Skip to main content
POST
https://api.keywordsai.co
/
api
/
evaluators
Create evaluator
curl --request POST \
  --url https://api.keywordsai.co/api/evaluators/ \
  --header 'Authorization: Bearer <token>'
Creates a new evaluator for your organization. You must specify type and score_value_type. The eval_class field is optional and only used for pre-built templates.

Authentication

All endpoints require API key authentication:
Authorization: Bearer YOUR_API_KEY

Evaluator Types and Score Value Types

Evaluator Types (Required)

Important: The evaluator type field now represents the primary interface/use case, but automation can be added independently via llm_config or code_config. This decouples the annotation method from the evaluator type.
  • llm: Primarily LLM-based evaluators (can also have code automation)
  • human: Primarily human annotation-based (can have LLM or code automation for assistance)
  • code: Primarily code-based evaluators (can also have LLM automation as fallback)

Score Value Types (Required)

  • numerical: Numeric scores (e.g., 1-5, 0.0-1.0)
  • boolean: True/false or pass/fail evaluations
  • percentage: 0-100 percentage scores (use decimals; 0.0–100.0)
  • single_select: Choose exactly one option from predefined choices
  • multi_select: Choose one or more options from predefined choices
  • json: Structured JSON data for complex evaluations
  • text: Text-based feedback and comments
  • (Legacy) categorical and comment remain readable for older evaluators

Pre-built Templates (Optional)

You can optionally use pre-built templates by specifying eval_class:
  • keywordsai_custom_llm: LLM-based evaluator with standard configuration
  • custom_code: Code-based evaluator template

Unified Evaluator Inputs

All evaluator runs now receive a single unified inputs object. This applies to all evaluator types (llm, human, code). Structure:
{
  "inputs": {
    "input": {},
    "output": {},
    "metrics": {},
    "metadata": {}
  }
}
  • input (any JSON): The request/input to be evaluated.
  • output (any JSON): The response/output being evaluated.
  • metrics (object, optional): System-captured metrics (e.g., tokens, latency, cost).
  • metadata (object, optional): Context and custom properties you pass; also logged.
  • llm_input and llm_output (string, optional): Legacy convenience aliases.

Required Fields

  • name (string): Display name for the evaluator
  • type (string): Evaluator type - "llm", "human", or "code"
  • score_value_type (string): Score format - "numerical", "boolean", "categorical", or "comment"

Optional Fields

  • evaluator_slug (string): Unique identifier (auto-generated if not provided)
  • description (string): Description of the evaluator
  • eval_class (string): Pre-built template to use (optional)
  • configurations (object): Custom configuration based on evaluator type
  • categorical_choices (array): Required when score_value_type is "categorical"
The new evaluator format uses clean, flat configuration fields instead of nested configurations. This format allows you to add both LLM and code automation to any evaluator type, decoupling the annotation method from the evaluator type.

New Top-Level Fields (All Optional)

FieldTypeDescription
score_configobjectScore type configuration (shape varies by score_value_type)
passing_conditionsobjectPassing conditions using universal filter format
llm_configobjectLLM automation config (if using LLM for scoring)
code_configobjectCode automation config (if using code for scoring)

Score Config Shapes

Numerical/Percentage:
{
  "min_score": 0.0,
  "max_score": 5.0,
  "choices": [...]  // Optional discrete values
}
Single/Multi Select:
{
  "choices": [
    {"name": "Professional", "value": "professional"},
    {"name": "Casual", "value": "casual"}
  ]
}

LLM Config

{
  "model": "gpt-4o-mini",
  "evaluator_definition": "Your prompt template with {{input}} and {{output}}",
  "scoring_rubric": "Scoring instructions",
  "temperature": 0.1,
  "max_tokens": 200
}
Available LLM config fields (all optional except model and evaluator_definition):
  • Core: model, stream
  • Sampling: temperature, top_p, max_tokens, max_completion_tokens
  • Penalties: frequency_penalty, presence_penalty, stop
  • Formatting: response_format, verbosity
  • Tools: tools, tool_choice, parallel_tool_calls

Code Config

{
  "eval_code_snippet": "def main(eval_inputs):\n    return 1 if 'success' in eval_inputs.get('output', '') else 0"
}

Passing Conditions

Uses the universal filter format. Example:
{
  "primary_score": {
    "operator": "gte",
    "value": 3
  }
}
For complete details, see the Filters API Reference.

Legacy Format (Still Supported)

The legacy configurations format remains fully functional for backward compatibility.

Configuration Fields by Type

For type: "llm" evaluators:
  • evaluator_definition (string): The evaluation prompt/instruction. Must include {{input}} and {{output}} template variables. Legacy {{llm_input}} and {{llm_output}} are also supported for backward compatibility.
  • scoring_rubric (string): Description of the scoring criteria
  • llm_engine (string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)
  • model_options (object, optional): LLM parameters like temperature, max_tokens
  • min_score (number, optional): Minimum possible score
  • max_score (number, optional): Maximum possible score
  • passing_score (number, optional): Score threshold for passing
For type: "code" evaluators:
  • eval_code_snippet (string): Python code with main(eval_inputs) function that returns the score
For type: "human" evaluators:
  • No specific configuration fields required
  • Use categorical_choices field when score_value_type is "single_select" or "multi_select"
For score_value_type: "single_select" | "multi_select":
  • categorical_choices (array): List of choice objects with name and value properties
    [
      { "name": "Excellent", "value": 5 },
      { "name": "Good", "value": 4 }
    ]
    

Examples

New Format Examples

LLM Evaluator with Automation (Numerical)

import requests

url = "https://api.keywordsai.co/api/evaluators/"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

data = {
    "name": "Response Quality",
    "evaluator_slug": "response_quality_v2",
    "score_value_type": "numerical",
    "score_config": {
        "min_score": 1,
        "max_score": 5,
        "choices": [
            {"name": "Poor", "value": 1},
            {"name": "Fair", "value": 2},
            {"name": "Good", "value": 3},
            {"name": "Great", "value": 4},
            {"name": "Excellent", "value": 5}
        ]
    },
    "passing_conditions": {
        "primary_score": {
            "operator": "gte",
            "value": 3
        }
    },
    "llm_config": {
        "model": "gpt-4o-mini",
        "evaluator_definition": "Rate the quality of this response:\n<input>{{input}}</input>\n<output>{{output}}</output>",
        "scoring_rubric": "1=Poor, 5=Excellent",
        "temperature": 0.1
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Human Evaluator with LLM Assistance

This shows how a human evaluator can have LLM automation for suggested scoring, decoupling annotation method from evaluator type.
data = {
    "name": "Human Review with AI Assistance",
    "evaluator_slug": "human_ai_assist_v1",
    "type": "human",
    "score_value_type": "numerical",
    "score_config": {"min_score": 1, "max_score": 5},
    "llm_config": {
        "model": "gpt-4o-mini",
        "evaluator_definition": "Suggest a quality score for this response",
        "temperature": 0.1
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Code Evaluator (Boolean)

data = {
    "name": "Length Check",
    "evaluator_slug": "length_check_v1",
    "score_value_type": "boolean",
    "description": "Checks if response is longer than 10 characters",
    "code_config": {
        "eval_code_snippet": "def main(eval_inputs):\n    output = eval_inputs.get('output', '')\n    return len(str(output)) > 10"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Single Select Evaluator with LLM

data = {
    "name": "Tone Classifier",
    "score_value_type": "single_select",
    "score_config": {
        "choices": [
            {"name": "Professional", "value": "professional"},
            {"name": "Casual", "value": "casual"},
            {"name": "Formal", "value": "formal"}
        ]
    },
    "llm_config": {
        "model": "gpt-4o-mini",
        "evaluator_definition": "Classify the tone of this response"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Legacy Format Examples

Custom LLM Evaluator (Numerical)

import requests

url = "https://api.keywordsai.co/api/evaluators/"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

data = {
    "name": "Response Quality Evaluator",
    "evaluator_slug": "response_quality_v1",
    "type": "llm",
    "score_value_type": "numerical",
    "description": "Evaluates response quality on a 1-5 scale",
    "configurations": {
        "evaluator_definition": "Rate the response quality based on accuracy, relevance, and completeness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent",
        "llm_engine": "gpt-4o-mini",
        "model_options": {
            "temperature": 0.1,
            "max_tokens": 200
        },
        "min_score": 1.0,
        "max_score": 5.0,
        "passing_score": 3.0
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Human Categorical Evaluator

data = {
    "name": "Content Quality Assessment",
    "evaluator_slug": "content_quality_categorical",
    "type": "human",
    "score_value_type": "categorical",
    "description": "Human assessment of content quality with predefined categories",
    "categorical_choices": [
        { "name": "Excellent", "value": 5 },
        { "name": "Good", "value": 4 },
        { "name": "Average", "value": 3 },
        { "name": "Poor", "value": 2 },
        { "name": "Very Poor", "value": 1 }
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Code-based Boolean Evaluator

data = {
    "name": "Response Length Checker",
    "evaluator_slug": "length_checker_boolean",
    "type": "code",
    "score_value_type": "boolean",
    "description": "Checks if response meets minimum length requirement",
    "configurations": {
        "eval_code_snippet": "def evaluate(llm_input, llm_output, **kwargs):\n    '''\n    Check if response meets minimum length requirement\n    Returns True if length >= 50 characters, False otherwise\n    '''\n    if not llm_output:\n        return False\n    \n    return len(llm_output.strip()) >= 50"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

LLM Boolean Evaluator

data = {
    "name": "LLM Factual Accuracy Check",
    "evaluator_slug": "llm_factual_accuracy",
    "type": "llm",
    "score_value_type": "boolean",
    "description": "LLM-based evaluator that checks if response is factually accurate",
    "configurations": {
        "evaluator_definition": "Determine if the response is factually accurate and contains no misinformation.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "Return True if factually accurate, False if contains errors or misinformation",
        "llm_engine": "gpt-4o-mini"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Using Pre-built Template

data = {
    "name": "Template-based LLM Evaluator",
    "evaluator_slug": "template_llm_eval",
    "type": "llm",
    "score_value_type": "numerical",
    "eval_class": "keywordsai_custom_llm",
    "description": "Uses pre-built LLM evaluator template",
    "configurations": {
        "evaluator_definition": "Evaluate response accuracy and helpfulness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "Score from 1-10 based on accuracy and helpfulness",
        "llm_engine": "gpt-4o",
        "min_score": 1.0,
        "max_score": 10.0
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Response

Status: 201 Created
{
  "id": "0f4325f9-55ef-4c20-8abe-376694419947",
  "name": "Response Quality Evaluator",
  "evaluator_slug": "response_quality_v1",
  "type": "llm",
  "score_value_type": "numerical",
  "eval_class": "",
  "description": "Evaluates response quality on a 1-5 scale",
  "configurations": {
    "evaluator_definition": "Rate the response quality based on accuracy, relevance, and completeness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
    "scoring_rubric": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent",
    "llm_engine": "gpt-4o-mini",
    "model_options": {
      "temperature": 0.1,
      "max_tokens": 200
    },
    "min_score": 1.0,
    "max_score": 5.0,
    "passing_score": 3.0
  },
  "created_by": {
    "first_name": "Keywords AI",
    "last_name": "Team",
    "email": "[email protected]"
  },
  "updated_by": {
    "first_name": "Keywords AI",
    "last_name": "Team",
    "email": "[email protected]"
  },
  "created_at": "2025-09-11T09:43:55.858321Z",
  "updated_at": "2025-09-11T09:43:55.858331Z",
  "custom_required_fields": [],
  "categorical_choices": null,
  "starred": false,
  "tags": []
}

Error Responses

400 Bad Request

{
  "configurations": [
    "Configuration validation failed: 1 validation error for KeywordsAICustomLLMEvaluatorType\nscoring_rubric\n  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]"
  ]
}

401 Unauthorized

{
  "detail": "Your API key is invalid or expired, please check your API key at https://platform.keywordsai.co/platform/api/api-keys"
}