Create evaluator

Creates a new evaluator for your organization. You must specify type and score_value_type. The eval_class field is optional and only used for pre-built templates.

Authentication

All endpoints require API key authentication:

Authorization: Bearer YOUR_API_KEY

Evaluator Types and Score Value Types

Evaluator Types (Required)

Important: The evaluator type field now represents the primary interface/use case, but automation can be added independently via llm_config or code_config. This decouples the annotation method from the evaluator type.

llm: Primarily LLM-based evaluators (can also have code automation)
human: Primarily human annotation-based (can have LLM or code automation for assistance)
code: Primarily code-based evaluators (can also have LLM automation as fallback)

Score Value Types (Required)

numerical: Numeric scores (e.g., 1-5, 0.0-1.0)
boolean: True/false or pass/fail evaluations
percentage: 0-100 percentage scores (use decimals; 0.0–100.0)
single_select: Choose exactly one option from predefined choices
multi_select: Choose one or more options from predefined choices
json: Structured JSON data for complex evaluations
text: Text-based feedback and comments
(Legacy) categorical and comment remain readable for older evaluators

Pre-built Templates (Optional)

You can optionally use pre-built templates by specifying eval_class:

keywordsai_custom_llm: LLM-based evaluator with standard configuration
custom_code: Code-based evaluator template

Unified Evaluator Inputs

All evaluator runs now receive a single unified inputs object. This applies to all evaluator types (llm, human, code). Structure:

{
  "inputs": {
    "input": {},
    "output": {},
    "metrics": {},
    "metadata": {}
  }
}

input (any JSON): The request/input to be evaluated.
output (any JSON): The response/output being evaluated.
metrics (object, optional): System-captured metrics (e.g., tokens, latency, cost).
metadata (object, optional): Context and custom properties you pass; also logged.
llm_input and llm_output (string, optional): Legacy convenience aliases.

Required Fields

name (string): Display name for the evaluator
type (string): Evaluator type - "llm", "human", or "code"
score_value_type (string): Score format - "numerical", "boolean", "categorical", or "comment"

Optional Fields

evaluator_slug (string): Unique identifier (auto-generated if not provided)
description (string): Description of the evaluator
eval_class (string): Pre-built template to use (optional)
configurations (object): Custom configuration based on evaluator type
categorical_choices (array): Required when score_value_type is "categorical"

New Format (Recommended)

The new evaluator format uses clean, flat configuration fields instead of nested configurations. This format allows you to add both LLM and code automation to any evaluator type, decoupling the annotation method from the evaluator type.

New Top-Level Fields (All Optional)

Field	Type	Description
`score_config`	object	Score type configuration (shape varies by `score_value_type`)
`passing_conditions`	object	Passing conditions using universal filter format
`llm_config`	object	LLM automation config (if using LLM for scoring)
`code_config`	object	Code automation config (if using code for scoring)

Score Config Shapes

Numerical/Percentage:

{
  "min_score": 0.0,
  "max_score": 5.0,
  "choices": [...]  // Optional discrete values
}

Single/Multi Select:

{
  "choices": [
    {"name": "Professional", "value": "professional"},
    {"name": "Casual", "value": "casual"}
  ]
}

LLM Config

{
  "model": "gpt-4o-mini",
  "evaluator_definition": "Your prompt template with {{input}} and {{output}}",
  "scoring_rubric": "Scoring instructions",
  "temperature": 0.1,
  "max_tokens": 200
}

Available LLM config fields (all optional except model and evaluator_definition):

Core: model, stream
Sampling: temperature, top_p, max_tokens, max_completion_tokens
Penalties: frequency_penalty, presence_penalty, stop
Formatting: response_format, verbosity
Tools: tools, tool_choice, parallel_tool_calls

Code Config

{
  "eval_code_snippet": "def main(eval_inputs):\n    return 1 if 'success' in eval_inputs.get('output', '') else 0"
}

Passing Conditions

Uses the universal filter format. Example:

{
  "primary_score": {
    "operator": "gte",
    "value": 3
  }
}

For complete details, see the Filters API Reference.

Legacy Format (Still Supported)

The legacy configurations format remains fully functional for backward compatibility.

Configuration Fields by Type

For type: "llm" evaluators:

evaluator_definition (string): The evaluation prompt/instruction. Must include {{input}} and {{output}} template variables. Legacy {{llm_input}} and {{llm_output}} are also supported for backward compatibility.
scoring_rubric (string): Description of the scoring criteria
llm_engine (string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)
model_options (object, optional): LLM parameters like temperature, max_tokens
min_score (number, optional): Minimum possible score
max_score (number, optional): Maximum possible score
passing_score (number, optional): Score threshold for passing

For type: "code" evaluators:

eval_code_snippet (string): Python code with main(eval_inputs) function that returns the score

For type: "human" evaluators:

No specific configuration fields required
Use categorical_choices field when score_value_type is "single_select" or "multi_select"

For score_value_type: "single_select" | "multi_select":

categorical_choices (array): List of choice objects with name and value properties

[
  { "name": "Excellent", "value": 5 },
  { "name": "Good", "value": 4 }
]

Examples

New Format Examples

LLM Evaluator with Automation (Numerical)

import requests

url = "https://api.keywordsai.co/api/evaluators/"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

data = {
    "name": "Response Quality",
    "evaluator_slug": "response_quality_v2",
    "score_value_type": "numerical",
    "score_config": {
        "min_score": 1,
        "max_score": 5,
        "choices": [
            {"name": "Poor", "value": 1},
            {"name": "Fair", "value": 2},
            {"name": "Good", "value": 3},
            {"name": "Great", "value": 4},
            {"name": "Excellent", "value": 5}
        ]
    },
    "passing_conditions": {
        "primary_score": {
            "operator": "gte",
            "value": 3
        }
    },
    "llm_config": {
        "model": "gpt-4o-mini",
        "evaluator_definition": "Rate the quality of this response:\n<input>{{input}}</input>\n<output>{{output}}</output>",
        "scoring_rubric": "1=Poor, 5=Excellent",
        "temperature": 0.1
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Human Evaluator with LLM Assistance

This shows how a human evaluator can have LLM automation for suggested scoring, decoupling annotation method from evaluator type.

data = {
    "name": "Human Review with AI Assistance",
    "evaluator_slug": "human_ai_assist_v1",
    "type": "human",
    "score_value_type": "numerical",
    "score_config": {"min_score": 1, "max_score": 5},
    "llm_config": {
        "model": "gpt-4o-mini",
        "evaluator_definition": "Suggest a quality score for this response",
        "temperature": 0.1
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Code Evaluator (Boolean)

data = {
    "name": "Length Check",
    "evaluator_slug": "length_check_v1",
    "score_value_type": "boolean",
    "description": "Checks if response is longer than 10 characters",
    "code_config": {
        "eval_code_snippet": "def main(eval_inputs):\n    output = eval_inputs.get('output', '')\n    return len(str(output)) > 10"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Single Select Evaluator with LLM

data = {
    "name": "Tone Classifier",
    "score_value_type": "single_select",
    "score_config": {
        "choices": [
            {"name": "Professional", "value": "professional"},
            {"name": "Casual", "value": "casual"},
            {"name": "Formal", "value": "formal"}
        ]
    },
    "llm_config": {
        "model": "gpt-4o-mini",
        "evaluator_definition": "Classify the tone of this response"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Legacy Format Examples

Custom LLM Evaluator (Numerical)

import requests

url = "https://api.keywordsai.co/api/evaluators/"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

data = {
    "name": "Response Quality Evaluator",
    "evaluator_slug": "response_quality_v1",
    "type": "llm",
    "score_value_type": "numerical",
    "description": "Evaluates response quality on a 1-5 scale",
    "configurations": {
        "evaluator_definition": "Rate the response quality based on accuracy, relevance, and completeness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent",
        "llm_engine": "gpt-4o-mini",
        "model_options": {
            "temperature": 0.1,
            "max_tokens": 200
        },
        "min_score": 1.0,
        "max_score": 5.0,
        "passing_score": 3.0
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Human Categorical Evaluator

data = {
    "name": "Content Quality Assessment",
    "evaluator_slug": "content_quality_categorical",
    "type": "human",
    "score_value_type": "categorical",
    "description": "Human assessment of content quality with predefined categories",
    "categorical_choices": [
        { "name": "Excellent", "value": 5 },
        { "name": "Good", "value": 4 },
        { "name": "Average", "value": 3 },
        { "name": "Poor", "value": 2 },
        { "name": "Very Poor", "value": 1 }
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Code-based Boolean Evaluator

data = {
    "name": "Response Length Checker",
    "evaluator_slug": "length_checker_boolean",
    "type": "code",
    "score_value_type": "boolean",
    "description": "Checks if response meets minimum length requirement",
    "configurations": {
        "eval_code_snippet": "def evaluate(llm_input, llm_output, **kwargs):\n    '''\n    Check if response meets minimum length requirement\n    Returns True if length >= 50 characters, False otherwise\n    '''\n    if not llm_output:\n        return False\n    \n    return len(llm_output.strip()) >= 50"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

LLM Boolean Evaluator

data = {
    "name": "LLM Factual Accuracy Check",
    "evaluator_slug": "llm_factual_accuracy",
    "type": "llm",
    "score_value_type": "boolean",
    "description": "LLM-based evaluator that checks if response is factually accurate",
    "configurations": {
        "evaluator_definition": "Determine if the response is factually accurate and contains no misinformation.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "Return True if factually accurate, False if contains errors or misinformation",
        "llm_engine": "gpt-4o-mini"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Using Pre-built Template

data = {
    "name": "Template-based LLM Evaluator",
    "evaluator_slug": "template_llm_eval",
    "type": "llm",
    "score_value_type": "numerical",
    "eval_class": "keywordsai_custom_llm",
    "description": "Uses pre-built LLM evaluator template",
    "configurations": {
        "evaluator_definition": "Evaluate response accuracy and helpfulness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "Score from 1-10 based on accuracy and helpfulness",
        "llm_engine": "gpt-4o",
        "min_score": 1.0,
        "max_score": 10.0
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Response

Status: 201 Created

{
  "id": "0f4325f9-55ef-4c20-8abe-376694419947",
  "name": "Response Quality Evaluator",
  "evaluator_slug": "response_quality_v1",
  "type": "llm",
  "score_value_type": "numerical",
  "eval_class": "",
  "description": "Evaluates response quality on a 1-5 scale",
  "configurations": {
    "evaluator_definition": "Rate the response quality based on accuracy, relevance, and completeness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
    "scoring_rubric": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent",
    "llm_engine": "gpt-4o-mini",
    "model_options": {
      "temperature": 0.1,
      "max_tokens": 200
    },
    "min_score": 1.0,
    "max_score": 5.0,
    "passing_score": 3.0
  },
  "created_by": {
    "first_name": "Keywords AI",
    "last_name": "Team",
    "email": "admin@keywordsai.co"
  },
  "updated_by": {
    "first_name": "Keywords AI",
    "last_name": "Team",
    "email": "admin@keywordsai.co"
  },
  "created_at": "2025-09-11T09:43:55.858321Z",
  "updated_at": "2025-09-11T09:43:55.858331Z",
  "custom_required_fields": [],
  "categorical_choices": null,
  "starred": false,
  "tags": []
}

Error Responses

400 Bad Request

{
  "configurations": [
    "Configuration validation failed: 1 validation error for KeywordsAICustomLLMEvaluatorType\nscoring_rubric\n  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]"
  ]
}

401 Unauthorized

{
  "detail": "Your API key is invalid or expired, please check your API key at https://platform.keywordsai.co/platform/api/api-keys"
}

Observe

Develop

Evals

Manage

Automation

Reference

Create evaluator

Authentication

Evaluator Types and Score Value Types

Evaluator Types (Required)

Score Value Types (Required)

Pre-built Templates (Optional)

Unified Evaluator Inputs

Required Fields

Optional Fields

New Format (Recommended)

New Top-Level Fields (All Optional)

Score Config Shapes

LLM Config

Code Config

Passing Conditions

Legacy Format (Still Supported)

Configuration Fields by Type

Examples

New Format Examples

LLM Evaluator with Automation (Numerical)

Human Evaluator with LLM Assistance

Code Evaluator (Boolean)

Single Select Evaluator with LLM

Legacy Format Examples

Custom LLM Evaluator (Numerical)

Human Categorical Evaluator

Code-based Boolean Evaluator

LLM Boolean Evaluator

Using Pre-built Template

Response

Error Responses

400 Bad Request

401 Unauthorized

Observe

Develop

Evals

Manage

Automation

Reference

​Authentication

​Evaluator Types and Score Value Types

​Evaluator Types (Required)

​Score Value Types (Required)

​Pre-built Templates (Optional)

​Unified Evaluator Inputs

​Required Fields

​Optional Fields

​New Format (Recommended)

​New Top-Level Fields (All Optional)

​Score Config Shapes

​LLM Config

​Code Config

​Passing Conditions

​Legacy Format (Still Supported)

​Configuration Fields by Type

​Examples

​New Format Examples

​LLM Evaluator with Automation (Numerical)

​Human Evaluator with LLM Assistance

​Code Evaluator (Boolean)

​Single Select Evaluator with LLM

​Legacy Format Examples

​Custom LLM Evaluator (Numerical)

​Human Categorical Evaluator

​Code-based Boolean Evaluator

​LLM Boolean Evaluator

​Using Pre-built Template

​Response

​Error Responses

​400 Bad Request

​401 Unauthorized

Authentication

Evaluator Types and Score Value Types

Evaluator Types (Required)

Score Value Types (Required)

Pre-built Templates (Optional)

Unified Evaluator Inputs

Required Fields

Optional Fields

New Format (Recommended)

New Top-Level Fields (All Optional)

Score Config Shapes

LLM Config

Code Config

Passing Conditions

Legacy Format (Still Supported)

Configuration Fields by Type

Examples

New Format Examples

LLM Evaluator with Automation (Numerical)

Human Evaluator with LLM Assistance

Code Evaluator (Boolean)

Single Select Evaluator with LLM

Legacy Format Examples

Custom LLM Evaluator (Numerical)

Human Categorical Evaluator

Code-based Boolean Evaluator

LLM Boolean Evaluator

Using Pre-built Template

Response

Error Responses

400 Bad Request

401 Unauthorized