Create Evaluator

Creates a new evaluator for your organization. You must specify type and score_value_type. The eval_class field is optional and only used for pre-built templates.

Authentication

All endpoints require API key authentication:

Authorization: Bearer YOUR_API_KEY

Evaluator Types and Score Value Types

Evaluator Types (Required)

llm: LLM-based evaluators that use another LLM to evaluate responses
human: Human annotation-based evaluators for manual scoring
code: Python code-based evaluators that run custom evaluation logic

Score Value Types (Required)

numerical: Numeric scores (e.g., 1-5, 0.0-1.0)
boolean: True/false or pass/fail evaluations
categorical: Multiple choice selections with predefined options
comment: Text-based feedback and comments

Pre-built Templates (Optional)

You can optionally use pre-built templates by specifying eval_class:

keywordsai_custom_llm: LLM-based evaluator with standard configuration
custom_code: Code-based evaluator template

Unified Evaluator Inputs

All evaluator runs now receive a single unified inputs object. This applies to all evaluator types (llm, human, code). Structure:

{
  "inputs": {
    "input": {},
    "output": {},
    "metrics": {},
    "metadata": {},
    "llm_input": "",
    "llm_output": ""
  }
}

input (any JSON): The request/input to be evaluated.
output (any JSON): The response/output being evaluated.
metrics (object, optional): System-captured metrics (e.g., tokens, latency, cost).
metadata (object, optional): Context and custom properties you pass; also logged.
llm_input and llm_output (string, optional): Legacy convenience aliases.

Required Fields

name (string): Display name for the evaluator
type (string): Evaluator type - "llm", "human", or "code"
score_value_type (string): Score format - "numerical", "boolean", "categorical", or "comment"

Optional Fields

evaluator_slug (string): Unique identifier (auto-generated if not provided)
description (string): Description of the evaluator
eval_class (string): Pre-built template to use (optional)
configurations (object): Custom configuration based on evaluator type
categorical_choices (array): Required when score_value_type is "categorical"

Configuration Fields by Type

For type: "llm" evaluators:

evaluator_definition (string): The evaluation prompt/instruction. Must include {{input}} and {{output}} template variables. Legacy {{llm_input}} and {{llm_output}} are also supported for backward compatibility.
scoring_rubric (string): Description of the scoring criteria
llm_engine (string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)
model_options (object, optional): LLM parameters like temperature, max_tokens
min_score (number, optional): Minimum possible score
max_score (number, optional): Maximum possible score
passing_score (number, optional): Score threshold for passing

For type: "code" evaluators:

eval_code_snippet (string): Python code with evaluate() function that returns the score

For type: "human" evaluators:

No specific configuration fields required
Use categorical_choices field when score_value_type is "categorical"

For score_value_type: "categorical":

categorical_choices (array): List of choice objects with name and value properties

[
  { "name": "Excellent", "value": 5 },
  { "name": "Good", "value": 4 }
]

Examples

Custom LLM Evaluator (Numerical)

import requests

url = "https://api.keywordsai.co/api/evaluators/"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

data = {
    "name": "Response Quality Evaluator",
    "evaluator_slug": "response_quality_v1",
    "type": "llm",
    "score_value_type": "numerical",
    "description": "Evaluates response quality on a 1-5 scale",
    "configurations": {
        "evaluator_definition": "Rate the response quality based on accuracy, relevance, and completeness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent",
        "llm_engine": "gpt-4o-mini",
        "model_options": {
            "temperature": 0.1,
            "max_tokens": 200
        },
        "min_score": 1.0,
        "max_score": 5.0,
        "passing_score": 3.0
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Human Categorical Evaluator

data = {
    "name": "Content Quality Assessment",
    "evaluator_slug": "content_quality_categorical",
    "type": "human",
    "score_value_type": "categorical",
    "description": "Human assessment of content quality with predefined categories",
    "categorical_choices": [
        { "name": "Excellent", "value": 5 },
        { "name": "Good", "value": 4 },
        { "name": "Average", "value": 3 },
        { "name": "Poor", "value": 2 },
        { "name": "Very Poor", "value": 1 }
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Code-based Boolean Evaluator

data = {
    "name": "Response Length Checker",
    "evaluator_slug": "length_checker_boolean",
    "type": "code",
    "score_value_type": "boolean",
    "description": "Checks if response meets minimum length requirement",
    "configurations": {
        "eval_code_snippet": "def evaluate(llm_input, llm_output, **kwargs):\n    '''\n    Check if response meets minimum length requirement\n    Returns True if length >= 50 characters, False otherwise\n    '''\n    if not llm_output:\n        return False\n    \n    return len(llm_output.strip()) >= 50"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

LLM Boolean Evaluator

data = {
    "name": "LLM Factual Accuracy Check",
    "evaluator_slug": "llm_factual_accuracy",
    "type": "llm",
    "score_value_type": "boolean",
    "description": "LLM-based evaluator that checks if response is factually accurate",
    "configurations": {
        "evaluator_definition": "Determine if the response is factually accurate and contains no misinformation.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "Return True if factually accurate, False if contains errors or misinformation",
        "llm_engine": "gpt-4o-mini"
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Using Pre-built Template

data = {
    "name": "Template-based LLM Evaluator",
    "evaluator_slug": "template_llm_eval",
    "type": "llm",
    "score_value_type": "numerical",
    "eval_class": "keywordsai_custom_llm",
    "description": "Uses pre-built LLM evaluator template",
    "configurations": {
        "evaluator_definition": "Evaluate response accuracy and helpfulness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
        "scoring_rubric": "Score from 1-10 based on accuracy and helpfulness",
        "llm_engine": "gpt-4o",
        "min_score": 1.0,
        "max_score": 10.0
    }
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Response

Status: 201 Created

{
  "id": "0f4325f9-55ef-4c20-8abe-376694419947",
  "name": "Response Quality Evaluator",
  "evaluator_slug": "response_quality_v1",
  "type": "llm",
  "score_value_type": "numerical",
  "eval_class": "",
  "description": "Evaluates response quality on a 1-5 scale",
  "configurations": {
    "evaluator_definition": "Rate the response quality based on accuracy, relevance, and completeness.\n<llm_input>{{llm_input}}</llm_input>\n<llm_output>{{llm_output}}</llm_output>",
    "scoring_rubric": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent",
    "llm_engine": "gpt-4o-mini",
    "model_options": {
      "temperature": 0.1,
      "max_tokens": 200
    },
    "min_score": 1.0,
    "max_score": 5.0,
    "passing_score": 3.0
  },
  "created_by": {
    "first_name": "Keywords AI",
    "last_name": "Team",
    "email": "admin@keywordsai.co"
  },
  "updated_by": {
    "first_name": "Keywords AI",
    "last_name": "Team",
    "email": "admin@keywordsai.co"
  },
  "created_at": "2025-09-11T09:43:55.858321Z",
  "updated_at": "2025-09-11T09:43:55.858331Z",
  "custom_required_fields": [],
  "categorical_choices": null,
  "starred": false,
  "tags": []
}

Error Responses

400 Bad Request

{
  "configurations": [
    "Configuration validation failed: 1 validation error for KeywordsAICustomLLMEvaluatorType\nscoring_rubric\n  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]"
  ]
}

401 Unauthorized

{
  "detail": "Your API key is invalid or expired, please check your API key at https://platform.keywordsai.co/platform/api/api-keys"
}

Observability

Develop

Evaluation

Manage

Create Evaluator

Authentication

Evaluator Types and Score Value Types

Evaluator Types (Required)

Score Value Types (Required)

Pre-built Templates (Optional)

Unified Evaluator Inputs

Required Fields

Optional Fields

Configuration Fields by Type

Examples

Custom LLM Evaluator (Numerical)

Human Categorical Evaluator

Code-based Boolean Evaluator

LLM Boolean Evaluator

Using Pre-built Template

Response

Error Responses

400 Bad Request

401 Unauthorized

Observability

Develop

Evaluation

Manage

​Authentication

​Evaluator Types and Score Value Types

​Evaluator Types (Required)

​Score Value Types (Required)

​Pre-built Templates (Optional)

​Unified Evaluator Inputs

​Required Fields

​Optional Fields

​Configuration Fields by Type

​Examples

​Custom LLM Evaluator (Numerical)

​Human Categorical Evaluator

​Code-based Boolean Evaluator

​LLM Boolean Evaluator

​Using Pre-built Template

​Response

​Error Responses

​400 Bad Request

​401 Unauthorized

Authentication

Evaluator Types and Score Value Types

Evaluator Types (Required)

Score Value Types (Required)

Pre-built Templates (Optional)

Unified Evaluator Inputs

Required Fields

Optional Fields

Configuration Fields by Type

Examples

Custom LLM Evaluator (Numerical)

Human Categorical Evaluator

Code-based Boolean Evaluator

LLM Boolean Evaluator

Using Pre-built Template

Response

Error Responses

400 Bad Request

401 Unauthorized