Evaluators API Documentation
Keywords AI evaluators allow you to define functions for evaluating the quality of your LLM responses through a simple REST API.Base URL
All evaluator endpoints are prefixed with/api/evaluators
Authentication
All endpoints require API key authentication:Evaluator Types and Score Value Types
Evaluator Types (Required)
llm: LLM-based evaluators that use another LLM to evaluate responseshuman: Human annotation-based evaluators for manual scoringcode: Python code-based evaluators that run custom evaluation logic
Score Value Types (Required)
numerical: Numeric scores (e.g., 1-5, 0.0-1.0)boolean: True/false or pass/fail evaluationscategorical: Multiple choice selections with predefined optionscomment: Text-based feedback and comments
score_value_type:
numerical→numerical_valueboolean→boolean_valuecategorical→categorical_value(array of strings; should match configuredcategorical_choices)comment→string_value
score_value_type to ensure consistent behavior across the UI and analytics. See the Scores API docs for details (refer to scores/scores_api_docs.md).
Pre-built Templates (Optional)
You can optionally use pre-built templates by specifyingeval_class:
keywordsai_custom_llm: LLM-based evaluator with standard configurationcustom_code: Code-based evaluator template
Unified Evaluator Inputs
All evaluator runs now receive a single unifiedinputs object. This applies to all evaluator types (llm, human, code). The same fields are also recorded and visible on the Scores page for every evaluation.
Structure:
input(any JSON): The request/input to be evaluated.output(any JSON): The response/output being evaluated.metrics(object, optional): System-captured metrics (e.g., tokens, latency, cost).metadata(object, optional): Context and custom properties you pass; also logged.llm_inputandllm_output(string, optional): Legacy convenience aliases. New templates should reference{{input}}and{{output}}. Legacy{{llm_input}}/{{llm_output}}placeholders remain supported and transparently map to the unified fields.
- These fields are stored with each evaluation and shown in the Scores page alongside the resulting score.
- When running evaluators from LLM calls,
inputsis auto-populated from the request/response and tracing data.
API Endpoints
Create Evaluator
POST/api/evaluators/
Creates a new evaluator. You must specify type and score_value_type. The eval_class field is optional and only used for pre-built templates.
Required Fields
name(string): Display name for the evaluatortype(string): Evaluator type -"llm","human", or"code"score_value_type(string): Score format -"numerical","boolean","categorical", or"comment"
Optional Fields
evaluator_slug(string): Unique identifier (auto-generated if not provided)description(string): Description of the evaluatoreval_class(string): Pre-built template to use (optional)configurations(object): Custom configuration based on evaluator typecategorical_choices(array): Required whenscore_value_typeis"categorical"
Example 1: Custom LLM Evaluator (Numerical)
Request:Example 2: Human Categorical Evaluator
Request:Example 3: Code-based Boolean Evaluator
Request:Example 4: LLM Boolean Evaluator
Request:Example 5: LLM Categorical Evaluator
Request:Example 6: LLM Comment Evaluator
Request:Example 7: Human Numerical Evaluator
Request:Example 8: Human Boolean Evaluator
Request:Example 9: Human Comment Evaluator
Request:Example 10: Using Pre-built Template (Optional)
Request:Configuration Fields by Type
Fortype: "llm" evaluators:
evaluator_definition(string): The evaluation prompt/instruction. Must include{{input}}and{{output}}template variables. Legacy{{llm_input}}and{{llm_output}}are also supported for backward compatibility.scoring_rubric(string): Description of the scoring criteriallm_engine(string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)model_options(object, optional): LLM parameters like temperature, max_tokensmin_score(number, optional): Minimum possible scoremax_score(number, optional): Maximum possible scorepassing_score(number, optional): Score threshold for passing
type: "code" evaluators:
eval_code_snippet(string): Python code with evaluate() function that returns the score
type: "human" evaluators:
- No specific configuration fields required
- Use
categorical_choicesfield whenscore_value_typeis"categorical"
score_value_type: "categorical":
categorical_choices(array): List of choice objects withnameandvalueproperties
List Evaluators
GET/api/evaluators/
Returns a paginated list of evaluators for your organization.
Response (200 OK):
Get Evaluator Details
GET/api/evaluators/{evaluator_id}/
Retrieves detailed information about a specific evaluator, including full configuration.
Response (200 OK):
Update Evaluator (Partial)
PATCH/api/evaluators/{evaluator_id}/
Updates specific fields of an evaluator. Supports partial updates of configuration fields.
Request:
- Only provide the fields you want to update
- Configuration fields are merged with existing values
- Non-null values take precedence over existing null values
Delete Evaluator
DELETE/api/evaluators/{evaluator_id}/
Permanently deletes an evaluator.
Response (204 No Content)
Run Evaluator
POST/api/evaluators/{evaluator_id}/run/
Executes an evaluator against provided input/output data for testing purposes.
Request:
Using Evaluators in LLM Calls
Once created, evaluators can be applied to LLM calls using theevaluation_params section:
Error Responses
400 Bad Request
401 Unauthorized
404 Not Found
Summary of New API Structure
Key Changes
typeandscore_value_typeare now required fieldseval_classis now optional and only used for pre-built templatescategorical_choicesprovides a user-friendly alias for categorical evaluations- Supports flexible evaluation approaches: LLM, human, and code-based
Migration from Old API
If you were using the old API structure, here’s how to migrate: Old Structure:Backward Compatibility
The API maintains backward compatibility with legacy type values:"human_numerical"→"type": "human", "score_value_type": "numerical""human_boolean"→"type": "human", "score_value_type": "boolean""human_categorical"→"type": "human", "score_value_type": "categorical""function"→"type": "code"
Best Practices
- Use descriptive names and slugs for easy identification
- Test evaluators using the run endpoint before deploying
- Specify both type and score_value_type explicitly for clarity
- Use categorical_choices for human evaluations with predefined options
- Keep evaluation prompts specific for consistent LLM-based results
- Set appropriate score ranges that match your use case
- Use model_options to control LLM behavior (temperature, max_tokens)
Rate Limits
- API key authentication: Standard rate limits apply
- Evaluator execution: May have additional limits based on LLM usage
- Separate feature for less confusion