Authentication
All endpoints require API key authentication:Path Parameters
Parameter | Type | Description |
---|---|---|
evaluator_id | string | The unique ID of the evaluator to run |
Unified Evaluator Inputs
All evaluator runs now receive a single unifiedinputs
object. This applies to all evaluator types (llm
, human
, code
). The same fields are also recorded and visible on the Scores page for every evaluation.
Request Body Structure
Field Descriptions
Field | Type | Required | Description |
---|---|---|---|
inputs | object | Yes | The unified input object containing all evaluation data |
inputs.input | any JSON | Yes | The request/input to be evaluated |
inputs.output | any JSON | Yes | The response/output being evaluated |
inputs.metrics | object | No | System-captured metrics (e.g., tokens, latency, cost) |
inputs.metadata | object | No | Context and custom properties you pass; also logged |
inputs.llm_input | string | No | Legacy convenience alias for input (maps to unified fields) |
inputs.llm_output | string | No | Legacy convenience alias for output (maps to unified fields) |
- These fields are stored with each evaluation and shown in the Scores page alongside the resulting score
- When running evaluators from LLM calls,
inputs
is auto-populated from the request/response and tracing data - Legacy
{{llm_input}}
/{{llm_output}}
placeholders remain supported and transparently map to the unified fields - New templates should reference
{{input}}
and{{output}}
Examples
Test LLM Evaluator
Test Code Evaluator
Test Human Categorical Evaluator
Legacy Format Support
Response
LLM Evaluator Response
Status: 200 OKCode Evaluator Response
Human Categorical Evaluator Response
Response Fields
Field | Type | Description |
---|---|---|
score | varies | The evaluation score (type depends on evaluator’s score_value_type ) |
score_type | string | The type of score: numerical , boolean , categorical , or comment |
evaluator_id | string | ID of the evaluator that was run |
evaluator_name | string | Name of the evaluator that was run |
evaluation_result | object | Detailed evaluation results and reasoning |
inputs | object | The input data that was evaluated (echoed back) |
execution_time | number | Time taken to execute the evaluation (in seconds) |
timestamp | string | ISO timestamp of when the evaluation was performed |
Score Types by Evaluator
Numerical Evaluators
- Score: Number (e.g.,
4.5
,8.2
) - Range: Defined by evaluator’s
min_score
andmax_score
- Passing: Determined by
passing_score
threshold
Boolean Evaluators
- Score: Boolean (
true
orfalse
) - Passing:
true
= passed,false
= failed
Categorical Evaluators
- Score: Array of selected category names (e.g.,
["Good", "Accurate"]
) - Values: Corresponding numeric values from
categorical_choices
- Note: Human evaluators return placeholder values for testing
Comment Evaluators
- Score: String with detailed feedback
- Content: Varies based on evaluator configuration
- Length: Can be extensive for detailed feedback
Error Responses
400 Bad Request
401 Unauthorized
404 Not Found
422 Unprocessable Entity
500 Internal Server Error
Testing Best Practices
1. Test with Realistic Data
Use actual examples from your use case:Python
2. Test Edge Cases
Python
3. Validate Configuration
Test your evaluator configuration before production use:Python