type
and score_value_type
. The eval_class
field is optional and only used for pre-built templates.
Authentication
All endpoints require API key authentication:Evaluator Types and Score Value Types
Evaluator Types (Required)
llm
: LLM-based evaluators that use another LLM to evaluate responseshuman
: Human annotation-based evaluators for manual scoringcode
: Python code-based evaluators that run custom evaluation logic
Score Value Types (Required)
numerical
: Numeric scores (e.g., 1-5, 0.0-1.0)boolean
: True/false or pass/fail evaluationscategorical
: Multiple choice selections with predefined optionscomment
: Text-based feedback and comments
Pre-built Templates (Optional)
You can optionally use pre-built templates by specifyingeval_class
:
keywordsai_custom_llm
: LLM-based evaluator with standard configurationcustom_code
: Code-based evaluator template
Unified Evaluator Inputs
All evaluator runs now receive a single unifiedinputs
object. This applies to all evaluator types (llm
, human
, code
).
Structure:
input
(any JSON): The request/input to be evaluated.output
(any JSON): The response/output being evaluated.metrics
(object, optional): System-captured metrics (e.g., tokens, latency, cost).metadata
(object, optional): Context and custom properties you pass; also logged.llm_input
andllm_output
(string, optional): Legacy convenience aliases.
Required Fields
name
(string): Display name for the evaluatortype
(string): Evaluator type -"llm"
,"human"
, or"code"
score_value_type
(string): Score format -"numerical"
,"boolean"
,"categorical"
, or"comment"
Optional Fields
evaluator_slug
(string): Unique identifier (auto-generated if not provided)description
(string): Description of the evaluatoreval_class
(string): Pre-built template to use (optional)configurations
(object): Custom configuration based on evaluator typecategorical_choices
(array): Required whenscore_value_type
is"categorical"
Configuration Fields by Type
Fortype: "llm"
evaluators:
evaluator_definition
(string): The evaluation prompt/instruction. Must include{{input}}
and{{output}}
template variables. Legacy{{llm_input}}
and{{llm_output}}
are also supported for backward compatibility.scoring_rubric
(string): Description of the scoring criteriallm_engine
(string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)model_options
(object, optional): LLM parameters like temperature, max_tokensmin_score
(number, optional): Minimum possible scoremax_score
(number, optional): Maximum possible scorepassing_score
(number, optional): Score threshold for passing
type: "code"
evaluators:
eval_code_snippet
(string): Python code with evaluate() function that returns the score
type: "human"
evaluators:
- No specific configuration fields required
- Use
categorical_choices
field whenscore_value_type
is"categorical"
score_value_type: "categorical"
:
categorical_choices
(array): List of choice objects withname
andvalue
properties