type and score_value_type. The eval_class field is optional and only used for pre-built templates.
Authentication
All endpoints require API key authentication:Evaluator Types and Score Value Types
Evaluator Types (Required)
llm: LLM-based evaluators that use another LLM to evaluate responseshuman: Human annotation-based evaluators for manual scoringcode: Python code-based evaluators that run custom evaluation logic
Score Value Types (Required)
numerical: Numeric scores (e.g., 1-5, 0.0-1.0)boolean: True/false or pass/fail evaluationscategorical: Multiple choice selections with predefined optionscomment: Text-based feedback and comments
Pre-built Templates (Optional)
You can optionally use pre-built templates by specifyingeval_class:
keywordsai_custom_llm: LLM-based evaluator with standard configurationcustom_code: Code-based evaluator template
Unified Evaluator Inputs
All evaluator runs now receive a single unifiedinputs object. This applies to all evaluator types (llm, human, code).
Structure:
input(any JSON): The request/input to be evaluated.output(any JSON): The response/output being evaluated.metrics(object, optional): System-captured metrics (e.g., tokens, latency, cost).metadata(object, optional): Context and custom properties you pass; also logged.llm_inputandllm_output(string, optional): Legacy convenience aliases.
Required Fields
name(string): Display name for the evaluatortype(string): Evaluator type -"llm","human", or"code"score_value_type(string): Score format -"numerical","boolean","categorical", or"comment"
Optional Fields
evaluator_slug(string): Unique identifier (auto-generated if not provided)description(string): Description of the evaluatoreval_class(string): Pre-built template to use (optional)configurations(object): Custom configuration based on evaluator typecategorical_choices(array): Required whenscore_value_typeis"categorical"
Configuration Fields by Type
Fortype: "llm" evaluators:
evaluator_definition(string): The evaluation prompt/instruction. Must include{{input}}and{{output}}template variables. Legacy{{llm_input}}and{{llm_output}}are also supported for backward compatibility.scoring_rubric(string): Description of the scoring criteriallm_engine(string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)model_options(object, optional): LLM parameters like temperature, max_tokensmin_score(number, optional): Minimum possible scoremax_score(number, optional): Maximum possible scorepassing_score(number, optional): Score threshold for passing
type: "code" evaluators:
eval_code_snippet(string): Python code with evaluate() function that returns the score
type: "human" evaluators:
- No specific configuration fields required
- Use
categorical_choicesfield whenscore_value_typeis"categorical"
score_value_type: "categorical":
categorical_choices(array): List of choice objects withnameandvalueproperties