type and score_value_type. The eval_class field is optional and only used for pre-built templates.
type field now represents the primary interface/use case, but automation can be added independently via llm_config or code_config. This decouples the annotation method from the evaluator type.llm: Primarily LLM-based evaluators (can also have code automation)human: Primarily human annotation-based (can have LLM or code automation for assistance)code: Primarily code-based evaluators (can also have LLM automation as fallback)numerical: Numeric scores (e.g., 1-5, 0.0-1.0)boolean: True/false or pass/fail evaluationspercentage: 0-100 percentage scores (use decimals; 0.0–100.0)single_select: Choose exactly one option from predefined choicesmulti_select: Choose one or more options from predefined choicesjson: Structured JSON data for complex evaluationstext: Text-based feedback and commentscategorical and comment remain readable for older evaluatorseval_class:
keywordsai_custom_llm: LLM-based evaluator with standard configurationcustom_code: Code-based evaluator templateinputs object. This applies to all evaluator types (llm, human, code).
Structure:
input (any JSON): The request/input to be evaluated.output (any JSON): The response/output being evaluated.metrics (object, optional): System-captured metrics (e.g., tokens, latency, cost).metadata (object, optional): Context and custom properties you pass; also logged.llm_input and llm_output (string, optional): Legacy convenience aliases.name (string): Display name for the evaluatortype (string): Evaluator type - "llm", "human", or "code"score_value_type (string): Score format - "numerical", "boolean", "categorical", or "comment"evaluator_slug (string): Unique identifier (auto-generated if not provided)description (string): Description of the evaluatoreval_class (string): Pre-built template to use (optional)configurations (object): Custom configuration based on evaluator typecategorical_choices (array): Required when score_value_type is "categorical"configurations. This format allows you to add both LLM and code automation to any evaluator type, decoupling the annotation method from the evaluator type.| Field | Type | Description |
|---|---|---|
score_config | object | Score type configuration (shape varies by score_value_type) |
passing_conditions | object | Passing conditions using universal filter format |
llm_config | object | LLM automation config (if using LLM for scoring) |
code_config | object | Code automation config (if using code for scoring) |
model and evaluator_definition):
model, streamtemperature, top_p, max_tokens, max_completion_tokensfrequency_penalty, presence_penalty, stopresponse_format, verbositytools, tool_choice, parallel_tool_callsconfigurations format remains fully functional for backward compatibility.
type: "llm" evaluators:
evaluator_definition (string): The evaluation prompt/instruction. Must include {{input}} and {{output}} template variables. Legacy {{llm_input}} and {{llm_output}} are also supported for backward compatibility.scoring_rubric (string): Description of the scoring criteriallm_engine (string): LLM model to use (e.g., “gpt-4o-mini”, “gpt-4o”)model_options (object, optional): LLM parameters like temperature, max_tokensmin_score (number, optional): Minimum possible scoremax_score (number, optional): Maximum possible scorepassing_score (number, optional): Score threshold for passingtype: "code" evaluators:
eval_code_snippet (string): Python code with main(eval_inputs) function that returns the scoretype: "human" evaluators:
categorical_choices field when score_value_type is "single_select" or "multi_select"score_value_type: "single_select" | "multi_select":
categorical_choices (array): List of choice objects with name and value properties