LLM as a judge is a feature that allows you to evaluate your prompts with the help of LLM. You can evaluate your prompts based on various metrics and see the results on the Logs page.

This is a beta feature. Please do let us know if you encounter any issues. We’ll continuously improve it.

1

Create an evaluator

To evaluate your prompts, you should first create a new evaluator on the Evaluators page.

After you click on the Create Evaluator button, you will see the following page. You will then need to define the evaluator slug for applying the evaluator in your LLM calls.

2

Configure the evaluator

Configure the evaluator with the prompt and the customer you want to evaluate. You can also test the evaluator with a sample prompt。

3

Pass required params in code

After you have configured the evaluator, you should pass the required parameters in the code to evaluate the prompt.

For example, you are asking a question about the capital of France, and you have retrieved some context about France and Paris.

You have defined an evaluator with the slug evaluator-slug1, and the required fields in the evaluator are retrieved_contexts, expected_response, input_text (default), and output_text (default).

By default, input_text will be extracted from messages array and output_text will be extracted from the LLM’s response, you DON’T need to specify them in the extra_params.

Example code:

{
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "evaluation_params": { 
    "evaluators": [{"evaluator_slug":"evaluator-slug1"}],
    "extra_params": {
        "retrieved_contexts": ["the capital of France is Paris", "France is a country in Europe"],
        "expected_response": "Paris"
    },
    "last_n_messages": 3, // We will use the last 3 messages of the `messages` array as the input.
  },
}
4

See the result

The evaluator will automatically run on the LLM call and you will be able to see the results in the side panel of the corresponding log.