LLM as a judge is a feature that allows you to evaluate your prompts with the help of LLM. You can evaluate your prompts based on various metrics and see the results on the Logs page.

This is a beta feature. Please do let us know if you encounter any issues. We’ll continuously improve it.

1

Create an evaluator

To evaluate your prompts, you should first create a new evaluator on the Evaluators page.

2

Choose a metric

Choose a metric to evaluate your prompts. We integrted Evaluation framework from Relari(Comming soon) and Ragas.

We are adding more metrics. Let us know if you want to add a specific metric.

3

Configure the evaluator

Configure the evaluator with the prompt and the customer you want to evaluate. You can also test the evaluator with a sample prompt.

4

Pass required params in code

After you have configured the evaluator, you should pass the required parameters in the code to evaluate the prompt.

Example code:

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "When was the first super bowl?"
    }
  ],
  "evaluation_params": {
    "evaluation_identifier": "rag_evaluation_1", // you will need to pass in the evaluation_identifier to mark the logs you want to run evals on
    "extra_params": {
      "contexts": ["The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,"], 
      // could be your RAG content, system prompt, or any other context you want to pass in
    },
    "sample_percentage": 100 // this will override the Sample percentage in the evaluator UI
    //last_n_messages: 5 // how many of the most recent messages to consider for evaluation Question, default is 1
  }
}

Explanation of extra_params

You must pass the extra_params based on the metric you choose. For example, if you choose the Ragas Answer Relevancy metric, you can find the required parameters areQuestion, Answer, and Contexts.

The Question and Answer fields are automatically extracted from the conversation log if required by the evaluator, so you don’t need to be explicitly provided in extra_params.

5

See the result

After you have passed the required parameters in the code, you can see the evaluation results on the Logs page.