Keywords AI integrates with Continuous Eval, an open-source package created for granular and holistic evaluation of GenAI application pipelines, to provide a comprehensive evaluation of the LLM-generated results from the Keywords AI API.

You can find the evaluation settings at Keywords AI (top of left nav bar) > Evaluation.

Sampling

Evaluations are expensive. You can specify a percentage of API calls to be randomly selected for evaluation.

The default value is 10%.

Evaluation metrics

Keywords AI provides the following categories of evaluation metrics:

1. Retrieval metrics:

  • Context precision: measures information density (see more).

2. Text generation metrics:

  • Faithfulness: measures how grounded is the answer on the retrieved contexts (see more).
  • Flesch–Kincaid readability: measures the readability of the generated text (see more).
  • Relevance: measures the relevance of the generated text to the query (see more).

3. Other metrics:

  • Sentiment: measures the sentiment of the generated text (see more).

Run evaluation mannually

You can run the evaluation manually by clicking the Run button in the side panel of the Requests page.