Keywords AI provides two main types of observability: Traces for agent workflows and Logs for LLM calls, plus a comprehensive Evaluation framework.

Traces vs Logs

Logs

LLM call logging - Individual requests and responses to language models

Traces

Agent tracking - Complete workflows with multiple steps

Logs

Logs record individual LLM API calls and responses
  • Single request/response pairs
  • Token usage and costs
  • Model performance metrics
  • Error tracking
  • Response latency
  • Text
  • Chat
  • Completion
  • Response
  • Embedding
  • Transcription
  • Speech
  • Workflow
  • Task
  • Tool
  • Agent
  • Handoff
  • Guardrail
  • Function
  • Custom
  • Generation
  • Status (Error, Success)
  • Time
  • Trace ID
  • Workflow name
  • Customer ID
  • Span count
  • Error count
  • Input token count
  • Output token count
  • Total tokens
  • Duration

Traces

Traces capture the full execution flow of agent workflows
  • Multi-step agent processes
  • Tool calls and function executions
  • Decision-making steps
  • Hierarchical workflow visualization
  • Agent reasoning and planning
  • Thread ID
  • Customer ID
  • API key
  • Input tokens
  • Output tokens
  • All tokens
  • Cost
  • Generation
  • Number of Rows

Trace Data Model

Trace Structure

Multi-Trace Workflow Grouping

Organization Data Model

Single Team Setup (Initial Onboarding)

Multi-Team Setup (After Scaling)

User Journey

User Data

User data provides insights into customer usage patterns and activity
  • Customer ID
  • Name
  • Email
  • Requests
  • Total tokens
  • Total cost
  • Active for

Testsets

Testsets allow systematic evaluation of model performance
  • Testset ID
  • Name
  • Created At
  • Updated At
  • Row Count
  • Column Count
  • Starred

Evaluation Framework

The evaluation system helps you assess and improve LLM performance through systematic testing.

Testsets

Curated collections of examples for evaluation
  • Input/output pairs
  • Expected responses
  • Evaluation criteria
  • Test case metadata

Evaluators

Tools that assess LLM output quality
Types of evaluators:
  • LLM Evaluators: AI-powered assessment
  • Human Evaluators: Manual review
  • Rule-based: Automated validation
  • Custom Metrics: Domain-specific scoring

Experiments

Comparative testing of different configurations
  • A/B testing of prompts
  • Model comparisons
  • Performance benchmarking
  • Cost analysis

Scores

Quantitative and qualitative assessment results
  • Numerical values: Ratings on scales (1-5, 1-10, percentages)
  • Boolean values: Pass/fail assessments
  • String values: Textual feedback and classifications
  • Unique per evaluator per log
  • Includes creation timestamp
  • Can be managed via API
  • Automatically enriches log data
  • Connected to specific logs via log_id
  • Associated with evaluators via evaluator_id or evaluator_slug
  • Can be part of test sets via dataset_id
  • Provides context through type and environment fields

Data Flow

Important terminology notes:
  • “testsets” is one word in our product
  • A “log” is actually a type of “trace” in our system

Score Integration

Scores are integrated throughout the Keywords AI platform:
Scores automatically appear in log details under the scores field
Scores provide quantitative metrics for evaluating model performance
Create, retrieve, update, and delete scores programmatically
Filter logs and test results based on score values