Keywords AI provides two main types of observability: Traces for agent workflows and Logs for LLM calls, plus a comprehensive Evaluation framework.

Traces vs Logs

Logs

LLM call logging - Individual requests and responses to language models

Traces

Agent tracking - Complete workflows with multiple steps

Logs

Logs record individual LLM API calls and responses
  • Single request/response pairs
  • Token usage and costs
  • Model performance metrics
  • Error tracking
  • Response latency

Traces

Traces capture the full execution flow of agent workflows
  • Multi-step agent processes
  • Tool calls and function executions
  • Decision-making steps
  • Hierarchical workflow visualization
  • Agent reasoning and planning

Evaluation Framework

The evaluation system helps you assess and improve LLM performance through systematic testing.

Test Sets

Curated collections of examples for evaluation
  • Input/output pairs
  • Expected responses
  • Evaluation criteria
  • Test case metadata

Evaluators

Tools that assess LLM output quality
Types of evaluators:
  • LLM Evaluators: AI-powered assessment
  • Human Evaluators: Manual review
  • Rule-based: Automated validation
  • Custom Metrics: Domain-specific scoring

Experiments

Comparative testing of different configurations
  • A/B testing of prompts
  • Model comparisons
  • Performance benchmarking
  • Cost analysis

Scores

Quantitative and qualitative assessment results
  • Numeric ratings (1-5, 1-10)
  • Boolean pass/fail
  • Categorical classifications
  • Comments and feedback

Data Flow