Traces vs Logs
Logs
LLM call logging - Individual requests and responses to language models
Traces
Agent tracking - Complete workflows with multiple steps
Logs
Logs record individual LLM API calls and responses
- Single request/response pairs
- Token usage and costs
- Model performance metrics
- Error tracking
- Response latency
Log Parameters
Log Parameters
Content Types
Content Types
- Text
- Chat
- Completion
- Response
- Embedding
- Transcription
- Speech
- Workflow
- Task
- Tool
- Agent
- Handoff
- Guardrail
- Function
- Custom
- Generation
Filtering Metrics
Filtering Metrics
- Status (Error, Success)
- Time
- Trace ID
- Workflow name
- Customer ID
- Span count
- Error count
- Input token count
- Output token count
- Total tokens
- Duration
Traces
Traces capture the full execution flow of agent workflows
- Multi-step agent processes
- Tool calls and function executions
- Decision-making steps
- Hierarchical workflow visualization
- Agent reasoning and planning
Trace Parameters
Trace Parameters
Filtering Metrics
Filtering Metrics
- Thread ID
- Customer ID
- API key
- Input tokens
- Output tokens
- All tokens
- Cost
- Generation
- Number of Rows
Trace Data Model
Trace Structure
Multi-Trace Workflow Grouping
Organization Data Model
Single Team Setup (Initial Onboarding)
Multi-Team Setup (After Scaling)
User Journey
User Data
User data provides insights into customer usage patterns and activity
User Parameters
User Parameters
- Customer ID
- Name
- Requests
- Total tokens
- Total cost
- Active for
Testsets
Testsets allow systematic evaluation of model performance
Testset Parameters
Testset Parameters
- Testset ID
- Name
- Created At
- Updated At
- Row Count
- Column Count
- Starred
Evaluation Framework
The evaluation system helps you assess and improve LLM performance through systematic testing.Testsets
Curated collections of examples for evaluation
- Input/output pairs
- Expected responses
- Evaluation criteria
- Test case metadata
Evaluators
Tools that assess LLM output quality
- LLM Evaluators: AI-powered assessment
- Human Evaluators: Manual review
- Rule-based: Automated validation
- Custom Metrics: Domain-specific scoring
Experiments
Comparative testing of different configurations
- A/B testing of prompts
- Model comparisons
- Performance benchmarking
- Cost analysis
Scores
Quantitative and qualitative assessment results
Score Types
Score Types
- Numerical values: Ratings on scales (1-5, 1-10, percentages)
- Boolean values: Pass/fail assessments
- String values: Textual feedback and classifications
Score Properties
Score Properties
- Unique per evaluator per log
- Includes creation timestamp
- Can be managed via API
- Automatically enriches log data
Score Relationships
Score Relationships
- Connected to specific logs via
log_id
- Associated with evaluators via
evaluator_id
orevaluator_slug
- Can be part of test sets via
dataset_id
- Provides context through
type
andenvironment
fields
Data Flow
Important terminology notes:
- “testsets” is one word in our product
- A “log” is actually a type of “trace” in our system
Score Integration
Scores are integrated throughout the Keywords AI platform:Log Enrichment
Log Enrichment
Scores automatically appear in log details under the
scores
fieldEvaluation Results
Evaluation Results
Scores provide quantitative metrics for evaluating model performance
API Access
API Access
Create, retrieve, update, and delete scores programmatically
Filtering
Filtering
Filter logs and test results based on score values