Overview
Keywords AI’s observability is built around one core concept: different views of the same underlying data. All views present the same log data, just organized differently for different use cases:Logs
Plain view - Individual LLM requests as they happen
Traces
Hierarchical view - Multi-step workflows and AI agent operations
Threads
Conversational view - Linear chat interface for dialogue systems
Scores
Evaluation view - Quality assessments and performance metrics
All four views show the same underlying data - they just organize and present it differently. The core data structure remains consistent across all views.
Logs
A log represents a single LLM request and contains all the information about that interaction. This is the foundational data that powers all other views.
Core fields
Request data
Request data
- Prompt: The input sent to the LLM
- Model: Which LLM model was used
- Parameters: Temperature, max tokens, etc.
Response data
Response data
- Response: The output generated by the LLM
- Tokens: Input, output, and total token counts
- Finish reason: How the generation ended
Metadata
Metadata
- Cost: Calculated cost for the request
- Latency: Response time in milliseconds
- Timestamp: When the request was made
- Status: Success, error, or other states
- Customer ID: User identifier
- API key: Which key was used
Log types
All logs are categorized by type to help with filtering and organization:- LLM inference
- Workflow & agent
- Advanced
text
: Basic text completion requestschat
: Conversational chat completions (most common)completion
: Legacy completion formatresponse
: Response API callsembedding
: Vector embedding generationtranscription
: Speech-to-text conversionspeech
: Text-to-speech generation
Example log structure
Traces
Traces organize the same log data into hierarchical workflows, perfect for complex AI agent operations and multi-step processes.
Trace structure
Key trace fields
trace_unique_id
: Groups all spans in the same workflowspan_unique_id
: Individual span identifier (maps to log ID)span_parent_id
: Creates the hierarchical structurespan_name
: Descriptive name for the operationspan_workflow_name
: Root workflow identifier
Multi-trace grouping
Complex workflows can span multiple traces usingtrace_group_identifier
:
Threads
Threads organize the same log data in a conversational format, ideal for chat applications and dialogue systems.
Thread structure
- Thread ID: Unique identifier for the conversation
- Messages: Ordered sequence of user and assistant messages (each message maps to log entries)
- Context: Maintains conversation history and state
Notice how each message in the thread references a
log_id
- this shows how threads are just a different presentation of the same underlying log data.Scores
Scores organize the same log data with evaluation metrics and quality assessments, perfect for monitoring LLM performance and conducting evaluations.
Score structure
Scores are linked to logs through thelog_id
field and can be created by two types of evaluators:
- Platform evaluators: Use
evaluator_id
(UUID from Keywords AI platform) - Custom evaluators: Use
evaluator_slug
(your custom string identifier)
Key score fields
id
: Unique score identifierlog_id
: Links the score to its corresponding log entryevaluator_id
: UUID of Keywords AI platform evaluator (optional)evaluator_slug
: Custom evaluator identifier (optional)is_passed
: Whether the evaluation passed defined criteriacost
: Cost of running the evaluationcreated_at
: When the score was created
Each evaluator can only have one score per log, ensuring data integrity and preventing duplicate evaluations.
Score value types
Scores support four different value types based on the evaluator’sscore_value_type
:
numerical_value
: Numeric scores (e.g., 4.5, 0.85, 92) used whenscore_value_type
is"numerical"
- for rating scales, confidence scores, and similarity metricsboolean_value
: True/false evaluations used whenscore_value_type
is"boolean"
- for pass/fail checks and binary classificationscategorical_value
: Array of predefined categories (e.g., [“coherent”, “relevant”, “helpful”]) used whenscore_value_type
is"categorical"
- for multi-label classifications and tag systemsstring_value
: Free-form text feedback used whenscore_value_type
is"comment"
- for detailed comments and explanations