What is Experiments?
Experiments lets you run repeatable evaluations over a dataset and inspect results as traces/logs (outputs, evaluator scores, costs, latency). The main decision is which workflow type you want:- Prompt workflow: render a saved prompt template with dataset variables, then run LLM calls automatically
- Completion workflow: run direct LLM completions on dataset messages automatically (no prompt templates)
- Custom workflow: you fetch inputs, run your own code/model, and submit outputs back for evaluator execution
This page is a tutorial. For full endpoint specs, jump to the linked API reference pages inside each step.
Resources
Steps to use
- Prompt workflow
- Completion workflow
- Custom workflow
1
Step 1: Prepare a dataset with variables
Dataset entries must include the variables your prompt template expects (example:
name, issue, order_id).Reference: Datasets: create / add logsDataset format (prompt workflow)
Dataset format (prompt workflow)
Your dataset log
input should be an object with keys that match your prompt version variables schema:2
Step 2: Create a prompt + deploy a version
Create a prompt template, create a version (with variables), and deploy it so it can be used in experiments.Reference: Prompts API
What “deployed” means (important)
What “deployed” means (important)
Prompt workflows need a readonly/deployed version so the experiment run is reproducible. A common pattern is:
- Create version 1
- Create version 2 (locks version 1 as readonly)
- Deploy version 1
3
Step 3: Create the experiment (type = prompt)
Create an experiment that references your
dataset_id and prompt_id.Reference: Create experimentWorkflow config (prompt)
Workflow config (prompt)
4
Step 4: List results and inspect the span tree
After the job runs, list logs for the experiment and fetch details for full span trees (workflow execution + evaluator spans).Reference: List experiment logs
What you should see in results
What you should see in results
- A root workflow span for the experiment trace
- Child spans for prompt loading/rendering and the LLM call
- Evaluator spans named like
evaluator.{slug}with score outputs
Troubleshooting
I created an experiment but logs list is empty
I created an experiment but logs list is empty
- The experiment may still be processing in the background (wait 5–10 seconds and retry)
- Your dataset might be empty
Evaluator spans never appear
Evaluator spans never appear
- Confirm your evaluator slug exists and is accessible to your org
- Evaluators can run asynchronously—poll the log detail endpoint after submission
Inputs/outputs look truncated
Inputs/outputs look truncated
Use the “get log” endpoint (detail) to retrieve the full span tree and untruncated fields.