Enable prompt caching
Prompt caching is a feature that allows you to cache the results of a prompt so that it can be reused later.
You can only enable prompt caching if you are using LLM proxy for Anthropic models.
How to use prompt caching
You can either use the prompt_caching
feature through the LLM proxy or log those LLM requests that are cached, which will give you a better observability.
How does prompt caching work?
All information is from Anthropic’s documentation.
When you send a request with prompt caching enabled:
- The system checks if a prompt prefix, up to a specified cache breakpoint, is already cached from a recent query.
- If found, it uses the cached version, reducing processing time and costs.
- Otherwise, it processes the full prompt and caches the prefix once the response begins.
This is especially useful for:
- Prompts with many examples
- Large amounts of context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations
The cache has a 5-minute lifetime, refreshed each time the cached content is used.
Prompt caching pricing for Anthropic models
Model | Base Input Tokens | Cache Writes | Cache Hits | Output Tokens |
---|---|---|---|---|
Claude 3.5 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok |
Claude 3.5 Haiku | $1 / MTok | $1.25 / MTok | $0.10 / MTok | $5 / MTok |
Claude 3 Haiku | $0.25 / MTok | $0.30 / MTok | $0.03 / MTok | $1.25 / MTok |
Claude 3 Opus | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok |
Note:
- Cache write tokens are 25% more expensive than base input tokens
- Cache read tokens are 90% cheaper than base input tokens
- Regular input and output tokens are priced at standard rates
Supported models
Prompt caching is currently supported on:
- Claude 3.5 Sonnet
- Claude 3.5 Haiku
- Claude 3 Haiku
- Claude 3 Opus
Cache Limitations
The minimum cacheable prompt length is:
- 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
- 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching.