You can only enable prompt caching if you are using LLM proxy for Anthropic models.
How to use prompt caching
You can either use theprompt_caching
feature through the LLM proxy or log those LLM requests that are cached, which will give you a better observability.
How does prompt caching work?
All information is from Anthropic’s documentation. When you send a request with prompt caching enabled:- The system checks if a prompt prefix, up to a specified cache breakpoint, is already cached from a recent query.
- If found, it uses the cached version, reducing processing time and costs.
- Otherwise, it processes the full prompt and caches the prefix once the response begins.
- Prompts with many examples
- Large amounts of context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations
Prompt caching pricing for Anthropic models
Model | Base Input Tokens | Cache Writes | Cache Hits | Output Tokens |
---|---|---|---|---|
Claude 3.5 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok |
Claude 3.5 Haiku | $1 / MTok | $1.25 / MTok | $0.10 / MTok | $5 / MTok |
Claude 3 Haiku | $0.25 / MTok | $0.30 / MTok | $0.03 / MTok | $1.25 / MTok |
Claude 3 Opus | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok |
- Cache write tokens are 25% more expensive than base input tokens
- Cache read tokens are 90% cheaper than base input tokens
- Regular input and output tokens are priced at standard rates
Supported models
Prompt caching is currently supported on:- Claude 3.5 Sonnet
- Claude 3.5 Haiku
- Claude 3 Haiku
- Claude 3 Opus
Cache Limitations
The minimum cacheable prompt length is:- 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
- 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku