Cache LLM responses
Reduce latency and save LLM costs by caching LLM prompts and responses.
Why Caches?
You may find caches useful when you want to:
- Reduce latency: Serve stored responses instantly, eliminating the need for repeated API calls.
- Save costs: Minimize expenses by reusing cached responses instead of making redundant requests.
- Improve performance: Deliver consistently high-quality outputs by serving pre-vetted, cached responses.
How to use Caches?
Turn on caches by setting cache_enabled
to true
. We currently will cache the whole conversation, including the system message, user message and the response.
See the example below, we will cache the user message “Hi, how are you?” and its response.
Caches parameters
Enable or disable caches.
This parameter specifies the time-to-live (TTL) for the cache in seconds.
This parameter specifies the cache options. Currently we support cache_by_customer
option, you can set it to true
or false
. If cache_by_customer
is set to true
, the cache will be stored by the customer identifier.
How to view caches
You can view the caches on the Logs page. The model tag will be keywordsai/cache
. You can also filter the logs by the Cache hit
field.
Regular caching vs Prompt caching
Regular caching stores complete prompt responses. When the same prompt is received, the system returns the stored response, reducing costs but limiting response variety.
Prompt caching stores the model’s intermediate computation state. This allows the model to generate diverse responses while still saving computational costs, as it doesn’t need to reprocess the entire prompt from scratch. View the Prompt caching section for more information.