Proxy
LLM caches
Reduce latency and save LLM costs by caching LLM prompts and responses.
This is a beta feature. Please do let us know if you encounter any issues. We’ll continuously improve it.
Why Caches?
You may find caches useful when you want to:
- Reduce latency: Serve stored responses instantly, eliminating the need for repeated API calls.
- Save costs: Minimize expenses by reusing cached responses instead of making redundant requests.
- Improve performance: Deliver consistently high-quality outputs by serving pre-vetted, cached responses.
How to use Caches?
Turn on caches by setting cache_enabled
to true
. We currently will cache the last message of the conversation.
See the example below, we will cache the user message “Hi, how are you?” and its response.
{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "Hello, how can I help you today?"
},
{
"role": "user",
"content": "Hi, how are you?" // message to be cached, its response will be cached as well
}
],
"max_tokens": 30,
"customer_identifier": "a_model_customer",
"stream": true,
"cache_enabled": true, // enable cache
"cache_ttl":600 // cache for 10 minutes, optional
}
Cashes parameters
cache_enabled
boolean
Enable or disable caches.
{
"cache_enabled": true
}
cache_ttl
number
This parameter specifies the time-to-live (TTL) for the cache in seconds.
It’s optional and the default value is 30 days now.
{
"cache_ttl": 3600 // in seconds
}