This is a beta feature. Please do let us know if you encounter any issues. We’ll continuously improve it.

Why Caches?

You may find caches useful when you want to:

  • Reduce latency: Serve stored responses instantly, eliminating the need for repeated API calls.
  • Save costs: Minimize expenses by reusing cached responses instead of making redundant requests.
  • Improve performance: Deliver consistently high-quality outputs by serving pre-vetted, cached responses.

How to use Caches?

Turn on caches by setting cache_enabled to true. We currently will cache the last message of the conversation.

See the example below, we will cache the user message “Hi, how are you?” and its response.

{
    "model": "gpt-3.5-turbo",
    "messages": [
        {
            "role": "system",
            "content": "Hello, how can I help you today?"
        },
        {
            "role": "user",
            "content": "Hi, how are you?" // message to be cached, its response will be cached as well
        }
    ],
    "max_tokens": 30,
    "customer_identifier": "a_model_customer",
    "stream": true,
    "cache_enabled": true, // enable cache
    "cache_ttl":600 // cache for 10 minutes, optional
}

Cashes parameters

cache_enabled
boolean

Enable or disable caches.

{
    "cache_enabled": true
}
cache_ttl
number

This parameter specifies the time-to-live (TTL) for the cache in seconds.

It’s optional and the default value is 30 days now.
{
    "cache_ttl": 3600 // in seconds
}