Why Caches?

You may find caches useful when you want to:

  • Reduce latency: Serve stored responses instantly, eliminating the need for repeated API calls.
  • Save costs: Minimize expenses by reusing cached responses instead of making redundant requests.
  • Improve performance: Deliver consistently high-quality outputs by serving pre-vetted, cached responses.

How to use Caches?

Turn on caches by setting cache_enabled to true. We currently will cache the whole conversation, including the system message, user message and the response.

See the example below, we will cache the user message “Hi, how are you?” and its response.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.keywordsai.co/api/",
    api_key="YOUR_KEYWORDSAI_API_KEY",
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Tell me a long story"}
    ],
    extra_body={
        "cache_enabled": True,
        "cache_ttl": 600,
        "cache_options": {
            "cache_by_customer": True
        }
    }
)

Caches parameters

cache_enabled
boolean

Enable or disable caches.

{
    "cache_enabled": true
}
cache_ttl
number

This parameter specifies the time-to-live (TTL) for the cache in seconds.

It’s optional and the default value is 30 days now.
{
    "cache_ttl": 3600 // in seconds
}
cache_options
boolean

This parameter specifies the cache options. Currently we support cache_by_customer option, you can set it to true or false. If cache_by_customer is set to true, the cache will be stored by the customer identifier.

It’s an optional parameter
{
    "cache_options": { // optional
        "cache_by_customer": true // or false
    }
}

How to view caches

You can view the caches on the Logs page. The model tag will be keywordsai/cache. You can also filter the logs by the Cache hit field.

Regular caching vs Prompt caching

Regular caching stores complete prompt responses. When the same prompt is received, the system returns the stored response, reducing costs but limiting response variety.

Prompt caching stores the model’s intermediate computation state. This allows the model to generate diverse responses while still saving computational costs, as it doesn’t need to reprocess the entire prompt from scratch. View the Prompt caching section for more information.