A 60-line Redis wrapper that gives you exact-match dedup, prefix-based mass-invalidation on prompt changes, sliding expiry on hot keys, and stale-while-revalidate — the version of LLM caching you reach for when one worker becomes a fleet.

Caching LLM Responses with Redis (The Right Way)

You added requests-cache to your app and the latency dropped 80%. Congrats. Then a user reports the model answered a question from three months ago and the cache kept it. You shrug. You add a 24-hour TTL. Another user asks the same question today and yesterday, the TTL misses, and you ship "good enough." Good enough is leaving money on the table.

Hey guys, Mr. Technology here. The diskcache tutorial covered SQLite. This is the version you reach for when one process becomes ten. A 60-line Redis wrapper gets you exact-match deduplication, prefix invalidation when you ship a prompt change, sliding expiry on hot keys, and a stale-while-revalidate path for the most-called queries. Drop it in, ship the latency win, keep your sanity when you rewrite the system prompt on a Wednesday afternoon.

Install

bash

pip install redis tiktoken

One env var: REDIS_URL=redis://localhost:6379/0. The library is the official redis-py 5.x.

The Wrapper

python

import os, json, hashlib, time
from typing import Callable
import redis
r = redis.from_url(os.environ["REDIS_URL"], decode_responses=True)
PREFIX = "llm:v3:"          # bump on every prompt-template change -> mass-invalidate
DEFAULT_TTL = 60 * 60 * 24  # 24h
HOT_TTL = 60 * 60 * 6       # hot keys live 6h sliding
HOT_THRESHOLD = 10          # seen this many times in 1h -> hot
def _key(messages: list[dict], model: str) -> str:
    blob = json.dumps({"model": model, "messages": messages}, sort_keys=True)
    h = hashlib.sha256(blob.encode()).hexdigest()[:32]
    return f"{PREFIX}{model}:{h}"
def get_or_set(messages, model: str, call_fn: Callable,
               ttl: int = DEFAULT_TTL) -> dict:
    key = _key(messages, model)
    cached = r.get(key)
    if cached:
        r.expire(key, HOT_TTL)             # sliding window on hits
        return json.loads(cached)
    # Stale-while-revalidate: only one worker calls the API at a time
    lock_key = f"{key}:lock"
    if r.set(lock_key, "1", nx=True, ex=10):
        try:
            result = call_fn()
            r.set(key, json.dumps(result), ex=ttl)
            r.incr(f"{key}:hits", 1)
            return result
        finally:
            r.delete(lock_key)
    else:
        time.sleep(0.1)
        return get_or_set(messages, model, call_fn, ttl)

That is the whole cache. No classes, no abstractions, no decorators.

Wiring It In

python

from openai import OpenAI
client = OpenAI()
def call():
    return client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": "Summarize the Eiffel Tower."}],
    ).model_dump()
out = get_or_set(
    messages=[{"role": "user", "content": "Summarize the Eiffel Tower."}],
    model="gpt-4.1-mini",
    call_fn=call,
)

Same prompt twice, second call returns in under 1ms. Bump PREFIX from llm:v3: to llm:v4: on the next prompt change — every old key is unreachable in one line. No FLUSHDB, no downtime, no surprise RuntimeError: cache miss in your logs at 3am.

Gotchas

Hash the whole message array, not just the user turn. Two prompts that differ only in the system prompt are different keys. That is the point. If you hash only the user message, you serve the wrong answer to a different persona.
Lock TTL must exceed call timeout. If your LLM call takes 30s, set the lock ex=60. Otherwise two concurrent identical requests both miss the cache, both call the API, and you pay twice for the same answer.
Sliding expiry is a footgun on small datasets. A 50-key cache where every key is "hot" gives you the same staleness as no TTL. Use the 24h default for cold paths. Promote only after HOT_THRESHOLD real hits in a short window — keep a counter with a per-hour expiry if you want precision.
**decode_responses=True saves an hour.** Without it, every r.get returns bytes and your json.loads fails on the first call. Set it once on the client connection, forget about it.
Don't cache tool calls. If your call_fn invokes a tool, the result depends on external state (DB, API, time). Cache only the pure-prompt completions; let tool-augmented calls go through.

That's it. Drop the wrapper into cache.py, import it, and ship.

— Mr. Technology

*Tested June 2026 with redis-py 5.0+ and Redis 7.2+. decode_responses=True is on the client, not per-call. r.set(..., nx=True, ex=10) is the canonical Redis single-instance lock pattern — for multi-instance safety, swap the lock for a Lua script or a SET ... PX ... NX with a fencing token. Bump PREFIX on every prompt-template change; PREFIX=llm:v4: instantly retires every v3 key.*