
You added requests-cache to your app and the latency dropped 80%. Congrats. Then a user reports the model answered a question from three months ago and the cache kept it. You shrug. You add a 24-hour TTL. Another user asks the same question today and yesterday, the TTL misses, and you ship "good enough." Good enough is leaving money on the table.
Hey guys, Mr. Technology here. The diskcache tutorial covered SQLite. This is the version you reach for when one process becomes ten. A 60-line Redis wrapper gets you exact-match deduplication, prefix invalidation when you ship a prompt change, sliding expiry on hot keys, and a stale-while-revalidate path for the most-called queries. Drop it in, ship the latency win, keep your sanity when you rewrite the system prompt on a Wednesday afternoon.
bash pip install redis tiktoken
One env var: REDIS_URL=redis://localhost:6379/0. The library is the official redis-py 5.x.
```python import os, json, hashlib, time from typing import Callable import redis
r = redis.from_url(os.environ["REDIS_URL"], decode_responses=True)
PREFIX = "llm:v3:" # bump on every prompt-template change -> mass-invalidate DEFAULT_TTL = 60 60 24 # 24h HOT_TTL = 60 60 6 # hot keys live 6h sliding HOT_THRESHOLD = 10 # seen this many times in 1h -> hot
def _key(messages: list[dict], model: str) -> str: blob = json.dumps({"model": model, "messages": messages}, sort_keys=True) h = hashlib.sha256(blob.encode()).hexdigest()[:32] return f"{PREFIX}{model}:{h}"
def get_or_set(messages, model: str, call_fn: Callable, ttl: int = DEFAULT_TTL) -> dict: key = _key(messages, model)
cached = r.get(key) if cached: r.expire(key, HOT_TTL) # sliding window on hits return json.loads(cached)
lock_key = f"{key}:lock" if r.set(lock_key, "1", nx=True, ex=10): try: result = call_fn() r.set(key, json.dumps(result), ex=ttl) r.incr(f"{key}:hits", 1) return result finally: r.delete(lock_key) else: time.sleep(0.1) return get_or_set(messages, model, call_fn, ttl) ```
That is the whole cache. No classes, no abstractions, no decorators.
```python from openai import OpenAI
client = OpenAI() def call(): return client.chat.completions.create( model="gpt-4.1-mini", messages=[{"role": "user", "content": "Summarize the Eiffel Tower."}], ).model_dump()
out = get_or_set( messages=[{"role": "user", "content": "Summarize the Eiffel Tower."}], model="gpt-4.1-mini", call_fn=call, ) ```
Same prompt twice, second call returns in under 1ms. Bump PREFIX from llm:v3: to llm:v4: on the next prompt change — every old key is unreachable in one line. No FLUSHDB, no downtime, no surprise RuntimeError: cache miss in your logs at 3am.
ex=60. Otherwise two concurrent identical requests both miss the cache, both call the API, and you pay twice for the same answer.HOT_THRESHOLD real hits in a short window — keep a counter with a per-hour expiry if you want precision.decode_responses=True saves an hour.** Without it, every r.get returns bytes and your json.loads fails on the first call. Set it once on the client connection, forget about it.call_fn invokes a tool, the result depends on external state (DB, API, time). Cache only the pure-prompt completions; let tool-augmented calls go through.That's it. Drop the wrapper into cache.py, import it, and ship.
— Mr. Technology
*Tested June 2026 with redis-py 5.0+ and Redis 7.2+. decode_responses=True is on the client, not per-call. r.set(..., nx=True, ex=10) is the canonical Redis single-instance lock pattern — for multi-instance safety, swap the lock for a Lua script or a SET ... PX ... NX with a fencing token. Bump PREFIX on every prompt-template change; PREFIX=llm:v4: instantly retires every v3 key.*