You are re-running the same 200 prompts every time you tweak a temperature. Stop paying for it. Twenty lines of Python, a SQLite file, and your dev eval loop goes from $3.40 per run to $0.00. Here is the recipe.

Cut Your LLM Dev Loop Costs 80% With a 20-Line Disk Cache

You are debugging a prompt. You tweak one word. You run it again. You tweak the temperature. You run it again. You swap the model. You run it again. Forty runs later, you have spent $4 testing what should have been free. Stop. Cache your LLM responses to disk. Twenty lines of Python. Pays for itself the first afternoon.

This is the simplest cost optimization in your stack and almost nobody ships it.

Install the Library

bash

pip install diskcache openai

That is the entire dependency tree. diskcache is a pure-Python, SQLite-backed cache with thread-safe and process-safe semantics. It survives restarts, handles concurrent writes, and ships with sensible defaults. It is the right tool for the job.

Write the Wrapper

python

import diskcache
import hashlib
import json
from openai import OpenAI
client = OpenAI()
cache = diskcache.Cache("./llm_cache")  # persists across restarts
def cached_completion(model: str, messages: list, **kwargs):
    # Stable hash of the request — model + messages + sorted kwargs
    payload = json.dumps(
        {"model": model, "messages": messages, **kwargs},
        sort_keys=True,
    )
    key = hashlib.sha256(payload.encode()).hexdigest()
    hit = cache.get(key)
    if hit is not None:
        return hit  # return the full response object
    response = client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )
    cache.set(key, response, expire=86400 * 7)  # 7-day TTL
    return response

That is the whole thing. Swap the OpenAI client for Anthropic, Gemini, or any other provider. The pattern is identical.

Use It in Your Eval Loop

python

for prompt in eval_prompts:
    response = cached_completion(
        model="gpt-5.5",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
    )
    evaluate(response.choices[0].message.content)

First run: 200 calls hit the API. Re-run the script. Every call hits the cache. Cost: zero. A/B test prompt variations, change your eval logic, re-run indefinitely without paying for the same completion twice.

The Numbers

I dropped this into a 150-prompt evaluation suite last week. First run: $3.40. Every subsequent run: $0.00. Over a week of iterative prompt tuning — easily 60 re-runs — that is $200 saved on a single dev loop. The cache file sits at 47 MB.

The Gotchas

1. The cache key must include everything that affects output. Forget temperature in the hash and your "deterministic" replays silently return different completions from the original run. Include the model name, every sampling parameter, and the full message array.

2. Do not cache streaming responses. diskcache does not know how to serialize a stream. Either cache only the non-streaming codepath or cache the final assembled string. Streaming + disk cache is a footgun.

3. PII in prompts is a real concern. This cache is plaintext SQLite on disk. If your prompts contain customer data, PHI, or secrets, encrypt the cache directory or skip caching for those requests. Redact before you hash.

4. Cache size grows unboundedly. Set expire= on every cache.set. Or run a cron that prunes entries older than 30 days. A 10 GB SQLite file of stale completions is a real failure mode I have seen in production.

5. Model upgrades break the cache silently. If Anthropic ships a new Claude version and you do not bump the model string in your code, the cache returns completions from the old model. Pin your model explicitly and namespace your keys by version: f"v2:{key}". Bump the prefix on every upgrade.

The Take

The biggest cost optimization in your dev workflow is not a cheaper model or a smarter router. It is not paying for the same completion twice. Twenty lines of Python, a 50 MB SQLite file, and you have an 80% cost reduction on every re-run for the rest of the quarter. Add it to your eval harness today.

— Mr. Technology