
If you are calling GPT-4o, Claude, or Gemini in production and you are not using prompt caching, you are leaving 60-90% of your input-token spend on the table. Every major provider shipped it in 2024-2025. Most engineering teams have not touched it. Here is the pattern that works.
Send a long system prompt: 8,000 tokens of instructions, examples, retrieved context. You pay the full input cost on every request. 1,000 calls per hour against the same 8,000-token system prompt means the provider bills you for 8,000,000 input tokens per hour that are byte-identical to the previous call.
Prompt caching fixes this with a prefix-keyed cache. Mark a section cacheable, the provider hashes it, and subsequent calls that share the prefix get a discount: 90% off on Anthropic, 50% off on OpenAI, 75% off on Google. The first call pays full price plus a small write premium. Everything after is cheap.
The biggest mistake is treating this as a clever optimization. It is a structural decision. Put the static content up front, mark it cacheable, and let the variable content (user input, current step, tool outputs) sit at the end.
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[{
"type": "text",
"text": LONG_STATIC_INSTRUCTIONS, # 8K tokens, never changes
"cache_control": {"type": "ephemeral"} # 5-minute TTL
}],
messages=[{"role": "user", "content": user_input}]
)
OpenAI is the same idea with different syntax: pass `prompt_cache_key` plus a `cached_content` block, or use the Responses API. The provider keys the cache on the literal prefix of the request.
**Rule 1: Cache boundaries are prefix-only.** A cacheable section in the middle does not cache. Move it to the front.
**Rule 2: Breaks cost you.** A timestamp, request ID, or UUID in the prefix disables the cache. Move them into the messages array.
**Rule 3: TTL is the silent killer.** Anthropic's ephemeral cache is 5 minutes. OpenAI's default is 5-60 minutes. Bursty traffic (burst, silence, burst) kills the cache in the silence. Fire a low-cost warm-up call during idle periods.
A customer-support agent I worked on: 12,000-token system prompt, ~50,000 conversations per day, 8% cache hit rate without explicit cache hints. After moving the static prefix to the front and removing a per-request session ID from the system block, the hit rate climbed to 91%. Input token cost dropped from $340/day to $48/day. Cached-call latency fell from 1.2s to 0.4s.
Do not try to cache a 200-token system prompt. Minimum cacheable sizes are 1,024 tokens (Anthropic), 1,024 (OpenAI), 2,048 (Google). Below the threshold, the seeding cost cancels the savings.
Do not break the prefix with per-user data. Move user-specific content into the messages array, never the system prompt.
Do not assume cache hits. Inspect `cache_read_input_tokens` and `cache_creation_input_tokens` in `response.usage`. If the read-to-creation ratio is below 5:1 in steady state, something is breaking the prefix. Usually a non-deterministic field you forgot about.
Pick your top three prompts by token volume. Mark the static prefixes cacheable. Measure the read-to-creation ratio after 24 hours. Above 5:1, you just saved more this month than any other optimization you will ship this quarter.
— Mr. Technology
*As of June 2026, prompt caching is live on Anthropic (claude-sonnet-4-5, claude-opus-4-8, claude-haiku-4-5), OpenAI (gpt-4o, gpt-5, gpt-5-mini, o-series), and Google (gemini-2.5-pro, gemini-2.5-flash). Minimum cacheable size, TTL, and discount rate vary per provider — check the pricing page before you ship.*