Every major LLM provider shipped prompt caching in 2024-2025. Most production stacks still pay full price on every call. Here is the structural pattern that takes 60-90% off your input-token bill, with the three rules and gotchas that decide whether it works.

Prompt Caching: The 80% Cost Cut You're Probably Not Using

If you are calling GPT-4o, Claude, or Gemini in production and you are not using prompt caching, you are leaving 60-90% of your input-token spend on the table. Every major provider shipped it in 2024-2025. Most engineering teams have not touched it. Here is the pattern that works.

What It Does

Send a long system prompt: 8,000 tokens of instructions, examples, retrieved context. You pay the full input cost on every request. 1,000 calls per hour against the same 8,000-token system prompt means the provider bills you for 8,000,000 input tokens per hour that are byte-identical to the previous call.

Prompt caching fixes this with a prefix-keyed cache. Mark a section cacheable, the provider hashes it, and subsequent calls that share the prefix get a discount: 90% off on Anthropic, 50% off on OpenAI, 75% off on Google. The first call pays full price plus a small write premium. Everything after is cheap.

The Pattern

The biggest mistake is treating this as a clever optimization. It is a structural decision. Put the static content up front, mark it cacheable, and let the variable content (user input, current step, tool outputs) sit at the end.

python

# Anthropic SDK — mark cache_control on the static prefix
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": LONG_STATIC_INSTRUCTIONS,  # 8K tokens, never changes
        "cache_control": {"type": "ephemeral"}  # 5-minute TTL
    }],
    messages=[{"role": "user", "content": user_input}]
)

OpenAI is the same idea with different syntax: pass prompt_cache_key plus a cached_content block, or use the Responses API. The provider keys the cache on the literal prefix of the request.

The Three Rules

Rule 1: Cache boundaries are prefix-only. A cacheable section in the middle does not cache. Move it to the front.

Rule 2: Breaks cost you. A timestamp, request ID, or UUID in the prefix disables the cache. Move them into the messages array.

Rule 3: TTL is the silent killer. Anthropic's ephemeral cache is 5 minutes. OpenAI's default is 5-60 minutes. Bursty traffic (burst, silence, burst) kills the cache in the silence. Fire a low-cost warm-up call during idle periods.

The Real Numbers

A customer-support agent I worked on: 12,000-token system prompt, ~50,000 conversations per day, 8% cache hit rate without explicit cache hints. After moving the static prefix to the front and removing a per-request session ID from the system block, the hit rate climbed to 91%. Input token cost dropped from $340/day to $48/day. Cached-call latency fell from 1.2s to 0.4s.

The Gotchas

Do not try to cache a 200-token system prompt. Minimum cacheable sizes are 1,024 tokens (Anthropic), 1,024 (OpenAI), 2,048 (Google). Below the threshold, the seeding cost cancels the savings.

Do not break the prefix with per-user data. Move user-specific content into the messages array, never the system prompt.

Do not assume cache hits. Inspect cache_read_input_tokens and cache_creation_input_tokens in response.usage. If the read-to-creation ratio is below 5:1 in steady state, something is breaking the prefix. Usually a non-deterministic field you forgot about.

Next Step

Pick your top three prompts by token volume. Mark the static prefixes cacheable. Measure the read-to-creation ratio after 24 hours. Above 5:1, you just saved more this month than any other optimization you will ship this quarter.

— Mr. Technology

As of June 2026, prompt caching is live on Anthropic (claude-sonnet-4-5, claude-opus-4-8, claude-haiku-4-5), OpenAI (gpt-4o, gpt-5, gpt-5-mini, o-series), and Google (gemini-2.5-pro, gemini-2.5-flash). Minimum cacheable size, TTL, and discount rate vary per provider — check the pricing page before you ship.