Anthropic's prompt caching cuts input costs by up to 90% on cache hits with one added field. Real code, real numbers, and the gotchas nobody mentions. Five minutes to add, immediate bill cut, and most teams in production are not using it yet.

Cut Your Claude API Bill 80% in 5 Minutes: A Practical Prompt Caching Tutorial

Hey guys, Mr. Technology here. Quick tutorial this week. If you are calling the Claude API and you are not using prompt caching, you are lighting money on fire. Anthropic's prompt caching cuts your bill by up to 90% on cache hits and adds essentially no latency to subsequent requests. Five minutes to add, immediate bill cut. Here is the pattern.

The Problem You Are Probably Paying For

Most production prompts look like this:

A 200-token system prompt that barely changes
A 5,000-token codebase file, PDF, or schema that is identical on every request
A 2,000-token few-shot examples block
A 200-token user question that is the only thing that actually changes

You are paying full input price on every API call. Every. Single. Time. For content that has not changed in months. I have audited agent stacks that were billing 70% of their input cost on content the model had already seen 10,000 times that day. That is the easiest money you will ever save.

The Fix Is One Field

Add a cache_control block to the content you want cached. Anthropic caches the prefix server-side. Subsequent requests that match the cached prefix get a 90% discount on those tokens, with a small surcharge for cache writes (5 minutes TTL by default).

The economics: without caching you pay $3/M input tokens. With caching the first request pays $3.75/M for the cache write, every subsequent request pays $0.30/M for the cache read. On a 5,000-token cached block × 1,000 requests/day, that is roughly an 80-90% cost cut on the cached portion. The model is identical, the output is identical, the latency on the cache hit is the same as a normal request.

The Code (Anthropic Python SDK)

Three lines of difference between the un-cached and cached versions. Same call, same response shape, same model.

python

import anthropic
client = anthropic.Anthropic()
# BEFORE: no caching
message = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system="Long system prompt here that never changes...",
    messages=[{"role": "user", "content": "User question"}]
)
# AFTER: with prompt caching
message = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Long system prompt here that never changes...",
            "cache_control": {"type": "ephemeral"}  # <-- THIS LINE
        }
    ],
    messages=[{"role": "user", "content": "User question"}]
)

Same call. One added field. Done.

The Two Cache Strategies That Actually Work

1. System prompt caching. Mark your system prompt with cache_control. Every request in the session hits the cache. For long-lived agent loops, bump the TTL to one hour with {"type": "ephemeral", "ttl": "1h"} — saves you on the 5-minute eviction when an agent session idles.

2. Long-document caching for RAG and code agents. If you are running retrieval over a single large document, or asking the same model to reason over a fixed codebase or a long session transcript, attach it once with a cache breakpoint and query it many times. Subsequent queries against the same document hit the cache at the $0.30/M rate.

python

messages = [{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": open("big_doc.txt").read(),
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {
            "type": "text",
            "text": "What does section 3 say about agent memory?"
        }
    ]
}]

You can stack up to four cache breakpoints in a single request, so a typical RAG pipeline caches the system prompt, the long document, and a few-shot block independently.

Verify It Worked

Anthropic returns usage with cache stats on every response. Print it and confirm the cache is hitting:

python

print(message.usage)
# First request:
#   input_tokens: 200
#   cache_creation_input_tokens: 5000
#   cache_read_input_tokens: 0
#
# Second request (5 minutes later, same prefix):
#   input_tokens: 200
#   cache_creation_input_tokens: 0
#   cache_read_input_tokens: 5000   <-- this is the bill cut
#   output_tokens: 350

If cache_read_input_tokens is non-zero on the second request, caching is working. If it is zero every time, your prefix is changing between calls — usually a timestamp, a random ID, or a newlines/whitespace difference sneaking into the system prompt. Stable content only.

The Gotchas Nobody Mentions

Byte-identical prefixes are required. Newlines, trailing spaces, dynamic timestamps — if it changes, the cache misses. Build your prompts from a stable template and inject the variable parts at the end.
Minimum cache size is 1024 tokens. Smaller than that and Anthropic will not cache it. System prompts almost always qualify. Short user messages do not.
Default TTL is 5 minutes. Long-running agent workflows and CI jobs should set ttl: "1h" explicitly or accept the re-write cost.
Cache is per-organization, not global. Two different API keys in two different projects have separate caches. Do not expect dev and prod to share.
Cache writes cost slightly more than uncached reads. The math only works if the same prefix is reused multiple times. If you are calling once, caching does not help.

The Take

Add cache_control to your system prompts today. Five minutes of work, 80% bill cut on the cached portion. If you are running any prompt over 1024 tokens more than once — and you are, because every agent loop and every RAG query does — there is no reason not to do this. The only real cost is the discipline of keeping your cached content stable across requests. Add a unit test that asserts the cached prefix is byte-identical across calls and the cost cut is permanent.

— Mr. Technology

*References: Anthropic Prompt Caching documentation (https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching); Anthropic Python SDK 0.40+ supports cache_control on system, user, and assistant content blocks; cache breakpoints support up to 4 per request; ephemeral TTL options: 5m (default) and 1h; pricing: cache writes $3.75/M input tokens, cache reads $0.30/M input tokens, regular input $3.00/M (Opus 4.8 reference pricing, June 2026); minimum cacheable prefix is 1024 tokens; cache scope is per-workspace.*