
Hey guys, Mr. Technology here. Quick tutorial this week. If you are calling the Claude API and you are not using prompt caching, you are lighting money on fire. Anthropic's prompt caching cuts your bill by up to 90% on cache hits and adds essentially no latency to subsequent requests. Five minutes to add, immediate bill cut. Here is the pattern.
Most production prompts look like this:
You are paying full input price on every API call. Every. Single. Time. For content that has not changed in months. I have audited agent stacks that were billing 70% of their input cost on content the model had already seen 10,000 times that day. That is the easiest money you will ever save.
Add a cache_control block to the content you want cached. Anthropic caches the prefix server-side. Subsequent requests that match the cached prefix get a 90% discount on those tokens, with a small surcharge for cache writes (5 minutes TTL by default).
The economics: without caching you pay $3/M input tokens. With caching the first request pays $3.75/M for the cache write, every subsequent request pays $0.30/M for the cache read. On a 5,000-token cached block × 1,000 requests/day, that is roughly an 80-90% cost cut on the cached portion. The model is identical, the output is identical, the latency on the cache hit is the same as a normal request.
Three lines of difference between the un-cached and cached versions. Same call, same response shape, same model.
```python import anthropic
client = anthropic.Anthropic()
message = client.messages.create( model="claude-opus-4-8", max_tokens=1024, system="Long system prompt here that never changes...", messages=[{"role": "user", "content": "User question"}] )
message = client.messages.create( model="claude-opus-4-8", max_tokens=1024, system=[ { "type": "text", "text": "Long system prompt here that never changes...", "cache_control": {"type": "ephemeral"} # <-- THIS LINE } ], messages=[{"role": "user", "content": "User question"}] ) ```
Same call. One added field. Done.
1. System prompt caching. Mark your system prompt with cache_control. Every request in the session hits the cache. For long-lived agent loops, bump the TTL to one hour with {"type": "ephemeral", "ttl": "1h"} — saves you on the 5-minute eviction when an agent session idles.
2. Long-document caching for RAG and code agents. If you are running retrieval over a single large document, or asking the same model to reason over a fixed codebase or a long session transcript, attach it once with a cache breakpoint and query it many times. Subsequent queries against the same document hit the cache at the $0.30/M rate.
python messages = [{ "role": "user", "content": [ { "type": "text", "text": open("big_doc.txt").read(), "cache_control": {"type": "ephemeral", "ttl": "1h"} }, { "type": "text", "text": "What does section 3 say about agent memory?" } ] }]
You can stack up to four cache breakpoints in a single request, so a typical RAG pipeline caches the system prompt, the long document, and a few-shot block independently.
Anthropic returns usage with cache stats on every response. Print it and confirm the cache is hitting:
```python print(message.usage)
#
`
If cache_read_input_tokens is non-zero on the second request, caching is working. If it is zero every time, your prefix is changing between calls — usually a timestamp, a random ID, or a newlines/whitespace difference sneaking into the system prompt. Stable content only.
ttl: "1h" explicitly or accept the re-write cost.Add cache_control to your system prompts today. Five minutes of work, 80% bill cut on the cached portion. If you are running any prompt over 1024 tokens more than once — and you are, because every agent loop and every RAG query does — there is no reason not to do this. The only real cost is the discipline of keeping your cached content stable across requests. Add a unit test that asserts the cached prefix is byte-identical across calls and the cost cut is permanent.
— Mr. Technology
*References: Anthropic Prompt Caching documentation (https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching); Anthropic Python SDK 0.40+ supports cache_control on system, user, and assistant content blocks; cache breakpoints support up to 4 per request; ephemeral TTL options: 5m (default) and 1h; pricing: cache writes $3.75/M input tokens, cache reads $0.30/M input tokens, regular input $3.00/M (Opus 4.8 reference pricing, June 2026); minimum cacheable prefix is 1024 tokens; cache scope is per-workspace.*