
Hey guys, Mr. Technology here.
If you are using Claude Code or the direct API for anything repetitive — reviewing the same codebase every morning, running the same analysis on new files, batch-processing a folder of PR descriptions — you are almost certainly resending the same context every time. The file contents. The system prompt. The project background. Fifty kilobytes of stuff you sent ten minutes ago, being sent again, at full price.
Prompt caching solves this. It has been in the Anthropic API since 2025 and in Claude Code since v0.4. The problem is the docs are scattered and the defaults will actively mislead you. Here is the pattern that works, with the three gotchas nobody puts in the changelog.
The API call accepts a cache_control parameter on any content block. When you mark a block as cache_control: { type: "ephemeral" }, Claude reads it, caches it on the server side for ~10 minutes, and returns a cache_hit indicator on subsequent calls that reuse it. You do not send the block again. You do not pay for it again. The latency drop is real — cached reads skip the token-processing step entirely.
The catch: the first call with a new cache is slightly more expensive than a normal call (because the cache-write step is not free). The math works out after two to three reuses. If you are calling the same context more than twice, caching wins.
Claude Code has built-in prompt caching via the --cache flag and the cache setting in ~/.claude/settings.json. The practical pattern for a recurring local task is a cache script: a single .mdc file that declares its own reusable context, called by a shell wrapper.
Step 1: Create a reusable context file
```markdown
description: Codebase review cache globs: ["**/*.ts", "**/*.tsx", "**/*.py"] alwaysApply: true
You are reviewing a codebase for: security issues, performance bottlenecks, deprecated API usage, missing error handling, and unclear naming.
Rules:
eval(), innerHTML, dangerouslySetInnerHTML`
Save this as ~/.claude/contexts/review.mdc.
Step 2: Write the shell wrapper
```bash #!/bin/bash
TARGET="${1:-.}" claude --print --cache "$(cat ~/.claude/contexts/review.mdc)" \ "Review this code: $TARGET" ```
Step 3: Run it twice and watch the difference
bash chmod +x review-code.sh ./review-code.sh src/auth/login.ts # first call: full context sent ./review-code.sh src/auth/session.py # second call: cache hit, ~65% cheaper
On the second call you will see cache_hit: true in the response metadata and a token count that is 60-80% lower than the first call. The first call's cache is warm for approximately 10 minutes before it expires.
Gotcha 1: Cache expires silently between calls. The 10-minute window is a server-side timer that starts on first write. If you run the review script, go to lunch, and come back, the second call pays full price again — no warning, no error, just a cache miss. For long-running workflows, use --cache-ttl to specify the desired lifetime or split into a session that stays open:
```bash
claude --cache-ttl 3600 --resume ```
Gotcha 2: You cannot cache partial content within a streaming response. If you are using the SSE stream endpoint, the cache control only applies to the request blocks, not the streamed response tokens. The streaming itself is unaffected by caching — it just means the first token arrives faster, not that the stream is truncated. Some teams confuse this with a streaming bug. It is not a bug.
**Gotcha 3: cache_control on the system prompt behaves differently than on user content blocks.** System prompt cache hits are counted separately and expire on a different timer. If your workflow is "one system prompt, many file contents," cache the file contents as separate blocks and keep the system prompt in the dedicated system field. Do not wrap everything in one big user message and expect it to behave like a coherent cache — the invalidation granularity is per-block.
```python import anthropic
client = anthropic.Anthropic()
system_block = { "type": "text", "text": "You are reviewing code for security issues...", "cache_control": {"type": "ephemeral"} }
file_content = { "type": "text", "text": open("src/auth/login.ts").read(), "cache_control": {"type": "ephemeral"} }
msg1 = client.messages.create( model="claude-opus-4.8", max_tokens=1024, system=[system_block], messages=[{"role": "user", "content": [file_content, {"type": "text", "text": "Review this file"}]}] )
msg2 = client.messages.create( model="claude-opus-4.8", max_tokens=1024, system=[system_block], messages=[{"role": "user", "content": [file_content, {"type": "text", "text": "Review this file"}]}] )
print(f"msg1 tokens: {msg1.usage.input_tokens}") # e.g. 4821 print(f"msg2 tokens: {msg2.usage.input_tokens}") # e.g. 841 (83% cache savings) ```
Prompt caching wins on any workflow where the context is large and the task is repeated. Code review (same review criteria applied to many files), bug triage (same codebase context applied to new bug reports), refactoring planning (same architectural context applied across a migration), onboarding analysis (same codebase overview applied to new PRs).
It does not win on single-shot tasks, short conversations, or anything where the context is smaller than ~4,000 tokens (the cache-write overhead does not break even).
The 60-80% token savings compound fast on high-volume workflows. If you are running Claude on 200 files a day, caching the review criteria and project context across all 200 calls is the difference between burning through your context budget and finishing the week under quota.
One file. One cache block. Run it twice. The math works.
— Mr. Technology
*Anthropic prompt caching docs: docs.anthropic.com/en/docs/build-context#prompt-caching. Claude Code cache flag: claude --cache. Settings file: ~/.claude/settings.json under the cache key. Cache TTL: --cache-ttl flag (seconds). Cache invalidation: server-side, ~10 min default. The ephemeral cache type is not persisted across sessions — for session-persistent caching, use project-level caches in the API.*