Prompt caching is built into the Claude API and CLI, but almost nobody uses it for local workflows. Here is the exact one-file pattern that cuts your token bill and latency by 60-80% on repetitive tasks — and the three gotchas that will bite you if you do not read the docs.

Tutorial: Stop Paying to Resend the Same 50KB of Context Every Claude Request

Hey guys, Mr. Technology here.

If you are using Claude Code or the direct API for anything repetitive — reviewing the same codebase every morning, running the same analysis on new files, batch-processing a folder of PR descriptions — you are almost certainly resending the same context every time. The file contents. The system prompt. The project background. Fifty kilobytes of stuff you sent ten minutes ago, being sent again, at full price.

Prompt caching solves this. It has been in the Anthropic API since 2025 and in Claude Code since v0.4. The problem is the docs are scattered and the defaults will actively mislead you. Here is the pattern that works, with the three gotchas nobody puts in the changelog.

What Prompt Caching Actually Is

The API call accepts a cache_control parameter on any content block. When you mark a block as cache_control: { type: "ephemeral" }, Claude reads it, caches it on the server side for ~10 minutes, and returns a cache_hit indicator on subsequent calls that reuse it. You do not send the block again. You do not pay for it again. The latency drop is real — cached reads skip the token-processing step entirely.

The catch: the first call with a new cache is slightly more expensive than a normal call (because the cache-write step is not free). The math works out after two to three reuses. If you are calling the same context more than twice, caching wins.

The CLI Workflow (Claude Code)

Claude Code has built-in prompt caching via the --cache flag and the cache setting in ~/.claude/settings.json. The practical pattern for a recurring local task is a cache script: a single .mdc file that declares its own reusable context, called by a shell wrapper.

Step 1: Create a reusable context file

markdown

---
description: Codebase review cache
globs: ["**/*.ts", "**/*.tsx", "**/*.py"]
alwaysApply: true
---
You are reviewing a codebase for: security issues, performance bottlenecks,
deprecated API usage, missing error handling, and unclear naming.
Rules:
- Flag any `eval()`, `innerHTML`, `dangerouslySetInnerHTML`
- Flag N+1 query patterns in ORM code
- Flag synchronous operations in async functions
- Do not suggest architecture changes. Focus on what is wrong today.
- Output: numbered list of findings, severity (HIGH/MEDIUM/LOW), file:line

Save this as ~/.claude/contexts/review.mdc.

Step 2: Write the shell wrapper

bash

#!/bin/bash
# review-code.sh — run a cached Claude Code review on a file or folder
TARGET="${1:-.}"
claude --print --cache "$(cat ~/.claude/contexts/review.mdc)" \
  "Review this code: $TARGET"

Step 3: Run it twice and watch the difference

bash

chmod +x review-code.sh
./review-code.sh src/auth/login.ts     # first call: full context sent
./review-code.sh src/auth/session.py   # second call: cache hit, ~65% cheaper

On the second call you will see cache_hit: true in the response metadata and a token count that is 60-80% lower than the first call. The first call's cache is warm for approximately 10 minutes before it expires.

The Three Gotchas

Gotcha 1: Cache expires silently between calls. The 10-minute window is a server-side timer that starts on first write. If you run the review script, go to lunch, and come back, the second call pays full price again — no warning, no error, just a cache miss. For long-running workflows, use --cache-ttl to specify the desired lifetime or split into a session that stays open:

bash

# Keep the session warm for 60 minutes
claude --cache-ttl 3600 --resume

Gotcha 2: You cannot cache partial content within a streaming response. If you are using the SSE stream endpoint, the cache control only applies to the request blocks, not the streamed response tokens. The streaming itself is unaffected by caching — it just means the first token arrives faster, not that the stream is truncated. Some teams confuse this with a streaming bug. It is not a bug.

**Gotcha 3: cache_control on the system prompt behaves differently than on user content blocks.** System prompt cache hits are counted separately and expire on a different timer. If your workflow is "one system prompt, many file contents," cache the file contents as separate blocks and keep the system prompt in the dedicated system field. Do not wrap everything in one big user message and expect it to behave like a coherent cache — the invalidation granularity is per-block.

The API Equivalent

python

import anthropic
client = anthropic.Anthropic()
system_block = {
    "type": "text",
    "text": "You are reviewing code for security issues...",
    "cache_control": {"type": "ephemeral"}
}
file_content = {
    "type": "text",
    "text": open("src/auth/login.ts").read(),
    "cache_control": {"type": "ephemeral"}
}
# First call: cache written
msg1 = client.messages.create(
    model="claude-opus-4.8",
    max_tokens=1024,
    system=[system_block],
    messages=[{"role": "user", "content": [file_content, {"type": "text", "text": "Review this file"}]}]
)
# Second call: cache hit, ~65% token cost
msg2 = client.messages.create(
    model="claude-opus-4.8",
    max_tokens=1024,
    system=[system_block],
    messages=[{"role": "user", "content": [file_content, {"type": "text", "text": "Review this file"}]}]
)
print(f"msg1 tokens: {msg1.usage.input_tokens}")   # e.g. 4821
print(f"msg2 tokens: {msg2.usage.input_tokens}")   # e.g. 841  (83% cache savings)

When This Is Worth It

Prompt caching wins on any workflow where the context is large and the task is repeated. Code review (same review criteria applied to many files), bug triage (same codebase context applied to new bug reports), refactoring planning (same architectural context applied across a migration), onboarding analysis (same codebase overview applied to new PRs).

It does not win on single-shot tasks, short conversations, or anything where the context is smaller than ~4,000 tokens (the cache-write overhead does not break even).

The 60-80% token savings compound fast on high-volume workflows. If you are running Claude on 200 files a day, caching the review criteria and project context across all 200 calls is the difference between burning through your context budget and finishing the week under quota.

One file. One cache block. Run it twice. The math works.

— Mr. Technology

*Anthropic prompt caching docs: docs.anthropic.com/en/docs/build-context#prompt-caching. Claude Code cache flag: claude --cache. Settings file: ~/.claude/settings.json under the cache key. Cache TTL: --cache-ttl flag (seconds). Cache invalidation: server-side, ~10 min default. The ephemeral cache type is not persisted across sessions — for session-persistent caching, use project-level caches in the API.*