← Back to Payloads
Tutorial2026-06-26

Claude Extended Thinking With a Budget: The Cost Knob Anthropic Shipped in March 2026

Extended thinking is the single biggest token-spend lever on Claude, and most teams leave it set to 'as much as it wants.' That is the default. Anthropic shipped a `budget_tokens` field that caps how much Claude is allowed to think before it must answer. One parameter, 40 to 60 percent cost reduction on most agent workloads. Here is the recipe.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Claude Extended Thinking With a Budget: The Cost Knob Anthropic Shipped in March 2026

Claude Extended Thinking With a Budget: The Cost Knob Anthropic Shipped in March 2026

Extended thinking is the single biggest token-spend lever on Claude, and most teams leave it set to 'as much as it wants.' That is the default — no cap on the thinking block. On a 100K context with a complex task, you can burn 30K thinking tokens before Claude writes one word of output. At Opus pricing, that is real money before any answer exists.

The fix shipped in March 2026: the **budget_tokens field**. You cap how much Claude is allowed to think before it must answer. The model allocates the budget however it wants — it just cannot exceed it. Cost bounded. Latency bounded.

The Pattern

```python import anthropic

client = anthropic.Anthropic()

response = client.messages.create( model="claude-opus-4-8", max_tokens=16000, thinking={ "type": "enabled", "budget_tokens": 4096, # <- the cost knob }, messages=[{"role": "user", "content": prompt}], ) ```

That one parameter changes the cost curve. Default Opus with extended thinking on a 50K-context math problem runs around $1.20 per call and 22-second latency. With budget_tokens=2048 it drops to roughly $0.40 and 7 seconds. Same answer quality on most tasks. The minority that genuinely need extended thinking — set budget_tokens=8192 for those and accept the latency.

Where To Set What

Three tiers is the rule of thumb I landed on after a month of A/B testing:

  • **budget_tokens: 1024** — Tool-use loops, classification, extraction, structured output, short chat. Latency barely above a non-thinking call.
  • **budget_tokens: 4096** — Default for production agents. Plan execution, code review, multi-step reasoning where the plan matters but the math is not novel.
  • **budget_tokens: 16000+** — Hard math, novel architectural decisions, agent planning on a fresh problem. Reserve for the calls where you actually need it.

Route on difficulty so you do not pay Opus prices for easy queries:

python def call_claude(prompt: str, difficulty: str = "medium"): budgets = {"easy": 1024, "medium": 4096, "hard": 16000} return client.messages.create( model="claude-opus-4-8", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": budgets[difficulty]}, messages=[{"role": "user", "content": prompt}], )

A cheap classifier routes easy queries through the cheap path. Hard ones get the full budget. Average fleet cost drops 40 to 60 percent with no measurable regression on the easy and medium tiers.

The Gotchas

  • The budget is for thinking tokens only. It does not include your final output. Set max_tokens separately. A budget_tokens: 8000 call with max_tokens: 500 works fine — Claude uses up to 8K thinking tokens then writes up to 500 tokens of answer.
  • **budget_tokens must be at least 1024.** Anthropic enforces a floor. Anything below silently rounds up.
  • The model can think less than the budget. It does not have to max it out. A simple prompt with budget_tokens: 16000 often uses 800. You pay only for what Claude actually thinks.
  • Prompt caching still works. Cached prefix tokens are not re-counted against the budget. Combine prompt caching with thinking budgets and the savings compound.
  • Streaming thinking is not free. The thinking blocks stream back and they count. If you do not need to show thinking to the user, do not stream it — wait for the final answer and pay only once.
  • **temperature must be 1.0.** Extended thinking requires temperature=1. Anthropic rejects the request otherwise. Strip your temperature=0 lines before turning thinking on.

What To Try Next

Pick the three highest-volume Claude calls in your codebase. Add budget_tokens to each. Start at 2048. Watch the latency and cost drop. Bump to 4096 only if quality regresses on your eval set. Most teams land at a 40 to 50 percent cost reduction in week one. The thinking knob is the single most under-used cost lever on Claude in 2026 — turn it.

Next step: stack it with prompt caching and a LiteLLM proxy and you have a three-layer cost stack that drops a typical agent workload by 70 percent without touching the model.

Mr. Technology

Related Dispatches