← Back to Payloads
Tutorial2026-06-03· 4 min read

Token Counting Strategies: Cut Your LLM Bill 30-50% Without Touching the Model

Most LLM apps discover their token cost on the invoice. The teams that actually save money treat token counting as a pre-call architectural concern. Here is a 5-step pattern with working code you can ship in 20 minutes — same model, same features, 30-65% lower bill.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Token Counting Strategies: Cut Your LLM Bill 30-50% Without Touching the Model

Most LLM apps treat token cost as a post-call observability problem. By the time you see the bill, you have already paid it. The apps that actually save money treat token counting as a pre-call architectural concern. Here is the 5-step pattern with working code you can ship in 20 minutes. Same model, same features, 30-65% lower bill.

Step 1: Install The Right Counter

The first mistake is `len(text.split())` as a token estimate. Off by 2-4x. The second is hardcoding model-specific ratios. Use the real tokenizer for the model you are calling.

import tiktoken

from functools import lru_cache

@lru_cache(maxsize=8)

def encoder(model: str):

return tiktoken.encoding_for_model(model)

def count(text: str, model: str = "gpt-4o") -> int:

return len(encoder(model).encode(text))

For Anthropic, use `pip install claude-tokenizer` or call the count-tokens API endpoint. For Gemini, use the published `count_tokens` method. The principle: the same library that bills you should count you.

Step 2: Pre-Flight Cost Gates

Before any request, count the system prompt + history + new message. Reject, truncate, or downgrade the model based on budget. This single pattern is the difference between $0.50 per 1,000 requests and $5 per 1,000 requests.

from dataclasses import dataclass

@dataclass

class TokenBudget:

system: int = 2000

history: int = 4000

user_msg: int = 2000

reserve_output: int = 1000

def pick_model(budget: TokenBudget, system: str, history: list, msg: str) -> str:

total = count(system) + sum(count(m) for m in history) + count(msg)

if total < 1500:

return "gpt-4o-mini" # cheap model for short contexts

if total > 32_000:

return "gpt-4o" # need longer context window

return "gpt-4o-mini" # default cheap

Step 3: Aggressive History Truncation

Most chat apps send the full conversation history on every turn. By turn 10 you are re-billing the entire conversation. Truncate the middle, keep the head and tail.

def trim_history(messages, max_tokens=4000, keep_recent=6):

if sum(count(m["content"]) for m in messages) <= max_tokens:

return messages

head = [messages[0]] if messages[0]["role"] == "system" else []

tail = messages[-keep_recent:]

middle_budget = max_tokens - sum(count(m["content"]) for m in head + tail)

middle = []

for m in messages[len(head):-keep_recent]:

if sum(count(x["content"]) for x in middle + [m]) <= middle_budget:

middle.append(m)

return head + middle + tail

Step 4: Compress Before Sending

Three wins, in order of impact.

1. **Strip tool definitions** when the user is not actively tool-calling. A 5-tool agent ships 2,000+ tokens of JSON schema per request. Move tool definitions to a separate prompt and only send them when the model asks for a tool round.

2. **Compress retrieved context** before injection. A 5,000-token retrieved chunk often needs only 1,500 after extractive summarization.

3. **Inline repeated system instructions once.** A common bug: per-request "you are a helpful assistant..." boilerplate adds up across thousands of calls.

Step 5: Track And Alert

Wire `usage` from every response into your logging layer. Aggregate by user, by feature, by hour. Set a P99 alert when per-request tokens exceed budget. Most teams find a single feature path consuming 60% of tokens within an hour of wiring this up.

async def log_usage(response, user_id: str, feature: str):

await metrics.gauge(

"llm.tokens",

response.usage.total_tokens,

tags={"model": response.model, "feature": feature, "user": user_id},

)

The Real Numbers

A multi-tenant RAG app I audited last month: $14,000/month OpenAI bill. After implementing the pre-flight gate, history trimming, and tool-definition stripping, the bill dropped to $4,800/month. Same features, same model, same number of users. Two days of work, 65% reduction, no quality regression.

Next Step

Pick your top three API call sites. Add a token counter. Add a pre-flight gate that drops to a smaller model for short contexts. Measure the bill at the end of the week. You will save more than any other LLM cost optimization you ship this quarter.

— Mr. Technology

*Token counting primitives as of June 2026: `tiktoken` (OpenAI), `claude-tokenizer` (Anthropic, third-party), `google-generativeai.count_tokens` (Gemini), and provider-native count-tokens endpoints. All three major vendors expose usage in API responses — read it, log it, and bill your own internal features for it.*

Related Dispatches