Most LLM apps discover their token cost on the invoice. The teams that actually save money treat token counting as a pre-call architectural concern. Here is a 5-step pattern with working code you can ship in 20 minutes — same model, same features, 30-65% lower bill.

Token Counting Strategies: Cut Your LLM Bill 30-50% Without Touching the Model

Most LLM apps treat token cost as a post-call observability problem. By the time you see the bill, you have already paid it. The apps that actually save money treat token counting as a pre-call architectural concern. Here is the 5-step pattern with working code you can ship in 20 minutes — same model, same features, 30-65% lower bill.

Step 1: Install The Right Counter

The first mistake is len(text.split()) as a token estimate — off by 2-4x. The second is hardcoding model-specific ratios. Use the real tokenizer for the model you are calling.

python

import tiktoken
from functools import lru_cache
@lru_cache(maxsize=8)
def encoder(model: str):
    return tiktoken.encoding_for_model(model)
def count(text: str, model: str = "gpt-4o") -> int:
    return len(encoder(model).encode(text))

For Anthropic, use pip install claude-tokenizer or call the count-tokens API endpoint. For Gemini, use the published count_tokens method. The principle: the same library that bills you should count you.

Step 2: Pre-Flight Cost Gates

Before any request, count the system prompt + history + new message. Reject, truncate, or downgrade the model based on budget. This single pattern is the difference between $0.50 per 1,000 requests and $5 per 1,000 requests.

python

from dataclasses import dataclass
@dataclass
class TokenBudget:
    system: int = 2000
    history: int = 4000
    user_msg: int = 2000
    reserve_output: int = 1000
def pick_model(budget: TokenBudget, system: str, history: list, msg: str) -> str:
    total = count(system) + sum(count(m) for m in history) + count(msg)
    if total < 1500:
        return "gpt-4o-mini"   # cheap model for short contexts
    if total > 32_000:
        return "gpt-4o"         # need longer context window
    return "gpt-4o-mini"        # default cheap

Step 3: Aggressive History Truncation

Most chat apps send the full conversation history on every turn. By turn 10 you are re-billing the entire conversation. Truncate the middle, keep the head and tail.

python

def trim_history(messages, max_tokens=4000, keep_recent=6):
    if sum(count(m["content"]) for m in messages) <= max_tokens:
        return messages
    head = [messages[0]] if messages[0]["role"] == "system" else []
    tail = messages[-keep_recent:]
    middle_budget = max_tokens - sum(count(m["content"]) for m in head + tail)
    middle = []
    for m in messages[len(head):-keep_recent]:
        if sum(count(x["content"]) for x in middle + [m]) <= middle_budget:
            middle.append(m)
    return head + middle + tail

Step 4: Compress Before Sending

Three wins, in order of impact. Strip tool definitions when the user is not actively tool-calling — a 5-tool agent ships 2,000+ tokens of JSON schema per request, so move tool definitions to a separate prompt sent only on tool rounds. Compress retrieved context before injection: a 5,000-token retrieved chunk often needs only 1,500 after extractive summarization. Inline repeated system instructions once — a common bug is per-request "you are a helpful assistant..." boilerplate that adds up across thousands of calls.

Step 5: Track And Alert

Wire usage from every response into your logging layer. Aggregate by user, by feature, by hour. Set a P99 alert when per-request tokens exceed budget. Most teams find a single feature path consuming 60% of tokens within an hour of wiring this up.

python

async def log_usage(response, user_id: str, feature: str):
    await metrics.gauge(
        "llm.tokens",
        response.usage.total_tokens,
        tags={"model": response.model, "feature": feature, "user": user_id},
    )

The Real Numbers

A multi-tenant RAG app I audited last month ran a $14,000/month OpenAI bill. After implementing the pre-flight gate, history trimming, and tool-definition stripping, the bill dropped to $4,800/month. Same features, same model, same users. Two days of work, 65% reduction, no quality regression.

Next Step

Pick your top three API call sites. Add a token counter. Add a pre-flight gate that drops to a smaller model for short contexts. Measure the bill at the end of the week. You will save more than any other LLM cost optimization you ship this quarter.

— Mr. Technology

*Token counting primitives as of June 2026: tiktoken (OpenAI), claude-tokenizer (Anthropic, third-party), and google-generativeai.count_tokens (Gemini). All three major vendors expose usage in API responses — read it, log it, bill your own internal features for it.*