Let me tell you about the context window arms race nobody asked for and everyone participating in.
GPT-4 had 32K tokens. Claude hit 200K. Gemini went to 1M. We are now at a point where models can ingest entire codebases, years of documentation, or the complete works of Shakespeare in a single call. And the industry response has been to treat this as an unalloyed good — a feature to market, a number to benchmark, a capability to be proud of.
It is not. Massive context windows are a crutch. And like most crutches, they have made us weaker even as they have made us feel more capable.
The premise behind large context windows sounds logical: if the model can see more information, it can reason over more information and give better answers. More context = more capability.
Except that not how inference works. The model attends over context. As context length grows, the effective attention the model can pay to any given piece of information decreases. You can fit a thousand documents in the context window. The model will weight them all equally, spread thin, and miss the signal in the noise.
This is not a theoretical problem. Studies on long-context models consistently show that performance degrades significantly beyond a certain context length — often well before you hit the ceiling. Models are better at reasoning over 4K-8K well-selected tokens than over 128K poorly-selected ones.
Large context windows have made AI engineers lazy about retrieval.
Think about it. If you have a 1M token context, why bother building a precise retrieval system? Why clean your data? Why structure your documents carefully? Why write good chunking logic? Just dump everything into the prompt and let the model sort it out.
This is the wrong lesson to learn from having large context windows. The teams that have built the most impressive AI systems — the ones that actually work reliably — are obsessive about retrieval quality, data structure, and prompt engineering. They are not using context as a substitute for those things.
RAG got dismissed as a complexity that large context would make obsolete. That is roughly correct in the same way that cars make horses obsolete so we stop building roads. The road quality still matters. The retrieval still matters. If anything, large context made the retrieval problem harder, because now you are retrieving from a larger haystack and need even more precision to avoid polluting context.
Token pricing is linear in context length. When you send 100K tokens to a model, you are not just paying 100X more than a 1K call. You are paying for the model to attend over 100K tokens, which means slower inference and higher latency. The cost and latency compounds as your context grows.
For a 1M token context call, you are potentially paying orders of magnitude more than a well-targeted 8K call that gives the model exactly the information it needs to answer the question.
The math is not subtle: a perfectly retrieved 8K context is worth more than a careless 1M context. And it is 12X cheaper.
Human memory is not a context window. You do not remember everything you have ever experienced equally. You retrieve relevant information based on context, recency, importance, and emotional salience. You forget most of what happens to you. That is a feature, not a bug.
The human brain does not work by having every document you have ever read loaded into working memory simultaneously. It works by having good retrieval systems that surface the right information at the right time.
Large context windows treat the model context as a datastore rather than working memory. That architectural metaphor has costs that the benchmark numbers do not capture: noise, confusion, degraded reasoning on specific items, and ballooning costs.
Large context is genuinely useful for:
Those are legitimate use cases. They are not everything.
The right architecture for most AI systems is: excellent retrieval, careful context construction, small-but-perfect context windows, and a model that good at reasoning over precisely what you have given it.
This is not a novel insight. It is how every high-performing production AI system I have seen is built. Not maxed out context window — exactly the right context, delivered precisely.
The teams obsessed with retrieval quality, data architecture, and context engineering are running circles around the teams who bought the 1M token context narrative and called it a strategy.
Context windows are a feature. They are not a substitute for doing the hard work of building systems that actually retrieve and reason well.
Stop bragging about your context window size. Start bragging about your retrieval precision. That is the number that actually matters.