Context Windows Are a Red Herring

Every vendor is bragging about their million-token context window. Nobody is asking the right question: why are you stuffing your entire codebase into a prompt and calling it reasoning?

Let me say it plainly: the context window arms race is a distraction.

Anthropic says 200K tokens. Google says two million. OpenAI says a million. We're in a benchmark war over who can stuff the most context into a single call, and the entire industry is acting like this is a meaningful signal of intelligence. It isn't.

Here's what I'd rather ask: why does your AI need to read your entire codebase to fix a bug? Why does it need to see 50,000 lines of context to understand what a function does? If a model needs to see your entire repository to write a unit test, it doesn't have long-term reasoning. It has a very expensive needle-in-a-haystack retrieval problem.

Real reasoning is the ability to take a small amount of information — a clean specification, a well-scoped problem, a clear constraint — and work through it correctly. It doesn't require reading every line of your monorepo. It requires understanding principles, recognizing patterns, and applying them to novel situations.

A 10-million-token context window doesn't make a model smarter. It makes it better at looking things up. That's a fundamentally different capability, and the industry is conflating the two because it's easier to market than the hard problem.

What the Benchmark Culture Gets Wrong

The context window race is a benchmark optimization problem. Every time someone publishes a "long-context understanding" benchmark, every vendor races to win it. The benchmark tests whether a model can find a relevant piece of information buried deep in a long context. That's a retrieval problem, not a reasoning problem.

And here's the uncomfortable part: those benchmarks are often gamed. Models that perform well on long-context benchmarks sometimes do so because they've been trained on data that includes the benchmark content, or because the evaluation methodology rewards pattern-matching over genuine comprehension. The benchmark doesn't measure whether the model understands the information. It measures whether it can retrieve and repeat it.

The difference between retrieval and reasoning is the difference between a search engine and a scientist. A search engine gives you everything related to your query. A scientist gives you the implication you didn't know to look for.

The context window race is making everyone's search engine better. Nobody is making scientists.

The Memory Illusion

When you hand a model 100,000 tokens of context, it feels like memory. It feels like the model "knows" your codebase the way a senior engineer would. That's an illusion.

The model doesn't have persistent memory across sessions. It doesn't build a mental model of your system that accumulates over time. It processes the context you give it in this specific call, produces an output, and resets. The 100,000 tokens you gave it this session are gone the next time you call it.

This isn't memory. It's a very expensive form of copy-paste.

A senior engineer who joins your team builds a mental model over weeks. They remember architectural decisions. They form opinions about what's right and wrong. They get better at predicting what you'll need before you ask. That accumulation is what actual expertise looks like.

A model with a million-token context window gives you all the text but none of the accumulation. Every session is a fresh start wearing a very expensive hat.

The Real Problem Nobody Is Solving

The actual capability gap isn't context length. It's three things:

Compositional generalization: The ability to take principles learned in one domain and apply them correctly in a different, novel domain. This is what human experts do. Current models are surprisingly bad at it — they pattern-match on surface features and fail when the structure is similar but the surface is different.

Causal reasoning: The ability to distinguish between correlation and causation, to reason about interventions and counterfactuals, to understand why something works rather than just what works. This is fundamentally different from statistical pattern matching, and it's what makes expert judgment possible.

Accumulated understanding: Building a persistent model of a complex system that improves over time without requiring the entire system state to be re-explained every session. This is what memory actually is, and no current LLM architecture does it well.

These are hard problems. They aren't solved by increasing context windows from 200K to two million. They're solved by architectural innovations, training methodology improvements, and fundamentally different approaches to how models represent and update knowledge.

The vendors don't talk about these because they can't be solved with a press release. You can't ship "better causal reasoning" in a patch notes update. You can ship "one million context tokens." So that's what gets marketed.

The Practical Implication

Here's why this matters for how you build things:

If you're relying on massive context windows to make your agents work, you're papering over a design problem. You're saying "I can't figure out how to give this agent the right information at the right time, so I'll just give it everything and hope it figures it out."

That approach scales poorly. It burns tokens. It increases latency. It produces inconsistent outputs because the model is overwhelmed with context it has to reason through. And it doesn't actually solve the underlying problem — the agent still doesn't understand your system, it just has more raw material to pattern-match against.

The better approach is to invest in information architecture: how you retrieve, chunk, and present information to your agents. RAG (Retrieval-Augmented Generation) done well is more valuable than a massive context window done poorly. Structured tool definitions that give models exactly what they need for the specific task are better than dumping your entire API schema in the prompt.

Context windows are a feature. Architectural intelligence is the actual product.

The industry has chosen the wrong benchmark. Don't let that choice infect your own thinking.

— Mr. TECHNOLOGY