RAG Is Mostly Theater and the Industry Knows It

Retrieval-Augmented Generation became the default architecture for LLM applications because developers couldn't trust context windows. That says more about the industry than it does about the technique.

I'm going to say something that will get me yelled at in engineering circles: RAG is mostly theater.

Before you draft that angry reply — yes, I know it works. Yes, I know vector search can retrieve relevant documents. Yes, I know retrieval augmentation helps models answer questions about information they weren't trained on. I'm not saying it does nothing.

I'm saying it's a workaround pretending to be an architecture.

What RAG actually is

Retrieval-Augmented Generation: you index documents, you retrieve chunks that are semantically similar to the query, you stuff those chunks into the context window, the model reads them and generates a response. Clean, elegant, widely deployed.

Here's what that architecture communicates: we don't trust the model's context window. We don't trust the model's attention mechanism to focus on what's relevant in a long context. We don't trust the model to reason over the full document.

So we pre-process: chunk, embed, index, retrieve. We optimize for retrieval because we assume the context mechanism will fail at the task.

And you know what? That assumption is often correct. Current models do struggle with very long contexts. Attention degrades. Models lose track of relevant information in noise. The retrieval step is doing real work.

But it should feel embarrassing that we've built an entire industry around compensating for a limitation that the model vendors are actively working to eliminate.

The context window arms race makes RAG obsolete

GPT-4o shipped with 128K tokens. Gemini 3 has a million token context window. SubQ launched with 12M tokens. The trajectory is obvious: context windows are growing, attention mechanisms are improving, and the fundamental premise of RAG — that we can't fit enough in context to answer the question — is eroding.

Every model vendor is racing to solve the context utilization problem because that's how they differentiate. Long-context attention is actively researched. Sparse attention, state space approaches, hierarchical attention — take your pick of the architectural innovations.

RAG is a point solution for a structural problem. It's valuable today. It will be less valuable next year. By 2028? Probably largely irrelevant for most use cases.

This isn't a knock on the engineers using RAG today. You're solving real problems with the right tools for today. But calling it "architecture" overstates its permanence.

The real problem RAG is hiding

When you strip it down, RAG is a band-aid for two underlying problems:

1. Unreliable attention: models lose relevant information in long contexts 2. Static knowledge: models can't answer questions about data that changed after training

The second problem is legitimate. Retrieval augmentation is a genuine solution for dynamic, frequently-updated knowledge bases. This use case isn't going away.

The first problem is a model quality problem. As models improve — as attention mechanisms get better — the retrieval step for this purpose becomes less necessary. What's left? Navigation overhead. Chunking decisions. Retrieval latency. Embedding quality. Retrieval recall. A dozen failure modes that the retrieval layer introduces that pure context didn't have.

My take

Use RAG for what it actually solves: dynamic knowledge, frequently updated corpora, very large document sets where pure context is genuinely cost-prohibitive. Don't use it as a default architecture because your embeddings are good or because everyone's doing it.

The question you should be asking isn't "should I use RAG" — it's "is the information I'm retrieving actually changing frequently enough to justify the retrieval overhead?"

For static documentation that changes quarterly? Probably not. For a legal database that's updated daily? Yes. For a product catalog with live inventory? Absolutely.

Context matters. The cargo-culting of RAG as a default for every LLM application is the part that bothers me. It's a technique with real use cases, not a universal pattern.

And to the people who will say "but the benchmarks show retrieval still helps even with unlimited context" — sure. In controlled evaluations on specific tasks. In your production system, with your embedding model, your chunking strategy, your retrieval optimization? The math is more complicated than the benchmark suggests.

Build what works. Be honest about why it works. Don't ship RAG because it's fashionable.

— Mr. Technology