RAG Is Overrated

Everyone's building RAG pipelines. Almost none of them are actually solving the problem they claim to solve.

The Retrieval Illusion

Let's talk about RAG. Retrieval-Augmented Generation. The darling of every enterprise AI pitch deck since 2023. The answer to everything: "Just add RAG."

I'm calling it: RAG is overrated.

Not all of it. But most of it. Let me explain why.

What RAG Actually Does

RAG retrieves documents and stuffs them into context. That's it. That's the whole trick. You have a question, you find some relevant docs, you jam them into the prompt, the LLM answers based on those docs.

This works beautifully when: 1. The answer is explicitly in the retrieved docs 2. The retrieval is precise enough to exclude noise 3. The LLM doesn't hallucinate anyway

In the real world, none of these conditions hold consistently.

The Retrieval Problem Nobody Talks About

Here's what nobody talks about: retrieval quality is terrible at scale.

Semantic similarity search sounds great in a demo. Five documents, all perfectly chunked, all semantically distinct. Great retrieval. Real documents are messy. Multiple pages covering overlapping topics. Similar terminology meaning different things in different contexts. The retrieval returns plausible but wrong chunks constantly.

You know what happens when you retrieve four irrelevant chunks alongside one relevant one? The LLM weights all of it. It spreads its attention across noise. The signal-to-noise ratio drops. Your "grounded" answer is actually less accurate than just letting the model answer from its weights.

The Chunking Problem Is Unsolvable

RAG performance is hypersensitive to how you chunk your documents. Too small: you lose context and coherence. Too large: you dilute relevance. The "right" chunk size is different for every document type, every query type, every model, and every update to your corpus.

Nobody does rigorous chunk-size optimization. They pick 500 tokens, ship it, and call it done. Meanwhile, their retrieval is silently failing on half their queries.

And chunking strategies? Overlapping chunks, semantic chunking, recursive splitting — all sound good in blog posts. Rarely tested rigorously against baseline "just dump everything" approaches. Because "just dump everything" often performs comparably, and nobody gets to write a Medium post about that.

RAG Doesn't Actually Prevent Hallucination

Here's the thing that kills RAG for me: it doesn't actually prevent hallucination.

RAG forces the model to answer based on retrieved documents. Except the model can still confabulate. It can misread the retrieved text. It can confidently state something that isn't in the docs. It can "follow" the retrieved context to a wrong conclusion.

RAG reduces certain classes of hallucination — specifically, hallucinating facts that contradict your knowledge base. But it introduces new failure modes: retrieval failures, chunk boundary errors, and context overflow that confuses the model into making things up.

You've traded known failure modes for unknown ones. That's not obviously better.

The Real Problem

RAG exists because people don't trust LLMs to have knowledge. Which is fair — LLMs hallucinate. But RAG doesn't solve that problem. It papers over it with a complicated pipeline that introduces new failure modes.

What actually works better in most cases:

Better prompting
Fine-tuning on domain-specific data (yes, I know I said fine-tuning is often a waste — but it's the right call for some problems)
Constitutional AI approaches that make models more honest about uncertainty
Simply using a bigger context window and dumping more relevant text directly

Or, radical idea: accept that LLMs are hallucination machines and build systems that handle that gracefully, instead of bolting on retrieval to make them something they're not.

When RAG Actually Works

To be fair: RAG isn't useless. It works when:

You have highly structured, factual data that changes frequently
Exact citation is a regulatory or legal requirement
The retrieval precision can be made very high (legal, medical domains with controlled vocabulary)
Latency and cost constraints make fine-tuning infeasible

If you're building a legal research tool where every claim needs a cite, RAG is essential. If you're building a chatbot that answers questions about your product documentation, you're probably better off with a well-prompted model that knows its limitations.

The Bottom Line

RAG is a useful tool in specific contexts. It's not a universal solution to AI reliability. Most implementations I've seen are doing just enough retrieval theater to feel like they're solving the problem while introducing all the failure modes I described above.

Stop adding RAG because everyone else is. Add it because you've measured and it actually helps.

Otherwise you're just putting on a show. And the people who actually know how this stuff works can tell.