
Here is what I see every single time a team has a knowledge problem with an LLM in 2026: they reach for a vector database. They chunk their documents. They embed them. They set up a retrieval pipeline. They tune the similarity threshold. They build a reranker. They deploy it. They spend three months on it. And then the model hallucinates something that was not in any of the retrieved documents anyway, and the whole thing collapses under the weight of its own complexity.
RAG — Retrieval-Augmented Generation — has become the default answer to every LLM knowledge problem. It is not the right answer to most of them. And the cult-like devotion to the pattern is costing engineering teams months of time and real money for a benefit that is often invisible.
Let me start with the most obvious point: context windows are not what they were in 2023.
The models you are running in production right now support 128K to 1M tokens of context. That is not a small number. That is an entire codebase. That is years of support tickets. That is a full employee handbook. That is a clinical trial dataset. The "we cannot fit everything in context" argument that made RAG necessary in the first place is largely gone for most production workloads.
When you can put 500 pages of documentation directly in the prompt, the retrieval layer is solving a problem you no longer have. The marginal value of retrieval, when context is cheap and abundant, approaches zero for a large class of knowledge tasks. You are not retrieving because you have to. You are retrieving because it is the reflex. That reflex is costing you.
Here is the part of the RAG cult that nobody wants to hear: vector similarity search is not better than BM25 for most production use cases.
BM25 is a 40-year-old sparse retrieval algorithm. It counts term frequencies. It applies inverse document frequency. It does not need an embedding model. It does not need a vector database. It does not need GPU inference for retrieval. It runs on a standard database index, it scales horizontally, and for most keyword-anchored queries — which is what most users actually send — it outperforms dense vector retrieval on recall.
Vector search looks better in demos. You type "financial results for Q3" and you get the right document. What the demos never show you is the failure mode: vector similarity collapses on synonymy, on domain-specific jargon, on queries where the user's phrasing has no lexical overlap with the document text. BM25 does not have this problem. BM25 does not need training data. BM25 does not drift when your embedding model goes stale.
The dirty secret of the vector DB industry is that the quality difference between a well-tuned BM25 index and a dense vector index is small for most workloads, and BM25 wins on latency, cost, and operational simplicity. The vector DB vendors will not tell you this. I will.
Let me walk you through what you are actually committing to when you choose RAG for a production system:
You are committing to a document ingestion pipeline. You are committing to chunking logic that you will argue about for months. You are committing to an embedding model that will need updating. You are committing to a vector database that will need monitoring, capacity planning, and eventual migration when the vendor changes their indexing algorithm. You are committing to a retrieval step that introduces latency you cannot fully predict. You are committing to a reranking step if you are serious. You are committing to an eval suite that tests retrieval quality independently of generation quality. You are committing to all of this so that when the user asks a question, you can show the model a few documents instead of putting those documents in the prompt.
Now ask yourself: what did you actually buy? A lower token count in your LLM call? A hallucination rate that is meaningfully different? A retrieval step that introduces a new class of failure modes — wrong document retrieved, relevant document missed, stale document retrieved — that your system now has to handle?
For most teams, the honest answer is: they bought complexity. They bought a second system to maintain. They bought a problem that did not need to exist.
I am not saying RAG is never the right tool. I am saying it is not the default tool.
RAG makes sense when you have a very large document corpus that changes frequently and you cannot afford to re-embed and re-context the entire corpus on every update. RAG makes sense when you have a compliance requirement to show which specific document supported a given answer — a retrieval audit trail is genuinely useful in regulated industries. RAG makes sense when you have a retrieval volume problem: millions of documents, and the cost of putting them all in context is genuinely prohibitive.
These are real cases. They are not the majority of cases. The majority of cases I see are teams with 200 PDFs who have decided that RAG is the correct architectural choice because they saw it in a blog post, and who are now maintaining a retrieval pipeline for a knowledge base that could fit entirely in a 128K context window with room for a 40-page conversation.
Here is what most of those teams should be doing instead: put the documents in the prompt. Use the context window. Pay the marginal token cost, which is now a fraction of what it was two years ago. Get retrieval for free, without a retrieval pipeline, without a vector database, without embedding maintenance.
If your knowledge base is too large for context, start with the simplest possible retrieval: keyword search, SQL full-text index, or BM25. Add vector search only if you have a demonstrated recall problem that keyword methods cannot solve. This is the boring, cost-effective path. It works. The people who are not on it are the people who are three months into building a RAG pipeline and are not asking whether they needed one.
RAG became the default answer to the wrong problem. In 2023, when context windows were 4K and embedding models were new, retrieval was necessary. In 2026, with 1M token contexts and commoditized embedding infrastructure, retrieval is often ceremonial. The vector DB industry needs you to believe otherwise. The honest engineers are the ones who will tell you that the pipeline you are building to retrieve 10 documents you could have put in the prompt is not a sophistication signal. It is a tax.
Next time you reach for a vector database, ask yourself one question: what problem am I solving that I cannot solve by putting more context in the prompt? If the answer is "I do not know" or "probably none," you have your answer.
The reflex to reach for RAG is costing you. Stop it.
— Mr. Technology