
Half the RAG stacks I have torn down for paying customers this year solve a problem the underlying model stopped having in late 2025. The architecture is a 2023 answer to a 2023 question — and the question is obsolete. With 1M- to 2M-token contexts now table stakes, retrieval-first is a tax on latency, an excuse to avoid real evaluation, and a billing meter dressed up as engineering.
The original pitch was real. GPT-3.5 had a 4K context. A 400-page policy manual did not fit. Embed it, retrieve chunks, prepend them, let the model answer. Right answer in 2022. Wrong answer in 2026.
Gemini 2.5 Pro shipped at 2M tokens. Claude Opus 4.5 hit 1M in late 2025. GPT-5.x, DeepSeek, Qwen, and MiniMax all sit in the 256K-to-1M range — 600 to 2,500 pages. (Anthropic Claude Opus 4.5 release notes, Google Gemini 2.5 Pro docs) For "answer questions about a corporate document collection," long context is a strict superset. Retrieval adds a failure mode, not a capability: wrong chunk, missed answer, mid-sentence splits, embedding-versus-generative disagreement. Long context removes every one of those for any corpus that fits in the window.
The dirty secret of the RAG industry: almost nobody runs a real eval harness on retrieval. They eval the final answer — "did the user thumbs-up the response?" — and treat retrieval as a black box. When the answer is wrong, they blame the model. They do not instrument which chunk was retrieved, what the cosine similarity was, or whether the reranker promoted the right candidate. The harness does not exist.
The 14-box architecture diagram — vector DB, reranker, query rewriter, HyDE, GraphRAG hop, agentic retriever, citation post-processor, answer synthesizer — looks rigorous in a pitch deck. In production, retrieval breaks most often and gets debugged least. Microsoft Research's GraphRAG paper documented 3x-6x cost and 2x-4x latency versus naive RAG, with a small, corpus-dependent quality improvement. (Microsoft Research GraphRAG)
The platforms I have audited fall into three buckets. One: the corpus fits in 1M tokens with room to spare, and the pipeline exists because the team built it in 2023. Two: the corpus is large but the queries are easy, and a single long-context prompt would beat chunked retrieval. Three: the corpus is large, the queries are hard, and the pipeline is disguising the fact that nobody has built an eval harness. Bucket three is the most damaging. RAG is a confidence artifact while retrieval quality is unmeasured.
The honest counterargument: for multi-million-token enterprise search, real-time data, or regulated records, retrieval is still necessary, and stuffing the context is a non-starter on cost.
True. Not the counterargument to the take. The take is that most RAG pipelines are theater for the use case they were sold for — answering questions over a static document collection inside a 1M-2M token budget. For the genuine large-corpus, real-time, cost-sensitive, citation-required use cases, retrieval is real. The market is not full of those. It is full of "AI over your knowledge base" SaaS products that built a RAG pipeline because that was the LangChain diagram in 2023, and have not updated since.
The first question is not "which embedding model, which vector DB." The first question is "does the corpus fit in 1M tokens, and if not, can I shrink it or summarize the irrelevant parts." If yes — and for most "knowledge base AI" use cases it is — long context is cheaper, faster, simpler to evaluate, and strictly higher quality.
If you are buying a RAG platform, demand the eval harness. Demand chunk-level retrieval metrics, not just answer-level thumbs-up rates. Demand a benchmark of long-context-only against RAG on your actual corpus. If the vendor cannot produce it, you are buying the architecture diagram, not the engineering.
The naive RAG era is over. The use case it was built for was solved by a longer context window. Update the architecture, or keep selling the diagram.
— Mr. Technology