Why RAG Quality Lives or Dies on Retrieval, Not the Model
Retrieval Is the Ceiling on Answer Quality
A retrieval-augmented system can only answer with what it retrieves. If the right chunk does not surface in the top results, no model in the world will produce a correct answer. Teams that skip past this find out the hard way, usually when a customer screenshots a confidently wrong answer.
The fix is not a better model. It is a better retrieval layer.
The Three Failures We See Most Often
In the production RAG systems we audit, the same three patterns show up:
- Chunks are too large, so the embedding represents an average of unrelated topics
- Chunks are too small, so the embedding loses the context needed to be retrievable
- The evaluation set was written by engineers, not derived from how users actually ask questions
The third one is the most damaging. Engineers write clean, factual queries. Real users misspell terms, omit half the context, and ask multi-part questions in a single sentence. A retriever tuned for the first will fail at the second.
Build the Evaluation Set From Real Questions
The first thing we do on a RAG engagement is pull real user questions from logs, support tickets, or prior conversations. We label each one with the document or chunk that should answer it. That becomes the held-out evaluation set.
Every change to chunking, embedding model, or ranker gets scored against that set. We track recall@10 and recall@20 weekly. A change that bumps recall@10 by two points is the kind of win that compounds. A change that does not move recall@10 is theater.
Re-Ranking Is Underused
The retrieval pipeline most teams ship is dense vector search and nothing else. That works for clean queries. It breaks on the long tail.
A cheap cross-encoder re-ranker over the top 50 candidates typically adds 8 to 15 points of recall@10 on real user queries. The latency cost is modest. The accuracy gain is the difference between a useful product and a demo.
Where the LLM Actually Matters
The LLM matters at the very end. Once the right context is in the prompt, modern frontier models will answer correctly more often than not. If they do not, the problem is usually the prompt or the formatting of the retrieved chunks, not the model class.
We have shipped RAG systems where switching from GPT-4o to Claude or back made no measurable difference. The same engagements showed 20-point retrieval lifts from better chunking and re-ranking. Spend the budget where the lift is.