Retrieval-Augmented Generation is one of those ideas that makes immediate sense and works brilliantly in a demo. You give an LLM access to a corpus of documents, it retrieves relevant context before answering, and suddenly it stops hallucinating about your proprietary data. Beautiful. Ship it.
Then it goes to production. And you discover a cluster of problems nobody talked about in the tutorial.
The Gap Nobody Advertises
The tutorial RAG system works because tutorial data is clean. The questions are well-formed. The documents are short, coherent, and directly answer the questions being asked. Production is none of those things.
Production data is messy PDFs, inconsistently formatted spreadsheets, and documents where the answer to a question is spread across three sections, two appendices, and a footnote. Production users ask ambiguous questions. They use jargon the documents never use. They ask follow-ups that reference context from five turns ago.
If you have shipped a RAG system, you have almost certainly experienced what we call the "precision collapse" — where your system achieves high recall (it retrieves something relevant) but terrible precision (the retrieved chunks are not what the LLM actually needs to answer correctly).
Chunking Is Your First Real Decision
Most tutorials chunk documents by character count. This is fast and simple, and it is almost always wrong for real documents.
A 512-character chunk boundary will frequently split a sentence, or worse, split a table or a list at a semantically meaningless point. The chunk will contain half an argument, and the retriever will correctly identify it as relevant — but the LLM will not be able to use it properly because the context is incomplete.
What we have found works better in practice:
- Semantic chunking — split on paragraph or section boundaries, not character counts. This requires more preprocessing but dramatically improves retrieval coherence.
- Hierarchical chunking — store both fine-grained chunks and parent document summaries. Retrieve at the fine level, but augment the prompt with the parent summary for context.
- Metadata injection — every chunk should carry the document title, section heading, and page number as structured metadata, queryable independently of the vector search.
Hybrid Search Is Not Optional
Pure dense vector retrieval misses a class of queries that keyword search handles trivially: exact entity names, product codes, specific dates, technical identifiers. A user asking about "clause 14.3(b)" expects you to find clause 14.3(b) — not a semantically similar clause.
Production RAG systems need hybrid retrieval: dense vector search for semantic similarity, plus BM25 or similar sparse retrieval for exact-match keywords. The results are then re-ranked with a cross-encoder model before being passed to the LLM. This pipeline is more expensive and more complex than pure vector search, but the precision improvement is not marginal — in our legal document system, hybrid search reduced retrieval errors by roughly 40% compared to vector-only.
Evaluation Is a Product
Here is the uncomfortable truth: you cannot know if your RAG system is working without a proper evaluation framework, and most teams do not build one until after they have had a production incident.
You need a curated set of question-answer pairs that represents the full distribution of real user queries — including edge cases, adversarial inputs, and questions where the correct answer is "I don't know, it's not in the documents." You need to track retrieval metrics (precision, recall, MRR) independently of generation quality, because they fail in different ways.
We run evaluations on every significant change to the pipeline — chunking strategy, embedding model, retrieval parameters, prompt template. Evaluation is not a one-time step before launch; it is a continuous process.
Latency Budgets
A RAG query involves at minimum: embedding the query, running vector search, optionally re-ranking, constructing the prompt, and running inference. At each step, latency accumulates. In a synchronous, single-turn interface, users will notice anything over 2–3 seconds. In an agentic system with multiple RAG calls per turn, this becomes critical.
Practical strategies we use: aggressive caching of embedding results for repeat queries, pre-computing embeddings for high-frequency questions, streaming the generation so users see output immediately rather than waiting for the full response, and setting hard timeouts with graceful degradation.
What We Would Do Differently
If we were starting over on the legal document system we shipped in January, we would front-load evaluation infrastructure. We spent the last two weeks of the project retrofitting an eval harness that should have been built in week one. The system works well, but we made decisions based on intuition that we should have been making based on data.
RAG is a proven approach for grounding LLM outputs in real documents. The techniques exist to do it well. The challenge is discipline — treating it as an engineering problem with measurable outcomes rather than a prompt-engineering exercise where you keep tweaking until it feels right.