Skip to main content

Advanced RAG Patterns for Enterprise

Basic RAG gets you started. Advanced patterns like hybrid search, re-ranking, and multi-step retrieval get you to production quality.
18 April 2024·7 min read
Mak Khan
Mak Khan
Chief AI Officer
We published our RAG explainer last year, and it remains the most-read piece on this site. But basic RAG, retrieve some chunks, stuff them into a prompt, generate a response, is where the journey starts, not where it ends. Enterprise RAG needs more than a vector database and a prayer.

What You Need to Know

  • Basic RAG works for demos. Production RAG needs advanced patterns. Chunking strategy, retrieval quality, and answer grounding all require engineering beyond the default setup.
  • Hybrid search (semantic + keyword) outperforms pure vector search in enterprise settings where exact terms matter (policy numbers, product codes, regulatory references).
  • Re-ranking retrieved results before passing them to the LLM significantly improves answer quality. Not all retrieved chunks are equally relevant.
  • Multi-step retrieval (query decomposition, follow-up retrieval) handles complex questions that basic RAG misses entirely.
  • Evaluation is non-negotiable. Without systematic measurement of retrieval and generation quality, you're guessing.
35-50%
improvement in answer accuracy when moving from basic to advanced RAG patterns in enterprise document sets
Source: RIVER, internal benchmarking across client engagements, 2023-2024

Why Basic RAG Falls Short

Basic RAG follows a simple pipeline: chunk documents, embed them, store in a vector database, retrieve the top-k similar chunks, generate an answer. This works well for straightforward questions against clean, well-structured documents.
Enterprise reality is different. Documents are messy. Questions are complex. Users expect precision.
Here's where basic RAG breaks down:
Chunking failures. A fixed-size chunk (say, 500 tokens) might split a critical table across two chunks, or combine unrelated sections. The model gets half the table or irrelevant context.
Retrieval noise. Semantic similarity isn't the same as relevance. A chunk that's semantically similar to the query might not contain the answer. When you retrieve 10 chunks and 6 are noise, the model has to work harder, and it sometimes gets confused.
Complex questions. "What's the difference between our 2023 and 2024 leave policies for employees on parental leave?" requires retrieving from multiple documents, comparing them, and synthesising. Basic RAG retrieves chunks for the query as-is, which usually misses half the picture.
Pure vector search finds semantically similar content. But enterprise users often search for specific terms: policy numbers, clause references, product codes, technical acronyms. These are exact-match problems, not similarity problems.
Hybrid search combines vector similarity with keyword matching (typically BM25). The retrieval layer runs both searches, then fuses the results.
In practice, we weight the fusion based on query type. A question like "What does our policy say about remote work?" leans on semantic search. A question like "What does clause 4.3.2 of the employment agreement state?" leans on keyword search. The system can detect which pattern to favour.

Pattern 2: Re-Ranking

Basic RAG takes the top-k results from the vector database and passes them all to the model. The assumption is that the vector database's similarity ranking is good enough. Often, it isn't.
Re-ranking adds a second pass. After initial retrieval (usually a broader set, say top-20), a cross-encoder model scores each result against the original query with much higher precision than the embedding similarity score. The top results after re-ranking are then passed to the LLM.
Cross-encoder re-ranking is more compute-intensive than vector similarity, which is why you don't use it for the initial retrieval. But applied to a shortlist of 20-50 candidates, it's fast and dramatically improves precision.
Quick Win
If you're running basic RAG in production, adding a re-ranking step is the single highest-impact improvement you can make. Models like Cohere Rerank or cross-encoder models from Hugging Face can be integrated in a day.

Pattern 3: Query Decomposition

Complex questions often need to be broken into sub-questions. "Compare our claims process for domestic vs international travel insurance" is really two retrieval tasks: one for domestic claims process, one for international.
Query decomposition uses the LLM itself to break a complex question into retrieval sub-queries. Each sub-query retrieves its own set of chunks. The combined context is then used for the final generation step.
This pattern adds latency (an extra LLM call for decomposition), but for complex questions, the accuracy improvement is substantial. We typically implement it conditionally: simple questions go through the standard pipeline, while complex questions trigger decomposition.

Pattern 4: Contextual Chunking

Instead of fixed-size chunks, contextual chunking preserves document structure. Tables stay intact. Sections are chunked at heading boundaries. Metadata (document title, section heading, page number) is attached to each chunk.
This requires more preprocessing effort, but the retrieval quality improvement is worth it. When the model receives a chunk that says "Section 4.2: Parental Leave Entitlements" with the full section intact, it generates better answers than when it receives an arbitrary 500-token fragment.

Evaluating RAG Quality

You can't improve what you don't measure. RAG evaluation has two components:
Retrieval quality. Are the right documents being retrieved? Measure precision (how many retrieved chunks are relevant) and recall (how many relevant chunks are retrieved).
Generation quality. Given the right context, does the model produce an accurate, well-grounded answer? Measure faithfulness (does the answer match the source material) and completeness (does the answer address the full question).
We build evaluation datasets early in every engagement, typically 50-100 question-answer pairs with source annotations. This becomes the benchmark for every pipeline change.
The difference between a demo RAG system and a production RAG system isn't the model or the vector database. These patterns aren't glamorous, but they're what makes the difference between 70% accuracy and 95%.
Mak Khan
Chief AI Officer