Skip to main content

The Maths Behind RAG Retrieval Quality

RAG systems are only as good as their retrieval layer. Here's the mathematical framework for evaluating whether your RAG is returning the right results.
20 March 2024·6 min read
Mak Khan
Mak Khan
Chief AI Officer
Dr Vincent Russell
Dr Vincent Russell
Machine Learning (AI) Engineer
Everyone building enterprise AI is using RAG. Few are measuring whether their RAG is any good. The standard evaluation is vibes: "the answers seem reasonable." That's not measurement. Here's a mathematical framework for evaluating retrieval quality with statistical rigour.

What You Need to Know

  • RAG retrieval quality is measurable using precision, recall, and ranking metrics at the chunk level
  • Most enterprise RAG systems retrieve too many irrelevant chunks (low precision) or miss relevant ones (low recall)
  • The embedding model choice, chunk strategy, and similarity threshold each have quantifiable impact on retrieval quality
  • A systematic evaluation framework catches degradation before users notice it
0.72
average precision@5 for enterprise RAG systems (out of 1.0)
Source: RIVER Group, enterprise engagement data
0.58
average recall@10 in typical enterprise knowledge bases
Source: RIVER Group, enterprise engagement data
23%
of retrieved chunks are irrelevant in a typical enterprise RAG query
Source: RIVER Group, enterprise engagement data

Why Retrieval Quality Matters More Than Generation Quality

The generation model (GPT-4, Claude, etc.) gets the attention. The retrieval layer determines what information the model has to work with. If the retrieval returns the wrong chunks, the most capable generation model will produce confident, well-written, wrong answers.
This is the RAG quality paradox: the better the generation model, the more dangerous poor retrieval becomes. A weak model with good retrieval produces awkward but accurate answers. A strong model with poor retrieval produces fluent but misleading ones.

The Evaluation Framework

Precision@K

Precision@K answers: "Of the K chunks retrieved, how many are actually relevant?"
If you retrieve 5 chunks and 3 are relevant: Precision@5 = 0.60.
Enterprise target: Precision@5 above 0.80. Below this, users are wading through noise.

Recall@K

Recall@K answers: "Of all the relevant chunks in the knowledge base, how many did we retrieve?"
If there are 8 relevant chunks and you retrieved 5 of them: Recall@10 = 0.625.
Enterprise target: Recall@10 above 0.75. Below this, the system misses important context.

Mean Reciprocal Rank (MRR)

MRR answers: "How high in the results is the first relevant chunk?"
If the first relevant chunk is at position 1: reciprocal rank = 1.0. Position 3: reciprocal rank = 0.33.
This matters because generation models weight earlier chunks more heavily. A relevant chunk at position 8 contributes less to the answer than one at position 1.

Normalised Discounted Cumulative Gain (nDCG)

nDCG measures the quality of the ranking, accounting for the position of each relevant result. It penalises relevant chunks that appear late in the ranking.
nDCG is the most informative single metric for RAG retrieval quality. It captures both relevance and ranking in one number. An nDCG@10 of 0.85+ means your retrieval is placing the right information where the generation model will actually use it.
Dr Vincent Russell
Machine Learning (AI) Engineer

What Degrades Retrieval Quality

Embedding Drift

Embedding models represent text as vectors. When new content enters the knowledge base with different vocabulary, structure, or domain terminology, the embedding space can drift. Queries that worked well last month may retrieve different chunks this month.
Detection: Track retrieval metrics weekly. A declining precision trend signals drift.
Fix: Re-embed periodically. For rapidly changing knowledge bases, monthly re-embedding is reasonable. For stable bases, quarterly.

Chunk Size Mismatch

Chunks that are too large contain relevant and irrelevant information mixed together (high recall, low precision). Chunks that are too small lose context (high precision, low recall).
The sweet spot depends on the domain. For structured enterprise documents (policies, procedures), 200-400 tokens per chunk. For narrative content (reports, analysis), 400-800 tokens. Test empirically.

Similarity Threshold Calibration

The cosine similarity threshold determines how close a chunk needs to be to the query to be retrieved. Too low: noise. Too high: misses.
Calibration approach: Plot precision and recall against threshold values. The intersection point is your operating threshold. Adjust per query type if your use cases are diverse.

Building the Evaluation Pipeline

Step 1: Create a Test Set

Build a set of 50-100 representative queries with manually labelled relevant chunks. This is labour-intensive but essential. Without ground truth, you can't measure retrieval quality.

Step 2: Automate Metrics

Run the test set through your retrieval pipeline weekly. Calculate precision@5, recall@10, MRR, and nDCG@10. Track trends.

Step 3: Set Alerting Thresholds

Define acceptable ranges for each metric. When a metric drops below threshold, investigate.

Step 4: Iterate

Use the metrics to guide improvements. Low precision: refine chunk boundaries, adjust similarity threshold. Low recall: expand the knowledge base, improve embeddings, add query expansion. Low MRR: improve ranking with hybrid search (vector + keyword).

RAG evaluation isn't optional for enterprise systems. If you're deploying RAG without measuring retrieval quality, you're trusting the most important layer of your AI to vibes. The mathematical framework exists. The metrics are well-defined. Build the evaluation pipeline and run it continuously.