Evaluating RAG and MCP

Building a working system marks only half the job. Checking whether that system actually performs well matters just as much. This topic covers practical ways to measure quality for both RAG and MCP components, using clear checks rather than guesswork.

Why Evaluation Cannot Be Skipped

A system can look impressive during a quick demo while still failing on real user questions. Structured evaluation catches these hidden weaknesses before real users encounter them, protecting both quality and trust.

A Driving Test Analogy

A new driver does not receive a license just from feeling confident behind the wheel. A structured driving test checks specific skills under real conditions. Evaluating a RAG or MCP system works the same way, checking specific measurable skills instead of relying on a general impression.

Core RAG Evaluation Measures

Measure	What It Checks
Retrieval accuracy	Did the search step find the truly relevant chunks
Answer faithfulness	Does the final answer stick to the retrieved facts
Answer relevance	Does the final answer actually address the question asked

Checking Faithfulness

Spotting an Unfaithful Answer

A retrieved chunk states a return window of thirty days. A faithful answer repeats that thirty-day window exactly. An unfaithful answer states sixty days instead, adding a fact that never appeared in the retrieved material. Catching this gap protects users from confident but incorrect information.

Core MCP Evaluation Measures

Measure	What It Checks
Tool call accuracy	Did the model call the correct tool for the situation
Input correctness	Did the model send correctly formatted input to that tool
Action success rate	Did the tool call actually complete without errors

Building a Simple Test Set

Collect a batch of real or realistic questions covering common use cases.
Write down the ideal correct answer for each question in advance.
Run every question through the system and record its actual answer.
Compare actual answers against the ideal answers, scoring each one.
Review the lowest-scoring cases to find patterns worth fixing.

The Testing Loop

A Worked Example

A team tests their support assistant with fifty sample questions. They discover the assistant answers general policy questions well but frequently calls the wrong tool for order lookups. This finding points the team directly toward fixing the MCP tool descriptions, rather than wasting time adjusting unrelated parts of the system.

Making Evaluation an Ongoing Habit

User needs and available documents keep changing after launch. Running evaluation regularly, not just once before launch, catches new problems as they appear and keeps quality steady over time, rather than letting small issues quietly pile up unnoticed.

Previous lesson

Back to course

Next lesson