Building Evaluation into AI Design

The standard AI development process goes: design, build, test, evaluate. Evaluation happens at the end, after the system is built, after the architecture is set, after the prompts are written. By that point, the most important evaluation decisions have already been made by default. Vincent and I argue for inverting this entirely. Evaluation is a design activity. It starts before the first line of code.

Why Post-Hoc Evaluation Fails

When evaluation is deferred to the end of development, three things go wrong:

The architecture constrains what you can evaluate. If you did not design the system to capture intermediate outputs, confidence signals, or provenance information, you cannot evaluate those dimensions later without rebuilding. Evaluation needs inform architecture decisions. They must come first.

The metrics default to what is easy to measure. Without deliberate evaluation design, teams measure what the system naturally produces: response times, error rates, maybe output length. These are operational metrics, not quality metrics. They tell you whether the system is running, not whether it is working well.

Evaluation becomes adversarial. When evaluation is a separate phase at the end, it becomes a gate that the system must pass. The incentive shifts from "how can evaluation make this system better?" to "how can we get this system through evaluation?" This is a subtle but damaging distortion.

If you design evaluation into the system from the start, it becomes a feedback mechanism. Judgement just scores them.

Dr Tania Wolfgramm

Chief Research Officer

Evaluation-First Design

Evaluation-first design means starting with the question "how will we know this is working?" before designing how it works. This sounds obvious. In practice, it requires discipline because the pressure to start building is always strong.

Step 1: Define Success

Before any technical design, define what success looks like for this AI capability. Not technical success (the model responds). Business success (the right things happen as a result).

For a document processing AI: Success is not "the model extracts fields from documents." Success is "extracted fields are accurate enough to reduce manual processing by 60% while maintaining compliance standards."

For a customer service AI: Success is not "the model answers questions." Success is "customer queries are resolved faster with equal or higher satisfaction, and escalation rate for complex queries remains within acceptable bounds."

These definitions drive everything that follows. They determine what you measure, how you architect the system to enable measurement, and what thresholds matter.

Step 2: Design the Evaluation Architecture

With success defined, design the system architecture to support continuous evaluation. This means:

Capture intermediate outputs. Do not just capture the final output. Capture the retrieval results, the confidence signals, the reasoning steps, and the context used. These intermediate outputs are essential for diagnosing quality issues.

Vincent's contribution here is specific: design the data pipeline so that every AI interaction produces a structured evaluation record. Input, context retrieved, model used, prompt version, intermediate reasoning (if available), output, and any user corrections. This record is the foundation for all evaluation.

Build comparison infrastructure. Evaluation is always comparative. The system's output compared to a human baseline. Current performance compared to last month. Model A compared to Model B. Design the infrastructure to run comparisons efficiently from the start.

Instrument for user signals. Build feedback mechanisms into the interface from day one. Not just thumbs up/down (though those help), but structured signals: did the user accept the output? Modify it? Reject it? What did they change? These signals are evaluation data.

Step 3: Establish Baselines

Before the AI system is live, establish baselines for the current process. How accurate is the human process? How long does it take? What is the error rate? What do errors cost?

This is where Vincent's statistical rigour matters. The baseline is not a single number. It is a distribution. Human accuracy varies by person, by case type, by time of day. Understanding the variance in the current process is essential for fairly evaluating the AI alternative.

82%

of enterprise AI evaluations compare AI performance to a theoretical perfect standard rather than the actual human process, inflating perceived failure rates

Source: MIT Sloan Management Review, 2024

Step 4: Continuous Evaluation Loop

Once the system is live, evaluation is continuous:

Automated evaluation runs daily against a golden test suite. This catches regressions quickly.

Statistical monitoring tracks output distributions, user correction rates, and quality metrics over time. Change-point detection identifies degradation before it becomes obvious.

Periodic deep evaluation involves human reviewers assessing a sample of production outputs against detailed rubrics. This catches quality issues that automated metrics miss.

Comparative evaluation runs whenever models, prompts, or retrieval pipelines change. Before any change goes live, it must demonstrate at least equivalent performance on the evaluation suite.

The Evidence Standard

Vincent brings a specific perspective to AI evaluation: the evidence standard should match the claim.

If you claim your AI system is "95% accurate," you need evidence that supports that claim with statistical confidence. A test suite of 50 examples is not sufficient. The confidence interval on 50 examples is wide enough to be meaningless.

Minimum sample sizes for meaningful evaluation:

Binary classification: 380+ examples for 95% confidence interval of plus or minus 5%
Multi-class problems: scale with the number of classes
Generation tasks: 100+ examples with multi-reviewer assessment for inter-rater reliability

These numbers surprise most teams. They expect to validate an AI system with 20 or 30 test cases. That is a sanity check, not an evaluation.

If a medical trial evaluated a drug with 30 patients, nobody would trust the results. We should hold AI systems to the same statistical standard, particularly when they inform consequential decisions.

Dr Vincent Russell

Machine Learning (AI) Engineer

Practical Implementation

For teams starting to build evaluation into their AI design:

Week 1: Define success metrics. What does good look like for this specific capability? Define three to five metrics that capture business impact, not just technical performance.

Week 2: Design the evaluation record. What data do you need to capture from every AI interaction to enable evaluation? Design the schema before designing the system.

Week 3: Establish baselines. Measure the current human process against your success metrics. This is your comparison point.

Week 4: Build the golden test suite. Create 100+ test cases with verified correct outputs. These become your regression baseline.

Ongoing: Instrument and iterate. Every production interaction generates evaluation data. Use it to improve the system continuously.

Evaluation-first design is more work upfront. But it produces AI systems that know how well they are working, improve with use, and earn the trust of the people who depend on them. The alternative, building first and evaluating later, produces AI systems that might be working well and might not, and nobody knows which.