ML Model Evaluation: Beyond Accuracy

"What's the accuracy?" It's the first question every stakeholder asks about an ML model. It's also the least useful question. Accuracy tells you how often the model is right. It tells you nothing about when it's wrong, how badly it's wrong, or whether the errors matter. Here's what to measure instead.

What You Need to Know

Accuracy is misleading for imbalanced datasets. A fraud detection model that labels everything "not fraud" achieves 99.5% accuracy. It also catches zero fraud.
Precision and recall tell you where the model fails. Precision: when it says positive, how often is it right? Recall: of all actual positives, how many does it find? The trade-off between these two defines whether your model is useful.
Calibration matters as much as classification. A model that says "80% confident" should be right 80% of the time. Most models aren't calibrated. This breaks every downstream decision that relies on confidence scores.
Evaluation must reflect production conditions, not clean test sets. Distribution shift, data quality variation, and adversarial inputs all degrade performance in ways that hold-out test sets don't capture.

Why Accuracy Fails

Vincent has a phrase he uses with clients: "accuracy is the metric that makes your model look good. The others make your model useful."

Consider a claims triage model. The dataset is 70% standard claims, 20% complex, 10% specialist. If the model classifies everything as "standard," it achieves 70% accuracy. But it's useless, because it misses every complex and specialist claim.

This is not a contrived example. It is the default behaviour of models trained on imbalanced data without appropriate handling. And enterprise data is almost always imbalanced.

70-80%

of enterprise ML datasets have class imbalance ratios of 5:1 or greater, making accuracy a misleading primary metric

Source: RIVER, internal analysis across client engagements, 2023-2024

The Metrics That Matter

Precision and Recall

Precision answers: of all the items the model labelled as positive, how many actually were? High precision means few false positives. When the model says "this claim is fraudulent," it's usually right.

Recall answers: of all the items that actually were positive, how many did the model find? High recall means few false negatives. The model catches most actual fraud.

These two metrics are in tension. Increasing precision typically decreases recall and vice versa. The right balance depends on the cost of each error type.

For fraud detection: High recall matters more. Missing a fraudulent claim (false negative) is expensive. Flagging a legitimate claim for review (false positive) is inconvenient but not catastrophic.

For automated approvals: High precision matters more. Approving something that should have been flagged (false positive on "safe") is the expensive error. Sending a safe claim for manual review (false negative on "safe") just adds a bit of processing time.

The F1 score is the harmonic mean of precision and recall, and it is a useful single number when you need one. But Vincent rightly argues that collapsing two informative metrics into one number loses the information that matters most: where is the model failing?

The Confusion Matrix

A 2x2 table (or NxN for multi-class) showing exactly how the model's predictions map to reality. True positives, true negatives, false positives, false negatives. It is the most informative single artefact in model evaluation.

For multi-class problems (like claims triage with standard/complex/specialist), the confusion matrix shows you which classes the model confuses. If it frequently classifies specialist claims as complex, that's a specific, actionable finding. "85% accuracy" hides this entirely.

Calibration

A model's confidence scores should be meaningful. If the model assigns 90% confidence to a prediction, that prediction should be correct roughly 90% of the time. This property is called calibration.

Most models are not well calibrated out of the box. They tend to be overconfident, assigning high confidence scores even when they're wrong. This matters enormously in enterprise contexts where confidence scores drive decisions: "auto-approve if confidence > 0.95" is a reasonable policy only if 0.95 confidence actually means what it says.

How to measure calibration:

Plot predicted confidence against actual accuracy across confidence bins. A perfectly calibrated model produces a diagonal line. Most models produce a curve that sits above the diagonal (overconfident) in some ranges and below it (underconfident) in others.

How to fix it:

Calibration techniques (Platt scaling, isotonic regression, temperature scaling) adjust the model's confidence scores to match observed accuracy. These are post-hoc adjustments that don't change the model's predictions, only the confidence scores attached to them. Simple to implement, significant impact.

If your stakeholder dashboard shows accuracy and nothing else, you're not monitoring your model. You're reassuring yourself.

Mak Khan

Chief AI Officer

Distribution Shift

The test set is a snapshot of the data distribution at training time. Production data drifts. New document formats appear. Claim patterns change seasonally. Regulatory changes alter the data landscape.

Population stability index (PSI) measures how much the input data distribution has shifted since training. A PSI above 0.2 suggests significant drift that warrants investigation.

Performance monitoring on labelled production data (even a sample) catches degradation that input drift metrics alone might miss. A model can receive similar-looking inputs and still degrade if the relationship between inputs and outputs has changed.

A Practical Evaluation Framework

Before Deployment

Define the cost matrix. What does each error type cost? False positive cost vs false negative cost. This determines your precision-recall trade-off.
Evaluate on a representative test set. Not a random split. A test set that reflects the data distribution you expect in production, including edge cases and rare classes.
Report the confusion matrix, not just accuracy. Show stakeholders exactly where the model succeeds and where it fails.
Calibrate confidence scores. If downstream processes rely on confidence thresholds, calibration is non-negotiable.
Test on adversarial inputs. What happens with garbled documents? Missing fields? Unusual formats? The model's failure modes matter as much as its success modes.

After Deployment

Monitor input distribution. Track PSI or equivalent drift metrics on all input features.
Sample and label production predictions. Even 50-100 labelled examples per week gives you a signal on real-world performance.
Track performance by subgroup. Overall metrics can mask subgroup degradation. A model that performs well on average but poorly for a specific claim type or customer segment is not performing well.
Set alert thresholds. Define what level of degradation triggers investigation, retraining, or rollback.
Version everything. Model version, data version, evaluation results. When performance degrades, you need to trace back to what changed.

The Conversation With Stakeholders

The hardest part of model evaluation is not the statistics. It is explaining to stakeholders why "92% accuracy" is not the whole story.

We've found that framing it in business terms works. Don't say "the recall on class 2 is 0.73." Say "the model misses 27% of complex claims and routes them as standard." Don't say "the model is poorly calibrated above 0.9." Say "when the model says it's 95% confident, it's actually right about 80% of the time."

Stakeholders don't need to understand the maths. They need to understand the consequences. Give them consequences, and the right decisions follow.