The Evidence Gap in Enterprise AI

"90% accuracy." "60% time savings." "3x productivity improvement." Enterprise AI is sold on numbers. But when you ask for the methodology behind those numbers, the conversation gets vague fast. Sample sizes, control groups, confidence intervals, and replication studies are rare. The evidence gap between what's claimed and what's proven is the biggest risk in enterprise AI procurement.

What You Need to Know

Most enterprise AI performance claims lack the methodological rigour to be credible
The absence of evidence isn't evidence of absence, but it should temper expectations and inform procurement decisions
Enterprises should demand the same evidentiary standards from AI vendors that they'd expect from any other significant investment
Building internal evaluation capability is the most reliable defence against inflated claims

78%

of enterprise AI vendor claims cannot be independently verified

Source: MIT Technology Review, 2025

$4.4T

projected global AI market by 2030, largely based on unverified ROI claims

Source: McKinsey Global Institute, 2024

The Claim-Evidence Spectrum

Not all AI claims are equal. Here's a framework for assessing the evidence behind them:

Level 1: Anecdotal

"One client saw a 60% improvement." No sample size. No methodology. No context about what else changed during the period. This is marketing, not evidence.

Level 2: Internal Benchmarks

"Our model achieves 94% accuracy on our test set." Better, but the test set was designed by the vendor. The data may not represent your environment. There's no independent verification.

Level 3: Independent Evaluation

"Our model was evaluated by [third party] on a standardised benchmark." Stronger. But standardised benchmarks may not reflect your specific use case, data quality, or operational conditions.

Level 4: Peer-Reviewed Research

"Our approach is published in [journal/conference] with full methodology." The strongest publicly available evidence. Subject to peer review, replicable, and methodologically transparent.

Level 5: Replicated Results

"Multiple independent teams have replicated our results in different contexts." The gold standard. Rare in enterprise AI, but the aspiration.

In research, we distinguish between claims and evidence. A claim is what someone says happened. Evidence is what can be independently verified. Enterprise AI is full of claims. It's remarkably short on evidence.

Dr Tania Wolfgramm

Chief Research Officer

Common Evidence Gaps

Missing Baselines

"AI improved processing time by 60%." Compared to what? The manual process? A different AI system? The same process on a bad day? Without a clear, documented baseline, improvement claims are uninterpretable.

Cherry-Picked Metrics

Vendors report the metrics that look best. Accuracy might be high, but precision on the critical category might be low. Overall time savings might be impressive, but time spent on AI error correction might not be counted.

Confounded Results

AI was deployed alongside a process redesign, new training, and a management change. The AI team claims the improvement. But which intervention caused it? Without a controlled comparison, attribution is guesswork.

Survivorship Bias

Vendors showcase successful deployments. The failed deployments, the pilots that didn't scale, the clients who churned, are invisible. The success rate you see in case studies doesn't represent the actual success rate of the product.

Survivorship bias in AI vendor marketing is the same statistical error that leads people to conclude that dolphins push drowning swimmers to shore. You only hear from the survivors. The data set is systematically biased toward positive outcomes.

Dr Vincent Russell

Machine Learning (AI) Engineer

What Enterprises Should Demand

From Vendors

Methodology documentation. How was the benchmark conducted? What was the sample size? What was the test set composition?
Confidence intervals. "94% accuracy (95% CI: 91-96%)" is an honest claim. "94% accuracy" without context is marketing.
Baseline comparisons. What's the comparison point? Manual processing? A competitor? The same vendor's previous version?
Reference clients. Not testimonials. Conversations with actual users who can speak to real-world performance, limitations, and implementation challenges.
Failure case analysis. Where does the AI underperform? What are the known limitations? Honest vendors discuss both.

From Internal Teams

Pre-deployment baselines. Measure current performance before AI deployment so you have a valid comparison point.
Controlled comparisons. Where possible, run AI and non-AI processes in parallel during evaluation.
Independent evaluation. The team that built the AI shouldn't be the only team evaluating it.
Longitudinal measurement. Measure at 1 month, 3 months, 6 months, and 12 months. Initial results may not sustain.
Honest reporting. Report what you found, including disappointing results. Organisations that suppress negative findings make worse decisions over time.

Building Evidence Culture

The long-term fix isn't better vendor interrogation. It's building an internal culture that values evidence.

This means:

Training AI teams in basic research methodology
Establishing evaluation standards that apply to all AI initiatives
Creating space for honest reporting (including negative results)
Connecting evaluation findings to investment decisions
Learning from evaluations across initiatives, not just within them

The evidence gap in enterprise AI isn't malicious. Vendors believe in their products. Internal teams believe in their projects. The gap exists because rigorous evaluation is expensive, slow, and sometimes produces uncomfortable answers. But decisions based on weak evidence produce weak outcomes. Invest in evaluation. Demand evidence. Build the capability to generate your own.