Research Methodology for Responsible Health AI

Health AI claims are everywhere: "AI detects cancer earlier." "AI reduces diagnostic errors." "AI improves patient outcomes." Some of these claims are backed by rigorous evidence. Most are backed by vendor case studies, pilot results on small samples, or evaluations that wouldn't survive peer review. Tania and I have been working through what responsible research methodology looks like for health AI, drawing on her research expertise and my experience with the realities of health data systems.

What You Need to Know

Health AI evaluation requires the same methodological standards as any health intervention: defined populations, appropriate controls, pre-specified outcomes, and independent review
Most health AI evidence is at the lowest level of the evidence hierarchy: case studies and uncontrolled before-after comparisons
Three specific challenges distinguish health AI evaluation from general AI evaluation: patient safety requirements, equity impact measurement, and the need for real-world clinical validation
Organisations deploying health AI should demand evidence that meets clinical standards, not just technology standards

The Evidence Gap

What Vendors Provide

Most health AI vendors provide evidence in one of three forms:

Technical benchmarks. "Our model achieved 94% accuracy on the MIMIC-III dataset." This tells you about model performance on a specific dataset. It tells you nothing about performance in your clinical context, with your patient population, using your data.

Pilot results. "In a 3-month pilot with Hospital X, our system reduced diagnosis time by 40%." Pilot results are useful but limited. They typically involve motivated users, curated data, and supportive implementation conditions. They do not predict real-world performance at scale.

Customer testimonials. "Dr Y says our system has transformed their practice." Testimonials are marketing, not evidence.

What's Actually Needed

Evidence that a health AI system is safe and effective requires:

Evaluation on a representative patient population (not just the easy cases)
Comparison to an appropriate baseline (human performance, existing tools, or no intervention)
Pre-specified outcome measures (defined before the evaluation, not after)
Sufficient sample size for statistical power
Equity analysis across demographic subgroups
Independent evaluation (not conducted by the vendor)

Health AI should be held to the same standard as any other health intervention. Patient safety and equity require evidence that we can trust.

Dr Tania Wolfgramm

Chief Research Officer

Designing Health AI Evaluations

Define the Population

Who will this AI system serve? The evaluation population must reflect the actual deployment population. If the system will serve a diverse urban hospital, the evaluation cannot be conducted on a homogeneous suburban dataset.

For NZ contexts, the evaluation population must include adequate representation of Māori, Pacific, and other populations that experience health disparities. An AI system that performs well for the majority but poorly for underserved populations is not acceptable, regardless of its aggregate accuracy.

Choose Appropriate Controls

The comparison matters. "Better than nothing" is a low bar. "Better than existing practice" is meaningful. "Better than the best available alternative" is rigorous.

For health AI, the appropriate control is usually current clinical practice. Does the AI-assisted process produce better outcomes than the clinician-only process? This requires a study design that compares the two, ideally with random or quasi-random assignment.

Specify Outcomes Before Evaluation

Pre-registration of outcome measures prevents post-hoc analysis that finds whatever pattern is convenient. Define:

Primary outcome (the main thing you're measuring)
Secondary outcomes (additional measures of interest)
Safety outcomes (adverse events, errors, harms)
Equity outcomes (performance differences across demographic groups)

Measure Real-World Performance

Laboratory performance and real-world performance diverge in health AI for predictable reasons: messier data, sicker patients, time pressure, workflow friction, and user variability. Evaluation must include real-world deployment conditions, not just controlled testing environments.

Louise's experience with health data systems at the national level reinforces this: the data in production is always messier than the data in the evaluation set. The evaluation must account for this, or its results won't generalise.

Equity-Specific Methodology

Stratified Analysis

Report performance metrics for each demographic subgroup separately, not just in aggregate. Aggregate accuracy of 93% could mean 97% for the majority population and 78% for underserved populations. The aggregate number hides the equity problem.

Fairness Metrics

Beyond accuracy, measure fairness-specific metrics: equal opportunity (same true positive rates across groups), predictive parity (same positive predictive values), and calibration (similar confidence levels for similar outcomes across groups).

No single fairness metric is sufficient. Report multiple metrics and be transparent about trade-offs.

Health AI has genuine potential to improve outcomes. Realising that potential requires evidence that meets clinical standards, not just technology benchmarks. The methodology is known. The challenge is applying it consistently, particularly for populations that have historically been underserved by both health systems and technology.