Skip to main content

Data Quality Is the AI Bottleneck Nobody Budgets For

After a year of enterprise AI delivery, the pattern is unmistakable. Projects don't fail because the model is wrong. They fail because the data feeding the model is inconsistent, incomplete, or just plain wrong.
15 January 2024·8 min read
Mak Khan
Mak Khan
Chief AI Officer
John Li
John Li
Chief Technology Officer
We've now completed over a dozen enterprise AI engagements. The models work. The infrastructure scales. The APIs connect. And yet, the single biggest drag on every project, the thing that blows timelines and burns budgets, is data quality. Not model selection. Not architecture. Not even organisational buy-in. The data itself.

What You Need to Know

  • Data quality work consistently consumes 30-40% of enterprise AI project effort, but most organisations budget 5-10%
  • The gap between test accuracy and production accuracy is almost always a data quality gap
  • RAG systems are only as reliable as the knowledge base behind them. Contradictory or stale documents produce hallucinations
  • Budget explicitly for data assessment, cleansing, and ongoing monitoring before any model work begins

The Pattern

Every project follows a familiar arc. The vendor demo looks great. The pilot, trained on a curated dataset, performs well. Everyone agrees to move forward. And then the real data arrives.
That's when things get interesting.
40%
of enterprise AI project time spent on data preparation and cleansing
Source: Anaconda, State of Data Science Report, 2023
On a recent engagement, we were building a RAG-based knowledge assistant for a mid-sized organisation. The concept was sound. Take the organisation's internal documentation, index it, let people ask questions and get accurate answers grounded in their own material. Simple enough.
The prototype worked beautifully on a sample of 200 documents. Clean, well-structured, consistent formatting. We went to production with 12,000 documents. The system started confidently presenting information that was flatly wrong.

Contradictory Documents Kill RAG Systems

The problem wasn't the model. It was that the knowledge base contained multiple versions of the same policy, some current and some five years out of date. The model would retrieve a deprecated document, present it as fact, and the user had no way to know the information was stale.
This organisation, like most, had never audited their document corpus. They had a shared drive with a decade of accumulation. Policies had been updated but old versions hadn't been removed. Different departments had created their own versions of the same procedure. Three separate documents described the leave policy, each slightly different.
When someone tells me their knowledge base is ready for AI, I ask them one question: how many documents describe the same process? If they don't know the answer, the knowledge base isn't ready.
Mak Khan
Chief AI Officer
We spent four weeks on data remediation. Deduplication. Version reconciliation. Metadata standardisation. It wasn't glamorous work. But without it, the system was worse than useless. It was confidently wrong.

The Training Data Trap

A different project. A classification model for a financial services client. We were categorising incoming requests to route them to the right team. The training data came from the client's historical records: 50,000 labelled examples, two years of production data.
On the test set, we hit 95% accuracy. The client was thrilled. We deployed.
Production accuracy: 62%.
The gap was entirely explained by data quality. The training data had been pre-cleaned by the team that prepared it. Obvious errors had been corrected. Ambiguous cases had been resolved with the benefit of hindsight. Edge cases had been excluded as "outliers."
Production data had none of those advantages. Requests came in with typos, incomplete information, multiple issues in a single submission, and categories that didn't map cleanly to the training labels. The model had learned to classify clean data. Production data isn't clean.
We rebuilt the training pipeline using raw, unprocessed historical data. Accuracy on the new test set dropped to 78%. Production accuracy climbed to 76%. A less impressive number on paper. A dramatically more useful system in practice.

Why Organisations Underbudget for Data Work

There are three reasons, and they compound.
Data quality is invisible until you look. Nobody knows their data is messy until they try to use it for something that requires consistency. A human can read a document with a typo. A model trained on that document learns the typo.
Data work feels like overhead, not progress. Stakeholders want to see the AI working. They don't want status updates about data cleansing pipelines. The pressure to demonstrate value quickly pushes teams past the data preparation phase and into model development before the data is ready.
The skills are different. The people who build models are not typically the people who understand the source data systems. Data engineers, domain experts, and the operational staff who actually use these systems daily. You need all of them in the room during the data preparation phase, and getting everyone's time is expensive.

What Good Looks Like

The projects that go well share a common approach.
Data assessment before any AI work. We now run a dedicated data quality sprint at the start of every engagement. Profile every data source. Measure completeness, consistency, freshness, duplication. Build a data quality scorecard. This sprint typically takes two to three weeks and saves months later.
Realistic training data. If you're building a supervised model, your training data must reflect production conditions. Not a cleaned, curated subset. The messy, inconsistent, ambiguous data that your system will actually encounter. If your test accuracy is significantly higher than you'd expect from the raw data, something is wrong with your evaluation.
Ongoing data monitoring. Data quality isn't a one-time fix. Source systems change. New document formats appear. Data entry practices drift. Without continuous monitoring, a system that was accurate at launch degrades silently over months.
Explicit budget allocation. We tell every client to budget 30-40% of their AI project effort for data work. Assessment, cleansing, transformation, validation, and ongoing monitoring. Most are surprised. But after the first project, nobody questions it.

The Conversation We Need to Have

The AI industry has done an excellent job of selling the capabilities of models. It's done a poor job of explaining that those capabilities depend entirely on data quality.
When a client asks us why their AI system isn't performing as expected, the answer is almost never "the model is wrong." It's almost always "the data feeding the model doesn't meet the quality bar the model requires."
This isn't a technology problem to solve with better algorithms. It's an organisational problem that requires investment in data governance, data engineering, and the unsexy work of making information consistent and reliable.
The organisations that treat data quality as a first-class concern, not an afterthought, are the ones whose AI investments actually deliver.