Enterprise Data Is Messier Than You Think

Never started a data migration and found the data was cleaner than expected. The client says "our data is pretty good." First extraction: duplicate records, inconsistent formats, missing fields, impossible dates, relationships that contradict each other. Not negligence. Entropy. Data degrades over years of use, system changes, and human input. Planning for that is the difference between a smooth migration and a project-threatening crisis.

What You Need to Know

Data quality is consistently worse than stakeholders believe, across every industry and organisation size
Data migration is a project within the project and should be scoped, budgeted, and managed accordingly
The three data challenges are quality, mapping, and volume. All three require dedicated time
Starting data assessment during discovery prevents surprises during the build

The Optimism Problem

Clients underestimate data messiness for understandable reasons. The current system works. People use it every day. Data goes in, reports come out. From the user's perspective, the data is fine.

But "fine for the current system" and "ready for the new system" are different standards. The current system has years of accommodations built in. It handles the duplicates because someone added a deduplication rule in 2014. It displays the dates correctly because there's a formatting layer that normalises three different date formats. It ignores the orphaned records because they don't appear in any active view.

Strip away those accommodations and transfer the raw data to a new system, and every accumulated compromise becomes visible.

25%

of enterprise data is estimated to contain critical quality issues

Source: Gartner Data Quality Market Survey, 2018

Twenty-five percent is the industry average. In our experience, for organisations with systems older than five years, the number is higher.

The Three Data Challenges

Quality

Quality problems fall into predictable categories.

Duplicates. The same customer entered three times with slightly different names. The same product with different codes. The same address formatted four different ways. Deduplication sounds simple until you realise there's no reliable unique identifier and the "same" record has different data in each duplicate.

Inconsistency. Phone numbers with and without country codes. Dates in DD/MM/YYYY and MM/DD/YYYY and YYYY-MM-DD. Status fields where "Active," "active," "ACTIVE," and "A" all mean the same thing. Freetext fields where people entered whatever they wanted.

Missing data. Fields that were optional in the old system but required in the new one. Records that were partially entered and never completed. Historical data where the original context has been lost.

Impossible data. Dates in the future for past events. Negative quantities. Email addresses that aren't email addresses. Phone numbers with too many digits. Each one needs a decision: fix, flag, or discard.

Mapping

The old system and the new system don't share the same data model. A "customer" in the old system might map to an "organisation" and a "contact" in the new one. A single field in the old system might split into three fields in the new one. A workflow status with twelve values in the old system needs to map to eight different values in the new one.

Data mapping is a business decision disguised as a engineering task. Every mapping rule requires someone who understands the domain to decide what the data means and where it belongs.

Data mapping workshops are a standard line item now. You cannot automate business judgement.

Hassan Nawaz

Senior Developer

Volume

Large datasets take time to migrate. Not just the transfer time. The validation time. When you're migrating a million records, even a 1% error rate means ten thousand records that need manual review. At scale, every data quality problem is multiplied.

Performance matters too. A query that takes a millisecond on ten thousand records takes ten seconds on ten million records. Data structures that worked fine at the old scale may need redesigning for the new scale. This needs to be discovered during testing, not on launch day.

How We Handle It

Assess Early

Data assessment happens during discovery at RIVER. Before we scope the build, we extract a sample of the data and assess its quality. We look for the common problems: duplicates, inconsistencies, missing fields, impossible values.

This assessment shapes the project plan. If the data is relatively clean, migration is a contained task within the build. If the data is a mess, migration becomes a significant workstream with its own timeline, resources, and budget.

Clean Progressively

We don't try to clean all the data before the migration. That approach leads to a data cleansing project that delays the build indefinitely. Instead, we define minimum quality standards for migration and clean progressively.

Block-and-fix. Critical issues that would break the new system (missing required fields, incompatible formats) must be fixed before migration.

Flag-and-review. Quality issues that don't break the system but reduce its value (duplicates, inconsistencies) get flagged for post-migration review.

Accept-and-improve. Minor issues that can be corrected over time through normal use of the new system. Better data entry validation prevents the same problems from recurring.

Test With Real Data

Development and testing happen with real data as early as possible. Not sanitised sample data. Not generated test data. Actual production data (with appropriate privacy controls). Real data reveals problems that test data hides.

Every migration gets at least two full rehearsal runs before the actual migration. Each rehearsal surfaces issues that get fixed before the next run. By the time the real migration happens, there should be no surprises.

The Cost Conversation

Data migration typically accounts for 15-30% of the total project effort on enterprise projects with legacy systems. That's a significant number. And it's a number that needs to be in the original budget, not discovered halfway through.

We've learned to have this conversation early and honestly. "Your data will have quality issues. Every organisation's does. We'll budget for that upfront." It's not a comfortable conversation. But it's better than the conversation three months in when the migration is behind schedule and nobody planned for it.

The data is messier than you think. It always is. The question is whether you discover that during a controlled assessment in week two or during a panicked migration in month six.