What We've Learned from Six Months of AI Delivery

We went all-in on AI in early 2023. Six months of active delivery later, I can say with confidence: some of what we expected was right, some was wrong, and the most important lessons were the ones we didn't see coming.

What You Need to Know

Enterprise AI delivery is harder than we expected, but not for the reasons we expected. The models are better than anticipated. The integration, data, and human challenges are worse.
The compound effect is real. Each AI capability we build makes the next one faster to build. But it only compounds if you invest in shared infrastructure, not isolated prototypes.
Accuracy improvements from the model matter less than accuracy improvements from better data preparation. Moving from GPT-3.5 to GPT-4 improved our results by roughly 15%. Improving our data pipeline improved results by 30-40%.

Where Accuracy Gains Actually Come From

Source: RIVER Group, enterprise delivery data, 2023

Enterprise clients don't care about AI. They care about outcomes. The moment we stopped talking about technology and started talking about operational improvement, every conversation got better.

What's Working

RAG Is the Enterprise Pattern

Retrieval-augmented generation - connecting an LLM to an organisation's own data - has become our primary delivery pattern. It's not the only pattern, but it's the one that delivers the clearest enterprise value.

Here's why: enterprises don't need general knowledge. They need their knowledge, made accessible and actionable. A health insurer doesn't need an AI that knows about insurance in general. They need an AI that knows their policies, their processes, their clinical guidelines.

RAG solves this. Ingest the organisation's documents. Build a retrieval system that finds the right information for each query. Use the LLM to synthesise an answer grounded in that specific context. Cite the sources.

We've built multiple production RAG systems now, and the pattern is maturing. The architecture is stabilising. The quality is improving with each build. This is the compound effect we were hoping for.

Domain Experts Are the Secret Weapon

The most important person on our AI delivery teams isn't the ML engineer. It's the domain expert - the person from the client's organisation who knows the subject matter cold.

They define what "good" looks like. They identify edge cases the model misses. They validate outputs in ways that no automated metric can replicate. They know when a technically correct answer is practically wrong because it misses context that only a practitioner would understand.

Every project where we've had strong domain expert involvement has delivered better outcomes. Every project where domain experts were treated as an afterthought has struggled.

Infrastructure Compounds, Prototypes Don't

Early in the year, we were tempted to build each AI project from scratch. Bespoke pipeline, bespoke interface, bespoke deployment. That's fine for a prototype. It's death for a delivery team.

We've since invested heavily in shared infrastructure: common RAG pipelines, reusable ingestion patterns, standardised deployment processes. The result is that our second, third, and fourth AI builds are materially faster and more reliable than the first.

This is the foundation argument made concrete. If you build each AI capability as an isolated experiment, you get linear progress. If you build shared infrastructure, you get exponential progress.

What Isn't Working

The Demo-to-Production Gap Is Real

We wrote about this, but experiencing it firsthand is different from theorising about it. The gap between a compelling AI demo and a production system that works reliably, at scale, with real data, in a real workflow, is enormous.

A demo can cherry-pick the best examples. Production has to handle everything - the messy documents, the edge cases, the formats nobody anticipated, the users who interact with it in ways you didn't plan for.

We've gotten better at managing expectations around this gap. But it still surprises clients, and honestly, it still sometimes surprises us.

Data Quality Is Always Worse Than Expected

Every client says their data is "pretty good." It never is. Inconsistent formats. Missing fields. Duplicate records. Documents that are technically PDFs but are actually scanned images with no OCR. Spreadsheets used as databases. Email threads used as approval workflows.

We now budget significant time for data assessment and preparation at the start of every engagement. It's rarely glamorous work. It's always essential.

73%

of enterprise data leaders rate their data quality as inadequate for AI applications

Source: NewVantage Partners, Data and AI Leadership Executive Survey, 2023

Change Management Can't Be Bolted On

This one hurts because we knew it intellectually and still underestimated it in practice. Building the AI is maybe 40% of the work. Getting people to adopt it, trust it, and integrate it into their daily operations is the other 60%.

We've started treating change management as a parallel workstream from day one, not something that happens after the technology is built. Early user involvement, transparent communication about capability and limitation, phased rollout. It's working better, but it remains the hardest part of delivery.

What Surprised Us

The Speed of Model Improvement

When we started in January, GPT-3.5 was the state of the art. By March, GPT-4 changed the calculus. By June, the open-source model landscape had shifted dramatically. By September, we're seeing capabilities we didn't expect until late 2024.

This is both exciting and strategically challenging. Building on a foundation that improves every quarter means your architecture needs to be model-agnostic. We've learned to abstract the model layer so we can swap models without rebuilding systems.

Clients Want Guidance, Not Options

Early on, we presented clients with options: here are three approaches, here are the trade-offs, which do you prefer? This was well-intentioned and completely wrong.

Clients don't want options. They want informed recommendations. "Based on your situation, we recommend this approach because of these reasons. Here's why the alternatives are worse." That's what a trusted advisor does. Presenting options is what a vendor does.

The Small Wins Matter Most

The AI capabilities that have driven the strongest client satisfaction aren't the big, ambitious ones. They're the targeted, specific ones. Reducing a three-hour document review to 20 minutes. Automatically routing incoming documents to the right team. Extracting structured data from unstructured forms.

Nobody writes a case study about these. But they're the wins that build trust, demonstrate value, and create momentum for bigger initiatives.

What's Next

We're six months in. The next six months will be about scaling what works, fixing what doesn't, and continuing to invest in the shared infrastructure that lets each project compound on the last.

If I had to distil everything into one lesson, it's this: the technology is the least interesting part of enterprise AI. The interesting part - the hard, valuable, differentiated part - is understanding the client's domain deeply enough to apply AI where it creates genuine operational advantage.

That's not a technology capability. It's a consulting capability built on technology.