Every enterprise AI conversation I'm in eventually devolves into a model debate. "Should we use GPT-4 or Claude 3?" "What about Gemini?" "Have you seen the latest benchmarks?" Stop. The model is maybe 20% of your outcome. The other 80% is everything around it - and that's the part nobody wants to talk about.
I'm going to say something that's borderline heretical in AI circles: for most enterprise use cases, the top-tier models produce roughly equivalent results. GPT-4 is excellent. Claude 3 is excellent. They have different strengths. They have different weaknesses. But the delta between them on a well-engineered system is small compared to the delta between a good system and a bad system using the same model.
The 80/20 of Enterprise AI
Here's what actually determines whether your enterprise AI system works:
Data quality and preparation (30%). How clean is your data? How well is it chunked? How good are your embeddings? How comprehensive is your metadata? A mediocre model with excellent data preparation will outperform a frontier model with poor data preparation. Every time.
Retrieval quality (25%). When a user asks a question, does the system find the right documents? Are the search results relevant? Is the reranking effective? This is the difference between an AI that gives useful answers and one that confidently gives wrong answers drawn from irrelevant documents.
System design (15%). Prompt engineering, guardrails, error handling, caching, latency management, cost optimisation. The architecture decisions that determine whether the system is reliable, fast, and economical - or fragile, slow, and expensive.
User experience (10%). How AI output is presented, how confidence is communicated, how users interact with the system. A brilliant AI behind a terrible interface delivers no value.
The model (20%). The actual language model. Important? Yes. The deciding factor? Almost never.
~3%
average accuracy difference between GPT-4 and Claude 3 Opus on enterprise retrieval tasks with equivalent infrastructure
Source: RIVER Group, internal benchmark across 4 enterprise deployments, Q1-Q2 2024
Why Everyone Obsesses Over Models
Three reasons, and none of them are good:
It's the visible part. Models have names, benchmarks, and marketing teams. Infrastructure has documentation nobody reads. It's natural to focus on the thing you can see and compare. But it's wrong.
It avoids the hard work. Debating models is fun. Building data pipelines is not. Optimising retrieval is tedious. Fixing chunking strategies is boring. The model debate is a procrastination mechanism disguised as strategy.
Vendors want you to care. If the model is what matters, then you need the vendor with the best model. If infrastructure is what matters, then you need good engineering - and that's a different conversation entirely.
Where the Model Does Matter
I'm not saying the model is irrelevant. There are specific situations where model choice is significant:
Complex reasoning tasks. When the task requires multi-step logical reasoning, synthesis across multiple documents, or nuanced interpretation, frontier models genuinely outperform smaller ones. Legal analysis, financial modelling, strategic recommendations.
Long context processing. Claude 3 handles long documents better than most alternatives right now. If your use case involves processing 50-page contracts or 200-page reports, this matters.
Multilingual capability. Model performance varies significantly across languages. If you need te reo Māori or Pacific language support, model choice matters more than for English-only use cases.
Cost-sensitive high-volume. When you're processing thousands of queries per hour, the cost difference between GPT-4 and a smaller model is substantial. Model routing - using expensive models only when needed - is an architecture decision, not a model decision.
For everything else? Build the infrastructure right and the model choice becomes a configuration decision, not an architectural one.
The Architecture That Makes Models Swappable
This is the actual strategic move: build your system so the model is a pluggable component.
Abstraction layer. A common interface for model interaction that isolates your application from model-specific APIs. When a better model arrives (and it will, frequently), you swap the configuration, not the code.
Model routing. Different tasks routed to different models based on complexity, cost, and latency requirements. Simple extraction goes to a fast, cheap model. Complex analysis goes to a frontier model. The routing logic is yours to control.
Evaluation framework. Automated testing that measures model performance on your specific tasks with your specific data. When you're considering a new model, you run the evaluation suite, not a vibe check.
Prompt management. Centralised prompt templates that can be optimised per model. Different models respond differently to the same prompt. Your prompt layer should handle this.
This architecture means you never get locked into a model. You're locked into your infrastructure - which you own and control.
What to Do Instead of Debating Models
Next time the model debate starts in your AI strategy meeting:
- Ask about data quality. How clean is the data the AI will use? What's the plan for maintaining quality? This conversation is ten times more productive than the model conversation.
- Ask about retrieval. How will the system find the right information? What's the search strategy? How will you measure retrieval quality?
- Ask about the user experience. Who will use this? What does their workflow look like? How does AI fit into it?
- Ask about monitoring. How will you know if the system is working? What metrics matter? How will you detect degradation?
- Then ask about the model. Once the above questions are answered, model selection becomes straightforward.
The enterprises that build great AI systems don't build them around a model. They build them around their data, their workflows, and their users. The model is the last decision, not the first.
Build the infrastructure right and the model is a configuration choice. Build around a specific model and you're rewriting everything in six months when something better comes out.
Mak Khan
Chief AI Officer
