Enterprise teams spend weeks debating which AI model to use. The debate usually centres on benchmarks, brand recognition, and whatever the CTO read last weekend. There's a better way. Model selection should be systematic, task-driven, and reversible.
What You Need to Know
- There is no "best" AI model. There is only the best model for a specific task, at a specific quality threshold, at an acceptable cost. This is a multi-variable optimisation, not a brand choice.
- Benchmark performance doesn't predict enterprise performance. Models are benchmarked on academic tasks. Your documents, your data quality, your edge cases will produce different results.
- The model decision should be reversible. If your architecture locks you to a specific model, you have an architecture problem, not a model problem.
- Cost differences between models are 10-50x for the same task. Choosing the most expensive model for every task is the most common waste we see.
10-50×
cost difference between using a frontier model vs an adequate smaller model for routine enterprise tasks
Source: RIVER, model benchmarking data, 2024
The Decision Framework
We use a five-step framework for model selection in every enterprise AI engagement.
Step 1: Define the Task Type
Different tasks have fundamentally different model requirements:
| Task Type | Key Requirement | Model Tier |
|---|---|---|
| Complex reasoning, analysis, synthesis | Highest capability | Frontier (Claude 3.5 Sonnet, GPT-4o) |
| Document extraction, summarisation | Good accuracy, moderate cost | Mid-tier (Claude 3 Sonnet, GPT-4o-mini) |
| Classification, routing, tagging | Speed and cost efficiency | Small/fine-tuned (Haiku, open-source) |
| Embedding and search | Specialised performance | Purpose-built embedding models |
| Customer-facing conversation | Safety, controllability, quality | Frontier with guardrails |
| Code generation | Capability + instruction following | Frontier |
The first filter is always task type. Don't evaluate frontier models for classification tasks. Don't evaluate small models for complex reasoning.
Step 2: Benchmark on Your Data
Generic benchmarks (MMLU, HumanEval, etc.) tell you about general capability. They don't tell you how a model performs on your specific documents, your formatting, your edge cases.
Build a task-specific evaluation set: 50-100 examples that represent your actual workload, including the hard cases. Run every candidate model against this set. The results will often surprise you. Models that lead generic benchmarks sometimes underperform on specific enterprise tasks.
Step 3: Calculate Total Cost
Model cost isn't just the per-token API price. Total cost includes:
- Inference cost: Price per input/output token x expected volume
- Latency impact: Slower models mean longer user wait times, which affects adoption and productivity
- Integration cost: Some models require more prompt engineering or post-processing
- Operational cost: Monitoring, error handling, model updates, and vendor management
A model that's 20% cheaper per token but requires 50% more prompt engineering effort isn't actually cheaper.
Step 4: Assess Vendor Risk
For enterprise deployments, vendor assessment matters as much as model quality:
- Data processing agreements: Where is data processed? What retention policies apply? Is the DPA compatible with your compliance requirements?
- SLAs and reliability: What uptime guarantees exist? What happens during outages?
- Regional availability: Can you access the model from your region with acceptable latency?
- Enterprise support: Do they offer enterprise support tiers? What's the escalation path?
- Pricing stability: How often has pricing changed? Are there volume commitments?
Step 5: Build for Switching
The most important step: ensure your architecture allows model switching. This means:
- An abstraction layer between your application and the model API
- Standardised input/output formats that aren't model-specific
- Evaluation datasets that can benchmark any new model against your current one
- Feature flags or configuration that allow model switching without code deployment
The 80/20 Test
For most enterprise tasks, 2-3 models will perform within 5% of each other on your benchmarks. When that happens, choose based on cost, vendor relationship, and operational factors. Don't optimise for the last 2% of benchmark performance.
The Open-Source Question
Enterprise teams increasingly ask about open-source models (Llama, Mistral, Mixtral). The honest answer:
Where open-source works well: Classification, extraction, and other focused tasks where you can fine-tune a smaller model. Also where data sovereignty requires on-premise deployment.
Where it doesn't (yet): Complex reasoning tasks where frontier models still lead significantly. Also where you need enterprise support, SLAs, and compliance documentation.
The hybrid approach: Use commercial frontier models for your most critical, complex tasks. Use open-source for high-volume, lower-complexity tasks where cost matters most. This is the pattern we recommend for most enterprise clients.
Current Landscape (Late 2024)
A snapshot of where we see each provider excelling:
Anthropic (Claude): Strong reasoning and analysis. Excellent at following complex instructions. Best safety and controllability for enterprise use. Growing enterprise offering.
OpenAI (GPT-4o): Broad capability. Strong multimodal. Largest ecosystem and tooling. Most mature enterprise agreements.
Google (Gemini): Strong multimodal. Long context windows. Deep integration with Google Cloud. Improving rapidly.
Meta (Llama): Best open-source option. Good for fine-tuning and on-premise deployment. Enterprise support through partners.
Mistral: Strong European option. Good performance-to-cost ratio. Emerging enterprise offering.
This landscape changes every quarter. The framework for evaluating doesn't.
The model you choose today won't be the model you use in 12 months - invest in the abstraction layer, it's the cheapest insurance in enterprise AI.
Mak Khan
Chief AI Officer

