One of our clients was spending $14,000 a month on OpenAI API calls. After four weeks of optimisation, we got that to $3,200 with no measurable loss in output quality. The techniques are not complicated. They are just not obvious until you have run AI at scale and watched the bills come in.
What You Need to Know
- Model selection is your biggest cost lever. Not every task needs GPT-4. Routing simple tasks to smaller, cheaper models can cut costs by 60-80% with negligible quality impact.
- Caching is underused in enterprise AI. Many enterprise queries are repetitive. A well-designed semantic cache can eliminate 30-50% of API calls entirely.
- Batching and async processing reduce per-unit costs. Real-time inference is expensive. Moving non-urgent tasks to batch processing can cut costs significantly.
- The optimisation order matters: model routing first, caching second, prompt optimisation third, batching fourth. This sequence delivers the most impact with the least disruption.
Why Costs Spiral
Enterprise AI costs spiral for predictable reasons. Teams start with a proof of concept using the most capable model available. It works well. It ships to production. Usage grows. Nobody revisits the model choice because it is working and nobody wants to break it.
Meanwhile, every request hits the most expensive model regardless of complexity. Simple classification tasks that a model one-tenth the cost could handle are processed alongside complex reasoning tasks. The bill grows linearly with usage, but it does not need to.
73%
of enterprise AI teams use a single model for all tasks, regardless of complexity
Source: Andreessen Horowitz, AI Infrastructure Survey, 2024
The Optimisation Playbook
1. Model Routing
This is the single highest-impact optimisation. Not every task needs your most capable model.
How it works: Build a routing layer that classifies incoming requests by complexity and routes them to the appropriate model. Simple tasks (classification, extraction, formatting) go to smaller models. Complex tasks (reasoning, creative generation, multi-step analysis) go to your most capable model.
Implementation pattern:
- Categorise your tasks by complexity. Most enterprise workloads break into three tiers: simple (60-70%), moderate (20-30%), complex (5-15%).
- Assign a model to each tier. For example: a small open-source model for simple tasks, a mid-tier model for moderate tasks, GPT-4 or Claude for complex tasks.
- Build a lightweight classifier that routes requests. This can be rule-based initially (if the task is classification, use the small model) and evolve to an ML-based router over time.
Expected impact: 50-70% cost reduction on a typical enterprise workload. The exact number depends on your task distribution.
John has written about this pattern in the context of multi-model architectures. The cost benefit is a side effect of good architecture.
2. Semantic Caching
Enterprise AI queries are more repetitive than most teams realise. Customer service queries cluster around common topics. Document processing handles similar document types. Knowledge retrieval answers variations of the same questions.
How it works: Before sending a request to the model, check a semantic cache for similar previous requests. If a sufficiently similar query has been answered recently, return the cached response. "Sufficiently similar" is defined by an embedding similarity threshold that you tune.
Implementation pattern:
- Embed incoming queries using a lightweight embedding model.
- Store query embeddings alongside their responses in a vector store.
- For each new query, search for similar cached queries above a similarity threshold (typically 0.92-0.95).
- Return the cached response for matches. Pass through to the model for misses.
Expected impact: 30-50% cache hit rate on typical enterprise workloads. Higher for customer service and knowledge retrieval. Lower for creative and analytical tasks.
Watch out for: Stale cache entries. Set TTLs based on how frequently your underlying data changes. A knowledge base that updates weekly needs shorter TTLs than a policy database that updates quarterly.
3. Prompt Optimisation
Longer prompts cost more. Most enterprise prompts are longer than they need to be because they were written iteratively (each iteration adding context to fix edge cases) and never trimmed.
Techniques:
- Compress system prompts. Remove redundant instructions. Models do not need lengthy preambles. A concise system prompt with clear constraints outperforms a verbose one.
- Use structured output formats. Asking for JSON with a defined schema produces shorter, more predictable outputs than asking for free-form text.
- Reduce few-shot examples. Many prompts include five to ten examples when two to three achieve the same quality. Test systematically, starting with fewer examples.
- Move static context to retrieval. Instead of including large context windows in every prompt, use RAG to inject only the relevant context per request.
Expected impact: 20-40% reduction in token usage per request. Compounds with model routing (cheaper models process the optimised prompts).
4. Batching and Async Processing
Not every AI task needs real-time inference. Document processing, report generation, data enrichment, and analytics can often run asynchronously.
How it works: Queue non-urgent tasks and process them in batches. Many model providers offer batch APIs with significant discounts (OpenAI's batch API is 50% cheaper than real-time).
Implementation pattern:
- Classify tasks as real-time (user is waiting) or async (result needed within hours).
- Queue async tasks in a job system.
- Process in batches during off-peak hours or via batch APIs.
- Notify users when results are ready.
Expected impact: 30-50% cost reduction on async workloads. The organisational change (getting users to accept async results) is harder than the technical implementation.
Monitoring and Governance
Cost optimisation is not a one-time exercise. You need ongoing monitoring:
- Cost per task type. Track spending by task category, not just total spend. This reveals which tasks are disproportionately expensive.
- Quality metrics per model. Ensure cheaper models are maintaining acceptable quality. Run automated evaluations on a sample of routed requests.
- Cache hit rates. Monitor cache effectiveness over time. Declining hit rates may indicate changing query patterns or stale cache entries.
- Cost per business outcome. The ultimate metric. Not cost per API call, but cost per customer query resolved, cost per document processed, cost per insight generated.
The Bottom Line
Enterprise AI cost optimisation is unglamorous work. There is no breakthrough technology involved. It is routing, caching, compressing, and batching. Engineering fundamentals applied to a new problem.
But the impact is substantial. The difference between a $14,000 monthly bill and a $3,200 monthly bill is not just cost savings. It is the difference between AI that is too expensive to scale and AI that the business case supports expanding.
Start with model routing. It is the highest-impact, lowest-risk optimisation. Then add caching. Then optimise prompts. Then batch what you can. In that order.

