The fine-tuning versus prompting debate generates more heat than light. Vendor positions, blog posts, and conference talks tend to advocate for whichever approach the speaker specialises in. Vincent and I have been working through the decision systematically, and the answer depends on factors that most discussions ignore: the volume of inference calls, the cost of errors, the rate of domain change, and the mathematical relationship between training data size and marginal performance gain.
What You Need to Know
- Fine-tuning wins on cost efficiency at high volume (roughly 10,000+ inference calls per month on the same task type), because a smaller fine-tuned model can match a larger prompted model at lower per-call cost
- Prompting wins on flexibility. If your task requirements change frequently, the cost of re-fine-tuning erodes the per-call savings
- Neither wins if the bottleneck is data quality. A fine-tuned model trained on poor data is confidently wrong. A prompted model with poor context is generically wrong. Both fail.
- The decision is quantifiable. Below is the framework we use.
The Cost Crossover Point
Per-Call Economics
A large general model (GPT-4 class) with a well-engineered prompt costs more per call than a smaller fine-tuned model (GPT-3.5 class) that has been trained on your specific task. The per-call difference varies but is typically 5-10x.
However, fine-tuning has upfront costs: training data preparation, training compute, evaluation, and ongoing maintenance. The question is where the crossover point sits.
If the per-call saving from fine-tuning is S, and the total cost of fine-tuning is C, then fine-tuning becomes cost-effective after C divided by S calls. For most enterprise use cases, this crossover sits between 5,000 and 50,000 calls.
Dr Vincent Russell
Machine Learning (AI) Engineer
The Maintenance Tax
Fine-tuned models need retraining when the domain changes. New products, updated policies, changed regulations, evolved terminology. Each retraining cycle incurs cost. If the domain changes quarterly, the annual maintenance cost is four training cycles plus four evaluation cycles plus the engineering time to manage the pipeline.
Prompted models absorb domain changes through updated prompts and retrieval context. The maintenance cost is lower but the per-call cost remains higher.
The break-even calculation must include maintenance, not just the initial training cost.
Performance Characteristics
Where Fine-Tuning Excels
Consistent formatting. If outputs must follow a strict schema (JSON with specific fields, structured reports, classification into a fixed taxonomy), fine-tuning produces more reliable formatting than prompting. The model learns the output structure during training rather than inferring it from instructions.
Domain-specific language. Industries with specialised terminology (legal, medical, insurance, engineering) benefit from fine-tuning because the model learns the vocabulary and usage patterns of the domain. Prompted models can handle domain language with context, but fine-tuned models handle it natively.
High-volume, stable tasks. If you're processing thousands of similar documents per day and the task hasn't changed in months, fine-tuning optimises both cost and quality.
Where Prompting Excels
Rapidly changing requirements. If the task evolves weekly (new edge cases, updated rules, expanded scope), re-prompting is immediate while re-fine-tuning takes days to weeks.
Multi-task flexibility. A prompted general model can handle diverse tasks. A fine-tuned model excels at one task. If your AI system serves multiple use cases, maintaining separate fine-tuned models for each is an engineering burden.
Small data regimes. Fine-tuning requires hundreds to thousands of training examples. If you have fewer than 200 examples for a task, prompting with few-shot examples will likely outperform fine-tuning.
The Decision Framework
Step 1: Volume Assessment
How many inference calls per month on this specific task?
- Under 1,000: prompting (fine-tuning cost never pays back)
- 1,000-10,000: prompting unless performance gap is critical
- Over 10,000: evaluate fine-tuning
Step 2: Stability Assessment
How often do task requirements change?
- Weekly: prompting (re-fine-tuning cycle is too slow)
- Monthly: either (evaluate on other factors)
- Quarterly or less: fine-tuning candidate
Step 3: Data Assessment
Do you have sufficient labelled examples for fine-tuning?
- Under 200: prompting
- 200-1,000: fine-tuning possible but evaluate quality carefully
- Over 1,000: fine-tuning viable
Step 4: Performance Gap Assessment
Run both approaches on a held-out test set. Is the performance difference statistically significant and practically meaningful?
- No significant difference: prompting (lower maintenance cost wins)
- Significant difference favouring fine-tuning: proceed with fine-tuning if volume and stability criteria are met
The decision between fine-tuning and prompting is not ideological. It is an engineering and economic decision that can be quantified. Run the numbers for your specific use case. The framework above gives you the structure. The data from your own evaluation gives you the answer.

