Skip to main content

GPT-4 Changes the Game

GPT-4 just launched. Multimodal, more reliable, genuinely useful for enterprise. What this step change means for the work we do.
20 March 2023·5 min read
Mak Khan
Mak Khan
Chief AI Officer
Isaac Rolfe
Isaac Rolfe
Managing Director
GPT-4 launched last week. We've been running it against our internal test cases since the API went live. This is not an incremental update. This is the kind of capability jump that moves use cases from "interesting demo" to "production viable."

What's Actually Different

Mak: Let me be specific about what changed, because the marketing around this launch is intense and specificity matters.
Accuracy. GPT-4 is materially more accurate than GPT-3.5. OpenAI reports a 40% reduction in hallucination rates. Our own testing confirms this is roughly right - maybe slightly optimistic, but the improvement is real and consistent across tasks.
Reasoning. Complex, multi-step reasoning is dramatically better. Where GPT-3.5 would lose the thread on long logical chains, GPT-4 maintains coherence. For enterprise tasks like policy analysis or contract review, this is the difference between a tool you demo and a tool you deploy.
Context window. The 32K token variant can process roughly 25,000 words in a single request. That's an entire report, a full contract, a comprehensive policy document. The engineering workarounds we needed for GPT-3.5's 4K limit - chunking, retrieval strategies, careful context management - become optional rather than essential.
Multimodal input. GPT-4 accepts images alongside text. Feed it a photo of a whiteboard, a screenshot of a dashboard, a scan of a document. It processes all of them. For enterprise document processing, this opens up use cases that were previously out of reach.
40%
reduction in hallucination rates compared to GPT-3.5
Source: OpenAI, GPT-4 Technical Report, March 2023

What This Means for Enterprise

Isaac: Here's the honest version. Three months ago, after ChatGPT launched, I wrote that the gap between consumer AI and enterprise AI was enormous. That gap just got smaller. Not closed - but smaller.
The accuracy improvement matters most. Enterprise AI lives or dies on reliability. A tool that's right 75% of the time is a liability. A tool that's right 92% of the time, with clear confidence indicators and human review workflows, is genuinely useful. GPT-4 pushes several of our internal test cases past that threshold.
But the fundamentals haven't changed. You still need clean data. You still need clear use cases. You still need governance. A better model doesn't fix a broken process - it just breaks it faster.

The Cost Question

Mak: GPT-4 is expensive. Roughly 10-30x the cost of GPT-3.5 depending on usage patterns. For enterprise deployments at scale, this means model routing becomes an architectural requirement, not an optimisation. Use GPT-4 for the hard stuff - analysis, synthesis, complex reasoning. Use GPT-3.5 for the routine - classification, formatting, simple extraction.
This is a pattern we expect to see everywhere. Not every task needs the most capable model. Most tasks need the right model for the job.

What We're Doing With It

We're running GPT-4 through every enterprise use case we've been exploring:
  • Document analysis. Extraction accuracy jumped from roughly 82% to 94% on our policy document test set. That's the difference between prototype and production.
  • Knowledge synthesis. Multi-source question answering - pulling from several documents to construct an answer - improved significantly. GPT-3.5 struggled to synthesise. GPT-4 handles it cleanly.
  • Code generation. Structured output (JSON, API responses, data schemas) is markedly more consistent. We're removing entire layers of validation logic that were necessary for GPT-3.5.
These aren't toy examples. These are the building blocks of enterprise AI systems.

The Pace Is Staggering

Isaac: I want to name something that keeps nagging at me. GPT-3.5 launched in November. GPT-4 launched in March. That's four months between a technology that impressed everyone and a technology that's genuinely enterprise-viable.
The pace of improvement in this space is unlike anything I've seen in enterprise technology. And it's creating a real tension: the organisations that wait for the technology to "settle down" before investing will find the gap between them and early movers has widened substantially.
You don't need to move recklessly. But you need to be moving. Experimenting. Building institutional knowledge about what AI can do for your specific business. That knowledge compounds, and four months from now the landscape will look different again.