Skip to main content

GPT-4 Just Raised the Bar

Two days after launch: GPT-4 is a genuine step-change. Multimodal, more accurate, better reasoning. What it means for enterprise AI.
16 March 2023·6 min read
Mak Khan
Mak Khan
Chief AI Officer
Isaac Rolfe
Isaac Rolfe
Managing Director
GPT-4 launched two days ago. We've been testing it non-stop since the API went live. This isn't an incremental update. It's the kind of capability jump that changes what's possible in enterprise AI delivery.

What You Need to Know

  • GPT-4 is multimodal (text and image input), significantly more accurate, and better at complex reasoning than GPT-3.5. It scores in the top 10% on the bar exam. GPT-3.5 scored in the bottom 10%.
  • For enterprise use cases, the accuracy improvement matters more than any other feature. The gap between "mostly right" and "reliably right" is the difference between a demo and a production system.
  • GPT-4 reduces hallucination rates by roughly 40% compared to GPT-3.5, according to OpenAI's internal benchmarks. That's meaningful, but it doesn't eliminate the problem.
  • The cost is approximately 10-30x higher than GPT-3.5 depending on usage. Enterprise economics will require careful decisions about which tasks justify the premium.
40%
reduction in hallucination rate compared to GPT-3.5, per OpenAI's internal evaluation
Source: OpenAI, GPT-4 Technical Report, March 2023

First Impressions

Isaac: I ran our standard enterprise test suite against GPT-4 within an hour of getting API access. Document summarisation, policy extraction, knowledge synthesis. The improvement is immediately visible. Responses that GPT-3.5 got roughly right, GPT-4 gets precisely right. Edge cases that broke 3.5 are handled cleanly. The reasoning chains are longer and more coherent.
But the thing that stopped me was the image input. I fed it a photo of a whiteboard from a recent workshop. It read the handwriting, understood the diagram structure, and produced a coherent summary of the strategic discussion. That's not a party trick. That's a capability that changes how we think about document ingestion for enterprise knowledge bases.
Mak: From an architecture perspective, the 32K context window variant changes the game for retrieval-augmented generation. With GPT-3.5's 4K context, we were spending significant engineering effort on chunking strategies and retrieval precision. 32K tokens means we can pass substantially more context per query, which directly improves answer quality for complex enterprise queries.
The structured output is also notably better. When you ask GPT-4 to return JSON or follow a schema, it does so consistently. That matters for production systems where downstream processing depends on predictable output formats. We were building extensive validation and retry logic for GPT-3.5. Much of that becomes unnecessary.

What This Changes for Enterprise AI

The "Good Enough" Threshold Just Moved

We've been saying that enterprise AI becomes viable when accuracy crosses specific thresholds for specific use cases. Claims triage might need 90%. Document extraction might need 95%. Knowledge synthesis might need 85% with clear source attribution.
GPT-4 pushes several use cases past their viability threshold. Things that were "promising but not production-ready" with GPT-3.5 are now genuinely viable. Our internal experiments with policy document analysis, for example, went from roughly 82% extraction accuracy to 94%. That's the difference between an interesting prototype and something an underwriter would actually trust.

Retrieval-Augmented Generation Gets Better

RAG systems (where you combine an LLM with your own data sources) improve dramatically when the underlying model is better at reasoning over retrieved content. With GPT-3.5, we often saw the model struggle to synthesise information from multiple retrieved passages. GPT-4 handles multi-source synthesis significantly better. For enterprise knowledge base applications, this is the most important improvement.

Cost Is a Real Constraint

GPT-4 is expensive. At current pricing, running it at the scale needed for enterprise operations requires careful thinking about which tasks justify the cost. A reasonable pattern: use GPT-4 for complex, high-value tasks (analysis, synthesis, reasoning) and GPT-3.5 for simpler tasks (classification, summarisation, formatting).
This kind of model routing will become a standard architectural pattern. Not every query needs the best model. Most need the right model for the task.

What This Doesn't Change

GPT-4 doesn't change the fundamentals we've been talking about. You still need clean data. You still need clear use cases. You still need governance. You still need change management. A better model doesn't fix a broken process. It just breaks it faster and more confidently.
The bar exam score makes a great headline. But your enterprise doesn't need a model that can pass the bar. It needs a model that can read your policy documents accurately, integrated into your existing systems, governed appropriately, and trusted by the people who'll use it.
That work is still the hard part. GPT-4 just made the model part of the equation significantly easier.
We'll share more detailed findings as we continue testing. For now: this is real, it's material, and if you've been waiting for the technology to be "ready enough" for enterprise work, the wait just got shorter.