We Broke AI for Two Years (So You Don't Have To)

In 2023, we bet the company on AI. The whole company. Not a side project, not a skunkworks team. Every project, every conversation, every decision filtered through one question: how does AI change this? Two years later, we've come up for air. We have scars. We have stories. And we're finally ready to talk about what we learned.

We Did the Work

When we say two years, we mean two years. Not "we hired an AI consultant and ran a workshop." We tried every popular tool. We ran dev projects in three different toolsets at the same time just to figure out where AI was actually useful and where it was smoke and mirrors. We built internal agents, broke them, rebuilt them, broke them again.

We ran countless blind tests to figure out which models performed best at which tasks. One of our favourites: we had the team rank responses from different models without knowing which was which. Subjectively, they ranked the Anthropic Claude responses about 50% better than the closest OpenAI response. Beautiful reasoning, clear structure, thoughtful output. But the cost? 10x.

Here's where it gets interesting. You could daisy-chain three GPT calls together, get a more robust result, and still come in cheaper than a single Claude call. We love the thought quality from Anthropic, but for most enterprise workloads we've found OpenAI's models more cost-effective and more stable. That kind of insight doesn't come from a whitepaper. It comes from doing the work.

2 years

of internal R&D before offering AI services to a single client

The Horror Stories

Every team that's serious about AI has horror stories. Here are ours. We share these because every one of them taught us something we now build into every client engagement.

The Email Agent That Ate Our Pipeline

I was building an email inbox agent in n8n. Three-stage process: triage incoming mail, organise it (CRM, tickets, etc.), and reply if necessary. Tested it against legacy email. Working beautifully.

Then I got a call from a lead. "Did you get the latest proposal update?"

No. I checked my inbox. Nothing. Checked the AI agent workflow and there it was. My triage rules marked anything that looked like accounting as read and archived, skipping the inbox entirely. The lead's email about a financial update on a proposal? Looked exactly like an accounting notification to the AI.

The lesson: when you're filtering with AI, think about what you can't afford to miss. Better to catch 100% of sales emails and deal with a few extra accounting emails than to miss a single proposal. Score and rank. Don't hard-filter.

The Products That Didn't Exist

Early API trial. E-commerce integration. We asked the AI to surface "good gift ideas for Mother's Day." Results looked fantastic. Beautiful product descriptions, perfect price points, exactly what you'd want to see.

One problem. The products didn't exist.

"Oh, I made those up. They seemed like they'd be good API responses."

Cheers for that.

The lesson: AI will hallucinate with total confidence if you let it. Every system prompt we write now includes some version of "never make something up, use only the context provided." And we verify. Always.

The Email to sarah@gmail.com

CRM integration trial. "Give me an update on our lead Sarah." Great summary. "Draft a response." Solid email. "Now send it."

Email sent. To sarah@gmail.com.

Why? "That seemed like it would be her email."

Thankfully, test mode. But imagine that in production. Your AI agent, confidently emailing a stranger with your internal sales context.

The lesson: never use AI for what you can do with normal logic. Get AI to return an ID or a key, then pass it through your existing systems. Let your CRM return the email by ID, and fail gracefully if it's not found. AI for reasoning. Systems for execution.

What Two Years of AI R&D Actually Taught Us

AI for reasoning, systems for execution. AI decides what to do. Your existing systems do it. Never let AI fabricate data it could look up.
Score and rank, don't hard-filter. Hard rules miss edge cases. Let AI score relevance, then set thresholds you can tune over time.
Every system prompt needs guardrails. "Only use the context provided" is non-negotiable. AI fills gaps creatively if you don't tell it not to.
Cost and quality are different conversations. The "best" model isn't always the right one. Test for the task, not the benchmark.
Stability matters more than capability. A model that's 90% as good but consistent beats one that's brilliant on Monday and unpredictable on Thursday.

We Rebuilt Everything

The AI lessons were only half the story. The other half was harder to admit: our own tech stack wasn't good enough.

We'd been on Laravel for a decade. It served us well. But if we were going AI-first, the stack had to be AI-first too. AI models work in real-time streams. Laravel is request-response. AI interfaces need to feel alive, update as the model thinks, show confidence levels and sources inline. Our existing architecture couldn't do that without fighting it every step of the way.

So we rebuilt. Next.js. React Server Components. Streaming from the ground up. A modern TypeScript stack designed around the way AI actually works, not bolted on top of something that was built for a different era. It meant throwing away a decade of patterns and muscle memory. It meant being beginners again, briefly. But the alternative was building AI products on a foundation that would hold us back for years.

Ripple

While we were rebuilding, we started building something internal. An AI platform we called Ripple. The idea was straightforward: instead of building every AI capability from scratch for every project, build a shared foundation that each new capability could plug into. Knowledge bases, model orchestration, tool calling, guardrails, evaluation. All in one place.

Ripple started in Laravel. That was fitting, actually. We learned what the platform needed to do by building it in the stack we knew, then rebuilt it in the stack it deserved. The Laravel version taught us the architecture. The Next.js version made it real.

We didn't build Ripple to sell it. We built it because we needed it. Every internal experiment, every horror story, every lesson from two years of breaking AI went into the design. It was the test of whether we could transform ourselves before we asked anyone else to trust us with their transformation.

That part of the story isn't finished yet. But it's getting close.

When It Got Real

For two years, we kept running into the same wall. AI was impressive in demos but brittle in production. Tool calling was unreliable. Hallucination rates were too high for enterprise use. We could build around these problems, and we did, but it meant layers of verification and fallback logic that added complexity and cost.

Then GPT-5 dropped.

Massive improvement in tool calling. Reduced hallucination. Better instruction following. We ran our standard test suite and the results were clear: this was ready for enterprise. Not perfect (nothing is), but reliable enough to build production systems without wrapping every AI call in three layers of safety net.

That was the moment we'd been waiting for. Not the moment AI got interesting (that was 2022). The moment it got reliable.

What Comes Next

We're ready. Two years of R&D. A rebuilt tech stack. An internal AI platform shaped by every mistake we've made. Two flagship products in production (Ora and Hakamana). And a delivery methodology we've been refining since 2011.

We're starting to work with select clients on their AI platforms. Not because the market is asking. Because we've earned the right to. Organisations we already know, where we understand the domain and can deliver results that justify the trust.

We'll open up more broadly when the time is right. For now, the work continues. And we're having a lot more fun than we expected.

Two years ago I said we were going in head first. Now it's time to put it to work.

Isaac Rolfe

Managing Director

This is the fourth chapter. Read Where It Started for the beginning, Why We Became RIVER for the rebrand, The AI Pivot for the commitment, or continue to what came next for the launch.