Skip to main content

Enterprise Error Handling Patterns

Errors will happen. The question is whether your system handles them gracefully or silently corrupts data. The patterns that matter.
10 May 2022·8 min read
John Li
John Li
Chief Technology Officer
Last month we traced a data inconsistency back to a try-catch block that swallowed an exception. The error had been silently failing for three weeks. Three weeks of corrupted records that nobody noticed because the system kept returning 200 OK. This is what bad error handling looks like in practice: not a crash, but a slow, invisible corruption.

The Real Cost of Silent Failures

In enterprise systems, errors that crash loudly are preferable to errors that fail silently. A crash gets attention. An alert fires, someone investigates, the problem gets fixed. A silent failure - a swallowed exception, a default return value, a retry that quietly drops data - can persist for weeks or months before anyone notices.
The cost isn't the error itself. It's the data cleanup after you discover it. We've spent more time reconciling silently corrupted data than we've ever spent fixing actual outages.
I'd rather have a system that crashes visibly than one that lies quietly. The silent failure costs you a weekend of data reconciliation and a difficult client conversation.
John Li
Chief Technology Officer

Pattern 1: Fail Fast, Recover Deliberately

Validate inputs at the boundary. Don't let bad data travel through your system hoping something downstream will catch it. If a required field is missing, if a value is outside expected range, if a foreign key doesn't resolve - reject it immediately with a clear error.
This feels aggressive. Enterprise teams often resist it because they want systems to be "resilient." But resilience doesn't mean accepting garbage. It means handling failures explicitly and predictably.
// Bad: let it through and hope for the best
function processOrder(order) {
  const customer = getCustomer(order.customerId) // might be null
  return createInvoice(customer.name, order.total) // crashes somewhere
}

// Good: validate and fail with context
function processOrder(order) {
  if (!order.customerId) throw new ValidationError('Order missing customerId')
  const customer = getCustomer(order.customerId)
  if (!customer) throw new NotFoundError(`Customer ${order.customerId} not found`)
  return createInvoice(customer.name, order.total)
}
The second version fails in the same place, but it tells you exactly what went wrong and where. That's the difference between a five-minute fix and a two-hour investigation.

Pattern 2: Error Classification

Not all errors are equal. Your error handling should distinguish between:
Transient errors. Network timeouts, rate limits, temporary service unavailability. These should be retried with backoff. They'll resolve on their own.
Client errors. Bad input, missing authentication, invalid requests. These should not be retried. Return a clear error message. The client needs to fix something.
System errors. Bugs, corrupted state, unexpected failures. These need alerts and investigation. They won't resolve without human intervention.
Dependency errors. A service you depend on is down or returning unexpected results. Depending on the dependency, this might be transient (retry) or might need a fallback strategy.
Treating all errors the same - either retrying everything or alerting on everything - creates noise. Transient errors that auto-resolve flood your alert channels. Client errors that get retried waste resources. Classification is the foundation of effective error handling.
62%
of enterprise production incidents are caused by inadequate error handling rather than logic bugs
Source: PagerDuty State of Digital Operations, 2022

Pattern 3: Structured Error Context

Every error needs context. Not just "something went wrong" but:
  • What was being attempted
  • What input triggered the error
  • Where in the process it failed
  • When it happened (timestamps, not just log ordering)
  • Correlation ID so you can trace it across services
In a microservices architecture, a single user action might touch five services. Without a correlation ID, connecting the error in Service D to the request that started in Service A is detective work. With a correlation ID, it's a log search.

Pattern 4: Retry with Backoff and Budgets

Retries are necessary for transient errors but dangerous without limits.
Exponential backoff. First retry after 1 second, then 2, then 4, then 8. This prevents your system from hammering a struggling service.
Retry budgets. Limit the total number of retries. Three attempts is usually sufficient. If it hasn't worked after three tries, it's probably not transient.
Jitter. Add randomness to the backoff interval. Without jitter, all your retries hit the struggling service at the same time, creating thundering herd problems.
Circuit breakers. After a threshold of failures, stop trying. The circuit "opens" and requests fail immediately instead of waiting for a timeout. Periodically let a single request through to check if the service has recovered. This prevents a cascade where one failing service takes down everything that depends on it.

Pattern 5: Dead Letter Queues

When a message can't be processed after all retries, don't drop it. Put it in a dead letter queue. This gives you a record of what failed, the ability to investigate and fix the issue, and the option to reprocess the messages once the problem is resolved.
We've saved clients from data loss multiple times with dead letter queues. The alternative - dropping failed messages - means permanent data loss that you might not discover for weeks.

Pattern 6: Graceful Degradation

When a non-critical dependency fails, the system should continue operating with reduced functionality rather than failing entirely.
  • Recommendation engine is down? Show the default list.
  • Analytics service isn't responding? Queue the events and send them later.
  • Image processing is slow? Show a placeholder and process asynchronously.
The key word is "non-critical." If your payment processor is down, you can't degrade gracefully. You need to tell the user clearly. But most dependencies in an enterprise system are not payment processors. Most are services where a fallback is acceptable.

The Anti-Patterns

Catching and ignoring. The catch (e) {} pattern. This is the silent corruption generator. If you catch an exception, do something with it. Log it, rethrow it, return an error response. Never swallow it.
Generic error messages. "An error occurred. Please try again." This tells the user nothing and tells the support team less. Be specific. If you can't be specific to the user for security reasons, be specific in the logs.
Retry without backoff. Immediate retries against a struggling service make the problem worse. You're adding load to a system that's already overloaded.
Alerting on everything. If every error generates an alert, the alert channel becomes noise and gets ignored. Alert on patterns, thresholds, and severity - not on individual occurrences of expected errors.

Making It Practical

If you're inheriting an enterprise codebase with poor error handling - and you probably are - start with the highest-value improvements:
  1. Find and eliminate silent catch blocks. Search for empty catch handlers. Each one is a potential data corruption source.
  2. Add correlation IDs to all cross-service communication. This makes debugging distributed errors manageable.
  3. Classify your errors and handle each category appropriately. This reduces alert noise and improves response time.
  4. Implement dead letter queues for asynchronous processing. This prevents data loss from transient failures.
You don't need to fix everything at once. But you do need to stop adding new silent failures.