Skip to main content

Database Performance at Enterprise Scale

When your database starts slowing down at 10M rows. The practical guide to indexing, query optimisation, and when to rethink the architecture.
15 February 2022·8 min read
John Li
John Li
Chief Technology Officer
The call usually comes on a Monday morning. "The dashboard is slow." You check. The query that used to take 200ms now takes 8 seconds. Nothing changed in the code. What changed is that the table grew from 2 million rows to 12 million over the weekend's data import, and the query plan that was fine at 2 million is catastrophic at 12 million.

The Scale Thresholds

In my experience, database performance problems arrive at predictable thresholds. Not exact numbers, but consistent enough to plan around.
Under 1 million rows. Almost anything works. Bad queries are fast enough that nobody notices. This is where most prototypes live, and it's why prototypes feel fast.
1-10 million rows. Missing indexes start to hurt. Queries that scan the full table become noticeable. This is where most applications hit their first real performance problem.
10-100 million rows. Query optimisation becomes a discipline. You need to understand execution plans, index strategy, and data access patterns. Some queries need to be restructured entirely.
Over 100 million rows. Architecture decisions matter. Partitioning, read replicas, caching layers, and potentially rethinking whether a relational database is the right tool for this particular dataset.
The database doesn't care about your deadline. Learn to read query plans and you'll solve 80% of performance problems before they reach production.
John Li
Chief Technology Officer

Indexing: The 80% Solution

Most database performance problems at enterprise scale are indexing problems. Not missing indexes - too many indexes, wrong indexes, or indexes that don't match the actual query patterns.
Understand your access patterns first. Before adding an index, understand how the data is actually queried. Not how you think it's queried. Run query analytics. Look at the slow query log. Find the queries that run most frequently and the queries that take the longest.
Composite indexes have column order. An index on (status, created_at) is useful for queries that filter by status. It's useless for queries that filter only by created_at. Column order in composite indexes matters enormously and it's the most common indexing mistake I see.
Covering indexes reduce I/O. If your index contains all the columns the query needs, the database never reads the table itself. For frequently-run analytical queries, this can reduce response time by an order of magnitude.
Too many indexes slow writes. Every index needs to be updated on every insert and update. A table with fifteen indexes will have slow writes. If your workload is write-heavy, be selective about indexing.
10x
typical query improvement from adding the correct composite index to a table with 10M+ rows
Source: RIVER internal benchmarks, PostgreSQL 14

Query Optimisation Patterns

Beyond indexing, there are query patterns that consistently cause problems at scale.
*SELECT . Never in production. Fetch only the columns you need. At 10 million rows, the difference between selecting 3 columns and selecting 30 columns is significant, especially when the unused columns contain large text or JSON fields.
N+1 queries. The classic ORM trap. Your code fetches 100 records, then makes a separate query for each record's related data. That's 101 queries where one join would do. At scale, N+1 is the most common cause of "the page takes 30 seconds to load."
Unbounded queries. A query without a LIMIT clause on a table with 50 million rows will try to return 50 million rows. Always paginate. Always set upper bounds. This sounds obvious and yet I find it in production codebases regularly.
Correlated subqueries. A subquery that references the outer query executes once per row of the outer query. At 10,000 rows that's slow. At 10 million rows it never finishes. Rewrite as a JOIN or a CTE.

When to Use EXPLAIN

Every database has an EXPLAIN command that shows you the query execution plan. Learn to read it. It will tell you:
  • Whether the database is using your indexes or ignoring them
  • Whether it's doing a sequential scan (reading every row) or an index scan (reading only relevant rows)
  • How many rows it expects to process at each step
  • Where the time is actually being spent
The execution plan is not a suggestion. It's what the database is actually doing. When a query is slow, the plan tells you why. Not approximately - exactly.
I make every developer on our team learn to read PostgreSQL EXPLAIN output in their first month. It's the single most valuable database skill.

Connection Pooling

At enterprise scale, database connections become a bottleneck before most teams expect it. Each connection consumes memory on the database server. PostgreSQL's default of 100 connections sounds like a lot until you have 20 application servers each wanting 10 connections.
Use a connection pooler. PgBouncer for PostgreSQL. The pooler manages a smaller number of actual database connections and multiplexes application requests across them. It's simple to set up and it prevents the "too many connections" error that brings down production systems at the worst possible moment.

When Postgres Isn't Enough

I'm a strong advocate for PostgreSQL. It handles most enterprise workloads well. But there are legitimate cases where you need something else or something additional.
Time-series data at volume. If you're writing millions of data points per day with time-based queries, a purpose-built time-series database like TimescaleDB (which runs on top of Postgres) is worth evaluating.
Full-text search. PostgreSQL's built-in full-text search is decent for simple cases. For complex search requirements - faceted search, relevance tuning, fuzzy matching across millions of documents - Elasticsearch is still the practical choice.
Caching layer. Redis in front of PostgreSQL for frequently-read, rarely-changed data. This is an architectural addition, not a replacement. Cache the expensive queries. Invalidate on write. It's well-understood and it works.
Read replicas. When your read volume outgrows what a single server can handle, route read queries to replicas. Most cloud-managed PostgreSQL services make this straightforward.

The Practical Checklist

If you're hitting performance problems at scale, work through this in order:
  1. Read the slow query log. Find the actual problem queries.
  2. Run EXPLAIN on each problem query. Understand the execution plan.
  3. Add or adjust indexes to match actual access patterns.
  4. Rewrite problematic query patterns (N+1, unbounded, correlated subqueries).
  5. Implement connection pooling if you haven't already.
  6. Consider caching for expensive, frequently-run queries.
  7. Evaluate read replicas if read volume is the bottleneck.
  8. Only then consider whether the architecture needs to change.
Most teams jump to step 8 first. Don't. The first seven steps solve 95% of performance problems and they're cheaper, faster, and less risky than an architectural change.