Building Scrapers That Last

I've built and maintained over 30 custom web scrapers since 2015. Most of them are still running. That's unusual. The typical scraper has a lifespan measured in weeks before something breaks and nobody fixes it. The difference isn't clever code. It's a set of patterns I've learned through years of things breaking at inconvenient times. The same maintainability principles that keep enterprise software alive apply here.

What You'll Learn

The four main reasons scrapers break: DOM changes, rate limiting, authentication changes, and data format shifts
Selector strategies that survive site redesigns
How to build monitoring that tells you something broke before your client does
Data validation patterns that catch silent failures

Why Scrapers Break

DOM Changes

This is the obvious one. A site changes its HTML structure, your selectors stop matching, and the scraper returns empty data or crashes. Happens constantly. Sites redesign, A/B test, update their CMS, or just push a deploy that moves a div.

The mistake most people make is writing selectors that are too specific. div.container > div:nth-child(3) > ul > li.item-class > span.price will break the moment anyone touches that page. A selector like [data-price] or a broader CSS class match survives far more changes.

I use a hierarchy: data attributes first, semantic classes second, structural selectors as a last resort. When a site has data attributes (and many do, because their own JavaScript needs them), those tend to be stable. They're part of the site's functionality, not its styling.

Rate Limiting

Scrape too fast and you'll get blocked. This seems obvious, but the failure mode is subtle. Many sites don't return a 429 status code. They return a 200 with a CAPTCHA page, or they silently serve cached/stale data, or they start returning partial results. Your scraper thinks it's working. It's not.

I add random delays between requests - not fixed intervals. A fixed 2-second delay between requests looks like a bot. Random delays between 1.5 and 4 seconds look more like a person browsing. For larger scraping jobs, I spread the work across hours rather than running everything in a burst.

Authentication Changes

Sites change their login flows. A scraper that authenticated via a simple POST request yesterday might need to handle a CSRF token today, a CAPTCHA tomorrow, and OAuth next month. This breaks scrapers silently because the scraper successfully "logs in" but gets redirected to an error page.

My approach: treat authentication as a separate, testable module. Check for authentication success explicitly after every login attempt. Don't assume a 200 response means you're logged in. Look for a known element or cookie that only appears when authentication actually worked.

Data Format Shifts

A price field that used to contain "NZ$49.99" now contains "$49.99 NZD". A date that was "12 Jun 2023" is now "2023-06-12". An address that was one field is now split across three. These changes don't break the scraper. They corrupt the data. Data quality is the real risk, not downtime.

Patterns That Survive

Graceful Degradation

A scraper should never just crash. When a selector doesn't match, log the failure, skip the record, and continue. When a page returns unexpected content, save the raw HTML for debugging and move on. When a field is missing, mark the record as incomplete rather than discarding it entirely.

I structure my scrapers so each record is processed independently. If record 47 out of 500 has a problem, I get 499 good records and one flagged error. Not a stack trace and zero records.

Data Validation

Every scraped value gets validated before storage. Prices should be positive numbers. Dates should parse to valid dates. URLs should be well-formed. Phone numbers should match expected patterns. It's the same rigour you'd apply to any integration.

This catches two things: actual errors in the scraper logic, and sites that have changed their data format. When validation failures spike, something has changed. Without validation, you'd only find out when someone looks at the data and notices the prices are all wrong.

I validate scraped data the same way I'd validate user input. Anything coming from outside your system is untrusted. A website's HTML is no different from a form submission.

Hassan Nawaz

Senior Developer

Monitoring and Alerting

Every scraper I maintain reports three things: how many records it processed, how many it skipped, and how long it took. I track these over time. A scraper that normally processes 200 records and suddenly processes 15 is broken, even if it didn't throw an error.

I set thresholds. If the record count drops below 50% of the rolling average, I get an alert. If the error rate goes above 10%, I get an alert. If the runtime doubles, I get an alert. Most of the time these alerts fire before anyone notices a problem with the actual data.

Selector Resilience

For critical fields, I use multiple selector strategies. Try the data attribute first. If that fails, try the semantic class. If that fails, try a structural pattern. If all three fail, flag the record.

This sounds like over-engineering until you've had a site change its class names but keep its data attributes. The primary selector breaks, the fallback catches it, and the scraper keeps running. I find out about the change from the logs, not from a panicked email.

Separate Fetching from Parsing

I store the raw HTML before parsing it. If the parser breaks, I can fix the parsing logic and re-run it against the stored HTML without re-fetching everything. This saves time, avoids hitting rate limits during debugging, and gives me a historical record of what the site looked like when the scrape ran.

The Maintenance Reality

Building a scraper takes a day or two. Maintaining it takes years. The scrapers I built in 2015 and 2016 have needed dozens of small fixes each. A selector change here, a new authentication flow there, a data format adjustment. None of these fixes are difficult. But they only happen if you have monitoring that tells you something broke.

The scrapers that die are the ones nobody's watching. They break silently, output bad data, and eventually someone notices the data hasn't been updated in three months. By that point the scraper needs a full rewrite because the target site has changed too much.

Build the monitoring first. The scraping logic is the easy part. Good testing habits apply to scrapers just as much as application code.

If you've got a scraping or data integration challenge that needs to work long-term, let's talk.