Designing a distributed web crawler

Crawl billions of web pages politely and without duplication — the URL frontier, dedup at scale, politeness, and the traps that snare naive crawlers.

The problem

Design a distributed web crawler (like Googlebot): start from seed URLs, fetch pages, extract links, and recursively crawl the web — billions of pages — politely (don’t hammer sites), without re-crawling the same content endlessly, and at huge scale. It feeds the search engine (next lesson).

Step 1 — Requirements

Functional: fetch pages from seed URLs; extract and enqueue new links; store page content; respect robots.txt; re-crawl pages to stay fresh; avoid duplicates.

Non-functional: massive scale (billions of pages), politeness (rate-limit per domain — don’t DDoS a site), robustness (handle traps, malformed pages, slow servers), freshness (re-crawl changing pages), and extensibility.

Step 2 — The crawl loop and the URL frontier

The heart is the URL frontier — the queue of URLs to crawl, which must enforce priority and politeness:

seed URLs → frontier → fetcher → parser → extract links → filter/dedup → frontier
                                    └─ store page content

The frontier isn’t a plain FIFO. It’s a two-level structure: priority queues (important/fresh pages first) feeding per-host queues that enforce a delay between requests to the same domain (politeness). A URL is only fetched when its host’s politeness window allows.

Step 3 — Dedup at scale (don’t crawl the same thing twice)

Two kinds of duplication to suppress:

Seen URLs — billions of URLs; checking “have we queued this?” against a full set is too much memory → use a Bloom filter (Chapter 2): a no is certain (enqueue), a maybe is confirmed against a store. Cheap membership at scale.
Duplicate content — different URLs, same content (mirrors, session ids). Hash the page content (or use SimHash for near-duplicates) and skip already-seen content.

Step 4 — Politeness and robots.txt

Respect robots.txt — fetch and cache each site’s crawl rules; obey disallowed paths and crawl-delay.
Per-domain rate limiting — never more than one (or a few) concurrent requests per host, with a delay; the frontier’s per-host queues enforce this. This is both courteous and avoids getting blocked.

Step 5 — Architecture and scale

frontier (sharded by host) → fetcher fleet → DNS resolver (cached)
       → parser → link extractor → URL filter/dedup (Bloom) → back to frontier
       → content store (object store) + metadata (crawled_at, hash)

Fetchers are a horizontally-scaled fleet; shard the frontier by host so politeness for a domain is enforced on one worker (no coordination needed) and load spreads.
DNS is a hidden bottleneck (a lookup per host) → run a caching DNS resolver.
Store raw pages in an object store; metadata (URL, hash, last-crawled) in a DB.

Step 6 — Traps and robustness

Name the hazards (this is where the problem gets interesting):

Crawler traps — infinite URL spaces (calendars, faceted filters generating endless links). Mitigate with depth limits, URL-pattern detection, and per-host page caps.
Slow/huge pages — timeouts and size limits.
Malformed HTML / non-HTML — robust parsing; skip binaries unless wanted.
Freshness — schedule re-crawls by change rate (news re-crawled often, static pages rarely).

Trade-offs to raise

Politeness vs throughput — per-host limits cap speed but are mandatory; scale by crawling many hosts in parallel, not one host fast.
Bloom filter (tiny memory, rare false positives) vs exact set (huge memory).
Freshness vs cost — re-crawl frequency trades coverage freshness against load.
BFS vs priority — prioritize important/changing pages rather than blind BFS.

The interview cue

“A URL frontier with priority + per-host politeness queues, a fetcher fleet sharded by host, Bloom-filter URL dedup + content hashing for near-dup detection, robots.txt compliance, cached DNS, pages in an object store, and re-crawl scheduling for freshness — plus trap defenses (depth/pattern limits).” Frontier + politeness + dedup is the core; implementation next.