Designing a distributed web crawler
Crawl billions of web pages politely and without duplication — the URL frontier, dedup at scale, politeness, and the traps that snare naive crawlers.
The problem
Design a distributed web crawler (like Googlebot): start from seed URLs, fetch pages, extract links, and recursively crawl the web — billions of pages — politely (don’t hammer sites), without re-crawling the same content endlessly, and at huge scale. It feeds the search engine (next lesson).
Step 1 — Requirements
Functional: fetch pages from seed URLs; extract and enqueue new links; store
page content; respect robots.txt; re-crawl pages to stay fresh; avoid duplicates.
Non-functional: massive scale (billions of pages), politeness (rate-limit per domain — don’t DDoS a site), robustness (handle traps, malformed pages, slow servers), freshness (re-crawl changing pages), and extensibility.
Step 2 — The crawl loop and the URL frontier
The heart is the URL frontier — the queue of URLs to crawl, which must enforce priority and politeness:
seed URLs → frontier → fetcher → parser → extract links → filter/dedup → frontier
└─ store page content
- The frontier isn’t a plain FIFO. It’s a two-level structure: priority queues (important/fresh pages first) feeding per-host queues that enforce a delay between requests to the same domain (politeness). A URL is only fetched when its host’s politeness window allows.
Step 3 — Dedup at scale (don’t crawl the same thing twice)
Two kinds of duplication to suppress:
- Seen URLs — billions of URLs; checking “have we queued this?” against a full set is too much memory → use a Bloom filter (Chapter 2): a no is certain (enqueue), a maybe is confirmed against a store. Cheap membership at scale.
- Duplicate content — different URLs, same content (mirrors, session ids). Hash the page content (or use SimHash for near-duplicates) and skip already-seen content.
Step 4 — Politeness and robots.txt
- Respect
robots.txt— fetch and cache each site’s crawl rules; obey disallowed paths and crawl-delay. - Per-domain rate limiting — never more than one (or a few) concurrent requests per host, with a delay; the frontier’s per-host queues enforce this. This is both courteous and avoids getting blocked.
Step 5 — Architecture and scale
frontier (sharded by host) → fetcher fleet → DNS resolver (cached)
→ parser → link extractor → URL filter/dedup (Bloom) → back to frontier
→ content store (object store) + metadata (crawled_at, hash)
- Fetchers are a horizontally-scaled fleet; shard the frontier by host so politeness for a domain is enforced on one worker (no coordination needed) and load spreads.
- DNS is a hidden bottleneck (a lookup per host) → run a caching DNS resolver.
- Store raw pages in an object store; metadata (URL, hash, last-crawled) in a DB.
Step 6 — Traps and robustness
Name the hazards (this is where the problem gets interesting):
- Crawler traps — infinite URL spaces (calendars, faceted filters generating endless links). Mitigate with depth limits, URL-pattern detection, and per-host page caps.
- Slow/huge pages — timeouts and size limits.
- Malformed HTML / non-HTML — robust parsing; skip binaries unless wanted.
- Freshness — schedule re-crawls by change rate (news re-crawled often, static pages rarely).
Trade-offs to raise
- Politeness vs throughput — per-host limits cap speed but are mandatory; scale by crawling many hosts in parallel, not one host fast.
- Bloom filter (tiny memory, rare false positives) vs exact set (huge memory).
- Freshness vs cost — re-crawl frequency trades coverage freshness against load.
- BFS vs priority — prioritize important/changing pages rather than blind BFS.
The interview cue
“A URL frontier with priority + per-host politeness queues, a fetcher fleet sharded by host, Bloom-filter URL dedup + content hashing for near-dup detection, robots.txt compliance, cached DNS, pages in an object store, and re-crawl scheduling for freshness — plus trap defenses (depth/pattern limits).” Frontier + politeness + dedup is the core; implementation next.