Building a distributed web crawler
Implement the two-level politeness frontier, Bloom-filter URL dedup, the fetch-parse-extract worker, and trap defenses.
The fetch–parse–extract worker
The core loop: pull a URL the frontier says is ready, fetch politely, store, extract links, dedup, enqueue:
def crawl_worker(frontier, seen, content_seen, store):
while True:
url = frontier.next_ready() # respects per-host politeness
if not allowed_by_robots(url): # cached robots.txt
continue
page = fetch(url, timeout=10, max_bytes=10*MB)
if not page: continue
h = simhash(page.text) # near-duplicate content detection
if content_seen.contains(h): # skip mirror/dup content
continue
content_seen.add(h)
store.put(url, page) # raw page → object store
for link in extract_links(page):
link = normalize(link) # canonicalize (strip fragments, sort params)
if not seen.contains(link): # Bloom filter check
seen.add(link)
frontier.add(link, priority=score(link))
URL dedup with a Bloom filter
Billions of URLs won’t fit in a set, so a Bloom filter answers “seen this?” cheaply; a maybe is confirmed against a backing store:
class SeenUrls:
def contains(self, url):
if not self.bloom.might_contain(url): # definitely new → enqueue
return False
return self.store.exists(url) # maybe → confirm (rare)
def add(self, url):
self.bloom.add(url); self.store.put(url)
A no is certain (no false negatives), so you never miss a new URL; a maybe costs one confirming lookup. Normalize URLs first (canonical form) so trivial variants don’t count as distinct.
The politeness frontier (two levels)
The frontier guarantees one in-flight request per host with a delay, while staying busy across many hosts:
class Frontier:
# priority queues feed per-host queues; a host is "ready" only after its delay
def next_ready(self):
host = self.ready_hosts.pop() # hosts past their politeness window
url = self.host_queues[host].pop()
self.schedule_host(host, after=self.delay(host)) # re-arm after crawl-delay
return url
def add(self, url, priority):
self.host_queues[host_of(url)].push(url, priority)
Shard the frontier by host across crawler nodes so each host’s politeness is enforced on one node with no cross-node coordination, and load spreads across hosts.
DNS caching
A DNS lookup per host is a real bottleneck at billions of fetches — cache resolutions (with TTLs) in a local/shared resolver so most fetches skip the lookup.
Trap and robustness defenses
def score(url):
if depth(url) > MAX_DEPTH: return DROP # cap infinite link spaces
if host_page_count[host_of(url)] > HOST_CAP: return DROP # per-host page cap
if looks_like_trap(url): return DROP # calendar/faceted-URL patterns
return priority_by_importance_and_freshness(url)
Plus fetch timeouts, size limits, content-type checks, and robust HTML parsing that tolerates malformed markup.
Re-crawling for freshness
Track last_crawled and an estimated change rate per page; a scheduler re-enqueues
pages by priority (news hourly, static pages monthly). Reuse the job-scheduler pattern.
Scaling and failure handling
- Throughput → add crawler nodes; the frontier shards by host, so politeness still holds and many hosts crawl in parallel.
- Worker crash → URLs in flight time out and return to the frontier (at-least-once; content dedup makes re-fetch harmless).
- A slow/huge site → timeouts + per-host caps prevent it from stalling the fleet.
- Politeness is non-negotiable — getting your crawler IP-banned is the real failure mode.
The takeaway
Concrete signals: a two-level politeness frontier sharded by host, Bloom-filter URL dedup + SimHash content dedup, robots.txt + cached DNS, and trap defenses (depth/host caps). It directly composes Chapter 2’s Bloom filters, queues, and sharding — and its output is the corpus the search engine indexes next.