Building a distributed web crawler

Implement the two-level politeness frontier, Bloom-filter URL dedup, the fetch-parse-extract worker, and trap defenses.

The fetch–parse–extract worker

The core loop: pull a URL the frontier says is ready, fetch politely, store, extract links, dedup, enqueue:

def crawl_worker(frontier, seen, content_seen, store):
    while True:
        url = frontier.next_ready()             # respects per-host politeness
        if not allowed_by_robots(url):          # cached robots.txt
            continue
        page = fetch(url, timeout=10, max_bytes=10*MB)
        if not page: continue
        h = simhash(page.text)                  # near-duplicate content detection
        if content_seen.contains(h):            # skip mirror/dup content
            continue
        content_seen.add(h)
        store.put(url, page)                    # raw page → object store
        for link in extract_links(page):
            link = normalize(link)              # canonicalize (strip fragments, sort params)
            if not seen.contains(link):         # Bloom filter check
                seen.add(link)
                frontier.add(link, priority=score(link))

URL dedup with a Bloom filter

Billions of URLs won’t fit in a set, so a Bloom filter answers “seen this?” cheaply; a maybe is confirmed against a backing store:

class SeenUrls:
    def contains(self, url):
        if not self.bloom.might_contain(url):   # definitely new → enqueue
            return False
        return self.store.exists(url)           # maybe → confirm (rare)
    def add(self, url):
        self.bloom.add(url); self.store.put(url)

A no is certain (no false negatives), so you never miss a new URL; a maybe costs one confirming lookup. Normalize URLs first (canonical form) so trivial variants don’t count as distinct.

The politeness frontier (two levels)

The frontier guarantees one in-flight request per host with a delay, while staying busy across many hosts:

class Frontier:
    # priority queues feed per-host queues; a host is "ready" only after its delay
    def next_ready(self):
        host = self.ready_hosts.pop()           # hosts past their politeness window
        url = self.host_queues[host].pop()
        self.schedule_host(host, after=self.delay(host))  # re-arm after crawl-delay
        return url
    def add(self, url, priority):
        self.host_queues[host_of(url)].push(url, priority)

Shard the frontier by host across crawler nodes so each host’s politeness is enforced on one node with no cross-node coordination, and load spreads across hosts.

DNS caching

A DNS lookup per host is a real bottleneck at billions of fetches — cache resolutions (with TTLs) in a local/shared resolver so most fetches skip the lookup.

Trap and robustness defenses

def score(url):
    if depth(url) > MAX_DEPTH: return DROP       # cap infinite link spaces
    if host_page_count[host_of(url)] > HOST_CAP: return DROP   # per-host page cap
    if looks_like_trap(url): return DROP         # calendar/faceted-URL patterns
    return priority_by_importance_and_freshness(url)

Plus fetch timeouts, size limits, content-type checks, and robust HTML parsing that tolerates malformed markup.

Re-crawling for freshness

Track last_crawled and an estimated change rate per page; a scheduler re-enqueues pages by priority (news hourly, static pages monthly). Reuse the job-scheduler pattern.

Scaling and failure handling

Throughput → add crawler nodes; the frontier shards by host, so politeness still holds and many hosts crawl in parallel.
Worker crash → URLs in flight time out and return to the frontier (at-least-once; content dedup makes re-fetch harmless).
A slow/huge site → timeouts + per-host caps prevent it from stalling the fleet.
Politeness is non-negotiable — getting your crawler IP-banned is the real failure mode.

The takeaway

Concrete signals: a two-level politeness frontier sharded by host, Bloom-filter URL dedup + SimHash content dedup, robots.txt + cached DNS, and trap defenses (depth/host caps). It directly composes Chapter 2’s Bloom filters, queues, and sharding — and its output is the corpus the search engine indexes next.