Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·7 min read

Building a web search engine

Implement the MapReduce index build, iterative PageRank over the link graph, two-phase scatter-gather ranking, and result caching.


Building the index with MapReduce

The inverted index over billions of pages is built as a distributed batch job:

# MAP: emit (term, doc_id) for every term in every page
def map_page(doc_id, text):
    for term in tokenize(text):
        emit(term, Posting(doc_id, tf=count, fields=...))

# REDUCE: collect all docs for a term into one sorted, compressed posting list
def reduce_term(term, postings):
    emit(term, compress(sorted(postings, key=lambda p: p.doc_id)))

The output is document-sharded index segments, published atomically to the serving tier. Run incrementally so new crawls update the index without a full rebuild.

Authority is precomputed iteratively: a page’s rank is shared out to the pages it links, with damping:

def pagerank(graph, d=0.85, iters=30):
    N = len(graph)
    rank = {p: 1/N for p in graph}
    for _ in range(iters):                      # iterate to convergence
        new = {p: (1 - d) / N for p in graph}
        for p in graph:
            share = d * rank[p] / max(len(graph[p].out_links), 1)
            for q in graph[p].out_links:
                new[q] += share                 # p donates rank to its targets
        rank = new
    return rank                                 # query-independent authority score

This is itself a giant distributed computation (MapReduce/Pregel over the link graph). The result is a per-page authority score stored alongside the index.

Two-phase query serving

def search(query, k=10):
    terms = expand(spell_correct(query))        # synonyms, corrections
    if cached := result_cache.get(query):       # popular queries served from cache
        return cached
    # phase 1: cheap candidate retrieval on every shard (scatter-gather)
    candidates = parallel_map(
        lambda s: s.top_candidates(terms, n=100, score=bm25_plus_pagerank), shards)
    merged = merge_top_n(candidates, n=1000)
    # phase 2: expensive ML re-rank on the merged candidates only
    ranked = ml_rank(merged, query, context)[:k]
    results = attach_snippets(ranked)
    result_cache.set(query, results, ttl="5m")
    return results

The two phases keep it fast: cheap scoring narrows billions of docs to ~1000 candidates per shard-merge, then costly ML ranking touches only those.

Snippets

For each result, generate a snippet — the passage best matching the query terms — from the stored page (positions in the posting list locate the matching span).

Caching layers

  • Query result cache — popular queries (a heavy-tailed but very repetitive distribution) served from cache.
  • Posting-list cache — hot terms’ posting lists kept in memory.
  • Document/snippet cache — for frequently shown results.

Scaling and failure handling

  • Index document-sharded across thousands of nodes, each replicated for query throughput + availability; a shard replica failing just routes to another.
  • A shard slow/down → return partial results (top-k from the shards that answered) rather than block — graceful degradation, acceptable for search.
  • Freshness → a real-time index of breaking content is queried alongside the main index and merged.
  • Load spikes → result cache + replicas absorb; the offline pipeline is unaffected.

The takeaway

Concrete signals: MapReduce index build, iterative PageRank over the link graph for authority, two-phase scatter-gather (cheap candidates → ML re-rank), snippets from positions, and layered caching + partial-result degradation. It’s the crawler + inverted index + ranking composed into one system — the search group’s capstone, reusing MapReduce, sharding, and caching throughout.