Building a web search engine
Implement the MapReduce index build, iterative PageRank over the link graph, two-phase scatter-gather ranking, and result caching.
Building the index with MapReduce
The inverted index over billions of pages is built as a distributed batch job:
# MAP: emit (term, doc_id) for every term in every page
def map_page(doc_id, text):
for term in tokenize(text):
emit(term, Posting(doc_id, tf=count, fields=...))
# REDUCE: collect all docs for a term into one sorted, compressed posting list
def reduce_term(term, postings):
emit(term, compress(sorted(postings, key=lambda p: p.doc_id)))
The output is document-sharded index segments, published atomically to the serving tier. Run incrementally so new crawls update the index without a full rebuild.
PageRank over the link graph
Authority is precomputed iteratively: a page’s rank is shared out to the pages it links, with damping:
def pagerank(graph, d=0.85, iters=30):
N = len(graph)
rank = {p: 1/N for p in graph}
for _ in range(iters): # iterate to convergence
new = {p: (1 - d) / N for p in graph}
for p in graph:
share = d * rank[p] / max(len(graph[p].out_links), 1)
for q in graph[p].out_links:
new[q] += share # p donates rank to its targets
rank = new
return rank # query-independent authority score
This is itself a giant distributed computation (MapReduce/Pregel over the link graph). The result is a per-page authority score stored alongside the index.
Two-phase query serving
def search(query, k=10):
terms = expand(spell_correct(query)) # synonyms, corrections
if cached := result_cache.get(query): # popular queries served from cache
return cached
# phase 1: cheap candidate retrieval on every shard (scatter-gather)
candidates = parallel_map(
lambda s: s.top_candidates(terms, n=100, score=bm25_plus_pagerank), shards)
merged = merge_top_n(candidates, n=1000)
# phase 2: expensive ML re-rank on the merged candidates only
ranked = ml_rank(merged, query, context)[:k]
results = attach_snippets(ranked)
result_cache.set(query, results, ttl="5m")
return results
The two phases keep it fast: cheap scoring narrows billions of docs to ~1000 candidates per shard-merge, then costly ML ranking touches only those.
Snippets
For each result, generate a snippet — the passage best matching the query terms — from the stored page (positions in the posting list locate the matching span).
Caching layers
- Query result cache — popular queries (a heavy-tailed but very repetitive distribution) served from cache.
- Posting-list cache — hot terms’ posting lists kept in memory.
- Document/snippet cache — for frequently shown results.
Scaling and failure handling
- Index document-sharded across thousands of nodes, each replicated for query throughput + availability; a shard replica failing just routes to another.
- A shard slow/down → return partial results (top-k from the shards that answered) rather than block — graceful degradation, acceptable for search.
- Freshness → a real-time index of breaking content is queried alongside the main index and merged.
- Load spikes → result cache + replicas absorb; the offline pipeline is unaffected.
The takeaway
Concrete signals: MapReduce index build, iterative PageRank over the link graph for authority, two-phase scatter-gather (cheap candidates → ML re-rank), snippets from positions, and layered caching + partial-result degradation. It’s the crawler + inverted index + ranking composed into one system — the search group’s capstone, reusing MapReduce, sharding, and caching throughout.