Designing a web search engine
Assemble crawling, indexing, and ranking into a search engine — the offline pipeline that builds the index, PageRank-style ranking, and serving queries in milliseconds.
The problem
Design a web search engine (Google): crawl the web, build an index, and answer keyword queries with the most relevant, authoritative pages in milliseconds — over hundreds of billions of pages and billions of queries/day. This is the capstone of the search group: it composes the crawler and inverted index with ranking and serving.
Step 1 — Requirements
Functional: crawl and index the web; answer keyword queries with ranked results; support snippets, pagination, freshness; spelling/synonyms.
Non-functional: massive scale (index hundreds of billions of pages), low query latency (< few hundred ms), high relevance (the product is ranking quality), high availability, and reasonable freshness.
Step 2 — Two systems: offline indexing vs online serving
Cleanly separate the slow, batch indexing pipeline from the fast query serving path — they scale and evolve independently.
OFFLINE: crawler → content store → indexer → inverted index + ranking signals
ONLINE: query → parse → fan out to index shards → rank → snippets → results
Step 3 — The offline indexing pipeline
- Crawl the web (previous lessons) → raw pages in an object store.
- Parse & extract — text, title, links, metadata; build the link graph.
- Build the inverted index (term → posting list) — a giant batch/streaming job (MapReduce-style), document-sharded across thousands of nodes.
- Compute ranking signals — per-page authority (PageRank), term stats (for BM25/TF-IDF), spam scores, etc.
- Publish new index segments to the serving tier atomically.
Step 4 — Ranking: relevance × authority
Ranking quality is the whole product. It blends:
- Textual relevance — BM25/TF-IDF: does the page match the query terms well?
- Authority — PageRank — a page is important if many important pages link to it. It models a random surfer: a page’s score is the probability of landing on it, computed iteratively over the link graph. Authority is independent of the query (precomputed).
- Hundreds of other signals — freshness, location, click/engagement data, page quality, mobile-friendliness, and increasingly ML-learned ranking (learning to rank) and semantic/embedding matching.
Final score combines query-dependent (relevance) and query-independent (authority) signals; modern engines apply ML on top.
Step 5 — Serving a query (milliseconds)
query → spell-correct/expand → fan out to all index shards (document-sharded)
→ each shard returns its top-k candidates (relevance + precomputed authority)
→ coordinator merges → final ML re-rank of the top candidates → snippets → results
- Scatter-gather over thousands of shards, merged into the global top-k.
- A two-phase rank: cheap scoring on each shard to get candidates, then expensive ML re-ranking on just the merged top candidates.
- Heavy caching — popular queries’ results are cached; common posting lists cached in memory.
Step 6 — Scale
- Index is document-sharded across thousands of machines, each replicated for query throughput and availability.
- Query serving fans out and merges; replicas handle the billions of daily queries.
- Freshness — a fast real-time index for breaking content (news) merges with the main batch index (Lambda-style).
Trade-offs to raise
- Offline batch index (fresh-on-rebuild, cheap) vs real-time index (fresh, costly) — use both.
- Document sharding (even load, scatter-gather) vs term sharding (hot terms).
- Ranking quality vs latency — cheap candidate scoring then expensive ML re-rank on few candidates.
- Coverage/freshness vs cost — how much of the web to crawl and how often.
The interview cue
“Separate offline indexing (crawl → parse → document-sharded inverted index + PageRank/signals via MapReduce) from online serving (scatter-gather over replicated shards → merge → ML re-rank top candidates → snippets), with heavy caching and a real-time index for freshness. Ranking blends relevance (BM25) × authority (PageRank) plus learned signals.” Indexing/serving separation + ranking is the core; implementation next.