Designing a web search engine

Assemble crawling, indexing, and ranking into a search engine — the offline pipeline that builds the index, PageRank-style ranking, and serving queries in milliseconds.

The problem

Design a web search engine (Google): crawl the web, build an index, and answer keyword queries with the most relevant, authoritative pages in milliseconds — over hundreds of billions of pages and billions of queries/day. This is the capstone of the search group: it composes the crawler and inverted index with ranking and serving.

Step 1 — Requirements

Functional: crawl and index the web; answer keyword queries with ranked results; support snippets, pagination, freshness; spelling/synonyms.

Non-functional: massive scale (index hundreds of billions of pages), low query latency (< few hundred ms), high relevance (the product is ranking quality), high availability, and reasonable freshness.

Step 2 — Two systems: offline indexing vs online serving

Cleanly separate the slow, batch indexing pipeline from the fast query serving path — they scale and evolve independently.

OFFLINE:  crawler → content store → indexer → inverted index + ranking signals
ONLINE:   query → parse → fan out to index shards → rank → snippets → results

Step 3 — The offline indexing pipeline

Crawl the web (previous lessons) → raw pages in an object store.
Parse & extract — text, title, links, metadata; build the link graph.
Build the inverted index (term → posting list) — a giant batch/streaming job (MapReduce-style), document-sharded across thousands of nodes.
Compute ranking signals — per-page authority (PageRank), term stats (for BM25/TF-IDF), spam scores, etc.
Publish new index segments to the serving tier atomically.

Step 4 — Ranking: relevance × authority

Ranking quality is the whole product. It blends:

Textual relevance — BM25/TF-IDF: does the page match the query terms well?
Authority — PageRank — a page is important if many important pages link to it. It models a random surfer: a page’s score is the probability of landing on it, computed iteratively over the link graph. Authority is independent of the query (precomputed).
Hundreds of other signals — freshness, location, click/engagement data, page quality, mobile-friendliness, and increasingly ML-learned ranking (learning to rank) and semantic/embedding matching.

Final score combines query-dependent (relevance) and query-independent (authority) signals; modern engines apply ML on top.

Step 5 — Serving a query (milliseconds)

query → spell-correct/expand → fan out to all index shards (document-sharded)
      → each shard returns its top-k candidates (relevance + precomputed authority)
      → coordinator merges → final ML re-rank of the top candidates → snippets → results

Scatter-gather over thousands of shards, merged into the global top-k.
A two-phase rank: cheap scoring on each shard to get candidates, then expensive ML re-ranking on just the merged top candidates.
Heavy caching — popular queries’ results are cached; common posting lists cached in memory.

Step 6 — Scale

Index is document-sharded across thousands of machines, each replicated for query throughput and availability.
Query serving fans out and merges; replicas handle the billions of daily queries.
Freshness — a fast real-time index for breaking content (news) merges with the main batch index (Lambda-style).

Trade-offs to raise

Offline batch index (fresh-on-rebuild, cheap) vs real-time index (fresh, costly) — use both.
Document sharding (even load, scatter-gather) vs term sharding (hot terms).
Ranking quality vs latency — cheap candidate scoring then expensive ML re-rank on few candidates.
Coverage/freshness vs cost — how much of the web to crawl and how often.

The interview cue

“Separate offline indexing (crawl → parse → document-sharded inverted index + PageRank/signals via MapReduce) from online serving (scatter-gather over replicated shards → merge → ML re-rank top candidates → snippets), with heavy caching and a real-time index for freshness. Ranking blends relevance (BM25) × authority (PageRank) plus learned signals.” Indexing/serving separation + ranking is the core; implementation next.