Designing search autocomplete (typeahead)

Suggest completions as the user types — a trie of top queries, precomputed top-k per prefix, and serving suggestions in tens of milliseconds.

The problem

Design search autocomplete / typeahead: as a user types, suggest the most likely completions (Google search box, app search). It must respond within a few tens of ms per keystroke, rank suggestions by popularity, and scale to billions of queries. The crux is prefix lookup of the top-k popular queries, fast.

Step 1 — Requirements

Functional: given a prefix, return the top k (≈5–10) most relevant complete queries; rank by popularity (and recency/personalization); update as new queries trend.

Non-functional: very low latency (< ~100 ms, ideally < 50 — it fires on every keystroke), high QPS (every keystroke is a request → reads ≫ the search itself), scalable, and reasonably fresh (trending terms appear within hours, not instantly).

Step 2 — The data structure: trie

A trie (prefix tree) maps each prefix to its completions: walk the tree by the typed characters, then return the best leaves below. But walking to all descendants per keystroke is too slow at scale, so the key optimization:

Precompute and cache the top-k at each node. Store, at every trie node, the top k completions for that prefix (with their scores). A lookup becomes: walk to the prefix node (O(prefix length)) and return its stored top-k — O(1) after the walk. No descendant scan.

node "ca" → ["cat", "car", "cake", ...]   (precomputed top-k by popularity)

Step 3 — Building and updating the top-k

Aggregate query logs (offline/streaming): count query frequencies over a window.
Build the trie with top-k per node from those counts (a batch job), and ship it to the serving tier.
Refresh periodically (e.g. hourly) so trending queries surface — autocomplete tolerates being a bit stale, so this batch/stream hybrid is fine (Chapter 3).

This is fan-out-on-write thinking: precompute at build time so reads are trivial.

Step 4 — Serving architecture

keystroke → API/LB → autocomplete servers (in-memory trie, top-k per node) → suggestions
                          ▲
        query logs → aggregation (batch/stream) → rebuild trie → deploy to servers

The trie lives in memory on serving nodes for speed; it’s large, so shard by prefix (e.g. first 1–2 characters) across nodes.
A cache/CDN can hold suggestions for very common prefixes.
Debounce on the client (wait ~100–300 ms after typing) to cut request volume.

Step 5 — Ranking and personalization

Score completions by popularity primarily, blended with recency (trending), personal history, and location/context. Personalization can be a second, smaller per-user signal merged at serve time.

Step 6 — Scale

Sharding the trie by prefix spreads memory and load; replicate hot shards.
Most traffic is short prefixes (1–3 chars) → cache those aggressively.
Billions of queries → aggregation is a big-data batch/stream job, decoupled from serving.

Trade-offs to raise

Precomputed top-k per node (fast reads, rebuild cost, slightly stale) vs on-the-fly traversal (fresh, too slow). Precompute wins.
Freshness vs cost — rebuild more often for fresher trends, at more compute.
Memory vs latency — in-memory trie is fast but large; shard it.

The interview cue

“A trie with precomputed top-k completions per node, built offline from aggregated query logs and refreshed hourly, served from in-memory sharded nodes so each keystroke is an O(prefix) lookup; client debounce and caching of short prefixes cut load. Rank by popularity + recency + personalization.” Precomputed top-k per trie node is the core idea; implementation next.