Designing search autocomplete (typeahead)
Suggest completions as the user types — a trie of top queries, precomputed top-k per prefix, and serving suggestions in tens of milliseconds.
The problem
Design search autocomplete / typeahead: as a user types, suggest the most likely completions (Google search box, app search). It must respond within a few tens of ms per keystroke, rank suggestions by popularity, and scale to billions of queries. The crux is prefix lookup of the top-k popular queries, fast.
Step 1 — Requirements
Functional: given a prefix, return the top k (≈5–10) most relevant complete queries; rank by popularity (and recency/personalization); update as new queries trend.
Non-functional: very low latency (< ~100 ms, ideally < 50 — it fires on every keystroke), high QPS (every keystroke is a request → reads ≫ the search itself), scalable, and reasonably fresh (trending terms appear within hours, not instantly).
Step 2 — The data structure: trie
A trie (prefix tree) maps each prefix to its completions: walk the tree by the typed characters, then return the best leaves below. But walking to all descendants per keystroke is too slow at scale, so the key optimization:
- Precompute and cache the top-k at each node. Store, at every trie node, the top k completions for that prefix (with their scores). A lookup becomes: walk to the prefix node (O(prefix length)) and return its stored top-k — O(1) after the walk. No descendant scan.
node "ca" → ["cat", "car", "cake", ...] (precomputed top-k by popularity)
Step 3 — Building and updating the top-k
- Aggregate query logs (offline/streaming): count query frequencies over a window.
- Build the trie with top-k per node from those counts (a batch job), and ship it to the serving tier.
- Refresh periodically (e.g. hourly) so trending queries surface — autocomplete tolerates being a bit stale, so this batch/stream hybrid is fine (Chapter 3).
This is fan-out-on-write thinking: precompute at build time so reads are trivial.
Step 4 — Serving architecture
keystroke → API/LB → autocomplete servers (in-memory trie, top-k per node) → suggestions
▲
query logs → aggregation (batch/stream) → rebuild trie → deploy to servers
- The trie lives in memory on serving nodes for speed; it’s large, so shard by prefix (e.g. first 1–2 characters) across nodes.
- A cache/CDN can hold suggestions for very common prefixes.
- Debounce on the client (wait ~100–300 ms after typing) to cut request volume.
Step 5 — Ranking and personalization
Score completions by popularity primarily, blended with recency (trending), personal history, and location/context. Personalization can be a second, smaller per-user signal merged at serve time.
Step 6 — Scale
- Sharding the trie by prefix spreads memory and load; replicate hot shards.
- Most traffic is short prefixes (1–3 chars) → cache those aggressively.
- Billions of queries → aggregation is a big-data batch/stream job, decoupled from serving.
Trade-offs to raise
- Precomputed top-k per node (fast reads, rebuild cost, slightly stale) vs on-the-fly traversal (fresh, too slow). Precompute wins.
- Freshness vs cost — rebuild more often for fresher trends, at more compute.
- Memory vs latency — in-memory trie is fast but large; shard it.
The interview cue
“A trie with precomputed top-k completions per node, built offline from aggregated query logs and refreshed hourly, served from in-memory sharded nodes so each keystroke is an O(prefix) lookup; client debounce and caching of short prefixes cut load. Rank by popularity + recency + personalization.” Precomputed top-k per trie node is the core idea; implementation next.