Designing search indexing (Twitter search)
Full-text search over a firehose of new content — the inverted index, how it's sharded and queried, and a real-time pipeline that makes new posts searchable in seconds.
The problem
Design search over tweets (or any high-write content): users search keywords and get matching posts ranked by relevance/recency, and brand-new posts become searchable within seconds. The crux is the inverted index plus a real-time indexing pipeline over a firehose of writes.
Step 1 — Requirements
Functional: index posts as they’re created; full-text search by keyword(s); rank results (relevance + recency + popularity); filters (user, time, language).
Non-functional: near-real-time indexing (new posts searchable in seconds), low-latency queries, huge write volume (hundreds of millions of posts/day) and huge query volume, scalable, available.
Step 2 — The inverted index
The core structure for full-text search: a map from term → list of documents (posting list) containing it.
"system" → [doc_42, doc_91, doc_220, ...]
"design" → [doc_42, doc_220, doc_888, ...]
search "system design" → intersect the two posting lists → [doc_42, doc_220]
Building blocks: tokenize text (lowercase, split, remove stopwords, stem), store a posting list per term (often with positions for phrase queries and term frequencies for ranking). A query intersects/unions posting lists and ranks the hits.
Step 3 — Sharding the index
The index is far too big for one node. Two sharding strategies:
- Document-based (by document) — each shard holds the full index for a subset of documents. A query fans out to all shards, each returns its top hits, and a coordinator merges them (scatter-gather). Even write distribution; queries touch all shards. Most common.
- Term-based (by term) — each shard owns certain terms’ posting lists. A query hits only the shards for its terms, but posting lists for popular terms become hotspots, and multi-term queries get complex.
Document-based is usually preferred (even load, simpler), accepting scatter-gather.
Step 4 — The real-time indexing pipeline
New posts must be searchable in seconds, so indexing is a streaming pipeline (Twitter’s Earlybird):
new post → ingest queue (Kafka) → indexers → in-memory real-time index (recent posts)
│ periodically flush/merge
▼
on-disk segments (older posts)
- A real-time index (in memory) holds the most recent posts for instant searchability; older posts live in immutable on-disk segments (LSM-like).
- Queries search both the real-time and historical indexes and merge.
Step 5 — Ranking
Score results by a blend of relevance (TF-IDF / BM25 — term frequency × inverse document frequency), recency (newer ranks higher for fresh topics), popularity/engagement (likes, retweets), and the searcher’s context. Return the top-k after merging shard results.
Step 6 — Scale
- Reads (queries) scale by replicating shards; the coordinator fans out and merges.
- Writes (indexing) scale via the queue + parallel indexers; document-sharding spreads write load.
- Caching frequent queries and hot posting lists cuts latency.
Trade-offs to raise
- Document-sharding (even load, scatter-gather) vs term-sharding (targeted queries, hot terms).
- Real-time index (fresh, in-memory, costly) vs batch index (cheap, stale) — use both (Lambda-style).
- Index freshness vs query latency — more frequent flush/merge = fresher but more overhead.
The interview cue
“An inverted index (term → posting list), document-sharded so queries scatter-gather and merge; a real-time pipeline (Kafka → indexers → in-memory recent index, flushed to on-disk segments) so posts are searchable in seconds; ranking by BM25 + recency + engagement.” Inverted index + real-time indexing is the heart; implementation next. (A dedicated search index like this is also how you’d offload search from any system’s primary DB — Chapter 2’s indexes lesson.)