Designing search indexing (Twitter search)

Full-text search over a firehose of new content — the inverted index, how it's sharded and queried, and a real-time pipeline that makes new posts searchable in seconds.

The problem

Design search over tweets (or any high-write content): users search keywords and get matching posts ranked by relevance/recency, and brand-new posts become searchable within seconds. The crux is the inverted index plus a real-time indexing pipeline over a firehose of writes.

Step 1 — Requirements

Functional: index posts as they’re created; full-text search by keyword(s); rank results (relevance + recency + popularity); filters (user, time, language).

Non-functional: near-real-time indexing (new posts searchable in seconds), low-latency queries, huge write volume (hundreds of millions of posts/day) and huge query volume, scalable, available.

Step 2 — The inverted index

The core structure for full-text search: a map from term → list of documents (posting list) containing it.

"system" → [doc_42, doc_91, doc_220, ...]
"design" → [doc_42, doc_220, doc_888, ...]
search "system design" → intersect the two posting lists → [doc_42, doc_220]

Building blocks: tokenize text (lowercase, split, remove stopwords, stem), store a posting list per term (often with positions for phrase queries and term frequencies for ranking). A query intersects/unions posting lists and ranks the hits.

Step 3 — Sharding the index

The index is far too big for one node. Two sharding strategies:

Document-based (by document) — each shard holds the full index for a subset of documents. A query fans out to all shards, each returns its top hits, and a coordinator merges them (scatter-gather). Even write distribution; queries touch all shards. Most common.
Term-based (by term) — each shard owns certain terms’ posting lists. A query hits only the shards for its terms, but posting lists for popular terms become hotspots, and multi-term queries get complex.

Document-based is usually preferred (even load, simpler), accepting scatter-gather.

Step 4 — The real-time indexing pipeline

New posts must be searchable in seconds, so indexing is a streaming pipeline (Twitter’s Earlybird):

new post → ingest queue (Kafka) → indexers → in-memory real-time index (recent posts)
                                                   │ periodically flush/merge
                                                   ▼
                                          on-disk segments (older posts)

A real-time index (in memory) holds the most recent posts for instant searchability; older posts live in immutable on-disk segments (LSM-like).
Queries search both the real-time and historical indexes and merge.

Step 5 — Ranking

Score results by a blend of relevance (TF-IDF / BM25 — term frequency × inverse document frequency), recency (newer ranks higher for fresh topics), popularity/engagement (likes, retweets), and the searcher’s context. Return the top-k after merging shard results.

Step 6 — Scale

Reads (queries) scale by replicating shards; the coordinator fans out and merges.
Writes (indexing) scale via the queue + parallel indexers; document-sharding spreads write load.
Caching frequent queries and hot posting lists cuts latency.

Trade-offs to raise

Document-sharding (even load, scatter-gather) vs term-sharding (targeted queries, hot terms).
Real-time index (fresh, in-memory, costly) vs batch index (cheap, stale) — use both (Lambda-style).
Index freshness vs query latency — more frequent flush/merge = fresher but more overhead.

The interview cue

“An inverted index (term → posting list), document-sharded so queries scatter-gather and merge; a real-time pipeline (Kafka → indexers → in-memory recent index, flushed to on-disk segments) so posts are searchable in seconds; ranking by BM25 + recency + engagement.” Inverted index + real-time indexing is the heart; implementation next. (A dedicated search index like this is also how you’d offload search from any system’s primary DB — Chapter 2’s indexes lesson.)