Building search indexing (Twitter search)
Implement the inverted index and posting-list intersection, BM25 ranking, scatter-gather across document shards, and the real-time segment pipeline.
Building the inverted index
Tokenize each document and append its id to every term’s posting list:
def index(doc_id, text, index):
for pos, term in enumerate(tokenize(text)): # lowercase, split, stopwords, stem
postings = index.setdefault(term, [])
postings.append(Posting(doc_id, pos, tf=...)) # doc id (+ position, term freq)
Posting lists are kept sorted by doc id so they can be intersected efficiently and compressed (delta/varint encoding) since they’re huge.
Querying: intersect posting lists
A multi-term AND query intersects posting lists (merge two sorted lists in O(n+m)):
def search(terms, index, k=10):
lists = sorted((index.get(t, []) for t in terms), key=len) # shortest first
hits = lists[0]
for lst in lists[1:]:
hits = intersect_sorted(hits, lst) # advance the two pointers
return rank(hits, terms)[:k]
Intersecting the shortest list first minimizes work. OR queries union; phrase queries also check positions.
Ranking with BM25
Score each hit by term relevance (BM25 — a tuned TF-IDF) blended with recency and engagement:
def score(doc, terms):
relevance = sum(bm25(term, doc) for term in terms) # TF × IDF, length-normalized
recency = decay(now() - doc.created_at)
social = log1p(doc.likes + doc.retweets)
return w1*relevance + w2*recency + w3*social
Scatter-gather across document shards
With document-based sharding, the query fans out to all shards; each returns its local top-k; the coordinator merges:
def distributed_search(terms, k=10):
partials = parallel_map(lambda shard: shard.search(terms, k), all_shards)
return merge_top_k(partials, k) # heap-merge shard results
Each shard ranks locally; merging k-per-shard gives the global top-k.
The real-time pipeline
New posts flow through a queue to indexers that update an in-memory real-time segment; segments are periodically flushed to disk and merged (LSM-style), so recent posts are instantly searchable and old ones live in compact immutable segments:
def ingest_loop():
for post in kafka.consume("posts"):
realtime_segment.index(post.id, post.text) # searchable within seconds
if realtime_segment.full():
flush_to_disk(realtime_segment) # immutable segment
realtime_segment = new_segment()
# queries search realtime_segment + all on-disk segments, then merge
A background compaction merges small segments into larger ones and drops deleted docs (tombstones).
Scaling and failure handling
- Query load → replicate each shard; the coordinator load-balances across replicas.
- Index load → partition posts across indexers via the queue.
- Hot term (a trending word with a giant posting list) → cache it; document sharding limits any single list’s length per shard.
- Indexer crash → reprocess from the queue offset (at-least-once; indexing is idempotent by doc id).
- Shard down → serve from a replica; results are slightly incomplete until repair (acceptable for search).
The takeaway
Concrete signals: an inverted index with sorted posting-list intersection, BM25 + recency + engagement ranking, scatter-gather over document shards with top-k merge, and a real-time segment pipeline (Kafka → in-memory → flushed segments + compaction) for seconds-fresh search. This is the dedicated search service you bolt onto any system to offload search from its primary store.