Building search indexing (Twitter search)

Implement the inverted index and posting-list intersection, BM25 ranking, scatter-gather across document shards, and the real-time segment pipeline.

Building the inverted index

Tokenize each document and append its id to every term’s posting list:

def index(doc_id, text, index):
    for pos, term in enumerate(tokenize(text)):     # lowercase, split, stopwords, stem
        postings = index.setdefault(term, [])
        postings.append(Posting(doc_id, pos, tf=...))  # doc id (+ position, term freq)

Posting lists are kept sorted by doc id so they can be intersected efficiently and compressed (delta/varint encoding) since they’re huge.

Querying: intersect posting lists

A multi-term AND query intersects posting lists (merge two sorted lists in O(n+m)):

def search(terms, index, k=10):
    lists = sorted((index.get(t, []) for t in terms), key=len)   # shortest first
    hits = lists[0]
    for lst in lists[1:]:
        hits = intersect_sorted(hits, lst)          # advance the two pointers
    return rank(hits, terms)[:k]

Intersecting the shortest list first minimizes work. OR queries union; phrase queries also check positions.

Ranking with BM25

Score each hit by term relevance (BM25 — a tuned TF-IDF) blended with recency and engagement:

def score(doc, terms):
    relevance = sum(bm25(term, doc) for term in terms)   # TF × IDF, length-normalized
    recency   = decay(now() - doc.created_at)
    social    = log1p(doc.likes + doc.retweets)
    return w1*relevance + w2*recency + w3*social

Scatter-gather across document shards

With document-based sharding, the query fans out to all shards; each returns its local top-k; the coordinator merges:

def distributed_search(terms, k=10):
    partials = parallel_map(lambda shard: shard.search(terms, k), all_shards)
    return merge_top_k(partials, k)                 # heap-merge shard results

Each shard ranks locally; merging k-per-shard gives the global top-k.

The real-time pipeline

New posts flow through a queue to indexers that update an in-memory real-time segment; segments are periodically flushed to disk and merged (LSM-style), so recent posts are instantly searchable and old ones live in compact immutable segments:

def ingest_loop():
    for post in kafka.consume("posts"):
        realtime_segment.index(post.id, post.text)   # searchable within seconds
        if realtime_segment.full():
            flush_to_disk(realtime_segment)           # immutable segment
            realtime_segment = new_segment()
# queries search realtime_segment + all on-disk segments, then merge

A background compaction merges small segments into larger ones and drops deleted docs (tombstones).

Scaling and failure handling

Query load → replicate each shard; the coordinator load-balances across replicas.
Index load → partition posts across indexers via the queue.
Hot term (a trending word with a giant posting list) → cache it; document sharding limits any single list’s length per shard.
Indexer crash → reprocess from the queue offset (at-least-once; indexing is idempotent by doc id).
Shard down → serve from a replica; results are slightly incomplete until repair (acceptable for search).

The takeaway

Concrete signals: an inverted index with sorted posting-list intersection, BM25 + recency + engagement ranking, scatter-gather over document shards with top-k merge, and a real-time segment pipeline (Kafka → in-memory → flushed segments + compaction) for seconds-fresh search. This is the dedicated search service you bolt onto any system to offload search from its primary store.