Building Twitter
Implement the hybrid fan-out via a queue, the merge-at-read timeline assembly, Snowflake tweet IDs, and timeline caching.
Posting a tweet (async fan-out)
The write returns fast; fan-out happens asynchronously off a queue:
def post_tweet(author_id, text, media):
tweet_id = snowflake() # time-sortable unique id
media_url = upload_to_blob(media) if media else None
tweets.put(tweet_id, {"author": author_id, "text": text,
"media": media_url, "ts": now()})
fanout_queue.publish({"tweet_id": tweet_id, "author_id": author_id})
return tweet_id
The fan-out worker (hybrid)
Workers push the tweet into followers’ timelines — unless the author is a celebrity, in which case they skip the push (pulled at read time instead):
def fanout_worker(msg):
author = msg["author_id"]
if follower_count(author) > CELEBRITY_THRESHOLD: # e.g. > 1M followers
return # do NOT fan out; pulled on read
for follower in followers(author): # paginate the follower list
timeline = f"timeline:{follower}"
redis.lpush(timeline, msg["tweet_id"])
redis.ltrim(timeline, 0, 800) # cap the precomputed list
The celebrity check is what prevents one tweet from triggering tens of millions of list writes.
Reading a timeline (merge precomputed + celebrities)
def home_timeline(user_id, k=50):
base = redis.lrange(f"timeline:{user_id}", 0, k) # O(1): pushed (normal) tweets
celeb_ids = celebrities_followed(user_id) # the few big accounts you follow
celeb_tweets = [latest_tweets(c, k) for c in celeb_ids] # pulled at read time
merged = merge_by_time(base, *celeb_tweets)[:k] # k-way merge by tweet_id (time)
return hydrate(merged) # fetch text/media from cache/store
Snowflake ids are time-sortable, so the merge is just a heap-merge by id. Hydration turns ids into full tweets (batched cache reads).
Snowflake IDs
Tweet ids must be unique, roughly time-ordered (for sorting/merging), and generated without a central bottleneck:
64-bit id = [timestamp (41 bits)] [machine id (10)] [sequence (12)]
Each node stamps the time, its machine id, and a per-ms counter — unique and sortable, no coordination. (Reuse for any “unique time-ordered id” need.)
Storage and sharding
- Tweets sharded by tweet id (or author); replicated; media on CDN.
- Timelines are capped lists in Redis, sharded by user id.
- Social graph (followers / following) in a sharded store; “followers(author)” must page efficiently for fan-out.
- A user’s profile timeline (their own tweets) is a simple per-author list.
Handling the hard cases
- Celebrity tweet → not fanned out; merged at read (caps write amplification).
- New follow → backfill: merge the newly-followed user’s recent tweets into your timeline (or just pull until the next refresh).
- Fan-out lag → it’s async, so a tweet may take a few seconds to appear — eventual consistency, acceptable.
- Hot tweet (viral) → tweet content is cached; the CDN serves media; likes/retweets counted asynchronously (approximate counters).
- Timeline rebuild (cache loss) → regenerate from followees’ recent tweets (fan-out-on-read fallback).
Scaling
- Reads dominate → Redis timelines + tweet cache absorb them; replicate hot shards.
- Writes/fan-out → queue + worker fleet; the hybrid caps amplification.
- Counts (likes/views) → async approximate counters, not synchronous DB increments.
The takeaway
Concrete signals: async hybrid fan-out (push for normal, pull-and-merge for celebrities), Snowflake ids enabling a time-ordered merge at read, capped Redis timelines, and CDN media. The push/pull hybrid is the reusable feed pattern — Instagram, news feed, and TikTok are variations on it.