Building a chat system (WhatsApp)
Implement the connection servers and session registry, ordered persisted delivery with offline inboxes, receipts, and group fan-out.
Connection servers and the session registry
Each connection server holds millions of WebSockets; a registry maps users to their server so messages can be routed:
# on connect
def on_connect(user_id, conn):
connections[user_id] = conn # local: user -> socket on THIS server
registry.set(f"session:{user_id}", this_server_id, ttl="...") # global: user -> server
presence.publish(user_id, "online")
def on_disconnect(user_id):
del connections[user_id]
registry.delete(f"session:{user_id}")
presence.publish(user_id, "offline", last_seen=now())
Sending a message (persist, order, route)
def send_message(sender, convo_id, ciphertext, client_msg_id):
seq = sequencer.next(convo_id) # per-conversation ordering
msg = {"id": snowflake(), "convo": convo_id, "from": sender, "seq": seq,
"body": ciphertext, "ts": now()}
message_store.append(convo_id, msg) # persist (durability + history)
for recipient in members(convo_id) - {sender}:
deliver(recipient, msg)
return msg["id"]
def deliver(recipient, msg):
server = registry.get(f"session:{recipient}")
if server: # online
route_to(server, recipient, msg) # push to their connection server
else: # offline
inbox.append(recipient, msg) # store for reconnect
Sequence numbers give per-conversation ordering; persisting before delivery means a crash never loses an accepted message.
Offline delivery and sync
On reconnect, the client tells the server its last-acked sequence; the server flushes the backlog:
def on_resync(user_id, convo_id, last_seq):
backlog = inbox.since(user_id, convo_id, last_seq) # everything missed
for msg in backlog: push(user_id, msg)
# client acks; server marks delivered, removes from inbox
Delivery and read receipts
Receipts are tiny messages routed back to the sender:
def on_client_ack(user_id, msg_id, kind): # kind = "delivered" | "read"
msg = message_store.get(msg_id)
deliver(msg["from"], {"type": "receipt", "msg_id": msg_id,
"by": user_id, "kind": kind})
At-least-once delivery + dedup by client_msg_id on the receiver means resends
(after a flaky network) don’t create duplicates.
Group fan-out
A group send writes once and fans out to each member’s delivery path (online push or
offline inbox), reusing deliver. Receipts aggregate per member (“delivered to 8/10”).
Cap fan-out for very large groups (rate-limit, or treat as broadcast channels).
Presence
Connection servers publish connect/disconnect to a pub/sub channel; a contact subscribed to you receives your online/last-seen. Presence is high-churn, so it’s best-effort, debounced, and rate-limited (don’t let presence storms dominate traffic).
End-to-end encryption
The client encrypts with the recipient’s public key (Signal protocol; keys exchanged via
a key server). The server stores/routes only ciphertext — body above is opaque to
it. Media is encrypted client-side and uploaded to the blob store; only the key (sent in
the message) decrypts it.
Scale and failure handling
- Connections → many connection servers, each with millions of sockets; the registry routes between them; a message broker (pub/sub) fans messages across servers.
- Connection server dies → its users’ sessions expire (TTL); clients reconnect to another server and resync from their last sequence (no loss — messages were persisted).
- Recipient offline → inbox holds messages durably until reconnect.
- Ordering across servers → the per-conversation sequencer is the single ordering authority for that conversation.
- Duplicate sends → client-msg-id dedup.
The takeaway
Concrete signals: connection servers + a session registry for routing, per- conversation sequence numbers for ordering, persist-then-deliver with an offline inbox for reliability, receipts/presence over the same channel, group fan-out, and E2E ciphertext routing. Persistent connections + reliable ordered delivery is the reusable real-time backbone (Zoom signaling and live features build on it).