Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·8 min read

Building a chat system (WhatsApp)

Implement the connection servers and session registry, ordered persisted delivery with offline inboxes, receipts, and group fan-out.


Connection servers and the session registry

Each connection server holds millions of WebSockets; a registry maps users to their server so messages can be routed:

# on connect
def on_connect(user_id, conn):
    connections[user_id] = conn                      # local: user -> socket on THIS server
    registry.set(f"session:{user_id}", this_server_id, ttl="...")  # global: user -> server
    presence.publish(user_id, "online")

def on_disconnect(user_id):
    del connections[user_id]
    registry.delete(f"session:{user_id}")
    presence.publish(user_id, "offline", last_seen=now())

Sending a message (persist, order, route)

def send_message(sender, convo_id, ciphertext, client_msg_id):
    seq = sequencer.next(convo_id)                   # per-conversation ordering
    msg = {"id": snowflake(), "convo": convo_id, "from": sender, "seq": seq,
           "body": ciphertext, "ts": now()}
    message_store.append(convo_id, msg)              # persist (durability + history)
    for recipient in members(convo_id) - {sender}:
        deliver(recipient, msg)
    return msg["id"]

def deliver(recipient, msg):
    server = registry.get(f"session:{recipient}")
    if server:                                       # online
        route_to(server, recipient, msg)             # push to their connection server
    else:                                            # offline
        inbox.append(recipient, msg)                 # store for reconnect

Sequence numbers give per-conversation ordering; persisting before delivery means a crash never loses an accepted message.

Offline delivery and sync

On reconnect, the client tells the server its last-acked sequence; the server flushes the backlog:

def on_resync(user_id, convo_id, last_seq):
    backlog = inbox.since(user_id, convo_id, last_seq)  # everything missed
    for msg in backlog: push(user_id, msg)
    # client acks; server marks delivered, removes from inbox

Delivery and read receipts

Receipts are tiny messages routed back to the sender:

def on_client_ack(user_id, msg_id, kind):            # kind = "delivered" | "read"
    msg = message_store.get(msg_id)
    deliver(msg["from"], {"type": "receipt", "msg_id": msg_id,
                          "by": user_id, "kind": kind})

At-least-once delivery + dedup by client_msg_id on the receiver means resends (after a flaky network) don’t create duplicates.

Group fan-out

A group send writes once and fans out to each member’s delivery path (online push or offline inbox), reusing deliver. Receipts aggregate per member (“delivered to 8/10”). Cap fan-out for very large groups (rate-limit, or treat as broadcast channels).

Presence

Connection servers publish connect/disconnect to a pub/sub channel; a contact subscribed to you receives your online/last-seen. Presence is high-churn, so it’s best-effort, debounced, and rate-limited (don’t let presence storms dominate traffic).

End-to-end encryption

The client encrypts with the recipient’s public key (Signal protocol; keys exchanged via a key server). The server stores/routes only ciphertextbody above is opaque to it. Media is encrypted client-side and uploaded to the blob store; only the key (sent in the message) decrypts it.

Scale and failure handling

  • Connections → many connection servers, each with millions of sockets; the registry routes between them; a message broker (pub/sub) fans messages across servers.
  • Connection server dies → its users’ sessions expire (TTL); clients reconnect to another server and resync from their last sequence (no loss — messages were persisted).
  • Recipient offline → inbox holds messages durably until reconnect.
  • Ordering across servers → the per-conversation sequencer is the single ordering authority for that conversation.
  • Duplicate sends → client-msg-id dedup.

The takeaway

Concrete signals: connection servers + a session registry for routing, per- conversation sequence numbers for ordering, persist-then-deliver with an offline inbox for reliability, receipts/presence over the same channel, group fan-out, and E2E ciphertext routing. Persistent connections + reliable ordered delivery is the reusable real-time backbone (Zoom signaling and live features build on it).