Skip to content
System design course
Ch.4 · Designing real systems·concept ·9 min read

Designing a chat system (WhatsApp)

Real-time 1:1 and group messaging — persistent connections, guaranteed ordered delivery, online/offline handling, receipts, and end-to-end encryption.


The problem

Design WhatsApp/Messenger: real-time 1:1 and group messaging with delivery to online and offline users, delivery/read receipts, online presence, and end-to-end encryption. The crux is persistent connections at scale and reliable, ordered delivery.

Step 1 — Requirements

Functional: send/receive messages in real time (1:1 and groups); deliver to offline users when they reconnect; delivery + read receipts; online/last-seen presence; media messages; E2E encryption.

Non-functional: low latency, reliable delivery (no lost messages, in order), massive concurrent connections (billions of users), high availability, and privacy.

Step 2 — Persistent connections

Messaging needs the server to push instantly, so clients hold a persistent connection (WebSocket / long-lived TCP) to a connection/gateway server.

  • A fleet of connection servers each hold millions of open connections.
  • A session registry maps user_id → which connection server they’re on (in Redis), so the system knows where to route a message.
sender → conn server A → route via session registry → conn server B → recipient
                        └─ persist message (for ordering, receipts, offline)

Step 3 — Message flow and delivery guarantees

  1. Sender’s connection server receives the message, assigns a per-conversation sequence number (ordering), and persists it.
  2. It looks up the recipient in the session registry:
    • Online → push to the recipient’s connection server → deliver instantly.
    • Offline → store in the recipient’s message queue/inbox (in a DB) for delivery on reconnect.
  3. The recipient acks delivery; the sender gets a delivery receipt, then a read receipt when opened.

Ordering comes from per-conversation sequence numbers; at-least-once delivery with client-side dedup (by message id) ensures nothing is lost or duplicated.

Step 4 — Offline users and sync

Undelivered messages sit in the recipient’s inbox (persistent). On reconnect, the client pulls everything since its last-acked sequence number and the server delivers the backlog, then marks delivered. This is what makes messaging reliable across flaky mobile networks.

Step 5 — Group chat (fan-out)

A group message fans out to all members:

  • Small/medium groups → fan-out-on-write: deliver to each member (online push or offline inbox).
  • The sender writes once; the server replicates to each member’s delivery path. Receipts aggregate per member.
  • Very large broadcast groups need care (rate limits, partial fan-out) — note the scaling limit.

Step 6 — Presence and receipts

  • Presence (online/last-seen) — connection servers track connect/disconnect (heartbeats); presence is published to interested contacts (pub/sub). It’s high-churn, so often best-effort and rate-limited.
  • Receipts — delivery (reached device) and read (opened) are themselves small messages routed back to the sender.

Step 7 — End-to-end encryption

Messages are encrypted on the sender’s device and only decryptable by the recipient (Signal protocol: per-conversation keys via a key-exchange). The server routes ciphertext and can’t read content — so server-side features (search) work on metadata only. Mention this as a core requirement, not an afterthought.

Trade-offs to raise

  • Persistent connections (instant, stateful, hard to scale) vs polling — messaging needs persistent.
  • At-least-once + dedup vs exactly-once.
  • Fan-out-on-write for groups vs read — write for timely delivery; limits for huge groups.
  • E2E encryption (privacy) vs server-side features (search/backup) — a real tension.

The interview cue

“Clients hold WebSocket connections to connection servers; a session registry routes messages; messages get per-conversation sequence numbers (ordering), are persisted, pushed if the recipient is online or queued in their inbox if offline; delivery/read receipts and presence ride the same channel; groups fan out to members; content is E2E-encrypted so the server routes only ciphertext.” Persistent connections + reliable ordered delivery + offline inbox is the core; implementation next.