Designing a chat system (WhatsApp)
Real-time 1:1 and group messaging — persistent connections, guaranteed ordered delivery, online/offline handling, receipts, and end-to-end encryption.
The problem
Design WhatsApp/Messenger: real-time 1:1 and group messaging with delivery to online and offline users, delivery/read receipts, online presence, and end-to-end encryption. The crux is persistent connections at scale and reliable, ordered delivery.
Step 1 — Requirements
Functional: send/receive messages in real time (1:1 and groups); deliver to offline users when they reconnect; delivery + read receipts; online/last-seen presence; media messages; E2E encryption.
Non-functional: low latency, reliable delivery (no lost messages, in order), massive concurrent connections (billions of users), high availability, and privacy.
Step 2 — Persistent connections
Messaging needs the server to push instantly, so clients hold a persistent connection (WebSocket / long-lived TCP) to a connection/gateway server.
- A fleet of connection servers each hold millions of open connections.
- A session registry maps
user_id → which connection serverthey’re on (in Redis), so the system knows where to route a message.
sender → conn server A → route via session registry → conn server B → recipient
└─ persist message (for ordering, receipts, offline)
Step 3 — Message flow and delivery guarantees
- Sender’s connection server receives the message, assigns a per-conversation sequence number (ordering), and persists it.
- It looks up the recipient in the session registry:
- Online → push to the recipient’s connection server → deliver instantly.
- Offline → store in the recipient’s message queue/inbox (in a DB) for delivery on reconnect.
- The recipient acks delivery; the sender gets a delivery receipt, then a read receipt when opened.
Ordering comes from per-conversation sequence numbers; at-least-once delivery with client-side dedup (by message id) ensures nothing is lost or duplicated.
Step 4 — Offline users and sync
Undelivered messages sit in the recipient’s inbox (persistent). On reconnect, the client pulls everything since its last-acked sequence number and the server delivers the backlog, then marks delivered. This is what makes messaging reliable across flaky mobile networks.
Step 5 — Group chat (fan-out)
A group message fans out to all members:
- Small/medium groups → fan-out-on-write: deliver to each member (online push or offline inbox).
- The sender writes once; the server replicates to each member’s delivery path. Receipts aggregate per member.
- Very large broadcast groups need care (rate limits, partial fan-out) — note the scaling limit.
Step 6 — Presence and receipts
- Presence (online/last-seen) — connection servers track connect/disconnect (heartbeats); presence is published to interested contacts (pub/sub). It’s high-churn, so often best-effort and rate-limited.
- Receipts — delivery (reached device) and read (opened) are themselves small messages routed back to the sender.
Step 7 — End-to-end encryption
Messages are encrypted on the sender’s device and only decryptable by the recipient (Signal protocol: per-conversation keys via a key-exchange). The server routes ciphertext and can’t read content — so server-side features (search) work on metadata only. Mention this as a core requirement, not an afterthought.
Trade-offs to raise
- Persistent connections (instant, stateful, hard to scale) vs polling — messaging needs persistent.
- At-least-once + dedup vs exactly-once.
- Fan-out-on-write for groups vs read — write for timely delivery; limits for huge groups.
- E2E encryption (privacy) vs server-side features (search/backup) — a real tension.
The interview cue
“Clients hold WebSocket connections to connection servers; a session registry routes messages; messages get per-conversation sequence numbers (ordering), are persisted, pushed if the recipient is online or queued in their inbox if offline; delivery/read receipts and presence ride the same channel; groups fan out to members; content is E2E-encrypted so the server routes only ciphertext.” Persistent connections + reliable ordered delivery + offline inbox is the core; implementation next.