Designing a video conferencing system (Zoom)
Real-time multi-party audio/video — WebRTC media transport, the SFU that scales group calls, signaling, and adapting to bad networks.
The problem
Design Zoom/Google Meet: real-time audio/video calls with many participants, low latency, screen sharing, and graceful behavior on poor networks. This is a real-time media problem — fundamentally different from request/response systems — centered on WebRTC and media routing.
Step 1 — Requirements
Functional: multi-party A/V calls; join/leave; mute/camera; screen share; chat; ideally recording. Works across devices and networks.
Non-functional: very low latency (< ~150 ms for natural conversation), scale (many participants per call, many concurrent calls), resilience to packet loss / variable bandwidth, and efficient bandwidth use.
Step 2 — Real-time media: UDP, not TCP
Conversational media prioritizes timeliness over reliability — a late packet is useless, so retransmitting it (TCP) hurts. Use UDP-based RTP (via WebRTC): drop late packets, conceal small losses, never stall waiting for a resend. This is the opposite of the reliable-delivery stance in chat.
Step 3 — Topologies: how media flows between N participants
The central design choice — how do N participants exchange streams?
- Mesh (P2P) — everyone sends their stream directly to everyone else. No server cost, but each client uploads N−1 copies → only works for tiny calls (2–4).
- MCU (Multipoint Control Unit) — a server mixes all streams into one composite and sends each participant a single stream. Low client bandwidth, but very CPU-expensive (decode/mix/encode per call) and adds latency.
- SFU (Selective Forwarding Unit) — a server receives each participant’s stream and forwards (doesn’t mix) the relevant streams to others. Each client uploads once; the SFU fans out. The standard for scalable group video — the answer to give.
each client --(1 upload)--> SFU --(forwards needed streams)--> each other client
Step 4 — Signaling vs media (two planes)
- Signaling plane — call setup/coordination over WebSockets: who’s in the call, SDP offer/answer (codec negotiation), ICE candidates (network paths), mute events. Low bandwidth.
- Media plane — the actual RTP audio/video over UDP through the SFU. High bandwidth.
STUN/TURN servers help peers traverse NATs/firewalls (STUN discovers your public address; TURN relays media when direct paths fail).
Step 5 — Adapting to bad networks
- Simulcast — each client sends multiple resolutions of its stream; the SFU forwards the right quality per recipient based on their bandwidth (and only the active speaker / visible tiles at high res). This is how Zoom handles mixed network conditions efficiently.
- Adaptive bitrate, forward error correction, and packet-loss concealment smooth over loss.
- Active-speaker detection — only send high-res for whoever’s talking; thumbnails for the rest.
Step 6 — Scale and recording
- SFUs are regional (place media servers near participants for latency); a call is assigned to an SFU (or cascaded SFUs for very large meetings/webinars).
- Recording / streaming to thousands (webinars) → an MCU-like component mixes and pushes to a CDN (HLS) for one-to-many.
Trade-offs to raise
- Mesh (no server, tiny calls) vs MCU (low client bw, high server CPU) vs SFU (scalable, moderate). SFU wins for group video.
- UDP/RTP (timely, lossy) vs TCP (reliable, stalls) — media must be UDP.
- Simulcast (more upload bandwidth, flexible per-recipient quality) vs single stream.
The interview cue
“WebRTC over UDP/RTP for media; an SFU receives each participant’s stream once and selectively forwards (scales group calls without mixing); a signaling plane (WebSockets + SDP/ICE) sets up calls, with STUN/TURN for NAT traversal; simulcast + active-speaker detection adapt quality per recipient; regional SFUs for latency; an MCU/CDN path for webinars and recording.” SFU + UDP media + simulcast is the defining answer; implementation next.