Building a video conferencing system (Zoom)
Implement the signaling handshake, the SFU forwarding loop with simulcast layer selection, active-speaker detection, and TURN fallback.
Signaling: setting up a call
Before media flows, peers negotiate over a signaling WebSocket — exchanging session descriptions (SDP: codecs, resolutions) and ICE candidates (network paths):
def on_join(user, room_id):
room = rooms[room_id]
room.add(user)
sfu = assign_sfu(room_id, region(user)) # regional media server
# exchange SDP offer/answer between the client and the SFU
offer = client.create_offer() # codecs, simulcast layers
answer = sfu.negotiate(user, offer)
signal.send(user, {"answer": answer, "ice": sfu.ice_candidates()})
room.broadcast_presence(user, "joined")
Signaling is low-bandwidth and separate from the media plane — it just orchestrates who connects to which SFU and how.
The SFU forwarding loop
Each participant sends one uplink to the SFU; the SFU forwards each stream to the other participants — routing, not mixing:
class SFU:
def on_rtp_packet(self, sender, packet): # an incoming media packet
for receiver in self.room(sender).others(sender):
layer = self.choose_layer(receiver) # simulcast: pick quality per receiver
if packet.layer == layer and self.wants(receiver, sender):
receiver.forward(packet) # selective forwarding
The SFU never decodes/re-encodes (that’s the MCU’s expensive job) — it just forwards the right packets, which is why it scales to many participants cheaply.
Simulcast: right quality per receiver
Each sender uploads multiple resolution layers (e.g. 1080p / 360p / 180p); the SFU forwards the layer each receiver can handle and wants:
def choose_layer(self, receiver):
if receiver.is_active_speaker_view(sender): return HIGH # the big tile
if receiver.bandwidth < LOW_THRESHOLD: return LOW # poor network
return MEDIUM # thumbnail tiles
So a viewer on a weak connection gets 180p of the gallery and high-res only of the active speaker — efficient bandwidth use without re-encoding.
Active-speaker detection
Clients (or the SFU) measure audio energy; the loudest becomes the active speaker, and the SFU promotes their video to high-res for everyone and signals the UI to spotlight them:
def on_audio_levels(room):
speaker = max(room.participants, key=lambda p: p.audio_energy())
if speaker != room.active_speaker:
room.active_speaker = speaker
room.broadcast({"active_speaker": speaker.id}) # UI spotlights + SFU bumps quality
NAT traversal with STUN/TURN
Most peers are behind NATs. STUN lets a peer discover its public address for a direct path; when symmetric NATs/firewalls block that, a TURN server relays the media:
client → (try direct via STUN) → SFU
→ (if blocked) → TURN relay → SFU # fallback, costs relay bandwidth
ICE tries candidates in order of preference (direct > relayed) and picks the first that works.
Resilience on bad networks
- Adaptive bitrate — the SFU drops to a lower simulcast layer when a receiver reports loss.
- FEC / retransmission (NACK) — recover small losses without stalling.
- Packet-loss concealment — the decoder fills tiny gaps so audio stays smooth.
- Media is UDP/RTP — never block on a late packet.
Scale and failure handling
- Regional SFUs keep media close to participants; very large meetings/webinars cascade SFUs (one forwards to others) or switch to an MCU + CDN (HLS) for one-to-many broadcast.
- SFU failure → participants reconnect (signaling re-negotiates) to a standby SFU; the call continues after a brief blip.
- A slow participant → only their downlink quality drops (simulcast), not everyone’s.
- Recording → a server-side participant subscribes to all streams, mixes, and writes to storage.
The takeaway
Concrete signals: a signaling plane (SDP/ICE) separate from media, an SFU that selectively forwards (scales without mixing), simulcast + active-speaker for per-receiver quality, STUN/TURN for NAT traversal, and UDP/RTP with FEC for timeliness over reliability. SFU forwarding is the scalable core of real-time group video.