Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·7 min read

Building a video conferencing system (Zoom)

Implement the signaling handshake, the SFU forwarding loop with simulcast layer selection, active-speaker detection, and TURN fallback.


Signaling: setting up a call

Before media flows, peers negotiate over a signaling WebSocket — exchanging session descriptions (SDP: codecs, resolutions) and ICE candidates (network paths):

def on_join(user, room_id):
    room = rooms[room_id]
    room.add(user)
    sfu = assign_sfu(room_id, region(user))        # regional media server
    # exchange SDP offer/answer between the client and the SFU
    offer = client.create_offer()                  # codecs, simulcast layers
    answer = sfu.negotiate(user, offer)
    signal.send(user, {"answer": answer, "ice": sfu.ice_candidates()})
    room.broadcast_presence(user, "joined")

Signaling is low-bandwidth and separate from the media plane — it just orchestrates who connects to which SFU and how.

The SFU forwarding loop

Each participant sends one uplink to the SFU; the SFU forwards each stream to the other participants — routing, not mixing:

class SFU:
    def on_rtp_packet(self, sender, packet):       # an incoming media packet
        for receiver in self.room(sender).others(sender):
            layer = self.choose_layer(receiver)    # simulcast: pick quality per receiver
            if packet.layer == layer and self.wants(receiver, sender):
                receiver.forward(packet)           # selective forwarding

The SFU never decodes/re-encodes (that’s the MCU’s expensive job) — it just forwards the right packets, which is why it scales to many participants cheaply.

Simulcast: right quality per receiver

Each sender uploads multiple resolution layers (e.g. 1080p / 360p / 180p); the SFU forwards the layer each receiver can handle and wants:

def choose_layer(self, receiver):
    if receiver.is_active_speaker_view(sender): return HIGH    # the big tile
    if receiver.bandwidth < LOW_THRESHOLD:      return LOW     # poor network
    return MEDIUM                                              # thumbnail tiles

So a viewer on a weak connection gets 180p of the gallery and high-res only of the active speaker — efficient bandwidth use without re-encoding.

Active-speaker detection

Clients (or the SFU) measure audio energy; the loudest becomes the active speaker, and the SFU promotes their video to high-res for everyone and signals the UI to spotlight them:

def on_audio_levels(room):
    speaker = max(room.participants, key=lambda p: p.audio_energy())
    if speaker != room.active_speaker:
        room.active_speaker = speaker
        room.broadcast({"active_speaker": speaker.id})   # UI spotlights + SFU bumps quality

NAT traversal with STUN/TURN

Most peers are behind NATs. STUN lets a peer discover its public address for a direct path; when symmetric NATs/firewalls block that, a TURN server relays the media:

client → (try direct via STUN) → SFU
       → (if blocked) → TURN relay → SFU      # fallback, costs relay bandwidth

ICE tries candidates in order of preference (direct > relayed) and picks the first that works.

Resilience on bad networks

  • Adaptive bitrate — the SFU drops to a lower simulcast layer when a receiver reports loss.
  • FEC / retransmission (NACK) — recover small losses without stalling.
  • Packet-loss concealment — the decoder fills tiny gaps so audio stays smooth.
  • Media is UDP/RTP — never block on a late packet.

Scale and failure handling

  • Regional SFUs keep media close to participants; very large meetings/webinars cascade SFUs (one forwards to others) or switch to an MCU + CDN (HLS) for one-to-many broadcast.
  • SFU failure → participants reconnect (signaling re-negotiates) to a standby SFU; the call continues after a brief blip.
  • A slow participant → only their downlink quality drops (simulcast), not everyone’s.
  • Recording → a server-side participant subscribes to all streams, mixes, and writes to storage.

The takeaway

Concrete signals: a signaling plane (SDP/ICE) separate from media, an SFU that selectively forwards (scales without mixing), simulcast + active-speaker for per-receiver quality, STUN/TURN for NAT traversal, and UDP/RTP with FEC for timeliness over reliability. SFU forwarding is the scalable core of real-time group video.