Skip to content
System design course
Ch.2 · The building blocks·concept ·6 min read

Heartbeats and failure detection

A periodic "I'm alive" signal — the simplest way to notice a node has died, and the source of the hardest question in distributed systems.


What a heartbeat is

Each node periodically sends a small “I’m alive” message — a heartbeat — to a monitor (or to its peers). As long as heartbeats keep arriving, the node is presumed healthy. When they stop for some timeout, the node is presumed dead, and the system reacts: stop routing it traffic, fail over its role, re-replicate its data, trigger a leader election.

It’s the mechanism behind nearly every “and then we detect the failure” hand-wave in a design.

What it enables

  • Load balancers drop unhealthy backends from rotation (health checks are heartbeats by another name).
  • Leader/follower systems notice a dead leader and elect a new one.
  • Cluster membership (which nodes are in the ring) stays current as nodes join and leave.
  • Replication kicks in to restore the replica count when a node is declared dead.

The fundamental problem: you can’t tell slow from dead

This is the deep idea to surface in an interview. A missing heartbeat could mean the node crashed — or that it’s slow, or the network dropped the message. From the outside, a dead node and an unreachable-but-alive node look identical. So every failure detector is really a guess, and the timeout tunes the guess:

  • Timeout too shortfalse positives: a momentarily slow node gets declared dead, triggering needless failovers, re-replication, and churn (even flapping).
  • Timeout too longslow detection: real failures linger, extending downtime and data unavailability.

There’s no perfect setting — only a trade-off you pick for the workload.

Making it robust

  • Multiple missed beats before declaring death (one missed beat is noise).
  • Adaptive timeouts (e.g. phi-accrual detectors) that learn normal latency instead of using a fixed threshold.
  • Multiple observers / quorum — let several nodes agree a node is dead before acting, so one bad network link doesn’t cause a false eviction.
  • Fencing — when you do fail over, make sure the “dead” node can’t come back and act as if it’s still in charge (prevents split-brain).

The interview cue

Whenever your design relies on noticing a failure — failover, removing a bad shard, rebalancing — say how: “nodes heartbeat to the coordinator; after N missed beats it’s marked dead and we fail over.” Then add the senior note: “I’d tune the timeout carefully — too aggressive and a slow node causes a false failover — and require agreement before evicting.” That shows you know detection is a guess, not a fact.