Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·7 min read

Building a code deployment system

Implement the rollout orchestrator state machine, batched health-gated deploys, canary metric analysis, and idempotent automatic rollback.


The deployment as a resumable state machine

A deployment is a durable state machine the orchestrator drives — so it survives an orchestrator restart and never half-deploys:

PENDING → CANARY → CANARY_ANALYSIS → ROLLING(batch 1..N) → COMPLETE
                          │                    │
                          └── FAILED → ROLLING_BACK → ROLLED_BACK

State (current batch, per-host version, target artifact) is persisted, so a crashed orchestrator resumes exactly where it left off.

The batched rolling loop

Deploy in batches, health-gate each before proceeding:

def roll_out(deploy, hosts, batch_pct=10):
    for batch in chunk(hosts, pct=batch_pct):
        for host in batch:
            host.pull(deploy.artifact)          # from nearby mirror / P2P
            host.restart_with(deploy.artifact)
        if not healthy(batch, settle="60s"):    # readiness + error/latency metrics
            rollback(deploy, deployed_so_far)   # halt and revert
            return FAILED
        deploy.advance(batch)                   # persist progress
    return COMPLETE

Health = readiness probe plus real signals (error rate, p99 latency, crash loops) compared against a baseline — not just “process is up.”

Canary analysis

Before the full rollout, send a small traffic slice to the new version and compare metrics statistically against the control (old version):

def canary(deploy):
    deploy_to(canary_cohort, deploy.artifact)   # ~1-5% of hosts/traffic
    metrics = observe(canary_cohort, window="10m")
    if regression(metrics.error_rate, metrics.latency, vs=baseline()):
        rollback(deploy, canary_cohort)
        return FAILED                           # auto-abort, users barely affected
    return PASS                                  # promote to full rolling

The canary catches bad deploys with a tiny blast radius — automated promotion/abort is the key.

Automatic, idempotent rollback

Rollback re-points hosts to the previous known-good artifact. Because artifacts are immutable and versioned, this is just “redeploy version N−1” — fast and reliable:

def rollback(deploy, affected):
    prev = deploy.previous_good_version
    for host in affected:
        host.restart_with(prev)                 # idempotent: safe to retry
    alert(deploy, "rolled back to " + prev)

Idempotency matters: a rollback may itself be retried after a failure, so re-applying the same target version must be a no-op if already done.

Artifact distribution

Hosts pull the artifact from a regional mirror or via P2P so the registry isn’t hammered N times; only changed content-addressed layers transfer. A host verifies the artifact checksum before running it (integrity).

Blue-green as an alternative

For instant rollback, deploy the new version to a parallel green fleet, health-check it fully, then flip the load balancer to green; keep blue warm so rollback is a single traffic switch back. Costs double capacity during the deploy.

Failure handling

  • Bad deploy → canary/health gate halts it; auto-rollback to last-known-good.
  • Orchestrator crash → resume from persisted state machine.
  • Partial batch failure → roll back the affected hosts, hold the rest on the old version.
  • Stuck host (won’t health-check) → time out, mark failed, exclude and alert.

The takeaway

Concrete signals: a persisted, resumable state machine, batched health-gated rolling with canary metric analysis, automatic idempotent rollback to an immutable previous artifact, and P2P/mirror artifact distribution. Progressive rollout + automated rollback is how you deploy to thousands of hosts without fear — and it reuses health checks, load balancing, and immutable-artifact ideas from earlier.