Building a code deployment system
Implement the rollout orchestrator state machine, batched health-gated deploys, canary metric analysis, and idempotent automatic rollback.
The deployment as a resumable state machine
A deployment is a durable state machine the orchestrator drives — so it survives an orchestrator restart and never half-deploys:
PENDING → CANARY → CANARY_ANALYSIS → ROLLING(batch 1..N) → COMPLETE
│ │
└── FAILED → ROLLING_BACK → ROLLED_BACK
State (current batch, per-host version, target artifact) is persisted, so a crashed orchestrator resumes exactly where it left off.
The batched rolling loop
Deploy in batches, health-gate each before proceeding:
def roll_out(deploy, hosts, batch_pct=10):
for batch in chunk(hosts, pct=batch_pct):
for host in batch:
host.pull(deploy.artifact) # from nearby mirror / P2P
host.restart_with(deploy.artifact)
if not healthy(batch, settle="60s"): # readiness + error/latency metrics
rollback(deploy, deployed_so_far) # halt and revert
return FAILED
deploy.advance(batch) # persist progress
return COMPLETE
Health = readiness probe plus real signals (error rate, p99 latency, crash loops) compared against a baseline — not just “process is up.”
Canary analysis
Before the full rollout, send a small traffic slice to the new version and compare metrics statistically against the control (old version):
def canary(deploy):
deploy_to(canary_cohort, deploy.artifact) # ~1-5% of hosts/traffic
metrics = observe(canary_cohort, window="10m")
if regression(metrics.error_rate, metrics.latency, vs=baseline()):
rollback(deploy, canary_cohort)
return FAILED # auto-abort, users barely affected
return PASS # promote to full rolling
The canary catches bad deploys with a tiny blast radius — automated promotion/abort is the key.
Automatic, idempotent rollback
Rollback re-points hosts to the previous known-good artifact. Because artifacts are immutable and versioned, this is just “redeploy version N−1” — fast and reliable:
def rollback(deploy, affected):
prev = deploy.previous_good_version
for host in affected:
host.restart_with(prev) # idempotent: safe to retry
alert(deploy, "rolled back to " + prev)
Idempotency matters: a rollback may itself be retried after a failure, so re-applying the same target version must be a no-op if already done.
Artifact distribution
Hosts pull the artifact from a regional mirror or via P2P so the registry isn’t hammered N times; only changed content-addressed layers transfer. A host verifies the artifact checksum before running it (integrity).
Blue-green as an alternative
For instant rollback, deploy the new version to a parallel green fleet, health-check it fully, then flip the load balancer to green; keep blue warm so rollback is a single traffic switch back. Costs double capacity during the deploy.
Failure handling
- Bad deploy → canary/health gate halts it; auto-rollback to last-known-good.
- Orchestrator crash → resume from persisted state machine.
- Partial batch failure → roll back the affected hosts, hold the rest on the old version.
- Stuck host (won’t health-check) → time out, mark failed, exclude and alert.
The takeaway
Concrete signals: a persisted, resumable state machine, batched health-gated rolling with canary metric analysis, automatic idempotent rollback to an immutable previous artifact, and P2P/mirror artifact distribution. Progressive rollout + automated rollback is how you deploy to thousands of hosts without fear — and it reuses health checks, load balancing, and immutable-artifact ideas from earlier.