Designing a distributed job scheduler
Build a system that runs tasks at a time or on a schedule across a worker fleet — reliably, exactly when due, and without losing or double-running jobs.
The problem
Design a distributed job/task scheduler (think cron-as-a-service, Airflow, or a delayed-job system): users submit jobs to run at a specific time, on a recurring schedule, or after a delay, and a fleet of workers executes them reliably. Powers reminders, report generation, retries, billing runs, cleanup.
Step 1 — Requirements
Functional: submit a one-off job (run at time T), a recurring job (cron), or a delayed job; execute it on a worker; retry on failure; let users query status/cancel.
Non-functional: reliability (every due job runs — no lost jobs), at-least-once execution (so make jobs idempotent), scalability (millions of jobs), timeliness (run close to when due), and no double execution (ideally).
Step 2 — Data model
Job: id, type, payload, run_at, schedule (cron|null),
status (pending|running|done|failed), attempts, owner, created_at
Store jobs in a durable DB (sharded by time or job id). Recurring jobs compute the
next run_at after each run.
Step 3 — The architecture
Three roles, decoupled by a queue:
clients → API → Job store (DB)
▲
│ poll/claim due jobs
[Scheduler] ──enqueue──▶ Message queue ──▶ Worker pool ──▶ execute
│ │
└────────── update status ◀───────────────┘
- Scheduler — finds jobs whose
run_at ≤ nowand enqueues them. - Queue — decouples scheduling from execution, absorbs bursts.
- Workers — pull jobs, execute, report result, schedule retries/next run.
Step 4 — Finding due jobs efficiently
Scanning all jobs every second doesn’t scale. Options:
- Indexed time query — an index on
run_at; the scheduler pollsWHERE run_at <= now AND status = pendingevery few seconds. Simple, scales with sharding by time bucket. - Time-bucketing / timing wheel — bucket jobs by minute; only load the current bucket. Efficient for huge volumes.
- For far-future jobs, store in the DB; pull into an in-memory priority queue (min-heap by run_at) only when near-due.
Step 5 — Reliability: don’t lose or double-run jobs
- Durability — jobs persist in the DB before acking submission, so a crash loses nothing.
- At-least-once + idempotency — a worker may crash mid-job; another retries, so jobs must be idempotent (dedup by job id/attempt).
- Visibility timeout / leasing — when a worker claims a job, it leases it (locks with a timeout). If the worker dies, the lease expires and another worker reclaims it. This prevents both loss (job comes back) and most double-runs.
- Scheduler HA — run multiple schedulers with leader election (or partition the job space) so one failing doesn’t stop scheduling, without two schedulers enqueuing the same job.
Step 6 — Retries and failures
Failed jobs retry with exponential backoff up to a max attempt count, then go to
a dead-letter queue for inspection. Track attempts on the job.
Trade-offs to raise
- Exactly-once is hard — settle for at-least-once + idempotent jobs; true exactly-once needs transactional claim+execute, often not worth it.
- Timeliness vs load — frequent polling = timely but more DB load; timing wheels reduce it.
- Scheduler centralization — leader-based is simple but a bottleneck; partition the job key space to scale schedulers.
The interview cue
“Durable job store indexed by run_at, an HA scheduler (leader-elected or
partitioned) that enqueues due jobs onto a message queue, a worker pool that
leases jobs (visibility timeout) so a dead worker’s job is reclaimed,
at-least-once with idempotent jobs, and exponential-backoff retries → DLQ.”
Durability + leasing + idempotency is the reliability story; implementation next.