Designing a distributed job scheduler

Build a system that runs tasks at a time or on a schedule across a worker fleet — reliably, exactly when due, and without losing or double-running jobs.

The problem

Design a distributed job/task scheduler (think cron-as-a-service, Airflow, or a delayed-job system): users submit jobs to run at a specific time, on a recurring schedule, or after a delay, and a fleet of workers executes them reliably. Powers reminders, report generation, retries, billing runs, cleanup.

Step 1 — Requirements

Functional: submit a one-off job (run at time T), a recurring job (cron), or a delayed job; execute it on a worker; retry on failure; let users query status/cancel.

Non-functional: reliability (every due job runs — no lost jobs), at-least-once execution (so make jobs idempotent), scalability (millions of jobs), timeliness (run close to when due), and no double execution (ideally).

Step 2 — Data model

Job: id, type, payload, run_at, schedule (cron|null),
     status (pending|running|done|failed), attempts, owner, created_at

Store jobs in a durable DB (sharded by time or job id). Recurring jobs compute the next run_at after each run.

Step 3 — The architecture

Three roles, decoupled by a queue:

clients → API → Job store (DB)
                   ▲
                   │ poll/claim due jobs
              [Scheduler] ──enqueue──▶ Message queue ──▶ Worker pool ──▶ execute
                   │                                          │
                   └────────── update status ◀───────────────┘

Scheduler — finds jobs whose run_at ≤ now and enqueues them.
Queue — decouples scheduling from execution, absorbs bursts.
Workers — pull jobs, execute, report result, schedule retries/next run.

Step 4 — Finding due jobs efficiently

Scanning all jobs every second doesn’t scale. Options:

Indexed time query — an index on run_at; the scheduler polls WHERE run_at <= now AND status = pending every few seconds. Simple, scales with sharding by time bucket.
Time-bucketing / timing wheel — bucket jobs by minute; only load the current bucket. Efficient for huge volumes.
For far-future jobs, store in the DB; pull into an in-memory priority queue (min-heap by run_at) only when near-due.

Step 5 — Reliability: don’t lose or double-run jobs

Durability — jobs persist in the DB before acking submission, so a crash loses nothing.
At-least-once + idempotency — a worker may crash mid-job; another retries, so jobs must be idempotent (dedup by job id/attempt).
Visibility timeout / leasing — when a worker claims a job, it leases it (locks with a timeout). If the worker dies, the lease expires and another worker reclaims it. This prevents both loss (job comes back) and most double-runs.
Scheduler HA — run multiple schedulers with leader election (or partition the job space) so one failing doesn’t stop scheduling, without two schedulers enqueuing the same job.

Step 6 — Retries and failures

Failed jobs retry with exponential backoff up to a max attempt count, then go to a dead-letter queue for inspection. Track attempts on the job.

Trade-offs to raise

Exactly-once is hard — settle for at-least-once + idempotent jobs; true exactly-once needs transactional claim+execute, often not worth it.
Timeliness vs load — frequent polling = timely but more DB load; timing wheels reduce it.
Scheduler centralization — leader-based is simple but a bottleneck; partition the job key space to scale schedulers.

The interview cue

“Durable job store indexed by run_at, an HA scheduler (leader-elected or partitioned) that enqueues due jobs onto a message queue, a worker pool that leases jobs (visibility timeout) so a dead worker’s job is reclaimed, at-least-once with idempotent jobs, and exponential-backoff retries → DLQ.” Durability + leasing + idempotency is the reliability story; implementation next.