Building a distributed job scheduler
Implement atomic job claiming with leases, the due-job poll, recurring-schedule computation, and crash-safe retries — without losing or double-running work.
Claiming a job atomically (the core)
The correctness crux: two workers must never both run the same job. Make the claim an atomic conditional update that sets a lease:
-- atomically claim one due job (only if still pending or its lease expired)
UPDATE jobs
SET status = 'running', worker_id = :me, lease_until = now() + interval '60s',
attempts = attempts + 1
WHERE id = (
SELECT id FROM jobs
WHERE run_at <= now()
AND (status = 'pending' OR (status = 'running' AND lease_until < now()))
ORDER BY run_at
FOR UPDATE SKIP LOCKED -- skip rows other workers are claiming
LIMIT 1)
RETURNING id, payload;
FOR UPDATE SKIP LOCKED lets many workers claim different jobs concurrently
without blocking each other. The lease (lease_until) means a crashed worker’s
job becomes reclaimable once the lease expires — no loss, no stuck jobs.
The worker loop
def worker():
while True:
job = claim_due_job() # the atomic UPDATE above
if not job:
sleep(poll_interval); continue
try:
heartbeat_lease(job) # extend lease for long jobs
execute(job.payload) # idempotent!
mark_done(job) # or schedule next run if recurring
except Exception:
schedule_retry(job) # backoff, or DLQ if attempts exhausted
Executing exactly the intended number of times
Because a worker can crash after executing but before mark_done, execution is
at-least-once — the job may rerun. So execute must be idempotent: dedup by
(job_id, attempt) or design the side effect as an upsert. State this explicitly.
Recurring jobs
On completing a cron job, compute the next fire time and reinsert (or update
run_at) so it’s picked up again:
def mark_done(job):
if job.schedule: # cron expression
job.run_at = croniter(job.schedule, now()).get_next()
job.status = 'pending'; save(job)
else:
job.status = 'done'; save(job)
Guard against missed fires (scheduler was down past a fire time): decide whether to run the skipped occurrences (catch-up) or just schedule the next future one.
Efficient due-job lookup at scale
The indexed poll works to moderate scale. Beyond it:
- Timing wheel / time buckets — jobs bucketed by minute; load only the current
bucket into an in-memory min-heap keyed by
run_at, pop as they come due. - Shard by job id across scheduler partitions so each scheduler owns a slice; scale horizontally without two schedulers fighting over a job.
Scheduler high availability
Run N schedulers; either leader-elect one (via etcd/ZooKeeper) so only it enqueues, or partition the job space so each owns disjoint jobs. Either way, a scheduler death doesn’t stop the system, and no job is enqueued twice. Workers are stateless and scale freely behind the queue.
Retries and dead-lettering
def schedule_retry(job):
if job.attempts >= MAX_ATTEMPTS:
move_to_dead_letter(job) # park for inspection
else:
delay = base * (2 ** job.attempts) + jitter() # exponential backoff
job.run_at = now() + delay; job.status = 'pending'; save(job)
Failure handling summary
- Worker dies mid-job → lease expires → another worker reclaims it.
- Job keeps failing → backoff retries → DLQ.
- Scheduler dies → standby/partition continues; catch up missed fires.
- DB is the source of truth → nothing is lost on any crash.
The takeaway
Concrete signals: atomic lease-based claiming (FOR UPDATE SKIP LOCKED +
lease_until) so jobs are never lost or double-claimed, idempotent execution for
at-least-once safety, timing wheels/heaps for scale, leader-elected or
partitioned schedulers for HA, and backoff retries → DLQ. The leasing pattern
here recurs anywhere work is handed to a fleet that might crash.