Building a distributed job scheduler

Implement atomic job claiming with leases, the due-job poll, recurring-schedule computation, and crash-safe retries — without losing or double-running work.

Claiming a job atomically (the core)

The correctness crux: two workers must never both run the same job. Make the claim an atomic conditional update that sets a lease:

-- atomically claim one due job (only if still pending or its lease expired)
UPDATE jobs
   SET status = 'running', worker_id = :me, lease_until = now() + interval '60s',
       attempts = attempts + 1
 WHERE id = (
     SELECT id FROM jobs
      WHERE run_at <= now()
        AND (status = 'pending' OR (status = 'running' AND lease_until < now()))
      ORDER BY run_at
      FOR UPDATE SKIP LOCKED          -- skip rows other workers are claiming
      LIMIT 1)
 RETURNING id, payload;

FOR UPDATE SKIP LOCKED lets many workers claim different jobs concurrently without blocking each other. The lease (lease_until) means a crashed worker’s job becomes reclaimable once the lease expires — no loss, no stuck jobs.

The worker loop

def worker():
    while True:
        job = claim_due_job()                 # the atomic UPDATE above
        if not job:
            sleep(poll_interval); continue
        try:
            heartbeat_lease(job)              # extend lease for long jobs
            execute(job.payload)             # idempotent!
            mark_done(job)                   # or schedule next run if recurring
        except Exception:
            schedule_retry(job)              # backoff, or DLQ if attempts exhausted

Executing exactly the intended number of times

Because a worker can crash after executing but before mark_done, execution is at-least-once — the job may rerun. So execute must be idempotent: dedup by (job_id, attempt) or design the side effect as an upsert. State this explicitly.

Recurring jobs

On completing a cron job, compute the next fire time and reinsert (or update run_at) so it’s picked up again:

def mark_done(job):
    if job.schedule:                          # cron expression
        job.run_at = croniter(job.schedule, now()).get_next()
        job.status = 'pending'; save(job)
    else:
        job.status = 'done'; save(job)

Guard against missed fires (scheduler was down past a fire time): decide whether to run the skipped occurrences (catch-up) or just schedule the next future one.

Efficient due-job lookup at scale

The indexed poll works to moderate scale. Beyond it:

Timing wheel / time buckets — jobs bucketed by minute; load only the current bucket into an in-memory min-heap keyed by run_at, pop as they come due.
Shard by job id across scheduler partitions so each scheduler owns a slice; scale horizontally without two schedulers fighting over a job.

Scheduler high availability

Run N schedulers; either leader-elect one (via etcd/ZooKeeper) so only it enqueues, or partition the job space so each owns disjoint jobs. Either way, a scheduler death doesn’t stop the system, and no job is enqueued twice. Workers are stateless and scale freely behind the queue.

Retries and dead-lettering

def schedule_retry(job):
    if job.attempts >= MAX_ATTEMPTS:
        move_to_dead_letter(job)              # park for inspection
    else:
        delay = base * (2 ** job.attempts) + jitter()   # exponential backoff
        job.run_at = now() + delay; job.status = 'pending'; save(job)

Failure handling summary

Worker dies mid-job → lease expires → another worker reclaims it.
Job keeps failing → backoff retries → DLQ.
Scheduler dies → standby/partition continues; catch up missed fires.
DB is the source of truth → nothing is lost on any crash.

The takeaway

Concrete signals: atomic lease-based claiming (FOR UPDATE SKIP LOCKED + lease_until) so jobs are never lost or double-claimed, idempotent execution for at-least-once safety, timing wheels/heaps for scale, leader-elected or partitioned schedulers for HA, and backoff retries → DLQ. The leasing pattern here recurs anywhere work is handed to a fleet that might crash.