Building a distributed lock service
Implement a safe lease lock with atomic acquire/release, fencing tokens the resource enforces, and the ZooKeeper ephemeral-znode pattern.
Atomic acquire and release
Mutual exclusion needs the acquire to be atomic (set-if-absent) and the release to only delete your own lock (or you’d release someone else’s after your TTL lapsed). In Redis:
token = uuid4()
# acquire: SET only if absent, with a TTL
ok = redis.set(f"lock:{res}", token, nx=True, px=30_000)
# release: delete ONLY if we still own it (atomic via Lua)
RELEASE = """
if redis.call('GET', KEYS[1]) == ARGV[1]
then return redis.call('DEL', KEYS[1]) else return 0 end
"""
redis.eval(RELEASE, 1, f"lock:{res}", token)
Never do if get()==me: del() in app code — the gap between get and del is a race.
The Lua script makes it one atomic step.
Renewing the lease for long work
If work may exceed the TTL, run a watchdog that extends the lease while you still hold it — but understand this is best-effort and still doesn’t make the lock safe without fencing:
def watchdog(res, token):
while holding:
# extend only if still ours (atomic compare-and-set TTL)
redis.eval(EXTEND_IF_MINE, 1, f"lock:{res}", token, 30_000)
sleep(10)
Fencing tokens (the safety mechanism)
Hand out a monotonic token per grant and make the resource enforce it. Even if a stalled old holder wakes up and tries to write, its lower token is rejected:
# lock service: increment a counter on each grant
fence = redis.incr(f"fence:{res}") # monotonic
return (token, fence)
# the protected resource (e.g. storage/DB) enforces it
def write(resource, data, fence):
if fence < highest_seen[resource]:
raise StaleLockError() # an older holder — reject
highest_seen[resource] = fence
apply(data)
This is what makes the lock correct rather than merely usually exclusive.
The ZooKeeper pattern (consensus-backed)
For a highly-available, safe lock without rolling your own consensus:
acquire:
create an EPHEMERAL SEQUENTIAL znode: /lock/resource/lock-0000000042
list children; if mine has the LOWEST sequence → I hold the lock
else WATCH the next-lower child and wait for it to disappear
release: delete my znode (or it auto-deletes if my session dies)
Why it’s robust: ephemeral nodes vanish when a client’s session dies (automatic release — no stuck locks); sequential ordering gives fair FIFO acquisition; watching only the predecessor avoids a thundering herd of waiters. The underlying ZAB/Raft quorum guarantees a partition can’t grant the lock twice.
Failure handling
- Holder crashes → TTL expires (Redis) or ephemeral znode disappears (ZK) → released automatically.
- Holder pauses past TTL → fencing token rejects its late writes; correctness preserved.
- Lock service node fails → consensus cluster fails over (quorum); a single-node Redis lock should be replicated, but prefer ZK/etcd for correctness-critical locks.
- Network partition → CP: the minority side can’t reach quorum and won’t grant.
The takeaway
Concrete signals: atomic acquire (NX+TTL) and owner-checked release, a watchdog for long work, and — the part that matters — fencing tokens enforced by the resource plus a consensus-backed lock service (ZooKeeper ephemeral sequential znodes) for real safety. The lesson to carry: a distributed lock is only as safe as the fencing the resource enforces.