Building a rate limiter
Implement token bucket atomically in Redis with a Lua script, handle the distributed-counter race, and scale to per-node local buckets.
The core: atomic token bucket in Redis
The whole correctness problem is atomicity. Refill-then-decrement is a read-modify-write; if two requests interleave, both can see the same token count and both pass when only one should. The fix is to make the operation atomic. In Redis, run it as a Lua script (Redis executes scripts atomically):
-- KEYS[1] = bucket key ARGV: capacity, refill_rate, now, requested
local b = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1]) -- start full
local ts = tonumber(b[2]) or tonumber(ARGV[3])
local cap, rate, now, need = tonumber(ARGV[1]), tonumber(ARGV[2]),
tonumber(ARGV[3]), tonumber(ARGV[4])
tokens = math.min(cap, tokens + (now - ts) * rate) -- refill by elapsed time
local allowed = tokens >= need
if allowed then tokens = tokens - need end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], math.ceil(cap / rate)) -- evict idle keys
return allowed and 1 or 0
The middleware calls this once per request; a 0 becomes a 429. Because the whole
script runs atomically on the single-threaded Redis, the race disappears — no
locks needed.
The middleware
def rate_limit(request):
key = f"rl:{request.api_key}:{request.route_rule}"
allowed = redis.evalsha(SCRIPT_SHA, 1, key, CAP, RATE, time.time(), 1)
if not allowed:
retry = compute_retry_after(key)
return Response(429, headers={"Retry-After": retry})
return forward(request)
Return helpful headers: X-RateLimit-Limit, X-RateLimit-Remaining,
X-RateLimit-Reset, and Retry-After so well-behaved clients back off.
Sliding window counter (the accurate alternative)
If you want to avoid token-bucket’s burst allowance and the fixed-window boundary spike, blend two fixed windows by weight:
count ≈ prev_window_count * (overlap fraction) + current_window_count
allow if count < limit
Cheap (two integer counters per key) and smooth — a common production choice.
Scaling: the per-node hop problem
A Redis call per request is fine to ~100k QPS, but at extreme scale that round trip becomes the bottleneck. Two mitigations:
- Local token buckets — each gateway node keeps its own bucket for a share of the limit (limit / N nodes). Zero network hop, but approximate (uneven traffic across nodes wastes allowance).
- Local + periodic sync — nodes limit locally and reconcile counts with Redis every few hundred ms, trading a little accuracy for big latency/throughput wins.
State the trade: exactness (central Redis) vs latency/throughput (local, approximate).
Availability: fail open
The limiter must never take down the service it protects. If Redis is unreachable, allow the request (and alert) rather than 429 everyone:
try:
allowed = redis.evalsha(...)
except RedisError:
metrics.incr("ratelimiter.fail_open")
allowed = True # degrade to allow
Run Redis replicated so this is rare, but design for it.
Edge cases and extras
- Multiple rules — evaluate per-endpoint and per-tier limits; the strictest wins.
- Clock skew — use the Redis server time (or a passed timestamp) consistently, not each node’s clock.
- Distributed-denial protection — combine per-key limits with a global limit so many keys can’t collectively overwhelm a backend.
- Hot key — a single abusive key hammering one Redis shard; shard counters or absorb with local buckets.
The takeaway
The graded points: an atomic token-bucket (Lua/single-threaded Redis) to kill the distributed-counter race, a shared store so the limit holds across instances, 429 + Retry-After, fail-open for availability, and a path to local buckets when the Redis hop becomes the bottleneck. That’s a production rate limiter, and it directly reuses Chapter 3’s token-vs-leaky and Chapter 2’s caching/consistency ideas.