Building a rate limiter

Implement token bucket atomically in Redis with a Lua script, handle the distributed-counter race, and scale to per-node local buckets.

The core: atomic token bucket in Redis

The whole correctness problem is atomicity. Refill-then-decrement is a read-modify-write; if two requests interleave, both can see the same token count and both pass when only one should. The fix is to make the operation atomic. In Redis, run it as a Lua script (Redis executes scripts atomically):

-- KEYS[1] = bucket key   ARGV: capacity, refill_rate, now, requested
local b = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])   -- start full
local ts     = tonumber(b[2]) or tonumber(ARGV[3])
local cap, rate, now, need = tonumber(ARGV[1]), tonumber(ARGV[2]),
                             tonumber(ARGV[3]), tonumber(ARGV[4])

tokens = math.min(cap, tokens + (now - ts) * rate)   -- refill by elapsed time
local allowed = tokens >= need
if allowed then tokens = tokens - need end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], math.ceil(cap / rate)) -- evict idle keys
return allowed and 1 or 0

The middleware calls this once per request; a 0 becomes a 429. Because the whole script runs atomically on the single-threaded Redis, the race disappears — no locks needed.

The middleware

def rate_limit(request):
    key = f"rl:{request.api_key}:{request.route_rule}"
    allowed = redis.evalsha(SCRIPT_SHA, 1, key, CAP, RATE, time.time(), 1)
    if not allowed:
        retry = compute_retry_after(key)
        return Response(429, headers={"Retry-After": retry})
    return forward(request)

Return helpful headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After so well-behaved clients back off.

Sliding window counter (the accurate alternative)

If you want to avoid token-bucket’s burst allowance and the fixed-window boundary spike, blend two fixed windows by weight:

count ≈ prev_window_count * (overlap fraction) + current_window_count
allow if count < limit

Cheap (two integer counters per key) and smooth — a common production choice.

Scaling: the per-node hop problem

A Redis call per request is fine to ~100k QPS, but at extreme scale that round trip becomes the bottleneck. Two mitigations:

Local token buckets — each gateway node keeps its own bucket for a share of the limit (limit / N nodes). Zero network hop, but approximate (uneven traffic across nodes wastes allowance).
Local + periodic sync — nodes limit locally and reconcile counts with Redis every few hundred ms, trading a little accuracy for big latency/throughput wins.

State the trade: exactness (central Redis) vs latency/throughput (local, approximate).

Availability: fail open

The limiter must never take down the service it protects. If Redis is unreachable, allow the request (and alert) rather than 429 everyone:

try:
    allowed = redis.evalsha(...)
except RedisError:
    metrics.incr("ratelimiter.fail_open")
    allowed = True            # degrade to allow

Run Redis replicated so this is rare, but design for it.

Edge cases and extras

Multiple rules — evaluate per-endpoint and per-tier limits; the strictest wins.
Clock skew — use the Redis server time (or a passed timestamp) consistently, not each node’s clock.
Distributed-denial protection — combine per-key limits with a global limit so many keys can’t collectively overwhelm a backend.
Hot key — a single abusive key hammering one Redis shard; shard counters or absorb with local buckets.

The takeaway

The graded points: an atomic token-bucket (Lua/single-threaded Redis) to kill the distributed-counter race, a shared store so the limit holds across instances, 429 + Retry-After, fail-open for availability, and a path to local buckets when the Redis hop becomes the bottleneck. That’s a production rate limiter, and it directly reuses Chapter 3’s token-vs-leaky and Chapter 2’s caching/consistency ideas.