Skip to content
System design course
Ch.4 · Designing real systems·concept ·8 min read

Designing a rate limiter

A service that caps how many requests a client may make in a window — where to place it, which algorithm to use, and how to share counters across a fleet.


The problem

A rate limiter caps how many requests a client (user, IP, API key) can make in a time window, rejecting or delaying the excess. It protects services from abuse, accidental floods, and resource exhaustion, and enforces fair use and billing tiers. The classic interview prompt: “design an API rate limiter that allows N requests per user per minute.”

Step 1 — Requirements

Functional:

  • Limit requests per client per window (e.g. 100/min per API key).
  • Support multiple rules (per-endpoint, per-tier limits).
  • When exceeded, reject with HTTP 429 and a Retry-After header.

Non-functional:

  • Low latency — it’s on every request’s hot path; must add ≤ a millisecond.
  • Accurate — don’t wrongly block legitimate traffic.
  • Distributed — work across many server instances sharing one limit.
  • Highly available + fault-tolerant — if the limiter is down, fail open (allow) rather than block everything.

Step 2 — Where it lives

Place it early, before expensive work:

  • In the API gateway or a reverse proxy / middleware — the common choice, so limits are enforced centrally before requests reach services.
  • As a sidecar or shared service the gateway calls.

The key design tension: the counter must be shared across all instances, so a user hitting different servers still counts against one limit → a centralized fast store (Redis) holds the counters.

Step 3 — The algorithm

You estimated this in Chapter 3; pick deliberately:

  • Token bucket — capacity B, refill R/sec. Allows bursts up to B, caps the average. Cheap state (count + timestamp). The usual default.
  • Leaky bucket — constant output rate, smooths bursts; adds queuing latency.
  • Fixed window counter — count per minute, reset on the boundary. Simple but allows a 2× burst at the boundary.
  • Sliding window log — store timestamps, count those in the last 60s. Accurate but memory-heavy.
  • Sliding window counter — weight current + previous window; fixes the boundary spike cheaply. Best accuracy/cost trade-off at scale.

Recommend token bucket (or sliding window counter) and say why.

Step 4 — Data model

In Redis, per client+rule: a small record. For token bucket:

key: ratelimit:{api_key}:{rule}
value: { tokens: 87, last_refill: 1718880000.123 }
TTL: window length (auto-cleanup of idle keys)

Step 5 — High-level design

client → gateway/middleware ──check──▶ Redis (counters)
              │ allowed → forward to service
              │ denied  → 429 + Retry-After

On each request: atomically refill + decrement the token count in Redis; allow if ≥ 1, else 429.

Trade-offs to raise

  • Accuracy vs latency/cost — fixed window is cheapest but bursty; sliding-window log is exact but heavy; token bucket / sliding-window counter sit in the middle.
  • Centralized counter latency — a Redis round trip per request. Mitigate with a Redis near the gateways; for extreme scale, local token buckets per node with approximate global coordination.
  • Fail open vs closed — if Redis is unreachable, allow traffic (avoid a self-inflicted outage) and alert.

The interview cue

“Enforce at the gateway, token bucket per API key stored in Redis so the limit is shared across instances, 429 + Retry-After on reject, and fail open if the store is down. At very high scale I’d push approximate local buckets to each node to avoid a Redis hop per request.” Placement + algorithm + shared store + failure stance is the complete answer; implementation is next.