Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·7 min read

Building a notification service

Implement the channel workers with provider adapters, idempotent dedup, circuit breakers, and batched fan-out to millions of recipients.


The send pipeline

The API does the cheap, synchronous work (validate, dedup, enqueue) and returns fast; everything slow happens in workers:

def notify(req):
    if seen.contains(req.idempotency_key):        # dedup at the door
        return Accepted
    prefs = prefs_store.get(req.user_id)
    for ch in req.channels:
        if not prefs.allows(ch) or prefs.in_quiet_hours(ch):
            continue                              # respect preferences
        body = render(req.template, req.data, prefs.locale)
        queue[ch].publish({**req, "body": body})  # per-channel queue
    seen.add(req.idempotency_key)
    return Accepted

Channel workers with provider adapters

Each channel has a worker pool and a provider adapter (Strategy) so providers swap without touching the worker:

class PushAdapter:
    def send(self, device_token, body): ...   # APNs/FCM
class SmsAdapter:
    def send(self, phone, body): ...          # Twilio

def push_worker():
    for msg in queue["push"].consume():
        for device in active_devices(msg.user_id):
            with breaker("apns"):              # circuit breaker
                result = PushAdapter().send(device.token, msg.body)
            record_status(msg.id, device, result)
            if result.invalid_token:
                deactivate(device)            # prune dead tokens

Idempotency and exactly-once-ish delivery

Queues are at-least-once, so a worker may retry a message it already sent. Dedup at send time keyed by (notification_id, device):

def send_once(key, fn):
    if not dedup.set(key, nx=True, ttl="7d"):  # someone already sent it
        return
    fn()

Providers also dedupe on their idempotency_key, giving belt-and-suspenders.

Circuit breaker for flaky providers

A failing provider shouldn’t pile up retries and exhaust workers. Trip a breaker after a failure threshold; fail fast (and fall back) while open; probe to recover:

class Breaker:
    def call(self, fn):
        if self.state == OPEN and not self.cooldown_elapsed():
            raise ProviderDown()              # fail fast, try fallback channel
        try:
            r = fn(); self.record_success(); return r
        except ProviderError:
            self.record_failure()             # may trip to OPEN
            raise

Retries, DLQ, and status

  • Retry transient provider errors with exponential backoff; after max attempts → dead-letter with the reason.
  • Status comes back asynchronously via provider webhooks (delivered, bounced, opened) — update the notification log; auto-disable channels for hard bounces.

Fan-out to millions

“Notify all followers” must not run inline. Enqueue a fan-out job that pages through the audience and emits per-user messages in batches:

def fanout(notification, audience_query):
    for batch in paginate(audience_query, size=1000):
        for user in batch:
            queue["fanout"].publish({"user_id": user, **notification})

Workers then apply preferences and per-user rate limiting / digesting so a busy user gets one digest, not 500 pings.

Scaling and failure handling

  • Throughput → add partitions/workers per channel independently.
  • Provider outage → breaker + fallback channel; DLQ for un-sendable.
  • Hot fan-out (celebrity) → batched fan-out job, never inline.
  • Duplicate suppression → idempotency keys end-to-end.

The takeaway

Concrete signals: per-channel queues + worker pools with provider adapters, idempotent dedup, circuit breakers + fallbacks for flaky third parties, retry → DLQ, webhook status, and batched fan-out with per-user rate limiting. It’s the async/queue + idempotency + fan-out patterns composed — the same moves power feeds and pipelines elsewhere.