Building a notification service
Implement the channel workers with provider adapters, idempotent dedup, circuit breakers, and batched fan-out to millions of recipients.
The send pipeline
The API does the cheap, synchronous work (validate, dedup, enqueue) and returns fast; everything slow happens in workers:
def notify(req):
if seen.contains(req.idempotency_key): # dedup at the door
return Accepted
prefs = prefs_store.get(req.user_id)
for ch in req.channels:
if not prefs.allows(ch) or prefs.in_quiet_hours(ch):
continue # respect preferences
body = render(req.template, req.data, prefs.locale)
queue[ch].publish({**req, "body": body}) # per-channel queue
seen.add(req.idempotency_key)
return Accepted
Channel workers with provider adapters
Each channel has a worker pool and a provider adapter (Strategy) so providers swap without touching the worker:
class PushAdapter:
def send(self, device_token, body): ... # APNs/FCM
class SmsAdapter:
def send(self, phone, body): ... # Twilio
def push_worker():
for msg in queue["push"].consume():
for device in active_devices(msg.user_id):
with breaker("apns"): # circuit breaker
result = PushAdapter().send(device.token, msg.body)
record_status(msg.id, device, result)
if result.invalid_token:
deactivate(device) # prune dead tokens
Idempotency and exactly-once-ish delivery
Queues are at-least-once, so a worker may retry a message it already sent. Dedup at
send time keyed by (notification_id, device):
def send_once(key, fn):
if not dedup.set(key, nx=True, ttl="7d"): # someone already sent it
return
fn()
Providers also dedupe on their idempotency_key, giving belt-and-suspenders.
Circuit breaker for flaky providers
A failing provider shouldn’t pile up retries and exhaust workers. Trip a breaker after a failure threshold; fail fast (and fall back) while open; probe to recover:
class Breaker:
def call(self, fn):
if self.state == OPEN and not self.cooldown_elapsed():
raise ProviderDown() # fail fast, try fallback channel
try:
r = fn(); self.record_success(); return r
except ProviderError:
self.record_failure() # may trip to OPEN
raise
Retries, DLQ, and status
- Retry transient provider errors with exponential backoff; after max attempts → dead-letter with the reason.
- Status comes back asynchronously via provider webhooks (delivered, bounced, opened) — update the notification log; auto-disable channels for hard bounces.
Fan-out to millions
“Notify all followers” must not run inline. Enqueue a fan-out job that pages through the audience and emits per-user messages in batches:
def fanout(notification, audience_query):
for batch in paginate(audience_query, size=1000):
for user in batch:
queue["fanout"].publish({"user_id": user, **notification})
Workers then apply preferences and per-user rate limiting / digesting so a busy user gets one digest, not 500 pings.
Scaling and failure handling
- Throughput → add partitions/workers per channel independently.
- Provider outage → breaker + fallback channel; DLQ for un-sendable.
- Hot fan-out (celebrity) → batched fan-out job, never inline.
- Duplicate suppression → idempotency keys end-to-end.
The takeaway
Concrete signals: per-channel queues + worker pools with provider adapters, idempotent dedup, circuit breakers + fallbacks for flaky third parties, retry → DLQ, webhook status, and batched fan-out with per-user rate limiting. It’s the async/queue + idempotency + fan-out patterns composed — the same moves power feeds and pipelines elsewhere.