Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·8 min read

Building a distributed message queue

Implement the on-disk segmented log, the replicated produce/consume paths with offset tracking, idempotent consumers, and dead-letter handling.


The partition log on disk

A partition is an append-only log split into segments (files) for manageable retention and compaction. An offset index maps offset → byte position for O(1) seeks:

partition-0/
  00000000.log     # messages, append-only
  00000000.index   # offset -> file position (sparse)
  00010000.log     # next segment (rolls at size/time)

Produce appends; consume seeks to an offset via the index and scans forward. Old segments are deleted/compacted by retention policy — never on read.

The produce path (replicated, acked)

def produce(topic, key, value, acks="all"):
    p = partition_for(key, num_partitions)         # hash(key) % N (ordering by key)
    leader = metadata.leader(topic, p)
    offset = leader.append(value)                  # sequential write to active segment
    if acks == "all":
        wait_for_isr_replication(leader, offset)   # in-sync replicas confirm
    return (p, offset)

acks=all waits for the in-sync replica set before acking → durable. acks=1 (leader only) is faster but can lose the tail on leader failure — the latency-vs-durability knob.

Replication and leader failover

Followers continuously fetch from the leader and append to their copy; the leader tracks the high-water mark (offset replicated to all ISR) — consumers can only read up to it (so they never see un-replicated messages). On leader death, a controller (backed by a quorum like ZooKeeper/Raft) promotes an ISR follower; committed messages survive.

The consume path (pull + offset commit)

Consumers pull batches (good backpressure and batching) and commit their offset:

def consume_loop(group, topic, partition):
    offset = offset_store.get(group, topic, partition)   # resume where we left off
    while True:
        batch = broker.fetch(topic, partition, offset, max_bytes=1<<20)
        for msg in batch:
            process(msg)                                  # do the work FIRST
            offset = msg.offset + 1
        offset_store.commit(group, topic, partition, offset)  # then commit → at-least-once

Committing after processing gives at-least-once (a crash mid-batch reprocesses). Committing before gives at-most-once.

Consumer groups and rebalancing

A coordinator assigns partitions to consumers in a group (one partition → one consumer). When a consumer joins/leaves/dies (detected by heartbeats), the group rebalances — partitions are reassigned. The new owner resumes from the committed offset, so no message is skipped.

Idempotency (defeating duplicates)

Since at-least-once yields duplicates, consumers must be idempotent:

def process(msg):
    if seen.contains(msg.id):    # dedup by message id (DB unique key or cache)
        return
    apply(msg)
    seen.add(msg.id)

Or design the operation to be naturally idempotent (upserts keyed by id). Producers can also use an idempotent producer (sequence numbers) so retries don’t append duplicates.

Dead-letter queue and poison messages

A message that keeps failing (bad data) would block its partition forever. After K retries, route it to a dead-letter queue for later inspection, and advance the offset so the partition keeps flowing.

Scaling and failure handling

  • Throughput → add partitions (and consumers up to the partition count).
  • Hot partition → a skewed key overloads one partition; choose a higher-cardinality key or add partitions.
  • Slow consumer / lag → monitor consumer lag (log end offset − committed offset); scale consumers or speed up processing.
  • Broker failure → leader failover from ISR; producers/consumers refresh metadata and retry.

The takeaway

Concrete signals: a segmented append-only log with an offset index, acks=all + ISR replication for durability, pull-based consumers that commit offsets after processing (at-least-once), idempotent consumers + DLQ for duplicates and poison messages, and consumer-group rebalancing. This queue is the async backbone — you’ll drop it into the notification, feed, metrics, and order pipelines ahead.