Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·8 min read

Building object storage (S3)

Implement multipart upload, erasure-coded writes across failure domains, the locate-then-fetch read, background scrubbing, and read-after-write consistency.


The write path (multipart + erasure coding)

Large objects upload as parts (parallel, resumable); each part is erasure-coded into fragments spread across failure domains; metadata is committed last:

def put_object(bucket, key, stream):
    upload_id = begin_multipart(bucket, key)
    parts = []
    for i, part in enumerate(read_parts(stream, size=8*MB)):   # parallelizable
        frags = reed_solomon_encode(part, k=10, m=4)           # 10 data + 4 parity
        locations = placement.pick(k+m, distinct_failure_domains=True)
        for frag, node in zip(frags, locations):
            node.write(frag_id(upload_id, i, frag.index), frag, checksum=sha256(frag))
        parts.append({"part": i, "frags": locations})
    # commit AFTER all data is durable → object becomes visible (read-after-write)
    metadata.commit(bucket, key, parts, version=new_version(), size=..., etag=...)

Committing metadata only after fragments are durably written is what gives strong read-after-write: the object can’t be read until it fully exists.

Placement across failure domains

The placement service spreads the k+m fragments so that no single disk, rack, or AZ holds enough fragments to matter — losing a whole rack still leaves ≥ k fragments to reconstruct. It also balances capacity/load across nodes (consistent hashing or a capacity-aware allocator).

The read path (locate → fetch → reconstruct)

def get_object(bucket, key, version=None):
    meta = metadata.get(bucket, key, version)        # locate fragments
    out = []
    for part in meta.parts:
        frags = fetch_any_k(part.frags, k=10)        # only need k of k+m
        out.append(reed_solomon_decode(frags))       # reconstruct
    return concat(out)

You only need any k fragments, so reads tolerate up to m unavailable nodes with no impact — availability comes free from the coding.

Background scrubbing (catching bit rot)

Disks silently corrupt data. A scrubber continuously re-reads fragments, verifies checksums (and Merkle trees for efficient comparison), and rebuilds any corrupted or lost fragment from the surviving k:

def scrub(fragment):
    if sha256(read(fragment)) != fragment.expected_checksum:
        good = fetch_any_k(sibling_fragments(fragment), k)
        rebuilt = reed_solomon_decode(good)
        re_encode_and_replace(fragment)              # restore the lost fragment

This is how durability stays at eleven nines despite constant hardware failure.

Metadata service

(bucket, key) → object location/version lives in a sharded, replicated store (itself a distributed KV store, Chapter 4). LIST bucket/prefix/ is a range scan over sorted keys — which is why object keys are stored ordered. Cache hot metadata.

Deletes and versioning

  • Versioned PUT writes a new version; the metadata head points to the latest.
  • DELETE writes a tombstone (or removes the head); garbage collection later reclaims fragments no live version references (reference counting), then scrubbing stops tracking them.

Scaling and failure handling

  • Capacity → add storage nodes; placement starts using them.
  • Node/rack/AZ failure → reads use surviving fragments; a repair process rebuilds the lost fragments onto healthy nodes to restore the durability margin.
  • Hot object → front with a CDN / cache; pre-signed URLs let clients pull directly.
  • Huge object → multipart keeps each part small, parallel, and resumable.

The takeaway

Concrete signals: multipart upload, erasure coding (k+m) across failure domains for cheap durability, commit-metadata-last for read-after-write, locate→fetch-any-k→reconstruct reads, and scrubbing + repair to survive bit rot. This durable blob layer is what Dropbox, media systems, and backups store their bytes on.