Building object storage (S3)
Implement multipart upload, erasure-coded writes across failure domains, the locate-then-fetch read, background scrubbing, and read-after-write consistency.
The write path (multipart + erasure coding)
Large objects upload as parts (parallel, resumable); each part is erasure-coded into fragments spread across failure domains; metadata is committed last:
def put_object(bucket, key, stream):
upload_id = begin_multipart(bucket, key)
parts = []
for i, part in enumerate(read_parts(stream, size=8*MB)): # parallelizable
frags = reed_solomon_encode(part, k=10, m=4) # 10 data + 4 parity
locations = placement.pick(k+m, distinct_failure_domains=True)
for frag, node in zip(frags, locations):
node.write(frag_id(upload_id, i, frag.index), frag, checksum=sha256(frag))
parts.append({"part": i, "frags": locations})
# commit AFTER all data is durable → object becomes visible (read-after-write)
metadata.commit(bucket, key, parts, version=new_version(), size=..., etag=...)
Committing metadata only after fragments are durably written is what gives strong read-after-write: the object can’t be read until it fully exists.
Placement across failure domains
The placement service spreads the k+m fragments so that no single disk, rack, or AZ holds enough fragments to matter — losing a whole rack still leaves ≥ k fragments to reconstruct. It also balances capacity/load across nodes (consistent hashing or a capacity-aware allocator).
The read path (locate → fetch → reconstruct)
def get_object(bucket, key, version=None):
meta = metadata.get(bucket, key, version) # locate fragments
out = []
for part in meta.parts:
frags = fetch_any_k(part.frags, k=10) # only need k of k+m
out.append(reed_solomon_decode(frags)) # reconstruct
return concat(out)
You only need any k fragments, so reads tolerate up to m unavailable nodes with no impact — availability comes free from the coding.
Background scrubbing (catching bit rot)
Disks silently corrupt data. A scrubber continuously re-reads fragments, verifies checksums (and Merkle trees for efficient comparison), and rebuilds any corrupted or lost fragment from the surviving k:
def scrub(fragment):
if sha256(read(fragment)) != fragment.expected_checksum:
good = fetch_any_k(sibling_fragments(fragment), k)
rebuilt = reed_solomon_decode(good)
re_encode_and_replace(fragment) # restore the lost fragment
This is how durability stays at eleven nines despite constant hardware failure.
Metadata service
(bucket, key) → object location/version lives in a sharded, replicated store
(itself a distributed KV store, Chapter 4). LIST bucket/prefix/ is a range scan
over sorted keys — which is why object keys are stored ordered. Cache hot metadata.
Deletes and versioning
- Versioned PUT writes a new version; the metadata head points to the latest.
- DELETE writes a tombstone (or removes the head); garbage collection later reclaims fragments no live version references (reference counting), then scrubbing stops tracking them.
Scaling and failure handling
- Capacity → add storage nodes; placement starts using them.
- Node/rack/AZ failure → reads use surviving fragments; a repair process rebuilds the lost fragments onto healthy nodes to restore the durability margin.
- Hot object → front with a CDN / cache; pre-signed URLs let clients pull directly.
- Huge object → multipart keeps each part small, parallel, and resumable.
The takeaway
Concrete signals: multipart upload, erasure coding (k+m) across failure domains for cheap durability, commit-metadata-last for read-after-write, locate→fetch-any-k→reconstruct reads, and scrubbing + repair to survive bit rot. This durable blob layer is what Dropbox, media systems, and backups store their bytes on.