Designing object storage (S3)

Build durable, infinitely-scalable blob storage — a flat key→object namespace, erasure coding for 11-nines durability, and how data and metadata planes separate.

The problem

Design object storage (Amazon S3): store and retrieve arbitrarily large objects (files/blobs) by key, with effectively unlimited capacity and extraordinary durability (S3 advertises “11 nines” — 99.999999999%). It’s the foundation under Dropbox, Pastebin, and every media system here, so understanding how it achieves durability and scale is high-value.

Step 1 — Requirements

Functional: PUT(bucket, key, object), GET(bucket, key), DELETE, LIST by prefix; objects from bytes to terabytes; buckets as namespaces; versioning; access control.

Non-functional: extreme durability (never lose an object), high availability, strong-ish read-after-write consistency for new objects, massive scale (exabytes, trillions of objects), and high throughput.

Step 2 — The model: a flat key→object map

Object storage is not a filesystem — it’s a giant flat key-value map. “Folders” are just key prefixes (photos/2024/cat.jpg). This flatness is what lets it scale horizontally without a directory tree to lock and rebalance. LIST is a prefix scan over the key index.

Step 3 — Separate the data plane from the metadata plane

Metadata service — maps (bucket, key) → where the object’s data lives (which storage nodes, version, size, checksum, ACL). Sharded (by bucket/key hash), heavily indexed for LIST.
Data plane (storage nodes) — store the actual object bytes on disks across many machines, racks, and availability zones.

client → API/LB → metadata service (locate) → storage nodes (object bytes)

Large objects are split into parts/blocks spread across nodes (and uploaded via multipart upload for parallel, resumable transfer).

Step 4 — Durability: replication vs erasure coding

This is the heart of object storage. To never lose data despite constant disk failures:

Replication (e.g. 3 copies across AZs) — simple, fast recovery, but 3× storage cost.
Erasure coding — split an object into k data + m parity fragments (e.g. Reed-Solomon 10+4); any k of the (k+m) fragments reconstruct it. Survives m failures with only (k+m)/k overhead (1.4× vs 3×) — far cheaper for the same or better durability. The standard for cold/large data.

Spread fragments across failure domains (disks, racks, AZs) so correlated failures can’t take out enough fragments. Background scrubbing (checksums/Merkle, Chapter 2) detects bit rot and rebuilds lost fragments.

Step 5 — Consistency

Modern S3 offers strong read-after-write for new objects (a GET right after a PUT returns it). Achieve it by committing the metadata pointer after the data is durably written, so the object is visible only once stored. Overwrites/deletes can be versioned to avoid in-place mutation hazards.

Step 6 — Extras

Versioning — keep old versions (new version = new object, metadata points to latest).
Lifecycle/tiering — move cold objects to cheaper storage (infrequent-access, archive/Glacier) automatically.
Access — bucket policies/ACLs; pre-signed URLs for temporary direct access (reuse from Pastebin); a CDN in front for hot public objects.

Trade-offs to raise

Replication (simple, 3×) vs erasure coding (cheap, complex recovery, higher latency to reconstruct).
Strong consistency vs latency — commit-after-durable adds a touch of latency.
Flat namespace scales but offers no real directories/transactions across keys.

The interview cue

“A flat key→object map; metadata plane locates objects, storage nodes hold bytes across AZs; durability via erasure coding (k+m fragments across failure domains) plus scrubbing to rebuild lost fragments; multipart upload for large objects; strong read-after-write by committing metadata after durable write; pre-signed URLs and a CDN for access.” Durability (erasure coding) + the data/metadata split is exactly what this problem is testing.