Designing object storage (S3)
Build durable, infinitely-scalable blob storage — a flat key→object namespace, erasure coding for 11-nines durability, and how data and metadata planes separate.
The problem
Design object storage (Amazon S3): store and retrieve arbitrarily large objects (files/blobs) by key, with effectively unlimited capacity and extraordinary durability (S3 advertises “11 nines” — 99.999999999%). It’s the foundation under Dropbox, Pastebin, and every media system here, so understanding how it achieves durability and scale is high-value.
Step 1 — Requirements
Functional: PUT(bucket, key, object), GET(bucket, key), DELETE, LIST by
prefix; objects from bytes to terabytes; buckets as namespaces; versioning;
access control.
Non-functional: extreme durability (never lose an object), high availability, strong-ish read-after-write consistency for new objects, massive scale (exabytes, trillions of objects), and high throughput.
Step 2 — The model: a flat key→object map
Object storage is not a filesystem — it’s a giant flat key-value map. “Folders”
are just key prefixes (photos/2024/cat.jpg). This flatness is what lets it scale
horizontally without a directory tree to lock and rebalance. LIST is a prefix scan
over the key index.
Step 3 — Separate the data plane from the metadata plane
- Metadata service — maps
(bucket, key)→ where the object’s data lives (which storage nodes, version, size, checksum, ACL). Sharded (by bucket/key hash), heavily indexed forLIST. - Data plane (storage nodes) — store the actual object bytes on disks across many machines, racks, and availability zones.
client → API/LB → metadata service (locate) → storage nodes (object bytes)
Large objects are split into parts/blocks spread across nodes (and uploaded via multipart upload for parallel, resumable transfer).
Step 4 — Durability: replication vs erasure coding
This is the heart of object storage. To never lose data despite constant disk failures:
- Replication (e.g. 3 copies across AZs) — simple, fast recovery, but 3× storage cost.
- Erasure coding — split an object into k data + m parity fragments (e.g. Reed-Solomon 10+4); any k of the (k+m) fragments reconstruct it. Survives m failures with only (k+m)/k overhead (1.4× vs 3×) — far cheaper for the same or better durability. The standard for cold/large data.
Spread fragments across failure domains (disks, racks, AZs) so correlated failures can’t take out enough fragments. Background scrubbing (checksums/Merkle, Chapter 2) detects bit rot and rebuilds lost fragments.
Step 5 — Consistency
Modern S3 offers strong read-after-write for new objects (a GET right after a PUT returns it). Achieve it by committing the metadata pointer after the data is durably written, so the object is visible only once stored. Overwrites/deletes can be versioned to avoid in-place mutation hazards.
Step 6 — Extras
- Versioning — keep old versions (new version = new object, metadata points to latest).
- Lifecycle/tiering — move cold objects to cheaper storage (infrequent-access, archive/Glacier) automatically.
- Access — bucket policies/ACLs; pre-signed URLs for temporary direct access (reuse from Pastebin); a CDN in front for hot public objects.
Trade-offs to raise
- Replication (simple, 3×) vs erasure coding (cheap, complex recovery, higher latency to reconstruct).
- Strong consistency vs latency — commit-after-durable adds a touch of latency.
- Flat namespace scales but offers no real directories/transactions across keys.
The interview cue
“A flat key→object map; metadata plane locates objects, storage nodes hold bytes across AZs; durability via erasure coding (k+m fragments across failure domains) plus scrubbing to rebuild lost fragments; multipart upload for large objects; strong read-after-write by committing metadata after durable write; pre-signed URLs and a CDN for access.” Durability (erasure coding) + the data/metadata split is exactly what this problem is testing.