Data compression vs data deduplication

Two ways to store less — shrink each item's bytes, or stop storing identical copies. They target different redundancy and often work together.

Two different “store less” strategies

Both cut storage and bandwidth, but attack different redundancy:

Compression — shrink data by encoding it more efficiently (remove redundancy within an item).
Deduplication — store only one copy of identical data and replace duplicates with references (remove redundancy across items).

Compression

Re-encode bytes so the same information takes less space — e.g. gzip/zstd on text, JPEG on images.

Scope: within a file/block.
Lossless (gzip, zstd, Brotli) — exact reconstruction; for text, logs, code.
Lossy (JPEG, MP4, MP3) — discard imperceptible detail for much smaller size; for media.
Cost: CPU to compress/decompress; a latency/CPU-vs-size trade-off.
Great when: individual items are internally redundant (text, logs, columnar data).

Deduplication

Detect identical chunks (by hashing them) and keep one physical copy; everything else points to it.

Scope: across files / the whole dataset.
File-level (whole identical files) or block-level (identical chunks within different files — far more effective).
Cost: maintaining a hash index of chunks; lookup on every write.
Great when: the same data repeats across many items — backups (yesterday ≈ today), VM images, a file-sync service where many users store the same file.

The side-by-side

	Compression	Deduplication
Removes redundancy	Within an item	Across items
Mechanism	Re-encode bytes	Reference shared copies
Best on	Internally redundant data	Repeated/duplicate data
Cost	CPU	Chunk hash index + lookups

They compose

You typically dedupe, then compress the unique chunks — eliminate repeated chunks first, then squeeze what’s left. Backup systems and cloud storage do exactly this to maximize savings.

The interview cue

Reach for these in storage-heavy designs (Dropbox, S3, backups, a CDN): “Files are chunked and deduplicated so the same file uploaded by a million users is stored once; unique chunks are then compressed. Dedup also slashes upload bandwidth — the client can skip sending chunks the server already has.” Knowing which redundancy each targets — and that they stack — is the signal.