Data compression vs data deduplication
Two ways to store less — shrink each item's bytes, or stop storing identical copies. They target different redundancy and often work together.
Two different “store less” strategies
Both cut storage and bandwidth, but attack different redundancy:
- Compression — shrink data by encoding it more efficiently (remove redundancy within an item).
- Deduplication — store only one copy of identical data and replace duplicates with references (remove redundancy across items).
Compression
Re-encode bytes so the same information takes less space — e.g. gzip/zstd on text, JPEG on images.
- Scope: within a file/block.
- Lossless (gzip, zstd, Brotli) — exact reconstruction; for text, logs, code.
- Lossy (JPEG, MP4, MP3) — discard imperceptible detail for much smaller size; for media.
- Cost: CPU to compress/decompress; a latency/CPU-vs-size trade-off.
- Great when: individual items are internally redundant (text, logs, columnar data).
Deduplication
Detect identical chunks (by hashing them) and keep one physical copy; everything else points to it.
- Scope: across files / the whole dataset.
- File-level (whole identical files) or block-level (identical chunks within different files — far more effective).
- Cost: maintaining a hash index of chunks; lookup on every write.
- Great when: the same data repeats across many items — backups (yesterday ≈ today), VM images, a file-sync service where many users store the same file.
The side-by-side
| Compression | Deduplication | |
|---|---|---|
| Removes redundancy | Within an item | Across items |
| Mechanism | Re-encode bytes | Reference shared copies |
| Best on | Internally redundant data | Repeated/duplicate data |
| Cost | CPU | Chunk hash index + lookups |
They compose
You typically dedupe, then compress the unique chunks — eliminate repeated chunks first, then squeeze what’s left. Backup systems and cloud storage do exactly this to maximize savings.
The interview cue
Reach for these in storage-heavy designs (Dropbox, S3, backups, a CDN): “Files are chunked and deduplicated so the same file uploaded by a million users is stored once; unique chunks are then compressed. Dedup also slashes upload bandwidth — the client can skip sending chunks the server already has.” Knowing which redundancy each targets — and that they stack — is the signal.