Skip to content
System design course
Ch.3 · Trade-offs that define a design·concept ·5 min read

Data compression vs data deduplication

Two ways to store less — shrink each item's bytes, or stop storing identical copies. They target different redundancy and often work together.


Two different “store less” strategies

Both cut storage and bandwidth, but attack different redundancy:

  • Compression — shrink data by encoding it more efficiently (remove redundancy within an item).
  • Deduplication — store only one copy of identical data and replace duplicates with references (remove redundancy across items).

Compression

Re-encode bytes so the same information takes less space — e.g. gzip/zstd on text, JPEG on images.

  • Scope: within a file/block.
  • Lossless (gzip, zstd, Brotli) — exact reconstruction; for text, logs, code.
  • Lossy (JPEG, MP4, MP3) — discard imperceptible detail for much smaller size; for media.
  • Cost: CPU to compress/decompress; a latency/CPU-vs-size trade-off.
  • Great when: individual items are internally redundant (text, logs, columnar data).

Deduplication

Detect identical chunks (by hashing them) and keep one physical copy; everything else points to it.

  • Scope: across files / the whole dataset.
  • File-level (whole identical files) or block-level (identical chunks within different files — far more effective).
  • Cost: maintaining a hash index of chunks; lookup on every write.
  • Great when: the same data repeats across many items — backups (yesterday ≈ today), VM images, a file-sync service where many users store the same file.

The side-by-side

CompressionDeduplication
Removes redundancyWithin an itemAcross items
MechanismRe-encode bytesReference shared copies
Best onInternally redundant dataRepeated/duplicate data
CostCPUChunk hash index + lookups

They compose

You typically dedupe, then compress the unique chunks — eliminate repeated chunks first, then squeeze what’s left. Backup systems and cloud storage do exactly this to maximize savings.

The interview cue

Reach for these in storage-heavy designs (Dropbox, S3, backups, a CDN): “Files are chunked and deduplicated so the same file uploaded by a million users is stored once; unique chunks are then compressed. Dedup also slashes upload bandwidth — the client can skip sending chunks the server already has.” Knowing which redundancy each targets — and that they stack — is the signal.