Skip to content
System design course
Ch.2 · The building blocks·concept ·6 min read

Bloom filters

A tiny probabilistic structure that answers "have I seen this?" using a fraction of the memory — at the cost of occasional false positives.


The problem

You want to ask “is X in this set?” — but the set is huge (billions of items) and keeping it all in memory (or hitting disk for every check) is too expensive. A Bloom filter answers membership using a tiny, fixed amount of memory, by accepting one specific kind of error.

The guarantee (and the catch)

A Bloom filter tells you one of two things:

  • “Definitely not in the set” — always correct. No false negatives.
  • “Possibly in the set” — usually right, but sometimes wrong. False positives are possible.

So a no is the truth; a yes means “probably — go check the real store.” That asymmetry is exactly what makes it useful as a cheap pre-filter.

How it works

A bit array plus k independent hash functions:

  • Add(x): hash x with all k functions, set those k bits to 1.
  • Query(x): hash x the same way; if any of those bits is 0, x is definitely absent. If all are 1, it’s probably present (those bits could have been set by other items — that’s the false positive).

You can’t delete (clearing bits could break other items), and the false-positive rate rises as it fills. You trade memory for accuracy: more bits and well-chosen k → fewer false positives.

Why it’s worth it

To check membership against a billion items, a hash set might need many GB; a Bloom filter with a 1% false-positive rate needs only a few bits per item — often megabytes total. The win is using it to avoid expensive work: a no skips a disk read or network call entirely, and a yes is just confirmed against the real store.

Where it shows up

  • Databases (Cassandra, HBase, RocksDB) — before reading an SSTable from disk, a Bloom filter says whether the key might be there, skipping pointless disk reads on a miss.
  • Caches / CDNs — “is this worth caching yet?” (only cache on the second request).
  • Web crawlers — “have we already seen this URL?” without storing every URL in memory (its own appearance in Chapter 4).
  • Account/username checks, malicious-URL filters — fast reject of the common case.

The interview cue

When a design needs a fast existence check against a massive set and can tolerate a confirming lookup on a maybe, name a Bloom filter. The framing to say out loud: “a Bloom filter in front lets a no skip the expensive store entirely; a yes falls through to verify.” That’s the whole value in one sentence.