Bloom filters
A tiny probabilistic structure that answers "have I seen this?" using a fraction of the memory — at the cost of occasional false positives.
The problem
You want to ask “is X in this set?” — but the set is huge (billions of items) and keeping it all in memory (or hitting disk for every check) is too expensive. A Bloom filter answers membership using a tiny, fixed amount of memory, by accepting one specific kind of error.
The guarantee (and the catch)
A Bloom filter tells you one of two things:
- “Definitely not in the set” — always correct. No false negatives.
- “Possibly in the set” — usually right, but sometimes wrong. False positives are possible.
So a no is the truth; a yes means “probably — go check the real store.” That asymmetry is exactly what makes it useful as a cheap pre-filter.
How it works
A bit array plus k independent hash functions:
- Add(x): hash
xwith all k functions, set those k bits to 1. - Query(x): hash
xthe same way; if any of those bits is 0,xis definitely absent. If all are 1, it’s probably present (those bits could have been set by other items — that’s the false positive).
You can’t delete (clearing bits could break other items), and the false-positive rate rises as it fills. You trade memory for accuracy: more bits and well-chosen k → fewer false positives.
Why it’s worth it
To check membership against a billion items, a hash set might need many GB; a Bloom filter with a 1% false-positive rate needs only a few bits per item — often megabytes total. The win is using it to avoid expensive work: a no skips a disk read or network call entirely, and a yes is just confirmed against the real store.
Where it shows up
- Databases (Cassandra, HBase, RocksDB) — before reading an SSTable from disk, a Bloom filter says whether the key might be there, skipping pointless disk reads on a miss.
- Caches / CDNs — “is this worth caching yet?” (only cache on the second request).
- Web crawlers — “have we already seen this URL?” without storing every URL in memory (its own appearance in Chapter 4).
- Account/username checks, malicious-URL filters — fast reject of the common case.
The interview cue
When a design needs a fast existence check against a massive set and can tolerate a confirming lookup on a maybe, name a Bloom filter. The framing to say out loud: “a Bloom filter in front lets a no skip the expensive store entirely; a yes falls through to verify.” That’s the whole value in one sentence.