INTERNALS · PROBABILISTIC DATA STRUCTURES bloom

Bloom filter playground

Insert items, watch bits flip; query items, watch the false-positive rate trace the theoretical curve (1 − e^(−kn/m))^k.

What's actually happening:

A bloom filter is a fixed-size bit array (m bits) with k hash functions. Insert sets the k bit-positions an item hashes to. Query says "true if all k bits are set" — false negatives are impossible (you only ever turn bits on), but false positives are, because two items can hash to overlapping positions.

That trade is the whole point: it's a set-membership oracle that uses ~10 bits per item for ~1% FP rate, regardless of how big the items are.

Try this:

  1. 1Type something into INSERT and submit. Watch the k=5 highlighted bits flip on.
  2. 2Click + 100 a few times to fill it up. The grid lights up; FP·THEO climbs.
  3. 3Drag K back and forth. Watch the FP-curve marker move and find the dotted vertical at k* = (m/n) ln 2 — that's the optimum.
  4. 4Query a string you never inserted. If all k bits already happen to be set, you'll get a false positive — and the bits-checked highlight shows you exactly which collisions caused it.

Where this lives in production:

  • Cassandra / LevelDB / RocksDB — skip SSTable lookups for keys that aren't there
  • Chrome Safe Browsing — fast pre-check before fetching the full malicious-URL list
  • CDN / cache layers — "have we seen this URL before?" without storing every URL
  • Bitcoin SPV clients — query peer nodes for transactions of interest without revealing them

The double-hashing trick (idx_i = h1 + i × h2 mod m) means we get k hashes for the cost of two — Kirsch & Mitzenmacher 2006.

INSERT
QUERY
BIT ARRAY · 1024 bits · 32 × 32setjust-flippedprobed
FP CURVE · θ(k) for fixed m, nθ(k)k = 5k* ≈ 709.8
k=1θ(k) = (1 − e−kn/m)kk=16

Set K below K* and the FP rate climbs because too few bits are set per item; set it above and FP climbs again because every query checks more bits, all of which are likely set. The minimum is the sweet spot.