Algorithms

Bloom Filter

๐Ÿ“‹ Quick Reference

PropertyValue
TypeProbabilistic set membership
AddO(k) where k = number of hash functions
QueryO(k) - may return false positives
DeleteNot supported (standard Bloom filter)
SpaceMuch smaller than storing actual elements
Best ForFast membership checks with acceptable false positives
One-liner: Space-efficient probabilistic data structure that can tell you "definitely not in set" or "possibly in set".

๐ŸŽฎ Interactive Visualizer

Watch how Bloom Filter uses multiple hash functions:

Loading visualizer...
Try these operations:
  1. Add elements and watch multiple bits get set
  2. Query existing elements (always found)
  3. Query non-existent elements (possible false positives)
  4. Observe how more elements increase false positive rate

๐Ÿ”ง How It Works

Core Concept

Bloom Filter = bit array + k hash functions

Adding "hello":
  h1("hello") = 3   โ†’ set bit[3] = 1
  h2("hello") = 7   โ†’ set bit[7] = 1
  h3("hello") = 12  โ†’ set bit[12] = 1

Querying "hello":
  Check bit[3] AND bit[7] AND bit[12]
  All 1? โ†’ "Possibly in set"
  Any 0? โ†’ "Definitely not in set"

Key Properties

JAVA(7 lines)
Code
Loading syntax highlighter...

Optimal Parameters

Optimal number of hash functions:
k = (m/n) * ln(2) โ‰ˆ 0.693 * (m/n)

False positive probability:
p โ‰ˆ (1 - e^(-kn/m))^k

For 1% false positive rate with n elements:
m โ‰ˆ 9.6n bits (about 1.2 bytes per element)
k โ‰ˆ 7 hash functions

๐Ÿ“Š Complexity Analysis

OperationTimeSpace
add(element)O(k)-
contains(element)O(k)-
Space (total)-O(m) bits

Space Comparison

StructureSpace for 1M elements
HashSet~48 MB (avg 48 bytes/entry)
Bloom Filter (1% FP)~1.2 MB
Bloom Filter (0.1% FP)~1.8 MB

โœ… When to Use Bloom Filter

Good Use Cases

  • Cache filtering - avoid expensive lookups for non-existent items
  • Spell checkers - quick "probably misspelled" check
  • Network routers - packet filtering
  • Database queries - skip disk reads for absent keys
  • Web crawlers - avoid revisiting URLs
  • Distributed systems - reduce unnecessary network calls

Avoid When

  • Need exact membership - can't distinguish "definitely" from "maybe"
  • Need deletion - standard Bloom filter doesn't support delete
  • Very low false positive tolerance - might need more space than HashSet
  • Need to enumerate elements - Bloom filter only answers "is X in set?"

๐Ÿ”„ Bloom Filter Variants

VariantFeatureTrade-off
StandardSimple, fastNo delete, false positives
CountingSupports delete4x more space (counters vs bits)
ScalableGrows dynamicallyMultiple filters, slightly slower
Cuckoo FilterDelete support, better spaceMore complex

Counting Bloom Filter

JAVA(6 lines)
Code
Loading syntax highlighter...

๐Ÿงฉ Implementation Patterns

Guava BloomFilter (Java)

JAVA(18 lines)
Code
Loading syntax highlighter...

Redis Bloom Filter

BASH(10 lines)
Code
Loading syntax highlighter...

Cache Pattern

JAVA(26 lines)
Code
Loading syntax highlighter...

โš ๏ธ Common Pitfalls

1. Ignoring False Positive Rate

JAVA(5 lines)
Code
Loading syntax highlighter...

2. Underestimating Element Count

JAVA(5 lines)
Code
Loading syntax highlighter...

3. Treating "Maybe" as "Yes"

JAVA(11 lines)
Code
Loading syntax highlighter...

4. Expecting Deletion Support

JAVA(4 lines)
Code
Loading syntax highlighter...

๐ŸŽค Interview Tips

Q: What is a Bloom filter and what's it used for?
"

A probabilistic data structure for set membership testing. It can tell you "definitely not in set" or "possibly in set". Used for caching, spell-checking, and avoiding expensive lookups when an element doesn't exist.

Q: What's the difference between false positive and false negative?
"

False positive: filter says "maybe yes" but element was never added. False negative: filter says "no" but element WAS added. Bloom filters guarantee no false negatives - if it says "no", the element is definitely not in the set.

Q: Why use multiple hash functions?
"

Multiple hash functions reduce false positive rate. With k hash functions, an element sets k bits. For a false positive, ALL k bits must have been set by other elements, which is less likely with good hash distribution.

Q: How do you choose optimal parameters?
"

Given expected elements (n) and desired false positive rate (p): bits m โ‰ˆ -n*ln(p)/(ln(2)ยฒ), hash functions k โ‰ˆ (m/n)*ln(2). For 1% FP rate, use about 10 bits per element and 7 hash functions.


๐Ÿ“š Series Navigation


Visualizer: BloomFilterVisualizer from @tomaszjarosz/react-visualizers