Probabilistic Data Structures

📋 At a Glance

Aspect	Details
Time to Read	35 minutes
Prerequisites	Part 1 (Algorithm Analysis), Basic probability
Key Concepts	Randomized Algorithms, Bloom Filters, HyperLogLog, Approximation
Difficulty	⭐⭐⭐ (Intermediate)

┌─────────────────────────────────────────────────────────────────┐
│             PROBABILISTIC ALGORITHMS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WHY RANDOMIZATION?                                             │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • Simpler algorithms                                    │    │
│  │ • Better average-case performance                       │    │
│  │ • Avoid worst-case inputs                               │    │
│  │ • Sublinear space for streaming                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  TWO TYPES                                                      │
│  ┌─────────────────────┐    ┌─────────────────────┐             │
│  │ LAS VEGAS           │    │ MONTE CARLO         │             │
│  │ ─────────────────── │    │ ─────────────────── │             │
│  │ Always correct      │    │ Usually correct     │             │
│  │ Random running time │    │ Fixed running time  │             │
│  │ Example: QuickSort  │    │ Example: Miller-Rabin│            │
│  │ with random pivot   │    │ primality test      │             │
│  └─────────────────────┘    └─────────────────────┘             │
│                                                                 │
│  SKETCHING DATA STRUCTURES                                      │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐     │
│  │ Bloom Filter    │  │ Count-Min Sketch│  │ HyperLogLog  │     │
│  │ ─────────────── │  │ ─────────────── │  │ ──────────── │     │
│  │ Set membership  │  │ Frequency est.  │  │ Cardinality  │     │
│  │ No false neg    │  │ May overcount   │  │ estimation   │     │
│  └─────────────────┘  └─────────────────┘  └──────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🎯 What You'll Learn

After completing this article, you will be able to:

Distinguish algorithm types: Las Vegas vs Monte Carlo
Implement Bloom filters: Space-efficient set membership
Use Count-Min Sketch: Frequency estimation in streams
Apply HyperLogLog: Count distinct elements in O(1) space
Design approximation algorithms: Trade accuracy for speed

🔥 Production Story: Counting Unique Visitors at Scale

A social media platform needed to count unique daily visitors - but with 500 million users, storing all IDs was impossible.

The Problem

JAVA(16 lines)
Code
Loading syntax highlighter...

The Solution: HyperLogLog

JAVA(37 lines)
Code
Loading syntax highlighter...

The Impact

Metric	HashSet	HyperLogLog
Memory per counter	4GB	16KB
Counters possible	1	250,000+
Error	0%	~2%
Mergeable	No	Yes

🧠 Mental Model: The Coin Flip Experiment

┌─────────────────────────────────────────────────────────────────┐
│           HYPERLOGLOG INTUITION                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Thought experiment: Flip coins until you get heads             │
│                                                                 │
│  1 flip:  H (50% probability)                                   │
│  2 flips: TH (25% probability)                                  │
│  3 flips: TTH (12.5% probability)                               │
│  k flips: T...TH (1/2^k probability)                            │
│                                                                 │
│  If you observe k flips, ~2^k unique experiments happened       │
│                                                                 │
│  HyperLogLog:                                                   │
│  1. Hash each element → random bit string                       │
│  2. Count leading zeros = "coin flips until heads"              │
│  3. Maximum leading zeros ≈ log₂(n)                             │
│                                                                 │
│  Example:                                                       │
│  Element → Hash → Leading zeros                                 │
│  "alice" → 0001... → 3 zeros                                    │
│  "bob"   → 0000001... → 6 zeros   ← Maximum!                    │
│  "carol" → 001... → 2 zeros                                     │
│                                                                 │
│  Max leading zeros = 6 → Estimate ~2^6 = 64 unique elements     │
│                                                                 │
│  Problem: High variance! Solution: Use M registers, average.    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔬 Deep Dive: Bloom Filters

The Concept

A Bloom filter answers "Is X in the set?" with:

No → Definitely not in set
Yes → Probably in set (false positives possible)

┌─────────────────────────────────────────────────────────────────┐
│                    BLOOM FILTER                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Bit array: [0][0][0][0][0][0][0][0][0][0]  (m bits)            │
│              0  1  2  3  4  5  6  7  8  9                       │
│                                                                 │
│  Add "apple" (k=3 hash functions):                              │
│  h1("apple") = 2, h2("apple") = 5, h3("apple") = 8              │
│                                                                 │
│  [0][0][1][0][0][1][0][0][1][0]                                 │
│        ↑        ↑        ↑                                      │
│                                                                 │
│  Add "banana":                                                  │
│  h1("banana") = 1, h2("banana") = 5, h3("banana") = 9           │
│                                                                 │
│  [0][1][1][0][0][1][0][0][1][1]                                 │
│     ↑           (already 1)     ↑                               │
│                                                                 │
│  Query "cherry":                                                │
│  h1("cherry") = 2, h2("cherry") = 7, h3("cherry") = 9           │
│  Check bits 2, 7, 9: [1][0][1] → 0 found → NOT IN SET           │
│                                                                 │
│  Query "date":                                                  │
│  h1("date") = 1, h2("date") = 5, h3("date") = 8                 │
│  Check bits 1, 5, 8: [1][1][1] → All 1 → PROBABLY IN SET        │
│  (false positive! "date" was never added)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation

JAVA(61 lines)
Code
Loading syntax highlighter...

Use Cases

JAVA(49 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Count-Min Sketch

The Concept

Estimate frequency of elements in a stream with limited memory.

JAVA(61 lines)
Code
Loading syntax highlighter...

Visualization

┌─────────────────────────────────────────────────────────────────┐
│                 COUNT-MIN SKETCH                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Add "apple" (appears 3 times):                                 │
│                                                                 │
│  Row 0: h0("apple")=2   [0][0][3][0][0][0]                      │
│  Row 1: h1("apple")=5   [0][0][0][0][0][3]                      │
│  Row 2: h2("apple")=1   [0][3][0][0][0][0]                      │
│                                                                 │
│  Add "banana" (appears 2 times, collides with apple in row 0):  │
│                                                                 │
│  Row 0: h0("banana")=2  [0][0][5][0][0][0]  ← Collision!        │
│  Row 1: h1("banana")=3  [0][0][0][2][0][3]                      │
│  Row 2: h2("banana")=4  [0][3][0][0][2][0]                      │
│                                                                 │
│  Query "apple":                                                 │
│  Row 0: table[0][2] = 5                                         │
│  Row 1: table[1][5] = 3                                         │
│  Row 2: table[2][1] = 3                                         │
│  Estimate = min(5, 3, 3) = 3 ✓ (correct!)                       │
│                                                                 │
│  Query "banana":                                                │
│  Row 0: table[0][2] = 5  ← Overestimate due to collision        │
│  Row 1: table[1][3] = 2                                         │
│  Row 2: table[2][4] = 2                                         │
│  Estimate = min(5, 2, 2) = 2 ✓ (correct!)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Use Cases

JAVA(32 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: HyperLogLog

Implementation

JAVA(70 lines)
Code
Loading syntax highlighter...

Error Analysis

┌─────────────────────────────────────────────────────────────────┐
│              HYPERLOGLOG PRECISION                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Precision P | Registers M | Memory | Standard Error            │
│  ───────────────────────────────────────────────────────────────│
│      4       |     16      |  16B   |    26%                    │
│      8       |    256      | 256B   |     6.5%                  │
│     12       |   4096      |  4KB   |     1.6%                  │
│     14       |  16384      | 16KB   |     0.8%                  │
│     16       |  65536      | 64KB   |     0.4%                  │
│                                                                 │
│  Standard Error ≈ 1.04 / √M                                     │
│                                                                 │
│  P=14 (default): 16KB memory, 0.8% error                        │
│  Can count up to 2^64 distinct elements!                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔬 Deep Dive: Reservoir Sampling

The Problem

Select k random items from a stream of unknown length.

JAVA(45 lines)
Code
Loading syntax highlighter...

Weighted Reservoir Sampling

JAVA(43 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Randomized Algorithms

QuickSelect (Las Vegas)

JAVA(46 lines)
Code
Loading syntax highlighter...

Miller-Rabin Primality Test (Monte Carlo)

JAVA(57 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Approximation Algorithms

Vertex Cover (2-Approximation)

JAVA(29 lines)
Code
Loading syntax highlighter...

Set Cover (Greedy log(n)-Approximation)

JAVA(37 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Wrong Bloom Filter Size

JAVA(8 lines)
Code
Loading syntax highlighter...

Mistake 2: Using Random Without Seed

JAVA(11 lines)
Code
Loading syntax highlighter...

Mistake 3: Not Handling HyperLogLog Edge Cases

JAVA(18 lines)
Code
Loading syntax highlighter...

🐛 Debug This: Bloom Filter Bug

This Bloom filter has a bug causing false negatives. Can you find it?

JAVA(29 lines)
Code
Loading syntax highlighter...

🔍 Click to reveal the bug

Bug: Hash can be negative! hashCode() can return negative values, and % preserves sign in Java.

JAVA(7 lines)
Code
Loading syntax highlighter...

Example:

JAVA(3 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Counting Bloom Filter ⭐⭐

Implement a Bloom filter that supports deletion (use counters instead of bits).

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 2: Min-Hash for Similarity ⭐⭐⭐

Implement MinHash to estimate Jaccard similarity between sets.

JAVA(6 lines)
Code
Loading syntax highlighter...

Exercise 3: Streaming Median ⭐⭐⭐

Find approximate median in a stream using reservoir sampling.

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 4: FPTAS for Knapsack ⭐⭐⭐⭐

Implement fully polynomial-time approximation scheme for 0/1 knapsack.

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 5: Streaming Heavy Hitters ⭐⭐⭐⭐

Find all elements appearing more than n/k times using O(k) space.

JAVA(5 lines)
Code
Loading syntax highlighter...

🎤 Interview Questions

Q1: "When should you use a Bloom filter vs a hash set?"

Answer:

Use Bloom Filter	Use Hash Set
Can tolerate false positives	Need 100% accuracy
Memory is constrained	Memory is available
Set is large (millions+)	Set is small
Read-heavy, rarely delete	Need deletions
Distributed merging needed	Single machine

Examples:

Bloom: Spell checker, URL blocklist, cache prefetch
HashSet: Shopping cart, user session, exact dedup

Q2: "Explain the trade-offs in HyperLogLog precision."

Answer:

Higher precision (more registers):
+ Lower error rate (1.04/√M)
- More memory (M bytes)
- Slightly slower merge

Typical choices:
- P=10 (~1KB, 3.3% error): Resource-constrained
- P=14 (~16KB, 0.8% error): Good default
- P=16 (~64KB, 0.4% error): High accuracy needed

The logarithmic nature means:
- Doubling memory halves error (roughly)
- Can count 2^64 items with any precision
- Error is independent of cardinality

Q3: "How would you find the top-k frequent items in a stream?"

Answer:

JAVA(25 lines)
Code
Loading syntax highlighter...

Q4: "What's the difference between Las Vegas and Monte Carlo algorithms?"

Answer:

Las Vegas	Monte Carlo
Always correct	May be incorrect
Random running time	Fixed running time
E[time] is bounded	Error probability bounded

Examples:

Las Vegas:
- QuickSort with random pivot: correct sort, E[O(n log n)] time
- QuickSelect: correct k-th element, E[O(n)] time

Monte Carlo:
- Miller-Rabin primality: may say composite is prime, O(k log³ n)
- Approximate counting: estimate with bounded error

Converting:
- Monte Carlo → Las Vegas: Run until verifiable answer
- Las Vegas → Monte Carlo: Set time limit, return best found

Q5: "Design a system to detect duplicate URLs across multiple servers."

Answer:

Architecture:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐          │
│  │ Server1 │   │ Server2 │   │ Server3 │   │ Server4 │          │
│  │ Bloom   │   │ Bloom   │   │ Bloom   │   │ Bloom   │          │
│  │ Filter  │   │ Filter  │   │ Filter  │   │ Filter  │          │
│  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘          │
│       │             │             │             │               │
│       └─────────────┴─────────────┴─────────────┘               │
│                          │                                      │
│                     Merge Layer                                 │
│                     (OR operation)                              │
│                          │                                      │
│                   Global Bloom Filter                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Algorithm:
1. Each server maintains local Bloom filter
2. On new URL: check local filter first (fast)
3. If negative locally, check global (slower)
4. Periodically merge local → global
5. For positives: verify in actual URL store (handles false positives)

Benefits:
- Fast local lookups
- Distributed, scalable
- Mergeable filters
- Bounded false positive rate

📋 Quick Reference

Data Structures

Structure	Operation	Space	Error
Bloom Filter	Membership	O(n)	FP: 1% typical
Count-Min Sketch	Frequency	O(1/ε × log(1/δ))	Overestimate
HyperLogLog	Cardinality	O(1)	~1% typical

Algorithm Types

Type	Correctness	Time
Las Vegas	Always	Random
Monte Carlo	Probably	Deterministic
Approximation	Within factor	Deterministic

Key Formulas

Bloom Filter:
- Optimal bits: m = -n × ln(p) / (ln(2))²
- Optimal hashes: k = (m/n) × ln(2)
- FP rate: (1 - e^(-kn/m))^k

HyperLogLog:
- Standard error: 1.04 / √M
- Memory: M registers × 5 bits

Count-Min Sketch:
- Width w = e/ε (for error ε)
- Depth d = ln(1/δ) (for confidence 1-δ)

🔗 What's Next?

In Part 22: Algorithm Design Patterns & Interview Prep, we'll explore:

Two pointers technique
Sliding window
Monotonic stack/queue
Pattern recognition guide
Interview strategies

📅 Review Schedule

Day	Focus	Time
0	Full read + Bloom Filter	90 min
1	Quick Reference	10 min
3	Implement HyperLogLog	30 min
7	Count-Min Sketch exercises	25 min
14	Debug exercise + approximation	20 min
30	Interview questions	15 min

Next: Part 22: Algorithm Design Patterns & Interview Prep →

Series: Algorithms Compendium Index

📋 At a Glance

🎯 What You'll Learn

🔥 Production Story: Counting Unique Visitors at Scale

The Problem

The Solution: HyperLogLog

The Impact

🧠 Mental Model: The Coin Flip Experiment

🔬 Deep Dive: Bloom Filters

The Concept

Implementation

Use Cases

🔬 Deep Dive: Count-Min Sketch

The Concept

Visualization

Use Cases

🔬 Deep Dive: HyperLogLog

Implementation

Error Analysis

🔬 Deep Dive: Reservoir Sampling

The Problem

Weighted Reservoir Sampling

🔬 Deep Dive: Randomized Algorithms

QuickSelect (Las Vegas)

Miller-Rabin Primality Test (Monte Carlo)

🔬 Deep Dive: Approximation Algorithms

Vertex Cover (2-Approximation)

Set Cover (Greedy log(n)-Approximation)

⚠️ Common Mistakes

Mistake 1: Wrong Bloom Filter Size

Mistake 2: Using Random Without Seed

Mistake 3: Not Handling HyperLogLog Edge Cases

🐛 Debug This: Bloom Filter Bug

💻 Exercises

Exercise 1: Counting Bloom Filter ⭐⭐

Exercise 2: Min-Hash for Similarity ⭐⭐⭐

Exercise 3: Streaming Median ⭐⭐⭐

Exercise 4: FPTAS for Knapsack ⭐⭐⭐⭐

Exercise 5: Streaming Heavy Hitters ⭐⭐⭐⭐

🎤 Interview Questions

Q1: "When should you use a Bloom filter vs a hash set?"

Q2: "Explain the trade-offs in HyperLogLog precision."

Q3: "How would you find the top-k frequent items in a stream?"

Q4: "What's the difference between Las Vegas and Monte Carlo algorithms?"

Q5: "Design a system to detect duplicate URLs across multiple servers."

📋 Quick Reference

Data Structures

Algorithm Types

Key Formulas

🔗 What's Next?

📅 Review Schedule

Tags: