Algorithms
Probabilistic Data Structures
π At a Glance
| Aspect | Details |
|---|---|
| Time to Read | 35 minutes |
| Prerequisites | Part 1 (Algorithm Analysis), Basic probability |
| Key Concepts | Randomized Algorithms, Bloom Filters, HyperLogLog, Approximation |
| Difficulty | βββ (Intermediate) |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β PROBABILISTIC ALGORITHMS β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β WHY RANDOMIZATION? β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β’ Simpler algorithms β β β β β’ Better average-case performance β β β β β’ Avoid worst-case inputs β β β β β’ Sublinear space for streaming β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β TWO TYPES β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β β LAS VEGAS β β MONTE CARLO β β β β βββββββββββββββββββ β β βββββββββββββββββββ β β β β Always correct β β Usually correct β β β β Random running time β β Fixed running time β β β β Example: QuickSort β β Example: Miller-Rabinβ β β β with random pivot β β primality test β β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β β β SKETCHING DATA STRUCTURES β β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β β β Bloom Filter β β Count-Min Sketchβ β HyperLogLog β β β β βββββββββββββββ β β βββββββββββββββ β β ββββββββββββ β β β β Set membership β β Frequency est. β β Cardinality β β β β No false neg β β May overcount β β estimation β β β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π― What You'll Learn
After completing this article, you will be able to:
- Distinguish algorithm types: Las Vegas vs Monte Carlo
- Implement Bloom filters: Space-efficient set membership
- Use Count-Min Sketch: Frequency estimation in streams
- Apply HyperLogLog: Count distinct elements in O(1) space
- Design approximation algorithms: Trade accuracy for speed
π₯ Production Story: Counting Unique Visitors at Scale
A social media platform needed to count unique daily visitors - but with 500 million users, storing all IDs was impossible.
The Problem
JAVA(16 lines)CodeLoading syntax highlighter...
The Solution: HyperLogLog
JAVA(37 lines)CodeLoading syntax highlighter...
The Impact
| Metric | HashSet | HyperLogLog |
|---|---|---|
| Memory per counter | 4GB | 16KB |
| Counters possible | 1 | 250,000+ |
| Error | 0% | ~2% |
| Mergeable | No | Yes |
π§ Mental Model: The Coin Flip Experiment
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β HYPERLOGLOG INTUITION β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Thought experiment: Flip coins until you get heads β β β β 1 flip: H (50% probability) β β 2 flips: TH (25% probability) β β 3 flips: TTH (12.5% probability) β β k flips: T...TH (1/2^k probability) β β β β If you observe k flips, ~2^k unique experiments happened β β β β HyperLogLog: β β 1. Hash each element β random bit string β β 2. Count leading zeros = "coin flips until heads" β β 3. Maximum leading zeros β logβ(n) β β β β Example: β β Element β Hash β Leading zeros β β "alice" β 0001... β 3 zeros β β "bob" β 0000001... β 6 zeros β Maximum! β β "carol" β 001... β 2 zeros β β β β Max leading zeros = 6 β Estimate ~2^6 = 64 unique elements β β β β Problem: High variance! Solution: Use M registers, average. β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¬ Deep Dive: Bloom Filters
The Concept
A Bloom filter answers "Is X in the set?" with:
- No β Definitely not in set
- Yes β Probably in set (false positives possible)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β BLOOM FILTER β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Bit array: [0][0][0][0][0][0][0][0][0][0] (m bits) β β 0 1 2 3 4 5 6 7 8 9 β β β β Add "apple" (k=3 hash functions): β β h1("apple") = 2, h2("apple") = 5, h3("apple") = 8 β β β β [0][0][1][0][0][1][0][0][1][0] β β β β β β β β β Add "banana": β β h1("banana") = 1, h2("banana") = 5, h3("banana") = 9 β β β β [0][1][1][0][0][1][0][0][1][1] β β β (already 1) β β β β β Query "cherry": β β h1("cherry") = 2, h2("cherry") = 7, h3("cherry") = 9 β β Check bits 2, 7, 9: [1][0][1] β 0 found β NOT IN SET β β β β Query "date": β β h1("date") = 1, h2("date") = 5, h3("date") = 8 β β Check bits 1, 5, 8: [1][1][1] β All 1 β PROBABLY IN SET β β (false positive! "date" was never added) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation
JAVA(61 lines)CodeLoading syntax highlighter...
Use Cases
JAVA(49 lines)CodeLoading syntax highlighter...
π¬ Deep Dive: Count-Min Sketch
The Concept
Estimate frequency of elements in a stream with limited memory.
JAVA(61 lines)CodeLoading syntax highlighter...
Visualization
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β COUNT-MIN SKETCH β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Add "apple" (appears 3 times): β β β β Row 0: h0("apple")=2 [0][0][3][0][0][0] β β Row 1: h1("apple")=5 [0][0][0][0][0][3] β β Row 2: h2("apple")=1 [0][3][0][0][0][0] β β β β Add "banana" (appears 2 times, collides with apple in row 0): β β β β Row 0: h0("banana")=2 [0][0][5][0][0][0] β Collision! β β Row 1: h1("banana")=3 [0][0][0][2][0][3] β β Row 2: h2("banana")=4 [0][3][0][0][2][0] β β β β Query "apple": β β Row 0: table[0][2] = 5 β β Row 1: table[1][5] = 3 β β Row 2: table[2][1] = 3 β β Estimate = min(5, 3, 3) = 3 β (correct!) β β β β Query "banana": β β Row 0: table[0][2] = 5 β Overestimate due to collision β β Row 1: table[1][3] = 2 β β Row 2: table[2][4] = 2 β β Estimate = min(5, 2, 2) = 2 β (correct!) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Use Cases
JAVA(32 lines)CodeLoading syntax highlighter...
π¬ Deep Dive: HyperLogLog
Implementation
JAVA(70 lines)CodeLoading syntax highlighter...
Error Analysis
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β HYPERLOGLOG PRECISION β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Precision P | Registers M | Memory | Standard Error β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 4 | 16 | 16B | 26% β β 8 | 256 | 256B | 6.5% β β 12 | 4096 | 4KB | 1.6% β β 14 | 16384 | 16KB | 0.8% β β 16 | 65536 | 64KB | 0.4% β β β β Standard Error β 1.04 / βM β β β β P=14 (default): 16KB memory, 0.8% error β β Can count up to 2^64 distinct elements! β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¬ Deep Dive: Reservoir Sampling
The Problem
Select k random items from a stream of unknown length.
JAVA(45 lines)CodeLoading syntax highlighter...
Weighted Reservoir Sampling
JAVA(43 lines)CodeLoading syntax highlighter...
π¬ Deep Dive: Randomized Algorithms
QuickSelect (Las Vegas)
JAVA(46 lines)CodeLoading syntax highlighter...
Miller-Rabin Primality Test (Monte Carlo)
JAVA(57 lines)CodeLoading syntax highlighter...
π¬ Deep Dive: Approximation Algorithms
Vertex Cover (2-Approximation)
JAVA(29 lines)CodeLoading syntax highlighter...
Set Cover (Greedy log(n)-Approximation)
JAVA(37 lines)CodeLoading syntax highlighter...
β οΈ Common Mistakes
Mistake 1: Wrong Bloom Filter Size
JAVA(8 lines)CodeLoading syntax highlighter...
Mistake 2: Using Random Without Seed
JAVA(11 lines)CodeLoading syntax highlighter...
Mistake 3: Not Handling HyperLogLog Edge Cases
JAVA(18 lines)CodeLoading syntax highlighter...
π Debug This: Bloom Filter Bug
This Bloom filter has a bug causing false negatives. Can you find it?
JAVA(29 lines)CodeLoading syntax highlighter...
π Click to reveal the bug
Bug: Hash can be negative!
hashCode() can return negative values, and % preserves sign in Java.JAVA(7 lines)CodeLoading syntax highlighter...
Example:
JAVA(3 lines)CodeLoading syntax highlighter...
π» Exercises
Exercise 1: Counting Bloom Filter ββ
Implement a Bloom filter that supports deletion (use counters instead of bits).
JAVA(5 lines)CodeLoading syntax highlighter...
Exercise 2: Min-Hash for Similarity βββ
Implement MinHash to estimate Jaccard similarity between sets.
JAVA(6 lines)CodeLoading syntax highlighter...
Exercise 3: Streaming Median βββ
Find approximate median in a stream using reservoir sampling.
JAVA(5 lines)CodeLoading syntax highlighter...
Exercise 4: FPTAS for Knapsack ββββ
Implement fully polynomial-time approximation scheme for 0/1 knapsack.
JAVA(5 lines)CodeLoading syntax highlighter...
Exercise 5: Streaming Heavy Hitters ββββ
Find all elements appearing more than n/k times using O(k) space.
JAVA(5 lines)CodeLoading syntax highlighter...
π€ Interview Questions
Q1: "When should you use a Bloom filter vs a hash set?"
Answer:
| Use Bloom Filter | Use Hash Set |
|---|---|
| Can tolerate false positives | Need 100% accuracy |
| Memory is constrained | Memory is available |
| Set is large (millions+) | Set is small |
| Read-heavy, rarely delete | Need deletions |
| Distributed merging needed | Single machine |
Examples:
- Bloom: Spell checker, URL blocklist, cache prefetch
- HashSet: Shopping cart, user session, exact dedup
Q2: "Explain the trade-offs in HyperLogLog precision."
Answer:
Higher precision (more registers): + Lower error rate (1.04/βM) - More memory (M bytes) - Slightly slower merge Typical choices: - P=10 (~1KB, 3.3% error): Resource-constrained - P=14 (~16KB, 0.8% error): Good default - P=16 (~64KB, 0.4% error): High accuracy needed The logarithmic nature means: - Doubling memory halves error (roughly) - Can count 2^64 items with any precision - Error is independent of cardinality
Q3: "How would you find the top-k frequent items in a stream?"
Answer:
JAVA(25 lines)CodeLoading syntax highlighter...
Q4: "What's the difference between Las Vegas and Monte Carlo algorithms?"
Answer:
| Las Vegas | Monte Carlo |
|---|---|
| Always correct | May be incorrect |
| Random running time | Fixed running time |
| E[time] is bounded | Error probability bounded |
Examples:
Las Vegas: - QuickSort with random pivot: correct sort, E[O(n log n)] time - QuickSelect: correct k-th element, E[O(n)] time Monte Carlo: - Miller-Rabin primality: may say composite is prime, O(k logΒ³ n) - Approximate counting: estimate with bounded error Converting: - Monte Carlo β Las Vegas: Run until verifiable answer - Las Vegas β Monte Carlo: Set time limit, return best found
Q5: "Design a system to detect duplicate URLs across multiple servers."
Answer:
Architecture: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β Server1 β β Server2 β β Server3 β β Server4 β β β β Bloom β β Bloom β β Bloom β β Bloom β β β β Filter β β Filter β β Filter β β Filter β β β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β β β β β β β β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β β β β β Merge Layer β β (OR operation) β β β β β Global Bloom Filter β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Algorithm: 1. Each server maintains local Bloom filter 2. On new URL: check local filter first (fast) 3. If negative locally, check global (slower) 4. Periodically merge local β global 5. For positives: verify in actual URL store (handles false positives) Benefits: - Fast local lookups - Distributed, scalable - Mergeable filters - Bounded false positive rate
π Quick Reference
Data Structures
| Structure | Operation | Space | Error |
|---|---|---|---|
| Bloom Filter | Membership | O(n) | FP: 1% typical |
| Count-Min Sketch | Frequency | O(1/Ξ΅ Γ log(1/Ξ΄)) | Overestimate |
| HyperLogLog | Cardinality | O(1) | ~1% typical |
Algorithm Types
| Type | Correctness | Time |
|---|---|---|
| Las Vegas | Always | Random |
| Monte Carlo | Probably | Deterministic |
| Approximation | Within factor | Deterministic |
Key Formulas
Bloom Filter: - Optimal bits: m = -n Γ ln(p) / (ln(2))Β² - Optimal hashes: k = (m/n) Γ ln(2) - FP rate: (1 - e^(-kn/m))^k HyperLogLog: - Standard error: 1.04 / βM - Memory: M registers Γ 5 bits Count-Min Sketch: - Width w = e/Ξ΅ (for error Ξ΅) - Depth d = ln(1/Ξ΄) (for confidence 1-Ξ΄)
π What's Next?
In Part 22: Algorithm Design Patterns & Interview Prep, we'll explore:
- Two pointers technique
- Sliding window
- Monotonic stack/queue
- Pattern recognition guide
- Interview strategies
π Review Schedule
| Day | Focus | Time |
|---|---|---|
| 0 | Full read + Bloom Filter | 90 min |
| 1 | Quick Reference | 10 min |
| 3 | Implement HyperLogLog | 30 min |
| 7 | Count-Min Sketch exercises | 25 min |
| 14 | Debug exercise + approximation | 20 min |
| 30 | Interview questions | 15 min |
Series: Algorithms Compendium Index