Algorithms

Probabilistic Data Structures

πŸ“‹ At a Glance

AspectDetails
Time to Read35 minutes
PrerequisitesPart 1 (Algorithm Analysis), Basic probability
Key ConceptsRandomized Algorithms, Bloom Filters, HyperLogLog, Approximation
Difficulty⭐⭐⭐ (Intermediate)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             PROBABILISTIC ALGORITHMS                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  WHY RANDOMIZATION?                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ β€’ Simpler algorithms                                    β”‚    β”‚
β”‚  β”‚ β€’ Better average-case performance                       β”‚    β”‚
β”‚  β”‚ β€’ Avoid worst-case inputs                               β”‚    β”‚
β”‚  β”‚ β€’ Sublinear space for streaming                         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                 β”‚
β”‚  TWO TYPES                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚ LAS VEGAS           β”‚    β”‚ MONTE CARLO         β”‚             β”‚
β”‚  β”‚ ─────────────────── β”‚    β”‚ ─────────────────── β”‚             β”‚
β”‚  β”‚ Always correct      β”‚    β”‚ Usually correct     β”‚             β”‚
β”‚  β”‚ Random running time β”‚    β”‚ Fixed running time  β”‚             β”‚
β”‚  β”‚ Example: QuickSort  β”‚    β”‚ Example: Miller-Rabinβ”‚            β”‚
β”‚  β”‚ with random pivot   β”‚    β”‚ primality test      β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚                                                                 β”‚
β”‚  SKETCHING DATA STRUCTURES                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ Bloom Filter    β”‚  β”‚ Count-Min Sketchβ”‚  β”‚ HyperLogLog  β”‚     β”‚
β”‚  β”‚ ─────────────── β”‚  β”‚ ─────────────── β”‚  β”‚ ──────────── β”‚     β”‚
β”‚  β”‚ Set membership  β”‚  β”‚ Frequency est.  β”‚  β”‚ Cardinality  β”‚     β”‚
β”‚  β”‚ No false neg    β”‚  β”‚ May overcount   β”‚  β”‚ estimation   β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 What You'll Learn

After completing this article, you will be able to:

  1. Distinguish algorithm types: Las Vegas vs Monte Carlo
  2. Implement Bloom filters: Space-efficient set membership
  3. Use Count-Min Sketch: Frequency estimation in streams
  4. Apply HyperLogLog: Count distinct elements in O(1) space
  5. Design approximation algorithms: Trade accuracy for speed

πŸ”₯ Production Story: Counting Unique Visitors at Scale

A social media platform needed to count unique daily visitors - but with 500 million users, storing all IDs was impossible.

The Problem

JAVA(16 lines)
Code
Loading syntax highlighter...

The Solution: HyperLogLog

JAVA(37 lines)
Code
Loading syntax highlighter...

The Impact

MetricHashSetHyperLogLog
Memory per counter4GB16KB
Counters possible1250,000+
Error0%~2%
MergeableNoYes

🧠 Mental Model: The Coin Flip Experiment

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           HYPERLOGLOG INTUITION                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Thought experiment: Flip coins until you get heads             β”‚
β”‚                                                                 β”‚
β”‚  1 flip:  H (50% probability)                                   β”‚
β”‚  2 flips: TH (25% probability)                                  β”‚
β”‚  3 flips: TTH (12.5% probability)                               β”‚
β”‚  k flips: T...TH (1/2^k probability)                            β”‚
β”‚                                                                 β”‚
β”‚  If you observe k flips, ~2^k unique experiments happened       β”‚
β”‚                                                                 β”‚
β”‚  HyperLogLog:                                                   β”‚
β”‚  1. Hash each element β†’ random bit string                       β”‚
β”‚  2. Count leading zeros = "coin flips until heads"              β”‚
β”‚  3. Maximum leading zeros β‰ˆ logβ‚‚(n)                             β”‚
β”‚                                                                 β”‚
β”‚  Example:                                                       β”‚
β”‚  Element β†’ Hash β†’ Leading zeros                                 β”‚
β”‚  "alice" β†’ 0001... β†’ 3 zeros                                    β”‚
β”‚  "bob"   β†’ 0000001... β†’ 6 zeros   ← Maximum!                    β”‚
β”‚  "carol" β†’ 001... β†’ 2 zeros                                     β”‚
β”‚                                                                 β”‚
β”‚  Max leading zeros = 6 β†’ Estimate ~2^6 = 64 unique elements     β”‚
β”‚                                                                 β”‚
β”‚  Problem: High variance! Solution: Use M registers, average.    β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ Deep Dive: Bloom Filters

The Concept

A Bloom filter answers "Is X in the set?" with:

  • No β†’ Definitely not in set
  • Yes β†’ Probably in set (false positives possible)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BLOOM FILTER                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Bit array: [0][0][0][0][0][0][0][0][0][0]  (m bits)            β”‚
β”‚              0  1  2  3  4  5  6  7  8  9                       β”‚
β”‚                                                                 β”‚
β”‚  Add "apple" (k=3 hash functions):                              β”‚
β”‚  h1("apple") = 2, h2("apple") = 5, h3("apple") = 8              β”‚
β”‚                                                                 β”‚
β”‚  [0][0][1][0][0][1][0][0][1][0]                                 β”‚
β”‚        ↑        ↑        ↑                                      β”‚
β”‚                                                                 β”‚
β”‚  Add "banana":                                                  β”‚
β”‚  h1("banana") = 1, h2("banana") = 5, h3("banana") = 9           β”‚
β”‚                                                                 β”‚
β”‚  [0][1][1][0][0][1][0][0][1][1]                                 β”‚
β”‚     ↑           (already 1)     ↑                               β”‚
β”‚                                                                 β”‚
β”‚  Query "cherry":                                                β”‚
β”‚  h1("cherry") = 2, h2("cherry") = 7, h3("cherry") = 9           β”‚
β”‚  Check bits 2, 7, 9: [1][0][1] β†’ 0 found β†’ NOT IN SET           β”‚
β”‚                                                                 β”‚
β”‚  Query "date":                                                  β”‚
β”‚  h1("date") = 1, h2("date") = 5, h3("date") = 8                 β”‚
β”‚  Check bits 1, 5, 8: [1][1][1] β†’ All 1 β†’ PROBABLY IN SET        β”‚
β”‚  (false positive! "date" was never added)                       β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation

JAVA(61 lines)
Code
Loading syntax highlighter...

Use Cases

JAVA(49 lines)
Code
Loading syntax highlighter...

πŸ”¬ Deep Dive: Count-Min Sketch

The Concept

Estimate frequency of elements in a stream with limited memory.

JAVA(61 lines)
Code
Loading syntax highlighter...

Visualization

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 COUNT-MIN SKETCH                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Add "apple" (appears 3 times):                                 β”‚
β”‚                                                                 β”‚
β”‚  Row 0: h0("apple")=2   [0][0][3][0][0][0]                      β”‚
β”‚  Row 1: h1("apple")=5   [0][0][0][0][0][3]                      β”‚
β”‚  Row 2: h2("apple")=1   [0][3][0][0][0][0]                      β”‚
β”‚                                                                 β”‚
β”‚  Add "banana" (appears 2 times, collides with apple in row 0):  β”‚
β”‚                                                                 β”‚
β”‚  Row 0: h0("banana")=2  [0][0][5][0][0][0]  ← Collision!        β”‚
β”‚  Row 1: h1("banana")=3  [0][0][0][2][0][3]                      β”‚
β”‚  Row 2: h2("banana")=4  [0][3][0][0][2][0]                      β”‚
β”‚                                                                 β”‚
β”‚  Query "apple":                                                 β”‚
β”‚  Row 0: table[0][2] = 5                                         β”‚
β”‚  Row 1: table[1][5] = 3                                         β”‚
β”‚  Row 2: table[2][1] = 3                                         β”‚
β”‚  Estimate = min(5, 3, 3) = 3 βœ“ (correct!)                       β”‚
β”‚                                                                 β”‚
β”‚  Query "banana":                                                β”‚
β”‚  Row 0: table[0][2] = 5  ← Overestimate due to collision        β”‚
β”‚  Row 1: table[1][3] = 2                                         β”‚
β”‚  Row 2: table[2][4] = 2                                         β”‚
β”‚  Estimate = min(5, 2, 2) = 2 βœ“ (correct!)                       β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Use Cases

JAVA(32 lines)
Code
Loading syntax highlighter...

πŸ”¬ Deep Dive: HyperLogLog

Implementation

JAVA(70 lines)
Code
Loading syntax highlighter...

Error Analysis

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              HYPERLOGLOG PRECISION                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Precision P | Registers M | Memory | Standard Error            β”‚
β”‚  ───────────────────────────────────────────────────────────────│
β”‚      4       |     16      |  16B   |    26%                    β”‚
β”‚      8       |    256      | 256B   |     6.5%                  β”‚
β”‚     12       |   4096      |  4KB   |     1.6%                  β”‚
β”‚     14       |  16384      | 16KB   |     0.8%                  β”‚
β”‚     16       |  65536      | 64KB   |     0.4%                  β”‚
β”‚                                                                 β”‚
β”‚  Standard Error β‰ˆ 1.04 / √M                                     β”‚
β”‚                                                                 β”‚
β”‚  P=14 (default): 16KB memory, 0.8% error                        β”‚
β”‚  Can count up to 2^64 distinct elements!                        β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ Deep Dive: Reservoir Sampling

The Problem

Select k random items from a stream of unknown length.

JAVA(45 lines)
Code
Loading syntax highlighter...

Weighted Reservoir Sampling

JAVA(43 lines)
Code
Loading syntax highlighter...

πŸ”¬ Deep Dive: Randomized Algorithms

QuickSelect (Las Vegas)

JAVA(46 lines)
Code
Loading syntax highlighter...

Miller-Rabin Primality Test (Monte Carlo)

JAVA(57 lines)
Code
Loading syntax highlighter...

πŸ”¬ Deep Dive: Approximation Algorithms

Vertex Cover (2-Approximation)

JAVA(29 lines)
Code
Loading syntax highlighter...

Set Cover (Greedy log(n)-Approximation)

JAVA(37 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Wrong Bloom Filter Size

JAVA(8 lines)
Code
Loading syntax highlighter...

Mistake 2: Using Random Without Seed

JAVA(11 lines)
Code
Loading syntax highlighter...

Mistake 3: Not Handling HyperLogLog Edge Cases

JAVA(18 lines)
Code
Loading syntax highlighter...

πŸ› Debug This: Bloom Filter Bug

This Bloom filter has a bug causing false negatives. Can you find it?

JAVA(29 lines)
Code
Loading syntax highlighter...
πŸ” Click to reveal the bug
Bug: Hash can be negative! hashCode() can return negative values, and % preserves sign in Java.
JAVA(7 lines)
Code
Loading syntax highlighter...
Example:
JAVA(3 lines)
Code
Loading syntax highlighter...

πŸ’» Exercises

Exercise 1: Counting Bloom Filter ⭐⭐

Implement a Bloom filter that supports deletion (use counters instead of bits).

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 2: Min-Hash for Similarity ⭐⭐⭐

Implement MinHash to estimate Jaccard similarity between sets.

JAVA(6 lines)
Code
Loading syntax highlighter...

Exercise 3: Streaming Median ⭐⭐⭐

Find approximate median in a stream using reservoir sampling.

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 4: FPTAS for Knapsack ⭐⭐⭐⭐

Implement fully polynomial-time approximation scheme for 0/1 knapsack.

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 5: Streaming Heavy Hitters ⭐⭐⭐⭐

Find all elements appearing more than n/k times using O(k) space.

JAVA(5 lines)
Code
Loading syntax highlighter...

🎀 Interview Questions

Q1: "When should you use a Bloom filter vs a hash set?"

Answer:
Use Bloom FilterUse Hash Set
Can tolerate false positivesNeed 100% accuracy
Memory is constrainedMemory is available
Set is large (millions+)Set is small
Read-heavy, rarely deleteNeed deletions
Distributed merging neededSingle machine
Examples:
  • Bloom: Spell checker, URL blocklist, cache prefetch
  • HashSet: Shopping cart, user session, exact dedup

Q2: "Explain the trade-offs in HyperLogLog precision."

Answer:
Higher precision (more registers):
+ Lower error rate (1.04/√M)
- More memory (M bytes)
- Slightly slower merge

Typical choices:
- P=10 (~1KB, 3.3% error): Resource-constrained
- P=14 (~16KB, 0.8% error): Good default
- P=16 (~64KB, 0.4% error): High accuracy needed

The logarithmic nature means:
- Doubling memory halves error (roughly)
- Can count 2^64 items with any precision
- Error is independent of cardinality

Q3: "How would you find the top-k frequent items in a stream?"

Answer:
JAVA(25 lines)
Code
Loading syntax highlighter...

Q4: "What's the difference between Las Vegas and Monte Carlo algorithms?"

Answer:
Las VegasMonte Carlo
Always correctMay be incorrect
Random running timeFixed running time
E[time] is boundedError probability bounded
Examples:
Las Vegas:
- QuickSort with random pivot: correct sort, E[O(n log n)] time
- QuickSelect: correct k-th element, E[O(n)] time

Monte Carlo:
- Miller-Rabin primality: may say composite is prime, O(k logΒ³ n)
- Approximate counting: estimate with bounded error

Converting:
- Monte Carlo β†’ Las Vegas: Run until verifiable answer
- Las Vegas β†’ Monte Carlo: Set time limit, return best found

Q5: "Design a system to detect duplicate URLs across multiple servers."

Answer:
Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚ Server1 β”‚   β”‚ Server2 β”‚   β”‚ Server3 β”‚   β”‚ Server4 β”‚          β”‚
β”‚  β”‚ Bloom   β”‚   β”‚ Bloom   β”‚   β”‚ Bloom   β”‚   β”‚ Bloom   β”‚          β”‚
β”‚  β”‚ Filter  β”‚   β”‚ Filter  β”‚   β”‚ Filter  β”‚   β”‚ Filter  β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜          β”‚
β”‚       β”‚             β”‚             β”‚             β”‚               β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                          β”‚                                      β”‚
β”‚                     Merge Layer                                 β”‚
β”‚                     (OR operation)                              β”‚
β”‚                          β”‚                                      β”‚
β”‚                   Global Bloom Filter                           β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Algorithm:
1. Each server maintains local Bloom filter
2. On new URL: check local filter first (fast)
3. If negative locally, check global (slower)
4. Periodically merge local β†’ global
5. For positives: verify in actual URL store (handles false positives)

Benefits:
- Fast local lookups
- Distributed, scalable
- Mergeable filters
- Bounded false positive rate

πŸ“‹ Quick Reference

Data Structures

StructureOperationSpaceError
Bloom FilterMembershipO(n)FP: 1% typical
Count-Min SketchFrequencyO(1/Ξ΅ Γ— log(1/Ξ΄))Overestimate
HyperLogLogCardinalityO(1)~1% typical

Algorithm Types

TypeCorrectnessTime
Las VegasAlwaysRandom
Monte CarloProbablyDeterministic
ApproximationWithin factorDeterministic

Key Formulas

Bloom Filter:
- Optimal bits: m = -n Γ— ln(p) / (ln(2))Β²
- Optimal hashes: k = (m/n) Γ— ln(2)
- FP rate: (1 - e^(-kn/m))^k

HyperLogLog:
- Standard error: 1.04 / √M
- Memory: M registers Γ— 5 bits

Count-Min Sketch:
- Width w = e/Ξ΅ (for error Ξ΅)
- Depth d = ln(1/Ξ΄) (for confidence 1-Ξ΄)

πŸ”— What's Next?

In Part 22: Algorithm Design Patterns & Interview Prep, we'll explore:
  • Two pointers technique
  • Sliding window
  • Monotonic stack/queue
  • Pattern recognition guide
  • Interview strategies

πŸ“… Review Schedule

DayFocusTime
0Full read + Bloom Filter90 min
1Quick Reference10 min
3Implement HyperLogLog30 min
7Count-Min Sketch exercises25 min
14Debug exercise + approximation20 min
30Interview questions15 min