String Searching Algorithms
String searching - finding a pattern within text - is one of the most common operations in computing. From text editors to log analysis, from DNA sequencing to spam filtering, efficient string matching is critical. The naive O(nm) approach becomes prohibitive for large texts. This article covers four major algorithms that achieve O(n+m) performance through clever preprocessing.
📋 At a Glance
| Algorithm | Preprocessing | Search | Best For |
|---|---|---|---|
| Naive | O(1) | O(nm) | Short patterns, small texts |
| KMP | O(m) | O(n) | Single pattern, streaming |
| Rabin-Karp | O(m) | O(n) avg | Multiple patterns, plagiarism detection |
| Boyer-Moore | O(m + σ) | O(n/m) best | Long patterns, large alphabets |
| Aho-Corasick | O(Σm) | O(n + z) | Multiple patterns simultaneously |
Where n = text length, m = pattern length, σ = alphabet size, z = number of matches
🎯 What You'll Learn
After reading this article, you will be able to:
- Implement KMP with proper failure function construction
- Apply Rabin-Karp for multiple pattern matching
- Use Boyer-Moore for practical high-performance search
- Build Aho-Corasick automaton for simultaneous multi-pattern search
- Choose the right algorithm based on pattern count and size
- Understand Java's String.indexOf() implementation
🔥 Production Story: The Log Analysis Nightmare
A security team needed to scan 500GB of server logs daily for 200 suspicious patterns (SQL injection attempts, XSS payloads, known attack signatures).
JAVA(20 lines)CodeLoading syntax highlighter...
- 500GB ≈ 5 billion characters
- 200 patterns × average 20 chars = 4000 chars to check per position
- Naive: O(n × m × p) where p = number of patterns
- Total operations: 5B × 20 × 200 = 20 quadrillion operations!
JAVA(22 lines)CodeLoading syntax highlighter...
- Naive: 18 hours
- Aho-Corasick: 25 minutes
- 43x faster - reduced from overnight job to lunch break
🧠 Mental Model: String Search Algorithm Selection
TEXT(29 lines)CodeLoading syntax highlighter...
🔬 Deep Dive: Naive String Search
The Algorithm
JAVA(43 lines)CodeLoading syntax highlighter...
Why Naive Is Slow
TEXT(16 lines)CodeLoading syntax highlighter...
🔬 Deep Dive: KMP (Knuth-Morris-Pratt)
The Key Insight
When a mismatch occurs, we've already matched some prefix of the pattern. KMP uses this information to skip ahead intelligently.
TEXT(18 lines)CodeLoading syntax highlighter...
The Failure Function (LPS Array)
lps[i] = length of longest proper prefix of pattern[0..i] which is also a suffix.JAVA(108 lines)CodeLoading syntax highlighter...
LPS Array Example
TEXT(21 lines)CodeLoading syntax highlighter...
🔬 Deep Dive: Rabin-Karp
The Hash Approach
Instead of comparing characters, compare hash values. Only do character comparison when hashes match.
TEXT(15 lines)CodeLoading syntax highlighter...
Rolling Hash
JAVA(123 lines)CodeLoading syntax highlighter...
When Rabin-Karp Excels
JAVA(47 lines)CodeLoading syntax highlighter...
🔬 Deep Dive: Boyer-Moore
The Clever Insight
Boyer-Moore compares pattern from right to left, allowing bigger jumps on mismatch.
TEXT(23 lines)CodeLoading syntax highlighter...
Implementation
JAVA(153 lines)CodeLoading syntax highlighter...
Performance Comparison
JAVA(37 lines)CodeLoading syntax highlighter...
🔬 Deep Dive: Aho-Corasick
The Trie + Failure Links
Aho-Corasick builds a trie from all patterns, then adds "failure links" (like KMP's failure function) to enable efficient searching.
TEXT(20 lines)CodeLoading syntax highlighter...
Implementation
JAVA(115 lines)CodeLoading syntax highlighter...
Usage Example
JAVA(20 lines)CodeLoading syntax highlighter...
🔬 Deep Dive: Java's String.indexOf()
The Implementation
String.indexOf() uses a simple but effective approach:JAVA(44 lines)CodeLoading syntax highlighter...
When to Use Standard Library vs Custom
| Scenario | Recommendation |
|---|---|
| Short pattern, short text | String.indexOf() |
| Short pattern, long text | String.indexOf() (JIT optimizes well) |
| Long pattern, long text | Boyer-Moore |
| Multiple searches, same pattern | Precompile KMP/BM |
| Multiple patterns | Aho-Corasick |
| Streaming text | KMP (no backtracking) |
⚠️ Common Mistakes
Mistake 1: Wrong LPS Computation
JAVA(17 lines)CodeLoading syntax highlighter...
Mistake 2: Integer Overflow in Rabin-Karp
JAVA(10 lines)CodeLoading syntax highlighter...
Mistake 3: Negative Hash in Rolling Window
JAVA(7 lines)CodeLoading syntax highlighter...
Mistake 4: Boyer-Moore Bad Character Shift
JAVA(5 lines)CodeLoading syntax highlighter...
🐛 Debug This: The Broken String Search
These implementations have bugs. Find and fix them:
JAVA(47 lines)CodeLoading syntax highlighter...
💡 Hint
Look for:
- LPS initialization and indexing
- Math.pow precision issues
- Loop bounds and modular arithmetic
- Hash collision handling
✅ Solution
JAVA(66 lines)CodeLoading syntax highlighter...
-
LPS function:
- Starting i at 0 would try to compare
pattern.charAt(0)with itself lps[i] = lenbefore incrementing would be off-by-onelps[len]should belps[len-1](previous prefix)
- Starting i at 0 would try to compare
-
Rabin-Karp:
Math.powreturns double, loses precision for large exponents- Loop
< text.length() - mmisses last position - No verification allows false positives from hash collisions
- No modular arithmetic causes overflow and negative remainders
💻 Exercises
Exercise 1: Count Pattern Occurrences ⭐
Implement a function to count all occurrences of a pattern in text.
JAVA(9 lines)CodeLoading syntax highlighter...
Exercise 2: First Unique Substring ⭐⭐
Find the shortest substring of text that doesn't appear elsewhere in text.
JAVA(9 lines)CodeLoading syntax highlighter...
💡 Hint
Use Rabin-Karp with rolling hash to check each length starting from 1.
Exercise 3: Wildcard Pattern Matching ⭐⭐⭐
Extend string search to support wildcards.
JAVA(15 lines)CodeLoading syntax highlighter...
Exercise 4: Longest Repeated Substring ⭐⭐⭐
Find the longest substring that appears at least twice.
JAVA(12 lines)CodeLoading syntax highlighter...
Exercise 5: DNA Sequence Alignment ⭐⭐⭐⭐
Find all approximate matches allowing k mismatches.
JAVA(9 lines)CodeLoading syntax highlighter...
💡 Hint
Split pattern into k+1 segments. At least one segment must match exactly (pigeonhole principle). Use exact search for segments, then verify full pattern.
🎤 Interview Questions
Question 1: Compare KMP and Boyer-Moore
| Aspect | KMP | Boyer-Moore |
|---|---|---|
| Comparison direction | Left to right | Right to left |
| Best case | O(n) | O(n/m) - sublinear! |
| Worst case | O(n) | O(nm) |
| Preprocessing | O(m) | O(m + σ) |
| Space | O(m) | O(m + σ) |
| Streaming | Yes (no backtrack) | No (needs random access) |
- Streaming text (can't go back)
- Guaranteed O(n) important
- Small patterns
- Large alphabet (natural language)
- Long patterns
- Can afford preprocessing
Question 2: How would you search for multiple patterns efficiently?
Three approaches:
-
Run single-pattern search for each - O(n × p) where p = number of patterns. Simple but slow.
-
Rabin-Karp with multiple hashes - O(n × max_pattern_length). Good when patterns have similar lengths.
-
Aho-Corasick - O(n + z) where z = total matches. Best for many patterns.
JAVA(4 lines)CodeLoading syntax highlighter...
Question 3: Why does String.indexOf() use naive search?
Java's decision makes sense for several reasons:
-
Preprocessing overhead: KMP needs O(m) preprocessing. For short patterns (most common case), this exceeds the search time savings.
-
JIT optimization: The tight loop in naive search is highly optimizable by the JIT compiler - can vectorize comparisons.
-
Cache efficiency: Sequential memory access in naive search is cache-friendly.
-
Average case: Real-world strings aren't adversarial. The O(nm) worst case rarely occurs.
-
Code complexity: Simple code is easier to maintain and less bug-prone.
- Searching same pattern multiple times
- Known adversarial patterns (regex engines)
- Very long texts (> 1MB)
- Streaming data
Question 4: Explain rolling hash in Rabin-Karp
Rolling hash computes the next window's hash from the current in O(1):
TEXT(10 lines)CodeLoading syntax highlighter...
- Modular arithmetic: Use prime modulus to prevent overflow
- Negative handling:
(hash - x % MOD + MOD) % MOD - Collision handling: Always verify actual characters when hashes match
- Base selection: Use 256 for ASCII, larger for Unicode
Question 5: How does Aho-Corasick differ from running KMP multiple times?
- Time: O(text_length × num_patterns)
- Space: O(max_pattern_length)
- Each pattern searched independently
- Time: O(text_length + total_pattern_length + num_matches)
- Space: O(total_pattern_length × alphabet_size)
- All patterns searched simultaneously
- Build trie from all patterns
- Add "failure links" (like KMP's LPS) between nodes
- Single pass through text follows trie transitions
- Failure links handle partial matches
- When matching "she", we automatically also match "he"
- Failure link from "she" points to "he" node
- No redundant comparisons
📋 Quick Reference
Algorithm Selection
| Criteria | Best Algorithm |
|---|---|
| Single pattern, general | KMP |
| Single pattern, large alphabet | Boyer-Moore |
| Single pattern, streaming | KMP |
| Multiple patterns | Aho-Corasick |
| Approximate matching | Rabin-Karp variant |
| Short text, any pattern | Naive (String.indexOf) |
Complexity Comparison
| Algorithm | Preprocessing | Search | Space |
|---|---|---|---|
| Naive | O(1) | O(nm) | O(1) |
| KMP | O(m) | O(n) | O(m) |
| Rabin-Karp | O(m) | O(n) avg | O(1) |
| Boyer-Moore | O(m + σ) | O(n/m) to O(nm) | O(m + σ) |
| Aho-Corasick | O(Σm) | O(n + z) | O(Σm × σ) |
Java Built-in Methods
JAVA(17 lines)CodeLoading syntax highlighter...
📅 Review Schedule
| Day | Focus | Time |
|---|---|---|
| Day 1 | Read article, understand KMP LPS construction | 60 min |
| Day 3 | Implement KMP from scratch | 30 min |
| Day 7 | Implement Rabin-Karp with rolling hash | 30 min |
| Day 14 | Build simple Aho-Corasick for 3 patterns | 45 min |
| Day 30 | Solve string search problems (LeetCode) | 30 min |
🔗 What's Next?
- Adjacency list vs matrix tradeoffs
- Graph implementation in Java
- DFS: recursive vs iterative
- BFS: shortest path in unweighted graphs
- Topological sort
- Cycle detection