String Searching Algorithms

String searching - finding a pattern within text - is one of the most common operations in computing. From text editors to log analysis, from DNA sequencing to spam filtering, efficient string matching is critical. The naive O(nm) approach becomes prohibitive for large texts. This article covers four major algorithms that achieve O(n+m) performance through clever preprocessing.

📋 At a Glance

Algorithm	Preprocessing	Search	Best For
Naive	O(1)	O(nm)	Short patterns, small texts
KMP	O(m)	O(n)	Single pattern, streaming
Rabin-Karp	O(m)	O(n) avg	Multiple patterns, plagiarism detection
Boyer-Moore	O(m + σ)	O(n/m) best	Long patterns, large alphabets
Aho-Corasick	O(Σm)	O(n + z)	Multiple patterns simultaneously

Where n = text length, m = pattern length, σ = alphabet size, z = number of matches

🎯 What You'll Learn

After reading this article, you will be able to:

Implement KMP with proper failure function construction
Apply Rabin-Karp for multiple pattern matching
Use Boyer-Moore for practical high-performance search
Build Aho-Corasick automaton for simultaneous multi-pattern search
Choose the right algorithm based on pattern count and size
Understand Java's String.indexOf() implementation

🔥 Production Story: The Log Analysis Nightmare

A security team needed to scan 500GB of server logs daily for 200 suspicious patterns (SQL injection attempts, XSS payloads, known attack signatures).

JAVA(20 lines)
Code
Loading syntax highlighter...

Analysis:

500GB ≈ 5 billion characters
200 patterns × average 20 chars = 4000 chars to check per position
Naive: O(n × m × p) where p = number of patterns
Total operations: 5B × 20 × 200 = 20 quadrillion operations!

The optimization:

JAVA(22 lines)
Code
Loading syntax highlighter...

Results:

Naive: 18 hours
Aho-Corasick: 25 minutes
43x faster - reduced from overnight job to lunch break

The lesson: When searching for multiple patterns, building a combined automaton once pays off massively compared to searching each pattern separately.

🧠 Mental Model: String Search Algorithm Selection

TEXT(29 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Naive String Search

The Algorithm

JAVA(43 lines)
Code
Loading syntax highlighter...

Why Naive Is Slow

TEXT(16 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: KMP (Knuth-Morris-Pratt)

The Key Insight

When a mismatch occurs, we've already matched some prefix of the pattern. KMP uses this information to skip ahead intelligently.

TEXT(18 lines)
Code
Loading syntax highlighter...

The Failure Function (LPS Array)

The failure function lps[i] = length of longest proper prefix of pattern[0..i] which is also a suffix.

JAVA(108 lines)
Code
Loading syntax highlighter...

LPS Array Example

TEXT(21 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Rabin-Karp

The Hash Approach

Instead of comparing characters, compare hash values. Only do character comparison when hashes match.

TEXT(15 lines)
Code
Loading syntax highlighter...

Rolling Hash

JAVA(123 lines)
Code
Loading syntax highlighter...

When Rabin-Karp Excels

JAVA(47 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Boyer-Moore

The Clever Insight

Boyer-Moore compares pattern from right to left, allowing bigger jumps on mismatch.

TEXT(23 lines)
Code
Loading syntax highlighter...

Implementation

JAVA(153 lines)
Code
Loading syntax highlighter...

Performance Comparison

JAVA(37 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Aho-Corasick

The Trie + Failure Links

Aho-Corasick builds a trie from all patterns, then adds "failure links" (like KMP's failure function) to enable efficient searching.

TEXT(20 lines)
Code
Loading syntax highlighter...

Implementation

JAVA(115 lines)
Code
Loading syntax highlighter...

Usage Example

JAVA(20 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Java's String.indexOf()

The Implementation

Java's String.indexOf() uses a simple but effective approach:

JAVA(44 lines)
Code
Loading syntax highlighter...

When to Use Standard Library vs Custom

Scenario	Recommendation
Short pattern, short text	`String.indexOf()`
Short pattern, long text	`String.indexOf()` (JIT optimizes well)
Long pattern, long text	Boyer-Moore
Multiple searches, same pattern	Precompile KMP/BM
Multiple patterns	Aho-Corasick
Streaming text	KMP (no backtracking)

⚠️ Common Mistakes

Mistake 1: Wrong LPS Computation

JAVA(17 lines)
Code
Loading syntax highlighter...

Mistake 2: Integer Overflow in Rabin-Karp

JAVA(10 lines)
Code
Loading syntax highlighter...

Mistake 3: Negative Hash in Rolling Window

JAVA(7 lines)
Code
Loading syntax highlighter...

Mistake 4: Boyer-Moore Bad Character Shift

JAVA(5 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Broken String Search

These implementations have bugs. Find and fix them:

JAVA(47 lines)
Code
Loading syntax highlighter...

💡 Hint

Look for:

LPS initialization and indexing
Math.pow precision issues
Loop bounds and modular arithmetic
Hash collision handling

✅ Solution

JAVA(66 lines)
Code
Loading syntax highlighter...

Bugs explained:

LPS function:
- Starting i at 0 would try to compare pattern.charAt(0) with itself
- lps[i] = len before incrementing would be off-by-one
- lps[len] should be lps[len-1] (previous prefix)
Rabin-Karp:
- Math.pow returns double, loses precision for large exponents
- Loop < text.length() - m misses last position
- No verification allows false positives from hash collisions
- No modular arithmetic causes overflow and negative remainders

💻 Exercises

Exercise 1: Count Pattern Occurrences ⭐

Implement a function to count all occurrences of a pattern in text.

JAVA(9 lines)
Code
Loading syntax highlighter...

Exercise 2: First Unique Substring ⭐⭐

Find the shortest substring of text that doesn't appear elsewhere in text.

JAVA(9 lines)
Code
Loading syntax highlighter...

💡 Hint

Use Rabin-Karp with rolling hash to check each length starting from 1.

Exercise 3: Wildcard Pattern Matching ⭐⭐⭐

Extend string search to support wildcards.

JAVA(15 lines)
Code
Loading syntax highlighter...

Exercise 4: Longest Repeated Substring ⭐⭐⭐

Find the longest substring that appears at least twice.

JAVA(12 lines)
Code
Loading syntax highlighter...

Exercise 5: DNA Sequence Alignment ⭐⭐⭐⭐

Find all approximate matches allowing k mismatches.

JAVA(9 lines)
Code
Loading syntax highlighter...

💡 Hint

Split pattern into k+1 segments. At least one segment must match exactly (pigeonhole principle). Use exact search for segments, then verify full pattern.

🎤 Interview Questions

Question 1: Compare KMP and Boyer-Moore

Answer:

Aspect	KMP	Boyer-Moore
Comparison direction	Left to right	Right to left
Best case	O(n)	O(n/m) - sublinear!
Worst case	O(n)	O(nm)
Preprocessing	O(m)	O(m + σ)
Space	O(m)	O(m + σ)
Streaming	Yes (no backtrack)	No (needs random access)

When to use KMP:

Streaming text (can't go back)
Guaranteed O(n) important
Small patterns

When to use Boyer-Moore:

Large alphabet (natural language)
Long patterns
Can afford preprocessing

In practice: Boyer-Moore is faster for most real-world cases because it skips characters.

Question 2: How would you search for multiple patterns efficiently?

Answer:

Three approaches:

Run single-pattern search for each - O(n × p) where p = number of patterns. Simple but slow.
Rabin-Karp with multiple hashes - O(n × max_pattern_length). Good when patterns have similar lengths.
Aho-Corasick - O(n + z) where z = total matches. Best for many patterns.

JAVA(4 lines)
Code
Loading syntax highlighter...

Recommendation: For < 10 patterns, run individually. For 10+ patterns, build Aho-Corasick automaton.

Question 3: Why does String.indexOf() use naive search?

Answer:

Java's decision makes sense for several reasons:

Preprocessing overhead: KMP needs O(m) preprocessing. For short patterns (most common case), this exceeds the search time savings.
JIT optimization: The tight loop in naive search is highly optimizable by the JIT compiler - can vectorize comparisons.
Cache efficiency: Sequential memory access in naive search is cache-friendly.
Average case: Real-world strings aren't adversarial. The O(nm) worst case rarely occurs.
Code complexity: Simple code is easier to maintain and less bug-prone.

When you should use KMP/BM:

Searching same pattern multiple times
Known adversarial patterns (regex engines)
Very long texts (> 1MB)
Streaming data

Question 4: Explain rolling hash in Rabin-Karp

Answer:

Rolling hash computes the next window's hash from the current in O(1):

TEXT(10 lines)
Code
Loading syntax highlighter...

Key considerations:

Modular arithmetic: Use prime modulus to prevent overflow
Negative handling: (hash - x % MOD + MOD) % MOD
Collision handling: Always verify actual characters when hashes match
Base selection: Use 256 for ASCII, larger for Unicode

Question 5: How does Aho-Corasick differ from running KMP multiple times?

Answer:

KMP × n patterns:

Time: O(text_length × num_patterns)
Space: O(max_pattern_length)
Each pattern searched independently

Aho-Corasick:

Time: O(text_length + total_pattern_length + num_matches)
Space: O(total_pattern_length × alphabet_size)
All patterns searched simultaneously

How Aho-Corasick works:

Build trie from all patterns
Add "failure links" (like KMP's LPS) between nodes
Single pass through text follows trie transitions
Failure links handle partial matches

Example benefit: Searching for "he", "she", "her" in text:

When matching "she", we automatically also match "he"
Failure link from "she" points to "he" node
No redundant comparisons

📋 Quick Reference

Algorithm Selection

Criteria	Best Algorithm
Single pattern, general	KMP
Single pattern, large alphabet	Boyer-Moore
Single pattern, streaming	KMP
Multiple patterns	Aho-Corasick
Approximate matching	Rabin-Karp variant
Short text, any pattern	Naive (String.indexOf)

Complexity Comparison

Algorithm	Preprocessing	Search	Space
Naive	O(1)	O(nm)	O(1)
KMP	O(m)	O(n)	O(m)
Rabin-Karp	O(m)	O(n) avg	O(1)
Boyer-Moore	O(m + σ)	O(n/m) to O(nm)	O(m + σ)
Aho-Corasick	O(Σm)	O(n + z)	O(Σm × σ)

Java Built-in Methods

JAVA(17 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

Day	Focus	Time
Day 1	Read article, understand KMP LPS construction	60 min
Day 3	Implement KMP from scratch	30 min
Day 7	Implement Rabin-Karp with rolling hash	30 min
Day 14	Build simple Aho-Corasick for 3 patterns	45 min
Day 30	Solve string search problems (LeetCode)	30 min

🔗 What's Next?

Part 11: Graph Representation & Traversal covers:

Adjacency list vs matrix tradeoffs
Graph implementation in Java
DFS: recursive vs iterative
BFS: shortest path in unweighted graphs
Topological sort
Cycle detection

Key insight preview: Graphs are everywhere - social networks, maps, dependencies, state machines. Mastering graph traversal unlocks an enormous class of problems.

This article is part of the Algorithms Compendium series. For the complete learning path, see Part 0: How to Use This Series.

📋 At a Glance

🎯 What You'll Learn

🔥 Production Story: The Log Analysis Nightmare

🧠 Mental Model: String Search Algorithm Selection

🔬 Deep Dive: Naive String Search

The Algorithm

Why Naive Is Slow

🔬 Deep Dive: KMP (Knuth-Morris-Pratt)

The Key Insight

The Failure Function (LPS Array)

LPS Array Example

🔬 Deep Dive: Rabin-Karp

The Hash Approach

Rolling Hash

When Rabin-Karp Excels

🔬 Deep Dive: Boyer-Moore

The Clever Insight

Implementation

Performance Comparison

🔬 Deep Dive: Aho-Corasick

The Trie + Failure Links

Implementation

Usage Example

🔬 Deep Dive: Java's String.indexOf()

The Implementation

When to Use Standard Library vs Custom

⚠️ Common Mistakes

Mistake 1: Wrong LPS Computation

Mistake 2: Integer Overflow in Rabin-Karp

Mistake 3: Negative Hash in Rolling Window

Mistake 4: Boyer-Moore Bad Character Shift

🐛 Debug This: The Broken String Search

💻 Exercises

Exercise 1: Count Pattern Occurrences ⭐

Exercise 2: First Unique Substring ⭐⭐

Exercise 3: Wildcard Pattern Matching ⭐⭐⭐

Exercise 4: Longest Repeated Substring ⭐⭐⭐

Exercise 5: DNA Sequence Alignment ⭐⭐⭐⭐

🎤 Interview Questions

Question 1: Compare KMP and Boyer-Moore

Question 2: How would you search for multiple patterns efficiently?

Question 3: Why does String.indexOf() use naive search?

Question 4: Explain rolling hash in Rabin-Karp

Question 5: How does Aho-Corasick differ from running KMP multiple times?

📋 Quick Reference

Algorithm Selection

Complexity Comparison

Java Built-in Methods

📅 Review Schedule

🔗 What's Next?

Tags: