Algorithms

String Searching Algorithms

String searching - finding a pattern within text - is one of the most common operations in computing. From text editors to log analysis, from DNA sequencing to spam filtering, efficient string matching is critical. The naive O(nm) approach becomes prohibitive for large texts. This article covers four major algorithms that achieve O(n+m) performance through clever preprocessing.

📋 At a Glance

AlgorithmPreprocessingSearchBest For
NaiveO(1)O(nm)Short patterns, small texts
KMPO(m)O(n)Single pattern, streaming
Rabin-KarpO(m)O(n) avgMultiple patterns, plagiarism detection
Boyer-MooreO(m + σ)O(n/m) bestLong patterns, large alphabets
Aho-CorasickO(Σm)O(n + z)Multiple patterns simultaneously

Where n = text length, m = pattern length, σ = alphabet size, z = number of matches


🎯 What You'll Learn

After reading this article, you will be able to:

  • Implement KMP with proper failure function construction
  • Apply Rabin-Karp for multiple pattern matching
  • Use Boyer-Moore for practical high-performance search
  • Build Aho-Corasick automaton for simultaneous multi-pattern search
  • Choose the right algorithm based on pattern count and size
  • Understand Java's String.indexOf() implementation

🔥 Production Story: The Log Analysis Nightmare

A security team needed to scan 500GB of server logs daily for 200 suspicious patterns (SQL injection attempts, XSS payloads, known attack signatures).

JAVA(20 lines)
Code
Loading syntax highlighter...
Analysis:
  • 500GB ≈ 5 billion characters
  • 200 patterns × average 20 chars = 4000 chars to check per position
  • Naive: O(n × m × p) where p = number of patterns
  • Total operations: 5B × 20 × 200 = 20 quadrillion operations!
The optimization:
JAVA(22 lines)
Code
Loading syntax highlighter...
Results:
  • Naive: 18 hours
  • Aho-Corasick: 25 minutes
  • 43x faster - reduced from overnight job to lunch break
The lesson: When searching for multiple patterns, building a combined automaton once pays off massively compared to searching each pattern separately.

🧠 Mental Model: String Search Algorithm Selection

TEXT(29 lines)
Code
Loading syntax highlighter...

The Algorithm

JAVA(43 lines)
Code
Loading syntax highlighter...

Why Naive Is Slow

TEXT(16 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: KMP (Knuth-Morris-Pratt)

The Key Insight

When a mismatch occurs, we've already matched some prefix of the pattern. KMP uses this information to skip ahead intelligently.

TEXT(18 lines)
Code
Loading syntax highlighter...

The Failure Function (LPS Array)

The failure function lps[i] = length of longest proper prefix of pattern[0..i] which is also a suffix.
JAVA(108 lines)
Code
Loading syntax highlighter...

LPS Array Example

TEXT(21 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Rabin-Karp

The Hash Approach

Instead of comparing characters, compare hash values. Only do character comparison when hashes match.

TEXT(15 lines)
Code
Loading syntax highlighter...

Rolling Hash

JAVA(123 lines)
Code
Loading syntax highlighter...

When Rabin-Karp Excels

JAVA(47 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Boyer-Moore

The Clever Insight

Boyer-Moore compares pattern from right to left, allowing bigger jumps on mismatch.

TEXT(23 lines)
Code
Loading syntax highlighter...

Implementation

JAVA(153 lines)
Code
Loading syntax highlighter...

Performance Comparison

JAVA(37 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Aho-Corasick

Aho-Corasick builds a trie from all patterns, then adds "failure links" (like KMP's failure function) to enable efficient searching.

TEXT(20 lines)
Code
Loading syntax highlighter...

Implementation

JAVA(115 lines)
Code
Loading syntax highlighter...

Usage Example

JAVA(20 lines)
Code
Loading syntax highlighter...

🔬 Deep Dive: Java's String.indexOf()

The Implementation

Java's String.indexOf() uses a simple but effective approach:
JAVA(44 lines)
Code
Loading syntax highlighter...

When to Use Standard Library vs Custom

ScenarioRecommendation
Short pattern, short textString.indexOf()
Short pattern, long textString.indexOf() (JIT optimizes well)
Long pattern, long textBoyer-Moore
Multiple searches, same patternPrecompile KMP/BM
Multiple patternsAho-Corasick
Streaming textKMP (no backtracking)

⚠️ Common Mistakes

Mistake 1: Wrong LPS Computation

JAVA(17 lines)
Code
Loading syntax highlighter...

Mistake 2: Integer Overflow in Rabin-Karp

JAVA(10 lines)
Code
Loading syntax highlighter...

Mistake 3: Negative Hash in Rolling Window

JAVA(7 lines)
Code
Loading syntax highlighter...

Mistake 4: Boyer-Moore Bad Character Shift

JAVA(5 lines)
Code
Loading syntax highlighter...

These implementations have bugs. Find and fix them:

JAVA(47 lines)
Code
Loading syntax highlighter...
💡 Hint

Look for:

  1. LPS initialization and indexing
  2. Math.pow precision issues
  3. Loop bounds and modular arithmetic
  4. Hash collision handling
✅ Solution
JAVA(66 lines)
Code
Loading syntax highlighter...
Bugs explained:
  1. LPS function:
    • Starting i at 0 would try to compare pattern.charAt(0) with itself
    • lps[i] = len before incrementing would be off-by-one
    • lps[len] should be lps[len-1] (previous prefix)
  2. Rabin-Karp:
    • Math.pow returns double, loses precision for large exponents
    • Loop < text.length() - m misses last position
    • No verification allows false positives from hash collisions
    • No modular arithmetic causes overflow and negative remainders

💻 Exercises

Exercise 1: Count Pattern Occurrences ⭐

Implement a function to count all occurrences of a pattern in text.

JAVA(9 lines)
Code
Loading syntax highlighter...

Exercise 2: First Unique Substring ⭐⭐

Find the shortest substring of text that doesn't appear elsewhere in text.

JAVA(9 lines)
Code
Loading syntax highlighter...
💡 Hint

Use Rabin-Karp with rolling hash to check each length starting from 1.


Exercise 3: Wildcard Pattern Matching ⭐⭐⭐

Extend string search to support wildcards.

JAVA(15 lines)
Code
Loading syntax highlighter...

Exercise 4: Longest Repeated Substring ⭐⭐⭐

Find the longest substring that appears at least twice.

JAVA(12 lines)
Code
Loading syntax highlighter...

Exercise 5: DNA Sequence Alignment ⭐⭐⭐⭐

Find all approximate matches allowing k mismatches.

JAVA(9 lines)
Code
Loading syntax highlighter...
💡 Hint

Split pattern into k+1 segments. At least one segment must match exactly (pigeonhole principle). Use exact search for segments, then verify full pattern.


🎤 Interview Questions

Question 1: Compare KMP and Boyer-Moore

Answer:
AspectKMPBoyer-Moore
Comparison directionLeft to rightRight to left
Best caseO(n)O(n/m) - sublinear!
Worst caseO(n)O(nm)
PreprocessingO(m)O(m + σ)
SpaceO(m)O(m + σ)
StreamingYes (no backtrack)No (needs random access)
When to use KMP:
  • Streaming text (can't go back)
  • Guaranteed O(n) important
  • Small patterns
When to use Boyer-Moore:
  • Large alphabet (natural language)
  • Long patterns
  • Can afford preprocessing
In practice: Boyer-Moore is faster for most real-world cases because it skips characters.

Question 2: How would you search for multiple patterns efficiently?

Answer:

Three approaches:

  1. Run single-pattern search for each - O(n × p) where p = number of patterns. Simple but slow.
  2. Rabin-Karp with multiple hashes - O(n × max_pattern_length). Good when patterns have similar lengths.
  3. Aho-Corasick - O(n + z) where z = total matches. Best for many patterns.
JAVA(4 lines)
Code
Loading syntax highlighter...
Recommendation: For < 10 patterns, run individually. For 10+ patterns, build Aho-Corasick automaton.

Answer:

Java's decision makes sense for several reasons:

  1. Preprocessing overhead: KMP needs O(m) preprocessing. For short patterns (most common case), this exceeds the search time savings.
  2. JIT optimization: The tight loop in naive search is highly optimizable by the JIT compiler - can vectorize comparisons.
  3. Cache efficiency: Sequential memory access in naive search is cache-friendly.
  4. Average case: Real-world strings aren't adversarial. The O(nm) worst case rarely occurs.
  5. Code complexity: Simple code is easier to maintain and less bug-prone.
When you should use KMP/BM:
  • Searching same pattern multiple times
  • Known adversarial patterns (regex engines)
  • Very long texts (> 1MB)
  • Streaming data

Question 4: Explain rolling hash in Rabin-Karp

Answer:

Rolling hash computes the next window's hash from the current in O(1):

TEXT(10 lines)
Code
Loading syntax highlighter...
Key considerations:
  1. Modular arithmetic: Use prime modulus to prevent overflow
  2. Negative handling: (hash - x % MOD + MOD) % MOD
  3. Collision handling: Always verify actual characters when hashes match
  4. Base selection: Use 256 for ASCII, larger for Unicode

Question 5: How does Aho-Corasick differ from running KMP multiple times?

Answer:
KMP × n patterns:
  • Time: O(text_length × num_patterns)
  • Space: O(max_pattern_length)
  • Each pattern searched independently
Aho-Corasick:
  • Time: O(text_length + total_pattern_length + num_matches)
  • Space: O(total_pattern_length × alphabet_size)
  • All patterns searched simultaneously
How Aho-Corasick works:
  1. Build trie from all patterns
  2. Add "failure links" (like KMP's LPS) between nodes
  3. Single pass through text follows trie transitions
  4. Failure links handle partial matches
Example benefit: Searching for "he", "she", "her" in text:
  • When matching "she", we automatically also match "he"
  • Failure link from "she" points to "he" node
  • No redundant comparisons

📋 Quick Reference

Algorithm Selection

CriteriaBest Algorithm
Single pattern, generalKMP
Single pattern, large alphabetBoyer-Moore
Single pattern, streamingKMP
Multiple patternsAho-Corasick
Approximate matchingRabin-Karp variant
Short text, any patternNaive (String.indexOf)

Complexity Comparison

AlgorithmPreprocessingSearchSpace
NaiveO(1)O(nm)O(1)
KMPO(m)O(n)O(m)
Rabin-KarpO(m)O(n) avgO(1)
Boyer-MooreO(m + σ)O(n/m) to O(nm)O(m + σ)
Aho-CorasickO(Σm)O(n + z)O(Σm × σ)

Java Built-in Methods

JAVA(17 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

DayFocusTime
Day 1Read article, understand KMP LPS construction60 min
Day 3Implement KMP from scratch30 min
Day 7Implement Rabin-Karp with rolling hash30 min
Day 14Build simple Aho-Corasick for 3 patterns45 min
Day 30Solve string search problems (LeetCode)30 min

🔗 What's Next?

Part 11: Graph Representation & Traversal covers:
  • Adjacency list vs matrix tradeoffs
  • Graph implementation in Java
  • DFS: recursive vs iterative
  • BFS: shortest path in unweighted graphs
  • Topological sort
  • Cycle detection
Key insight preview: Graphs are everywhere - social networks, maps, dependencies, state machines. Mastering graph traversal unlocks an enormous class of problems.

This article is part of the Algorithms Compendium series. For the complete learning path, see Part 0: How to Use This Series.