String Matching & Hashing
Finding patterns in text fast — KMP and its LPS array, Z-algorithm, Rabin–Karp rolling hash, and where suffix structures take over.
The problem
Find a pattern p inside a text t, or detect repeats/periodicity. The naive scan is O(n·m)
(re-comparing after every mismatch); the algorithms here reach O(n + m) by never re-examining
text characters needlessly.
KMP (Knuth–Morris–Pratt)
Precompute, for each prefix of the pattern, the longest proper prefix that is also a suffix (the LPS / failure array). On a mismatch you jump the pattern back via LPS instead of rescanning the text:
def build_lps(p):
lps = [0] * len(p)
k = 0
for i in range(1, len(p)):
while k and p[i] != p[k]:
k = lps[k - 1] # fall back to the next-longest border
if p[i] == p[k]:
k += 1
lps[i] = k
return lps
def kmp(t, p): # all start indices of p in t
lps, k, hits = build_lps(p), 0, []
for i, c in enumerate(t):
while k and c != p[k]: k = lps[k - 1]
if c == p[k]: k += 1
if k == len(p): hits.append(i - k + 1); k = lps[k - 1]
return hits
The LPS array alone solves “Shortest Palindrome,” “Repeated Substring Pattern,” and “Longest Happy Prefix” — recognizing a borders/period problem is half the battle.
Z-algorithm
Computes, for each position, the length of the longest substring starting there that matches a
prefix of the string — in O(n). Concatenate p + sep + t and scan the Z-array for values equal
to len(p) to find all matches. An alternative to KMP that some find easier to reason about.
Rabin–Karp (rolling hash)
Treat a window as a number in some base mod a large prime; slide it and update the hash in O(1). Great for comparing many substrings or detecting duplicates:
def rolling_hashes(s, L, base=131, mod=(1 << 61) - 1):
h = 0
for c in s[:L]:
h = (h * base + ord(c)) % mod
power = pow(base, L - 1, mod)
yield h
for i in range(L, len(s)):
h = ((h - ord(s[i - L]) * power) * base + ord(s[i])) % mod # drop left, add right
yield h
Powers “Longest Duplicate Substring” (binary-search the length + rolling hash), “Repeated DNA Sequences,” and 2-D pattern search. Always verify a hash hit with a direct compare to dodge rare collisions; use double hashing (two mods) for adversarial inputs.
Suffix structures (the heavy artillery)
For many queries against one text — “how many distinct substrings,” “longest repeated substring,” multiple-pattern search — build a suffix array (+ LCP array) or a suffix automaton/tree. O(n log n) or O(n) to build, then queries are fast. Name these when one text is queried repeatedly; you rarely implement them in an interview.
Aho–Corasick (many patterns at once)
To match many patterns against a text simultaneously, build a trie of the patterns with KMP-style failure links — O(text + total pattern length + matches). The go-to for “find all dictionary words in this text.”
Choosing in an interview
- Single pattern, exact, O(n+m) → KMP (or Z).
- Borders / period / “repeated substring” → LPS array insight.
- Many substring comparisons / dedup / 2-D → rolling hash (verify hits; double-hash).
- Many patterns at once → Aho–Corasick.
- Many queries on one fixed text → suffix array / automaton.
For the everyday string toolkit (frequency, two pointers, palindromes), see the strings primer; reach here only when the task is literally pattern search.