Skip to content
All reading material
Advanced·14 min read

String Matching & Hashing

Finding patterns in text fast — KMP and its LPS array, Z-algorithm, Rabin–Karp rolling hash, and where suffix structures take over.


The problem

Find a pattern p inside a text t, or detect repeats/periodicity. The naive scan is O(n·m) (re-comparing after every mismatch); the algorithms here reach O(n + m) by never re-examining text characters needlessly.

KMP (Knuth–Morris–Pratt)

Precompute, for each prefix of the pattern, the longest proper prefix that is also a suffix (the LPS / failure array). On a mismatch you jump the pattern back via LPS instead of rescanning the text:

def build_lps(p):
    lps = [0] * len(p)
    k = 0
    for i in range(1, len(p)):
        while k and p[i] != p[k]:
            k = lps[k - 1]          # fall back to the next-longest border
        if p[i] == p[k]:
            k += 1
        lps[i] = k
    return lps

def kmp(t, p):                       # all start indices of p in t
    lps, k, hits = build_lps(p), 0, []
    for i, c in enumerate(t):
        while k and c != p[k]: k = lps[k - 1]
        if c == p[k]: k += 1
        if k == len(p): hits.append(i - k + 1); k = lps[k - 1]
    return hits

The LPS array alone solves “Shortest Palindrome,” “Repeated Substring Pattern,” and “Longest Happy Prefix” — recognizing a borders/period problem is half the battle.

Z-algorithm

Computes, for each position, the length of the longest substring starting there that matches a prefix of the string — in O(n). Concatenate p + sep + t and scan the Z-array for values equal to len(p) to find all matches. An alternative to KMP that some find easier to reason about.

Rabin–Karp (rolling hash)

Treat a window as a number in some base mod a large prime; slide it and update the hash in O(1). Great for comparing many substrings or detecting duplicates:

def rolling_hashes(s, L, base=131, mod=(1 << 61) - 1):
    h = 0
    for c in s[:L]:
        h = (h * base + ord(c)) % mod
    power = pow(base, L - 1, mod)
    yield h
    for i in range(L, len(s)):
        h = ((h - ord(s[i - L]) * power) * base + ord(s[i])) % mod   # drop left, add right
        yield h

Powers “Longest Duplicate Substring” (binary-search the length + rolling hash), “Repeated DNA Sequences,” and 2-D pattern search. Always verify a hash hit with a direct compare to dodge rare collisions; use double hashing (two mods) for adversarial inputs.

Suffix structures (the heavy artillery)

For many queries against one text — “how many distinct substrings,” “longest repeated substring,” multiple-pattern search — build a suffix array (+ LCP array) or a suffix automaton/tree. O(n log n) or O(n) to build, then queries are fast. Name these when one text is queried repeatedly; you rarely implement them in an interview.

Aho–Corasick (many patterns at once)

To match many patterns against a text simultaneously, build a trie of the patterns with KMP-style failure links — O(text + total pattern length + matches). The go-to for “find all dictionary words in this text.”

Choosing in an interview

  • Single pattern, exact, O(n+m) → KMP (or Z).
  • Borders / period / “repeated substring” → LPS array insight.
  • Many substring comparisons / dedup / 2-D → rolling hash (verify hits; double-hash).
  • Many patterns at once → Aho–Corasick.
  • Many queries on one fixed text → suffix array / automaton.

For the everyday string toolkit (frequency, two pointers, palindromes), see the strings primer; reach here only when the task is literally pattern search.