Below is the string-matching automaton of $\sa{abaab}$ of length $5$ on the alphabet $\{\sa{a},\sa{b}\}$. Labelled arcs represent non-null values in the transition table. There are five forward arcs (on the main line) and four backward arcs. All other undrawn arcs have $0$ as target state. Were the alphabet to contain a third letter, all arcs labelled with it would also have $0$ as the target state.
Problem 27: Sparse Matching Automaton |
The most standard method to locate given patterns in a text processed sequentially is to use a pattern-matching automaton. Border tables used in Algorithm KMP may be viewed as implementations of such an automaton. The aim of the problem is to show another implementation technique that eventually improves searches. Searches using this implementation run at least as fast as Algorithm KMP, then run in linear time executing no more letter comparisons than twice the text length.
The pattern-matching or string-matching automaton $\aSMA(x)$ of a word $x$ drawn from the alphabet $A$ is the minimal deterministic automaton accepting words ending with $x$. It accepts the language $A^*x$ and has $|x|+1$ states, $0, 1, \dots, |x|$, the initial state $0$ and its only accepting state $|x|$. Its transition table $\delta$ is defined, for a state $i$ and a letter $a$, by: $$\delta(i,a) = \max\{s+1 \mid -1\leq s\leq i \mbox{ and } x[0\dd s] \mbox{ suffix of } x[0\dd i-1]\cdot a\}.$$ Observe that the size of the table is $\Omega(|x|^2)$ when the alphabet of $x$ is as large as its length. But the table is very sparse, since most of its values are null.