CLR cover

Problem 36: String Matching with Don't Cares

Words in the problem are drawn from the alphabet of positive integers with an extra letter $*$. Letter $*$, called a don't care (or joker), stands for any other letter of the alphabet and matches any letter including itself.

Matching strings with don't cares consists in searching a text $y$ for all occurrences of a pattern $x$, assuming the two words contain don't care symbols. Let $m=|x|$ and $n=|y|$.

Contrary to several other string-matching algorithms, solutions to the present problem often use arithmetic operations for convolution: given sequences $B$ and $C$ of length at most $n$ compute the sequence $A$ of length $2n$ defined, for $0\leq i\leq 2n-1$, by $$A[i] = \sum_{j=0}^{n-1}B[j]\cdot C[i+j].$$

It is assumed that each elementary arithmetic operation executes in constant time and that convolution of sequences of length $n$ can be done in $O(n\log n)$ time.

Show how string matching with don't cares can be reduced in linear time to the convolution problem.

References

  • P. Clifford and R. Clifford. Simple deterministic wildcard matching. Inf. Process. Lett., 101(2):53-54, 2007.
  • W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18-31, 1980.