CLR cover

Additional Problem 9: Compressed Pattern Matching in Thue-Morse Words (Preliminary version)

\( \def\EVEN{\mathit{EVEN}} \)

The Thue-Morse binary word is produced by iterating the Thue-Morse morphism $\mu$ from $\{\sa{0}, \sa{1}\}^*$ to itself defined by $$\mu(\sa{0}) = \sa{01}, \ \mu(\sa{1}) = \sa{10}. $$ Eventually it produces infinite Thue-Morse word: $$\mathbf{t}\,=\, \sa{01101001100101101001011001101001}\cdots. $$

For a pattern $x$ of even length define $\bar{\mu}(x)=z$, if $\mu(z)=x$. If there is no such word $z$ then define $\bar{\mu}(x)=nil$. We introduce also the set $\EVEN=\{0110, 1010, 0101, 1001\}$.

Denote by $4first(x)$ the prefix of $x$ of size 4, if there is any. $first(x),\, last(x)$ is respectively the first/last letter of $x$, and $\mbox{neg}(s)$ is the negation of a bit $s$.

The following algorithm tests in linear time and in a very simple way if a finite pattern $x$ is a subword of $\mathbf{t}$.

Test$(x \textrm{ non-empty word})$
\IF{$|x|\lt 4$}
  \RETURN{$x\ne 111$ and $x\neq 000$}
\IF{$4first(x)\in EVEN$}
  \IF{$|x|$ is odd}
    \STATE{$x\leftarrow x\cdot \mbox{neg}(last(x))$}
  \STATE{$x\leftarrow \mbox{neg}(first(x))\cdot x$}
  \IF{$|x|$ is even}
    \STATE{$x\leftarrow x\cdot \mbox{neg}(last(x))$}

Why this algorithm correctly tests in linear time if $x$ is a subword of the infinite Thue-Morse word $\mathbf{t}$?