CLR cover

Problem 34: Worst Case of the Boyer-Moore Algorithm

\( \def\lcs{\mathit{lcs}} \def\per{\mathit{per}} \def\tbsuff{\mathit{good\text{-}suff}} \def\dd{\dot{}\dot{}} \)

a Boyer-Moore string matching is based on a technique that leads to the fastest searching algorithms for fixed patterns. Its main feature is to scan the pattern backward when aligned with a factor of the searched text. A typical pattern preprocessing is shown in Problem 33.

When locating a fixed pattern $x$ of length $m$ in a text $y$ of length $n$, in the generic situation $x$ is aligned with a factor (the window) of $y$ ending at position $j$ (see picture). The algorithm computes the longest common suffix ($\lcs$) between $x$ and the factor of $y$; after possibly reporting an occurrence, it slides the window towards the end of $y$ based on the preprocessing and on information collected during the scan, without missing an occurrence of $x$. Algorithm BM implements the method with table $\tbsuff$ of Problem 33.

Worst Case of the Boyer-Moore Algorithm

BM$(x,y \mbox{ non-empty words}, m,n \mbox{ their lengths})$
    \begin{algorithmic}
  \STATE $j\leftarrow m-1$
  \WHILE{$j \lt n$}
    \STATE $i\leftarrow m-1-|lcs(x,y[j-m+1..j])|$
    \IF{$i \lt 0$}
      \STATE report an occurrence of $x$ at position $j-m+1$ on $y$
      \STATE $j\leftarrow j+per(x)$
      \COMMENT{$per(x)=good\text{-}suff[0]$}
    \ELSE
      \STATE $j\leftarrow j+good\text{-}suff[i]$
    \ENDIF
  \ENDWHILE
    \end{algorithmic}

After position $j$ on $y$ is treated, if an occurrence of $x$ is found the algorithm slides naturally the window at distance $\per(x)$. If no occurrence is found, the distance $\tbsuff[i]$ depends on the factor $bu$ of $y$ (it depends on $au$ of $x$ in Problem 33). Value $\per(x)$ and array $\tbsuff$ are preprocessed before the search.

Give examples of a non-periodic pattern and of a text $y$ for which Algorithm BM performs close to $3|y|$ letter comparisons at line 3 for computing the longest common suffix.

References

  • R. S. Boyer and J. S. Moore. A fast string searching algorithm. Commun. ACM., 20(10):762-772, 1977.
  • R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algorithm. SIAM J. Comput., 23(5):1075-1091, 1994.
  • M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 392 pages.
  • M. Crochemore and W. Rytter. Text algorithms. Oxford University Press, 1994. 412 pages.
  • M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific Publishing, Hong-Kong, 2002. 310 pages.
  • D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, 1997.
  • D. E. Knuth, J. H. Morris Jr., and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977.
  • B. Smyth. Computing Patterns in Strings. Pearson Education Limited, Harlow, England, 2003. 423 pages.