CLR cover

Problem 46: Computing Longest Common Prefixes

\( \def\sa#1{\tt{#1}} \def\dd{\dot{}\dot{}} \def\SA{\mathrm{SA}} \def\LCP{\mathrm{LCP}} \)

The Suffix array of a non-empty word $y$ is a light and efficient solution for text indexing. It consists in using a binary search procedure to locate patterns inside $y$. To do so the suffixes of $y$ are first sorted in lexicographic order, producing a table $\SA$ that lists the starting positions of the sorted suffixes.

But this standard technique is not sufficient to get a powerful search method. This is why the table $\SA$ is adjoined a second table $\LCP$ that gives the length of longest common prefixes between consecutive suffixes in the sorted list (some more values easy to deduce are also needed). Using both tables, searching $y$ for a word $x$ is then achieved in time $O(|x|+\log |y|)$ instead of a straightforward $O(|x|\log |y|)$ time without the table $\LCP$. Here is the Suffix array of $\sa{abaabababbabbb}$:
$j$ 012345678910111213 
$y[j]$ $\sa{a}$$\sa{b}$$\sa{a}$$\sa{a}$$\sa{b}$$\sa{a}$$\sa{b}$$\sa{a}$$\sa{b}$$\sa{b}$ $\sa{a}$$\sa{b}$$\sa{b}$$\sa{b}$ 
Rank $r$ 01234567891011121314
$\SA[r]$ 203571013146912811 
$\LCP[r]$ 013423012341220
where $\LCP[r] = |\mbox{lcp}(y[\SA[r-1]\dd |y|-1],y[\SA[r]\dd |y|-1])|$.

Given the table $\SA$ for the word $y$, show that Algorithm Lcp computes the associated table $\LCP$ in linear time.

Lcp$(y \textrm{ non-empty word})$
    \begin{algorithmic}
  \FOR{$r \leftarrow 0$ \TO $|y|-1$}
    \STATE $\mathrm{Rank}[\mathrm{SA}[r]] \leftarrow r$
  \ENDFOR 
  \STATE $\ell \leftarrow 0$
  \FOR{$j \leftarrow 0$ \TO $|y|-1$}
    \STATE $\ell \leftarrow \max\{0,\ell-1\}$
    \IF{$\mathrm{Rank}[j] \gt 0$}
      \WHILE{$\max\{j+\ell,\mathrm{SA}[\mathrm{Rank}[j]-1]+\ell\}\lt |y|$ \AND
        $y[j+\ell]=y[\mathrm{SA}[\mathrm{Rank}[j]-1]+\ell]$}
        \STATE $\ell \leftarrow \ell+1$
      \ENDWHILE 
    \ELSE
      \STATE $\ell \leftarrow 0$
    \ENDIF 
    \STATE $\mathrm{LCP}[\mathrm{Rank}[j]] \leftarrow \ell$
  \ENDFOR 
  \STATE $\mathrm{LCP}[|y|] \leftarrow 0$
\RETURN{$\mathrm{LCP}$}
    \end{algorithmic}

Note the solution is counterintuitive, since it looks natural to compute the values $\LCP[r]$ sequentially, that is, by processing suffixes in the increasing order of their ranks. But this does not readily produce a linear-time algorithm. Instead, Algorithm Lcp processes the suffixes from the longest to the shortest, which is its key feature and is more efficient.

References

  • M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 392 pages.
  • T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In A. Amir and G. M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, volume 2089 of Lecture Notes in Computer Science, pages 181-192. Springer, 2001.