Problem 46: Computing Longest Common Prefixes |
The Suffix array of a non-empty word $y$ is a light and efficient solution for text indexing. It consists in using a binary search procedure to locate patterns inside $y$. To do so the suffixes of $y$ are first sorted in lexicographic order, producing a table $\SA$ that lists the starting positions of the sorted suffixes.
But this standard technique is not sufficient to get a powerful search method. This is why the table $\SA$ is adjoined a second table $\LCP$ that gives the length of longest common prefixes between consecutive suffixes in the sorted list (some more values easy to deduce are also needed). Using both tables, searching $y$ for a word $x$ is then achieved in time $O(|x|+\log |y|)$ instead of a straightforward $O(|x|\log |y|)$ time without the table $\LCP$. Here is the Suffix array of $\sa{abaabababbabbb}$:
$j$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
$y[j]$ | $\sa{a}$ | $\sa{b}$ | $\sa{a}$ | $\sa{a}$ | $\sa{b}$ | $\sa{a}$ | $\sa{b}$ | $\sa{a}$ | $\sa{b}$ | $\sa{b}$ | $\sa{a}$ | $\sa{b}$ | $\sa{b}$ | $\sa{b}$ | |
Rank $r$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
$\SA[r]$ | 2 | 0 | 3 | 5 | 7 | 10 | 13 | 1 | 4 | 6 | 9 | 12 | 8 | 11 | |
$\LCP[r]$ | 0 | 1 | 3 | 4 | 2 | 3 | 0 | 1 | 2 | 3 | 4 | 1 | 2 | 2 | 0 |
\begin{algorithmic} \FOR{$r \leftarrow 0$ \TO $|y|-1$} \STATE $\mathrm{Rank}[\mathrm{SA}[r]] \leftarrow r$ \ENDFOR \STATE $\ell \leftarrow 0$ \FOR{$j \leftarrow 0$ \TO $|y|-1$} \STATE $\ell \leftarrow \max\{0,\ell-1\}$ \IF{$\mathrm{Rank}[j] \gt 0$} \WHILE{$\max\{j+\ell,\mathrm{SA}[\mathrm{Rank}[j]-1]+\ell\}\lt |y|$ \AND $y[j+\ell]=y[\mathrm{SA}[\mathrm{Rank}[j]-1]+\ell]$} \STATE $\ell \leftarrow \ell+1$ \ENDWHILE \ELSE \STATE $\ell \leftarrow 0$ \ENDIF \STATE $\mathrm{LCP}[\mathrm{Rank}[j]] \leftarrow \ell$ \ENDFOR \STATE $\mathrm{LCP}[|y|] \leftarrow 0$ \RETURN{$\mathrm{LCP}$} \end{algorithmic}
Note the solution is counterintuitive, since it looks natural to compute the values $\LCP[r]$ sequentially, that is, by processing suffixes in the increasing order of their ranks. But this does not readily produce a linear-time algorithm. Instead, Algorithm Lcp processes the suffixes from the longest to the shortest, which is its key feature and is more efficient.