CLR cover

Suffix Array

The Suffix array of a word is also used to produce indexes but proceeds differently than with trees or automata. It consists primarily in sorting the non-empty suffixes of the word to allow binary search for its factors. To get actually efficient searches another feature is considered: the longest common prefixes of successive suffixes in the sorted list.

The information is stored in two arrays, $\SA$ and $\LCP$. The array $\SA$ is the inverse of the array $\Rank$ that gives the rank of each suffix attached at its starting position.

The table $\LCP$ essentially contains longest common prefixes stored as maximal lengths of common prefixes between successive suffixes: \[\LCP[r]=|\lcp(x[\SA[r-1]\dd |x|-1],x[\SA[r]\dd |x|-1])|,\] where $\lcp$ denotes the longest common prefix between two words. This gives $\LCP[0\dd 6]$ for the example. The next values in $\LCP[7\dd 12]$ correspond to the same information for suffixes starting at positions $d$ and $f$ when the pair $(d,f)$ appears in the binary search. Formally, for such a pair, the value is stored at position $|x|+1+\lfloor (d+f)/2 \rfloor$. For example, in the above $\LCP$ array the value $1$ corresponding to the pair $(0,2)$, maximal length of prefixes between $x[5\dd 5]$ and $x[3\dd 5]$, is stored at position~$8$.

The table $\Rank$ is used in applications of the Suffix array that are mainly other than searching.