CLR cover

Problem 106: Compressing Suffix Arrays

\( \def\sa#1{\tt{#1}} \def\dd{\dot{}\dot{}} \)

Suffix arrays constitute a simple and space-economical data structure for indexing texts. In addition, there are many compressed versions of suffix arrays. The problem discusses one of them for compressing the array that stores the sorted (partial) list of suffixes (more precisely the partial rank array) of the concerned text. This shows an application of simple number theory.

Number-theoretic tools

A set $D\subseteq [0\dd t-1]$ is called a $t$-difference-cover if all elements of the interval are differences modulo $t$ of elements in $D$: $$[0\dd t-1]= \{(x-y) \bmod t \mid x,y\in D\}.$$ For example, the set $D=\{2,3,5\}$ is a $6$-difference-cover for the interval $[0\dd 6]$, since $1=3-2$, $2=5-3$, $3=5-2$, $4=3-5$ (mod $6$) and $5=2-3$ (mod $6$).

It is known that for every positive integer $t$ there is a $t$-difference-cover of size $O(\sqrt{t})$ and that the set can be constructed in time $O(\sqrt{t})$.

A set $\mathcal{S}(t)\subseteq [1\dd n]$ is called a $t$-cover of the interval $[1\dd n]$ if both $|\mathcal{S}(t)| = O(\frac{n}{\sqrt{t}})$ and there is a constant-time computable function $$h: [1..n-t] \times [1..n-t]\, \rightarrow \,[0..t]$$ that satisfies $$0\le h(i,j)\le t \mbox{ and } i+h(i,j), j+h(i,j)\in \mathcal{S}(t).$$ A $t$-cover can be obtained from a $t$-difference-cover ${\mathcal{D}}$ (of the interval $[0\dd t-1]$) by setting $\mathcal{S}(t) = \{i\in [1\dd n] \mid i \bmod t \in {\mathcal{D}}\}$. The following fact is known.

Fact

For each $t\le n$ a $t$-cover $\mathcal{S}(t)$ can be constructed in time $O(\frac{n}{\sqrt{t}})$.

Show that the sorted partial list of suffixes of a text of length $n$ can be represented in only $O(n^{3/4})$ amount of memory space and can still allow comparison of any two suffixes in $O(\sqrt{n})$ time.
Use the notion of $t$-covers on intervals of integers.

References

  • S. Burkhardt and J. Kärkkäinen. Fast lightweight suffix array construction and checking. In R. A. Baeza-Yates, E. Chávez, and M. Crochemore, editors, Combinatorial Pattern Matching, CPM 2003, volume 2676 of LNCS, pages 55-69. Springer, 2003.
  • P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005.
  • M. Maekawa. A $\sqrt{N}$ algorithm for mutual exclusion in decentralized systems. ACM Transactions on Computer Systems, 3(2):145-159, 1985.
  • E. Ohlebusch. Bioinformatics Algorithms. Oldenbusch Verlag, 2013.