|
Problem 106: Compressing Suffix Arrays
|
|
\(
\def\sa#1{\tt{#1}}
\def\dd{\dot{}\dot{}}
\)
Suffix arrays constitute a simple and space-economical data structure for
indexing texts.
In addition, there are many compressed versions of suffix arrays.
The problem discusses one of them for compressing the array that
stores the sorted (partial) list of suffixes (more precisely the partial rank
array) of the concerned text.
This shows an application of simple number theory.
Number-theoretic tools
A set $D\subseteq [0\dd t-1]$ is called a $t$-
difference-cover if all
elements of the interval are differences modulo $t$ of elements in $D$:
$$[0\dd t-1]= \{(x-y) \bmod t \mid x,y\in D\}.$$
For example, the set $D=\{2,3,5\}$ is a $6$-difference-cover for the interval
$[0\dd 6]$, since $1=3-2$, $2=5-3$, $3=5-2$, $4=3-5$ (mod $6$) and $5=2-3$
(mod $6$).
It is known that for every positive integer $t$ there is a
$t$-difference-cover of size $O(\sqrt{t})$ and that the set can be
constructed in time $O(\sqrt{t})$.
A set $\mathcal{S}(t)\subseteq [1\dd n]$ is called a $t$-cover of the
interval $[1\dd n]$ if both $|\mathcal{S}(t)| = O(\frac{n}{\sqrt{t}})$ and
there is a constant-time computable function
$$h: [1..n-t] \times [1..n-t]\, \rightarrow \,[0..t]$$
that satisfies
$$0\le h(i,j)\le t \mbox{ and } i+h(i,j), j+h(i,j)\in \mathcal{S}(t).$$
A $t$-cover can be obtained from a $t$-difference-cover ${\mathcal{D}}$ (of
the interval $[0\dd t-1]$) by setting $\mathcal{S}(t) = \{i\in [1\dd n] \mid
i \bmod t \in {\mathcal{D}}\}$.
The following fact is known.
Fact
For each $t\le n$ a $t$-cover $\mathcal{S}(t)$ can be constructed
in time $O(\frac{n}{\sqrt{t}})$.
Show that the sorted partial list of suffixes of a text of length $n$ can be
represented in only $O(n^{3/4})$ amount of memory space and can still allow
comparison of any two suffixes in $O(\sqrt{n})$ time.
Use the notion of $t$-covers on intervals of integers.
References
S. Burkhardt and J. Kärkkäinen. Fast lightweight suffix array construction
and checking. In R. A. Baeza-Yates, E. Chávez, and M. Crochemore,
editors, Combinatorial Pattern Matching, CPM 2003, volume 2676 of
LNCS, pages 55-69. Springer, 2003.
P. Ferragina and G. Manzini. Indexing compressed text. J. ACM,
52(4):552-581, 2005.
M. Maekawa. A $\sqrt{N}$ algorithm for mutual exclusion in decentralized
systems. ACM Transactions on Computer Systems, 3(2):145-159, 1985.
E. Ohlebusch. Bioinformatics Algorithms. Oldenbusch Verlag, 2013.