Example

$ \def\sa#1{\tt{#1}} \def\dd{\dot{}\dot{}} \def\LZ{\mathrm{LZ}} \def\LPF{\mathrm{LPF}} $

The factorisation of $\sa{abaabababbabbb}$ is $\sa{a}\cdot\sa{b}\cdot\sa{a}\cdot\sa{aba}\cdot\sa{bab}\cdot\sa{babb}\cdot\sa{b}$, which gives the array $\LZ=[0,1,2,3,6,9,13,14]$.

Below is the $\LPF$ array of $\sa{abaabababbabbb}$.

$i$ 0 1 2 3 4 5 6 7 8 9 10 11 12 13

$w[i]$ $\sa{a}$ $\sa{b}$ $\sa{a}$ $\sa{a}$ $\sa{b}$ $\sa{a}$ $\sa{b}$ $\sa{a}$ $\sa{b}$ $\sa{b}$ $\sa{a}$ $\sa{b}$ $\sa{b}$ $\sa{b}$

$\LPF[i]$ 0 0 1 3 2 4 3 2 1 4 3 2 2 1

Problem 97: Lempel-Ziv Factorisation

The problem deals with a Lempel-Ziv factorisation of words. The factorisation considered here is the decomposition of a word $w$ into the product $w_0w_1\cdots w_k$ where each $w_i$ is the longest prefix of $w_iw_{i+1}\cdots w_k$ that occurs in $w$ before the present position $|w_0w_1\cdots w_{i-1}|$. If there is no previous occurrence, $w_i$ is the first letter of $w_iw_{i+1}\cdots w_k$.

The factorisation is stored in the array $\LZ$: $\LZ[0] = 0$ and, for $1\le i \le k$, $\LZ[i] = |w_0w_1\cdots w_{i-1}|$.

Show how to compute the array $\LZ$ of a word in linear time assuming a fixed-size alphabet.

The same running time can be achieved when the alphabet is linearly sortable, which is a weaker condition than the above one. This is done from the longest previous array ($\LPF$) array of the word that computes in linear time under this condition (see Problem 53).

The $\LPF$ array of a word $w$ is defined, for each position $i$ on $w$, by: $\LPF[i]$ is the length of the longest factor of $w$ that starts both at position $i$ and at a smaller position.

Design an algorithm that computes in linear time the Lempel-Ziv factorisation of a word given its $\LPF$ array.

References

H. Bannai, T. I, S. Inenaga, Y. Nakashima, M. Takeda, and K. Tsuruta. The "runs" theorem. SIAM J. Comput., 46(5):1501-1514, 2017.

J. Berstel and A. Savelli. Crochemore factorization of sturmian and other infinite words. In Mathematical Foundations of Computer Science 2006, 31st International Symposium, MFCS 2006, Stará Lesná, Slovakia, August 28-September 1, 2006, Proceedings, volume 4162 of Lecture Notes in Computer Science, pages 157-166. Springer, 2006.

M. Crochemore. Transducers and repetitions. Theor. Comput. Sci., 45(1):63-86, 1986.

M. Crochemore and L. Ilie. Computing longest previous factors in linear time and applications. Inf. Process. Lett., 106(2):75-80, 2008.

M. Crochemore, G. M. Landau, and M. Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput., 32(6):1654-1673, 2003.

R. M. Kolpakov and G. Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA, pages 596-604. IEEE Computer Society, 1999.

R. M. Kolpakov and G. Kucherov. Finding approximate repetitions under Hamming distance. In F. Meyer auf der Heide, editor, Algorithms - ESA 2001, 9th Annual European Symposium, Aarhus, Denmark, August 28-31, 2001, Proceedings, volume 2161 of Lecture Notes in Computer Science, pages 170-181. Springer, 2001.

J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337-343, 1977.

$i$	0	1	2	3	4	5	6	7	8	9	10	11	12	13
$w[i]$	$\sa{a}$	$\sa{b}$	$\sa{a}$	$\sa{a}$	$\sa{b}$	$\sa{a}$	$\sa{b}$	$\sa{a}$	$\sa{b}$	$\sa{b}$	$\sa{a}$	$\sa{b}$	$\sa{b}$	$\sa{b}$
$\LPF[i]$	0	0	1	3	2	4	3	2	1	4	3	2	2	1

125 Problems in Text Algorithms

Example

Problem 97: Lempel-Ziv Factorisation

References