CLR cover

Problem 97: Lempel-Ziv Factorisation

The problem deals with a Lempel-Ziv factorisation of words. The factorisation considered here is the decomposition of a word $w$ into the product $w_0w_1\cdots w_k$ where each $w_i$ is the longest prefix of $w_iw_{i+1}\cdots w_k$ that occurs in $w$ before the present position $|w_0w_1\cdots w_{i-1}|$. If there is no previous occurrence, $w_i$ is the first letter of $w_iw_{i+1}\cdots w_k$.

The factorisation is stored in the array $\LZ$: $\LZ[0] = 0$ and, for $1\le i \le k$, $\LZ[i] = |w_0w_1\cdots w_{i-1}|$.

Show how to compute the array $\LZ$ of a word in linear time assuming a fixed-size alphabet.

The same running time can be achieved when the alphabet is linearly sortable, which is a weaker condition than the above one. This is done from the longest previous array ($\LPF$) array of the word that computes in linear time under this condition (see Problem 53).

The $\LPF$ array of a word $w$ is defined, for each position $i$ on $w$, by: $\LPF[i]$ is the length of the longest factor of $w$ that starts both at position $i$ and at a smaller position.

Design an algorithm that computes in linear time the Lempel-Ziv factorisation of a word given its $\LPF$ array.

References

  • H. Bannai, T. I, S. Inenaga, Y. Nakashima, M. Takeda, and K. Tsuruta. The "runs" theorem. SIAM J. Comput., 46(5):1501-1514, 2017.
  • J. Berstel and A. Savelli. Crochemore factorization of sturmian and other infinite words. In Mathematical Foundations of Computer Science 2006, 31st International Symposium, MFCS 2006, Stará Lesná, Slovakia, August 28-September 1, 2006, Proceedings, volume 4162 of Lecture Notes in Computer Science, pages 157-166. Springer, 2006.
  • M. Crochemore. Transducers and repetitions. Theor. Comput. Sci., 45(1):63-86, 1986.
  • M. Crochemore and L. Ilie. Computing longest previous factors in linear time and applications. Inf. Process. Lett., 106(2):75-80, 2008.
  • M. Crochemore, G. M. Landau, and M. Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput., 32(6):1654-1673, 2003.
  • R. M. Kolpakov and G. Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA, pages 596-604. IEEE Computer Society, 1999.
  • R. M. Kolpakov and G. Kucherov. Finding approximate repetitions under Hamming distance. In F. Meyer auf der Heide, editor, Algorithms - ESA 2001, 9th Annual European Symposium, Aarhus, Denmark, August 28-31, 2001, Proceedings, volume 2161 of Lecture Notes in Computer Science, pages 170-181. Springer, 2001.
  • J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337-343, 1977.