CLR cover

Problem 105: Prediction by Partial Matching

\( \def\sa#1{\tt{#1}} \def\dd{\dot{}\dot{}} \)

Prediction by Partial Matching (PPM) is a lossless compression technique where the encoder maintains a statistical model of the text. The goal is to predict letters following a given factor of the input. In the problem we examine the data structure used to store the model.

Let $y$ be the text to be compressed and assume $y[0\dd i]$ has already been encoded. PPM assigns a probability $p(a)$ to each letter $a\in A$ depending on the number of occurrences of $y[i + 1 - d \dd i]\cdot a$ in $y[1\dd i]$, where $d$ is the context length. Then PPM sends $p(y[i+1])$ to an adaptive arithmetic encoder taking into account letter probabilities. When there is no occurrence of $y[i + 1 - d\dd i + 1]$ in $y[0\dd i]$ the encoder decrements the value of $d$ until either an occurrence of $y[i + 1 - d\dd i + 1]$ is found or $d=-1$. In the last case, $y[i + 1]$ is a letter not met before. Each time the encoder decrements the value of $d$ it sends a so-called ``escape'' probability to the adaptive arithmetic encoder.

PPM* is a variant of PPM which does not consider a maximum context length but stores all contexts. The initial context at each step is the shortest deterministic context, one that corresponds to the shortest repeated suffix that is always followed by the same letter or it is the longest context if such a suffix does not exist.

Design a data structure able to maintain online the number of occurrences of each context and that can be managed in linear time on a constant-size alphabet.

References

  • J. G. Cleary, W. J. Teahan, and I. H. Witten. Unbounded length contexts for PPM. In J. A. Storer and M. Cohn, editors, Proceedings of the IEEE Data Compression Conference, DCC 1995, Snowbird, Utah, March 28-30, 1995., pages 52-61. IEEE Computer Society, 1995.
  • J. G. Cleary and I. H. Witten. A comparison of enumerative and adaptive codes. IEEE Transactions on Information Theory, 30(2):306-315, 1984.
  • M. Effros. PPM performance with BWT complexity: A new method for lossless data compression. In Data Compression Conference, DCC 2000, Snowbird, Utah, USA, March 28-30, 2000., pages 203-212. IEEE Computer Society, 2000.
  • A. Moffat. Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11):1917-1921, 1990.