#
Problem 105: Prediction by Partial Matching

\(
\def\sa#1{\tt{#1}}
\def\dd{\dot{}\dot{}}
\)

Prediction by Partial Matching (PPM) is a lossless compression technique
where the encoder maintains a statistical model of the text.
The goal is to predict letters following a given factor of the input.
In the problem we examine the data structure used to store the model.

Let $y$ be the text to be compressed and assume $y[0\dd i]$ has already been
encoded.
PPM assigns a probability $p(a)$ to each letter $a\in A$ depending
on the number of occurrences of $y[i + 1 - d \dd i]\cdot a$ in $y[1\dd i]$,
where $d$ is the context length.
Then PPM sends $p(y[i+1])$ to an adaptive
arithmetic encoder taking into account letter probabilities.
When there is no
occurrence of $y[i + 1 - d\dd i + 1]$ in $y[0\dd i]$ the
encoder decrements the value of $d$ until either an occurrence of
$y[i + 1 - d\dd i + 1]$ is found or $d=-1$.
In the last case, $y[i + 1]$ is a letter not met before.
Each time the encoder decrements the value of $d$ it sends a so-called
``escape'' probability to the adaptive arithmetic encoder.

PPM* is a variant of PPM which does not consider a maximum context length
but stores all contexts.
The initial context at each step is the shortest deterministic context,
one that corresponds to the shortest repeated suffix that is always followed
by the same letter or it is the longest context if such a suffix does not
exist.

Design a data structure able to maintain online the number of occurrences of
each context and that can be managed in linear time on a constant-size
alphabet.

## References

J. G. Cleary, W. J. Teahan, and I. H. Witten. Unbounded length contexts
for PPM. In J. A. Storer and M. Cohn, editors, *Proceedings of the IEEE
Data Compression Conference*, DCC 1995, Snowbird, Utah, March 28-30, 1995.,
pages 52-61. IEEE Computer Society, 1995.
J. G. Cleary and I. H. Witten. A comparison of enumerative and adaptive
codes. *IEEE Transactions on Information Theory*, **30**(2):306-315, 1984.
M. Effros. PPM performance with BWT complexity: A new method
for lossless data compression. In *Data Compression Conference*, DCC
2000, Snowbird, Utah, USA, March 28-30, 2000., pages 203-212. IEEE
Computer Society, 2000.
A. Moffat. Implementing the PPM data compression scheme. *IEEE
Transactions on Communications*, **38**(11):1917-1921, 1990.