Problem 105: Prediction by Partial Matching |
Prediction by Partial Matching (PPM) is a lossless compression technique where the encoder maintains a statistical model of the text. The goal is to predict letters following a given factor of the input. In the problem we examine the data structure used to store the model.
Let $y$ be the text to be compressed and assume $y[0\dd i]$ has already been encoded. PPM assigns a probability $p(a)$ to each letter $a\in A$ depending on the number of occurrences of $y[i + 1 - d \dd i]\cdot a$ in $y[1\dd i]$, where $d$ is the context length. Then PPM sends $p(y[i+1])$ to an adaptive arithmetic encoder taking into account letter probabilities. When there is no occurrence of $y[i + 1 - d\dd i + 1]$ in $y[0\dd i]$ the encoder decrements the value of $d$ until either an occurrence of $y[i + 1 - d\dd i + 1]$ is found or $d=-1$. In the last case, $y[i + 1]$ is a letter not met before. Each time the encoder decrements the value of $d$ it sends a so-called ``escape'' probability to the adaptive arithmetic encoder.
PPM* is a variant of PPM which does not consider a maximum context length but stores all contexts. The initial context at each step is the shortest deterministic context, one that corresponds to the shortest repeated suffix that is always followed by the same letter or it is the longest context if such a suffix does not exist.