Consider the word $\sa{abcacbabcbac}$. Let $L=\{\sa{a},\sa{c}\}$, $R=\{\sa{b}\}$. Then, substituting $\sa{d}$ for $\sa{ab}$ and $\sa{e}$ for $\sa{cb}$ produces the word $\sa{dcaedeac}$ of length $8\lt \frac{3}{4}12=9$.
On the contrary, setting $L=\{\sa{a}\}$, $R=\{\sa{b}\}$ and substituting $\sa{d}$ for $\sa{ab}$ in the word $\sa{aaabbb}$ containing two unary runs produces the word $\sa{aadbb}$ of length $5\gt \frac{3}{4}6=4.5$, which does not meet the desired bound.
Problem 129: Shrinking a text by pairing adjacent symbols |
One of the most powerfull compression techniques is recompression. In this technique there are two crucial operations: shrinking unary runs and pairing letters.
A unary run is a maximal occurrence of a factor of length at least 2 that is a repetition of the same letter. The first phase of the recompression technique consists in shrinking each unary run into a single letter.
The second phase it to apply the operation $\Comp(x,L,R)$, where $(L,R)$ is a partition of the alphabet $A$ of letters, $L\cup R=A$ and $L\cap R=\emptyset$. The compressed word $\Comp(x,L,R)$ results from $x$ by substituting a single letter (identifier of a pair of letters) for each occurrence of its 2-letter factor $ab$, whenever $a\in L$ and $b\in R$.
The Pairing problem consists in computing a partition $(L,R)$ of the alphabet of $x$ for which $|\Comp(x,L,R)|\le \frac{3}{4}|x|$.