CLR cover

Problem 52: Codicity Test

Sets of words, especially binary words, are used to encode information. They may be related to transmission protocols, to data compression or mere texts. Streams of data need to be parsed according to the set to retrieve the original information. Parsing is a simple operation when codewords have the same length, like ASCII and UTF-32 codes for characters, and gives a unique factorisation of encoded data.

A code is a set of words that features a uniquely decipherable property. The question of having a unique parsing concerns mostly variable-length codes. The goal of the problem is to test whether a set of words is a code.

More precisely, a set $C=\{w_1,w_2,\dots,w_n\}$ of words drawn from an alphabet $A$ is a code if for every two sequences (noted as words) $i_1i_2\cdots i_k$ and $j_1j_2\cdots j_\ell$ of indices from $\{1,2,\dots,n\}$ we have $$i_1i_2\cdots i_k \neq j_1j_2\cdots j_\ell \;\Rightarrow \; w_{i_1}w_{i_2}\cdots w_{i_k} \neq w_{j_1}w_{j_2}\cdots w_{j_\ell}.$$ In other words, if we define the morphism $h$ from $\{1,2,\dots,n\}^*$ to $A^*$ by $h(i)=w_i$, for $i\in \{1,2,....n\}$, the condition means $h$ is injective.

The size $N$ of the codicity testing problem for a finite set of words is the total length $||C||$ of all words of $C$.

Design an algorithm that checks if a finite set $C$ of words is a code and that runs in time $O(N^2)$.

References

  • A. Apostolico and R. Giancarlo. Pattern matching machine implementation of a fast test for unique decipherability. Inf. Process. Lett., 18(3):155-158, 1984.
  • J. Berstel and D. Perrin. Theory of Codes. Academic Press, INC., Orlando, Florida, 1985.
  • M. Lothaire. Combinatorics on Words. Addison-Wesley, 1983. Reprinted in 1997.
  • A. A. Sardinas and G. W. Patterson. A necessary and sufficient condition for the unique decomposition of coded messages. Research Division Report 50-27, Moore School of Electrical Engineering, University of Pennsylvania, 1950.