CLR cover

Problem 133: Huffman codes vs entropy

We consider $n$ items with positive weights probabilities $p_1,p_2,\ldots,p_n$ satisfying $\sum_i\, p_i=1$ and let $\mathbf{p}=(p_1,p_2,\ldots,p_n)$. Huffman algorithm constructs a full binary tree (each non-leaf node has two children) with items assigned to its leaves. Let $l_i$ be the depth to $p_i$, number of edges from the root to $p_i$. The average length of the Huffman coding is $\Huf(\mathbf{p})=\sum_i\, p_i\cdot l_i$.

An important concept in information theory is entropy. The sequence $\mathbf{p}$ is treated as a source of information and its entropy is $$\Entropy(\mathbf{p})\,=\, -\sum_i p_i\cdot \log_2 p_i.$$

A useful property for the solution below is: if $p_1,\dots,p_n$, and $q_1,\dots,q_n$ are two sequences of positive integers with the same sum, then $-\sum_i\,p_i \log_2 q_i\geq -\sum_i\,p_i \log_2 p_i$. From the inequality, it follows directly: $\log_2 x \le x-1$, for $x\gt 0$.

Show that $\Entropy(\mathbf{p})\le \Huf(\mathbf{p})\le \Entropy(\mathbf{p})+1$.

References