## ARTICLE

# Data Compression with Huffman’s Algorithm

*From **Algorithms and Data Structures in Action** by Marcello La Rocca*

This article discusses Huffman’s Algorithm: what it is and what you can do with it.

_________________________________________________________________

Take 37% off *Algorithms and Data Structures in Action*. Just enter **fccrocca** into the discount code box at checkout at manning.com.

_________________________________________________________________

Huffman’s algorithm is probably the most famous data compression algorithm. You probably have already studied in your introduction to CS course. It is a simple, brilliant greedy[1] algorithm that, despite not being the state of the art for compression anymore, was a major breakthrough in the ’50s.

A Huffman code is a tree, built bottom up, starting with the list of different characters appearing in a text and their frequency. The algorithm iteratively

- selects and removes the two elements in the list with the smallest frequency
- then creates a new node by combining them (summing the two frequencies)
- and finally adds back the new node to the list.

While the tree itself is not a heap, a key step of the algorithm is based on efficiently retrieving the smallest elements in the list, as well as efficiently add new elements to the list. You probably have guessed by now that, once again, that’s where heaps come to the rescue.

Let’s take a look at the algorithm itself, in listing 1.

We assume the input for the algorithm is a text, stored in a string (of course, the actual text might be stored in a file or stream, but we can always have a way to convert it to a string[2]), and the output is a map from characters to binary sequences.

The first sub-task that we need to perform is to transform the text: we want to compute some statistics on it, to understand what are the most used and least used characters in it. To that end, we compute the frequency of characters in the text[3].

The details of the `ComputeFrequencies`

method at line #1 are both out of scope and (at least in its basic version) simple enough, and there is no need to delve into that helper method here.

**Listing 1 The Huffman Coding Algorithm**

**function** huffman(text)

charFrequenciesMap ← ComputeFrequencies(text) #1

priorityQueue ← MinHeap() #2

**for** (char, frequency) in charFrequenciesMap #3

priorityQueue.insert(TreeNode([char], frequency)) #4

**while** priorityQueue.size > 1 #5

left ← priorityQueue.top() #6

right ← priorityQueue.top() #7

parent ← TreeNode(left.chars + right.chars,

left.frequency + right.frequency) #8

parent.left ← left #9

parent.right ← right #10

priorityQueue.insert(parent) #11

**return** buildTable(priorityQueue.top(), [], Map()) #12

Once we have computed the frequency map, we create a new priority queue and then at lines #3 and #4 we iterate over the frequency map, creating a new `TreeNode`

for each character and then adding it to the priority queue. Obviously, considering the subject of this chapter, for the queue we use a heap, and in particular a min-heap, where the element at the top is the one with the smallest value for the priority field. And in this case the priority field is (not surprisingly) the frequency field of the `TreeNode`

.

Each `TreeNode`

, in fact, contains two fields (besides the pointers to its children): a set of characters, and the frequency of those characters in the text, computed as the sum of the frequencies of individual characters.

If you look at figure 1, you can see that the root of the final tree is the set of all characters in our example text, and hence the total frequency is 1.

This set is split into two groups, each of which is assigned to one of the root’s trees, and so on each internal node is similarly split till we get to leaves, each of which contains just one character.

Back to our algorithm, you can see how the tree in figure 1 is constructed bottom-up, and lines #2 to #4 in listing 1 take care of the first step, creating the leaves of the tree and adding them to the priority queue.

Now, from line #5 we enter the core of the algorithm: until there is only one element left in the queue, we extract the top `TreeNode`

entries, in lines #6 and #7: as you can see in figure 2.B, those two elements will be the subtrees with the lowest frequencies so far.

Let’s call this subtrees `L`

and `R`

(the reason for these names will be apparent soon).

Figure 2.C shows the actions performed in lines #8 to #10 of our pseudocode: a new `TreeNode`

is created (let’s call it `P`

) by merging the two entries’ character sets, and setting its frequency as the sum of the old subtrees’ frequencies. Then the new node and two subtrees are combined in a new subtree, where the new node `P`

is the root and the subtrees `L`

and `R`

are its children.

Finally, at line #11 we add this new subtree back into the queue: as it’s shown in figure 2.D, it sometimes can be placed at the top of the queue, but that’s not always the case: the priority queue will take care of this detail for us (notice that here the priority queue is used as a black box)

**Listing 2 The Huffman Coding Algorithm (building a table from the tree)**

**function** buildTable(node, sequence, charactersToSequenceMap)

**if** node.characters.size == 1 **then** #1

charactersToSequenceMap[node.characters[0]] ← sequence #2

**else** #3

**if** node.left <> **none then** #4

buildTable(node.left, sequence + 0, charactersToSequenceMap) #5

**if** node.right <> **none** **then** #6

buildTable(node.right, sequence + 1, charactersToSequenceMap) #7

**return** charactersToSequenceMap #8

These steps are repeated until there is only 1 element left in the queue (figure 3 shows a few more steps), and that last element will be the `TreeNode`

that is the root of the final tree.

We can then use it, in line #12, to create a compression table, which will be the final output of the `huffman`

method. In turn, the compression table can be used to perform the compression of the text by translating each one of its characters into a sequence of bits.

While we won’t show this last step[4], we provide listing 2 with the steps needed to create a compression table from the tree in figure 1; while this goes beyond the scope of this chapter (since the method doesn’t use a priority queue), providing a brief explanation should help those readers interested in writing an implementation of Huffman coding.

We wrote the `buildTable`

method using recursive form. As explained in appendix E, this allows us to provide cleaner and more easily understandable code, but in some languages concrete implementations can be more performant when implemented using explicit iterations.

We’re going to pass 3 arguments to the method: a `TreeNode node`

, that is the current node in the traversal of the tree, a sequence, which is the path from the root to the current node (where we add a 0 for a “left turn” and a 1 for a “right turn”) and the `Map`

that will hold the associations between characters and bit sequences.

At line #1, we check if the set of characters in the node has only one character. If it does, it means we have reached a leaf, and so the recursion can stop: the bit sequence that is associated with the character in the node is the path from root to current node, stored in the `sequence`

variable.

Otherwise we check if the node has left and right children (it will have at least one, because it’s not a leaf) and traverse them. The crucial point here is how we build the sequence argument in the recursive calls: if we traverse the left child of the current node, we add a 0 at the end of the sequence, while if we traverse the right child, we add a 1.

Table 1 shows the compression table produced starting from the tree shown in figure 1; the last column would not be part of the actual compression table, but it’s useful to understand how most used characters end up translated into shorter sequences (which is the key to an efficient compression).

Looking at the sequences, the most important property is that they form a prefix code: no sequence is the prefix of another sequence in the code.

This property is the key point for the decoding: iterating on the compressed text, we immediately know how to break it into characters.

For instance, in the compressed text `1001101`

, if we start from the first character, we can immediately see the sequence `10`

that matches `B`

, then the next `0`

matches `A`

, and finally `1101`

matches `D`

, so the compressed bit sequence is translated into `“BAD”`

.

**Performance Analysis: Finding the best Branching Factor**

We have seen that priority queues are a key component in the Huffman compression pipeline; in chapter 2 of “Algorithms and Data Structures in Action”, we describe an efficient implementation of priority queues, the binary heap, and its advanced variant, the d-ary heap.

To sum it up, the difference between a binary (aka 2-ary) heap and its generalization, a d-ary heap, is the branching factor, the maximum number of children any node of the heap can have.

If we measure the performance of our heap’s methods as the number of swaps performed, we have shown that we have at most `h`

swaps per-method call, if `h`

is the height of the heap. This is because the heap is a complete balanced tree, the height of a d-ary heap is exactly `log`

D`(n)`

[5].

Therefore, for both methods insert and top, it would seem that the larger the branching factor, the smaller the height, and in turn the better heap performance should be.

But just limiting to swaps doesn’t tell us the whole story: you also have to dive into a performance analysis of these two methods, and also take into consideration the number of array accesses, or equivalently the number of comparisons on heap elements, that are needed for each method. While `insert`

accesses only one element per heap’s level, method `top`

traverses the tree from root to leaves, and at each level it needs to go through the list of children of a node: therefore, it needs approximately up to `D*log`

D`(n)`

accesses, on a heap with branching factor `D`

and containing `n`

elements.

Tables 2 and 3 summarize the performance analysis for the three main methods in heaps’ API.

So, for method `top`

, a larger branching factor doesn’t always improve performance, because while `log`

D`(n)`

becomes smaller, `d`

becomes larger. Bringing it to an extreme, if we choose `D > n-1`

, then the heap becomes a root with a list of `n-1`

children, and so `insert`

will require just `1`

comparison and `1`

swap, while `top`

method will need `n`

comparisons and `1`

swap (in practice, as bad as a keeping an unsorted list of elements).

There is no easy[6] way to find a value for `D`

that minimizes function `f(n) = D*log`

D`(n)`

for all values of `n`

, and besides, these formulas give us just an estimate of the maximum number of accesses/swaps performed, the exact number of comparisons and swaps actually performed depends on the sequence of operations, and on the order in which elements are added.

Then the question arises: how do we choose the best value for the branching factor?

The best we can do here is profiling our applications to choose the best value for this parameter. In theory, applications with more calls to `insert`

than to `top`

will perform better with larger branching factors, while when the ratio of calls between the two methods approaches 1.0, a more balanced choice will be best.

**Profile (like there is no tomorrow)**

And so, we are stuck with profiling. If you are asking yourself “what is profiling?” or “where do I start?”, here there are a few tips:

- Profiling means measuring the running time and possibly the memory consumption of different parts of your code.
- It can be done at a high level (measuring the calls to high level functions) or lower level, for each instruction, and although you can set it up manually (measuring the time before and after a call), there are great tools to aid you–usually guaranteeing an error-free measurement.
- Profiling can’t give you general-purpose answers: it can measure the performance of your code on the input you provide. In turn, this means that the result of profiling is as good as the input you use, meaning that if you only use a very specific input, you might tune your code for an edge case, and it could perform poorly on different inputs. Also, another key factor is the data volume: to be statistically significant, it can be nice to collect profiling results on many runs over (pseudo)random inputs. It also means that results are not generalizable to different programming languages, or even different implementations in the same language.
- Profiling requires time. The more in depth you go with tuning, the more time it requires. Remember Donald Knuth’s advice: “premature optimization is the root of all evil”. Meaning you should only get into profiling to optimize critical code paths: spending 2 days to shave 5 milliseconds on an application that takes 1 minute, is frankly a waste of time (and possibly, if you end up making your code more complicated to tune your app, it will also make your code worse).

If, in spite of all of the disclaimers above, you realize that your application actually needs some tuning, then brace yourself and choose the best profiling tool available for your framework.

In our example, we will profile a `Python`

implementation of Huffman encoding and d-ary heap. You can check out the implementation on our repo on GitHub.

Code and tests are written in Python 3, specifically using version 3.7.4, the latest stable version at the time of writing. We are going to use a few libraries and tools to make sense of the profiling stats we collect:

To make your life easier, if you’d like to try out the code I suggest you install Anaconda distribution, that already includes latest Python distribution and all the packages above.

To do the actual profiling, we use cProfile package, which is already included in the basic `Python`

distro.

We won’t explain in detail how to use `cProfile`

(lots of free material online covers this, starting from Python docs linked above), but to sum it up, `cProfile`

allows running a method or function and records the per-call, total, and cumulative time taken by every method involved.

Using pStats.Stats, we can retrieve and print (or process) those stats: the output of profiling looks something like what’s shown in figure 4.

Now, to reduce the noise, we are only interested in a few methods, specifically the functions that use a heap, in particular `_frequency_table_to_heap`

, which takes the dictionary with the frequency (or number of occurrences) of each character in the input text, and creates a heap with one entry per character, and `_heap_to_tree`

, which in turn takes the heap created by the former function, and uses it to create the Huffman encoding tree.

We also want to track down the calls to the heap methods used: `_heapify`

, `top`

and `insert`

. So, instead of just printing those stats, we can read them as a dictionary, and filter the entries corresponding to those five functions.

To have meaningful, reliable results, we also need to profile several calls to the method `huffman.create_encoding`

, and so processing the stats and saving the result to a CSV file seems the best choice anyway.

To see the profiling in action, check out our example on GitHub. The example profiles several calls to the method creating a Huffman encoding over an ensemble of large text files[7] and a bitmap image. The bitmap needs to be duly pre-processed in order to make it processable as text–the details of this pre-processing are not particularly interesting, but for the curious reader: we encode image bytes in base64 to have valid characters.

**Interpreting Results**

Now that we stored the results of profiling in a CSV file, we just need to interpret them to understand what’s the best choice of branching factor for our Huffman encoding app.

At this point it should look like a piece of cake, right?

We can do it in many ways, but my personal favorite is displaying some plots in a Jupyter notebook: take a look here at an example[8]–isn’t it great that GitHub lets you already display the notebook without having to run it locally?

Before delving into the results, though, we need to take a step back: since we will use a boxplot to display our data, let’s make sure we know how to interpret these plots first: if you feel you could use a hint, figure 5 comes to the rescue!

To make sense of the data our example notebook uses Pandas library, with which we read the CSV file with the stats and transform it into a `DataFrame`

, an internal Pandas representation that you can think of as a SQL table on steroids: in fact, using this `DataFrame`

we can partition data by test case (image versus text), then group it by method, and finally process each method separately and display results, for each method, grouped by branching factor. Figure 6 shows these results for encoding a large text, comparing the two main functions in Huffman encoding that use a heap.

If we look at those results, we can confirm a few points:

- 2 is never the best choice for a branching factor;
- A branching factor of 4 or 5 seems a good compromise;
- There is a consistent difference (even -50% for
`_frequency_table_to_heap`

) between a binary heap and the d-ary heap with best branching factor.

There is, however, also something that might look surprising: `_frequency_table_to_heap`

seems to improve with larger branching factors, while `_heap_to_tree`

has a minimum for branching factors around 9 and 10.

As you can see, the former method only calls the`_push_down`

helper method, while the latter uses both `top`

(which in turn calls `_push_down`

) and `insert`

(relying on `_bubble_up`

instead), so we would expect to see the opposite result. It is true, in any case, that even `_heap_to_tree`

has a ratio of 2 calls to `top`

per `insert`

.

To double check that this result is not biased because of the choice of texts, let’s take a look at the running times of encoding a bitmap image (figure 7).

The result looks somehow similar, making input bias less likely.

We should then go back to the text test-case: let’s then delve into the internals of these high-level methods, and see the running time per-call of the heap’s internal method `_heapify`

(figure 9), of the API’s methods `top`

and `insert`

and their helper methods `_push_down`

and `_bubble_up`

(figure 8 and 10).

Before digging into the most interesting part, let’s quickly check out method insert: figure 10 shows that there are no surprises here, the method tends to improve with larger branching factors, as does `_bubble_up`

, its main helper.

Method `_heapify`

, instead, shows a trend similar to `_frequency_table_to_heap`

– as expected, since all this method does is to create a heap from the frequency table. Still, is it a bit surprising that _`heapify`

‘s running time doesn’t degrade with larger branching factors?

Now to the juicy part: let’s take a look at `top`

, in figure 11. If we look at the running time per-call, the median and distribution clearly show a local minimum (around `D==9`

), as we would have expected considering that the method’s running time is `O(D*log`

D`(n))`

, as we have discussed above and summarized in table 2. Not surprisingly, the `_push_down`

helper method has an identical distribution.

If we look at listings 2.7 and 2.4, sections 2.6.2 and 2.6.4 of Algorithms and Data Structrures in Action, it’s clear how `pushDown`

is the hearth of method `top`

, and in turn the largest chunk of work for the latter is the method that at each heap level retrieves the child of the current node, we called it `highestPriorityChild`

[9].

And if we then look at the running time per-call for `_highest_priority_child`

(figure 11), we have a confirmation that our profiling does make sense, as the running time per-call increases with the branching factor: the largest `D`

is, the longest list of children for each node, which this method need to traverse entirely in order to find which tree branch should be traversed next.

You might then wonder: why then `_push_down`

doesn’t have the same trend? Remember that while the running time for `_highest_priority_child`

is `O(D)`

, in particular with `D`

comparisons, `_push_down`

performs (at most) `log`

D`(n)`

swap sand `D*log`

D`(n)`

comparisons, because it calls `_highest_priority_child`

at most `D`

times.

The largest it is the branching factor `D`

, the fewer calls are made to `_highest_priority_child`

: this becomes apparent if, instead of plotting the running time per-call for `_highest_priority_child`

, we use the total cumulative time (the sum of all calls to each method), as shown in figure 12: there we can see again how this composite function, `f(D) = D*log`

D`(n)`

, has a minimum at `D==9`

.

**The Mystery with Heapify**

So, summing up, `_heapify`

keeps improving even at larger branching factors, although we can also say it practically plateaus after `D==13`

, but this behavior is not caused by methods `top`

and `_push_down`

, which do behave as expected.

There could be a few explanations for how `_heapify`

running time grows:

- It’s possible that, checking larger branching factors, we discover that there is a minimum (we just haven’t found it yet).
- The performance of this method heavily depends on the order of insertions: with sorted data it performs way better than random data.
- Implementation-specific details makes the contribution of calls to
`_push_down`

less relevant over the total running time.

But… are we sure that this is in contrast with theory?

You should know by now that I like rhetorical questions: this means it is time to get into some Math.

And indeed, if we take a closer look at section 2.6.7 on the book, we can find out that the number of swaps performed by `_heapify`

is limited by:

As you can imagine, analyzing this function is not straightforward. But, and here is the fun part, now you can certainly plot the function using your (possibly newly acquired) Jupyter Notebook skills; turns out that, indeed, plotting this function of `D`

for a few values of `n`

, we get the charts in figure 13, showing that, despite the single calls to `_push_down`

will be slower with a higher branching factor, the total number of swaps performed by `_heapify`

is expected to decrease.

So, mystery solved, `_heapify`

does behave as expected–and good news too, it gets faster as we increase the running time.

**Choosing the Best Branching Factor**

For a mystery solved, there is still a big question standing: what is the best branching factor to speed up our Huffman encoding method? We digressed, delving into the details of the analysis of heaps, and somehow left the most important question behind.

To answer this question, there is only one way: looking at figure 14, showing the running time for the main method of our Huffman encoding library.

It shows three interesting facts:

- The best choice seems to be
`D==9`

; - Choosing any value larger than 7 will likely be as good as;
- While the max gain for
`_frequency_table_to_heap`

is 50%, and for`_heapify`

even up to 80%, we merely get to a 5% here.

Notice how the chart for create_encoding looks more similar to method `_heap_to_tree`

than to `_frequency_table_to_heap`

: also considering point #3 above, the explanation is that the operation on the heap only contributes to a fraction of the running time, while the most demanding methods needs to be searched elsewhere (hint: for `create_frequency_table`

running time depends on the length of the input file, while for the other methods it only depends on the size of the alphabet…).

You can check out the full results, including the other examples, on the notebook on GitHub. Keep in mind that this analysis uses a limited input, and as such is likely to be biased: this is just a starting point, I highly encourage you to try running a more thorough profiling using more runs on different inputs.

You can also pull the full profiling and visualization code, and delve even deeper into the analysis.

To wrap up, I’d like to highlight a few takeaways:

- D-ary heaps are faster than binary ones. Make sure to take a look at chapter 2 of
*Algorithms and Data Structures in Action* - If you have parametric code, profiling is the way to tune your parameter: the best choice depends on your implementation as well as on input;
- Be sure to run your profiling on a representative sample of your domain: if you use a limited subset, you will likely optimize for an edge case, and your results will be biased;
- As a general golden rule, make sure to optimize the right thing: only profile critical code, perform high-level profiling first to spot the critical section of code whose optimization would give you the largest improvement, and that this improvement is worth the time you’ll spend on it.

That’s all for now. If you want to learn more about the book, check it out on liveBook here and see this slide deck.

[1] Greedy algorithms are characterized by their strategy of making the local-optimum choice at each step; for some problems (including the Huffman optimal encoding) this leads to the global optimal solution, although this is not the case for the majority of problems: therefore, greedy algorithms are often used as heuristics that can provide a fast but approximate solution to problems.

[2] Additional challenges can occur if the text is so large that either can’t be contained in memory, or where a map-reduce approach is advisable, we encapsulate all that complexity in the ComputeFrequencies method.

[3] Instead of the actual frequency it might be easier, and equivalent for the algorithm, just counting the number of occurrences of each character.

[4] It’s simply going to be a 1:1 mapping over the characters in the text, with extra attention needed to efficiently construct the bits encoding in output.

[5] Where D is the branching factor for the d-ary heap under analysis.

[6] It is, obviously, possible to find a formula for f(D)’s minima using calculus, and in particular computing the first and second order derivatives of the function.

[7] We used copyright-free novels downloaded from Project Gutenberg, http://www.gutenberg.org (worth checking out!)

[8] Disclaimer: there are many possible ways to do this, and possibly better ways too. This is just one of the possible ways.

[9] In out Python implementation, the names become respectively _push_down and _highest_priority_child, to follow Python naming conventions.

*Check out more great free content from our titles on our **Free Content Center**.*