 Welcome to this course on data structures and algorithms. We have been discussing binary trees in this week. We will continue our discussion by introducing a new and useful and important application called Huffman coding. In this particular session, we will look at the Huffman coding basic principle and then later on we shall see how this can be implemented using a C++ program. So first of all what is Huffman coding? We represent various input characters ordinarily by ASCII codes. So what happens is that you have to spend so many bits in representing every individual character. If we could assign a variable length code to represent these characters where such variable length would be less than 8 in most cases then we will get a much compact code. So we can use the frequency of occurrence of each character to decide what code to be used to represent that character. The smallest code is assigned to a character which has the highest frequency of occurrence. The applications of Huffman coding is typically in data compression. We call it lossless data compression because it is possible to reconstruct the entire original data by using this Huffman code. Let us look at how the Huffman code is structured. Consider a file which contains different characters between A to F and consider that the frequency of each character is actually counted by the number of times they appear in the file and let us assume that these characters appear in this file. A appears 15 times, B appears 10 times, etc. Notice that E appears 55 times. So B appears smallest number of times and E appears the largest number of times. Different characters appear for different number of times. We will now try and use this frequency attribute to design code for every individual character. Now we need to construct a unique code. So we first start by looking at a pair of minimal frequency elements. We note that B occurs 10 times and C occurs 12 times. All other characters occur more frequently. So these two are the minimal occurring characters in this list. What do we do? We combine these two and count the total number of times both of them occur that is 22. We now create artificially a tree structure where we put the value 22 here at the top and the left child of this node represents the character B which occurs 10 times and the right child of this node represents the character C which occurs 12 times. Please note that this node does not have any particular character because it represents the frequency of occurrence of both the child characters together. We can proceed exactly in this fashion. Now we have our table which has frequencies of 15 for A, 22 for B and C, 18 for D, 55 for E and 16 for F. We continue the same logic to build the tree. Now we find that the two minimal elements are A and F which occur for 15 and 16 times. Now since these are unrelated to this B and C, we create a separate subtree here in which 31 is inserted as the value of data. The left child represents A which occurs 15 times, the right child is representing F which occurs 16 times. We give them the values 0, 1, 0 for left, 1 for right exactly as we did it here. We combine these two and now we have a combined frequency of 31. We proceed continue with the same logic. Now we find that amongst the remaining elements including the conglomerate that we have created we have now 22 and 18 as the minimal values. So we combine these two to form a composite value of 40 and create a node. Please note that this node 40 on the right will have a composite element value of 22 frequency but on the left it will have the node which represents D which is the next smallest value 80. So this is another subtree which has grown out of this 22 further. Let us continue this. We now have the minimum values as 31 and 40. These are now combined to make 71. So you will notice 71 now appears here. The left child points to this 31 composite value and right child points to this 40 composite value. 71 becomes parent for both of them. Again as before we represent the left child by symbol 0, right child by symbol 1. Continue this. Now we have the only combination left 71 and 55. So we get 126. This becomes this value. Again on the left we have E and on the right we have this conglomerate 71. We note that in this particular structure what has happened is the symbol which occurs with the maximum frequency has come right towards the top of the tree. If you consider height of the tree like that towards the lower end are the symbols which occur less frequently. All that we need to do now is associate the code with the symbols depending upon what are the values that we have associated with the left and right branches like 0, 1. For example, E could be represented simply by 0. How would we represent A, F, D, B, C? We would traverse the tree and use the bits that we encounter to collectively represent individual characters. So let us look at how do we find a code word. We follow the path from the root node to the leaf node. So we traverse it. We come across this leaf node E. We simply assign the value 0. If we want to look at A, we will have to traverse the path 1, 0, 0. So this becomes the code for A. Similarly, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1. These are the different codes that we can use for representing these different symbols. This is a very peculiar property of these codes. Can you guess what the property is? Let us look at it. These are called Huffman codes by the way. If you see the code word generated, no whole code word in the system that is a prefix or initial segment of any other code word in the system. Let us go back to the previous slide to once again look at this peculiar property. So for example, 0 will never occur as a prefix for any other code. You see all other codes start with 1. Similarly, if you take the whole word 100, 100 will never occur as a prefix of any 4 or larger bit code. 101 similarly will never occur as a prefix. 110 will never occur as a prefix. Notice that the two codes in this example which have 4 bits all have 1110 and 1111. It is guaranteed that if we were to add more symbols, 1110 and 1111 will never appear as a prefix to any other code. This permits us simply to create a sequence of bits and uniquely identify every character that is occurring inside that file. This is the reason why the Huffman codes are extremely popular. There is a reference available. The Wikipedia article on prefix code or variable length code will describe these in more details. Those of you who are interested can look at these Huffman coding and several other types of code. So to conclude we looked at Huffman coding today and we saw how variable length codes could be generated by representing different characters and their frequencies and creating a binary tree to represent a structure by which Huffman codes could be identified for individual characters. Thank you.