 OK, let's get started. So today, we'll discuss collection types other than arrays and matrices. So thus far in the course, the only collection types we have discussed were vector t and matrix of t. And these were very regular data structures, mainly used for numerical analysis type programs, Gaussian elimination, inversion of matrices, so on and so forth. So the dominant access pattern for either vectors or 2D matrices or even multidimensional matrices is that it's one compound or collection variable. There's only one name for the whole matrix, but it consists of a lot of atomic values, which are organized in a 1 or 2D or 3D grid. And to address a cell in the grid, you need to give one or more indexes along the row, along the column, et cetera. And then once you give the index, you get back a cell of the matrix, which you can either read or write. That was the contract between the collective object or the collection and the outside world. But that's only one kind of paradigm of accessing collective data structures or collections. Today, we'll look at four more very useful collection data structures and look at some applications of where they can be used. So to motivate this, let's look at two problems. The first problem is that you are given an input stream of numbers. And to make things simple, let's assume these numbers will all be, say, the digits, 10 possible numbers, 0 through 9. And you have to maintain a running histogram of the number of times you have seen each digit. So we now know that's easy to do. All you need to do is to create a vector with 10 elements, 0 through 9. The indexes will be 0 through 9. And then when you see the next number, num, all you need to do is to say plus plus histogram of num. That will generate the count if you initialize the vector to all zeros. Now the problem 2 is actually very similar to problem 1. But before that, let's do a problem 1.1. What if I did not a priori restrict the number of distinct numbers to just 10? They were not the digits 0 through 9. I could say that I'll give you a stream of 2,500 numbers, say. But each number could be anything between 0 and the maximum integer possible. Now if the number of incoming numbers is only 2,500, then it seems like a waste to declare a vector which is 2 billion long just because you can receive any integer as input. So eventually there will be at most 2,500 distinct values. But if I had to use a vector to do this, I'd take a waste and awful lot of space because my vector has to be as long as the number of distinct nums that could happen, not the ones that actually happen in the incoming sequence. So that's a problem. So the only mapping I can now implement is a dense map where I create a cell for every possible incoming value. If I don't want to do that, then I have to do this sparse vector business, and that's a little tedious because every time a new number comes in, I have to do a linear scan through the dim array to see if I have seen that value before, otherwise I have to extend them. So it's kind of a messy job. Similarly, problem number 2 is where I'm given a piece of text, let's say a sequence of words. And I want to count how many times each distinct word appears in the text. Now here again, if words can be, say, up to 10 characters long, there are an awful number of words that are 10 characters long. So in principle, the space of words is very, very large. But if the incoming text has 2,500 words, then there can be at most 2,500 distinct entries in your mapping. So ideally, you'd like to code exactly as before, as simply as we used to do with vectors. In other words, when I see a word called word, I should still be able to say something like plus plus his word. So whatever be the count of word, increase it by 1. So C++ and later Java, they give us the facility to be able to do that last line. Write it exactly that way, in fact, in C++. Without regard to how sparse the input space is. In other words, even if the number of potential words you can see is in the billions, suppose the incoming text has 2,500 words and 1,000 distinct words, you need some sort of a data structure which has the following properties. First of all, the space taken by the data structure should be proportional to the number of distinct words you see, not the number of possible words which can appear in the input sequence. Second is that incrementing the count of a word should take about constant time. That was what happened in vectors. And we'd like to keep up that good performance. So the two criteria are space proportional to whatever payload you are actually storing, not the potential input space. And two is unit time or constant time updates. And reading. If you want to read the frequency of a word currently, you should be able to read in order 1 time, constant time. So these are provided by a data structure which is called a hash table in computer science. And in C++, it's provided by a STD library called unordered map. So remember, so far when we discussed collection objects like vectors or matrices, there's only one type of object in it. And that's why we said vector t or matrix t. Now, we have two kinds of objects inside a mapping. A map takes something of type K or a key and maps it to something of type V or value. So the unordered map template or data type has to be instantiated with two types instead of, as against vectors or matrices, which have to be instantiated with only one type. So in our example of counting the words in a text, the key type K is string. The value type is int. Or if you're anticipating very large counts, then maybe long, or long, long. So the declaration then looks like unordered map string int. And then you say the name of the variable, which is the histogram, hist. Hist is your chosen name. The stuff before that is the full specification of the map type. It's an unordered map going from mapping strings to integers. Why is it called unordered? We'll see that later on in more detail. It means that the keys, the words, they're not assumed to be any ordered relationship with each other. So the unordered map, while mapping from words to counts, will not order the words in any significant way. There is no sense of an ordering between the words. So initially this map, when declared as hist, will be empty. There will be no mapping from strings to integers. If you want to, at any point, read the current frequency or count of the word hello, the syntax looks very similar to a vector syntax. In fact, by the magic of c++, you can just have a quoted string, quote hello, and that is silently transformed into a string inside. So you can actually write that literal hist of hello. And if the key hello exists in the current map, you will get back its current frequency. Yes. Say a word, the maximum word in the case. Yeah. So there is actually, there are two parts, some word inside it, respected characters. Then for every letter, for a character, it will go into object correct. Yeah. Go ahead. Bit edit. Yeah. Then bit. No, I'm not even discussing how this is implemented yet. So let's not go into that. Actually, implementation is very complicated. No, no. I'll describe briefly how it's implemented. It's not like that. You can do it that way. Perhaps that's a different data structure. But that won't give you constant time access. It's not. We'll discuss it offline. So if the key hello did not exist in the map, then something funny will happen. If you access the count of hello, it will be silently created for you. The mapping will be created, even if it didn't exist. And by default, the initial count will be initialized to zero. So if you didn't insert any key into the mapping called hello, and you access int hn equal to his hello, the value of hn will be zero. Not only that, after this statement, his will contain a mapping from hello to zero. This may not always be what you want. You may want to check if hello is in the map without creating a mapping if it's not there. So we'll see how to do that soon. But you can also assign. You can explicitly say, his of world equal to five. And if there was no entry for world, this will create a new entry for world and map it to five. If there was already a mapping for world, that would be destroyed. Just like if there was, if you say vector at five equal to 13, if whatever was in vector five earlier will get wiped out. Similarly here, if world existed as an entry in the mapping, that will be destroyed. And a new mapping will be created from world to five. So very intuitive. Exactly similar to how a vector would behave. And finally, because of this facility of automatic creation of a mapping on first use, you can also write that last statement saying plus plus his world. What does plus plus do? It accesses the old value. It increments it by one. That will happen transparently here. The first time you call this, his world will become one. Because it will start implicitly from zero and then go to one. So that's what an unordered map is. And the very next question is, of course, how do I iterate through it? Like in vectors. The easiest way to iterate through vectors is the index, which goes from 0 to size minus 1. Here also there is a size call. You can ask for the number of entries in the list. But now there is no natural notion of an index. There's no natural ordering between the entries. But there is still an iterator provided. The iterator looks exactly the same as in vectors. You start off an iterator exactly the same as in vectors by calling begin. So all these collection types are written to be extremely consistent with each other in terms of what syntax and methods they support. So given an unordered map hist, you can call begin on it and end on it. In all these collection types, the understanding is that if the collection is non-empty, calling begin positions you or the iterator called begin on a or the first element of that collection. In case of a vector, there is a well-defined first element. The 0th element. In case of an unordered map, the first element could be any element. There's no guarantee of where you'll start. But it will position you on some element. And then the variable which is returned begin is of type unordered map colon colon iterator. Now you're not supposed to print that iterator. You can only do a few things on it. So the code in a for loop which runs through the entries, key to value entries in a map looks like this. For unordered map kv iterator hx equals hist.begin. So in this case, k would be string and v would be int. I'm just shortening the code on the screen. Hist.begin. So that's how you initialize it. Then you test whether hx is equal to hist.end. If it's equal, then you quit. Any collection.end is an illegal value of the iterator past the last entry in that collection object. So you cannot really access hist.end. It's like a sentinel last guard value which protects you from going past the end of the data structure. So hx not equal to hist.end. And then you do plus plus hx. Plus plus has been magically redefined so that it now advances your iterator over a hash table or an unordered map rather than elements of a vector. Now inside the loop, how do you use hx? So note that hx iterates not over one variable, but pairs of variables, the key and the value. You access the key by saying hx arrow first dash gridded them. And you access the value as hx arrow second. So first gives you the key, second gives you the value. So now let's see a demo of this part of the code at work. Yeah. Yeah. Yeah, in this case it doesn't matter. If you use that expression value in something else, then it has the same meaning as incrementing an integer. So here is my main class. So I declared unordered map string in hist. And I print hist. We'll see the print code very soon. So this should just print an empty mapping because I have not inserted anything yet into hist. Then I print out hist of junk. So note that junk is not in the histogram. But by our rule, a mapping will be created for junk with value 0. And so I print again to verify that. Next we do plus plus hist apple. We do plus plus hist apple again. So after this point, the count of apple should be 2. And plus plus hist peach should result in peach having a count of 1. But you could also do this sort of thing. You could say hist. Let's do that. And now we finally print hist again. Forget about those two things. So let's see what this code does. So if you wanted to look at print, let's do that before you actually run it. So print is very simple. Again, it has two parameters, the key type and the value type. And then it takes an unordered map by difference. We're not going to change it. I'm just going to print it. We start by printing a curly bracket because inside the collection is a set of mappings. It's unordered. So to simplify that, I'm writing it as a set. So curly bracket. And then I create an iterator over the object. So I write here const iterator. That means that this iterator will not attempt to change anything in the collection. You can also insert at the iterator position or the cursor position. You can delete from the cursor position, as we have seen in vectors. If you're not going to do that, const is a hint to the compiler that this iterator is a harmless read-only iterator. It will not change the data structure. If you omit the const, this is a writable iterator. We'll see some examples of that also. Now we start from begin. We check against dot end. And we increment the iterator ux. Inside, I'm just going to print first, pointing to. This means first maps to second. So ux.first will print the word. And ux.second will print the count. Then I'll print a comma in anticipation for the next entry. And finally, once I'm done, I'm going to close the curly bracket and print a new line. So this is the print reading. So print, his junk, print, bunch of things, print. Now it turns out that there's a big fight about C++ standards and whether unordered map is in the standard or not. So different standards committees are still working on it. And the most recent two big standards of C++ is the ANSI C++ and now there are the ISO C++. Unordered map is part of the new ISO standard. And to enable that, you have to say dash std C++0x for some reason. So we don't argue we'll just provide it. If you don't, C++, G++ will complain and tell you to just add that string. So I'm going to compile it. And now unordered map is done. See that the initial print is empty. There's nothing in the map. Then I read his junk, thereby creating a, and I get zero because junk is not mapped to any value initially. So hist of junk comes out to be zero. But now if I print hist, you see that there's a mapping from junk to zero. Finally, after I do all my manipulations here, I get three entries, apple with a count of two, the old junk with a count of zero, pear with a count of minus five, and peach with a count of one. So there are four things in this mapping. Now unordered map only allows one entry, or at most one entry, with each distinct key. So you can never have two entries with the first value being the same. Now that comes with a slight qualification in that this is true for discrete key types, like integer or string. If your key is a float or a double, then by now you know what to expect. You are allowed two different entries, provided the floats differ in even one bit, no matter how close they are. If the test of equality fails between two floats, then you're allowed two different entries with them. But in your application, if you anticipate that key values within a certain tolerance of each other should be the same, you have to carefully ensure that they're bit wise the same before they enter the hash table. Otherwise you'll get two different values. If one key is one, and the other key is 1.150 than one, they'll end up being different keys in the map. So be careful about that. Because of integers and strings, there's no danger, because they're discrete types. So anyway, so this is how a map works. Observe that when I iterated through the map to print it, there is no guarantee of the order in which it will be printed. If you compile this same code on your machine, you may well get a different ordering. Is this deterministic? If I run this one more time, I seem to get the same, but technically that's not guaranteed. So don't depend on that ordering. In fact, inserting one more key here may completely upset the relative ordering between these guys. So now I can talk a little bit about how this is implemented. But it's a very long story. There are books written about unordered maps. So I'll give a very short five minute exposure to how unordered maps are implemented. Remember the two things I want. I want storage to be the amount of stuff you're actually storing and no more, typically. And I want constant time updates and access. So what is done is the following. The main data structure is called the buckets. And that's a vector, typically. In this case, it's a vector of ints. I also keep a bit vector to record which cell is already taken and which cell is empty. Initially, all cells are empty. Now, when the string hello comes in, I apply what's called a hash function to the string. I'll describe some example hash functions in a moment. This hash function has domain, which is all strings, and range, which is between 0 and n minus 1. What is n? We'll discuss that too. For now, assume that n is comparable to the number of keys you want to insert into the collection object. Now, this hash function will eventually pick some bucket. Say this one. A good hash function makes all of these buckets equally likely for any given string. So the hash function is sort of like a deterministic pseudo random type of number. I'll describe that in a moment. But basically, given a string, it should be very hard for you to predict which bucket it'll go into. All of them should be equally likely. So given two strings, the probability that both of them will go to the same bucket should be equal to one over number of buckets. So as you increase the number of buckets, the probability of a collision should go down to 0. Now, it's possible that hello and world may collide. Suppose your mapping says that hello has to go to 5 and world has to go to 3, say. Suppose hello was hashed first. Then this place was 5, and you said this is full. When you now try to hash world, if world maps to a different cell which is empty, then you're done. You just stick in 3 there. If you're unlucky, and hello and world both hash to this same bucket, then you'll see that this bucket is full. And you'll keep walking to the right until you find the first empty cell. And you'll stick the 3 in there. So anytime there's a collision, your access time may increase from constant. So you want to decrease or minimize the number of collisions. This is done in two ways. One is by designing good hash functions, which, given a particular string, will make it equally likely to be anywhere on the bucket space. And second, by allocating enough buckets. If you distribute evenly, and you have many, many buckets, then the probability that two strings will map to the same bucket will be smaller and smaller. So what's the hash function? A very common hash function is something like this. So you initialize h to, say, some prime number like 11. And then you say for character in input string, h equals h, say, some other prime. Character into h plus some other prime p. You do that. And then finally return h percent n. So this function is nothing but a deterministic mishmash of the characters that are coming. Remember, these characters are also integers. Can be interpreted as numbers. So you keep on multiplying the incoming character with the old value of h, and you add some garbage to it. The goal is to avoid predictability. But it should be deterministic. If I hash the second time using the same function, I should get the same value. So because of this calculation and the percentage, it's perfectly possible that two strings will map to the same eventual hash value. But it's known that for function classes of this form, the probability of that is small. So that's called a typical hash function on a string. So unfortunately, given that the course is drawing to a close, I won't have time to go into the details of properties of hash functions. There are entire books written about hash functions. So that's how a hash table is implemented. The advantage is that calculating the hash function is very fast. You run through the string, you get a number which looks like a garbage, but it's actually a deterministic map from the number to an integer. Then you take a percent of that with respect to the number of buckets you get a home to go to. If there's already someone there, you keep hopping to the right until you find an empty place. This is done both for updates as well as for access. Even during access, you go there and you check this bit. The first bit which is present, you can just read off that value and return it. So that's how this structure works. So now let's go back to some more applications and code. Yes? So you say, give me the current count of peace or something. You apply the same function to it. You find a bucket. If that bucket is occupied, you return that value. If it's not occupied, you keep walking until you find the first bucket to the right of that bucket which is occupied and you return that value. You follow the exact same logic while inserting things and retrieving things. So the search is, once you land up at the bucket, if the bucket is full, so what you should also do is store that key inside here, which key led to that value. If it matches, then you can return that. So there's a small detail there. Otherwise you might return noisy value sometimes. So if the bucket is full, it might have been from someone else who collided with. If hello and world collided, hello was inserted first. And then I asked for world. If you only check for presence, you will return the count of hello instead of the count for world. So the other thing you have is the key itself. So you have another array of the keys if you want to avoid that collision. So when you inserted hello, this cell has hello in it. And now if you ask for world, you collided. You'll find that the key is not the same. And then you'll keep walking until you find the next non-empty cell which matches the key. Now in some applications, you don't need to do that. In some applications, if you return a wrong value, it's not a disaster. So then you don't need to store the keys. Now, so what else can we do about this? You could also erase all occurrence of a particular key in the hash map or any other condition. So I run an iterator just like I was doing for printing. And if the key is equal to pitch, then I erase it. You can always do that. So if I run that, then all that will happen on printing it again is that the pitch will have disappeared. So there was a pitch here, no more pitch. But there is technically no guarantee that deleting pitch will preserve the relative ordering of the other guys. That's why it's called unordered. Never assume that the keys are ordered between themselves. Yeah? The value here is pi. So there's pi between. Minus pi. Minus pi. Yes. So there's pi between. No, no, no, no. There's exactly one integer for the value. And the original string is stored somewhere. The key has to be in it. The key is always in it. We'll look at a different data structure called a multi-map which allows multiple entries under a key. Any questions so far about unordered maps? So think of it as a sparse vector, where the keys are not contiguous indices, but the keys could be any arbitrary data type. All you require is a hash function on the data type. Then you can use the unordered map on it. Yes? Who occupies that space? They are actually initialized to fairly small vectors. And as you keep inserting keys, the system will dynamically adjust. It will reallocate more buckets as it needs. And that's when the relative ordering between elements will change. You have no control over that. And in some implementations, as you delete things out and the hash map becomes very empty, it will also release the space to the system. Yeah? How does that? What do you mean how does it? How does it? Oh, I just drew it here. So I want to turn a complicated key space into a one dimensional space. That's the summary of this, right? No, no, so what I try to avoid is a linear search through this vector. The way I do this is I turn whatever key was given to me into one integer, between 0 and minus 1. I go there and I check whether the key there is the same as what I used or not. If it was the same, then I've hit my cell, right? So let me draw this once again, slightly clearly. So this is the val. This is the key array. And val may be empty, say. So we'll just use a simple thing. So here's a key which comes in. You hash it. So let's say I'm trying to be inserting the key. That's the action. I hash it. Let's say I find that this is the hash value. It's one cell between 0 and n minus 1. I look at this cell. If the key is empty, so let's say empty is represented by null or something and value is empty, then I have to insert key with value. So I place key in here and value in there. That's the hash function. Remember, the hash function is a simple function which takes a string as input and outputs an integer between 0 and n minus 1. Now we will. So that's what I'm saying. So suppose you say insert key value. I first take the key. I find its hash function value. I go to that place. I check if this cell in keys is equal to key or not. If it is occupied, then that means some other key prime hash to the same value, in which case I have to start walking to the right until I find an empty slot. And then how would it recognize that this value is not writing in there? Because I stored the key in there. I stored the key in this key added. You're getting a little confused. So let me give specific examples. Insert hello2, followed by insert world3. Here is the keys, and here is the valves buffer. Now suppose by accident, both hello hashes to location 47 initially. At that point, it was empty. So at location 47, you put in hello in keys. And at the location 47 value, you put 2. Now suppose world also you're unlucky, and world also hashes to location 47. I go and read location 47. I find that hello is already there. Then I know that the hash value of hello and the hash value of world are equal, and they're both equal to 47. And I have a collision. So now I say, OK, since 47 was occupied, I'll go look for the next chair, or the next slot. And suppose I find this 48 to be empty. Now I stick in world here, and I stick in 3 there. OK? Yes. Yes. So suppose tell me what you're searching for. World or hello? World, world. OK, so let's say look up world. So again, I use the same hash function and end up here. I find that there's a hello sitting there. So then I again start skipping. Until I reach a non-empty cell, whose contents is the same as the key I was looking for. And then I read this, and I return it. Yes, you showed the same. Yeah? It won't be. The only cell where? Maybe before. But now if we have hello and world, we have to insert something where before world. Tell me what. So now you're going to say insert. Hello. OK. Hello 5. So again, hello 5 is going to hash to exactly the same position. What? No, no, no, that hash function is deterministic. Every time you call it on a particular string, you are computing the same function. You'll get exactly that same bucket. The key is unique for a function. Value of key is unique. That's why the key has to be unique. Yeah, the P is chosen once at the beginning of the hash table. And the 11 and P everything, all constants in the hash function is chosen exactly once. That's stored along with inside the hash table. Yes, that's right. It may change when the number of bucket changes. If the number of keys is getting too much and the system has to change the number of buckets, then it may change that. But within the lifetime of a bucket system, it has to be the same. It will make sure that it's effectively creating a bigger hash table and transferring all of that. I didn't want to go into too much in detail because we are running out of time in this course. So that's an unordered map.