 Now, so I said that if you try to access a key which did not exist, it will be transparently created for you. That's not what you always want. You want to test whether a key exists without updating the data structure. So you do that by the find method, just like find existed in vectors. So you say in the histogram, try to find piece. And that returns another iterator. If the key piece exists in the map, then you'll get a valid iterator from which you can access first and second. First will, of course, be piece. Second will be the count of piece. But if the key piece did not exist in the histogram, then the histogram will not be modified. Instead, the return iterator will be equal to his dot n. Remember, his dot n is a special value which shouldn't really be accessed. First and second of n doesn't make any sense. But you can compare against n. So you can say if his dot find piece is not equal to his dot n, then you can do some stuff with key n value here. So let's demo that in the code. So check out this code. So I'm trying to find hello. And I'm comparing it against n, then printing the comparison. So that's a Boolean. Whether his dot find hello is equal to his dot n dot not. If it is equal, then you haven't found hello. If it is not equal, then you have found hello. Now, nowhere did we insert hello. So this comparison should come out to be true, the first one or one. In the second, I'm looking for apple. And I should find it. Therefore, it should not be equal to his dot n. And I should be printing false or zero. And this is a little bit of a dangerous code. Because if find apple turns out to be n, then add a second is a bug. But since here we know that apple is in the structure, I'll just write for shorthand. His dot find apple add a second. So that's the count of apple. This is another way of writing his of apple. Except this will not create a mapping for apple. If apple didn't exist, this will probably crash the program. So it's in fact good to first test this, and then only access second if his dot find apple is not equal to his dot n. So as you can see, after this point, the first Boolean printed is one, which is hello was not found. The second Boolean printed was zero, which is apple was found. And third was apple's count, which is two from here. And finally, I print it again. Observe that hello was not in the histogram, but mapping for hello has not been created. So if you do find, then that doesn't create a mapping if it didn't exist. That's a bug, because as you know, and by now you know the culture of C++, you're allowed to shoot yourself in the foot as often as you want. So you can try this. Let's see. It may give garbage. It may crash the program. We have no idea. All right, so that's unordered map now, and testing if a key exists. So suppose we want to print the histogram, but sort it by words. See, let me run another piece of code here. So now that we know how to collect word counts, there's a bigger program wordcount.cpp. This is quite simple. Main declares a histogram, hist. Then I start reading cn, and I keep reading words from cn. At some point, cn runs out. You're not willing to type anymore, or the end of file is reached. And that's detected by a method called fail. So far, we haven't used this fail method in cn. We'll look at this later on. If cn fails, you quit. So that means cn has nothing more to offer to you. Otherwise, you read the next word from cn. And if the word is non-empty, then you do plus plus hist of word. So this is in about six lines the easiest program to collect word count from a text file. So for example, suppose I have the following text, just one small file. So I want to count how many times each word is distinct word appears. So how do I do that? I'll say wordcount.exe. And then finally, what does the code do? So eventually, I just print hist, just like before. Because it's cn, instead of typing things by hand, I'll just pipe in question.text. So that does the counting and says that matter appears in the text once, r appears five times, the appears twice, off appears twice, and so on. So of course, word counts are very important in web search, because words which are frequent are low information like r and n, whereas words like pneumonia are much rarer. So you analyze a few billion pages, the counts are extremely indicative of how important the word is for the search. So that's one big application of wordcounting. But hey, you need to count words all the time. So this is the wordcount application. Now suppose I want to print this histogram. So as you observed, when you print the histogram, it does not come out in any specific order. It starts with matter, goes to time, goes to smarter, argument and n is inside the middle. So there's no sorted order. It's an unordered map. Suppose I want to actually order the words while printing the counts. There are two ways to do that. One is to first extract all the elements, all the keys in the mapping into a vector. We know how to sort a vector. So after that, we get a sorted vector of only the keys. Then we iterate over the vector while extracting the counts from the histogram. So this should be fairly clear. You should be able to write that code. But just to get through it, print sorted, we get an unordered map, um. We initialize a vector of strings, keys. We run an iterator through the unordered map. So we're getting keys in no particular order. But we keep pushing them back. So ux arrow first gives me the keys. I push back those to the keys vector. Then I sort the keys. After that, I run an index from 0 to keys.size-1. And I read the keys from keys. But then I access his keys kx. That gives me keys in sorted order along with the accounts. But there are other ways of doing it. So you don't always want to do this. The other is to use an ordered map, which in C++ is simply called map. So map kv behaves almost identically to unordered map, except that when you run an iterator, it's guaranteed to give you the keys in sorted order of k. So let's see a demo of that. So I'll comment out some of the more later examples. So look at this main. It's identical to the body of the word count program. If you remember word count main, unordered map, while C not fail, read a word, increment the count of the word, exactly the same thing. Only I declared it as map instead of unordered map. So let's see what that does. So that same question.text, if I pipe it into sortedmap.exe, now I'll get the printout in increasing order of the words, starting with able and ending with east, with their respective counts. What do you include? Yeah, the file to include is the same. It's just map. Map, otherwise it's unordered map. So now, the advantage of sorted maps also is that because there is a natural ordering on the keys, you can now do what is called range traversal using an iterator. You can also find. So let me explain what I mean by, so remember those three things, find lower bound and upper bound. I'll now demo to you what that means. Yeah. So why do you have an order? Because operations on ordered map is more expensive to keep the keys in order. So you can no longer use that earlier data structure because the hash function can take you anywhere. So you have to now use a different kind of data structure, which are various options are there. You can use what's called an AVL tree. You can use a red-black tree and several others. So these are all things that are taught in data structures in computer science. But it's more expensive. Every instruction relation takes more time than the hash map. Yeah. So minus goal, it starts to keep improving. So what if the entire data structure is full? You're a cycle path. A cycle path. You need some deterministic rule by which you home in on the correct cell during both insertion and reading. That's all. No. Ordered maps follow a totally different logic. We don't have any time to discuss it. So let's see. So in the case of ordered maps, the key space has some natural total order on them. Now suppose I say find k, just like find in unordered maps, if the key does not exist, you're going to get n, which is beyond the last thing. Last infinity, if you will. If the key is there, your iterator is going to home in there. But suppose I have now compiled the marks of all students in this class. And I don't know if there is, say, a John in class. Maybe, maybe not. And I don't know if there is a Nick in class. Maybe, maybe not. But I just want to say compute the average marks of students named between John and Nick. So if John exists, then the iterator will be positioned at John if you call lower bound instead of find. If John does not exist, then the iterator will be positioned on the very next larger key. Similarly, if you call upper bound with argument Nick, if Nick exists, then your iterator will be placed there. If Nick does not exist in your map, then the iterator will be positioned on the previous last key before Nick. Now if you get two iterators like this, you can iterate between those. Just like between begin and end, it doesn't matter what your starting point and end points are. You can always run your iterator through those values. So those are find lower bound and upper bound. So find a key. You find the lower bound on a low key and find the upper bound on a high key. That's the purpose of find lower bound and upper bound. So let's see some examples of that. So here, remember what was being printed. Able, acting, action, argument, at, clearly, and so on. Now I'm going to do two searches using bounds. Find we have already seen. It works exactly the same way. Nothing particularly interesting. The first iterator I'll pick is a lower bound on best. So note that best doesn't even exist here. Best is between at and clearly. So that lower bound will initialize low x, the iterator on clearly. Iterator high x searches for nano, which also does not exist, m and o. So high x will be placed on more. In fact, high x will be placed somewhere between. But when you iterate from low x to high x, you will go from clearly to more inclusive. So now I'm going to, say, print this out. And not erase it. I want to just print out all keys between best and n. If they match, they'll be included. If they didn't match, then the rule is as I drew on the piece of paper. So I think I do the final print. But forget about that. It's the same thing. So what I print inside is all the keys between best and nano, which is clearly to more with their respective counts. Now again, instead of printing it, you can choose to erase it. It's just a cursor. You can always erase what is at the cursor. And then we print it. That should print the remaining items. So you're saying delete anything in the range best to nano and keep the rest. So here, after the deletion, observe that act is followed by off. So everything between clearly and more was deleted, or between best and nano, was deleted. And you end up act being followed by off. So these are very powerful primitives. If you know how to use these classes, you can use them to code many of the things you have done earlier somewhat more simply, or at least easily in your code. So some numbers, unordered maps, are supposed to do insertions and deletions and reading the value of a key in constant time. Whereas if an ordered map has n things in it, then inserting another thing in it costs log n time. Deleting something also takes log n time. And looking up something may also take log n time. So all operations are log n. If the current size of the ordered map is n elements. Any questions about ordered maps? So because ordered maps are always kept ordered, you can always extract the minimum element relatively quickly in log n time. So suppose you say map kv, my map. If my map is non-empty, then my map.begin points to the entry with the smallest key at all times. And you can keep removing it. So after recording the key and value, you can remove it using id as my map.begin. So let's try to do that. So you can write this to test for emptiness. While histogram is not empty, I can say cout his dot begin first. That's just the key. And I can do his dot it is his dot begin. This keeps pulling out the smallest current element in the ordered map. If between these two removes, you inserted something else that will go into its rightful ordered position. So as you can see, at, off. All the keys are coming out in sorted order, except that we know that between clearly and off has been deleted. So at is followed by off. And then following is sorted out. So that's ordered maps for us. Any questions? Now, unordered map and map, they allow at most one entry for each distinct key. You either have a mapping for the key or you don't. But you can never have two mappings under the same key. Sometimes you need that. So this is not a loss of generality, because if you need multiple values for the same key, instead of v being an int in our case, you could always do a vector of int and store all the values you want to store. But it's a little tedious. So for convenience, c++ also provides what's called a multi-map, where there could be multiple entries under the same key with different values. Now, the notation hist box bracket key or word is no longer provided, because now it's ambiguous which entry you mean. If there are multiple entries under a word, then you don't want to provide a construct like this, because it's confusing. So instead, you can only access two iterators. And you can insert pairs of kv's in it. So this is what multi-map code looks like. So you just include the same map. And then here is your histogram. While scene not fail, I keep reading words. Now, there is no longer any box construct. So you have to insert something. What do you insert? You have to insert a pair of a string and int. That's what it was declared as. And you insert the word with a count of 1. This will actually result in many different entries, each with value 1. If you don't want to do that, you have to actually find to find the iterator position and increment its value. We'll see both examples. In the first case, for every word I see, even if it is already mapped in the multi-map, I'll create a new mapping with value 1. So let's run this. So observe that r appeared five times, and that's recorded five times, each with a count of 1. So a multi-map will allow multiple entries. If you don't want that, if you wanted to do the old thing as in map, then you cannot insert like this. You have to locate the old entry and update it. If you insert a new entry, it will be a new entry. So let's try the other logic. So this is where I'll say something like, if list.find word equal to list.end, that means the mapping is currently not there. Then I can do this old thing. I'll insert an entry with the count of 1. Otherwise, the entry was already there for this word. So in that case, I'll do his.find word second plus plus. In this case, you get what we expect. So multi-map is more general. You can use it in both modes. You may choose to have multiple entries with the same key, or you can choose to update the old record under that key. There are different entries. So if you run an iterator through them, you'll get different records, only by the value. Otherwise, there's nothing to distinguish. So as an informal homework, you can go read the manual about if I have a key apple, how do I iterate through exactly the, instead of a find, if I have multiple entries in the multi-map, I want to look at exactly those entries with key apple, which may be five in number. Suppose I want to only modify exactly those five records. How do I do that? So read the manual to figure that out. So that is a multi-map. Any questions about multi-maps? It's a more general access to the mapping structure. What do you mean, I don't know? No, no, a map is always a map from one thing to another thing. That's why it's a pair. This multi-map is from strings to ends. To insert an entry, I need a pair of a string or an integer. So see things don't arise in this case. Why do you resist that case, insert, not insert, then word combo, why do you have to write pair? That's a convention. Insert takes an argument to which is a pair. That's all. It's a slight bother. They could have done it themselves. That's a multi-map. So all of these have their positions in terms of where you should be using each of these data structures. The last one we'll look at is a list. So here's a list. It has a beginning and an end. As always, end is an illegal position of the iterator, which you shouldn't be reading. The beginning points to A, some element A. And the current elements are A, B, C, and E. So it's like a train. Coaches are linked to each other. But it's more flexible in some ways. Unlike vector of T, list of T does not give you access by index. You cannot say give me the 24th element in the list. You have to walk through. Just like in a train, unless it's at the platform, or even if it's at the platform, you have to keep walking until you find the correct compartment. You cannot jump into it. You cannot instantly go from compartment one to five. You have to pass through the other compartments on the way. Similarly, in lists, you have to walk through it. So you only accesses by iterator. But compared to vectors, the big advantage is that once you are at a compartment, you can delete the compartment or insert a new compartment at constant cost. This probably even faster than trains, where you need a shunting yard, and so on. You don't need any of that. So how do I do that? For example, if I wanted to delete the element B, all I need to do is to cut those two links and short out A to C. It's just a de-linking problem. B is now lost from the world unless the user had some other handle or access into B. B is not accessible anymore. And in Java, the system would reclaim the storage of B automatically. In C++, you are in charge of releasing the space. Or rather, the list T implementation is in charge of releasing the space. You don't need to worry about it. List T will look into it. But someone has to do that explicitly. The language itself doesn't do it. Similarly, suppose you want to insert D between C and E. The action is very simple. You snip that link, and you link in a D. So creating new entries inside a list anywhere, once you have an iterator positioned suitably, is very easy. The other operations which are very fast on lists is actions on the front and the back, or beginning and end. So it's very easy to push a new element onto the beginning. Before A, if you want to insert a new element, you can do that in constant time. If you want to pop front, remove A from the beginning of the list, that can also be done in constant time. Then begin will point to C, and A will be freed up and given to you. Similarly, pushing something to the back of the list and popping it out of the back of the list is constant time. Just like in a train. It's easy to append compartments to the ends. What is different from trains is that, taking out a bogey from the middle is difficult in case of trains. It's very easy here. That's a list. It could. Yes, you could put. You have a list of strings. No, but in the case of vectors, you could have a vector of strings. You could have a list of strings as well. So the linking is logical. They're not contiguous in memory. That's the important thing to understand that. In memory, there is no contiguous layout of these elements. That's important to mention. So to put many of these data structures together, in the last part of this lecture, we will look at one very interesting application, which is a simulation of a queue. So queues are everywhere. Queuing is sort of an unpleasant part of life. So customers arrive at a queue. Maybe a bank has multiple queues for multiple counters. If they all serve the same purpose, naturally you'll go and join the shortest queue, which you'll then promptly find to be not moving at all, and some other queues moving faster. But that's a different story. Now, in queuing theory, there are many, many models, ranging from very simple to analyze to very complicated but realistic. When people start out reading about queuing theory, very often they model two distributions. One is the inter-arrival time between successive customers to the bank, or to the toll booth, and so on. And then the other distribution of importance is how much does it take to service you? You're either withdrawing cash, or you're depositing some money, or you're paying the cash and getting a toll ticket and driving past. What is the actual service time, the actual amount of work that has to be done to satisfy your requirement? That service time is another distribution. Very often, the same parametric distribution is used for both with different parameters. So an exponential distribution is often used for both the inter-arrival time of successive customers, as well as the service time for one customer. So what is an exponential distribution? The density at x, given a mean of 1 over lambda, is lambda times e to the power minus lambda x if x is positive. It's 0 if x is negative. So roughly speaking, short service times are more likely. And then what a few jobs may be difficult, maybe walk up to a cash counter at a bank and ask for exact change for pi rupees, that will take a long time. Or maybe you just want to check your balance, and that's fast. So the whole distribution of customers and their service time. And then inter-arrival times of customers may be very short during paydays or something like that, and it may be quite long on idle days. Maybe Saturday mornings are very busy because everyone goes to the bank. The Wednesday afternoons may be quite idle. So why are you interested in all this? So modeling queuing is very, very important. And there are literally billions of dollars in it. When you design a router or a switch for connecting up the internet, it has many ports, TCP or IP packets, the network packets, sequences of bits, and I want different wires. You have to detect what address, what computer this packet wants to go to. Accordingly, you have to choose an output port. If the packet comes in an input port, you have to choose an output port. Now it may well be that one of the output ports is destined for Google.com. So a lot of packets want to go there. Maybe another packet is destined for my home page, and there are hardly anyone there. So depending on that, each port has a queue for packets. And you have to simulate the router before you actually invest billions of dollars building it at Cisco, for example, to figure out how long that buffer should be so that packets are not dropped excessively. When the buffer gets full, you have to drop packets or you have to back up. Tell the source to hold it. And that's when your browser shows a spinning wheel. So you have to reduce that. At the same time, you want to economize on the amount of silicon and hardware you are putting in every router. So you have to choose the optimal buffer size for those ports. The same thing holds in real life. I have a highway ending in a toll booth. Do I need 15 toll booths? Do I need 20? How many do I need at peak hours? Off-peak hours, maybe the toll booth operators can go do some other jobs. Similarly, in a bank, at rush hour, how many counters do I need for acceptable delays to customers? And during off-peak hours, maybe some of the tellers go into other jobs like record checking and other things. So queuing is extremely important to optimize system performance and to ensure tolerable delays to customers. And in many cases, you cannot solve the system in closed form because the system is too complicated. For example, instead of going alone to a bank, you go in a gang of three friends, they join three different queues. And then whoever gets to the head first, the other guys run there and give all their papers to them. This, of course, makes people behind that person unhappy, and then they start a fight. So in reality, queuing systems are very complicated. So that's why simulations are often required. So I'm starting with a very simple model of one queue. Customers arrive in order. So the 0th customer, C0, arrives at time 0. Time is itself continuous, but events happen at discrete points in time. Maybe we model time by a double or something, but each event happened as a fixed double value. The 0th customer arrives at time 0. In our simulation, this triggers two events in the future. First of all, the 0th customer, C0, will leave the queue in some point in the future after the servicing is done. That's the customer 0 service time. This is created by invoking a random number generator, which gives me the service time of the customer. So you have a black box, which when you invoke it, it gives you an exponentially distributed random variable. That gives you the service time of the first customer. So customer C0 arrives at that point to create, by sampling the service time of the customer. And you know that in that future time, the customer will leave C0. But the second event you trigger is the arrival of the next customer, the 1th customer. And that follows a different exponential distribution with a different potential. So maybe the inter-deriver time that you pick by random sampling is this. And the next customer C1 arrives at that point. At this point, the first customer is still being serviced. So the C1 has to queue up. So C1 waits until C0 leaves. And then C1 gets serviced. And C1 leaves at that time. So C0 was lucky. C0 came to an empty queue. And if you go to the bank in off peak hours, you can also be lucky. But otherwise, like C1, you have to line up behind someone. And a quantity that's often of interest is what's called the stretch metric. So C1 arrived and left. The time from arrival to leaving is the yellow bar. The actual service time of C1 is the green bar. If the wait time was 0, then the stretch is by definition 1. So consider the ratio of the yellow to the green. If yellow divided by green is 1, then everyone is happy. See, certainly you can't be unhappy if the teller is working on your case. So the green time is not unpleasant, hopefully. But if the yellow to green ratio gets large, then you get unhappy. So perhaps the epitome of large yellow to green ratio is getting a passport or a visa. The work takes 10 minutes. You waste the whole day on it. So in queuing simulations, measuring the yellow to green ratio over customers' average is a very important measure. You want to keep it close to 1. And typically what happens is if the number of tellers or queues is small, then yellow to green increases. If you have adequate number of tellers or counters, then yellow to green will be close to 1. The goal is to balance your organization's budget to the yellow to green ratio. Investing another teller costs you salary. So you decide what's important for you. At some point, you have to translate customer annoyance into loss of business. People will go away from SBI to HDFC or something. And then you judge whether you can invest that money in a teller to reduce their stretch ratio. So with that, let's start working on what data structures we need for this problem. Now the first thing we need is a multi-map, which maps from event times into some other thing, the two question marks. We don't know yet exactly what we'll put in there. So this multi-map will always record events to happen in the future. And this is a sorted map. So whenever I pull out begin, I'll always get the next discrete event, which is to happen on this number line. The first event I'll inject into this multi-map is the arrival of customer C0. When I pull that out from the queue, I realize I have to trigger two new things, the completion of service for this customer and the arrival of the next customer. I'll push them back into this multi-map as future events. And I'll keep pulling out the next event and processing it. That's called the event loop. So this is the logic. The event simulation loop looks like this. Remove the next event from the event multi-map. If the event is customer arrival, do the following. If the event is customer service completion, do some other things. If a customer is about to arrive, assign the customer an ID or a ticket number. You can just count up from 0. Record the arrival time so that we'll use it later for statistics collection. Push the customer to the back of the queue. The customer ID, for example. Generate the random next customer arrival time and record the next customer arrival time in the event map. Instead, if a customer is about to leave because service has been completed, then pop the customer out from the front of the queue, collect statistics to report, et cetera, and then generate the random completion time for the next customer. And again, record it back into the event map. Because the event multi-map is always kept in sorted order, any event that is to happen in the future is inserted into the right slot, no matter what happened. And you keep pulling on the next event. So this is a classic application of multi-maps. When various things happen, various numbers are inserted in the multi-map, but you always want the smallest one. So of course, there are some special cases like the queue is empty. So if the customer is at the front of the queue, you can start service immediately. So you have to do some relatively careful logic in there. Now what are the data structures you need? First of all, there's the queue, which is a list of ints, because the customer ID, CID is an int set. We have a CID generator, which is just an integer which you keep clocking up every time a customer arrives, now serving 56 or whatever it is. So this is the queue. You keep pushing back and popping front. And you assume that the person standing in the head of the queue is being served, currently. So you need a list here. Could I manage with a vector? Well, yeah, I could manage with a vector. But if you want to remove the first element of the vector, you're shifting down everything all the time, that's inefficient. Here, push back and pop front takes unit time, so that's more efficient. So this is one classic use of a list. It's very naturally used as a queue. The other thing, as I said, is a multi-map of events. And one way you could code that is the key being double, which is the time of the next event. And the value is an integer, which is again a customer ID. Now, how about the customer's other information? Like how long they've been waiting, and so on and so forth. So in principle, I could create an infinite vector of customer records. Here is customer 0, here is customer 1, here is customer 2, and so on. So a customer record has a bunch of fields. You need to record the customer ID, the arrival time, the start service time, end service time. Before the customer has arrived, the arrival, all of them are set to, say, NAN or some illegal value, not a number or something. Once they actually arrive, or they're added to the queue or whatever, then you can set the arrival time to a proper value. They're waiting, so these two fields are still NAN, say. When they actually start servicing, you fill in the start service time. When they actually finish servicing, you fill in the end service time, according to the logic of the previous pseudocode. And the other thing you need is the next scheduled event time and event type for this customer. So arrival, departure, and the times. But the important question is, do I need to store all these one, two, three, four, five, six things in separate places, or can I pack them into one record and store in one array? And that's where a struct comes in. A struct is a construct given in C, C++, and also Java, where you can take things of different types, but logically pertaining to one record, first name, last name, date of birth, and stick them into one thing called a struct, or a structure. So let's just finish the class with just one look at what a struct looks like. So you declare a struct like this, type def struct, and then what is the name of the structure? It's a customer. The customer has an ID, has arrival time, service begin time, service end time, and has the next event in the customer's life at event time. And whatever happens at that event is event action. Event action is an enum. We have already seen an example of enum. Event action is arrive, depart, or maybe it's arrive, start service, and service, or something like that. So we won't have time today. But depending on how I feel about the remaining material in this course, I'll either finish this event simulation in the beginning of the next lecture. Or if I feel that's too dangerous and we have too much material, I'll post the complete code to the event simulation online. You should run it and see how it behaves. I'll put all kinds of trace code in it so you can run it and see how it works. So that's what we'll discuss about collection data types. These are the most important ones, namely lists, unordered maps, ordered maps, and multi-maps. These four we'll pretty much see you through in most programming assignments. But there are others in the library, and there are a lot of things to learn. So after collection objects, our next goal is to see how structs are used. And structs, as you can see, has a new type name that you assign called customer. And inside, these are called fields. These are called fields or members of this struct. And after that, we'll see that structs can be enhanced into what are called classes. A class is a struct plus methods, like begin, bracket, end, bracket, find, those are called methods. So classes define the types of objects, and objects are instances of their classes. So next, we'll look at that kind of object-oriented programming, followed by a long-pending item, which is pointers and memory management. Pointers and memory management. And then we'll finish off with a couple of maybe one lecture on file systems and input and output, binary format, IO, and random access files, and so on. That will be the end of the course.