 Today, we will continue with some important and interesting variations on added representation and operations on them, namely, sort, merge, and search. Just to recap a little bit on where we finished off last time, we were looking at merge sort and we wrote this algorithm, which was a bottom-up merging strategy. We started with arrays of size 1 each, which were trivially sorted, then we merged them into arrays of size 2, which were sorted because of the merge. And then we built our way up to larger and larger arrays, the sizes were powers of 2, until at the root of that merge tree, we ended up with the whole original array in sorted order. So, we showed a demo of that, there was a small bug in the code, which was that I was not copying back the merged array into the original array. That was the reason why the array was not getting sorted. So, let us look at that piece of code. So, here was the merge sort code. Remember I was initializing an array with numbers, which were not sorted. The number of elements was 8, which was a power of 2, namely 2 to the power 3. And C was this scratch space for sorting or rather for merging, runs from A. So, that was the purpose of C. And then the outer merge phase was indexed PX inside. I decided to initialize CX to all minus ones, which is not a value in the original array, just so that we can see how CX is being filled in. So, you can see minus one as some sort of a null or empty cell. And then we print that we are starting a merge phase, number PX. And inside, remember there was this fancy footwork with Rx and the runs and where the runs start and end in the original array. So, we figured out where the left run began and ended and where the right run began and ended. So, that was the adjacent runs at all times. And then I initialized AX and BX to the run begins. And CX would be a right closer into the C array. That would write in the result of the merge. And the merge routine is right here. This is the beginning of the merge, that is the end of the merge. And then I would print the array C to show how it is being filled in. What I got wrong is I never copied back C into A. Otherwise it is a pointless exercise. So, and then the merge loop PX ends on PX. So, instead of printing the entire result at the end, I will just be printing C each time. And we will see that at the end it will become entirely sorted. So, if I run that, it says in merge phase zero, let me first print out the original array. So, do not look at the print array thing. Let us say I print out array A to start with. The array A was 35, 9, 22, 16, 17, 30, 29, 4. In the merge phase zero, I do four mergers. One is the merge of 35 and 9, which results in 9 and 35. The next is the merge of 22 and 16, which results in 16 and 22. The third merge merges 17 and 13, which results in 13, 17. And last 29, 4 turns into 4, 29. So, that is the state of the C array after the first state of the merge. Observe that it is not sorted, but runs of length 2 are sorted. Then the merge phase 1 starts, where I take this array, which has now been copied back into A, which was missing earlier. And now I merge 9, 35 with 16, 22. And now I get 9, 16, 22, 35. So, this first 4 are now sorted. The last 4 have not been filled in yet. That happens in the second run pair, which merges 13, 17 with 4, 29, resulting in 4, 13, 17, 29. And that finishes the second merge phase. Finally, merge phase number 2 or the third merge phase takes just as one merge between these 4 elements and those 4 elements, resulting in the sorted order. So, it's now clear how merge sort is working. All right, so this is actually a bottom up way of expressing merge sort, where we start with small solutions to small sub-problems. And then we combine them to form solutions to larger and larger sub-problems, eventually solving the whole problem. But perhaps it is a little more natural to express merge sort instead in a top-down fashion. So, and this is how it would go. To merge sort an array in positions low through high, where low is included and high is excluded. That's what is shown by the box and the curve bracket. If high is equal to low, then the array is empty. So, if high is excluded and low is included. So, in case high is equal to low, there is no array. So, you're already done. If there's nothing, there's nothing to sort. Otherwise, find the midpoint. This starts looking like binary search again. Find the midpoint m equal to the average of low and high, take the integer part of it, and then merge sort the arrays in positions low through m and m through high. So, unlike in binary search where one of the halves would be excluded, here we have to really solve both the half problems. We need to sort both the left half and the right half. So, these are two what are called recursive calls. To merge sort a big array, I have to merge sort smaller arrays. Once those calls return, once we ensure that two small arrays are sorted, half arrays, we can merge those two segments and we know how to merge already. So, symbolically, we can say that the time to solve the entire merge sort problem on 0 through n or n elements, I have to solve two sub-problems, each solving n over 2 elements roughly. And then I do the merge in time proportional to cn. We already seen that merging m plus n elements will take m plus n time. So, here merging 2n over 2 elements will take approximately n time or constant times n times. So, it turns out that if you try to solve this, you will eventually get to tn equal to some other constant c prime times n log n. We'll see that later on. So, even the top down expression really results in the same algorithm. The computation is done in the same order. It is just that expressing it in this way is much more powerful. Instead of messing around with indices and deciding where runs begin and runs end, we are just dividing things in half approximately. We are recursively invoking a merge sort on the two halves, and then we are zipping things together at the end, okay? So, this is what a top down expression would be. And we will study in more detail how this will work once we understand function calls and in particular recursive function calls. So, this is sort of an early preview into why we like to have recursive functions. Because you can express the solution of a large problem as the solution of the same structural problem over smaller sizes of problems, okay? Of input, okay? So, today I will start with a new way of representing arrays and that is by using the ncc plus plus vector class or vector type, okay? Now, in case of native arrays, you declare a native array as int a box n or float b box n, which means an int array with n elements or a float array with n elements. Now, in case of vector, vector is a keyword which comes from the library called vector, okay? You have to tell the c compiler which kind of things you want to install or insert in that vector, okay? So, if you want a vector of double numbers, then you have to say it's a vector of doubles within angle brackets. If instead you want a vector of ints, you have to say a vector of int like this, like in the red font. So, vector double one array is a vector of double precision numbers, vector int another array is another array of integer numbers. Inside, how it manages memory is nominally of no concern to us. It will ensure that most of its behavior is very close to native arrays and strings. For example, I can resize an array to have a given number of elements. I can say resize the array to ten elements. In case the original array was of size less than ten, for example, five. After this call, array will be of size ten. The last five elements will be filled with garbage. The first five elements will be whatever elements were there before you made the call. That is guaranteed. If you resize to a size which is smaller than the initial size, suppose the array started out being 20 elements long, and you say resize it to ten, you will retain the first ten elements and discard everything after that, okay? So, very simple, intuitive feel. You can also ask for the current size of the array at any time. Now, observe that you don't have to declare the array with any given size, okay? When you declare an array like this, the array is empty with zero elements. You can resize it to a specific size at any time, okay? And then the array becomes of that size with the semantics I've just described. So, the advantage is that your array size can be a function of various inputs you read from the user or even runtime conditions in your algorithm or program that you're writing, okay? And generally speaking, extra space will not be wasted too badly. The system will try to reclaim old spaces if you shrink the array by a lot and so on. And space reclaimed from this array can be used in other arrays and so on. But we don't need to worry about it. Space is managed entirely by the system. Beside that, other accesses into the array are absolutely like native arrays. So you can say foo equals array index, in which case you'll read the value of the cell numbered index in the array. Or you can even write into it. You can say array index equal to value. So that remains unchanged. Another way to build up an array is to initialize it to empty and then call what is called a pushback. So you can say pushback the following value to the end of the current array. And then the length of the array will grow one at a time. We'll see an example of that in some code. So one way to initialize it is to create an array, resize it to the size you want, and then in a loop, assign some initial value to each element. Here I'm arbitrarily assigning ax times an minus ax into the array elements. But you could do anything else with it. Usually arrays will not be assigned magic values like this, in most cases. In most cases, arrays will be read in from data files and populate it. So let's look at a sample, first sample of vector code. So don't look at this declaration. Just start reading from main onwards. So I declare an array of ints and a length, id size. And I have some initialization code similar to earlier. Then I print that. Now I'll come to the sort a little later. And then I can just print that. I have just created a method called print, which can print an arbitrary array. So if I initialize and print, it just has a 10, 18, 24 increases and then decreases back again because it's ax plus one times an minus ax. So it's a parabolic form. Now there are various advantages to using the vector class. You get for free, along with vector class something called an algorithm package, which includes a huge number of goodies. For example, all those sort routines like selection sort and mod sort that we painfully learned, you don't need that at all. Algorithm already provides you a well-written sort routine, where you have to pass the beginning of an array and then end of an array. So it can sort a segment of an array if you want. And today we won't look into what begin and end means, except for their obvious colloquial meaning, which is sort the array from beginning to end. But the advantage is that whatever we are doing painfully in all that kind of code with index arithmetic, the algorithm library provides you a one liner to do this immediately. So if I compile this and I run it again, see that I get a sorted array. So as you learn more and more of the C++ standard libraries, you'll find that you'll hardly need to write any code of your own. You can pretty much reuse many things that are already provided as basic building blocks. And the more you learn about the details of invoking them, the better you'll use those facilities. For example, if you wanted to sort a section of an array, you could learn how to pass a section of an array instead of the beginning and end. And what does begin and end mean effectively? What are the types of those? So we won't see that today. So the advantage is then of using the vector class as against native arrays. Just like native arrays, we can store elements of any type in it. In fact, you can even store vectors of strings which you couldn't easily do in native arrays. And as we shall see in the next example code, dealing with a vector of strings looks as easy syntactically as dealing with a vector of ints. There's no difference in what you can do with those things. Memory management is handled entirely by the C++ runtime library. When I discussed native arrays, I drew that picture showing how elements of the array are arranged one after the other with uniform size in RAM. As soon as you start using vector, you should stop thinking about that. The vector class or type is an abstraction that lets you index a position and then fetch a value. If it is a string, different strings can have different lengths. And my earlier statement about laying out things regularly in RAM no longer holds. But the system gives you an abstraction that there is a series of cells, number 0 to n minus 1. Each cell has a value. Even if it's a string, that's a value. You can access it. You can do things on it. And you don't have to worry about how those things are packed in, whether one string is longer than the other. In fact, you can go in and modify one of the strings in the middle, and you don't even have to touch any of the other strings. We shall see an example of that. You can grow and shrink an array after declaration, depending on the needs of your program. So that's very convenient. But there's more. So as you saw, sorting is already provided. So is binary search. And many, many other useful algorithms. So vector is basically the way to go. As far as possible, there's no point using native arrays at all. So let's show some of these demos. Second piece of code, vector strings. So now I just say vector strings instead of vector ints. Yeah, not as far as I know. They require one of the C++ standard collection classes. And this is one way to initialize an array. I told you that you could either set a size and then keep assigning values, or you can provide values one by one, and the array will grow one element at a time. So you say, start with an empty array called names, and then push back the string zebra. Then push back the string alligator, one after the other, and at the end, again I'll comment out that code. So let me just do this and print the names. So what I have here is an array where each element is a string, and the strings are currently not sorted. Now suppose, for some reason, I need to append the string 1, 2, 3 to each of the strings. If you try to do this with native character arrays, you would have to go through a fair bit of circus. For those of you who have coded with native character strings arrays in C or C++, it's quite painful. But you don't need to worry about the bad old days. All we do is, for int vx equal to 0, vx less than names.size++vx, names vx plus equal to 1, 2, 3. So plus equal to is a polymorphic operator. In case of integers, it means integer addition. In case of double, it means double addition. If I take a string, each name vx is a string. Whatever you say here, as soon as you access an element, it becomes an element of that type. So names vx is a string, and you say take that string and append 1, 2, 3 to it. Let's see how that works. So everything became animal 1, 2, 3 at the end of it. Now, you don't need to only append literals. I can, for example, append names vx itself. Just double it, pizza, pizza. We haven't seen that add. So that's what happens. So it's very easy to deal with string classes inside vectors. And that's the other thing, which is if you take a native add-a and inside it, you want to store things which are complicated. I want a vector of vectors. I want a vector of vector of strings. You can do arbitrary things like that. You can put anything in vectors. There's no constraint on what you can keep there. Even if it's irregularly sized in RAM, even if it's not laid out contiguously in RAM, doesn't matter. Anything can be placed in a vector. So in fact, just to give you another small preview of what this is doing, I'm going to declare a function or a method called print, which returns nothing. It's a templatized method. So I'm not going to tell you the full type specification of the method. I'm going to input to the print method a vector of some unknown type T. So I'm going to say T is not instantiated. So I'm writing a single print routine which can print all vectors with any type T in it. T is abstract. T is not bound to int or string or float or anything like that. And then that's implemented later on in this file. You can have a look at it offline. So that is how you'd append things. So I'm going to comment that out. And as I was saying, once you start using the ANSI C++ classes, everything becomes really uniform. You want to sort this vector of strings. It's exactly the same call. Sort names.begin, names.end. And then you print names. So I'm not doubling the strings anymore, but sorting results in a dictionary order. So you need to worry about memory management, index, arithmetic, with mod sort, nothing like that. You have sort given to you, you have strings given to you, sort a vector of strings, just one line. So that gives you a sorted vector of names. There's also binary search provided for you. And this version returns a Boolean, whether the query was found or not found. So binary search, again, you give a range of a vector. So in this case, I'm searching from beginning to end. And I'm searching for the string wolf, which is here. And then I'll print out whether the query was found or not. So wolf was found, so it is one. If I change the query, then I should not find it. I did not find it. So observe that for binary search to work correctly, the array has to be sorted. Now, what is the restriction? If I did not call the sort, what could happen? If I did not call the sort, binary search is a different method. It doesn't know to start with whether the array is sorted or not. It doesn't even test. It does that bisection of the bracket, like I described in class. Now, if binary search has to return true, then at some point, a of mid has to be equal to the query. Remember the code? Now, if the element is nowhere really in the array, unsorted array, you can never get a true return, because no element will match the query. But it's possible that if your array is not sorted and wolf appears in the array, you will miss it. Because the array is not sorted, you'll get an amid. You'll assume that everything to the left is smaller. Everything to the right is larger. That will not be true if the array is not sorted. And your query could be lurking in the long half. So that's why you need to be careful about sorting before doing the search. But here you see that operating on elements inside a vector is trivial, even if they're irregularly sized and if the size changes dynamically. So this is the string class as embedded in the vector class. So the next thing we'll look at is not related to vectors per se, but now that we have vectors, we can do many things a little less painfully than before. We'll talk about sparse arrays. So far, we have allocated arrays in contiguous fashion where there are cells after cells numbered from 0, 1, 2 onwards, and every cell contained an element. In many applications, this would be wasteful because the semantics of the application would make most element 0s. Let me give two examples. Suppose we try to represent the Facebook network of friends. Now, mercifully, most people are not friends with most other people, although the degree is pretty large. So what you have is for every user, you want to keep a list of friends, but you do not want to keep it like the Josephus problem present array. Suppose Facebook has, whatever, 1 billion users. You don't want to have a Boolean array with 1 billion places in it. Just to say that most people are not your friends. Only 1,000 people are your friends. It's wasteful to store so many falses in the array. Similarly, in case of representing documents in a search engine, we cannot afford to represent the documents in a dense way. This is what I mean. Now, documents are represented as vectors in a very high dimensional space. So how does that space get constructed? There is an axis or dimension in this space for every word that was found on the web. Not just word, every token found on the web. So the token cat, the token dog, the token logic, and millions and millions of other tokens each have a dimension in this euclidean space. In fact, the number of distinct tokens, if you crawl the entire web into a repository, scales roughly with the number of documents. Because in some page, someone has stuck in a unique email address. That's a token. In some other page, someone has listed their PGP public key. That's a unique token. For these reasons, the number of distinct tokens on the web is in the billions or tens of billions. So in other words, this euclidean space is almost unimaginable. It has like a billion dimensions. Every document is embedded in this dimension, in this euclidean space, and the coordinate of a document. So this red arrow is one document. It's coordinate in these three dimensions is, say, to start with the number of times that word appears in that document. So suppose the document says, you can teach dogs logic, but not cats. So dog and cat after stemming of the S, et cetera, there'll be one occurrence of each of them. So the coordinate of this arrow in these three dimensions will be 1, 1, 1. Because each term appears once in it. Now the issue is that, although the number of raw dimensions in this space is in tens of billions, each document itself is short, a few hundred, two, maybe a thousand words. So the number of coordinates, which are non-zero, is a vanishingly small fraction of all dimensions that are there. So if I started writing these documents down as vectors, then suppose here is one document, d1, and the dimensions go this way. There's a dimension for cat. There's a dimension for dog. There's a dimension for logic. There's a dimension for zebra, and so on. If d1 has one occurrence of cat, but no occurrence of dog, one occurrence of logic, no occurrence of zebra, actually most of the dimensions will not appear in the document. So there'll be a humongous number of zeros, and at most 100 to 1,000 ones, or some counts of how many times dog appeared in the document or cat appeared in the document. So we'll be spending an enormous amount of space storing zeros all over the place. And that's undesirable. We can't even afford it. So let's take a more concrete example. Suppose our corpus only has two documents instead of 20 billion to keep things within screen size. One document is my care is loss of care. The second document is buy old care done. If we take the set of all distinct words in this vocabulary, there are eight distinct words. Buy, care, done, is loss, my, often old. And these are sorted in increasing order, in string order, and I've assigned them integer IDs, 0 through 7. Now a search engine will immediately do that. It doesn't want to mess around with variable size documents and so on, or terms. It will do that, and immediately represent these documents as sequence of token IDs instead of the original tokens. I've not lost any information. So the first document is the sequence 513461. The second sequence is 0, 7, 1, 2. If you look up this table for what word they were, you will recover the original documents. That's how a search engine will initially represent the documents. Now the second move that the search engine does is to collect counts. So it says, as far as a human reading the document is concerned, the order of words is, of course, very important. But just to compute the similarity between two documents, or the similarity between a query and a document, it is enough that I realize that the term 1 appears in this document a couple of times. Term 1 appears here once. So there is some similarity in the documents. Both of them have the word care. That's basically the first cut that search engines do. And then there are more sophisticated algorithms that rank the pages. Now so if we turn these sequences into multisets, that is the notation for writing down the multisets. So we say that the term 1 appears in the first document twice. Term number 3 appears once. And 4, 5, and 6 also appear once each. The second document has terms 0, 1, 2, and 7 each one time. And again you notice that this term 1 appears twice and term 1 appears once is a signal that there is some similarity between the two documents. Now this is also called a sparse vector notation. Instead of, so as a sequence, document d1 looked like 5, 1, 3, 4, 6, 1. Suppose I have this entire space of term id 0 through 7. Remember I had 8 overall terms. 5 appears once, so I bump up 5. 1 appears twice, so I bump up 1 twice, 3, 4, 6. So this is a dense representation of document d1. The dense representation will waste 3 zeros. In this case, it doesn't seem much like a loss if I store 3 zeros. But imagine instead of 3 zeros, you have 30 million zeros and only 5 non-zeros. Then you realize that it's much better if I wrote down d1 in the sparse format. 1 appears twice, 3 appears once, 4 appears once. So if this is a float and that's an int, I'm still spending 10 times 4 bytes. Whereas if I really had in a very long vector, I had say 10 billion slots and only 100 were non-zero. Then the dense representation would cost me 10 billion times 4 bytes, whereas the sparse representation will cost me 2 times 100 times 4 bytes. So yes, I do have to pay by explicitly coding the index position. But if the density is much less than half, I'm far better off storing in the sparse format as compared to the dense format. Everyone understand the motivation of this? Can I get a show of hands? So sparse areas are ones in which there are lots of zeros. You don't want to waste space, positional space just storing the zeros. So instead, we squeeze out the zeros. And the extra overhead I now have to pay, there's no free lunch that I have to record at what indices the non-zero values happen. So that's a sparse added representation. Now, I won't go into it today. But we can talk about extreme compression in a bit, somewhat later in the course. We won't discuss it here, but you can do some kind of extreme compression. If you realize that these term IDs are strictly in increasing order, then instead of storing 7 here, you could say it's actually the previous value plus 5. That will reduce the numbers of the term IDs, which itself doesn't help until you start encoding integers in variable number of bits. If we keep taking 32 bits for everything, they don't help. But that's just an aside. We'll come back to that much later in the course. But the reason this is done is that you can compress things even better. So how do we store sparse areas? We assign integer IDs to every axis or dimension, which are the token IDs, as you have seen. And we'll use two ganged vectors. We'll use a vector of ints, which we call dimensions, dims, which stores the dimension IDs. And in this case, we'll have a vector of floats, which stores the coordinate in the corresponding dimension. If you're taking raw word counts, you can use int again. But in general scientific applications, your second vector will usually be floats or doubles. You may normalize the vectors. You may add and subtract things, multiply by constants. So eventually, the second part will generally become the vales array, will usually be a floating point array. And these are ganged vectors in the same sense as you can have two ganged variable registers. How many of you have used ganged variable registers? If you own a stereo set you have, you change the volume. You change the volume of both the channels simultaneously. So that's a ganged register. As you turn the knob or move the slider, it actually slides over both tracks of registers. So similarly, our sparse array that I showed in the previous slide said in dimension 1, the value is 2. In dimension 3, 4, 5, 6, the value is 1. You should always look at this array, this matrix, as column by column. It doesn't make sense to separate the upper row from the lower row. This value of 3 and 1 always go together. This value of 1 and 2 always go together. If you reorder one of the rows without reordering the other, the entire meaning falls apart. So the meaning of ganged is clear. Whenever you index through any of these arrays, you have to access the same index on the other array. Otherwise, you are accessing some meaningless thing. Now, what are some important operations on vectors? Well, on dense vectors, you might want to find out the norm, say L2 norm. We may want to do the same on a sparse array. We generally want to add two vectors. We may want to find the dot product between two vectors. Why dot product? Because to realize that these two documents are similar, taking a dot product will, what's the value of the dot product here? 2. Because in dimension 1, the contribution to the dot product is 2 times 1. And it doesn't overlap in any other dimension. So computing a dot product is an important operation to realize what's the similarity between two documents. So let us look at these three operations one by one. How to do them on sparse vectors? We already know how to do them on dense vectors. We have seen the code for that already. How do things change if we now have to compute these three quantities on sparse vectors? Norm turns out to be very, very easy. In fact, norm will not even need to inspect the dims array. See, if you have a vector like this, these guys don't contribute to the norm at all. Those guys aren't recorded here, anyway. So all you need to do is to take just the valves part of the array. You don't care what the dimensions are. Each dimension contributes equally to the norm. So you don't even need to look at the dims array. So the code looks like this. Plot norm squared equal to 0. Int vx between 0 and valves dot size, all I do is norm squared plus equal to valves vx times valves vx. Dims is not involved. And finally, I take the square root of that. That gives me the L2 norm of the sparse vector, as well as it would be the L2 norm of the corresponding dense vector. No difference. How about summing up two arrays? So suppose I have two spars arrays, A and B. A is represented by A dims and A valves. B is represented by B dims and B valves. And I want my answer to be stored in C dims and C valves. Now, best to keep all the dimensions in increasing order. If I had to look at each dim here and then search the other linearly, that would take a long time. Instead, if I sort the other guy, then I'd be able to either binary search or even better, we just mod. Because this is just a mod. So conventionally, we can write the two vectors as, say, A in dimension 0 has value or coordinate 2.2. In dimension 3 has value minus 1.4. In general, the sparse vectors kind of negative value also. Documents may not, but in other applications, you can have negative elements also. And dimension 11 has value 35. B has value minus 5.2 in dimension 1 and, say, 1.4 in dimension 3. Then what's the mod? B doesn't even have dimension 0. So in dimension 0, you just inherit from A. The next is dimension 3, where both of them have values, but they seem to exactly cancel out. So to save space, you don't want dimension 3 to appear at all in the output. If instead, both A and B had plus 1.4, then you would need to insert 3 colon 2.8 in the C value, C values. And other than that, 1135 will be transferred to C, because there is no dimension 11 in the B sparse vector. So the definition of the problem is clear. How to add two vectors? You align their dimensions, and you just add it element twice. So how to do that? Clearly to align their dimensions, if they're in sorted order, all you have to do is emerge. So there are two parts to this. One is, given a sparse vector that is not initially ordered by dim, how to clean it up in preparation for summing? Second is, given A and B, which are already sorted on A dims and B dims, how do we compute C? We'll take the first problem later. We'll look at the second problem first. Assuming A and B are already in canonical order with dims increasing, how do we sum two such sparse vectors? So this is a demo of how it should work, pictorially. Initially, I'm pointing AX and BX to the first cell. Remember, these are now ganged arrays. They're inseparable. If we shuffle them around, they have to be shuffled as a column unit. Otherwise, it doesn't make any sense. So AX and BX are pointing to the first position. And the values which need to be compared this time, unlike the original merge, are the actual dimensions here. We're trying to see if any of the dimensions are shared between the two. Otherwise, the case, as it is now, is that the dim here is zero and the dim here is one. There's no direct match. So zero has to be just transferred to the output. So in C dims, I have to now just add zero. And in C valves, I have to transfer the 2.2, at which point the first element goes away. And AX is advanced to the next position. This time, the quantities which come under the scanner are 3 and 1. I compare the dims. I find that BX wins, and there is no match. So I transfer the column from the B array to the output, and I drop that, and BX advances. At this point, the elements being compared are 3 and 3, which do match. And then I find plus 1.4 and minus 1.4 is equal to zero. And again, as I have mentioned multiple times before, you may not want to test for exact equality. You might say that to save space, any time an element of my sparse vector drops below some epsilon, I'm going to squeeze it out. Whether you want to do that or not is application dependent, where you tune epsilon, so on and so forth. But in this example, let's assume that it truly cancel out, and we decide to eliminate it from C. Then both AX and BX will disappear. Nothing will go to C in this case. If they are both plus 1.4, then I transfer 3 and 2.8 to the output. In this case, I transfer nothing. B empties out, A points to the end, B has no match, and that is transferred out. So this is how sparse vector addition should happen. Now all that remains is to write some simple code. It will, of course, resemble merging a lot. So how do you do the sum merge? We are given these two vectors, A dims, A vales, B dims, B vales, and the output is C dims, C vales. A and B are suitably filled. C is initially assumed to be empty. So C has zero elements. Now we set up these read and write cursors just like before. In fact, we don't need a write cursor, because we'll just append to C. So AX is equal to 0, BX is equal to 0 with corresponding last positions, A and BN. And then exactly like in merging numerical arrays, we have three cases inside. A dims, AX less than B dims, BX. A dims, AX greater than B dims, BX. And A dims, AX equal to B dims, BX. And afterward, again like before, if there are leftovers, I need to append the rest of that array to C. So I won't look into that in detail. We know how to do that. So the three cases, the exact code will be as follows. A dims, AX is less than B dims, BX. In that case, I will take C dims. I'll push back A dims of AX to it. And to C vales, I'll push back A vales, AX. And then I'll advance the AX cursor. You've already seen a demo of this in the picture. On the other hand, if A dims, AX is strictly greater than B dims, BX, then BX has 1. And the contents of the two rows of B has to be transferred to C. That corresponds to C dims pushback, B dims, C vales pushback, B vales, and advancing the BX cursor. In case A dims, AX is equal to B dims, BX, like happened with 3, the dimension 3 in the example, we create a new val, which is A vales, AX plus B vales, BX. In our example before, this turned into 0. If the magnitude of new val is 0 or small, we discard it. Otherwise, C dims.pushback, A dims, AX, which happens to be the same as B dims, BX. And C vales.pushback, new val. This is the quantity 0 or whatever is compute, appended to the C array. But in either case, we have to increment the AX and BX cursor together. So how many people are comfortable with how sparse vectors are summed up? Show of hands, please. Over in that corner? So that's how sparse vectors are added up. Now we look at the third problem, which is sparse.product. Now this will again be a merge. Because we need to locate the shared dimensions. And only dimensions which are common to both the sparse arrays will contribute to the dot product. But it's a little different in that there's nothing to clean up at the end. So produce output to C only if dimensions match. So we start off with dims and vales as before. I initialize a dot product to 0. And then I initialize AX and A and BX and BN as before. And again, I run through the two arrays. And we have three cases. But once one of the arrays empties out, no further contribution is possible to dot product. So I can just quit. Nothing to do. Now what happens to these three cases? If A dims AX is less than B dims BX. So it basically says that all the dimensions left in one of the arrays is strictly larger than whatever I have already accumulated into dot product. Implicitly, that means that the coordinates of the other array is 0 in all those dimensions. I don't store 0s. So contributions to dot product will be 0 times something, which is all 0s. So in case A dims AX is less than B dims BX, there is no contribution to dot product, but I have to advance the AX cursor, hoping that something else will match up. In case A dims AX is greater than B dims BX, then BX has to be advanced. Only if they're equal, dot product has to be incremented by A vales AX times B vales BX. And then I have to increment both. So let me take an example to clarify this. So I'll take the same old arrays. So the first array was 0, 2.2, 3 minus 1.4, 11 and 35. The second array was 1 minus 5.2 and 3, 1.4. So initially, AX will be here, BX will be there. And here is dot product, which is equal to 0. Now let's compare 0 and 1. 0 is less. There's no match between 0 and 1. So nothing is contributed to dot product. I'll advance AX to 3. Now I compare 3 and 1. I find no match. So still there is no contribution to dot product. But BX is now smaller. So I advance BX to 3. Now there is a match. So I multiply this. So I add 1.4 times minus 1.4 to the dot product. And I advance both of them. So BX drops off the edge of the earth. And AX gets advanced to 11. But by now the loop has terminated because BX has reached BN, and my answer is minus 1.4 squared. So that's how sparse dot products can be computed. Any questions about sparse dot product? So these are both merges, but there's a slight difference. In case of sum, the resulting array typically gets denser because anything with any dimension with input in one of the arrays is potentially a candidate to contribute to the output array. Whereas in case of dot product, the time is even smaller because the dimension needs to appear in both of the arrays to contribute a dot product. Other things you can skip over. Fine? Now we come back to the first question. So the next problem is the initial problem, which is suppose we are provided sparse vectors. And they are, of course, ganged on dims and valves. Otherwise it's meaningless. But dims is not sorted. And this would be the case when you're acquiring the data from raw devices or you're reading documents and turning them into integers. You cannot guarantee that. All the tokens will be in increasing order in your dictionary. So suppose a dims is not sorted, as we will need if we want to compute some more dot product. So the question is, how can we rewrite selection sort, which is moving the smallest element, whatever that element is, to the front? And do it over ganged arrays. So we'll now rewrite selection sort to work with ganged, dim, and valedays. That's the next task. Now smallest will now not mean in the value sense. Just like in merging, smallest now means over the dimensions. But move to front will involve both because you have to keep the arrays ganged at all times. So here's an example. I start off with a sparse vector, which is assumed ganged, but with the dims out of order. So 11 goes to 2.1, 7 goes to 6.5, et cetera. Now in the first step, my frontier, fx, is at 0. And when I search for the minimum dimension, I get 3. Min pos is 3. And dim of min pos is 2. So now I have to swap the first column with the fourth column, or 0th column with the 3th column. So once I do that, I end up with 2 in the first column and 11 in the fourth column. And I'm done with 2. So fx can now advance. So the gray part is already sorted. And now the next minimum position I find is 2. Min pos is 2 with dim 5. So at this point, I decide to swap the current fx, which is 1, with min pos, which is 2. And that swap results in 7.5 transform to 5.7. But observe that, at all times, I have to also transfer the 4.3 here and transfer the 6.5 there. That's the only new angle here, but it's critical to correctness. And I'm also done with the second column. That's grayed out. So what happens is, after the first two positions are done, it turns out that the second position is the smallest, 7. So there's nothing to do in that step. In the next step, I advance it. And I find that 8 is the smaller value at 4. So I swap 8 and 11, along with 2.1 and 5.5, to get the final array. So the final array will be 257811 with corresponding permuted values from the bottom. Of course, those do not be sorted in general. So that's what the algorithm should do. And it's fairly easy to code it up on selection sort earlier. This is what the code will look like. So we have dims and valves as before. Now we say a n is the size of the array, which should be the same for both. Frontier goes from 0 to a n as before, fx. Min dim is set to some maximum value. Min pause is set to minus 1. And then I hunt for the minimum position, except that I look at a dims. mx equal to fx to a n minus 1. If min dim is greater than a dims mx, then remember min dim. But also remember min pause as mx. Mx is what we saw as the thing to be swapped with. So the important thing is what I do after I get min pause. So at this point, the product is really min pause. So I copy off both rows of column min pause into two temporary variables, tdim and tval. Then I override the min pause position with the fx position in both arrays. And then for both arrays, I assign dims and valves to tdim and tval. So you're swapping, but you're swapping in tandem. That's exactly as the picture wanted you to do. In the earlier selection sort, I was just swapping two values in one value array. Now I'm having to swap the dim and the val array in conjunction, in synchro. So anytime you want to find the minimum value in an array, if you initialize that minimum value to infinity, then you'll take the minimum of that and whatever comes next. If you want to find the minimum over a set, it always helps to initialize the minimum to infinity. As soon as the set is non-empty, you'll go into drop down to the first element and so on. So it gives you uniform code without starting with the first one and then what. So is this code clear? How to do selection sort on dim itself? So how many people want a demo of this versus how many people are sure enough this will work? Then we can move on to something else. So the important thing is that we swap min-pause and fx on both dims and valves simultaneously. So the next problem I'll discuss is very related to all this, but it feels slightly different. So it's called an indirection or index array. So suppose each element of this matrix, which I call data dm, consumes a lot of bytes. For example, I may have a string vector or vector of strings where each element is a long string. So the problem is that we have seen two sort methods so far. Both of them involve copying and moving elements. In case of selection sort, we detect the smallest element and we copy it over to the first, copy over to the first guy to that position that involves copying so many bytes. In case of mod sort, of course you read two arrays and write out into the other one. So in all cases, a lot of transfer of bytes is involved. And that is a cost. If I have a long vector each with long strings, I'll be spending time not proportional to the number of elements in the array, but proportional to the total number of bytes you have to copy. It just happens that integers and floats have a small fixed number of bytes. But if you're storing complicated records like a student state from which the student came, roll number, age, date of birth, whatever, lots of fields, that's a lot of bytes. If you want to keep them sorted on some field, it'll take a lot of time. So can we avoid this? So the trick is to create a permutation array called pause. If you're given a data array, which is data dm, we will create a permutation array called pause dn, same size, such that when you access data through pause, data of pause of px, you will get the data in increasing order as px increases. I'll draw a picture to make that clear. So hold that thought, data of pause of px. And here's an example. So suppose I have an array of strings. For shorthand, I'll just use one character in the strings. So suppose my original array, data, has, say, five cells. And this is q, z, a, e, r. The new technique will not at all move around data inside the data array. What I want to create is a pause array, which is a permutation of the numbers 0 through 4. All these records is that the smallest element was in cell 2. The second smallest element was in cell 3. The third smallest element was in cell 0. That's what I want in pause. So now suppose I index this using px. So let me write down px. Let me write down pause. Let me write down data pause px. So when px is 0, of course, pause x is 2. Data of 2 is a. When px is 1, pause px is 3. Data of, this is e. When px is 2, this is 0. This is q. When px is 3, pause is 4. So that is r. And when px is 4, pause px is 1. And data pause px is z. So the summary is that neither data nor pause is sorted. But when you compose them using data pause px, as you increase px, data will increase. So that's what you call an index or indirection array. This is called an indirection. Instead of accessing data directly with an index, you find the index in another array. So this is a composition operator. I'm indexing something, and I'm using that as an index into another array. So I have this vector of names, which is suitably filled. And I want to create in pause a permutation. Let's say initialize pause to the right size or whatever. Those are small details. Let nn be the shared size of both of those arrays. Initially, I fill out pause with the identity permutation. So for nx equal to 0, nx less than nn, pause.pushback nx itself. So what will happen is pause starts out empty, and then 0, 1, 2, 3, 4 will be appended in that order. So pause will be the identity permutation. 0 goes to 0. 1 goes to 1. And in that stage, of course, data of pause of px will not be sorted order. It will be the same as the original one. And then we'll start the sorting routine. Again with frontier. So fx equal to 0, fx less than nn. Find the smallest string among positions, pause fx2, pause nn minus 1, and then do a swap. But let's see what the nature of the swap is going to be. So here I have our favorite animals again. And we print all the names. Now, so how do I really want this to happen? I want to find the smallest animal here, which happens to be alligator, not physically, in string sense only. And then, so fx equal to 0 through n, min pause is minus 1. What is the minimum name? What is infinity in case of names? As a hack, I'm just putting a very sleepy animal here. That's the last animal possible. Now I'll start the mx loop. So mx goes from frontier to the end. Now the nice part is that strings also have this operator called greater than. This is dictionary order. So if min name is greater than names pause mx, then I'm going to set min name to names pause mx. And remember, min pause as mx. So finally, I have the position array. So I'm going to swap fx with min pause in only the position array. Note that, unlike last time, the data or value array is not being touched. So now, to make sure we understand what's going on, let me print the pause array at every step, rather than only at the end. So let's see how this works. So read this slowly. Don't rush to read the whole thing. The array never changes, by the way. Let me print out something more. That is, where it found the min pause and what the element was. Let me print that out. So I'm printing out the min name found to the right of frontier. I'm also printing out the min pause. And pause of min pause. So what happened here? So initially, my permutation array was identity. So I started with this. I started with identity 0, 1, 2, 3, 4, 5. And it says the minimum string I found was alligator. Min pause was 1. See, currently I have identity. So min pause is 1, and pause min pause is 1. Alligator. So at this point, I've swapped 0 and 1. And my induction array has become 1, 0, 2, 3, 4, 5. Now, in the next step, I find that G daf is the smallest animal. Where is that found? Min pause is 2. Pause min pause is also 2. So now I decide to swap these two guys. So 2 and 0 instead of 0 and 2. So another important thing to note is that as soon as the permutation pause changes, your logical view of the array is getting shuffled around. So you are actually logically shuffling the original data array, but through the pause array only. You're not physically changing its layout. Let me provide a picture, actually, with the same example, then you'll understand a little better. So I have names equal to G a, G w, l, j. And initially, I have pause equal to 0, 1, 2, 3, 4, 5. If I now write names of pause of the array, as the index here goes up from 0 through 5, I'm just going to get G a, G w, l, j. Same thing, right? Because pause of i is i. After the first stage, what I do is I swap like that. So these two are swapped. Note that that is not swapped. But what has happened to my view through pause? If I take 0, I actually get 1. If i is 0, then pause 0 is 1, and I should get a. So effectively, I've done that. And now in the second step, I have to find the smallest element right of this. So that is G. And I decide to swap those two guys. My new pause is 1, 2, 0, 3, 4, 5. What happened to my names pause? Names pause is 1 is a, 2 is G. So it's nothing but always accessing the data array through the pause array. So you only change things inside the pause array, which are constant sized and small, only four bytes, instead of moving around things in names, which are much larger and would take more time. So this is a useful trick. If you don't want to update or sort the original array in place, you create an indirection array which gives you the permutation, which will sort the array effectively. Let me kill the intermediate prints and only print it at the end. But there's some other code which was sorting the original array. So finally, because it's actually cheap, I'm going to sort the original array. So see, I'm reusing the same print routine to print arrays of ints and arrays of strings. And that's how this magic is implemented. We'll see this later on in the semester. Those printouts, too distracting. So my initial pause was that. My final pause is this. Now, what does it mean? If I access the first position first, I get Alligator. If I access the second position, I get Gdav. Then I get links, or then I get Jackalope here. Then I get links. Then I get Wolf. And finally, I get Zebra. So this is the correct permutation through which you should read names so that it becomes sorted. So that is pretty much where we'll stop. Next time we'll see a generalization of indirection arrays to handle multiple data arrays simultaneously.