 Sorting and aggregating, so it sort of goes without saying it should be sort of obvious why we need to sort, the relational model has no ordering, so the user might say in their SQL query, give me an order by and therefore we have to sort it. As we go along, there are going to be other scenarios where certain algorithms or operators can be implemented more efficiently by sorting, even though the user didn't ask us to sort. So even though there was an explicit order by clause, if we sort our tuples as we're processing them, we may provide us with additional benefits. And some of the things we'll be able to do are things like removing duplicates, doing aggregations with group eyes, and then also bulk loading keys into an index. So why am I even bothered talking about sorting? Because everyone here is basically taking CS 101 or whatever the equivalent is at your undergrad institution. If everything fits in memory, we can just use quick sort, heap sort, merge sort, whatever your favorite sort is, and we're done. There's nothing magic on a database system that'll make one sorting algorithm for an in-memory workload be better than another. The thing that we're going to focus on is when it doesn't fit in memory. That's the thing that they don't teach you in intro CS class of when you have to spill to disk. Because now these certain algorithms like quick sort, for example, is not actually what you're going to want to use because quick sort is all about doing random access. It assumes everything in memory, so you can swap things and who cares, right? If you assume there's no IO. But now in a database system, we need to be mindful of this because we may have to spill to disk. So that's what we're going to focus on today. So the particular algorithm we're going to focus on is called external merge sort. So this is the most frequently used external sorting algorithm in every single database system. The, maybe the exact naming of the phases or the exact protocol or methods for each step may be different per system, but at a high level, they're all going to be doing basically the same thing. So it's a sort of a hybrid strategy where in the first phase, we're going to sort the, sort our input sequence of keys into small sort of runs that can fit in memory. And then we may spill them to disk because we may not have enough space. And then in the merge phase, we can start combining these sort of runs into progressively larger and larger file sizes that can, could get spilled to disk. And at some point we repeat this, this, this merging process until we have the entire sequence sorted, right? Again, think of like, what we talked about before, in that the, the database system is going to try to maximize the amount of sequential access it can do to read and write data because it's assumed that sequential access is much faster on a disk storage system. So the algorithm is going to favor doing sequential access versus random writes, which is what the quick sort would actually do. So we're going to start off talking about the simple example of what's a two-way merge sort. So two-way would mean there's two sort of runs we're going to merge together. So we'll later generalize this to do a k-way, it's going to merge sort. We can merge k different runs together. But it's easier to start off with just two. So what we're going to do is we're going to say our file or key sequence, the thing we want to sort, will be broken up into n-pages. And that we're going to restrict the database system to only have the number of fixed size buffers that it can use to store these sorted pages. So again, think about the irreparable assignment you guys built. You can't store everything in your bubble because you run out of space. So you want to restrict how much memory you're allowing the system to do, to do these sorting so that it doesn't blow away the buffer flow every single time you need to do a sort. So the high level of the algorithm looks like this. So in pass zero, we're just going to scan through all our pages, every b pages into memory. And for each page, we're going to sort them, just that page. Just the values in that page. And then we're going to write that out the disk, right? And so these sorted pages will be called runs or sorted runs. So at first, the size of the, after the first pass, the size of the runs will be one page. That took a page into memory, did quick sort to sort that, then wrote it out. Then in the subsequent passes, we're going to go back and we're going to recursively merge pairs of sort of runs together. Doesn't matter which ones we're going to pick, right? You can try to be intelligent and we'll see an example. But in practice, I don't think anybody actually does this. And we're going to merge them together and produce a new sort of run that's twice as big as the input runs that we merged together. And then we keep doing this until we, until the whole thing is sorted. So for this algorithm, we're only going to need three buffer pool pages. We need one page, two pages for the input for our two runs. And then one page for the output, because we have to always write it out. So again, at a high level, sort of looks like this. Say that we have our database on disk and we have two pages and they're unsorted. We're going to fetch the first page, sort it, write it back out the disk. Fetch the next page, sort it, write it out the disk. And then we'll fetch the two pages together and then write the first, combine the two together to make a new page. Then combine the two together to make the second page, right? At a high level, this is what we're doing. So this is considered a divide and conquer strategy, right? Because the essential we're doing is rather than trying to merge sort everything, which is what quicksort do we want to do. We're going to sort smaller chunks and then we're going to combine it together as we go along. And being smart about how we read and ready data up. So this is our input sequence. And we have a little marker here at the end to say we have the end of file. So in the first pass, we're just going to take each page and just run quicksort on it and sort that. And then write each of those pages out. Then in the second pass, we take two adjacent pages, combine them together and then write those out. So again, what I was saying before is that we don't need to be smart and say, well, I know that this one starts with a three and this one starts with a one. So maybe I want to merge these two guys together. You don't have to do that. I don't know if anybody actually does that. In practice, the algorithm just works okay as it is. You can say, these two guys are next to each other. I'm just going to merge them together, right? And what will happen is the reason why we only need three pages to do this is because as I say, start with these two guys here. As I'm scanning them, I just do a comparison versus wherever my cursor is on the first page, where my cursor is in the second page. And then whatever one is less gets written out and then that cursor moves down. And I never have to backtrack, right? So I never have to maybe scan down one page and then have to go back and fetch it again. As I'm doing my comparisons, my cursor is moving down and I'm writing these pages out, I never have to go back. So I only need two pages or two pages for input, one page for output. And if the page I'm writing is my output is full, then that gets written out at the desk and I just start a new page. Continue down the third pass, same thing, we combine them again, and then we get down to the bottom here. And we have our completely sorted sequence. So for the two-way merge sort, the algorithm for determining the number of passes we're gonna need to do is one plus the ceiling of log 2n. So the total IO cost, number of pages with the read and write compute this algorithm, this sorted sequence. It's 2n times the number of passes, right? For each pass, we gotta do one, for every page we have, we have to read it in and write it out. That's why it's 2n. The first n is for the read, first the second n is for the write. And we do this times the number of passes that we have. Pretty straightforward, right? Yes? This question is, what happens to the intermediate pages? So say you're here and I combine these two guys and I sort them. Again, think of it like you have a cursor here and a cursor there. And they're saying, all right, what's the first two keys I wanna compare across these two runs? So this one would be two and this one would be four. Two is less than four, so I write that out to my output page. Then the cursor moves down by one. So then I compare three and four. Well three is less than four, so three gets written out and that cursor moves down. Now we get four to four and then you flip a coin, it could be one or the other, it doesn't matter. So you're sort of, you're populating one output page at a time. And then when it's full, it gets written out the disk and you just start with a new fresh page. And then you maintain some internal metadata to say, all right, for this sort of run, page one, two, three, and four can be found here. Going that sequence. They may not be contiguous in memory, sorry, on disk, but in practice you would want them to be. Yes? Where does the what, sorry? So where does that one come from? Oh, this number of passes? Cuz you always need the first pass. You always have to go through it once. That's unavoidable, right? And then this is just dividing it by you're having the number of sort of runs you have as you go down. So you need, one pass is always to go through it the first time. Question here or no? Okay, so again, this particular album only requires three buffer pages. It turns out if you have more than three buffer pages, you're actually not gonna get a good performance cuz you're not gonna effectively be utilizing them. So I don't wanna go into details of the general merge sort algorithm other than to say that here's what the math looks like. The basic idea is that instead of computing the number of passes based on two, you base it on the number of runs that you have. Now such what you're doing is for every single pass, instead of merging just two runs together, now you're merging k runs together. And the algorithm essentially works the same way. Every run that you're combining together, you have a cursor, you do comparison across all of them and then which when everyone is the smallest, that gets put in the output buffer and you move that cursor down by one. And so in general, the formula for determining the number of passes you have to do is one plus the ceiling of log b minus one and the ceiling of n over b. And the total cost of that again is two times the number of passes. That's always gonna be the same, okay? Again, this is almost a plug and chug kind of thing. Doing a simple example, I'm not gonna go through all of this. But this will just walk you through how the math works out, okay? And I think there's a homework that basically does the same thing. So we can actually use b trees to speed things up. But we have the same distinction that we have before whether you have a clustered index or an unclustered index. In general, if you need to do sorting and you have a b plus tree available to you that is indexed on the key you wanna sort by, then every data center is always gonna try to use that instead of doing the external merge sort, right? Cuz the way you think about this, you're paying an upfront cost of maintaining the index's inner nodes to sort of bake into the index, the sort order along the leaf nodes. You're already paying that cost to maintain that sort ordering in the index. And so that's why it's gonna be faster to use that instead of actually computing the sort order on the fly with external merge sort. So in general, the optimizer is gonna be smart enough to recognize oh, I have an index on the thing I wanna do an order by. Let me use that instead of actually doing the merge sort manually. Again, they can only do this if you have a clustered index. So this is what we talked about before, right? So if the records we need to read are sorted in the order that are specified by the index, then we just jump to where we need to based on the leaf page and just scan across and we're good to go. We know the output is already sorted in the order that we want, right? There's no extra step we have to do. In the case of the unclustered b plus tree, like a secondary index, these sort ordering is all over the map. So we may end up doing more IO than we would have if we just did our external merge sort. So these kind of formulas that I'm showing here, this is essentially what the query optimizer is gonna do in the data system. It's gonna recognize, I have an index, but it's unclustered. And I have this amount of memory I can use to do my sorting. It can use that to make trade off decisions about whether it's better to do the index scan on the unclustered index or just do the external merge sort. And this is the beauty of SQL declarative languages. This is not something you use the program I have to know about. The system just does it for you automatically. It's not gonna always get it right, but in general it's gonna be better than what any human can do, right? So in this example here, we talked about how to when we talked about the query processing stuff, how if I could recognize that I'm gonna fetch these pages, these two different pages, and maybe sort them by their page ID. So I'm having reusing every page I fetch in. But if you do that, now you've ruined the sort order. So you don't have to maintain an extra metadata to put it all back in the correct order when you're done. And at that point, you might just be, again, better off just doing the external merge sort. All right, so now we know how to sort. What can we do with it, right? Obviously, again, order by is easy. If the user asks for that, we have to sort it. So we can use that algorithm. But we can use it for aggregations as well. So how would you use it for the aggregation? It's pretty straightforward, right? So say I'm doing a distinct on the course ID here for this query. So the query plan would first go do a central scan on the table, apply the filter that's in the where clause, then do a projection to remove any comms that we don't want. And then now do our sorting to sort the keys that they want in the output. And now we just have a cursor walk through this sort of output and maintain the little buffer that says, what's the last value that I looked at? So that if I come across a value that I've seen before, then I know it's a duplicate and I can just skip it and not include it in my output here. So this is another example of where sorting can be used for something not just for order by queries. You can do it for making internal operations go much more quickly. So this is pretty common, a bunch of systems will actually support this. But actually the most common way to do aggregations is to do hashing. So if you don't care about things being ordered, then running maybe the external merge sort algorithm would be just too expensive. And a cheaper thing to do is actually just do hashing. So we can use the hash table stuff we talked about before and use that to compute our aggregations. So the way to think about this is like, say in this example here, if I had an order by clause on course ID, then it'd be a no brainer to do sorting for this because I need to get it sorted anyway. And I can just reuse the sort order to then illuminate the duplicates. But if I don't care about sorting, then I may want to use hashing. Hashing is going to add some randomness to our output because every time we hash a key and put it into a hash table, you remove all sense of ordering from the data that it came from. So if we don't care about the sort ordering, then hashing is almost always going to be superior. So the way we're going to compute aggregates with hashing is that we're going to populate this ephemeral hash table, this in memory hash, or not in memory. We're going to populate this hash table that only lives for the execution of that query. Remember we talked about data structures earlier and we said that it can be used in different ways. It can be used for table indexes, it can be used for metadata, it can be used for transient data structures. So this is an example of a transient data structure. For our one query, we're going to build a hash table, populate it, compute our aggregate, and then throw it away. And if we execute the same query over and over again, we're going to rebuild that hash table every single time. So we're going to populate this hash table and then depending on what the aggregation we're trying to compute, we'll do different things with the values we're storing in the buckets. So if we're doing distinct, if we see that we have a match, a duplicate we just throw away, ignore the way we just put it put in. If we're trying to do a group by with an aggregation function, then we can summarize the value on the fly as we go. So again, just like in sorting, if everything fits in memory, then we can use any of the hash tables we talked about before and we're done, right, the linear probe hashing, the assemble hashing, the cuckoo hashing, all those were in-memory data structures. So if our data set fits in memory, then we're fine, we're easy. But if it's a really large data set that's not going to fit in memory, then we need to be smarter, just like in the sorting case. So the hashing aggregate is going to have a two phases. First phase, we're going to partition the tuples into buckets, write those out the disk. And then the second phase, we're going to bring those buckets back in and then populate an in-memory hash table for just that partition or just that bucket, and then compute whatever it is the aggregate that we want, right? Again, we have to split a disk, we have to do this. If we're in memory, then we don't. So in the first phase, we're going to use one hash function, doesn't matter what it is, remember hash, SED hash, doesn't matter, and then we're going to do our scan over our input tuples. And then whatever key we want to group by, we hash that and that tells us what partition that we're going to write into our own disk. So we're not writing into a hash table at this point, we're just writing into output buffers that get written out the disk. So the idea here is by pre-partitioning is that we know that when we come back and read things back into memory to compute our aggregation, there's never going to be a time where we have to go get data from another partition to compute the aggregate. Because everyone's going to get put into the same location that have the same values. So we go to our example here, right? Same query as before, we do our filter, we do remove the columns. And then now at this point here, we're going to go scan through these values, take our hash function, apply it, and then write out whatever the values are to our buckets here. And we'll be a little smart and recognize that actually in this case here, for 445 there actually should be duplicates. You're not doing anything smart here, just taking the values, just writing them out. Because it may be the case, again, if I have a billion tuples, I may not, the first one that goes in this bucket may be 15,445. And then a thousand tuples later, I see another 15,445 that I want to write out. But I don't know about the other one because I'm not maintaining any metadata about what I've seen in the past to make decisions about what the filter out. So all these buckets now get written out, right? And they're sort of grouped together in partitions. And then for each of these partitions, we're going to bring it back into memory. And we're going to build an in-memory hash table. Whatever your favorite hash table is, extendable, linear, cuckoo, it doesn't matter. And then we're going to use another hash function to hash it into the hash table and then do whatever operation we need to do based on what our aggregation is. If it's a distinct, we filter out duplicates. If it's a group by aggregation function, then we compute that. So this, we're going to assume that the partitions, as we bring them in, will fit in memory. It's not always the case either, but we'll see how to handle that when we talk about hash joins in the next class. So the way to think about this is, we're assuming that our input data doesn't fit in memory, but as we partition it, then each partition will fit in memory. And this obviously assumes that we have uniform distribution after hashing, which is not always true either. But for simplicity, we just assume this is the case. So here we're back now on phase two. These are the partitions that we generated from the last time. And for simplicity, we're just going to combine them into a single hash function like this, right? And the keys are just the hash keys. And then, since we're doing distinct, we're just maintaining the values inside of the hash table. So as we're scanning along, if we see that we have a duplicate, we can just throw it away. And then the final result we need to produce for the query, it's just the values that are stored in the hash table itself, right? So again, what would happen is, as we go along this phase, if we hash 1545, we see that there's already a tuple here with the same value, we can just discard it. So we're doing a linear scan inside of the value list for that particular key, which we know how to do for hash tables. For summarization, the idea is that you can maintain a running total as you go along and store that as your value in your hash table, rather than just the existence of the key. So again, the idea is that in the second phase, when I rehash, if I do my hashing and I see that I do hash on the group key, I land into a bucket, and then I find the entry that corresponds to my group by key, and then I just do whatever operation I need to do to update the total for that function. So again, let's say we go back to this query, a different query, we're doing a join on student enrolled and we're getting the average GPA of the students in the course ID, for each course ID. So now when I, in the rehash phase, when I build my hash table, in the value list, I'm gonna have a mapping from the group by key, because we're grouping by course ID, to some running total we have to do to compute the function. So the table sort of looks like this, what you actually need to maintain. So if you're doing average, it's a pair of count and sum, because at the end, you just take the sum divided by the count, and that gives you the average. For min, max, it's sort of obvious to maintain the largest value you've seen. For sum is just adding it to it, count is just adding one. So this is how the distance is gonna maintain inside the hash table, the aggregation function as it goes along. And then as it scans through each of these and populates this. And then when it's done, that's the answer that it produces as the final output. So is this clear? Right, again, pretty straightforward, but it's kind of cool that we're building on the crap we did before, right? We talked about how to build hash tables. Now we can use our hash tables to do aggregations. Or we can use sorting to build aggregations, do aggregations, right? Sort of building the layers of the system on top of each other. So the cost analysis for this is essentially if you have B buffer blocks you can build, then in the first phase you can always fill out B minus one, right? Cuz you always need one partition to be able to use the buffer to write out to. And then on the coming back in, the total number amount I owe you to do will be times B minus one. Cuz for every single page B, you have to read it back in, write it out and read it back in. So I don't talk about the fudge factor here, but again, the big assumption we're making is that the keys are going to hash uniformly. So each bucket has the same amount of data. In practice that's not always the case, right? There's more people live in Pennsylvania than people live in Alaska. So if you hash on the state, the Pennsylvania bucket's gonna be overflowing. We'll cover that next class when we talk about hash joins. Cuz this is the big problem that these guys are gonna have to deal with in their environment, how to deal with joins when everything isn't uniformly distributed. Okay? All right, so the main takeaway when you go from this, and we'll see more about this on Monday next week, is that then the data system has to maintain this trade-off and making this decision about whether to use sorting or hashing and depends on what index you have available, what the data looks like, what your query looks like, and then uses that to figure out whether one choice is better than another. I would say the spoiler is for if you take the advanced class, is that hashing is almost always gonna be the better approach. Unless you have an order by that explicitly wants your data sorted, hashing, whether it's hash join or hash aggregations, is almost always better. Sorting is just more work. There's a bunch of other optimizations we talked about before, about prefetching and trying to do as much work as you can for every IO you have to do. The main takeaway from all of this would be that all the optimizations we focused on were all about how to reduce IO. Once I got the thing in memory, I don't really care what sorting algorithm I'm using or how cash friendly approach I'm using to compute my hashes. I don't care because the disk IO is always the most expensive thing, so that's what I want to minimize. In the advanced class or in an in-memory database, disk IO is no longer the issue. So now we start caring about things like CPU register usage, cash utilization, and other things. For this, again, you could come up with a better hash table that's cash friendly. But since the IO is always gonna be the biggest problem, we don't care about that for this class, okay? So any questions about sorting or aggregations? All right, cool. All right, so next class, again, we'll be focusing on joins. So we'll start with the basic nest loop joins and then just get more complicated and complicated. And then we'll have time, we'll talk about what I call exotic joins, lateral joins, anti-joins, semi-joins, right? But we'll get through the basic join algorithms and see how much time we have at the end to talk about more fancy things, okay? That's my favorite all-time job. What is it? Yes, it's the SD Cricut IVES. I make a mess unless I can do it like a GEO. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no roots. Homies on the cusp of y'all my focus, I drink proof. Put the bus a cap on the ice, bro. Bushwick gonna go with a flow to the ice. Here I come. Willie D, that's me. Rolling with fifth one, South Park and South Central G. And St. I's when I party. By the 12-pack case, I'm a thot. Six-pack, 48, gets the real price. I drink proof, but yo, I drink it by the 12 ounce. They say bill makes you fat. But St. I's is straight, so it really don't matter.