 So today's class is about sorting and aggregation, and as usual we're going to start with a few administrative things. So first thing is homework number two is going to be due this coming Sunday, October 3rd as usual at 11.59pm. Remember there are no late days. You can't use any of your extension days for homework, they're just for the projects. And project number two is out now and it's going to be due Sunday, October 17th. There will be a Q&A session tomorrow, Thursday, September 30th from 5pm to 6pm, and if you check the Piazza post it'll have the zoom link and all that information in there. So one final thing I just want to put on your radar, the midterm exam is going to be Wednesday, October 13th, that's in class during the normal class time. And we'll release more details about the format and everything by the beginning of next week. So the topics that are going to be covered are just everything that we've covered so far up to the midterm in the class. So everything is fair game up until the midterm. Okay so last class we were kind of talking about index concurrency control and we ran out of time at the end so I just want to pick up where we left off and finish off the last little bit from that lecture and then we'll move on to the main topic today which is, as I said, sorting and aggregation. So I kind of recall the piece we were talking about specifically was B plus tree concurrency control where we want to allow multiple threads to read and update a B plus tree at the same time. So this lets us, we don't have a bunch of threads sitting around doing nothing. If we have a lot of cores in our system we can utilize all of that parallelism to concurrently operate on the B plus tree at the same time but we need to be careful to protect against these two types of problems that can come up. So again one is that we have two concurrent threads modifying the contents of a single node for example like a leaf node at the same time and another is that one thread might be traversing the B plus tree structure while another is doing some reorganization operations that could be splitting nodes, merging nodes, rebalance kind of stuff. So we ended last time with this observation that the first step in our latching algorithm to protect the contents of the B plus tree was that we were always going to take this right latch on the root node of the tree every single time so no matter what right we wanted to do we always had to first latch the root node but as you scale up to more threads this can become kind of a performance bottleneck because remember the right latch is exclusive so no other threads can acquire a latch while the root node is right latched so kind of the whole tree is locked up and everyone's blocked at least until the modifying thread is able to release this root node latch so kind of a better latching algorithm takes into account the following assumptions so most modifications to a B plus tree will not require a split or emerge so instead of assuming that there's going to be a split or emerge we want to sort of optimistically traverse the tree using read latches or remember read latches are going to let other threads who also want to acquire read latches come in and get those latches so we can have multiple concurrent threads with read latches but only one single thread with a right latch at any given time on a node so kind of the intuition is that especially for the higher levels of the tree there will be very few rebalancing operations now in a lot of the examples we've showed you know the nodes are relatively small they have two or three elements each so obviously for you know illustration purposes that's good because it can let us show rebalancing but as you scale up to you know larger trees imagine I don't know a million or a billion elements most of the modifications are going to be taking place at the bottom layers of the tree so the leaf nodes or you know the inner nodes that are only a few few layers up from the leaf nodes so kind of the that's the intuition here that we can kind of optimistically traverse the tree just getting the read latches to protect kind of our our path or traversal through the tree but not not intending to right latch things and then of course if we if we kind of guess wrong and we actually do need to propagate these these rebalance operations up the tree then you just you know abort your operation and start over at the beginning acquiring now you know you need to acquire right latches so you can kind of just repeat repeat the traversal so kind of how does this work compared to the original algorithm we talked about so you know search is the same as before there's no difference there in certain delete as I said we're going to set the latches the same as for search to get to the leaf until we get down to the leaf node and then we're going to set the right latch so all latches are reads all the way down to the leaf and then we set a right latch on leaf node so if the leaf is not safe remember we define safe last time as it's it doesn't need to split or merge so there's enough room or keys to accommodate whatever change you're going to make so when you get down to the leaf node you you ask is this safe if it is then you're done that's great you have the right latch on it if it's not safe then you kind of have to to start over from the root and traverse down getting right latches so that's why we call this optimistic we optimistically assume that we have we're not going to need the right latches all the way down the tree versus pessimistically acquiring those those latches as we descend the tree so we'll just look at two two examples here quickly so this is example two from the previous examples where we where we did this kind of delete when delete key 38 so again here instead of taking the right latch on the root node we're going to take this read latch and just you know descend descend of the tree as usual taking taking read latches all the way until we get to the bottom here we acquire this right latch and we say okay we're we want to delete this this key 38 here H doesn't need to coalesce so we're safe we're done we achieve the same thing that we did taking these right latches down the tree just using the read latches so we can you know do our deletion and then we're we're all set there so this is it is example four also from the previous examples where this time I'm going to insert a key 25 so again we'll we'll use the new read latching algorithm to descend the tree until we get to where the key 25 needs to insert and we see okay here we need to split F so we have to restart and re-execute the transaction for the insertion from the root node this time acquiring right latches because we figured out okay we have to split this node we're going to need to to rebalance on some of the the upper levels so does anyone have any questions about either of these two examples okay so this other thing that that we can point out is that in all of the examples we showed so far both in the last class with the pessimistic right latching and in this example here with the optimistic read latching is that all of the threads have been acquiring the latches in this top-down fashion so remember in the the hash table examples we always have the threads accessing the hash table in a forward scan you know the linear probing hash table you just scan forward in the table until you wrap around and go back to the top so they're always scanning in the same direction they're always acquiring latches in the same direction and in this case also for the B plus tree we've always been acquiring latches starting from the root node and then as we we traverse down the tree so that that kind of has allowed our algorithmically to prevent deadlocks because recall there's no sort of deadlock detection or deadlock prevention the only thing we can do is avoid deadlocks through this kind of careful algorithm design so one of the other operations we talked about on the B plus tree was this idea of a leaf node scan so you traverse down to the leaf node where you want to start your scan and then we have these pointers the sibling pointers we can traverse across the leaf node layer to get a range of values that you're looking for so now what we have to think about is okay if we have threads potentially acquiring latches from the top of the tree coming down and also threads potentially scanning along the leaf node layer we have to be careful that we don't run into a situation where we have a deadlock so we'll see a few just simple examples of this come up so let's just do a simple leaf node scan here with one thread so t1 wants to find all keys that are less than four so what we're gonna do is just start here at the root figure out which leaf node we need to get to to find the keys less than four and then we want to scan now across the leaf nodes to to find our range and similar to how we were taking latches on going down the tree and kind of this log lock latch coupling or latch crabbing algorithm where you don't release the latch on the node you currently hold until you have the latch on the one that you want to go to next we're gonna do the same thing kind of in this this leaf node scan so here we don't want to release the latch on the C leaf node before t1 gets the latch on the B leaf node so that's great we got the latch and now we can move over here and release the latch on C because we're kind of doing in this coupling way where we always have you know the current latch in the next latch that we want to get so let's get a little more complicated here we can now add in these two threads that are executing at the same time so again thread one still wants to find all of the keys less than four and thread two wants to find all of the keys greater than one so you know let's say they start at roughly around the same time they're they both get the read latch on the root node that's fine there's no problem there and then they're going to move down to each of their respective leaf nodes where they need to start scanning so now you have thread one that wants to scan backwards from four and thread two that wants to scan forwards from one so you're going to run into this situation here they can you know release their latches on the root node because they have their leaf node that they're on now and now they want to come over and acquire a latch on the other's leaf node so what do we think is going to happen here yes so the question is it a is it a read write latch or just a regular retake let's assume it's a read write latch just like all the other ones that we had so you can either get it in in read mode or write mode okay so yes that's correct there's not going to be a problem both t1 and t2 can get the read latch on each other's leaf nodes because you know as we saw before and we've discussed before there's no problem in this compatibility matrix if two threads want to both acquire a read latch at the same time so kind of they're both going to get the read latch on each other's leaf nodes and then you know they can release the latch that they hold once they switch in that way so this example works out but if we have now arbitrary readers and writers at the same time you can pretty easily run into these kind of deadlocks as you are acquiring latches in now not just a top-down fashion they can be you know in an arbitrary fashion so again the important thing to remember is that latches don't support deadlock detection or avoidance the only thing we can do is through like carefully designing the algorithms and the data structures to prevent these sorts of things from happening so in the previous example here you know we could ensure or mandate that for example you can't scan in the reverse direction across leaf nodes we're only going to allow scans in the forward directions that would kind of be one way of preventing deadlocks here so kind of the key takeaway that we need to remember is that there's there's no kind of higher level deadlock detection or management it's just in in the algorithm and data structure design we have to be careful to avoid deadlocks in in our our data structure traversals so just to wrap up with the index and currency stuff generally making data structures thread safe is is notoriously difficult in practice it's really hard to do you really have to pay attention to figuring out if there are cases where you can run into potential concurrency errors or deadlocks or that kind of stuff and and engineer around them and again we mostly focused on hash tables and B plus trees but the same high level techniques can be applicable to other data structures and we're not going to really talk about any more in this course but you could imagine you know generalizing these same sorts of ideas to arbitrary tree structures for example so before we move on to the main topic for today are there any questions about the the concurrent index stuff okay so let's just take a step back and look at kind of the core status so we've we've talked about a lot of different layers of this stack here representing the different components of the DBMS and today we're going to be talking about or at least start talking about how to execute queries using the DBMS components that we've discussed so far so we're at this operator execution layer in the stack right now we've already talked about the disk manager we've already talked about the buffer pool we've already talked about you know different access methods and index structures they can speed up accessing the the the data in the database so now we're going to really be focusing on the the different individual operators that we want to provide to the the query interface and over the course of the next four lectures we're going to be talking about these three topics so specific operator algorithms how we're going to implement individual operators different query processing models so how do we you know string these operators together to execute a query or like a SQL query and then different runtime architectures that have trade-offs for different you know workloads or situations so just at a high level the the thing that we're going to be executing is called a query plan so in a query plan the operators are going to be arranged in a tree structure technically it's a directed a cyclic graph but just for our purposes you think about it as like a tree structure so here we have this really simple SQL query as just you know selecting some values from two tables you're doing a join between the two tables and we're applying a filter condition on this b value so the kind of way to think about this and we'll talk more about this in later lectures as we talk about the query processing models but the way to kind of think about this at a high level is that data is going to flow from the leaves of this query plan tree so in each case we have a and b at the bottom that's just a scan of the tables a and b so data is going to flow from the leaves of this query plan tree up to operators in the inner nodes and then all the way up to the root of the tree whatever the root of the tree is going to be the output of the query plan so kind of here we have this scan over a scan over b the output from the scan over b is going to go to this selection operator that's going to filter out some values based on the selection condition then those two inputs are going to go to the join we're going to join them together and then finally do a projection to just return the two attributes that we want so this is how it's going to work at a high level and as I said we'll kind of go into the details of how exactly this is implemented in a couple classes so the important thing to remember here is that just like we can't assume that a table is going to fit entirely in memory we also can't assume that the the query results or even the intermediate results of these different operators we saw in the query plan in the last slide we can't assume that they're going to fit entirely in memory so the the point of the disk oriented dbms that we've been talking about building is that it kind of gives the the user the application the impression that you know all of the data can be processed in memory even though the data might exceed memory so we're going to rely again on our on our buffer pool that we've talked about and you've implemented in a project we're going to rely on that buffer pool mechanism to implement the different algorithms that we need in order to spilt disk and again it's important to remember that we're going to prefer algorithms that maximize the amount of sequential IO so this is important because of how how disk accesses work sequential IO remember is more more efficient than random IO and just at a high level I mean you may have taken an algorithms course learn about things like sorting that kind of stuff but what we're going to talk about here is different than how you may have studied algorithms in the past because you know where we talk about the time complexity of an operation like a hash table has 01 time complexity or something or a you know a search or sort has a n log n time complexity but I mentioned in one of the previous lectures that the important thing to care about is the constant that goes along with that time complexity because again if you have a an 01 look up in a hash table that goes to memory that's okay it's cheap to access memory relative to an 01 access and a hash table that goes to disk every single hash table accesses to disk it's going to be really slow even though you know the theoretical time complexity is 01 so there's another good example of why the DBMS needs to think carefully about managing its own memory and file IO we don't want the OS to do that or something and why we have to kind of design these algorithms to take into consideration the different disk accesses because we can have you know better you know caching and prefetching policies than just an OS I might have because it doesn't know exactly what types of operations we're trying to perform so the two things that we're going to talk about today are the external merge sort which is basically is disorting algorithm for out of memory sort so large in memory sort and different aggregation operations so we're going to see kind of how these like high level strategies can be designed around like a divide and conquer algorithm we'll see how we can apply it to two different types of operators so this is also going to segue into the next class where we'll talk about an alternative to sorting which is hash based methods so what are some different reasons why we need to sort well in previous lectures I have and continue to make a big deal out of this idea that the relational model or sequel is a set or multi-set algebra so it's unsorted so there's no inherent relationship or order between different tuples in our in our tables so why do we care about sorting order some examples of when we might need to sort our our data none okay well the first and kind of most obvious one is that queries may request explicitly that the tuples that we return in our results that our answer are sorted in a specific way so if the sequel query specifies kind of an order by clause they say okay I want the output result that you're getting sorted by a particular attribute or set of attributes then the dbms is going to need to take care of that we're going to need to sort it in that particular order so that's one example of when we might care about sorting the data another situation is that you know even if the query doesn't explicitly specify some order so even if the user doesn't care what order they get the results back in there are times when we may still want to perform sorting of the data in the database in order to speed up or perform different algorithms so these are just a few examples I've given here there are many others where you can apply the same idea but for example if you want to do some sort of duplicate elimination so you just want all of the distinct values in a particular table it's trivial to support this duplicate elimination if you have the data sorted so you know sort the data in a particular order as you're scanning through you can see consecutive values can be removed from the the output answer because you know you just saw that value previously so you can kind of skip through the results in that way again we talked about bulk loading sorted tuples into a B plus tree it's much faster to do that than to do insertions from the root node and follow the usual insertion algorithm so you kind of do this bulk loading but it requires that the data is sorted so if you want to build a B plus tree in this bulk load fashion then you have to first sort your your tuples and then build it from the leaf nodes up and finally we'll see in this this lecture an example of how you can use sorted data to speed up aggregations and implement a group by so you want to get an aggregate grouped by some value the first step in that pipeline can be to sort the data and then perform the aggregation and we'll see how that works in a couple slides so again I mentioned this a few slides ago but I just want to reiterate that if the data fits in memory then you just use any standard sorting algorithm quicksort you know any of the other million sorting algorithms that there are pick your favorite one you can use it it doesn't matter because everything's in memory everything's going to be fast but the problem again is that if the data does not fit in memory then we're going to need to use some kind of algorithm that knows about the fact that we're going to be writing to and reading from disk so we have to know okay in advance here's how much working room I have in memory and I'm going to be able to read in this many pages into my buffer pool and then I have to kind of write them out in an intelligent way when my buffer pool fills up so that's kind of the whole idea behind this external merge sort algorithm that we're going to talk about today so as I mentioned external merge sort is basically a divide and conquer algorithm we're going to split up your data into these separate pieces called runs and then you're going to sort each of those runs individually and then the last step and this is the the merge phase is to combine them back into longer sorted runs so you know the two basic phases we have are sorting where we sort individual chunks of data that are going to fit in memory so we read the data and work on it in memory sort it and then we're going to write out the sorted chunks to a file on disk and then phase two the merging phase we're going to combine sorted runs until into larger sorted chunks so the term run or sorted run is basically just the list of key value pairs so again the key is the attribute or attributes that that we're going to use for the sort order whatever we're sorting by it could be you know primary key it could be like a date I want all the data ordered by date or something so it's whatever we're using to impose the sort order on the tuples the second piece is the value so here you have as in many cases two choices you can either you include the entire tuple in the sorted run and this is called early materialization or you can include just the record ID in the sorted run and that's called late materialization so the first example here early materialization basically um if we're sorted by key so the keys are ascending key one key two etc we're co-locating the tuple data directly in the sorted run with the keys so this early materialization because we're doing this proactively we're putting all the data right with the keys where they need to be the alternative of course is this late materialization idea where instead of storing all of the tuple data we're just going to store the key along with the pointer that's just a record ID that's going to tell us where to go at the end to reconstruct the tuple and we'll talk about the different reasons why we may prefer one over the other it may have to do with the data storage model you have so we talked a little bit about whether you have a row store or a column store that's one reason why you might prefer one over the other there are also cases where for example if you're doing a lot of filtering or something it may be or filtering at later stages it may be beneficial to use this late materialization so you don't have to do a lot of extra copying so we'll see some examples later where one of these methods might be preferred over the other but for now we're just going to assume we're not going to factor this trade-off into the the algorithms that we're talking about so let's start with the simplest most basic example which is the two-way external merge sort and then later we'll generalize it to a k-way sort so two is basically meaning the number of runs that we're going to merge into a new run for each pass so at any given pass in the algorithm we're only ever going to be taking in this example two runs and then merging them into a single larger longer run so we're going to break the data up into end pages imagine a table broken up into end pages and the dbms is going to have a finite number of buffer pools buffer pool pages b that we're going to use to hold the input in the output so you know this is something we need to know ahead of time in the dbms the amount of memory we have to work with to do our sorting and it's it's typically configurable there are configuration options in postgres mysql whatever to figure out you know you can as the database administrator say here here's how much memory you should have available to to do these types of operations so we'll kind of go through an example here imagine we have a dataset with two just for now just two pages stored on disk and we want to do this two-way external merge sort so the the first pass we'll call pass number zero is going to be to read all b pages of the table into memory and we're going to sort the pages into runs and then write them each back up to disk so for now let's just assume we have one one page in our buffer pool and we're going to read them in one at a time and do this this sorting so let's first read in page number two it's going to go into our buffer pool in memory we're going to do this nice sort we can do it in place even so we don't we just need that one free slot in the buffer pool now we have this sorted run and then we're going to write it back out to disk in this new page here so we have the original one unchanged it's still unsorted however it was previously now we have this new intermediate result that we've materialized somewhere on disk that's the the sorted version of page number two so now the next step is we have to read in page number one again do our sort so we get the sorted version of page number one the sorted run and then we write that out to disk so now the next step in the subsequent pass is that we're in a recursively merged pairs of runs into runs that are each twice as long so we're going to take each of these individual pages and merge them into now pages that are or runs that are two pages long so in order to do this this step uh what's the minimum number of buffer pool pages that we're going to need here yes so the answer is three that is correct and it is because we are going to need one for each of the input pages and then one for the output so we have the two pages that are coming in in our two-way external merge sort and then we need one page left over to write the the merged page out and then we're going to write that to disk so the way this is going to work again is we're going to read in these two pages here we're going to have our result page we're going to write that out first uh so once that page fills up from the the first two pages because uh you know we have two pages uh we're only going to be able to fit some portion from each of those pages into this output page so we're going to write that part of the run out so that's the first page of the run and then we're going to reuse that same buffer pool slot we're going to fill it up with the remainder from the two pages in the second page of the run so kind of the the the whole way this works is if we had more pages you know we just keep kind of repeating the process recursively until we get one single run at the end so kind of what we can see maybe a more concrete example here of how that works um in each pass we're going to be reading and writing every page in the file so in terms of the total number of passes in the total iocos well the total number of passes we need our log two or the ceiling log 2n plus one and the plus one is that first initial pass number zero that we did will resort each each page individually so I think it might vary in Wikipedia or in different textbooks if you see whether or not they include that first sort pass but we'll include it here so it's the first sort sort pass plus the log because again we're doing it we're splitting it into recursively into these these uh merging two pages together at a time so we have the ceiling log 2n and then the total iocost because on on each pass we need to read and write every page of the file we're going to end up with 2n times the number of passes in terms of ioc so just as a concrete example of how this works um let's say we have these pages here that we want to sort so each page contains two values and the last thing is just we'll say like a an end of file character to let us know that we're done there um so again what we're going to do on pass number zero is we're going to read each of these pages in and sort them individually so uh we can get get all of our values sorted um for for the next merging phase so in this case you know some of the pages are already sorted like the three the first page three four it's already in sorted order so we don't really need to do anything the second page goes from six two it gets flipped to two six uh etc all the way across so now the next stage is to get to this pass number one we're going to merge each of the one page runs into two page runs so we can just you know take uh consecutive pairs here and merge them together so the first one we're going to merge these two together and the way we're going to do that is we're going to look at the minimum or lowest value in each of these pages so in this case we're going to see you know the the smallest value from the right page it's going to be two so that's going to go in our first slot then we move the pointer over and we say okay now I have three and six so three is smaller so three is going to go into the the the um run page that we're going to write out so now we can you know kind of repeat the process uh and fill out all of these pages in the same way writing out a page as soon as we fill it out so in this first this the case of the first two pages when we merge them together we filled up the page with two and three uh we write that out and now we clear it and we have uh four and six come in to fill out the second page of the two page runs so this process continues all the way down we get to pass number two and now we merge them into four page runs and finally we merge it into one final eight page run at the end and that's the entire thing sorted so this is kind of like a divide and conquer strategy where you split up the file into these smaller runs and then we recursively merge them together to get a final sorted result so are there any questions kind of at a high level about this uh how this algorithm works this here uh yes that should be answer I think the b b was for the size of the buffer pool right and and this is um we're talking about the the number of pages in the tables then yes sorry are there any other questions about this or errors that I've made in the slides those are also welcome yes so the the question is um are these are these merge phases being performed sequentially or do you have multiple threads running concurrently working on it the answer is so for simplicity for everything we're going to discuss here let's just assume one thread for now so as soon as you know we finish the first two we're going to move on to the next two however uh you could and in fact I a lot of systems do uh partition up the the work into multiple threads so of course you know if you have let's say you're working on the the two page runs you want to merge the one page runs into two page runs you could have each thread you know take two individual pages to to do the merging right and that those can work independent of each other and then as you get you know further and further down you have to be careful about you know you can't you can't use as many threads as you get further down in this in the number of passes so for now we're just going to assume one thread doing all this you could split it up but then again also if you have multiple concurrent threads running you also have to think about the resource utilization so for example we've we've said okay we're going to have just for now two buffer pool pages assigned one for each of the inputs and then one for the output now if you have in this case let's say four-way parallelism you now have four times the number of buffer pool pages assigned so you need to think about trading off the the parallelism versus the amount of you know extra resources you need to support that level of parallelism so again to to reiterate the two-way external merge sort is only going to require three buffer pool pages in order to perform the sorting so b equals three so again two input pages and one output page but you know three pages is pretty restrictive that seems relatively small if we have four kilobyte pages that's not very big in terms of today's main memory so you know in a lot of cases you have b is greater than three you have more buffer pool space to work with so what we need to do is figure out how to leverage that buffer space without having the the worker thread that's doing the sorting have to block on disk IO because kind of we're we're reading in these pages and then we're writing them out we need to figure out how to do it in a way such that we're not blocking on disk IO waiting around for for the pages to get right into the more buffer pool resources that we have so the way we're going to do this is what's called double buffering this is an optimization of the previous algorithm where basically we're just going to prefetch the next run in the background somehow either you know there's some other thread that's doing the prefetching or you could issue from the worker thread a prefetch request for the next run that you know you're going to need and basically we're just going to fetch that next run and we're going to store it in a in extra buffer buffer pool slot so here we're reading in page one from disk into our buffer pool in memory and now let's say again it's doing its work it's sorting doing what it needs to do there and while that's happening in the background what we can be doing is also fetching this page number two which we also know we need to sort into the buffer pool to kind of mask that that disk IO so we can get it loaded and ready to work on as soon as we're done working on the first page there so now that that page is done we can move on to this next one and start sorting it right away it's already in memory we don't have to wait for a separate disk IO request to get that page in so kind of you can you can generalize this as far in advance as as you want if you have you know a sufficiently large buffer pool you can prefetch as many runs ahead as you want since you know the order that you need to work on them in so you can kind of mask it in arbitrarily complex disk arbitrarily expensive disk access so are there any questions about this optimization here yes so the question is do you throw away the intermediate pages that you write on disks or the sorted sorted runs that you write on disk as soon as you're done with them or do you keep them around so technically you don't need them anymore as soon as you've done what you need to do with them you can throw them away as soon as the query result is returned you can get rid of them but sometimes there may be reasons to keep them around maybe it's a query that gets executed frequently so you can kind of keep this intermediate result around and and you think of it sort of like caching so you can keep this this intermediate computation that you've done stored somewhere in a special cache so that way if you have another query that would benefit from accessing it or even is you know the exact same query you can go back and and re-scan your intermediate results so you don't have to go start from scratch again but from a from a like correctness point of view there's no reason to keep keep the intermediate results around as soon as the the query is done executing yes so the question is what if your memory system is able to return sorted pages to you so you mean like so like a hardware acceleration or in memory computing that would be nice i guess the the same problem is that if you you're only you're only getting sorting at the level of the page right so you still need to even if you can sort in the hardware i assume it doesn't have to go through the cpu or whatever but if you can sort in the hardware or in the memory then you you still need to do this kind of merge phase right to get the longer run so i guess you could accelerate the the sort part of it but each individual page would be sorted but then you still need to merge them together into a longer run okay so the the general external merge sort is kind of this generalization where you have some number b of buffer pages rather than just the the two pages that we're working with and what we want to do is on every on each pass we're going to we're going to work with these b buffer pages at the same time so in the first pass we're going to produce n divided by b ceiling sorted runs each of size b so then on subsequent passes we're going to merge b minus one runs to do this k-way merge we're going to take all these pages now b of them and we're going to merge all at the same time into a single run so in this case the number of passes that we end up with is given here and the total i o cost is still again two n times the number of passes that we have because again we're reading and writing all of the pages on every single pass so an example of how this works is pretty straightforward just basically let's say we're going to determine how many passes it's going to take to sort 108 pages with five buffer pool pages so our our inputs are n equals 108 and b equals five and you know kind of we can go through each of the how the passes work so in the first one we're going to produce 22 sorted runs of five pages each the last of these it's uneven the last the last run is only going to be three pages long on pass number one we're going to end up with six sorted runs of 20 pages each and again because it's uneven the last one's going to be eight pages long on pass number two we're going to end up with two sorted runs the first one's going to have 80 pages and the second one's going to have 28 pages then the last pass we're going to have the whole file sorted of 108 pages so again we end up with we can figure out using this formula how many how many total passes we're going to need so are there any questions kind of about how how to generalize the two-way merge sort into this k-way merge sort okay so the next thing to think about is we've talked about B plus trees in the past and they maintain the data in sorted order and we also looked at you know how we can kind of scan B plus trees to get a sorted result from the leaf nodes of the B plus tree so it's natural that rather than doing this this sort merge sort thing if we already have the data stored in some kind of sorted data structure like a B plus tree index then it makes perfect sense to to leverage that in order to accelerate our sorting so there are two cases that we kind of talked about in the the B plus tree lecture they apply the same here so there's either a clustered B plus tree which is the data is stored in sorted order in pages based on whatever the clustering attribute is usually it's like the primary key but it's stored in whatever order the physical tuples are stored in pages in whatever order the whatever the sort order is the other case is the unclustered B plus tree which remember can be you know completely out of order it just gives us some kind of data structure built on some non-primary sort attribute so we'll look at the first case first we have here a table that's stored in a clustered B plus tree and again we're going to traverse to the left most leaf page just like we did in the the previous version we want to get a scan of the the tuple pages and since they're in sorted order we're going to get nice sequential access here we're going to be able to access all of the tuples in their exact sorted order first we look at page 101 then page 102 etc because this is already maintained in sorted order and it's it's guaranteed by the B plus tree so this is always going to be better than than that external sorting algorithm because there's no computational cost here we just we already know it's in sorted order we can just scan forward in in a sequential scan with sequential disk IO so the if your if your query optimizer in your dbms knows this and it knows that you have a clustered B plus tree index on on the data then it can kind of choose whether it makes sense to to execute a some kind of sorting operation or if it knows it's already in sorted order it can just return a scan over the the sorted root data so the other case is this unclustered B plus tree and as you may have guessed you're going to end up with really bad IO access patterns it's going to be random access because basically you have all of these pointers pointing all over the place in your base data and it's going to to have completely random IO patterns so in general you're going to end up with one disk IO per per record in your B plus tree there so this is going to be really bad because now we're doing you know for every single record we're doing a disk IO rather than in the sort sort merge join we can or sorry rather than the sort merge sort phase I will talk about the sort merge join next next class but in the the merge sort phases we can constrain how many passes we have to do and we can figure how much disk IO we need to do so I also mentioned that you can use sorting to as a as a precursor step to implementing an aggregation basically the way it's going to work is we're going to collapse values for a single attribute from from multiple tuples into a single scalar value so that's what the aggregation does it takes these groups of attributes grouped by some key that you want to group by and then it compacts them into a single scalar value I think like maximum average sum that kind of stuff so again I I sort of mentioned earlier there are two different choices that we have sorting or hashing first one to talk about sorting but I'll give you a small spoiler hashing is going to be better in in a lot of cases so there are some instances where sorting might be preferred but but hashing is going to be better in a lot of cases we'll start first with talking about sorting because it's it's a little bit easier to think about and we've kind of talked about the the prerequisite algorithms that you need to get there so just as a simple example let's let's take distinct so imagine that we just want to get all of the distinct course IDs from from this enrollment table where some student got either a B or C in the course so we just want only only the the course IDs where some student got one of those grades and we want to order it let's say by course ID so it has the sort property it would be beneficial here for us to sort them so we can we can give the results back in sorted order so the first step that we're going to do and this this will come up again in our discussion of query optimization and query planning but the first thing you pretty much almost always want to do is remove useless data as early as you can in the plan so we have this table here there is one one tuple in there where the the student got a grade of a in the course so we want to filter that out as early as we can in the plan so we don't have to you know work on that in our subsequent operators we're going to have to include it in our sort because it's not going to be in our result set so it's just wasting our time trying to sort something that we don't we don't care about so we want to filter things out as early as possible so we apply this selection predicate where grade in B or C then we want to remove the columns we're not going to need so the only column we have in our select clause is select distinct CID so we only are going to need the CID column in our result set so again kind of going along with this idea of removing as much unnecessary data as possible we want to apply our projection here to get rid of the other two columns that we don't need so now we're left with just the CIDs and then the final the final step in the sorting aggregations obviously to sort so we're going to take the CIDs that we have after the first two steps and we're going to sort them in this ascending order and now what we're going to do is just essentially one final scan through the sorted result in order to remove duplicates so we'll start at the beginning we see okay we see 15 445 as we're scanning forward we see okay there's another I just saw that 15 445 I don't need this next one so we'll get rid of it you can kind of scan forward in this way checking if the current key that you're looking at is the same as the last key you just saw if it is you can get rid of it otherwise you kind of added to your result set so this is kind of the high level idea of how sorting aggregation works are there any questions about this okay so let's talk about some alternatives to sorting so what if we don't need the data to be ordered so in the last example there was the order by clause at the end that said okay we want this result produced in this particular order now let's just say we just want the result back we don't care so an example is like if we're forming groups in a group by statement and we don't care what order they're in that doesn't require ordering or if you're removing duplicates and again you don't care there's no order by clause you don't care what order you get the results back in so an alternative that we can use in these cases is hashing and it's going to be a better alternative in these scenarios than than doing the the full sorting and because it can end up being computationally cheaper than having to sort the input to the the later operators in the query plan so basically the way that we compute a hash or perform a hashing aggregation is to build some kind of ephemeral hash table as the dbms is scanning the table so for each record we're going to check whether there's already an entry in the hash table if we're doing a distinct operation then you know if we if we hash the key go look in the hash table already exists we can just throw it away we don't need it because we already know it's in the hash table if we're doing a group by and for now let's assume that the the aggregation function is going to be commutative and associative so something like sum or min max that kind of stuff we can just perform the aggregate computation so some kind of incremental aggregation imagine like a sum you know if i want to sum all of the values together for a particular key then i can just go and for each each each time i hash the key and find that key in the hash table i just add the value to the running sum until i get to the end and i have the total sum if you have other aggregates like median or something then you have to do something different but we won't talk about those here so if everything fits in memory the same with the sorting then this is easy you're done i mean you just use one of the hash tables that we talked about before you use your favorite hash table whatever it's all going to fit in memory you'll be fine there won't be a problem but again if the dbms has to spilled the disk then we need to be more careful about how we manage our our i o and how we manage our operations so this algorithm is going to be called an external hashing aggregate or aggregation operation and basically it's it's split into these two phases so in phase number one it's going to be the partition phase and what we're going to do is divide up the tuples into these different buckets based on their their hash the hash of their key and then while we're doing the partitioning we're going to write them out to disk when each of our partitions gets full so once that's all done that's one pass through the data in the next phase phase number two the rehashing phase we're going to build an in-memory hash table for each partition and compute the aggregation over that partition and then there'll be a final phase where we merge our our intermediate hash tables into one final hash table for the result so kind of it's the it's the same high level idea of this divide and conquer approach as the external merge sort and it's going to help us through our partitioning it's going to help us maximize the the sequential i o that we're going to perform to and from disk and we'll see why in a second so again phase one just to dive down a little further the partitioning phase we're going to use some hash function h1 to split the tuples up into partitions separate partitions that are restored on disk and a partition is just one or more pages that contain the set of all the keys with the same hash value produced by h1 and then kind of they're spilled to disk in the same way that we were doing the spilling of the output buffers for the the sort where we you know as soon as we as soon as we filled up a buffer for the output we wrote that out to disk and then we had we reused the buffer to fill out the rest of it so kind of as soon as the output buffer fills up we can write it out to disk so again assume that we have b buffers here and we're going to use b minus one buffers for the partitions and then one buffer for the input so the kind of the key idea is that the same value from or the same key from different tuples are going to hash into the same partitions we'll get all of the the same keys with the same hash co-located in a partition and it's kind of like the opposite of doing the sorting operation so in the sorting operation we had one input buffer and then you know b minus one output buffers in this sorry we have sorry i reversed that in this case we have one input buffer and b minus one output buffers in the other case the k-way sort we have b minus one input buffers and one output buffer so we can see an example of kind of how this phase number one works again we have the same query here except i removed the order by so we're just getting all the distinct cids that meet this case same table and again we're going to do the same thing we're going to filter as early as possible to remove the rows that i don't need and then we're going to remove the columns we don't need and then when we get to this partition part at the end remember this is this where we're going to do the sort but instead here we're going to do the partitioning so we're going to pass the keys through this h1 function and it's going to route us to these b minus one partitions so in this case we can see all the 1545 keys end up in the the first page there and then they i think that the the first two pages are part of the same partition so all of those keys end up in in those partition that partition there and then the bottom page is a separate partition so that fills up separately so we kind of pass it through this h1 hash function and it splits them up into these different partitions so again kind of this one is filled up so we have the the overflow values in there so now in this phase number two rehashing phase for each of those partitions that we created on disk we're going to read it into memory and build an in-memory hash table based on some second hash function h2 so this is just separate from the original h1 hash function that's going to give us a low a hash offset into our in-memory hash table that we're building and we're going to go through each bucket of the hash table to bring together matching tuples so imagine just for now for simplicity that our our hash table is going to fit in memory you can have the same kind of buffering idea that we had with our you know external or extendable hash table that where you can you know build it up incrementally and spill pages to disk but for now just assume it's a really simple linear probing hash table so like I said we're assuming that the the hash results going to fit in memory but it in the general case doesn't have to now the second phase rehash again we have the same setup here we have these partitions so these are the the partitions or buckets that are produced by phase number one we have the first buckets there and then we we want to do a scan through these each of these buckets and build this in-memory hash table so we're going to pass them through h2 as we iterate over each page stored in these buckets we're going to pass them through the h2 hash function to populate this in-memory hash table and we're going to build up the values and put them in there remember again we're just looking for the distinct cid values so if we hash in and we find a duplicate key in this in-memory hash table then we just throw it away because we don't need we don't need duplicate we just want all of the unique keys in the hash table so kind of we we build up this in-memory hash table from this first partition here and then when we're done with it we can immediately add them to our final result and we can do this because we know that there's it's not possible for any of the keys that are in any of the other phase one buckets or partitions to show up again so we're not we never have to backtrack and look at previous partitions because all of the partitions are disjoint we have we have completely disjoint partitions that we produced in phase number one so as soon as you know we're done looking at the pages in this first partition we can write out this hash table to our result set we're done with it those keys are never going to come up again in later partitions so we can go down here to one of these later partitions and we see okay we're going to run it through hash two and build up our new hash table we can get rid of the the previous one we've already written it out to our final result then when we're done here looking at at the pages in later partitions we can just add that to our final result so are there any questions about how this this two phase partition hashing works okay so just to reiterate the first phase is the partitioning phase where you kind of build up these disjoint partitions the second phase the rehashing phase where you build this final hash table at the end for each partition so what have we achieved here in doing this well we've taken you know the the giant hash table that we would have had to build otherwise over all of all of the keys instead we've broken it down into these smaller disjoint hash tables which has taken the random IO from the giant hash table that we would have had to build and and given us now sequential IO since the partitions can each be read in sequentially and we can write out the final result sequentially so we've avoided this kind of big giant random IO hash table and we've turned it into smaller sequential IO operations so that's that's the the key idea behind splitting it up into this partition and rehash phases okay so if you that that was getting the distinct keys in this case if you want to do some kind of aggregation called hash summarization basically um during the rehash phase we're going to want to store uh output values of the form the group by key whatever key we're grouping by and then the running running aggregate that we're computing so we're going to insert a new tuple into the hash table we first go to see if there's a matching group by key if there is that's great we just take whatever the value is let's say it's the current running sum we add whatever our our value is to it and then we're done otherwise if the group by key doesn't exist then we need to insert the group by key and then our our starting running value so kind of just as an example of how this would look let's say now we want to get the average gpa grouped by course so kind of the the process is similar we're going to do the phase one partitioning into disjoint buckets and then we're going to run them each through their their individual building up their individual hash functions hash tables and then merge all these results into the final hash table so we get kind of this key value output where the key is the the group by key so in this case course ID in the value in order to get the average we need to keep track of the count and the sum so it's the count of the number of values that that match the key and then the sum of those values so you you can think about this kind of you know the different running totals most of them are easy the minimum value is just the minimum one you've seen so far max is the maximum you've seen so far sum is just adding them all up and count is obviously just count but for the average we're going to keep around both the count upfront and the sum so we can compute it in this kind of online fashion so then once we have this we're going to take those those running values and convert them into our final result just by dividing the the sum that we've computed by the count and that gives us the average so are there any questions about kind of how this process works okay so just kind of wrapping up there's kind of this tradeoff that we have between sorting algorithms versus hashing algorithms it's subtle and depends kind of on a few different things like I said the data layout if you have a row store or entry store versus a column store depends on the workload that you're running and it depends on the the data distribution for example you know if you have data sitting around on on disk there are different decisions you might make if you know that your data is already sorted or mostly sorted then that may make you choose the sorting algorithm over a hashing algorithm but different tradeoffs like that so there are different reasons why you might prefer one or the other in in each case and that's going to be the job of the query optimizer that we're going to talk about later later in the semester about making those sorts of decisions and tradeoffs to execute queries and again kind of we've we've already discussed the optimizations for sorting so kind of the the chunking or the buffered IO into large blocks to amortize the the costs and this double buffering idea where you're doing prefetching to keep your cpu utilized while you have disk IO going on in the background so you're not stalled on disk IO so kind of we've talked about all the optimizations for for sorting and we've started talking a little bit about some of the optimizations for hashing but in the next class we're specifically going to be talking about join operators and that's where some of the more advanced optimizations for the the hashing based approaches will come up so that's it for today and i will see you next time the bees are set to grab a 40 to put them in your snap snacks and take a sip and wipe your lips to my 40s getting more i'm out he gots the tip drink a drink a drink and then i burp after i slurp ice cube i put in much work with the bmt and the e-drub get us a sanai brew on the dump