 Good morning, I'm Jignesh Patel and I am going to be teaching this class for the month of October. Andy will be back in November. An additional change is that on Wednesday, both Andy and I are traveling. So Matt is going to run the class on Wednesday. Matt is Andy's student. This is a rerecording of the class that happened on Monday. So those of you who were in class on Monday and see a different background, see a few different set of notes, it's because the recording on Monday didn't work. So let's get started with a few announcements first. We have the database seminar series, as many of you know, and there are two lectures. VB8 which just happened yesterday. There's a recording for that. You can find that from the CMU database page. There's feature form. That's a feature store that works with machine learning models and stores those features helps you manage the entire retraining pipeline. Their presentation is on the coming Monday. So if you're interested in data related things, outside traditional relational databases, but where databases meet machine learning, the seminar series is a great event and hope you can come to it. A few announcements. Homebook 3 is due on Sunday at midnight and the midterms coming up on Wednesday, October 13. So start brushing up on the material that we've been discussing in class since the start of the semester. And of course, come to our office hours and ask questions if some things are not clear. So with that, let's get started. Today we are going to talk about executing queries inside the database engine. And you see the structure of the database engine on the right side. You've seen this before. And the part that we're going to focus on today is that operator execution engine. So at this point in the lifecycle of a query, the query has started. The query has been planned, basically optimized. And you end up with this situation now where you have to go and execute the operators in that query. There's a sequence of four lectures starting today where we look at algorithms that are used for executing these operators. And they will end up using the access methods. You might scan a file. You might end up using a B3 stuff that you know from before. Though the data that gets fetched from disk gets pulled into the buffer manager. And you know how buffer managers work. That was your first project assignment. And essentially the goal is to fetch data from the disk, keep all the hot stuff in the buffer pool for as long as possible so that you can reaccess any data pages that you need to reaccess from memory which is far cheaper than going to the disk. All right. So let's jump into it and look at what happens to the query after it's been parsed. So here's a SQL query which does join between two tables. There's a selection on one of the tables S and projects out a column from each of the tables. After this query is presented to the database engine, it's going to get checked through a syntactical parser. And then it's going to get converted into a relational algebraic representation. And that representation is what gets optimized. We'll talk about optimization later on in the class. But for today, we just assume that we've got some tree-like representation for the query. And in the tree are nodes that are operators. So here you've got the relation R that's feeding into a join operator where the condition for the join is looking for equality between the ID fields on the R and the S tables records. The S table is being scanned. And only the records that have value greater than 10 are sent to this join operator. Now here, if there's an index built on the value, that will get used. And you know how those indices work from the last lecture. The output of the join goes into a projection operator where these two columns are projected producing the output of the result. And so that's really what the query tree looks like. And it tuples up flowing through it. And you can think of a query tree as a data flow graph where the records flowing from the bottom, they go through the operators where they get processed based upon the semantics of each of those operators, a selection, a join, a projection in this case. And eventually you produce the output over here that satisfies all the conditions that are in the query. What we are seeing over here is a tree form for this query. But in general, this can be a DAG where it may be that there's a portion of the query where the output of this join gets fed into another tree. And all of those come together before you produce the final output. When you have nested queries, you have CTEs, often you end up with a structure that looks like a DAG and not a regular tree like you see over here. But it's never a graph. It's the complexity in the worst case is going to be a DAG. Things are going to flow in and come out with one node at the top, which is going to be the output of that query. Okay. All right. Now let's jump into what are we trying to do with the algorithms for these operators? So the first thing is databases need to work with large amounts of data. So we want our operator algorithms to work on data that cannot fit in memory. So we want to be able to deal with data that is far bigger than the memory that we have available. Second, we want to use the memory that is available. And that's in the form of the buffer pool. We want that memory to be used as efficiently as possible. And this is where we will bring pages into the buffer pool. We'll apply some sort of a buffer replacement policy. You've implemented LRU2 in the last assignment. And so you know how those things work. And that's what we are trying to do. We are trying to avoid going to the disk as much as possible because accessing a page in the buffer pool in memory is orders of magnitude cheaper than going to the desk. And now things get even more complicated in cloud settings. We both delve into that in this class, but the advanced database class goes into that in much more detail. Sometimes the data is not coming from disk that is local. It may come from cloud storage that is remote, which could be even more expensive. Sometimes there are configurations where the data is coming from memory from a different node, in which case it's not as expensive as going to a disk, but it's still more expensive than accessing data in local memory. So regardless of where the data is coming from, there's a hierarchy for where the data lives. And you want to use the buffer pool, which is sitting in DRAM that is closest to where things are getting processed as efficiently as possible. But we also want our algorithms to work on large amounts of data. And generally, when we have to go and fetch something from the disk, we want the IO access pattern to be sequential because generally disks are better if you ask them for a bunch of pages sequentially than randomly. So if you need to go to the disk and fetch 10 pages, and if your access pattern, your operator is asking for those 10 pages at random locations on the disk, compared to asking for 10 pages that are contiguous or sequential together on disk, the latter is nearly always going to be faster. So we want our algorithms to use the buffer pool efficiently. And when we have to go to disk, do that in a way that minimizes the time that might need to go fetch those pages from disk. So how do we do that? We'll go through sort and aggregation operations in today's lecture and we'll continue with joins and other operations in the subsequent lectures. So starting with sort, the first question that might come to mind is the relational model, which is based on relational algebra, is based on set theory, and sets are not sorted. So why do we need sorted? Now SQL has that, so we need sorting for a number of reasons. One is SQL has an order by clause, and that order by clause requires that the data be produced, the final result be produced sorted by the column set is specified in the order by clause. And sometimes you need SQL, you need sorting to be able to apply operations like distinct and we'll see that aggregations can also be done using sorting. And as we get into aggregations in the second half of this lecture today, we'll see that there are sort based methods to evaluate the aggregate operator, but they also hash based methods to do that. And we'll see that duality between sorting and hashing, they're like siblings that are rivaling with each other in terms of when is sorting better for evaluating a specific operations versus hashing, generally hashing is going to be better. And that's well known now, but there was a time in the community where that was an active debate. Sorting still helps in many cases, and we'll talk about that. Those cases are largely going to be centered around when data is already pre sorted, then you can use the sort based methods will likely went over the hash based methods. And of course, if you have to produce records, because there's an order by, you will have to do a sorting for that final result. Okay, but many of these algorithms we're going to look at aggregation, joints and later classes, they'll have a sort based approach and a hash based approach, and they're constantly dueling with each other for higher performance. So let's start with very basic in memory sorting. If the data fits in memory, then we can use a standard algorithm like quick sort. And all of you guys have looked at a number of in memory sorting algorithms. And yesterday in class, many of you volunteered the different types of algorithms, some of the other algorithms that you've looked at, including bubble sort and insertion sort. Most database systems, when they have to do an in memory sort, remember, we are also going to figure out how to sort things that are much larger than memory so that we can sort extremely large data without requiring all of that fit in memory. But there will be a portion of that master external sort algorithm for which you need to sort in memory. And so you need an in memory sort algorithm and most database systems are going to use quick sort for this in memory sort algorithm. It's not quite true. Most database systems will use quick sort for that in memory sort algorithm. But in quick sort, you find a pivot point, then you sort on both sides. But if one of the sides becomes really small, then they might end up using something like insertion sort, because it's just much faster when you have really small number of records, like maybe 10 or 20. That number really depends also on the hardware in which you're running on. In data platform, most notably Python, which is very popular in the data science world, and many of those data science notebooks that you see often get connected to a relational database to go pull stuff. So it's like the extension in many cases of the database platform for doing data science work. And Python's the language in which a lot of these things get done. Python, you can pull data into Python data structures. And there's a default sort mechanism in Python. And that uses something called Tim sort, which is a combination of insertion sort and binary sort. So there's a lot of in memory sorting algorithms. And it's really fun to go and look at them and think about that. And what combination of algorithms works best is changing even today, because the hardware is changing, some of the considerations change. And sometimes depending upon the data distribution, which algorithm is going to do better for this in memory sort component also changes. So let's go and take a quick look at a couple sort algorithms and visualize them. I'm not going to go into the details of all the sort algorithms, but let me just pull up this web page and bear with me as I try to bring that window back into focus in Zoom here. And so as you can see here, this is a page in which we can look at a number of different sort algorithms. Again, we are focusing just on in memory sort methods here. And the key point over here is that which algorithm bins is going to depend upon things like the data distribution and also the inherent properties of the algorithm. So let's take a look at random and let's go and bump this up so that we have 50 data points so that we can see the simulation happen for a little bit longer. I'm going to run random and just watch this. This is insertion sort here. This is merge sort. We'll use something like that later on as a primitive for building that external sort method we are going to talk about in a second. And then here is quick sort. And this is quick sort that spits it up into three parts. Let's just go run that. As you can see, things are getting sorted. And this is random data. So these algorithms here on this right side, which are more complex, are going to do a lot better. You can see heap sort is nearly done there. Quick sort is getting there. Shell sort is done. And all these five algorithms here on the right are done. And these most simpler algorithms, insertion, selection and bubble sort are still working on it. Now let's change the scenario a little bit. We'll let that complete on the top over there. Let's go to the scenario where the data is nearly sorted. And let's now go and play through that and watch how quickly the simpler algorithms finish up. And as you can see the more complex algorithms on the right, they're still working on it. But the first column insertion sort is already done. So hopefully that illustrates how these sorting algorithms work and why there is this constant interest even in database systems is to find the right in-memory sort algorithms and people are constantly re-evaluating that. I'm also very briefly going to go into this other site, which is the link is in the slide deck. And let's go to that site and also take a look at a different way of looking at sorting. So let's see. I might just type that in. There we go. Okay. And I'm going to switch this over to exploration mode. And this is really a cute way of looking at algorithms where effectively it's like a debugger. And as you can see, we'll start with quick sort, which has this, you pick a pivot and then sort on both sides, divide and conquer algorithm. And let's go ahead and sort that. And the nice thing about this is you can see what that code for quick sort looks like and it's walking through the elements. And it's a very nice way to connect the algorithm in that code-centric way with what's happening visually. So I encourage you to play around with this. If you have forgotten in-memory sort algorithms, you should at least know the basic algorithms like quick sort and insertion sort and just go and refresh that. You're probably going to need it. If you're going to interview, it's a fundamental question that gets asked many times. All right. Let's get back to a slide. So I'm going to close this and let you play with the algorithms on this site by yourself. So moving along, we're going to start into the sort operation. Start with something simple called the top-end heap sort. And that shows up when you have an order by, as we just discussed, SQL has an order by clause. And when that's present in the SQL query, you have to present the results sorted by the columns in that order by here. You have to output the records from the enroll table ordered by the student ID. There's a few additional components that SQL allows you to do. So here we are asking the system to send us only the first four rows sorted by student ID, of course. So the first four in that sorted order. And this is extra optional clause that says which ties. So if there are any ties in the top four, give me all the ties. So in this case, if this table has more than four records, the output of this will be at least four, but it could be more. So if one of those top four student IDs is repeated, that repeated value will be shown back to us. Okay, so let's see how that works. We will use a version of heap sort called the top-end heap sort. And here's how it works. Here we are going to go and look at the records. So assume the student IDs are these values. We're going to start by fetching the first student ID, which is three, and we're going to build in memory a heap. And we'll start by allocating a heap of size four. And obviously, a heap is a sorted data structure. Its core implementation is that of an array. And so I'm just going to show that array representation. I'm going to assume you know how a heap works. And if you've forgotten, then you can quickly go look that up. So I'm just going to show that array in a sorted order as we insert elements, because ultimately that's what a heap does. So we'll fetch three, insert that into the heap, then we'll fetch four, insert that into the heap in sorted order. This heap I'm going to show sorted from right to left. So three is smaller than four. And so the heap's growing in this direction. Then go to six, that's added to the heap. So three, four, six. Now add two and we have a heap that is full. Now we come to this value nine and we have to decide what we want to do with it. So yesterday in class, I paused the lecture and asked like, what would you do with nine? There were a bunch of answers that were given out. In terms of what you could do, the simple answer was add nine to the heap and grow the heap, which is correct. But there's a better way to do it when you're doing this top four evaluation. And the insight is the following. I've already seen that there are four numbers in the data I've scanned so far that are all less than nine. If I only have to retrieve the first four rows, the first four smallest values because it's ordered by the student ID, there is no way I would ever output nine because I already have four values that are smaller than nine. So nine does not even need to be processed because I have four values, each of which are smaller than nine. So anytime in my heap, I've got enough values, four in this case, and I see a value that is greater than the largest value in the heap, six in this case, I can simply toss it. So now you can start to see how when you implement things like in-memory sorting, in the database context where you have these semantics, you can do very interesting things. You can modify those algorithms to make it more efficient. And we'll see that with external merge sort. It's the algorithms that you know and you love for sorting, hashing, other kinds of things when you start to make that in databases. In memory case like this with top and semantics or external memory, you make these subtle changes and make that magic work so you can do things in a more efficient way and do these operations on much larger datasets, much larger than the amount of memory that you have. So let me just pause, let you take that in, make sure you understand why we can skip nine. It is greater than six and we are guaranteed we are not going to need it in the final answer. Then we come to one, which is interesting. We have to put that in the heap. But as you might notice, there was previously two, three, four, six. As I put one, I no longer need to keep six around. And I can toss that out again, that same reasoning that I've got four numbers that are the smallest I've seen so far. I only need to output four. I can now toss away six, which had scanned a few records ago. So again, we are making these optimizations to the algorithm. Now, because we have these width ties, we're going to come to this number four, and we have to decide what we do. We can't throw it away, because four is one of the smallest number and we have width ties. If we didn't have width ties, we could toss it away, but in this case, we need to keep it. We have no choice but to grow the heap. And you can grow the heap in multiple ways. It's often an array data structure that is used to write the heap data structure. And so we would generally double that. So that's what we're going to do here. Double the size of the heap, add four to the sorted heap, go to the next element, which is also four. Then we come to this last element, which is eight. And as we discussed in class, ask for questions as to what you would do with eight. At that point, many of you volunteered and knew that you would skip eight, and that totally makes sense because just like nine, we are guaranteed we don't need eight. Four is the smallest number in the heap right now. We have four distinct values here. We can skip and we can simply output the sorted heap. Okay. So a modified version of heap sort, far more efficient in memory, but allows you to go to the fetching of the first 10 rows of width ties. And the different variants of this in SQL in terms of how you can specify which rows to fetch. And this type of a structure works for it with some modification, but that's the overall idea. All right. Now we are going to get into the more interesting part of today's lecture and probably something you haven't seen before, which is how do you sort a table of records when the table is much larger than the amount of memory that you have? As you'll see a theme for many of the algorithms, they're called external memory algorithms. They are based on the idea of divide and conquer. You're going to take that big problem that we have. Data is too large to fit into memory, split it up into smaller chunks in such a way that processing each chunk individually guarantees that we can produce the final output in aggregate. Okay. So let's see how this works. Before we go, as we do sorting, we are going to take multiple passes on the data that we are sorting. And we will have to write records out in intermediate files. And the question is, what do we write in those intermediate files? So we're sorting on some key. So that's the key value shown here, K1, K2 are two distinct key values here. And we could either take the entire record along with the key and store that in these intermediate form that's called early materialization. Or we could do something called late materialization where we just take the key and store a pointer to the record. Obviously, if I'm sorting a table in which each record is a thousand bytes long, and I'm just sorting on an int4 key, you will, this payload here, this key, if you think of everything here as a key value pair, the value portion is going to be a thousand bytes. And if this record ID is let's say eight bytes, that's the space, that's this number of bytes that you need to represent the pointer on disk. And you saw what those pointers look like, right? They're the same type of pointers that read the leaf node of a B3. Eight bytes here for the pointer to the record versus storing the actual record. Obviously, this is going to be far more efficient space wise. So if we are writing stuff, this will take fewer number of bytes that we need to write to the disk as we are doing multiple passes on the file. But at the end, when we are done sorting, we have to go retrieve all these records, and there may be a whole bunch of random IOs. Generally, what happens is row stores will use this format. Column stores are naturally in the format where the keys have not been brought together, and they ultimately need to be pulled in. And so might as well keep the record IDs or some logical version of that around because, you know, these records haven't been stitched together to begin with, might as well delay that stitching till later on. So row stores often will use this format. Column stores will often use this format for how they represent the internal sort stuff. Obviously, this is way more space efficient. All right. So let's go into the setup for the merge sort. We have a data that has N pages, and we have a buffer pool that is B pages. N is much larger than B. And we started the very simple version of the external merge sort in which we are going to merge things two ways, and then we'll see how we can generalize that. So the two is the degree of the merge tree, and I'll show you that next. I'll come back to this in a little bit, but let me just go to the diagram and show you how it works. So here's a pictorial way of looking at merge sort. Imagine we are starting with a file in which there are eight pages. The first page has two records, just to keep the diagram manageable. Obviously, a page will have a lot more than two records often hundreds, or if you're using bigger block sizes, maybe even many thousands. But in this case, just to keep things simple, here is a page which has only two records, three and four. And three and four is really the key on which I'm sorting. I'm not showing all the payload of all the entire record or the key point repairs that we might keep around. This is just the key values. And then the second page has six and two and so on. We want to sort all of this. Of course, when we sort it in the final output file, the first record will be one, then two, and so on. But we want to get to that. And we'll do this by using only three pages, just to show how you can do sorting of files that are much bigger than the amount of memory that you have. So we have only three pages, pretty small. And we've got seven pages in this file. There's an extra dummy page put over here to just indicate the end of file. That's just for convenience. And also, in this case, makes it a power of two. You don't need to actually allocate a page. You could just remember that this is the last page, but this is just shown for visual clarity over here. So let's see what happens in the first pass. We're going to make multiple passes on the data. And in each pass, we will read what we had in the previous pass and then write new data. And when we write that data, we'll either use that materialized format or that key and record ID format. So in the first pass, the input data is the original file. So we start there. And that's the EUF marker, as we talked about. And then we'll bring one page into memory, this page, sort it and write that back to a new file as the first page. And in this case, 34 was already sorted. That's OK. We'll write that back. So this is a new page, a copy of that page made in a new file. We'll take the second page, which is 62, and then sort that in memory. And we'll use quick sort or any of your favorite in-memory sort method to do that. And then write that 26 on to the second page of this new file we are creating. And then we keep going on. So you have 49, then 78. Basically, each page is getting sorted. And you finally end up having this file in which you have what we call as one page runs. Basically, in this file, there are as many pages as there are in the input file. And each page in this file, each one page is sorted. And this one page seems odd right now, but just hold on to what happens in the second pass. In the second pass, here's what we'll do. We will start merging. And this is where the two-way merge comes in. We're going to merge two pages and the records in each page are sorted. So we'll bring this page into the buffer pool. So now we've used one page in our three-page buffer manager. We'll bring the second page into the buffer pool. Second page is gone. We'll reserve the third page for the output. And I don't know why those bubbles came up. I think Zoom is trying to get too smart and did that automatically. Hopefully you enjoyed that. So we bring the first page and the second page into memory. And then we'll hold a cursor on the smallest value in the first page, hold a pointer, and that points to three. We'll hold a pointer to two, which is the smallest record on the second page, and then compare those two and then output the smaller of those two, which is two. And as we do that, we will move the cursor on the second page to six. So in the next step, we'll compare three and six, output three, and so on. And so effectively, we'll start producing these output. As soon as that output buffer page, which was a third page, becomes full, we'll write it to disk. So now that output buffer page is free. So we can fill it up again with the next page. So as you can see, as I do this merge step, in this second pass, which is called pass number one, because we started, the first pass was labeled zero. In pass number one, we will have created a sequence of two pages, and the records across those two pages are fully sorted. And so that's called a two-page run. You can take a sequence of two pages, and across that the records are sorted. And so the two three gets written out to disk, then four, six, and I'm just going to draw it in a slightly different way. This file created at the end of pass one still has seven pages as before, eight, if you add that dummy page. But now I'm just going to draw the sequence of two pages each pair vertically like that just to show that they are sorted. It's a two-page run. The first two pages in this file have what you see here. The second two pages will also be merged, and there will be a second two-page run for the four, nine, and seven, eight page that we had written out at the end of pass zero. Okay, so now we can start to see. First, we started with data was completely unsorted. Then we said everything on a single page is sorted. Now we are seeing in this new file that is produced, every pair of pages, the records across that are sorted. Now we'll repeat this process again using three buffer pool pages. And what we'll do is we'll create this four-page run file. We'll do that by doing the following. We'll bring page two, three into memory. Again, now we use one buffer pool page for that. We'll bring the page four, seven into memory. Second page is gone. That's used for the smallest page on the two runs we are trying to run, merge. And then the output pages before, it's the same algorithm that we did before. We'll find the smallest between what's in memory. We'll compare two and four. Output two. Move the cursor to three. Compare three and four. Output three. Fill up that page in the output buffer. Keep going. At any point in time, we just need one input page from either side, from these one input page from the two runs that we are merging and one output page. And we effectively create this new file at the end of pass two. Surprise, surprise. Now this is called a four-page run because in this file, there are again seven plus one dummy eighth page. But the first four pages now are fully sorted. The records across the first four pages start from the first to the last and it's in the right sort order. Do the last thing, which is do one more merge and you're basically done. So as you can see, if you think about what's the cost of doing this operation, the cost is going to be two times the number of pages. N is the number of pages in each pass because in each pass, we are reading it and writing it. So hence this two N. And how many passes do we have? That's the depth of this tree. This is effectively a binary tree. And so the depth is going to be log of N to the base two. And the ceiling stuff is just to take care of the fact that the number of pages may not be exact power of two, but we can't take half a pass. We have to take a full pass and that's what we do. There was an excellent question that was asked in class yesterday about why do we go about merging this way? Could we have in pass one done this merge, then done this merge and before doing the merge of the five, six and one, three to create this two page run, could we have merged these two to create this in the first place? So instead of doing the processing effectively level by level, could we go down the depth of the tree as much as we can and then keep coming back? So effectively what you could do is produce this in pass one, pause the rest of the processing of pass one, go to pass two and so on. That was an excellent question. The answer is yes, you can do that. And there's an advantage to that is you could, you process this, you process these two page runs and do that. You can delete this page, this set of four pages from disk and you basically just have this portion here. So you're not allocating as much space on disk because at that point you could be tossing this away. In the other case, you will have twice as many pages allocated on disk. So if your disk is getting full, then that might help. Generally, disks these days have plenty of space. So it's not a problem, but that's definitely a way to think about it. And those are the ways in which people are still constantly improving external sort. It's still a blood sport. People compete very heavily on how fast they can sort and ideas like that and ideas like trying to improve the in-memory sort for specific hardware, trying to make this work in an environment where your data is coming from a cloud storage device and can you do things faster? All of this is still an interesting topic. In other words, external mode sort is still a research topic that is worth pursuing. Okay, now let's go and see how we can make what we have better. So I'll go back now to the macro perspective of what the algorithm does. We have pass zero where we were reading one page into memory, sort that into a one-page run, then in pass one, two, three onwards, recursively merge all of these pairs, still be a finally done. Pretty standard divide and conquer. And everything is great so far. We are now able to sort stuff that's extremely fast. But if you had a petabyte file and you just had 100 pages in the buffer pool or just three pages in the buffer pool like we were doing here, you will eventually finish sorting that file, but it will take a very, very long time. It may take decades or a century depending on the hardware you have before you are done. So can we do better? Can we do better especially if we have larger amounts of buffer pool? So imagine today gigabyte memory is very feasible. In fact, many database servers, high end servers run with terabytes of main memory. So imagine I give you a lot more than three buffer pool pages. Can we make this go a lot faster? So what can we do to make this go faster? We can change this to N, right? In each pass, we're going to have to do that many IOs. What we can do is try and reduce this cost, which is basically this, okay? And effectively you can do something with this number one. You really have to focus on what you can do here. This is basically determine that log of N to the base two is determined by the shape of this tree. So you can do two things. If I told you in this case, I'm going to give you five pages. What could you do? You could chop this tree. So instead of starting with a one-page run, you could immediately go to something like a four-page or a five-page run. If I give you five pages or let's just start with four pages, bring all the four pages in memory, sort them, and immediately you get this run. So effectively what you can do is take this tree and you can chop the base of the tree so that you can reduce the number of passes by jumping from here to there, right? Because you have a lot more memory. So if I've got four pages, I can just bring those four. Go straight here. Didn't have to do these two passes. Save two passes. Second thing you can do if you have more pages, there's no reason to do a two-way merge. I could do a three-way merge or a four-way merge or how many ever pages that I have, if I've got B buffer pages, I always need one page for the output and I could do a B minus one-way merge and that basically widens the tree. So we're going to chop the tree and widen the tree that dramatically reduces the number of passes and in practice, you rarely see even on very large data sets because if you have a petabyte data set, you're probably going to get terabytes of main memory to sort that. You rarely need more than two passes to sort data. Maybe three, if you're doing something crazy, crazy big and don't have a lot of memory, but the number of passes are not going to be a lot more than two or three in practice and it's because of this technique. So how does this work? We are, that idea is essentially what I told you captured over here in a little bit more detail. Start by using all the buffer pool pages, all the B pages. Don't have to leave one for the output. Bring everything into those B pages, sorted and write these sorted runs. So that will produce N divided by B. Take the ceiling function to get that nice integer number, sorted runs and each of them is B pages long. Then merge in a B minus way. The final equation becomes log to the base of B minus one. Massive reduction in the number of passes because of that or another big reduction because you're starting with a much smaller number of nodes in that tree that you have. So this is a B minus way fan out tree, but the number of leaves in that tree is N divided by B with the ceiling function. Okay. All right. Just to put that in perspective, imagine we want to sort a file with 108 pages and have five buffer pool pages, bring five pages at a time into memory, sort back. So you get a five page run file and the number of sorted runs in that is going to be 108 divided by five, rounded up 22. And that's how many runs you have. Now, each of these 22 runs have to be merged, but you have B pages. So you can do a four way merge. Remember, we still need one page for the output. And so the output of that is going to be a file that is five times four, which will have 20 pages in each run. So it's a 20 page run file. And the number of runs in that is going to be six, which you merge one more time. And basically you are done. As an exercise, try to look at different values, try to see what happens when B is equal to three, try to make N equal to a million, which is not very unusual to find in databases to have a million page file in the large databases. And you'll see why everything we are doing with the larger fan out and this merge can happen and happens fast. All right. Okay. Now, there's an optimization that you can do with external merge where you play on this idea that if you plot this log function, you'll see that for large values of N, if I increase N, the number of passes is not going to change significantly. It's going to jump in a step function. And if that is the case, effectively, even if I give you P pages or give you half of those pages, the number of passes might remain the same for a whole range of N values. So in many of those cases, it makes sense to do this optimization called double buffering. And double buffering does the following. It's going to try and use the buffer pool and instead of using all the pages, it has only used part of it to do one pass while the other stuff is getting ready. Let me explain what I mean. So assume we have the table that we want to be sorted is sitting on disk and there's a bunch of pages. All pages are equal. I've just colored it into light and dark to make this animation work. And so far, everything we've talked about, what are we going to do if I've got four buffer pool pages, bring three pages in, do a three day merge and write those output into this sorted run that we are creating on disk. So now with double buffering, assume we have a lot more pages. We have eight pages and the number of passes remains the same because there's going to be a whole range of N values for which that's going to happen. Now, what's happening right now with the way we are doing things is we could have gone to an eight-way merge and that will certainly help. And we should obviously consider that when we try to optimize this operation. But another thing we can consider is to try and play with the observation here that right now, the sequential nature of this processing is that if I look at the CPU and the disk, when I'm bringing the pages into the buffer pool, the CPU is doing nothing. It's just idle. My IO bus is busy. And then when I start processing, the disk buses is not doing anything. And it's only the CPU that's doing. So at any point in time, if I look at the CPU resource and the disk resource, one of them is basically idling. So what if we wanted to use all of that? And we had this extra buffer space. And the way you could deal with that is the following. You do double buffering. So we'll start the same thing, except we'll create a shadow buffer page for each actual buffer page that we are doing. So we are still going to do a three-way merge. But as this data is brought in, same as before, and it's getting merged, I'm going to start fetching the other pages to get it ready for the next merge phase, or the next creation of the sorted run phase. And so now, when I'm done with merging together these three pages that I have into the red runs, the CPU was really busy. But the disk was also busy because it was fetching the next set of pages that needs to be processed. We finished writing this red run, and then we start working on the darker pages that we brought in here to create that blue run, and we'll keep toggling between those buffer pages. It's effectively like you've got the buffer pool, you divide it into two. There's a shadow version of the pool, and one is operating, and the shadow is fetching data from disk, and you keep flipping which one is active and which is the shadow version. Again, obviously with this, we'll get far better resource utilization for the disk and IO, and you can overlap that computation, you'll get a reduction in the response time, and the downside of that is the effective B in the equations we had before is half of what we had, but as I said, there's a log function and a ceiling function, so in many cases, it won't matter, won't increase the number of passes, so this could certainly help reduce that response time because you're using both the disk and CPU in parallel. It may not change the throughput, you could get a response time reduction, but you've got a lot of sort of operations happening in parallel, obviously the throughput may not change, so it's a trade-off that you make, but it's a cool technique, especially if you want to get the latency of that sort of operation down, you can play around with techniques like that. All right, one little tidbit of information, deep down in the sort board, we are doing a comparison between two values, and often it helps to write specialized codes so that that comparison can be fast, especially for types that might be complex types like date time, or even for integer types, if I can write the code, if I can write a sort function where the type of the keys is predefined as opposed to I have to infer the type on the fly, if I have to infer the type on the fly, I have to pass into the sort function of a pointer to a function to compare the value, an int equal, it'd be better if I could just use native int equal, and so often what systems do is they will write the sort code in a generic way, and if you're in C++, you can write that using template programming, and then instantiate the template when you're compiling the code for all the data types that you care about that you need to go fast, and usually those data types have to be fixed data types like int, int2, int4, int8, and so on, you can do a few more things with strings, but this code specialization, we're going to say I have a generic sort function, maybe I'm looking at the quick sort component, a written quick sort once, I don't want to write quick sort for int2, quick sort for int4, quick sort for int8, and so on, I write it once using template meta programming, compile it, so now I have a specialized version of int2, and that inner loop might be optimized, the comparison predicate is optimized to make int2 or int4 go really fast, but it might be that even some parts of that inner loop where there's a loop, the compiler may be able to unroll at compilation time because it has perfect type information, so that code specialization is applied in a bunch of database systems through a variety of methods, in C++ in databases it's often done through template mechanism and compilation of those templates that are instantiated explicitly to make that code go fast. If I'm doing sorting on var car keys, basically string keys, then I could do a simple thing and just do a comparison on the entire string, but if the strings are long, they might be 100 characters long, I'm going to need a lot of cycles to just do an individual key comparison, and we're doing a lot of these key comparisons as we do the sort. So one alternative popular technique that gets used is to get a representation of the key in some encoded fixed length form, usually something like a 64-bit encoded version of that string, and which has a property such that it could be basically just the prefix of the string. So if I compare that bit encoded 64-bit version that I can just read like a 64-bit int and use int comparison, which is faster than doing string comparison, I can tell whether something is less than another string, and it's only when that prefix is equal do I have to go and do a full string comparison, and so I don't need to do a full string comparison for many of these operations and these comparison operations are getting used a lot. So when you're sorting on strings, you should very rarely be sorting using a equality function or comparison function on the native strings. You could use that if the strings are really small and you know that, but if it's variable character strings, then you generally want to go use a technique that looks like that. And this type of idea of taking keys and using a suffix of it as a surrogate for getting correct comparison and only having to look at the full string if you need to, is used all over the place. Beatries that have string keys often in the inner nodes are going to have that suffix representation in the keys because they are smaller and there are all kinds of interesting things that you can play around with string Beatries. So the suffix idea is used, shows up all over. Now some of you might be thinking, as we've been talking about sorting, that there's a lot of connection between sorting and what you guys learned for Beatries. And could we use the Beatry for sorting? And there are two cases to consider. And the answer is yes, you can use Beatries for sorting, but you have to be careful about what type of Beatry it is. So Beatries come in two form, clustered and unclustered. Clustered means the record IDs in the leaf level of the Beatry shown by this gray area over here. They will generally follow the record ID of the pages that are stored in the disk. So if I have a clustered Beatry, then I could just sort the keys. So if a Beatry is built on student ID, the example we've been using before, and someone and the query says, order by student ID, don't need to sort. You could just go look at, start with the leftmost leaf node, go chase the records in that leaf node from left to right, pull up the pointer and effectively you're going to get a sorted output, because the keys already sorted at the lowest level of the Beatry, just for the records. The records are clustered. So the first three keys over here all belong to page one, page 101. And so you're only accessing each page once basically in a single scan of the file and a scan of the lowest level of the Beatry, you're going to get sort done, which is going to be faster than the external sort merge. However, if it's an unclustered Beatry, it's probably a bad idea to use it, unless you've got a range selection predicate also on the Beatry. So assume we're just saying order by student ID, the Beatry is built on student ID and it's the leaf level, then it's a bad idea to use an unclustered Beatry, because as you can see here, you're going to chase the key down, you're going to fetch this page, the second key here goes to a second page, and later on, you're going to refetch this page. So it's at random IOs and in this case, not just random IOs, but also re-accessing the same page multiple times. This is going to be pretty expensive. You are going to be better off having done that external merge sort. So if you have to sort, you have a clustered Beatry on the sort key, you can use that. Otherwise, if it's an unclustered Beatry, you can avoid that, and these are the types of decisions that an optimizer can make as they're picking the right algorithm or the access path. All right, switching gears from the sort operator to the aggregation operator, we can start to now look at different algorithms for it and also start to think about hashing. So aggregation basically involves, and you've seen aggregation before, so let's just go look at it with an example. So here's a query in which on this enroll table, which has student ID, column ID, and grade, we are looking for only those students that have a grade of B and C, and as I joked about this in class yesterday, obviously, this is not a query you would run in CMU because probably it comes out empty, everyone gets an A, I'm just kidding. But here we are selecting records that fit this criteria, ordering it by the column ID, and then selecting this distinct. So it's a very simple aggregate. We will bring group buys in a little bit, distinct count IDs is an aggregate, and we'll see how we can do that with a sort based method. And we'll also see the hash based version after that. So first thing we'll do in this query tree, I'm not going to draw that tree as you saw before, but this is the bottom level of the tree, you apply the selection on this table, and you keep just the records that have Bs and Cs. Now you go to that second operator in that tree, which is to project and remove the columns keeping just the column ID. So just this column, that's what you produce at that second operator in that query tree. And then we sort it, and you can use external sort merge if this file were really large, I know it's a small example here, four records, but imagine it were four million, you would sort that using external memory if you didn't have enough space in the buffer pool. If it's a small file, you'll sort it using a memory sort, regardless you get a sorted output. And now what you can do is to eliminate the duplicates, just scan over this file from top to bottom, and only output the values once that you see. So here we'll output four, four, five, the first record, skip over the second one, because it's duplicate, the duplicates will always be next to each other. So at any point in time, you just remember output this value, remember that value, go to the next record if it's the same value, skip over. Okay, this is a little optimization you can do this idea of skipping stuff, of having dropped this. If you go back to where we were many slides ago with that merge tree up over here, you could imagine that you could fold in some of that optimization to this level. Imagine you had to do a distinct on the keys here. And imagine instead of this two, there was a three here. And what you could do at that point when you're doing this merge, you could have written the three only once. You can have dropped that three earlier, because you're just looking for distinct. So if new you were doing sorting to do distinct, you can actually go change the core sort algorithm to do that distinct earlier. Using essentially that same ideas that many of distinct, you just need to keep one copy of that record. And you could push that optimization that we just saw here on this slide, where we skipped over that second duplicate value up ahead in that operator tree. So now you can see if I can unpack the full semantics of the SQL query in the operator that same external sort merge operator that we are looking at, I can actually modify that operator within the context of the rest of the query and do even better. Now as I started earlier on and told you that you'll see this duality between sort-based methods and hash-based methods and it's like the sibling rivalry that happens, we now can start to unpack that and look at aggregation and look at the hash-based aggregation. So imagine we don't need the data to be ordered because we have a group by order distinct. So that opens up the chance to go use a hash-based aggregation and hashing is nearly always going to be a better alternative in this scenario. When it may not be a better option is that if I have a B3 that is built on the group by column because it's a clustered B3, I could potentially use that and get a faster operation. And again, an optimizer will consider that. So in a hash-based aggregate, I'm just going to go through this with an example. It's going to be two phases divide and conquer. In the divide phase, we're going to break up the file partition it into two or more pieces. And then we want each piece to be processed individually to get the final answer. And we want each of those partitions to be small enough that we can do things with it in memory. Files too big to fit in memory, break it up into small parts so that each small part can be worked on in memory. So breaking up is the first phase, rehashing which is working on each of those partition in this case is going to be that second phase that you work in memory. So we'll use the two hash functions, h1 and h2. We'll assume we've got B buffer pages. And now for this hash-based aggregation, we'll have to keep one page for the output. And we'll start in the following way. So assume this query where we have to do this distinct course ID. I don't have a group by but just hold on to that. This example is simplified and there are again optimizations you could do here to fold in some early elimination of distinct values, but don't worry about it in this example. It's an inefficient example, but just to show you how hashing works. After you look at it site multiple times, you'll say, I could have done this hashing earlier, but it's just to show a partition phase while keeping the same type of example we had before for the sort-based stuff. So now we have a whole bunch of records here. Assume that there are lots of lots more data in here. And we'll apply the filter. We get a file. It has these four records, but also a lot more here because now our data is much larger. And we'll remove the columns. We get a long list of column IDs. Again, assume that's really big and it can't fit in memory. Now we'll apply a hash function to create different partition files. We'll create B minus one partition file. So we'll bring in a record. We'll use one buffer page for the input. We'll allocate B minus one pages for the B minus one output partition files. As we read a record, we'll apply the hash function, put it into the output buffer page for that partition. As that fills up, we will write it to disk. We just need one output page for each partition, but at the end of it, what we end up having is we'll have a whole bunch of pages in the first partition. I'm just showing one here, but there could be a whole bunch of that spills over and we'll break up the input file in this case, column IDs, into B minus one chunks, partitions, where in each partition, all the keys have the property that if they hash to the same value using this h1 function. And why is that important? Because if I've got two duplicates like here, I've got a duplicate here and a duplicate here. A duplicate, if I apply the same hash function, will hash to the same position, always will hash to bucket zero, the partition zero here, and hence, I will bring all the duplicates together. So that's my divide strategy. And the second part, all I do is process each partition to eliminate duplicates. I'm guaranteed because I've applied h1 here for partitioning that if two records are the same, they will be in the same partition. So by dividing the file into smaller pieces, these partitions, and then processing each partition individually, for that piece, we use an in-memory algorithm. We are guaranteed we've done the entire operation efficiently in this two passes, partition pass, and then a merge pass, and then a rehash pass. Okay? So that's the second page in that partition, as we are scanning through it, just showing that partition can have multiple pages. So now this rehash will apply a second hash function. Let's go and see visually what that looks like. So imagine these are pages that we've got across all these partitions, so that first partition has two pages, and then we've got a bunch of other partitions that are in here. As we pull that in, we will apply the hash function, and so we will take the first record, put that into that in-memory hash table. Then when we take the second record and apply a hash function to a disordery there, so we'll just toss it out. That's how we're eliminating duplicates because it's hashing to the same value. At the end, when we are done, sorry, in this example, this is the second page in that same partition, so we'll hash that here. So we are still in the first partition, and you put that into the hash table, and at the end of it, when you're done processing all the records in that partition, we will output the results. Look through the hash table. Anything that has a value will fill that up. So we are done with processing all the records in this partition. Now we clear up the hash table, then do the same thing again for records in the other partition, and output those, keep appending to this final result. So second partition that we are starting to process, clear up the hash table, use that space now to do the same thing as we did before, that rehash function, and keep doing that till you've done with all the partitions. So very simple divide and conquer where you first break it up into partitions, so this is the first partition, this is the second partition, and there'd be many more, because when you apply H1, all the records that have duplicates will hash to the same bucket because of that, in this case, 445 records, and that way, you are guaranteed you can eliminate duplicates. And in this case, if you've gotten the partition right, someone question might arise is, hey, what happens if, let's say this partition had a lot of duplicate values and this hash table doesn't fit in memory, what would you do? At that point, you could do one of several things, you could take this partition and repartition it, or what you could do is you could say there are too many records over here, if I haven't output any records already, I can take this partition, and I can use a sort based algorithm for this partition, so you can mix and match. You apply the first base of partition, everything using a hash function, but each individual partition, you could decide that they're going to use a hash-based approach for eliminating duplicates, or you could use a sort-based approach. And many times, when you've created the partition, you know how many keys there are in each of the partition files, so you could decide how to choose. But there are also adaptive techniques that can adapt on the fly, the advanced database course talks about that, but just want you to be aware that sometimes you may need to deal with overflows of the data structure because hashing is not perfect, and it may be that when we were doing H1, or the data is such that a lot of duplicates, or a lot of values that map to the first partition, you'll go first partition here that you might have to do something else when you start to do the rehash, the rehash table, hash table doesn't fit in memory. But there are details, just wanted to be aware that you may need to do something else, and you can look things up if you need to, if you hit that case. All right, so essentially, we've looked at how we can do hashing using sort-based and hash-based methods. Now let's cover this last scenario and aggregation, but the aggregation is a little bit more complex, you know, instead of the distinct count, which is the simplest aggregate you can think of, you've got a group by, there's a join, and then you're computing the average. So now in this case, imagine we are creating the buckets as we did before, we are applying the hash function as we did before, you know, ignore how we did the H1 stuff before the first phase, that's exactly the same as before. What matters is in that rehash phase when we are processing each individual partition, we will store now in the hash table, not just the key, but also a value. In this case, if I've got two records in that first partition, and they have two different GPAs, right, what I'll store in the value is a running count of how many records have hashed to this value, how many 445 records I've seen too, what's the running sum of those GPAs, 7.2, because I'm computing an average, if instead I just stored the average here, I would be averaging the averages in a continuous form, and that's not precise, average of averages is not the true average. And so here we'll do the sum and the count as two different things we keep around. And when we are done with scanning all the records in that partition, we'll output the aggregate by dividing this value 7.32 by the count, which is two, and that will be the correct average. So two things are happening, a hash table now has the key and a value. The second is the value may not be the final value, but maybe some intermediate running data structure that allows us to calculate the final value, as you can see with the average. Okay, now, of course, if the aggregate instead of average it were min, then what we put in the value space would just be min because min of min is guaranteed to be min, same thing with maximum count, average is different. Databases often also support statistical aggregates like median and mode, and the different ways of dealing with that here, take the advanced database class, we talk about how do you deal with more sophisticated statistical aggregation, and what are the properties, when is something easy, when is something hard. And in some cases, you have to keep all the values around and that is complicated, but the ways of trying to work around that or to at least reduce the effect of that. Statistical aggregates are present in database systems, you can ask for those in SQL, and they will require some additional modification. And that's the main thing we want you to know in this material. Alright, so in this case, final result is being outputted for that average, pretty straightforward. And so with that, we conclude today's lecture. Hopefully you've seen that there's a huge emphasis database systems on these operator algorithms that work with large amounts of data, even when it doesn't fit in memory, they require a rethinking of basic algorithms that you might think you already know. And there are all kinds of interesting optimizations that you can bring to the table. And the jury is not yet completely done on, can you optimize it more? The answer is yes, people are constantly making all kinds of subtle and interesting changes, especially as you start moving things to the cloud, you have new hardware that changes a lot of the trade-offs and opens up opportunities for new algorithmic innovations. We saw that there are hash and sort-based methods that duality between these two methods will continue as we go into the next lecture, and we start talking about joins and other algorithms.