 Uh, G.J. Murcio, Kulio died. When? Ah, like yesterday or something. Seriously? Yeah. Well, okay. Did I tell you about what he did with cooking with Kulio, what he did to me? What did he do to you? All right. We'll talk about it later, but I don't give a f**k. That was me moneying it. Again, who doesn't owe you money? All right. I don't want to talk about it. All right. So for this class, Project One is going to be due this Sunday coming up. Again, we have the extra office hours on this Saturday at 3 p.m. as opposed to piazza. That'll be on Gates, I think, the fifth floor. We reserved out the little cutouts there. Hover 3 will be due on October 9th. And then, can't believe it's already here, but the midterm is coming up on the 13th. That'll be in class in here. And then I'll have more details about this starting discussing next week. There will be a practice exam that will be released, and then there will be a web page with all the material and additional information that will be covered, okay? Any questions about any of these things for you guys? All right. Cool. So, we're at the course. Again, we're going up the stack, the different layers of the database system. So now we're up here in operator execution. So now we're actually going to talk about, since we know how to store the data on disk, we know how to then bring it into memory, we know then how to access it at the lowest level, like how to scan tables, how to access things through indexes. Now we're actually going to talk about how to actually cook queries. After all this, we're finally here. So the next four lectures are going to be covering sort of the four key or three key ideas that we could care about. We'll start off talking about how we're actually going to implement the algorithms that we would have in our query plan, how to do sorting and aggregations, which is today, how to do joins and other types of relational operators. Then we'll talk about how do you pass data within your query plan as you execute it. And then we'll talk about the implementation of the database system. How do you maintain either multiple threads, multiple processes, and execute these queries following the different processing models, okay? So we need to talk a little bit about what a query plan actually looks like. So we started off this semester talking about relational algebra. And I said at some point that, oh, you take a SQL query and you can convert it into a relational algebra equivalent. And that's what the database system is going to execute. In actuality, what happens is that the SQL query is going to getverted to an abstract syntax tree that gets, after it's been parsed. And then the syntax tree is basically going to be translated into a tree structure, a query plan tree, or a DAG, right? And the idea is that at the leaves, the leaf nodes at the bottom of the tree structure will be the access methods we talked about. Either scanning indexes, scanning the table itself. And they're going to be feeding table, or sort of feeding data up along its path to its parent node, parent operator to do the processing, right? So say we're doing this, join here on A and B. So at the access, the lowest level of the leaves, I'm scanning A, I'm scanning B. And the A operator is passing data up to the join. The B operator is passing up to this filter operator, which then passes up to the join. The join outputs data that then goes to the projection operator. And then after the projection operator, that's the result that's returned back to the application. So I'm not going to say what the actual granularity or what this data actually looks like that we're passing along. We'll cover that next week. But just know that this is roughly how every single database system is going to execute queries at a high level. So what I'm showing here is what we refer to as a logical plan. Meaning I'm not telling you how I'm accessing A, how I'm accessing B. I'm not telling you what join algorithm I'm using. I'm just saying this is the logical flow I want for my data. So what today's lecture about and going forward is to then discussing, okay, what are the actual physical implementations of the physical operators we would have in our database system to do a join, to do a sorting, to do a filter and so forth, right? So that's what we're about going forward today. The other thing that I'm going to punt on for now is coming up with the right or the optimal configuration for this query plan. Again, SQL says, the declarative language says, this is the answer I want. The database system has to figure out what's the most optimal way to execute this. And that could be in two ways, be trying to optimize the physical algorithm you want to use. Like is one join better than another? Spoiler is hash join is usually better so that you would want to take that. But it also could be a logical restructuring of the query plan. Similarly, we solve the logical reordering of the relational algebra. So for example here, I put the filter below the join because I don't want to do a join on a bunch of data that I'm going to filter out later on. So it's called a filter pushdown or a predicate pushdown. So these other optimizations, we can ignore it for now. And so just before, since the entire semester, we've been assuming that we're working on a disk-oriented database system. Not only can we not assume that the tables that we're trying to access can't fit in memory, we can't assume that the intermediate results from one operator to the next will fit entirely in memory. That means we're going to have to use our buffer pool again to allow us to implement these physical operators that can then eventually, if necessary, spill out the disk if the intermediate results of an operator doesn't exceed our allocated memory. And so this will be, again, as I said before, it will be much different than maybe how we started the algorithm before in the past where the asymptotic complexity, if you just look at the number of comparisons or the number of operations you have to do, may not be the right way to judge whether one algorithm is better than another. It may actually be how often do I have to read and write things to disk. Hash tables are going to be 01, that's great, but if it's random access and I'm jumping to random pages and I'm going to do random I.O., that's bad. And maybe I want to do something that is, according to algorithm, of course, is slower, but I'm going to get more sequential I.O. out of that, or I'll bring things in memory, do as much work I need to do on it, and then never have to go back and look at it again. So, again, just like before, we're going to try to choose algorithms that are conscious of how we're bringing things in and out of disk, and we want to maximize them out of sequential I.O. And this is, of course, it can be another good example of why we don't want the OS to manage our memory for us, because the OS doesn't know what algorithm we're actually using. It's not going to know how we can prefetch things if it's not always going to be sequential. Or may not know that the next step of our algorithm is going to be reusing the block we just fetched in, and then, therefore, don't swap it out. So, today, we're going to start off with the sort of simple primitive of sorting. And this sort of seems obvious, so why do we need to sort in a database? Well, it's because the relational model, yes, question. So, his statement is that the running theme of the course that I've laid out is that the OS is, we don't want the OS to do anything, because it's getting in our way. So, what do we actually need an OS for? I mean, the OS is like a frenemy, right? We need it because we need to boot things up, and deal with talking to the hardware, in some cases. So, the one guy over here brought this up, and said, should we just have a unicolonel? What you've heard is a unicolonel, where you basically strip down the OS to not do anything but whatever you need for a database system to actually do. There was some work done in the early 80s where you basically would build a database system on raw bare metal hardware. Like you wrote your own firmware to talk directly to the hardware. Nobody does that anymore, because the overhead is just not worth it. No, the answer is the OS, we don't want the OS to do almost anything, right? And therefore, there's a bunch of optimizations we've seen. We talked a little bit about some of these things. We can talk more about these things in the advanced class. There are methods now to get the kernel bypass to avoid having to talk to the OS for things. Like instead of having to go talk to the OS's API to access the file system, we can maybe do the kernel bypass and go directly to the hardware and get the data we need, to read and write. We want to try to do those things, yes. We want to do our own scheduling, we want to do our own memory allocation. We want to do everything, yes. We're always in a better position because we know exactly what the queries are trying to do, what our data looks like, what other queries are queued up to doing, yes. So we need the OS, like, I don't know, like, again, it's frenemy is the best way I can say. You need it to exist, but we don't want to talk to it, okay. All right, so again, why do we need to sort? Again, at the very beginning of the semester, we talked about how the relational model and its SQL are inherently unsorted, but oftentimes we're going to have queries that are going to want things sorted, right. Show me the top 10 actors in different movies, as you guys did in Homework One, show me the top 10 players in a game, right. So order by is a very common operation we're going to see, and therefore we need to be able to support this efficiently. There are going to be some cases though, where even if the query doesn't say once the data is sorted, where actually sorting things for ourselves internally as we're running the query is actually going to put us in a better position to do things more easily in the upper levels of the tree. Right, so it's going to be really easy for us to do some aggregations now, like duplicate elimination with a distinct clause, if we sort things, because now we just rip through the data after it's sorted and then know that whatever we're looking at, whatever key we're looking at at this point, if it's the same one as we saw in the previous one, then we know it's a duplicate and we can just ignore it, right. We can do this also for aggregations, with group by clauses, we saw how to do bulk loading with the V plus tree if you sorted ahead of time. So sorting is going to be useful for us for a bunch of different parts of the database system. And it'll make some algorithms again and be able to parts of the tree much more efficient if we know it's coming in sorted, which we would know because the database system is the one controlling the query plan, it knows what steps it's doing as it moves data up and therefore it's these assumptions are valid or safe. So as I said, in the beginning, we're assuming a disk-oriented system where the data could be out in disk when you have to spell a disk, but if everything does fit into memory, then we don't do anything special. Then we can just use quick sort, that's what Postgres does. Whatever, if everything fits in memory, then whatever your favorite sorting algorithm is, sample sort, smooth sort, cocktail shaker sort, like whatever crazy thing you can find, you can do it when we don't care, that's fine. It's when it doesn't fit in memory, then this is when we need to be more careful and we have to be more use a technique that's aware of the cost of going out to disk. Again, because the data doesn't fit on disk in the beginning, or as we're generating intermediate results, we have to spell those in disk as well. So everything doesn't fit in memory, quick sort is gonna be a terrible idea because you're gonna have those pivots, that's all gonna be a random IO, and that's gonna be a nightmare. We don't wanna use quick sort, we have to do something smarter. So today I'm gonna talk about sort of, in general, two quick or two different sorting algorithms when you have to go spell a disk. The heap sort for top end queries, this is pretty obvious, we'll cover this quickly. The main one is gonna be the external merge sort, the main technique for sort of divide and conquering, do a divide and conquer approach for, sort of things that don't fit in memory. And this divide and conquer strategy is gonna be useful for other parts in our databases as well, as we'll see in next class in going forward. We'll see how next week we can use sorting and also hashing for doing joins as well. And then we'll finish up talking about how to use, how to do aggregations with hashing or with sorting. All right, so this is a special case, but it does show up enough that it's worth discussing. So if you have an order by clause with a limit, then the database system doesn't actually need to do sort of multiple passes over the dataset, it can just scan it once, find the top end things and you're done, right? So this, the technique that Postgres uses and other users, it's just a heap sort, right? And the basic idea is that I have my, I actually see also two, so this syntax here, I don't think I told you before, so the offset clause when you use limits, that's actually not in the SQL standard, this fetch first, end rows, whatever, this is what shows up, this is what actually is in the SQL, this is what SQL server supports and other systems support. But this is basically saying I want to, after I get my order by, I want to fetch the first four rows, but I can add this little with ties thing at the end here where I want to get all the elements even if there's ties, right? So in this case here, I have my regional dataset, so I've got a bunch of keys or a bunch of integers like this and then I'm just going to scan along and I maintain a sort of heap, which is like a priority queue where you just keep track of the elements in sort of order. So at the very beginning my heap is empty, I put three in and then for next I want four, slide three over, put four there, slide four and three over, put six here, add two at the end, but now here I get to nine, my heap is full, I know I'm looking for top four, right? Because the query says it, so then I just slide that over, add nine, then I get here and because I have with ties, I need these four duplicates, so I drop out three and put four, but now I get here and I have four again, again with ties I need to store four three times because I need to tell the result and the query, you should say if four's been repeated, I wonder how many times it's been repeated. So in this case here we just double the size of the memory for the heap and then we can add four. And then lastly we get here with eight, we just slide it over like that, right? Again one pass through the data, I maintain the sort of heap, assuming the sort of heap fits the memory, this is super easy and I'm done, right? Yes, the statement is as soon as you see the limit keyword you try to use the sort of heap, yes. Right so the statement is at the point you're, at the very beginning you don't know where the heap is gonna fit in memory, you assume best case scenario that it will, yes. So we'll talk about this when we tell query optimization. The database system is gonna maintain its own internal statistics to try to predict how much data these different operators are gonna spit out. They get, it's notoriously always wrong. And so in this example here, basically it's gonna maintain a histogram to say how many keys do I expect or how many values I expect for a given attribute and you try to get how many elements you're gonna actually see, right? Or how many, what is the likelihood that you're gonna have a bunch of duplicates in different indexes, right? And this is a simple example, it's simple, yes. Your statement is, and she's correct and we'll get to this later, is that if the order by attribute was, if the order by attribute, if there was already a B plus three index on the order by attribute, wouldn't it be faster to scan the index instead of scanning the data? Yes, we'll get to that in a second. Yes. He says if you put order by after the limit, does it change the semantics? Why would it? I mean, if you do have like a nested query, because I don't think the syntax will allow you to put the limit before the order by. If you did a nested query with the limit inside of it and then you have the order by after it, I mean in that case you wouldn't need to do, in that case like the limit four would just, you grab the first four tuples and you're done. The order by and the outside would then just do quick sort on those four keys. So yeah, that would change the semantics of it. Okay, like I said, this is a special case, Postgres supports this, some other systems support this. If you know you're just doing this top end thing. What's likely more common is gonna be it's the sort of merge sort. And so again, this is a divide and conquer algorithm that's gonna split the data into separate runs and then you're gonna sort these runs individually and then in subsequent phases of the algorithm you're gonna then combine sort of runs together to make larger sort of runs until you do this until you completely sort the entire data set. Now I would say what is confusing there's gonna be two phases, the sorting phase and the merging phase and then this thing will get repeated. We're actually gonna, on next class we'll talk about sort merge algorithm which you can use the merge sort sorting algorithm. And so I'll be very careful next class and say what's the sorting phase of the join versus the sorting phase of the merge that we're talking about here. But for now, we can ignore all that. All right, so the phase one is gonna be I'm gonna sort chunks of data that can fit in memory and then write those sorted chunks back out to disk and then I'm gonna read those sort of runs back in and then combine them into larger runs that are double the size of the previous run. All right, so let's go through this visually here. Actually, quickly forget that. The things we're actually sorting in all these algorithms is gonna be key value pairs. For simplicity going forward, I'm not gonna show the value portion but you can think of this as just like when we talked about the B plus trees, I had the keys but they didn't always show what the value, the thing that they were pointing to. But one interesting concept to think about now in our operator algorithms here is what the actual value is gonna be. And there's two choices. The first is that you contain the entire contents of the tuple in the value. So I would have the key and then the complete copy of the tuple. And that means that as I'm doing the sorting and I'm copying things right at the disk, I'm making multiple copies of the entire tuple for one step to the next. The other approach is just store the record ID because that's gonna be 32, 64 bit integer that I can pass around and that's much more cheaper. So the tuple is either early materialization or late materialization. So again, early materialization I had the entire copy of the tuple. Now you can play a game of recognizing that how much of the tuple you actually need in the upper part of the system and only patch portions of this if you're doing projections. We'll cover that next week. And then in late materialization is just the record ID. So there's trade-offs to these, right? If I have the early materialization then I'm copying potentially a lot of data from one step to the next. As I do these runs I gotta copy the tuple, bring the tuple in and write it back out over and over again. But the advantage of that is when I actually then need to produce the result to send back to the, send to the client I don't have to go fetch the tuple again from disk. I bring it in once and try to keep it in memory as much as possible. In late materialization you can do as much of the work you can on the keys with just the record ID and only when you actually need the rest of the values do you go fetch the tuple. So I would say that the record ID approach, the late materialization, this is more common in column stores, right? Because when I go fetch, I go figure out what are the minimum columns I need operating on all those is passed around the record ID and at some point I have to stitch the values back together. Then and only then do I follow the record ID to go get the rest of the data, right? Because this assumes you're gonna filter out more things. You're gonna filter out a bunch of the tuples early in the query plan and so you don't fetch unnecessary data. Early materialization again has the advantage of, I bring it in once in memory and I never go back to get it. All right, so let's see how to do a two-way external versort. So the number two and two-way means that it's the number of runs we're gonna merge together and generate each new run. You could have a K-way sort, meaning like I'll say a four-way would be I fetch in four runs together and then I have four cursors walk together and merge them to a new run that's four times as big as the previous ones. For simplicity, we'll just keep it two. I'm gonna break our data up into end pages. I'm not specifying what the page size is. It doesn't matter for the algorithm, all right? But one thing we already have to allocate and know ahead of time is what the number of buffers that we have. So we're gonna have to say we have B buffer pools and that's the amount of, that's a finite amount of memory that we have to bring in both the input data as we bring in runs from disk and to generate new outputs that we then write out the disk as we go from sort of one step to the next, all right? And so we talked about the four, how the database system, you can set these to be different page sizes, right? All that still applies here. In terms of the buffer pool sizes or the button or buffers, this is usually maintained through a separate parameter. You can specify either globally in the database system or in a single, for a single query or single session or client connection, right? So like in Postgres, I think it's called work memory. In MySQL, it's called like sort memory. So it's a parameter say how many megabytes or how many kilobytes you're allowed to use for sorting in memory. And then when you exceed that, when the query exceeds that, it has to then spill out the disk, right? And how to set that correctly, you know, the whole of the black magic that we can talk about later. All right, so let's walk through a simple example. So we have a database on disk. We just have two pages and then we have some space of memory. So in past zero, what we're gonna do is gonna read all B pages from the table into memory and then sort them into runs of the original page size and then write them back out the disk. So soon we can only bring in one page of memory at a time. So we're gonna need actually one page for the input page and then one page for the output buffer that we're gonna then write out. So in this case here, I'm gonna bring in this page and then I'm gonna sort it and then write it back out the disk. So because assuming these pages are coming from the actual table itself, these are like the primary storage table or pages of the table. I don't wanna do in place, I don't wanna overwrite the original data. These are like temporary buffers where if I crash and I come back, who cares that they get lost, right? So you could do in place sorting once it's in memory. Actually, you wouldn't do that for the, if it comes from the table pages because if another query has to access the data too, you don't wanna like, if you're reorganizing it, that might screw up another query that's running at the same time. So in the first pass, you always make a copy and then sort that, right? Then now the next phase, I gotta bring in the other page, bring that into memory, sort it and then I write it back out. Then now in the next pass, I'm gonna bring in the two pages together and I'm gonna walk through them and sort of have a cursor on each side and do a comparison of the first element of the first page, the first one of the second page and assuming you're going in ascending order, whichever one is less based on the comparison operator, the comparison function, that's the first element that goes into the new output buffer, right? So I bring these two guys in, they've already sorted and I walk through them incrementally and get all the elements that are, that until I fill my output buffer and write that out, go back and now do fill up the output buffer and write that out and now we have our sorted data. So this is a high level overview, I'm just showing colors, it's not really, the next slide will show actual values but I'm gonna sort of understand the basic idea. You bring a page in, you sort it and then you try to then combine it with another sorted page and you walk through them incrementally and do the merge and then write out a new run that's twice the size of the original pages, yes. We have three buffers, three blocks. No, the two in the two A means like I'm gonna sort two sort of runs together at the same time. You've got two buffers, why don't you think the first page is gonna be just sort of in the first goal? You're basically asking why can't I go like, why can't I just bring both in memory and sort them right now? Because you did that at the last one. I mean, this is like just, this is a toy example. Just trying to show you the process. It's two pages, you're right, yes. This will fit memory. Yeah, yeah, yeah, yes, yes. Again, the two doesn't mean two pages, the two means I'm gonna merge two runs together. Let's go to the next slide and then hopefully this will make more sense. All right, let's see a bigger example. So again, here I'm not showing the value portion of the key value pairs, these are just keys. Now we're gonna have also two at the end of the keys is some end of file marker. So this example here, in every one page, I'm gonna have two keys. So the first page, it's the values of keys three and four, the next page is six and two, right? So the comma just means a separation between the keys, not key value pairs. So in the first pass, what we're gonna do is we're gonna bring every page in memory one by one and then we're gonna sort them based on the contents of the page, right? And then we're gonna write them back out. So in this case, I'm gonna need B minus one buffers because well, in the first pass, you just bring one page in, sort it and then write it back out. But we can work buffers in a second. Then now in the next phase, what I'm gonna do is combine two adjacent two page sort of runs, bring them into memory and then I'm gonna walk through them with that cursor and compare the first key for the first page and the first run and the first key of the second run and then which whatever one is less, that gets written out to our output buffer, right? So we start these two pages up here, two runs up here. The, we do a comparison between three and two. Two is less than three, so two gets written out. Three, and then the cursor moves over and now we compare three and six. Three is less than six, so three gets written out. And now we filled up our page because we're gonna hold two elements and then that gets flushed to disk. And then we can continue to do the other comparisons between now we're gonna do four and six. Four is less than six, so four gets written out first and the page and by six and then this gets written out. Right? Yes, yes, the statement is over time we're gonna have many physical pages that make up one sort of run, yes. So there's an additional bookkeeping data structure. We keep track of like I'm at this phase and for sort of run one, here's the pages that have in that order, yes. I'm not showing that data structure here. And that's in memory, that doesn't need to get spelled a disk because that's not gonna be huge because if I crash then two cares of the query in every results were lost. Yes. So the statement is if I don't have space to hold two pages, how do I handle this? So I'm not showing when things get written out. So when this buffer gets full, it gets flushed out and I reuse the memory. So then now I can fill up another page. So I always need at least three pages in a two-way merge sort because I need two pages for the inputs for the two sort of runs and then one page for the applet. Yes, you need B-1 buffer pages, B-1 buffers or pages for all the runs you're trying to merge together in the sort of merge. Again, for the N-ways, the K-ways sort. In my example here. No, let's keep going. We can do this incrementally, right? Because because the sort of runs are sorted, the cursor never has to backtrack. Meaning like if I do a comparison and this one is less than this one, so therefore I write this one out to my Alpapover and I move this cursor down, I never have to go back and look at previous pages because I always know I'm going in that order. Yes. So the question is, is the sorting happening in place? Yeah, I mean, once you're past one, then yes, it's in place. Well, no, no, it's just so. I'll take that back. No, because you're always writing to an Alpapuffer, right? So could you swap? I mean, it's two-way, yes. If it's N-way, that's more complicated. Or K-way is more complicated. For simplicity, just assume there's an Alpapuffer you always write into. What is the same page, sorry? Like so I have this page here, this page here, I'm trying to merge them. I write into this one. I flush this out the disk and I overwrite these guys or what? Overwrite them on disk? You mean overwrite them on disk or overwrite them in memory? You could, yeah, you could recline a space that way, yeah. It doesn't matter. So your question is, this example here, no matter how large the sort of runs are, I only need to have three pages in my buffer pool to do this, yes. And obviously again, in a real system, you'd have way more memory than three pages, but for simplicity and making fit on slides, three. All right, so now we get to the next phase, next pass, now we're gonna generate four page sort of runs. So same thing, I'm gonna start with guys here and these guys here, combine them together. So I only need to bring in the first page for this one, first page for this one, do the comparison. So two is less than four, two gets written out, three is less than four, this page can then be flushed and dead. And then I have another, I zero it out and then I keep my comparison going. And then you keep going to the bottom until we eventually reach a completely sorted output. So the number of passes we have to do is we one plus the ceiling of log two N. So the first, the one, because in the first pass, we always have to do, or sorry, we always have to go past zero to go read everything in and sort of the individual pages. And then it's the ceiling of log two N where N is the number of pages because as we go from one pass to the next, we're generating exponentially larger run sizes and therefore at some point we'll reach the entire, the entire data set has been sorted. So the total IO cost is gonna be two N times the number of passes. So the two N here is because I always have to read it in and then write it back out, right? And then it's just the number of passes here. So every pass is always one input, one output. All right, cool. As I said, my example here, I'm assuming we have three buffer pools or buffer pages, one for the input on the right, one for the left and then one output, right? But obviously in a real system, we'll have way more space. And we don't want to have to wait for IO so we want to do things asynchronously. Like I don't want to have, when I fill out my output buffer and I have to write out the disk, I don't want to block my sorting algorithm while I wait that to get written. So you would actually have another set of memory pages that you can start filling up with some other thread or some other process, writes out the data in the background, right? So this is called double buffering. The basic idea is that you can prefetch and asynchronously write things out in the background while I'm doing whatever processing I want on the data that I have in memory. So my example here, I bring in page one and I start sorting that. In the background, I'm gonna bring in page, the other page so that when I finish the sorting of the first page, the second page is ready to go and I can just sort that, right? That's pretty obvious. So in general, the general external more short is works exactly the same thing the same way, but now we just include B as a constant to determine how many pages we're actually gonna need. So again, in past zero, we're always gonna use all the buffer pool pages, right? Because we're not doing any comparisons between sort of runs, it's always sorting the single page by itself and writing it out. And so now we're gonna generate the ceiling of N over B sort of runs of size B. And in the subsequent past, we're always gonna have B minus one runs, right? Because it's, we always need for, we always need at least, we always need exactly one output buffer to write things out. All right, so the math works out like this. It's basically plug and chug. So for example here, we're gonna say we have a data set with 108 pages with five bundled pages to sort. So we say N equals 108, B equals five. So again, first pass, we take N divided by B, take the ceiling of that. And that means we're gonna generate 22 different sort of runs, each of page size five, but the last one is gonna have the, or just have three pages. Again, that's why we take the ceiling, yes. So state, this question is for, this one here, it should be one plus log two N what? N minus three divided by three. Divide? Y divided by three? So, this is, so you don't say, so you don't need, how this is? The log size, the size of the run is doubling, right? So it's the B factor is really, the B factor determines how many pages you're bringing in. So log two means I'm bringing in, I can bring in two sets of sort of runs at a time, right? I'm ignoring that the extra buffer is for the output, I ignored that. So that's why you don't count for B in this. Yes, because that's the number of, it's the number of pages I have and then how many chunks can I put them in? I take the ceiling of that because there may be ones at the end that don't fill all the entire run but I have to account for it anyway, right? And then, yes, yes, we can take this online. Yeah, I think that's the issue is that the, this is arbitrary size buffer runs based on B and then the, in the, the ridge one it's only, the run size is just one for simplicity. We can take this offline and I'll double check the math but I'm fairly certain that this works out but we can cover up any confusion on Piazza afterwards, okay? Okay, right, so as I said, like the, it's pretty much plug and chug, right? You're just getting larger and larger run sizes to go down and at the end you end up with a sort of run file that doesn't want to wait. And then in this example here, the math works out that we end up with four passes like this, right? Again, this is, this is, it's math. There's nothing really special about it. But again, the general idea is that we're dividing up the, the, the dataset into smaller chunks. We can sort them and then write them out. And then now when we bring in the, the, the sort of runs, we don't have to backtrack in our, as we do our scans and do comparisons and that way we're maximizing, maximizing the amount of sequential IO we have to do and reducing, removing any wasted IO. Two quick optimizations that I wanna talk about beyond just the, the sorting algorithm itself, right? Is how you actually wanna do the comparisons in your dataset. So this, this, this will matter if, whether or not we're doing external merge sort, the heap sort or the, or the in memory quick sort. So the first issue is gonna be the, the cost of actually the comparison itself, right? It's always going to be the cost of the comparison, but the first issue is the cost of invoking the comparison function. So if you ever used like the standard template library, you have to pass in maybe like a point into a function to do, to do your sort. You can do a lambda function in some cases that I'll get compiled in. But do you think of like a, a system like PostgreSQL or for really any sort of general purpose database system, they need to be able to do a comparison on any possible key types or any possible combination of keys. So typically the way the sorting algorithms are implemented is that they'll be, you pass in the function pointer that says I know how to compare whatever the, the key type it is that I'm doing my sorting algorithm. So if everything fits in memory, if you're trying to do this as fast as possible, the cost of actually that, that, that following that, that pointer to, to the function and do that comparison can actually be quite expensive. Especially again, if it's in memory doing quick sort. So we're not gonna talk about code specialization and, and just in time compilation in this semester other than this point here, but in some systems, you can actually build out the specialized version of the sorting algorithm where the comparison function is actually hard-coded in the sorting algorithm implementation itself so that if you know you're comparing like two 32 bit integer keys, you don't have to follow that function pointer. You have it baked in exactly to the sorting algorithm that how to do that comparison. And that's really fast, right? So, so PostgreSQL does this, it's kind of a hack. They have, they have some Perl script that basically takes the sorting algorithm and then duplicates it and makes a integer version of the comparison built in, makes a floating point one built in, right? So they have duplicate code in the system to do quick sort for the different key types you can do comparisons on. Another technique that's similar to the compression stuff that we talked about before is to do suffix truncation, where if I'm gonna, I've had to compare varchars that could be arbitrary length, doing string comparisons is actually really slow, right? So a better approach would be to first do the sorting on prefixes of the strings. And ideally a binary representation of them is now you're just comparing integers. And I can still have the order, I still preserve the order ordering of the strings with my binary representation, but now I'm comparing like two 32 bit integers rather than arbitrary length strings. So if the binary representations or the prefixes are distinct, then my sorting algorithm will still work when I do that comparison, that's really fast. If the prefixes are equal, then I have to go then follow the pointer to go get the original value, then I do the more expensive one. And that's sort of like the optimistic stuff we talked about in the last class where I assume the common case keys are not gonna be equal and therefore this truncation approach speeds things up. Yes? The same is do I assume that the string order matches the binary order? It has to, yes. I mean, think of like sorting, what's that? Sorry, just like the first two bits in there. Yeah, just like, I'll send a link to the Postgres one. So you won't compare the length. What's that? No, you won't compare the length. You don't need, again, I don't need to, right? Yeah, if my name is AAA, your name is XXX, then like the binary prefixes, sorry, AAA, if my name is like the character A 128 times, your name is the character X 128 times, I can just look at the first few bits and know that mine is less than yours. And therefore I don't need to do the full string comparison. Yes? It's even as it relies on ASCII being sorted. So when I talk about correlations, the data system can't always assume that it's the ASCII data. There's different correlations, different languages, like there's different ordering per characters, then there's Unicode. You have to know, you have to be aware of that and know that. There was this bug in Postgres, I don't know if I talked about this, there was a bug in Postgres a few years ago where they relied on I think libc, like the built in like libc comparison operator for strings. And then a new version of libc came out where the string ordering for Unicode switched. So now like, since they rely on it, whatever libc said is the ordering, when they first built the B plus tree, they built a B plus tree on a Varchar column, before the upgrade of libc, it would be sort of one way and then after the upgrade it would be sort of another way and it broke all these people's B plus trees. So correlations are super tricky to deal with. All the enterprise systems implement that some stuff themselves, Postgres, that I think my SQL rely on like the open source ones. Yes. So since you mentioned... Yeah, so his question is, if I don't have a index, going back to the very beginning, if I don't have an index on a table, then these like leaf nodes in the query plan, these always have to be a sequential scan. Yes, the fallback option is always going to be a sequential scan and that some cases that's okay, for sometimes it's a bad idea, like you want to have an index. So his question is, and we take this one offline, do modern database systems automatically build indexes? Some, yes, most do not, Postgres does not, that's my startup. It's hard, it's like, it's an incomplete problem, it's not easy to do. Okay, all right, so I want to bring up the example she brought up in the beginning where if I already have a order preserving data structure index on the keys I want to sort, like a B plus tree we spent the last two classes talking about, then the data systems could recognize, aha, you want to sort by on the thing I already have an index on, it may be actually better for me just to traverse the B plus tree than scan a lot of the leaf nodes because I'll get my data sorted that way, right? So we have to see that in two cases, there's the cluster indexes and uncluster indexes. For cluster indexes, this is always going to be a good idea. Again, you have to make sure that the, ideally, whatever you have in your order by calls is exactly what the index is based on. You can, in some cases, be, if you only have the prefix of the index keys and not the full thing in your order by calls, then you can still use it. Postgres I think is very conservative and if it doesn't have an exact match of the keys then it doesn't use the index. But the commercial guys are a bit more aggressive on for using indexes, right? So if it's a cluster index, then this is the best case scenario because we just, whatever traversed the leaf nodes, we get all our, we get all the entries we need and we just scan along a sequential access. In this case here, you actually could just traverse the index once and scan along the pages and just know when you need to stop. You wouldn't actually have to maybe scan the tree itself. For on cluster index, again, this could be tough because everything's in different order and the two pages versus what's in the leaf nodes, then this could be a random IO for each one. So the data system is gonna try to be clever and figure out how much the index is gonna help it versus just scanning what you need, everything and then sequentially and then doing the sorting, right? And this is what I was saying before that the optimizer is gonna try to figure out when it converts a SQL query into an actual physical plan and wants to execute, it has to make it estimations about how much IO it's gonna do from one sort of strategy versus another. Okay, but before we jump into aggregations, any questions about sorting, right? The everything's in memory, quick sort. If it's a top end, we can use heap sort. If we think it's gonna spill to disk, then we would use extra number of sort. Now some, you could have a strategy where I assume best case scenario and I'm always gonna start off with a quick sort and then if I get it wrong, throw the results away and fall back to a merge sort. That's called adaptive query processing. Very few systems do that because it's hard to do. You basically have to have checks as you run the query, say, oh, my assumptions are wrong, kill myself and start back over. Again, the enterprise guys can do some of these things, the open source ones cannot. All right, so let's talk about how we actually want to implement aggregations now. Right, so just recall from the beginning of the semester that an aggregation is some kind of function that we're gonna take multiple values from a single attribute in our dataset and then convert it into a single scalar value. Then max some count average and so forth, right? So when we actually wanna implement these aggregations, we need a way for the system to quickly find the tuples that have the same values for the attribute that we're trying to build the aggregation on so that we can group them and compute whatever it is, whatever the aggregation function we want. So there can be two basic, basically only two ways you can implement aggregations. It's either gonna be through sorting or hashing because the dumbest thing you could do is just scan through the table over and over and over again, right, and then you could compute your aggregation that way, but that would be dumb. You don't wanna do that, right? So we're basically gonna rely on the sorting stuff we talked about before and then we'll see how we can use some elements of the hash table stuff we talked about to do aggregations with hashes, hash tables. And so, again, as I said before, the spoiler is almost always hashing, the hashing approach is almost always gonna be better than sorting, especially if your disk is slow, right? And the idea what we're trying to do here is we're trying to avoid having random IO going to disk and maximize bound of sequential access that we can actually do. So both of these techniques are gonna try to achieve that. So let's start with the most simple aggregation you could have with the distinct, right? You're just trying to find what are all the unique course IDs from the enroll table, right? So I'm not showing what the query tree looks like, but you can assume that we have to scan the enroll table. We're gonna apply some filter that we have in our where clause to find all the grades that are either B and C. And then we actually could do a projection early on as well because we're assuming we're doing early materialization. So we don't have to copy a bunch of the columns for our tuples as we go along. So we'll push down the projection, move everything but the course ID because that's the only thing we actually need to produce result. And then we're gonna sort this single column, right? And now to produce the final result, we just scan through the output of the sorting and just keep track of what was the last value I looked at in my cursor. And then if I ever come across the same value as the one I've previously looked at, I know it's a duplicate, I can throw it away, right? Pretty straightforward, pretty simple. So this idea of throwing away data quickly as possible, that's a recurring theme of query optimization, that's the whole point of query optimization. So I showed how to push down the filter, I showed how to push down the projection. And again, this is up in the data system, we'll figure out for you with SQL, you don't have to write the stuff yourself. So my example here, I had an order by clause, so it made sense to implement the distinct aggregation using sorting because the aggregate was based on, it was on the course ID, there was an order by clause on the course ID. So I killed two birds with one stone, I could do the sorting, get things in the sort of order and do quick duplicate elimination. But if my data doesn't need to be sorted, then maybe that's not the best approach for me, right? I could have, but I still could use sorting because I just look at the previous one and find the things I'm looking for, right? But there may be cases where, as I said, we're hashing action can be better because if I only need to remove duplicates, I don't need to do any sorting, then it's more efficient just to build the hash table or build some intermediate data structure than having to sort everything and then walk through it again, right? Because I can start throwing things away as I'm doing the hashing when I identify that I have duplicates, right? But of course now, as I said, if it's gonna be random IO, like if my hash table or whatever I'm trying to build doesn't fit in memory, then I don't want to have to start swapping pages in and out and thrashing on disk because the thing I needed before got flushed out the disk or written out the disk and then now I need it again, I gotta read it again, right? So we don't want to pay for the same real estate twice. We want to, when we bring something into memory from disk, we want to do all the processing we can on it and then throw it away and never have to go back and get it again, right? Let's see how we can do this with a hashing aggregate. So what we're gonna do is we're gonna populate an ephemeral hash table as the data just scans the table and then for the each record that we find, we're gonna see whether there's already an entry in this hash table. If yes, if there is and it's like a distinct clause that we can throw it away, if it's a group by then we can do whatever the computation that we want for the aggregate function incrementally to maintain and running total of these things and then once we reach the end, then we can produce the final result. Maybe do some additional calculations or additional computations to get the correct answer that they're looking for. So if everything's in memory, this is easy to do, right? Of course, constrictly is when we have to spell out the disk. So again, the hashing stuff, if we just go with the hashing tables we talked about before, that's gonna be a bunch of random access and that's gonna kill us, right? So by being smarter, we want to maximize the amount of work we have to do on a page whenever we bring it into memory. So an external hashing function is gonna look kind of like the external merge sort where there was multiple phases. And for this purposes here, for this class, we'll assume that we only need to run through these two phases once, right? Assert a merge so we may have the multiple passes, write things out, bring it back in. We assume in this case here that we can run both phases once and then we're done. We'll see next class how we handle things that don't, we may have to go multiple passes. All right, so in the first phase, we're gonna partition the data into buckets based on whatever the hashing is and then whenever our bucket gets full, we just write it out the disk, reuse the memory and we keep filling it up, right? So once we do that first pass on the data, we then we go into the second phase and now we're gonna bring in those buckets one at a time and then we're gonna build a memory hash table for each partition to then compute whatever the aggregation it is that we want to generate for the query, right? Again, the idea here is we're trying to maximize sequential IO. All right, so in phase one, we're gonna use a hash function H and we're gonna split the tuples into tuples or sorry, split the tuples into partitions and a partition is basically gonna be a sort of logical grouping of one or more pages that where we know that they're gonna contain the keys of the same values and then when these sort of buckets full, we write them out the output buffers, we maintain an internal memory data structure to keep track of, for this partition, the buckets are located here on disk. For this purpose here, we assume that we have B, B buffers and we're always gonna use B minus one buffers for the output and then one buffer for the input. All right, so let's see an example here. Now, I will say I'm showing the, showing the aggregation back on the stinked again. There's an obvious optimization as I'm going along where you can actually just recognize that you have duplicates as you're running out the buckets but let's ignore that for now, right? But this is the same query we had before without the order by clause. So I do my filter, I do move the columns and then now I'm gonna scan through the output of the, after the projection and I'm gonna apply the hash function and I modify the number of partitions that I have and that's gonna tell me which sort of sequence of buckets I go to or which bucket I'm gonna go to, right? And so in this case here, say there's a bunch of you taking 445, 645 and so all those, and then keep filling it up. So I'm not, it's not like, this is not actually a hash table where I'm actually trying to scan through and trying to find do I have duplicates for simplicity reasons and I'm just saying, I just keep it to the end of the page and then what I want is to write it out. And the question is if I see the same key then I just ignore it and throw it away. Yes. But what is a hash table? Wouldn't that be like, wouldn't that be complex? Yeah, so it's the same as if you were using a hash table, wouldn't that be complex? So what I'm saying, yeah, so I'm saying this is not a hash table that does conflicts. This is literally, I take whatever this value is, I hash it mod by the number of partitions I have and then that tells me what page I'm gonna write to and I just append to the end of the page until it's full. I don't look at what's in the page. Yeah, so your question is like, yeah, we're trying to find distinct, this example here, isn't that why we're hashing? This phase I'm trying to divide things up, divide the data set into smaller chunks into partitions so then I can go back and then we'll do whatever aggregation function I wanna do on. The, I should fix this, the confusing part is of course that like, at this point here as I'm hashing into it, I could just look and say, do I already see this key from before? If yes, then don't put it in, right? But go ahead, this. Yeah, so statement is, we're using the hash function as a way to divide things up, yes. Then you get the question like, what about skew if everything hashes the same thing? Ignore that. All the same challenges we talked about with hash tables, yes. Yeah, so a statement is, he's correct. Like, you have to keep track in memory where the page IDs or the list of page IDs that correspond to a partition. Yeah, so when you write out the disk, the next thing you know what you need to break. So a statement is, and he's correct again, that like, if I crash while I'm doing this, I don't wanna come back and have these pages look to be occupied and then not be able to reuse them, right? Yeah, so you store this temporary data in sort of separate files, then actually the table heap, right? It's like, because it's scratch data. If I crash and I lose it, who cares, right? When we talk about distributed system and distributed databases, that is a design philosophy in most databases, distributed databases would be if a machine goes down while I'm computing a query, then the query just fails. I don't care about trying to recover things. That has slightly changed in newer systems, but for our purposes here, if we crash while the query is running, we don't try to pick up when we start back up. We just say, because if you wanted the result, you really wanted that query, then just, it's up to you to submit it again. Okay, so again, this is not a hash table, we're just filling up buckets within our partitions, and we ignore the duplicate elimination at this point. All right, so now in the second phase, what we're gonna do is we're gonna rehash everything all over again, and now build it in memory hash table. So for each partition on disk, we look at our internal data structure and say, here's all the pages we have for that partition. We're gonna read that, bring it into memory one by one, and we're gonna scan through them, use another hash function to build out a new hash table to then do it at whatever the aggregation function that we want it to be, right? And then now that we bring things back into memory in the second phase, we never have to go back and look at other pages we've already brought back in. As we scan through sequentially, we do whatever, you know, do our hash, put it in the hash table, do whatever you wanna do, and then we never have to go back and look at anything else, right? And again, for simplicity, we assume that the hash table fit in memory, assume that partitions can fit in memory. If you have to spell a desk, then you can deal with it, but it's not as bad as building a hash table on the entire dataset. Again, the hash table's smaller, so it should be able to fit. So we go back here, here's our phase one buckets, and we're gonna bring in b minus one partitions, and we go bring in the first page, just scan through it sequentially, use another hash function, and build out the hash table to do whatever we wanna do. So here's the two classes that people have taken, 445 and 826, I don't think we've offered 826 in a while. And then now, once we complete the scanning through all these, the pages with the first partition, then we can then append the result to the final output buffer that we're maintaining as the final result for this query. And then now I go skip down to the next pages, the next partition, bring those in memory one by one, go to new hash table, and same thing, use a second hash function, build this out, and at some point when I'm finished scanning all the pages for the bucket, for the partition, I then append the result to the final output buffer, and then I'm done. Again, the key idea here is because we've already, in the first phase, we've already hashed the keys, and then split them up into these different, I know as I'm scanning down to this sort of second partition here, there's not gonna be 721 up over here. So I can throw away the first hash table, because I know that the hash function since it's deterministic is not gonna incorrectly, I'm not gonna incorrectly find a 721 up here. It can only be in this partition down here. Yes, your statement is why build the second hash table here? And why just like, do what? Keep track of what? Yeah, so to his statement, he's correct. In some cases for some, well, in this example here, so, I mean, because of my example here, there's only one value, it's 445, so you just rip through this account, the number of entries, but because I could have collisions where different keys hash to the same partition, I have to maintain some data structure to keep that running count of that. Yeah, you might have collisions. Yeah, my simple example here, there's only one value, so it's trivial, but think of like larger datasets. So he says, the best case scenario will be collide or have collision in both H1 and H2. We have, this is a real hash table now, so we can handle conflicts. We can handle linear prepashing, yes. So this is just trying to divide, the first one, it's not a hash table, it's just, oh, because, and then I throw this away, and then I build a new one, because I don't wanna assume that I can have a giant hash table fit everything into memory, right? So by dividing it up, I could build another hash table, and the whole idea is that this now can be smaller, the second hash table can be smaller, and I'm not gonna worry about missing, have any false negatives, because I missed something in the first partition, because it would have been hashed into it in the first phase. All right, so now let's talk about the example he was asking of like, okay, distinct is sort of obvious, there's optimization if we can do that for that, but let's look at the other type of aggregations that are more common, or not more common, other aggregations you wanna do. So when we do the rehash phase, instead of just storing the hash table that say does the key exist, we actually wanna store the, whatever the running value it is that we're trying to compute for aggregation function, right? So when we wanna insert a new entry into this memory hash table of the second phase, if the key doesn't exist, then we populate a new entry, but if it does exist, then we basically do like an upsert, where we update the running total, whatever they're trying to compute for the aggregate function for, you know, based on whatever the algorithm that we need to do, right? And we can do this for, if we have multiple aggregation functions in our query, right? So going back here now, so now we have a new query, we wanna get the average GPA. So say we've already done first phase where we've set everything up in the buckets, and now we're gonna do hash on our second function to build this hash table, but now the value portion is gonna be, in the case of an average, it's gonna be the count and then a running summation, right? For min, max and sum and count, it's sort of obvious that counts is plus one, someone's just adding to the total, right? So we're computing this incrementally as we're scanning through in the second phase, hashing to the hash table, and then when we're done doing the scan of all the buckets, depending on what the aggregation function is, we could just be done and this hash table is the final result, or we have to take a second pass, and in case of average, and then divide the sum by the count to produce whatever the average should be, right? Pretty straightforward, right? All right, cool. All right, so that's it. So the choice of sorting and hashing is usually easy, usually hashing is always gonna be the better one to do, and to try out your favorite database system, if they support both techniques, you'll see oftentimes they'll pick hashing over sorting, and then if you're gonna build a new database system of scratch, you probably wanna build a, well, if you need sort of order by, then you build sorting first, and then maybe you can leverage that to do some of the aggregations, and then you have time, then you go build the hashing stuff in a second, right? We see a bunch of optimizations already, how we do sorting, right? We're chunking I.O. Or chunking, trying to amortize the cost of going fetching things on disk by grouping them together and then maximizing managed sequential scans, and then we can do double buffering to not block a thread when it has to go get I.O., let somebody else go get that and then you keep processing. Yes? So the statement is for this point in the semester, we're assuming we have, our database system is running on a single machine, single server, without multiple machines, yes. So he says, if you have enough memory, just sort, your server has enough memory, let's keep it to a single server, because let's ignore the network. If I have enough memory on a single server, then I avoid all of these problems. So yes and no, so there are database systems called in-memory databases or main memory databases where the architecture is written such that it assumes that all the data fits in memory, including both the tables and the results and indexes. And in that case, you get rid of the buffer pool, you get rid of a bunch of stuff, right, and runs a lot faster. So that's what my PC thesis was on, was in-memory databases. I really thought they were gonna take off and be the way everyone would build systems going forward because memory was getting cheaper and larger, but Flash got a lot cheaper and a lot faster, so that market really kind of didn't take off and I was wrong. There are some optimizations, yes, that you can do when you think everything's gonna fit in memory, then you can use QuickSort for everything, you still want to use B-plus trees, you don't have to worry about page latches and like latching on pages and things like that. Yes, you can optimize the system designed specifically for in-memory execution. If you post got some I-SQL though, again, they're built from the ground up assuming things don't fit in memory. So even if you give it a large buffer pool size, it's still gonna do all the things that we talked about here. Just the query optimizer would recognize, okay, I set my sort buffer to be super, super large, so therefore for this particular query, I don't have to spell to disk, I don't need to use external merge sort, I'll use QuickSort, but you still have to go to do the page table lookup stuff. Right, that's sort of a long one to answer, to say like, there's optimizations you can do if you assume everything fits in memory, most of these disk-oriented systems don't do that for many cases. Yes, so this question is if I have low cardinality data, I mean it's a lot of unique values or it's a small number of unique values, small number of unique values, then is hashing better than sorting? Hashing usually is always gonna be better. Yeah, so you would have super large partitions if it's in memory because you would have collisions, but then you can just, we'll see in the next class, you can do recursive partitioning, so you can keep dividing it further and further and get everything out to fit in disk. So hashing goes gonna be better. I think what other scenario is. Even the, it's. Binary, like it doesn't matter whether it's binary or not. Oh, sorry, sorry, meaning like literally like zero and one. Yeah. So it's just like a stream case. For aggregation, let me think about this. If the optimizer knows that it's literally two values, then it's almost the opposition of the heap sort, where you just wanna scan through it once and maintain a hash table that has for the two values, right? But at some point there's a trade-off where like, okay, you do have to do the divide and conquer approach. So again, I think in that case, I still don't think hashing would do better. Okay, and another questions. All right, next class we'll do joins, we'll do nest loop joins, index nest loop join, sort merge join and hash join. Again, this is the most important data, this most important algorithm you have to do joins. So I'll spend a lot of time going on how to optimize and make it go fast, okay? All right, see you guys, hit it. Super snakes.