 All right, guys, let's get started. We have a lot to discuss today. Again, thank you, DJ Drop Tables, for keeping everything fresh. All right, so before we get into today's lecture, let's go through real quickly what's on the schedule for you guys coming up for the next two weeks. So homework three is out today. It should be on the website. We'll set up Gradescope so you can submit later today. And so that'll be due next week on Wednesday the 9th. In two weeks, we will have the midterm exam. Project two is going out today. So I'll talk about that at the end of this lecture. But that'll be due after the midterm. The midterm will be on Wednesday the 16th in this room at the normal class time. It'll be an hour and 20 minute exam. So this will cover everything up to and including next week's lectures. So on Wednesday the 9th, the midterm will cover that lecture and everything prior to that. Any questions about any of these expectations? So homework three, knock that out before the midterm. And then this thing, project two, it'll encompass some of the material that we talked about that will be relevant to the midterm. But it won't be due officially until after the midterm. So you space things out for you guys. OK, so where we're at now in the semester is that we've, again, going up to this architecture layers. We know how to store things on disk in pages. We know how to then copy them into memory into our buffer pool manager as needed. Then we talked about how we actually access them. So we can build indexes on top of them, or we can do sequential scans. And so now where we're at is up above, now we actually want to start executing queries. We actually want to be able to take SQL queries, generate query plans for them, and then have them use the access methods to get access to the data that we need. OK, so for the next two weeks, we're going to first talk about how we actually implement the algorithms for our operators and our queries. Then we'll talk about different ways to process queries themselves, like what's the, how do we move data from one operator to the next. And then we'll talk about also to the more system architecture, what's the runtime architecture of the system for threads or processes and how do we organize them to run queries in parallel. So I'm not going to go into detail what a query plan looks like just yet. I just want to show you what one potentially looks like, just to frame the conversation where we're going today and the next class. And then we'll go into way more detail about what query plan execution looks like next week. And then when we talk about query optimization, query planning further. So a query plan is essentially the instructions or the high level direction of how the database system is going to execute a given query. And we're going to organize the query plan in a tree structure or an acyclic directed graph. So we can take this SQL query like this or doing a join with a filter on table A and B. We can represent it as a query plan like this. We're at the leaf nodes, we're doing our scans or accessing the tables. And then we were moving tuples up to the next operator to do whatever it wants to do. So in this case here, we scan A and since we don't have any filter on it, we just feed it right into our join operator. And then for the scan on B, we'll first apply the filter to limit out any values less than a hundred. And then we feed that into our join operator. And then this now produces some output that's then fed into the projection operator. So what I'm showing here is what we call is a logical plan. Meaning I'm not saying anything about how we're, what algorithm we're going to use to implement all these different operators. I'm just saying this is at a high level what I want to do. It's almost like a direct translation of the relational algebra. I want to do a join. I'm not telling you how I want to do the join. I just want to do one on A and B. I want to get tuples from A. I didn't tell you whether to do a sequential scan or an index scan. I'm just saying just get tuples from A. So what we're focusing on today is now to talk about what these algorithms actually are and then we'll put it all together when we talk about query planning, query optimization to say now we need to make a decision of here are the different choices of algorithms I could use or different access methods that I could use for my query. Which one is going to be the best for me? So today's lecture is really focusing on and for Wednesday's lecture as well for this week what are the different algorithms we can implement for the physical operators in our query plan? But at a high level we're assuming that it's in this tree structure where we're moving tuples for one operator to the next. So is this clear? This is essentially what any query engine is going to do. They're going to represent as a data flow tree and they're moving tuples between them. So the tricky thing though now that for us when we start deciding what we know how do we actually implement the algorithms for these operators is that again this entire semester is focusing on a system that's assuming that data doesn't fit entirely in memory. So just like in our disk-oriented database system just like we can't assume that tables that entirely memory or indexes can't fit entirely in memory we now need to be worried about that the intermediate results between those operators cannot actually fit entirely in main memory. So therefore when we design the algorithms we're going to use to execute the operator we need we need to pick ones that actually know how to write data at the disk and be mindful that we may have to read and write data from disk and therefore we'll make certain design decisions in those algorithms to accommodate them. So just as a quick example I'm doing a join here on A and B the hash table or depending on what join algorithm I'm going to use I may have to spill to disk like A doesn't fit in memory B doesn't fit in memory so I need a join algorithm that can handle inputs where the entire dataset may not fit in memory. Furthermore the output also may not fit entirely in main memory. So again we're going to use our buffer pool manager that we built in the first project and we talked about before that's how we're going to be able to accommodate algorithms that need more memory than is actually available. So we're not just using for tables not just using it for indexes we can use it for intermediate results. And this again goes back to why I was saying before this is why the, you know if the database is managed in memory instead of letting the OS do it the OS can know, oh this page is for an ephemeral data structure to do whatever my query is and I'm going to throw it away immediately after the query is over and maybe I want to do different things or have different replacement policies or different strategies for those kind of pages and those kind of data versus the data coming from the underlying tables. Whereas the OS doesn't see that the OS doesn't know anything about what's inside these pages or how they're being used. So again we're going to use our buffer pool manager be able to spill algorithms at disk and therefore we're going to design algorithms or prefer algorithms that are going to maximize the amount of sequential IO that we can do. And this is going to be different than any algorithms course you've taken potentially before where you assume that you're just reading, writing into memory and everything has uniform access and now we need to be mindful of what's actually in memory when we design these approaches. So we're going to first start off talking about the external merge-short algorithm and what will come out of this discussion you will see some high-level strategies for essentially doing divide and conquer that allow us to then that we can apply to other methods or other operators we want to implement. And then we'll finish up talking about how to do aggregations which can rely on sorting algorithm but then it also sort of segue into the join algorithm stuff we talked about next week about hash joins. So there's sort of this trade-off between sorting versus hashing as the two different methods you can use to execute algorithms in your database system. We're going to first talk about sorting and then we'll add in hashing at the end. Okay? All right, so it's sort of obvious that why we need to sort but just make sure that everyone puts this in the correct context. The, in the relational model the tuples in our relations are inherently unsorted, right? It's set algebra, there's no sort order so we can't assume that the data as we read them is going to end in one particular order. Now there's clustering indexes that we talked about before and we'll talk about again today which then provide you a, enforces a sort order based on some index but in general we can't assume that's always going to be the case and furthermore, we could have an index, our table could be clustered on one particular key but now we need to sort it on another key so that being pre-sorted doesn't actually help us in that scenario. So in addition to also now being able to, someone calls an order by and we want to sort the output if our data is sorted there's a bunch of other optimizations we can do for other utility things we want to do or queries we want to execute in our database system. So if our table is sorted or outputs or keys are sorted then it's really easy to do duplicate elimination because now I just scan through the table once and if I see that the thing I'm looking at is the same as the one I saw in the last thing I looked at that I know is a duplicate and I can just throw it away. For group buys this is the same thing I can, if everything's pre-sorted then I can generate the aggregations by just scanning through the table once and computing the running totals as needed and then we talked about optimization of doing bulk loading in a B plus tree where you pre-sort all the data along the leaf nodes and then you build the index from the bottom up rather than top down and that's way more efficient. So again, sorting is a useful utility operation we're gonna need in our database system but we need to be able to accommodate one where it doesn't fit entirely in memory because if it fits in memory then we just pick whatever your favorite sorting algorithm that you know and love from your intro classes and that works just fine for us. Quick sort, heap sort, if you're crazy bubble sort, right, we don't care it's in memory so all the things that we learned before in intro to algorithms class work just fine but again, the issues now if it doesn't fit in memory quick sort is gonna be terrible for us because what is quick sort doing? Quick sort is doing much of random pivots jumping around to memory in different locations. That's random IO in our world because those pages we're jumping into may not actually fit in memory and in worst case scenario we're having one IO cost per change to the dataset. So instead we want an algorithm that is mindful of the potential costs of reading, writing, data, disk and therefore makes certain design decisions that try to maximize the amount of sequential IO because sequential IO even on faster SSDs is gonna be more efficient than random IO because you can bring a lot more data in with a single and there's no seek in an SSD but within a single read operation or write operation down into the device you can get more data coming back, all right? So the algorithm we're gonna use and at a high level this is what every this is what every single database does that supports out of memory sorting is called the external merge sort. All right, sometimes you see external sort merge and it'd be really confusing because there's the external merge sort algorithm and then we'll see a sort merge join which could use the merge sort algorithm, right? But I'll try to make it more clear when we talk about joins what we're actually doing. So as I said, this is a divide and conquer approach where we're gonna basically split the dataset that we wanna sort up into these smaller chunks called runs and then we're gonna sort those runs individually, right? All the keys within a given run are sorted and the runs are disjoint subsets of the entire key set we wanna sort. And so then we're gonna sort these little runs and then we're gonna start combining them together to create larger sorted runs. And we keep doing this and doing this until we get the full dataset, the full key set that we want sorted. So there's two phases for this. So the first phase, again, we're gonna take as many blocks as we can fit in memory, sort them and then write them back out the disk. Then in the second phase, again, that's when you go combine these subsorted runs into larger sort of runs and then write them out and you keep doing this over and over until you have the entire thing sorted. So this is gonna end up potentially multiple passes through the dataset that we're trying to sort, but in the end, we end up with a complete sorted run. So let's start with a simple example called a two-way merge sort. All right, so two-way means that we're, the number two is that if the number of sorted runs we're gonna merge together for every single pass. All right, so within a pass, we're gonna go grab two runs, merge them together and produce a new run that's the combination of the two smaller ones that were our input. So our dataset is gonna be broken up into end pages and then what's now important for us when we consider how our algorithm is gonna work is that we need to know how much memory is available to us to buffered things in memory to do our sorting. Because again, if everything fits in memory, then we don't need to do any of this, we just do quick sort. But we need to be told ahead of time how much memory we're allowed to use for sorting. And this is actually something you can configure in database systems. So in Postgres it's called working memory. You basically say how much memory the, a one particular query is allowed to use for whatever kind of intermediate operation it wants to do, building a hash table, doing sorting and other things like that. So we're always told this be ahead of time. So let's look at sort of visual example. So in pass zero, you're gonna read every B pages from the table into memory and then you're gonna sort them in place in memory and then write them out. So that's really a simple example. I have a disk, on disk I have my data set, I have two pages. So let's say this case here, I can only have, I can bring in the first page, sort that in place and now that's a sorted run and then I write that sort of run out the disk. And I'm gonna do this one at a time, assuming I'm gonna have a single thread, I'm gonna do this one at a time for all my other pages, right? So now, so this is sort of step one and then now over here I'm reading the other page, I bring that into memory, sort it and then I write that out. So now that's the end of pass zero, that I've taken all my sorted runs, sorry, all my, I grab a run that's the size of B pages, I sort that in memory because I have B pages I'm allowed to use in memory so I'm sorting that in place and then I'm writing it back out and then I go to the next run once after that's done. So now on the subsequent passes, we're gonna do recursive merges of all the runs we've sorted so far and we're gonna combine them together to produce runs that are double the size of my input, right? And so for this approach, I need at least three buffer pages because I need to have two buffer pages for each of the two runs that I'm bringing into memory and then another buffer page for the output that I'm writing out. So in this case here, I could say I wanna sort these that you sort of guys, so I bring those in memory and then now I have one other page where I can write out the combination of these two guys but this is two pages long, but I only have one page. So I'm just gonna scan through each of these and compare them one by one to see which one's greater than the other, sorry, which one's less than the other depending on what order you're going and then I just write that out into a sorted page like this and then once that's full then I write that out to disk and then my merging continues where I keep going down, think of things two cursors scanning through these guys comparing them one by one then I continue down with the other dataset and I write that page out and then once I reach the end then I'm done. So is this clear? Yes? If the memory can hold three pages, what are you forcing the two initial pages to do for three pages? So your question is if the memory in this case here, memory can hold three pages, why not just do exactly what I just did here first? Yeah, you could think of like, and this is sort of a simple example, you could think of like, I could bring the two sort of pages, unsorted pages and memory, sort them in place, then do the combination without having to write them back out of the disk, yes. But in general that's like that, this is a trivial example, in general that's not, you can't do that. And this is also sort of over simplification too because think of these are like the data pages of the tuple or the table, right? These are tuples in them. You're actually really can't kind of do this, the in place sorting like this, because that would be modifying the actual table itself and you don't want to do that. So in general usually like, you're not going to do this one step here where I was sorting them in place, you'd make another copy and then write that out. So in that case that wouldn't work because you would need, you would need at most four pages in memory. So this is sort of again, this is sort of a simplification, but let's actually work through the math and I'll see what actually would happen. So let's go through a more fine grain example here. So the way the math works out is the number of passes we need for this two-way merge sort is one plus the ceiling of log two N, right? And the first one here, the one that's for the first pass and then the log two N is as you keep dividing up the, in each pass you're sort of getting larger and larger runs until you reach the total size of the data set, right? The last pass, the two runs that you're sorting will be at most half the size of the total data set. So the total IO cost to do an extra number of sort is two N times the number of passes. And this is two N because I always have to read it in and write it out, right? For every pass, it's one read in and then one write out. So it means also in every pass, at most I'm reading and writing every record, every key that I'm trying to sort exactly once, once in and once back out. So let's look at an example like this, right? So we have a bunch of pages and each page we have two keys in them. So, and then there's a little marker to say here's the end of file. So in the first pass, we're just gonna read in every page, sort it and then write it back out. So we're not examining data across different pages. Or these are actually runs, but it's a one-page run. Then in the next pass, I'm gonna go grab two sorted runs that are next to each other, bring them to memory, sort them globally within the two pages and then write those guys out. So in this case here, the output of the second pass will be runs of size two pages. And then I do the keep this going down, second page now I have four page runs and the last one I have an eight-page run. And then here point I'm done because now my output run is the total size of the number of keys that I have. Okay, so in what I've shown you so far, all right, for the two-way merge sort, as I said, it only requires three buffer pages. Two for the input, one for the output. So going back here, when I was creating this, now creating a run that's two pages, I actually use this example. So now I'm creating a run that has four pages. I can only have three pages in memory. So I'm gonna have one for the right side or the left side, one for the right side. So again, you think of this to have a cursor. I'm just scanning through each of the pages on the two sides and then compare to see whether one is greater than the other. And if the one is less than the other, then that's what I write out to my output and then move that cursor down. And then I do the same comparison, right? So I keep going step by step until I reach the end, the cursor reaches the ends of both of them, yes. The question is, what am I showing here? Are these numbers considered tuples? Yes, I'm showing the simplification. They're keys, but in actuality, in a real system, you would have like the key you're trying to sort on and then the record idea where it came from. Yes. In this example here, each square is a page, right? There's only two keys in the page for simplicity, yeah. Okay. So again, for this example here, we only need two pages. But the problem is, say I say, all right, I give you more pages and what I'm showing so far, you're not actually gonna get any better, right? Why? He's right, the I overmains the same. Because what am I doing? I'm going fetching two pages and then I have this cursor that's gonna walk through them and then whatever one has the lower key, that gets written out to the third page. Having more pages in memory doesn't really help us because eventually, I'm only can do in comparison, two pages at a time. So a really simple optimization to minimize this IO cost is to do prefetching. This technique is called double buffering. The idea is that when you go and start merging, say, two other pages, you have a bunch of shadow pages or shadow buffers where you start fetching in the next runs you need to sort or the next pages you need to sort. So it requires you to have do asynchronous IO to have something in the background go and fetch the next pages you're gonna need so that when the cursor reaches the end of the current page it's operating on, the next page that it needs is there. If it's single threaded and everything's synchronous, then you have this ping pong effect where I'm gonna be CPU bound then disk bound then CPU bound then disk bound because I'm gonna bring the page in, I wait for that, then do my sorting, that's all CPU, then I'm done doing that sorting or done doing the merging and now I have to get to the next page and I'm blocked on that, right? So really simple example here. I wanna sort one and then while I'm doing that in the background I go fetch page two and then by the time I'm sorting this one then when that's done then when this guy's ready to go for me. Okay, all right so this is, the two way merge sort is the most simple way to consider this. We need to consider how this works with a general N-Way sort or K-Way sort. Yes, sorry. How can you compute a source and at the same time you do loading So this question is how do you do this optimization here? How do you can have the thread sorting while in the background it's going fetching data? Well this is actually where the operating system helps us, right? So you make a request from the operating system go read this for us and then it has another thread that's called asynchronous IO in the background it goes fetch the data that we need. We tell it where to put it and then that way our thread can do all the computation as well. Actually the data system can do this well you actually don't need the OS in a real system you would have like a IO dispatcher thread. So you say I want this request, give me this page and then give me, here's a callback to tell me when it's actually ready then it goes and does that. Your thread can do whatever computation it wants and then when it's done it's available for you. So I have to have two threads? Yeah, you have to have two threads, yeah. Okay, so let's quickly go over how to generalize this algorithm beyond just having just do standard two-way sort. So with the general K-way sort it's still the same we're gonna use B buffer pools and then in the first pass we're gonna produce and divided by B the ceiling of that sort of runs of size B because that's what we're doing the in-place sorting. And then in the subsequent passes we're gonna generate B minus one runs at a time and it's always minus one because we always need one buffer for the output. Having additional output buffers doesn't help us because you can really only write to one with one thread but you can only write to one output buffer at a time. So that's why it's B minus one. So the way the math works out is just an extension of what we showed before where I set it now saying log two N or log two it's log B minus one and then you take the ceiling of N divided by B. But it's still the IO cost is two N times the number of passes. So this is very plug and chuggy you'll see this when you do the homework, right? You fill in the Bs, fill in the Ns and the numbers work out. So let's just walk through a really quick example here. So we wanna sort 108 pages with five buffers pages we can use. So N equals 108, B equals five. So in the first pass the amount of IO that we're gonna do is we're trying to compute how many runs we're gonna generate. So it's the ceiling of 108 divided by five. So that's gonna generate 22 sort of runs of five pages each, right? And then the last page, the last run is only three pages. So that's why you have to take the ceiling, right? Because you don't want a fractional cost. And then going down in the subsequent passes now you're taking the number of runs you generated at the previous pass and dividing that by the number of the size of the run you're gonna generate before, right? So now it was two, now it's four. So this generates six sort of runs of 20 pages where the last page is, the last run is only eight pages. And then just sort of going down and just keep applying over and over again until you reach the very end where you have now a, the data set is exactly the same size as the original one. Yes. There are these human that the source have done in place? His question is are we assuming here the source have done in place? For the first pass, yes. For the subsequent passes, no. But as if, this is what the textbook does in a real system you wouldn't do that because again, depending on how where you're reading the data from, it could, if it's coming directly from the table itself then you can't modify that. If it's coming from another operator then you can do that. Yes. Why does it have to be minus one? Does minus one mean like the output? Yeah, the minus one it goes because you always have to one buffer page for the output. Yeah. I'd be interested if this external works so I can only use like three buffer pages. So yeah, in the general case it was, oh there's three. Right, whatever. It's two for the input, one for the output. Yeah. That's why I'm here and you was five. I find it's kind of like, it's one, it's like minus one, I think. His question is why am I using five here? So you're sorting. Because I thought like you said, you can five or it's not gonna be that much faster. Yeah, sorry, let me make a sec. So before I was sorting two runs at a time, this is now sorting multiple runs at a time. Right, so say with B equals five, I bought five buffer pages, so for each of the sort of runs, I'm gonna have five sort of runs I'm gonna try to merge at the same time. And again, all I need to do is just have a cursor sit at each one and walk through them one by one and just do a comparison across all of them and say which one is the smallest, write that out. And then. Now we have like four cursors. Now we have four cursors and one output. Yes. Yeah, I should I should visualize that. Sorry. Again, this is why I recorded because I'll remember to do that next year. All right, this is clear. Okay. So that's external merge sort. They said this is the exact details of how you actually implement this will vary from system to system. There's obviously some other optimizations we can think about like some hints to say, oh, I know the min value of the max value of my sort of run is this and that and my min max value for this other sort of run is that. So therefore, if I know that the the min value is greater than the max value of this other sort of run, I don't need to do the merge. I just append the other one top of it, right? So there's some optimizations you can do like that. But in general, what I've shown here today is works well for both data sets, assuming uniform distribution of the values. There's no locality to end of the data. If it's just completely random, this works fine. If you know that something more about your data, like it's skewed in a certain way, then you can apply some simple techniques to speed this thing up. But this is what I've shown today is like the pickup truck version of it, the general purpose version of it. So instead of actually just doing the brute force merge sort that I showed here, show now, there may be some cases where we can actually use the B plus tree to speed up our sorting operation. So in general, the sorting and join algorithms, those are actually the most expensive things to do. Always. So if there's any way we can speed these things up, this is always gonna be a good choice for us. So what is the B plus tree essentially doing? Well, it's maintaining a sort order for our keys in the data structure. So we're paying the penalty to maintain an update, do splits and merges as needed on our B plus tree as the table gets modified. But now we can then possibly piggyback off of that work we've already done to speed up our sorting by not having to do sorting at all. So if our sorting operation that we need, if the keys we wanna sort on are the same keys that our B plus tree is indexed on, then we can potentially just reuse the B plus tree and not go through that whole extra merge sort with multiple passes that I just showed. But it only works if we have a clustered B plus tree. Again, we showed an example of that I think two classes ago, but now I wanna show a sort of more visual example of what actually is going on and you'll see why this makes sense for sorting but not for the unclustered one. So the clustered B plus tree, a clustered index just means that the sort order or the physical location of the tuples on our pages will match the sort order defined in the index. So if I have a index on key foo, then along my pages there'll be the tuples we sorted in pages based on that order of foo. So now if I wanna do a sort on that key, I don't need to do a sort of merge sort because all I need to do is just get down to my leaf pages and my B plus tree because now the sort order of the key will match the sort order of how the data is found. So I don't need you any extra computation to go sort it, it's already sorted for me. So this again, this is another example where the database system, the query planner that we'll talk about after the midterm or the class before the midterm, the query optimizer can figure out, oh, you wanna do a sort on this key, I already have a clustered index on that key, let me go use that to generate the correct sort order and not even bother running extra merge sort. But as we saw in the case of Postgres, they don't enforce this, you tell it I wanna be clustered on an index or on a given key, they're not gonna maintain that sort order and other systems will do this. So now if you have an unclustered index, unclustered B plus tree, this is actually the worst possible thing to use for trying to generate a sort order. Let me take a guess why, it should be obvious. One IO per record. What's that? You have to do one IO per record. Great, you have one IO per record. So again, I traverse the index, get down to the left side of the tree and I wanna scan across and cause that's how my keys are sorted but the data has no connection to how it's being sorted in the index. So for every single record I gotta go get and generate as my output, I may be doing another disk IO because the page I need is not in memory. I go to disk, go get it, bring my buffer pool and then now the very next key that I look at is in another page and I just evict the one I just brought in and bring in the next one. Yes? Is the tree being unclustered means like, does it have to be clustered on something like just be clustered on like another index? The question is what does it mean for a tree to be, you wouldn't say it is a, you don't say the tree is clustered, the table's clustered. Again, I wouldn't call it that, like if I was coming up with those terms, I would call it, oh, it's a sorted table for whatever reason they call it a cluster table because it's basically the tuples that are similar to each other are clustered together on the page. So again, back here, the sort order of how the tuple's actually being stored matches the sort order of the key. So this would be a clustered index. In this case here, if you just call create index, this is what you normally get. The keys are just, where the records are actually being stored has no relation to how they're being sorted. So what's the information in the tree? This question is what is the information in the tree? It's just the B plus tree we talked about before, right? So create index on key foo. So to build that index, I'm doing a sequential scan looking at every single tuple, getting their value of foo, inserting that into my tree. And the key value pair is the value of foo and then the value is the record ID, the pointer to the tuple. This is different than the index organized tables we talked about where the tuple pages are actually in the leaf nodes themselves. In that case, that is a clustered index, but it's also an index organized table. This is like, if you're not storing the data in the leaf nodes themselves, if it's disconnected, then it's either clustered or unclustered. Yes? Yeah, so his statement is, instead of actually every single time I encounter a key, immediately go fetch it, what if I get all the keys that I need and the record IDs and then now do combine the lookup so that I get all the ones from page 101 first, then all the ones from page 102. Yes, we'll talk about in two weeks, or next week, right? For scans. That's a common optimization, but that assumes that you can fit like the key set. There are some algorithms where you actually can start producing output sooner rather than later. This is like an all or nothing. This is like, what I've shown so far is I'm gonna get all, this operator asked me to get this data in sort of order, so I'm gonna get it all now, and then I don't move on to the next operator until I get everything. There are some streaming operators where you can say, all right, to start streaming data out as you get it, because I'd rather have it sooner rather than later, because there's other optimizations I can do up in the tree. So in that case, your batch approach won't work. It won't work in that environment. But that is a common optimization. We'll see this in like next week. All right, so again, the main takeaway of this is if it's a cluster index and the query needs it to be sorted on the key that the index is based on, then you just use the clustered index. If it's not a clustered index, then you just almost never want to use it. All right, so that's basically it for discussing sorting. So let's talk about doing some other operations. So in particular, we're now gonna focus on aggregations, because for aggregations, it's another good example of, or it is an example of a type of operator where we can make a choice between sorting versus hashing for our algorithm. And then have different trade-offs and have different performance characteristics, because one is essentially trying to do a lot of sequential access, someone's trying to do random access. So it might be certain scenarios where one might be better than another. In general, as a spoiler, what I'll say is that, and known as always in the case, no matter how fast the disk is, oftentimes the hashing approach will work better. And we'll see an example of how we can actually make the hashing aggregation do more sequential IO rather than random IO. So if you take the advanced class in the spring, this is another big thing too, is like hashing always works super fast because everything's in memory. All right, so how would you use sorting do an aggregation? Well, again, what is an aggregation doing? You're basically taking a bunch of values and you're coalescing them to produce a single scalar value. So with sorting, the nice thing about it is that because the data is sorted, as I said, when we take a pass now through the sorted output, we don't have to backtrack to compute our aggregation. We only have to do one pass to find, to compute whatever the answer is that we want. So let's say a real simple query here. We're doing a scan on the enrolled table. There's a bunch of students enrolled in the classes, the database classes at CMU. And we always want to get all the distinct course IDs from any class where a student either got a B or a C in it. We want the output to be sorted based on the course ID. So the very first thing we're going to do in our query plan tree is do the filter. So we're going to first filter out all the tuples where the grade is not B or C. Then the next step we're going to do is remove all the columns we don't need in our output. We only need the course ID. We only need the course ID to do the order by and for the distinct clause because for our filter, it access the grade table. At that point, we know in our query plan we don't need to ever look at the grade column anymore. We don't need to look at the student ID anymore. So we can strip all that out before we move on to the next operator. And then we finish off now by sorting on the sample column here. And because we're doing a distinct aggregation or distinct clause, we want to remove any duplicate values. So all we need to do is just have our cursor scan through this and any time it finds a value where that was same as the one it just looked at, then it knows it can throw it away. And it strips that out and that's our final output. So we'll go into this more next week or when we talk about query planning but just one obvious thing in this pipeline of executing this query that I did was I try to strip out as much useless data as possible sooner in my pipeline rather than later. So the very first thing I did was a filter. So say this table had a billion records in it but only five of them match or four of them match for my predicate. So rather than me sorting a billion records first then going back and filtering it, it was better for me to filter it first then move the data on to the next operators. Same thing for the projection. This is a row store, it's not a column store in my example. So in order for me to go get the data I need to do whatever the sorting I want to do, I got to go get the entire record because that's going to be packed together in a single page. But if I can do a projection, I can strip out all the comms I don't need or the attributes I don't need. And then now when I'm doing my sorting I'm not copying around a bunch of extra data. So my sort of simple example is related to her question. What am I actually passing around? It could be the record ID or it could actually be the entire tuple itself depending on how I want to materialize things. So the projection here allows me to throw away comms I don't need. So now when I'm doing my sorting I'm only copying things that just related to what's needed for the rest of the query plan. Yes. For columns of like enum type like for example grades that can only take some fixed set of values can we maintain some like counters like the minimum tuple number and the maximum tuple number two like every time we sort of record we just update those counters for that particular enum type and like when we are filtering this we can always use those numbers. So his question is, I mean it's not related to this query you're talking about like a count query. So his question is the grade column has a fixed domain meaning it's A, B, C, D or E. I don't think C, B, U, S does it. You have incompletes, right? But it's fixed. Oh, there's another S one that's whatever. The problem is when I go putting your grades I can't tell whether you're an undergrad or a graduate student. So I'm like, oh this student did awesome they get an A plus but then it throws an error because they're an undergrad and undergrads can't get A pluses. Unless you're E, C, E which I think you can it's a nightmare. Anyway, so his question is all right. So couldn't I have some kind of side table that has a tally that keeps track of every single time I insert a tuple with one of these values and I'm trying to maintain an accountant I just increment that counter by one. Not something like maximum record ID and minimum record ID so that when you are filtering you know which is the first ID and which is the last ID so that you can filter only those pages. Within a page. No, like if you have 10,000 pages and you know that the C grade is only in the first four pages then you can bring only those four pages. Okay, now I understand. So what he's saying is say this was stored in a page this little example here is in one page and then for the grade column I could keep track of the min and max value. So in this case B or C. So now if I'm saying I'm looking for all people that have the grade A if I get to that page and I look and say oh why it's only between B and C there's nobody that has an A in that page I don't even bother looking at the column. That's what you're saying, right? Okay, I think we are talking about the same thing. What you're describing are called zone maps, right? We will talk about this I think next week or this week I forget when but basically there's a way to keep track of you're talking about an auxiliary data structure on the side that you looked at that first and then you check the page. Yes, so that's a zone map. You could or could not be in the same page you could have a separate page within the page but it's basically a pre-computed information to say the data, here's the range of data that could possibly exist for each attribute and you refer to that first and make decisions whether you need to go further. Yes, so those are called zone maps they're called pre- Oracle calls them zone maps. So we're known as a called pre-computed pre-computed, materialized aggregations sometimes different systems do different things but that does exist, we'll cover that later. Yes? I think the idea was that if I'm just looking for the grade C, I don't have to look at all the pupils I can look at a specific range of pupils that I know that. That's an index, right? That's what an index does. You're talking about something more fine like not an index though, I think, right? Yeah, that's a zone map what you're describing is an index. And again, the beauty of a declarative language like SQL is that I write my SQL query like this. I don't know whether I'm using zone maps I don't know whether I'm using an index. I don't care. The data system will figure out what's the best strategy for me to go find the data that I want. It's essentially just trying to remove as much use as crap quickly as possible. That's the whole goal of all of this. All right, so that was a tangent about zone maps. We'll cover that later. The main takeaway from this was if I'm sorted I do one pass and I can eliminate the duplicates. All right? In this example here, this worked out great for us because the output needed to be sorted on a course ID. So it was two for one. I did my sorting because that's the output I needed but then I'm also in the sort order I need for my output. Right? So in this case here, doing a sorting based aggregation is a definite win for us. But in many cases, we don't actually need the output to be sorted, right? So again, you still can use sorting for this like you can do for group eyes and doing distinct stuff but if you don't need to be sorted then this actually might be more expensive because again, the sorting process itself is not cheap. So this is where hashing can help us. So hashing is a way for us to be able to sort of again, another divide and conquer approach where we can split up the dataset and guide the tuples or the keys that we're examining to particular pages and then do our processing in memory on those pages. But again, hashing removes all locality, all any sort ordering because it's taking any key and doing some hash function on it and now it's gonna jump to some random location. So this works great if we don't need sorting, we don't need things to be ordered. So the way we can do a hashing aggregate is we're gonna pipe late at a femoral hash table as the data system scans the table or scans whatever our input is. And then say we, when we do our lookup depending on what kind of aggregation we're doing if we do an insert and the key is not there then we populate it. If it is there then we may wanna modify it or modify its value to compute whatever the aggregation that is that we want. For distinct, just a hash, see whether it's in there if it is then I know it's already, it's a duplicate so I don't bother inserting it. For the group by queries, for the other aggregations, you may, you may have to update a running total and we'll see an example of this. So this approach is fantastic if everything fits in memory. So the key thing I'm saying up above, I'm saying it's an ephemeral hash table not an emory hash table. So ephemeral or transient means that this is a hash table I'm gonna build for my one query and then when that query is done I throw it all away and I'm gonna do this for every single query. We said in the very beginning we used data structures in different ways in the database system so there's the example of a transient data structure. I need it for just my one query, I do whatever I want then I throw it away. So if everything's in memory, the hash table is fantastic because it's 01 lookups to go update things. In this case here I'm also not doing deletes. It's just inserting things or updating things. If we need to spill a disk though, now we're screwed because now this randomness is gonna hurt us because now I'm jumping around to different pages or blocks in my hash table and each one could be incurring an IO. So we wanna be a bit smarter about this and try to maximize the amount of work we can do for every single page we bring into memory. So this is what an external hashing aggregate does and it's again the high level is the same way as the same technique that we did for external merge sort is to divide and conquer approach. So the first thing we're gonna go through is take a pass through our data and we're gonna split it up into partition into buckets where so that all the tuples that are either the same, all the tuples that are the same have the same key will land in the same partition. And then we go back through in the second phase and now for each partition we're gonna build an in memory hash table that we can then do whatever it is that the aggregation that we wanna do. Then we produce our final output, throw that in memory hash table away and then move on to the next partition. So again we're maximizing the amount of sequential IO that we're doing and for every single page, every single IO we have to do to bring something into memory then we do all the work we need to do on that one page before we move on to the next ones. So we never, again, we never have to backtrack. So let's go through these two phases. So in the first phase again what we're trying to do is we're gonna split the tuples up into partitions that we can then write out the disk as needed. So we're gonna use our first hash functions just to split things up. And again we use member hash, city hash, xx to hash three, whatever, it doesn't matter. And so the reason why we're doing this is that because our hash function is deterministic meaning the same key will always be given the same hash value output, that means that tuples that have the same key will land in the same partition. And we don't need to hunt around for other parts of the table space or the table to find the same key, they're always gonna be in our one partition. And our partitions can dispel to disk using the buffer hold manager when they get full. So we have a page that we're storing the current partition data in, when that gets full we just write that out to disk and start filling in the next page. So in this case here we're gonna assume we have B buffers and we're gonna use B minus buffers for the partitions and at least one buffer for the input. So I'm gonna bring in one page from my table and I'm gonna sequential scan on that page, look at every single tuple and it's gonna write it out to B minus one partitions. Right, because you need to have at least one buffered memory for each partition, yes. Can you explain what do you mean by matches? So, if say I'm doing a group by on the course ID, actually here, next slide. I'm doing a group by on the course ID, I'm doing aggregation. So I'm gonna hash this course ID for every single tuple. If I had the same course ID it's gonna land in the same partition. So it's gonna live there, right? Reside, live, stored. And then that way when I wanna go now do that in this case the duplicate elimination, when I come back the second time, I know that the tuples that have the same key have to be in the same partition. They're not gonna be some other random place. Is partition a page? This question, is partition a page? No, partition would be like a, it's sort of a logical thing. Take the hash value mod by the number of partitions and that's where you write into. And each partition can have multiple pages. Right, so again, we do our filters as we did before, remove our projection columns. And then now we take all the output of here, we're gonna run it through our hash function and we write it out to the partition pages. So in this case here I'd be minus one, so say there's like four or five, I'm showing three here. So all the 15, 445 keys land here, all the 15, 826 land here at 15, 721 lands here. So again, you could be smart about this and say, all right, well, I know I'm doing distinct. So within my page, if I see the same thing, then don't bother putting it into it. But for simplicity reasons, we're just blindly just putting it in. Yes. Is partition a page? Question is what is a partition? Okay, think of like a partition is, think of like, it's like the bucket chain and the chain hash table. You just have, within a chain, you can have multiple pages. But I only have one page in memory as I'm populating this. Because again, for every single time I'm gonna cache something and insert it into this, I'm only inserting it into one page. And when the disk gets full, it gets again written out at the disk. And I now allocate another one that I start filling up. So within memory, while I'm doing this first phase, I only need B minus one pages because I have B minus one partitions. What is the distinct course idea? Is the number of distinct course IDs is larger than the number of different numbers? So this question is, what if the number of distinct course IDs? I don't have enough buffers for the distinct course IDs. You do, because you're hashing it, right? You're taking hash, mod, to take this hash value, mod by B minus one. So in this example here, I'm only showing three distinct keys. But I have another class, 15, 4, 10. That could land in the same bucket as 15, 4, 45. I don't need to have a partition for every distinct key. The hashing allows them to go into the same thing. Your face looks like you're confused by this. Right, again, so I have 15, 4, 10. I'm gonna hash it. I mod it by B minus one. It lands in partition zero. And so I just append it to this page, right? And then the main thing is that 15, 4, 10 can't exist in any other page, because the hash function always guarantee that it's always gonna point to this one. If the partition, the current page where this partition overflows, I write it out the disk and I allocate a new page and start filling that up. So you mean you flush the page? Yeah, you flush the page, allocate a new one, yes. And again, at this phase, all we're doing is just partitioning. So I don't care, I can be smart and say, oh, I'm doing duplicate elimination. I know I already have 15, 4, 45 in here. I don't put it in. Ignore that for now, right? It's just I'm blindly putting things into the pages and writing them out. And when you, after you flush that overflowing page, it's better this way, right? Yes. So you have kind of like identifying so it's been partitioning. Yeah, so his question is, it's getting written out the disk. Where am I storing the metadata that says, oh, partition zero has these pages. You have that in like a memory data structure. You keep track of like partition zero, here's the pages for partition one, here's the pages for it. But that's small, right? That's like, that's nothing. Now, are we not considering the collision? His question is, are we not considering collisions? We don't care at this point, right? It's, I should maybe use another example of an distinct, or maybe that's fouling people up. But if I'm doing a, you know, account, again, you can do that more efficiently as well. But like, I don't care, putting it inside of this. I don't care that it's collision, because I'm gonna resolve that in the second phase whenever you hash things. The question is, where is this number coming from? B minus one. So that's the database that's telling this query, whatever thread or worker that's executing this query, you have this amount of memory to use for query processing. Okay, so we cannot have to do that. We can't have, like if I say a B is 100, we can't have X more than 99. Yeah, so like the data system says, you're allowed to have B equals 100 pages to do whatever you want to do for execute the query. To execute this algorithm, I'm gonna use B minus one to store my part, I'll B minus one partitions, because these partitions will have one page. If B is configured, there is more. Yeah, it sucks. Yeah, your question statement is, if B is really small, you're s***. Yes. Right, there's nothing you can do. It's not like you can't magically add more memory. It's to find that resource. The database system is doing resource management. It's deciding, oh, I have a lot of queries I need to execute at the same time, so therefore I can't let them all have a lot of memory. So this gets into the tuning side of things, which is actually very difficult as well. Yes. Okay, so he says, and I don't have slides for this, we'll do it next class, he said that you're screwed. Let me rephrase what he said. You're screwed or you're s*** whatever. If everything hashes into this bucket. So say this is the most popular course on the campus. Everyone's taking 1545, right? Then as I hash it, everyone lands there, then I'm screwed, right? But again, this gets into the query planning side of things. The database system could look at and say, oh, I know what the distribution of values are for this column, and everyone's taking 1540, 45. So therefore, if I do this technique, then it's not gonna get any benefit because everything's gonna hash to this and it's all waste of work. I might as well just do a sequential scan. This question is, you always don't know about the data. Like, a good data system will know something. It won't be entirely accurate, but it'll know something. Like, because you remove other columns also at this point, the ID, like you have only one column, right? So then it might happen that only values in that column are skewed, other columns are making a difference. So his statement is like, with the full data set, these could be unskewed, but then this is skewed. Again, this is next week, or two weeks. The data system can maintain metadata about every single column. Histograms, sketches, it can do an approximation of what the distribution of values look like. Again, for skewed workloads, that's harder. You got a call? All right, he's got a call, yeah, all right, sorry. He's out on parole, that's why he's got a call as a parole officer. All right, so for simplicity, I'm just saying, assume uniform distribution, okay? For skewed workloads, again, there'll be up to a certain point where this technique won't work and the sequential scan will be the better approach. Yes? How much would you say the overhead for removing columns? His question is, what is the overhead of removing columns? So, in this example here, I'm showing this as like discrete steps, like filter and then remove. You can inline and combine these together. But again, this is another good example, there's a trade-off. So, if my table is massive and I know that I don't need all the columns up above in the tree, then it's totally worth it to me to pay the penalty to do this projection because you're essentially copying data. But if I only have one tuple, then I'll delay that maybe the projection as late as possible because that's gonna be, it's just cheaper to do at the very end. Right? The trade-off, how wide and how tall the table is. And again, the data system can figure this out or at least attempt to. Okay, so what we're doing here in the first phase, we're taking the course ID, we're hashing it, we're putting into these pages for the partitions. So now in the second phase, when we rehash for every single partition, now we're gonna bring the pages in, right? And then we're gonna build an in-memory hash table that we can then use to find the same keys. So we don't have to do this, we could just bring in every single partition and do a sequential scan on them, but because we're doing aggregations, we know that we don't need to have all of the duplicate keys in memory at the same time. So we're using a hash table to summarize it and condense it down to just the bare minimum information that we need to compute our result. And again, the reason why we did the partition first is that when we go back in the second phase and we do rehashing, we know that all of the keys that are the same will exist in the same partition. So once we go through all the pages within that partition, we compute whatever the answer that is that we want, we can potentially throw that hash table away because we know that there's, or at least produce it as an output, because we know that it's the keys that we've updated so far through that one partition, we'll never get updated again from any other partition because the hashing guarantees locality for us. All right, so back here, these are all the buckets we generated in the first phase. So let's say now that we can bring in these two pages or all the partitions, we can process these two partitions in memory at the same time. So all we're gonna do is just have a cursor that can just scan through them and every single key you're gonna hash it and populate the hash table. And I keep just scanning down and do the same thing for everything else. And then now I produce this as my final result. Again, for some, realize that maybe confusing, the final result of this hash table is the same as this one, but the main takeaway is that we're gonna throw this away when we move on to the next partitions. And this one we keep around. Distinct is a little bit too simple, but I was trying to pick something that distilled down the core ideas. So now we got this other partition here. So again, we blow away the hash table from the first set of partitions, we do the same thing, build an memory hash table for this guy, and then when it's done, we just populate this thing. Yes? So assuming that we're not gonna have collisions with the second hash function? The question, or statement is, assuming we're not gonna have collisions in the second hash function. You absolutely can, right? What do you think we're writing, like overwriting this into one with a different? So her statement is, question is, what does that mean you'd be overwriting this? Yeah, like if it hashes into the same place. But so that does the collision handling schemes that we talked about when we talked about hash tables. So it's either linear probing, cuckoo hash, whatever, the Robin Hood stuff, right? That's all like internal to the hash table. We're sort of above it now. We're saying you have a hash table I can write things into key value pairs and it'll store them for me. I don't know and I don't really care at this point how it handles collisions. Okay. Again, distinctive is a really stupid, simple example, but going through this process is the main thing I want you guys to get. So how is this faster than sorting? Okay, his statement is the question is, how is this faster than sorting? For this particular query, probably not. It depends on the size of the data. I'll cover those. Let me punt on that question until next week and then it'll be more clear when we start seeing like the different join algorithms. Yes. You get a second hash table, by the way. It's where it goes cuckooing the data from there so it's mine and resolve it. Yeah, so again, because this is, his question is why do we need this when it's just this, when it's just right into this? In that example, yes. But I was trying to show that you have an ephemeral hash table that you build, populate, and then when you're done then you shove it into this thing. For distinct, it's stupid, it doesn't make sense. For aggregations, for other aggregations, you could potentially do that as well, right? Because again, this may not fit in memory. Yes. Is there a reason why we use the same hash function if it doesn't change it, or do we have to use the same hash function? Oh, so yeah, so I should be clear. This is a different seed. Same, so remember hash for the different seed. But like between like the each phase, like we're going the first two buckets, we would move to the third bucket. Do we need to change the hash function if we don't need to change it? This question is, like say I built this first hash table and I used one seed for the hash function. Now down here, do I need to use, can I use a different seed? I don't think it matters, right? If you're writing into this, if you're writing into the same hash table, you absolutely have to use the same seed. If you're just gonna merge that in later on, it doesn't matter. It doesn't want to affect the order of the final result if it's not the stitches. Yeah, yes. Is that a result of a good hash table or a simple list? Yeah, the final result of the database, the final result of an operator is always gonna be a relation. It could be a hash table. It could be just a buffer of pages. Depends on the implementation. I realize that it's the same shapes. Yeah, sorry. All right, so finish up. Let's talk about it, do something more complicated. Let's actually do aggregations where you're actually producing a real result. So for this one, the intermediate hash table that we're using for the second phase, we're actually gonna use that to maintain the running total of whatever it is the computation we're trying to do in our aggregate function, right? And so this running value would depend on what the aggregation you're actually trying to do. So it's going back here. So say in all of these guys, now I'm getting the course ID and I'm computing the average GPA. So in the hash table that I could be generating for all of these, I'm gonna have the key mapped to this like tuple value that's gonna keep the running count of the number of keys that I've seen, sorry, the number of tuples I've seen with the same key and then just the summation of their GPAs, right? And then I just take this thing and then when I wanna produce the final output, I take the running total divided by the number of tuples and that's how I get my average. So for all the different aggregation functions, in general you just keep track of a single scalar value, a count you're just adding one every single time you see a new key or a key of the same value and then for some you just keep adding values together. For the average you can compute that with the number of the count plus the sum. Standard deviation or other aggregation functions you maintain a little bit more information. So now basically what happens in our hash table when we wanna update the hash table, we do an insert, if it's not there, we just add it, if it is there, then we need to be able to modify this in place or do a delete followed by an insert to update it. So this clear. And again, if you were doing this with sorting, you could do the same thing, you would have this on the side and then as you scan through in the final sort of output you could update these totals and produce the final output. So I'm gonna skip this for now. This'll make more sense next week when we do hash joins. Essentially a hash join is gonna be essentially do the same thing. That we're gonna build this a federal hash table with on the keys we wanna do a join on and then we probe in that to see whether we have a match and we produce our final output of the operator. So let's skip this and then we'll focus, we'll discuss this again next week or next Wednesday when we do hash joins, okay? All right, so in conclusion, so what I showed today is sort of the trade-offs between sorting and hashing and again we'll go to more details about which one is better than the other when we talk about joins next week. The high level techniques that we talked about here are we applicable for other parts of the database system. So this partitioning approach, this divide and conquer approach, all that is useful for other algorithms, other methods we care about in our system. So we'll see this recurring theme throughout the rest of the semester that splitting things up into smaller units of work and trying to operate on that small chunk of data or small problem is gonna be very useful technique, okay? All right, so let's talk about project two. So project two, you are gonna be building a thread safe linear probing hash table. So this is gonna be built on top of the buffer pool you built in the first project. So it's not an in memory hash table, it has to be backed by disk pages. So we're not gonna do anything that we talked about here in this class, we're doing trying to maximize sequential IO. It's just you do random IO and you go grab pages from your buffer pool manager as needed, right, to do inserts and deletes. So you are gonna have to support resizing. So again, linear probing hash table assumes it's a static hash table but when it gets full, then you need to take a latch on it and then resize the entire thing. So you need to support resizing as well. And you need to support doing this resizing when multiple threads could be accessing the hash table at the same time. So the website is up, it's not announced yet on Piazza, there's some final adjustments we're doing for the source code before we release to you guys but we hope to have this will be up later today. I don't know, sorry, animations. All right, so there's four tasks you're gonna have to do. The first is that you're responsible for designing the page layout of the hash table blocks. So this is the header page and then the actual block pages where the actual key values are stored. So this is a useful exercise to get you to understand what it means to take a page from the buffer pool manager and then be able to interpret it in such a way that it stores the data exactly that you want. It's not, you're just mallicking some space, you're going to the buffer pool manager, he says, give me a page and you say, oh, this is a hash table block page. Here's the offsets to find the data that I'm looking for. How do you essentially do a reinterpreted cast on that data? So you first implement those two classes to do the header page and then the block pages. Then you want to implement the basic hash table itself, to do inserts and deletes. And then you also support concurrent operations using a reader writer latch, which we provide you and then also support resizing. You take a latch on the entire table, double the size of it and then rehash everything. So you need to be able to support that. So you should follow the textbook semantics and algorithms for how they do the various operations. I think the lecture I gave on the linear hash table follows the textbook pretty closely. And the linear hash table doesn't have that many different design decisions you have to make, it's just sort of going through these exact steps. I advise you to first obviously work on the page layout because you can't have a hash table unless you can store it in pages anyway. But you should make sure that your pages work perfectly before you move on to actually building the hash table itself. So we'll provide you some basic test cases again to do some rudimentary checks for your page layouts, but it's up for you guys to make sure that it's actually, you know, do something more rigorous. Because if your page layout gets fucked up and then now you start building your hash table on that, it's like building a house on sand because now you're like, my hash table's not working and it could because your pages aren't working correctly. So get this down solid before moving to the next thing. Then when you actually build the hash table itself, don't worry about making it thread safe. Focus on the single threaded support first. This is a common design approach in database systems. This is the approach I take with my own research and in practice, I think this is, not every company follows this. He's wearing the shirt for the company that does not follow this. The focus on correctness first. Don't worry about it being slow. So make sure that it works exactly what you think it should work. Then go back and now start doing the optimizations that some of the things he suggested, some of the things we talked about in class, to be, you know, to do optimistic latching, be more crafty on how you release latches, right? Make sure it works correct first. Have test cases that prove that it works correctly for you. Then when you go start trying to make it go faster because we'll have a leaderboard to see who has the fastest hash table, then you know that you're working with again a solid implementation, okay? All right, so just like before, you don't need to change any other files in the system other than the ones that you have to submit on Gradescope. This is what we're working on now. So we'll send an announcement on Piazza that you want to rebase your existing code on top of the latest master because that'll bring in the new, the sample header files and the sample test cases for you. We'll provide instructions on exactly what you need to do to rebase. Obviously, make, you know, since you can blow away your source code on GitHub, very easily with a push force. Make sure you make a backup of your copy of the first before you start doing the rebase. And then as always, post your questions on Piazza and come to Alvis Hours. Yes. If all testers are testing in the first framework, can we safely assume that our buffal pool works perfectly? His question is if we assume, if you get a 100% score on the first project, can you assume that your buffal pool implementation is solid to support a hashable? Yes. I could be wrong, but we think we tested it pretty great. But I would say like there was a bug last year that exact problem showed up, that's all been resolved. So I think if you've passed our test, it should be solid. Could you now release the test code? This question is could we now release the test code? I cannot do that because there's some people that are still haven't submitted yet. You didn't mean to? No. Let's take that all fine. In the back, yes. There's a week for me to submit a photo of you. For the first project, you can submit as much as you want, yeah. Yeah, whatever the highest score you got from the last, actually, great score will let you activate which score you want, right, I think. But it's whatever the highest score up into the deadline is what we'll use. Yeah, you submit as all you want. If you had an 80 before and have 100 after the deadline, you have an 80 score. You can play the game with it like the late days, but that, you know, yeah. What if you were to change your implementation, like you got your score and then you change your implementation in one time? Again, so your question is what if you change your implementation after the fact from project one? It would still be, you still be allowed to submit on Gradescope for the old project, for project one. We could just have it throw in the first test as well if that make it easier. It'll run all the tests from the first project. You won't get a score for that, but it'll just be there. We could do that, we'll fix that, okay. What's that? It'll make it slower, that's the only thing. Yeah. But you can still be able to submit for the first one. Okay, do not plagiarize. We're gonna run this in moths. This is, we're doing that this week for your first project. If you plagiarize, we'll fuck you over, warn her whole, and be kicked out, okay. Don't do that. Next class, we're doing joints. Nestle loop joints, sortmer's joints, and hash joints. Okay. All right, he's got a call, he's not here. All right guys, see ya. Yeah, come in through with my shallow and cool. Two cent for a case, give me St. Octophe blue, in the midst of broken bodies and crushed up can, met the cows in the Jam Oh-ho-truck! He's with St. Octophe and my system, crack another, I'm blessed, let's go get the next one, then get over, the object is to stay sober, lay on the sofa. Better yet, damn, I'm so tired of be' Tim stressed out, could never be son Rick and say jelly, hit the deli for a pole one, naturally blessed, yeah. My rap is like a laser beam, the bars in the bushes, St. Octophe and the can team. Crack the bottle up and say nah, I sip it through those, you don't realize I'm drinking it only to be drunk, you can't drive, keep my people still alive And if the same don't know you're for a can of pain, pain