 Welcome back. Yeah, sorry I missed yesterday two days ago. I had to check on. Do you hear what happened with Fatface Rick? Yeah, he's not doing well. Yeah, yeah, no, he's in pretty bad shape. I just got back from checking on him, but I think he got paid my visit. Yeah, all right. Yeah, so that explains, yeah. So, as he said, Fatface Rick is not in good shape. I'm traveling there this weekend to go check out on, check him out. So that's why the class on Tuesday again will be, will be remote. I'll post all details on Slack. Other things on the agenda for you guys. Homework three, again, all the dates through here we should be correct. Homework three we do this Sunday. The midterm exam is on Thursday in class on October 13. And then project two is out now. We're having the info session tonight on Zoom as I posted on Slack. That'll be at 8pm. And then checkpoint one will be due on October 11. It's a Tuesday and checkpoint two will be due on October 26. We bumped up because of fall break on October 26. All right, any questions about any of this? Okay, Tuesday's class will be over Zoom or YouTube. Sorry, yes. All right, let's jump into this. Okay, yes. Whose question is, is project two the hardest? Yes. So your family, your friends start project two right now. Trust me, people always put they always think, oh, project one was easy, and they put our project to the end. You will not have a good time. Do not wait. Start it now. I mean, is it is it because the projects are easier, and you've got or you've gotten better, right? Like it's playing like what? I don't know what that is. Sorry. You don't have any time to play video games. Sorry. Yes. His question is what is hard about project two? Well, it's a concurrent data structure. We get to care about things moving around in memory. People always have a hard time. Your question is, how do you manage the files and project to how to open them? Are you Are you a real question? Is this a real question? Okay. I mean, use an IDE or use like C line, right? ID or Visual Studio. The C++ IDEs have gotten really good in the last 10 years, right? All right. So what's that? C line. What's that? It's free for students. If you have a .edu, it's free. All right. It works with alumni emails too, just so you know. Okay. All right. So some questions about like how to make, you know, how to work on bus stuff and the projects going forward. We can't stress this enough. You should really be writing your own tests, right? Don't use grade scope as a debugger. Don't like, you know, oh, I failed a test, make a little fix and then submit it again. Because again, as you get closer to the deadlines, the queue gets backed up in grade scope and you're not going to have immediate response. All right. So I would encourage you to write what is called defensive programming, especially in a good current data structure like a B plus tree where you want to put asserts and other invariant checks all throughout the code. So that way, if like you get back on null pointer at some function, you'll trap it when it occurs, not like, you know, five functions later when it hits your seg vaults because you passed something wrong before. So you should have write the assertion in your code so that you know that the data you're passing around sort of one, as the data structure moves from one state to the next is correct. All right. And in terms of getting a higher score on the leaderboard, we'll discuss maybe some things tonight, which you can do to make your code better. But I encourage you to use a profiler like call grind or something or Perf. Perf would be maybe an overkill for this, but call grind would be pretty good. And that'll basically show you where in your code you're spending all your time. And go look at that, say, all right, I might do something stupid. What can I do to speed this up? Okay. So again, don't use grade scope for debugging. Don't email the TA's directive for help. Please put, you know, questions, please post on Piazza. That way, you know, it's fair to all the students. If you have a question and then somebody else might have the same question, rather than just repeating the same answer over and over again in private emails, it's better to put this out in Piazza so everybody can see it. Okay. All right. So again, start project two. Do not wait. All right. So for today's class, I want to go over, today's has worth talking about query execution, right after all this work, we talked about the data structures, how to bring things from disk into memory, how to do sort, how to do joins. Now we're going to talk about how you actually take a query plan and run it and execute it. So we're beginning talking about the different processing models you can have in the database system as you execute queries. Then we'll talk about the access methods, basically, sequential scan versus index scans. Then we'll talk about how you have queries that modify data, what needs to happen with bookkeeping, you need to do to keep track of things. And then we'll talk about how to write the expressions and your where causes, having causes and so forth. And then we'll finish up with a quick midterm review. Okay. All right. So the first design that we're going to have to make in our database system is to execute queries is the processing model. So processing model is going to find how the system is going to sort of traverse the query plan tree and execute the operators. And in what form are they going to, what's the data that they're going to pass from one operator to the next? And am I going to start at the top and go down or am I going to start at the bottom and go up? The spoiler would be most database systems you can think about if you ignore the data warehouse ones, are we using the top down approach with the iterator model? Right? Because it's sort of the most simplest way for humans to reason about the code. But we'll see some cases where depending on the workload you're trying to execute, the iterator model may not be the best choice. And what we'll see on next Tuesday's class is how to make these paralyzed, right? To have multiple threads or multiple processes, multiple workers execute the same query at the same time, or different parts of the same query at the same time. All right, so we're going to go through each of these three approaches, the iterator materialization and vectorized batch model. And we'll talk about the pros and cons on each of these. So the iterator model, as I said, is the most basic one. This is what people typically implement when they build a database system that's not targeting data warehousing or data warehouse or lat workloads. Sometimes you see in the literature this would be called the volcano model, or the pipeline model. Volcano was a influential project in academia in the late 80s, early 90s, by that same guy that wrote the B plus tree paper Gritz Graffi. So he had a whole sort of system, and he wasn't the first to invent this, but he was the first to sort of codify and explain how to actually, what the iterator model actually means. And then he also invented techniques to make it paralyzed. We'll talk about next class. So the basic idea is that every single operator in your query plan, now these are going to be physical operators, like what joint implementation we're actually going to do, what the scan operator we're actually going to do. The actual implementation of these are all going to provide a next function. And so on each invocation, the operator is going to return, when you call next on it, it's either going to return a tuple. And again, the tuple could be some subset of columns, could be actually just a record ID, but some notion of a tuple back to the caller. Or if there's no more tuples to process, then it passes back a null pointer. And that just says, hey, don't call next on me anymore. I don't have anything for you. Right. The idea is that an operator that it's a parent operator in the tree would call next on his children. And that's going to potentially pull up data from the bottom of the query plan, until you produce the top, which is with the output that we return back to the client. It's called pipeline model, too, because the idea here is that you want to, in theory, you want to be able to ride along a single tuple up a pipeline in the query plan as far as you can, until you can't do anything more with it, before you go back down and get the next tuple. So let's look at an example of this. So we have a single join here, a simple join query on R and S, where our ID equals SID and S value is greater than 100. And then we have, again, this query plan here. But again, what I'm describing here is actually the physical operators, the physical implementation of the different operators. But visually, I'm just showing the logical query plan. So normally I don't like to show code in class, but it's kind of unavoidable here. But you can think, again, the implementation of all these operators are going to be these for loops where they call next on their children, except for the leaf nodes, the scan on R and S, where they're actually going to iterate over the tuples in the corresponding target tables, right? So the way you would invoke this is that going from the top down, you'd call next on the root operator and able to call the next function on its subsequent children, right? So the order goes like this. We start off in the root operator, which is the projection. So it's going to call next on its child, which then invokes this function here for the join. And for this one, we're doing a hash join. So the first thing it has to do is build the hash table. So it has to iterate over the left child and call next on it over and over again to populate the hash table. And every time it calls next, it then goes down to this function here, which then iterates over every single tuple on R and then emits up one at a time. Again, an iterative model, it's one tuple per every next call. Then once it builds the hash table, then it goes to the next second part of the join where it has to do the probe. So it's going to call next on the right child. It goes down. And inside this, the first thing we had is the filter operator. But it just iterates over every tuple on its child. So then that goes down to the scan. And then for every tuple in S, when you made it back up, apply the filter, evaluate the predicate here in this step, then spits it back up here to this loop here. And then we do our probe. And we have a match. Then we pass it up to its parent. Yes. It's probe hash table for hash join. You build a hash table, then the probe phase is checking to see whether you have a match. If it has a match, then you emit it up. If it doesn't have a match, then you just go back and get the next one. So what's kind of nice about this approach is that the operators are highly composable. Because the hash join doesn't care what its child is doing down here, right? It could be another join. It could be another scan query index scan. It doesn't matter. As long as all the operators implement next, then the they're all composable. So the I don't have a diagram to show the what a pipeline looks like. But the joint operator is considered a pipeline breaker, meaning I can't do I can't when when its parent calls next on this, it can't return a tuple until it sees all the tuples on the left side of the tree, because it has to build the hash table, right? So so we're going to be in this loop here, iterating over all the tuples here to build the hash table. Now, once we build the hash table and we do the probe phase, when we call the next on this side of the tree, then this will actually be a pipeline where we can go from from the bottom all the way to the top, right? Because for every single tuple in s, as long as it values the predicate evaluates to true, then goes up to the next one, then I probe the hash table and as long as that's true, then I go up the next one and then I do the projection and then I produce the output, right? So the pipeline in this example for this side of the tree would be five, four, two and one. And so the advantage of a pipeline also means that like, I don't have to, I don't have to write out any intermediate results at the disk, right? Again, in a disk system, and every single of these one of these steps, we could be having to dispel the disk if we run out of space, right? So it may be the case, like for, you know, I go fetch one page, I get one tuple here, I can do as much work as I need to do on that tuple, potentially get all the way to the output without having to write it back to disk. So as I said, this is using almost every single database system that you can think of. And then this allows for this, this pipeline approach to try to keep things in memory and do as much work as you can for long as possible. As I said, there's some operators that have to block until all their children admit all the tuples. These are called the pipeline breakers. It's joins the sub queries sorting, obviously, I can't sort, I can't do a global sort on all the data until I see all the data, right? Another nice advantage of this approach, in addition to engineering being simpler, is the output control is super easy, right? Because now if I have a limit clause, going back to my example here, if there was a limit 10 on this, then in the top loop here, once I hit 10, I stop and I don't call next anymore and I'm done. All right, what are some pros and cons of this, potentially? Yes. He says you make a huge number of function calls. Yes. And certainly in, if everything's in memory, then yeah, that could be expensive. But if you have to spell a disk, the jump call, it's not that, it's bad but not like terrible. Going to disk is always worse. Yes. He says you can't optimize for things happening below you. What do you mean by that? He says, if you want to do an order by with a limit, then you need to still pull all the data from your children, then sort it, then do the limit. Yes, but that's somewhat unavoidable, right? Yes. Yes, so his question is, this is a couple here, I have a SQL query and then I somehow convert it to a tree and then I somehow said this is the functions we're going to call for, right? So these, like these functions here, this is what you actually would implement in the system itself. Like you would implement this in like in bus tub, you have a project tree to implement these things, right? The, the thing that the, that the data set is going to figure out for you is how to convert the, the SQL string into an optimized plan. And that will cover after the midterm, right? It's going to try to figure out like what should be the right side or the inner, inner table versus the outer table on the join, what join algorithm to use. In my example here, I pushed down the predicate to be below the join, right? Relation algebra, it said it would actually be on the top, right? So there's a bunch of, we assume that we have this already, yes. It's kind of hard in this class because like there's, how do I say this? There's so many other parts of the system that all intertwine and it's kind of like, I could expose something like, like it wouldn't make sense for me to talk about query optimization if you didn't know what join algorithms were, right? Like, so it was a top part. Right, so the, so we hit a bunch of, bunch of, the downsides of this. The, the jump clause or the, the, the, the function calls, that sucks, but it's, the real limitation is the fact that you're dealing with sort of one tuple at a time, right? Because I go fetch one thing and I may be able to do a bunch of, I may have to do some expensive operation where if I could have multiple tuples at the same time, I could bash them together and potentially vectorize or execute that all at once to amortize the cost, right? So again, this, this approach is, is the easiest for humans, humans to reason about. This is what people typically have implemented. Actually, this is pretty much how everybody implemented it up until maybe 20 years ago was always like this. But let's, let's look at some other approaches. So the next one is the materialization model. So this is, yes. So this question is, if I go back here, what happens if the hash table doesn't fit entirely in memory, would I skip four and five or what? Right. So the question is, if my hash table has to spell to disk, would I try to do four and five as much as possible before doing, going up into here? No, because the iterator model wouldn't let you to do that, right? Because at no point here in my pseudocode do I have like, oh, is something in disk or not, right? So you would basically have to do the, the data doesn't have to figure out ahead of time, do I think I might have to spell to disk and therefore maybe I do the, like the recursive partitioning or the partition hash join we talked about last time, or the hybrid hash join, right? So in that case here, if I was, instead of, this is like the straight, the regular hash join, but instead of doing the probe face here, when I call on the, on the right side, I could do that, that bucketing approach, spill all that out the disk, and then now do an extra step where I would join the right side and the left side together. But you like, the database system isn't going to try to figure that, most systems don't try to figure that out on the fly, they try to figure that ahead of time, and then, and then the query plan is essentially bait. Alright, so the, the materialization model is, is going to be a, a bottom up approach, where there isn't going to be a next function, there's, essentially it's going to be, let's admit function that they implement, that's going to, that the, when you invoke the operator, it then generates all the result that the operator would ever, ever generate, and then pushes that up to the, it's parent operator, right? The idea is you're trying to materialize all the, the output of result all at once, right? And of course, because now it's, it's not going from the top down, you have to play some games of actually inlining or embedding higher level concepts or operators that are up at the top of the tree, now in the lower parts, like a limit clause, for example, if, if you can, and that way you avoid having to scan too much data, push too much data in, most push too much data up, right? So I'll say this, actually meant to say this for, for all, all the approaches, it could just be a, you know, the entire tuple, the entire row with all its columns, or it could be a single column. And there's some additional bookkeeping mechanisms that the, the data system will maintain to keep track of, like, in what portion of the, of the table am I actually passing up from one operative to the next, right? So let's go back to our, our joining example here, right? So now we have any, you know, all our different operators is, again, instead of having calling next, you're just calling the output function on the, on his parent, or sorry, on his child. Actually, in this example, I'm going top down, you could also go, go bottom up. Again, the other one, all the different ones, all the different applications work in both directions. So this one going to top down, so we call the output function on our child, and that'll be a blocking call where I'm not going to get, I'm not going to get control back into this operator until I get all the, the output for my child. So this guy here then calls output on its left side, gets this thing to then build a, a buffer or an array, a list of all the tuples, returns that up all at once. Then we can build a hash table. And then now I go down the, the right side. This calls output on its, on its child, it gets the complete list of all the tuples. And then I push that up and push that up and so forth, right? So the example I was saying before about inlining, like, if I have a billion tuples in S, then obviously this is stupid to like, okay, let me try to take this entire table, put it into a buffer, then pass it to my, my parent. You basically, you would know that you could inline this predicate evaluation. So that as you're scanning, you would like, when you bring the tuple into memory, then you apply the filter, then put it into your output buffer, and then you skip step four here. Yes. So his statement is, and you use the word generator in the Python context, they state, am I correct, Wes? Yeah. So the statement was that in the iterate approach, because it was, it was a sort of the Python generator programming pattern with the yield function, it's sort of going at one tuple at a time, that you can, that you can sort of enlarge tables, you can only look at portions of the table at a time. In this approach, if you have a large table, because you have to generate all the output of the operator all at once, you can potentially thrash because you've run out of memory. Yes. This question is that many of you really use this. Yes. Right. So Volty B, again, is the system that was based on the project I worked on when I was in grad school. We actually, I admitted this. We did this because because in in Volty B or H door, it was an old to be system. So the the leaf nodes are almost always going to be index scans, you're almost always going to get one, one record at a time, like go fetch Andy's account record, go fetch your GJ Mushu's account record, like you follow the index, you get one thing, right? And so we don't. And in case you're also too, it's an in memory system. So we didn't want to pay that function call overhead, because we never go to disk. Therefore, the function call actually mattered. Right. Raven DB is a document database store at Israel. They do this because again, same thing they're trying to target sort of operational workloads where you're going fetching a small number of things. Create DB is similar. Now, the one exception to this is moneadb. moneadb is a is an OLAPS system, one of the first column store systems out of CWI in the Netherlands. It's actually the same school where Duck DB came out of. Actually, Duck DB was a Duck DB started off as a fork of moneadb. It's called moneadb light and they threw it all the way and then they wrote Duck DB from scratch, which is a cool system. Anyway, they, moneadb does this. They rely on MAP, believe it or not. And they had a bunch of papers that came out like to deal with like the overhead of this materialization. And essentially try to push down as much of the scans as they can. And they would just let the OS swap things in and out as needed. And they claimed it worked well. I disagree. I think it's the wrong way to do it for OLAP, definitely. But they do it. And I've said this in class before and then somebody emailed the creator of this, of moneadb. He's like, hey, Andy said this. You're wrong. I was like, so I had a good deal with that. Actually, that guy died this year. It's kind of sad. Wait, sorry. The creator of moneadb, not the student. I wasn't trying to be funny. Martin was a good guy. He was a good guy. All right. So yeah, in my opinion, this is a bad for OLAP, queries with large enemy results. Could you have this build a desk? For OLTP, this makes sense. Right? But again, so I sort of have two extremes now. I have like the iterated model is one tuple at a time. Materialization model is all the tuples all at once. Right? We actually potentially may want something in the middle. And that's what the vectorized model, vectorization model does. Right? So we're still going to have a next function. But now instead of getting one tuple, we're going to get a batch of tuples. And this is going to allow us to do in the control loop of the operator when it calls next on its child, it gets a batch of tuples. And then now with inside that for loop, the thing that actually does whatever the work that the operator wants to do, because we're looking at multiple, multiple values of a tuple at the same time, we can do, we can do a bunch of vectorized operations with SIMD to speed things up. Right? So the size of the batch is going to vary on, vary depending on the hardware or what the query looks like. And so I can't say like, hey, you always use the size. It, you know, every system does something different. But do you typically want to align the, you want to make it so that the, the data, actually, this is not true. Like, it wants to be aligned for the, you want to align the sort of chunks of data that you're processing in the vectorized instructions based on the size of the registers, but the batch itself could be, you know, larger than a single SIMD register. So again, it's like there's something in the middle here. It's like not one tuple, but not all the tuples, something in the middle and get best performance in a various per system. All right. So we go back to our example here. Basically, now we have the next function, right? And it's just like before where they're passing things around. But now we see in the leaf node in our scan, if the size of our output array is, is larger than, than whatever our, we want our next size to be, then we, we emit it out. Right. So this is like, again, the yield function and a Python generator. Right. And again, everything, everything works just like before. So this is the way to build a data warehouse today or a whole lab system. Pretty much every single system that is designed physically for, for analytical workloads is going to support this because again, you get the vectorization within inside the for whatever, whatever you're actually whatever the work it is you're trying to do inside of this, right? And pretty much every system since I mean since the late 2000s worked this way. So the system actually sort of invented this was vector wise. Again, this came out of the same school from MoneDB and DuckDB. So they had a fork of MoneDB called X100. They said, hey, look, you actually want to do all this vectorized stuff and move things in batches. The vector wise got bought by Actian, the co-founder of vector wise left, then he was the co-founder of Snowflake. So this, this, this is essentially also how, how Snowflake works today as well. So, you know, I list other systems like SQL Server and DB2 and Oracle. Obviously these systems are super, super old, right? DB2 is like 1983 oracles like 1978, whatever. And obviously back then they didn't have SIMD and they didn't implement the vectorization model. But they have sort of specialized engines that you can graph on to the regular iterator bottle engine that operates on a rose. So they have engines that are specialized to work on columnar data using the vectorized model. Okay? So the, again, the main takeaway would be for, for general purpose systems the iterator model is probably the best approach for like very specific OLTP systems. I think the materialization model is better. And then for analytical workloads, actually, whether it's in disk or in memory, the vectorized model is better for analytical, analytical workloads. Yes? So your first question is, how does the customer choose? Are there so many different systems out there? How does somebody choose? And then your second question is, why are there so many databases? Let's see here. So as I said many times before, I love databases. And my hobby project is this website, dbdio, which is the database So I'm currently aware of 862. There's way more than that. There's a few more hobby ones that I'm missing. But the, I mean your question is like, why is there so many? There's a ton of money, right? Snowflake went IPO, Databricks is going IPO next year, sometime, or some point. Like, there's a lot of money. People have a lot of data. And there's always somebody out there that says, okay, these existing systems don't do what I need. And they're always trying to build a specialized one. So is the markets, is this sustainable to have all these different systems? Probably no. I think there's a, there's a Gartner report that says a bunch of these are going to start going under in the next three years or two years. So where am I going with this? There's a lot of money made. So people always think of you can do something better. And then, often time with happens too is like, systems exist, people use them. And then, and then they sort of get bought out by another company and they get sort of relegated to like maintenance mode. So then you got to figure out how to move to something else. And that's when people have like built something new. There's been a, I mean, 800 databases, a lot of, I think there's a lot of money. Got a question over here. Cindy means single instruction and multiple data. So like, I mean, I have slides from the Advanced Clouds I could show. Basically, if I want to add two numbers together, one plus one, that could be one instruction. But if I want to add two vectors of numbers together, then that also is a single instruction. If I put them in these specialized registers, because you're doing these questions, how's it, how, why is Cindy better for analytical, analytical queries? Because I'm doing long scans on single columns, right? I want to find all the customers that, that live, live in, you know, this county, assuming I don't have an index, I want to scan through. So rather than doing the iterated model of getting one, two by the time, then run, you know, run my where clause, if clause to do my check, if it's, if it's true, then produce the output. What if I get four, four tuples at the same time in a vectorized model? Then load them up into the SIMD registers, single instruction to do the comparison. I get back the result all at once. I then, I pass the ones that match up. So his statement is, this, this is basically like the iterated model, except that I'm running special instructions. Well, I'm not showing any SIMD here inside these for loops. I'm just saying, if you actually implement it, you would implement this in SIMD to speed things up. But like the high level, like the high level concept of the processing model is indifferent of what the actual implementation is, whether it's vectorized or like, whether it's like, you know, using hardware accelerators or not. This is, I want to get the idea like, again, you could pass one tuple at a time, all the tuples at a time, or something in between. And if you do something in between, you, you can get this advantage of SIMD. Yes, we're going down, down a rabbit hole here. This question is, do people implement these things with CUDA? Yes. So there are GPU-accelerated databases. I can't remember which one hand-off-hand uses CUDA. I think it's BrightLite out of the UK. But yes, they, they would implement these things. You could implement this down at the GPU with CUDA. Yes. I actually think they're a bad idea, but they exist. And we go to the database of databases. You can click like, which databases do GPUs? And it'll tell you. Yes. What I think is a bad idea. All right, rabbit hole. All right, real quickly. I'll just say it's a bad idea because the, the data exists where? On disk or in memory, right? So if it's on disk, then I got to fetch it into memory and then bring it back down to the GPU, do any processing on it, and then bring the result back up, right? So there's NVLink, or there's the new CLX, or there's, there's new stuff, new hardware can make that go faster. But oftentimes, a well-written CPU-based generator or execution engine will outperform the GPU one because you don't pay that transfer cost. That's the basic idea. And again, so I can post a link of this in the Piazza. But we run the seminar series on Mondays and back before the pandemic. I ran one. This is, this is, this is, this is just general stuff. But I ran one specifically on hardware accelerator databases, right? So we had a bunch of these companies come and this is GPU. This has been, this has been renamed to Omni-Side and then they got renamed again to HeavyDB. But these all guys are all GPU ones, right? And then the Swarm64 was an FPGA. They got, they basically, nobody wanted FPGA accelerator database. So then they switched to software only and then they got bought by Peanuts for Peanuts. But, right? The question was, why is I think it's a bad idea? Like, have you heard of any of these? No. Right? And when they came to give a talk, they would show how much faster we are because we're using CUDA or down the GPU versus like, you know, Green Plum, which is a fork of Postgres from a few years ago. And we say, okay, great. Is that because it's the GPU or just because you rewrote everything and didn't pay the legacy overhead of Green Plum? And they didn't know. Right? So I will say BlazingDB got renamed to BlazingSQL. This one of all these, this one actually made the most sense because it was basically, if I was gonna shove data down into like TensorFlow PyTorch and the GPU, then on the way down to the GPU, I could do filtering on the GPU like with SQL. And that to me, that made the most sense. They got, they basically closed shop and then they formed a new company with the guy that was doing Apache Aero, the guy that met Python Pandas. They took him plus the BlazingSQL guys out of Peru, plus the guy that was doing Davidia Rapids. They mashed them together and they're putting a new system and that's called Voltron Data. And we can cover that one later too. Aero, it's basically, it's part of the Apache Aero project. All right, I'm way on a tangent here. I apologize. All right, cool. Sorry, let's come back. Okay. All right, so then I sort of mentioned this as well. You can have sort of processing both directions. For this class, it's not gonna matter too much. Assume that most systems going top down. There are other systems that go bottom up. And this, you can do this, sort of think of this as like the pipeline approach. You can be very careful about aligning your memory to fit or do, if you do bottom up, or sort of bottom to the top, you can be very careful about putting data in specific CPU registers and get even better performance than just keeping things in memory. And there's a system out in Germany called Hyper, that Tableau bot that does this. But again, most systems are gonna go top to bottom. All right, so let's talk about the access methods now. So this is stuff we basically have already covered most of this semester, right? We said this is the way that at the leaf nodes in our query plan, this is either we scan the table or scan the index or potentially scan multiple indexes, right? Systems like Oracle have a bunch of different names for these things, like heapscan, partition range scans, row scan ID or row ID scan. But at the end of the day, it's either scanning a table or you're scanning an index. And the table could be like a real table, like something you created called create table. It could be a result of a sort of intermediate query or nested query within your own query that you're trying to run. It could be like a materialized view where you sort of pre-comput a query and put the results in a temp space, right? But again, because our operators are composable, the upper parts of the system don't care necessarily where the data came from, right? So these access methods are not defined in relation algebra. This is something we have to, you know, extend relation algebra with they actually have to say with the implementation or how are we actually gonna do this actually implement this, right? So again, we're talking about the accesses, the access of the sort of the base tables or something that looks like a table or an index at the bottom. So at the end of the day, it basically, you only have three choices. So a sequential scan, we've already covered, again, this is just iterating over every single page in the table. And again, you don't access things directly on disk, you go through the buffer pool, you say, I want this page. It's responsible for going out to disk and bringing the memory into a frame and then handing back the pointer of where it's located. Then you do whatever you want on the inside of your scan operator, do whatever you want inside the for loop, evaluate a predicate, compute some aggregation, whatever you want, and then you produce the result back up to your parent operator, right? And the way this basically works is that the inside the scan operator, you have this what is called a cursor, where you keep track of what's the last page I looked at and what's the last slot I looked at if I'm doing a slot of pages. So that way if I'm calling, if it's the iterator model and I call next, my parent calls next to me and I have to then emit a tuple back up, or again, Python generator to yield, you emit a result back up, when I call next again, I want to pick up where I left off. So when it's in the operator, you maintain some local state to say, what's the last thing I looked at? And that would be called a cursor. So there's not really a lot of things we can do to make the scan itself go faster, like I fetch a page and then I bring the memory and then I have to look at the data. And this is oftentimes the worst thing you can possibly do in a database system. But it's like the fallback option. So if I don't have any indexes, I don't have any pre-computed data, I can always just scan the table, the raw pages themselves and produce my result. And again, it's the worst thing to do, often because the disk is slow, it's going to trash our buffer pool tracking, this is a flexible scan flooding, but there are some things we can do to make it a little faster. And most of these things we've already covered. We've already talked about how to pre-fetching in the buffer pool. If I know I'm doing a sequential scan on a table and make it pre-fetch a bunch of the pages ahead of time, while I'm processing the first page so that when it comes back to get the next page, it's already in memory. The buffer pool bypass we talked about where I can have a little side memory space where I don't trash my buffer pool, just for this one query, I bring my page in and put it through this private memory and then again, I don't break the LRU stuff. Parallel queries we'll talk about next class. Keep clustering is the index clustering technique where the data is pre-sorted based on index. Late materialization we talked about before, but the new one I want to mention is data skipping. And again, data skipping, the basic idea is that we want to skip data, we want to try to produce the result for the query without having to read all the data, right? And there's a lossy and lossless approach to doing this. The lossy approach is due to approximate queries. Again, lossy means that I may, since I'm not gonna look at all the data, I may actually get incorrect results, but that might be okay. Again, think of like MP3, MP4, it's trying to compress the audio-visual data in such a way that we as humans can't detect the degradation or the loss loss rather than like being raw data. So the idea of an approximate query is we can have a sample of the data, either already existing or we sample it and we run the query and do the scan, where we can compute some aggregation or compute some result that is close enough. And I'm okay with that, right? If I want to say how many people visited my website and I'm getting like a million users visit in a day, but I set of looking at exactly all those one million clicks or visitors, I just take a sample and try to approximate it from and then maybe I get off by a thousand, that's probably okay, right? So there are some systems that support limited amount of approximate queries, and again, these are typically done in aggregations, like give me approximate count, approximate min and max and so forth, right? So things like Snowflake, BigQuery, Databricks, and Oracle, like they will have special keywords where you can say I want an approximate count and then maybe they maintain an eternal data structure like a hyperlog log or a sketch or so forth, similar to like a probabilistic data structure like a Bloom filter, they maintain that underneath the covers and they use that to produce the thought for the scan. Systems like BlinkDB, which is an academic system out of Berkeley, and then ComputeDB, which was called Snappy Data, but then they got bought, these are systems where they're designed to specifically do only approximate queries. And you can maybe potentially say, you can give it like, give me an approximate answer for this query within some amount of statistical error. And they figure that out for you, right? So again, this allows you to skip data because you don't need to scan everything. A loss of approach would be zone maps and this one is way more common. The basic idea here is that within a sort of set of data, within like a multiple pages or a single page or a single column, you can pre-compute a bunch of aggregations on that column. And so that when your query runs, you can actually just, the where calls will check the zone map to see whether the predicate actually could even possibly find something in that column. And then if no, then it just skips it entirely. Right? So I'm using the word term zone maps. If you Google zone maps database, you'll end up with the documentation from Oracle. As far as I can tell, this was invented by Natesa, which was another FPGA hardware accelerator database from the 2000s. IBM bought them, killed it off because nobody wanted FPGAs, right? And, but this is pretty much how everybody does this. They might not just call it zone maps. Or if you're familiar with like Parquet or Orc files, we'll talk about this when we talk about cloud databases, they maintain this metadata for you, right? So it works like this. So say this is our original data on a single column and then I'll compute a zone map where on this column I'll have all my, the standard aggregates I would potentially compute, min, max, average, sum, and count, right? And you would do this for every single column. Now, in my example here, I only have five entries, that's super small, in like something like a Parquet file or in a column store database, you can think of this being like 10 megabyte or 100 megabyte blocks, right? So then my query shows up, select star from table where value is greater than 600. So rather than doing a sequential scan on the original data, I just go look in the zone map, right? Value is greater than 600, I go look at the max value, it's 400, so therefore I would know that there couldn't possibly be a value that's greater than 600, so I just skip this thing entirely. So as I said, this is super common in a lot of systems today, especially in the cloud system because you can go fetch the zone map, that's gonna be way smaller than trying to bring in the entire column and the entire dataset. Snowflake does this too, which Snowflake should be listed there. That's one of the acquisitions that they do. Now, obviously the larger your sort of collection of data that you put a zone map on, the larger it is, the less selective it's gonna be and you may end up going, still have to go read about everything, right? So there's this trade on how big you wanna make the blocks and have a zone map based on it versus how useful it's gonna be, right? Like if I have a billion entries and I have one zone map for it, you know, a query like this would be, I couldn't use the zone map because that probably wouldn't be a value greater than 600 and I had to go scan everything. But if I broke it up into smaller chunks or in the case of Snowflake's case, they actually sort things ahead of time, then the zone maps can be very, very helpful filtering things out. All right, so that's sequential scans, right? The other choice is to do index scans and we'll talk about a single index scan and then we'll talk about a multi index scan. So the data system is gonna figure out ahead of time what indexes should use to find the result that you need for your query. This goes back to the question he had about like, oh, how does the data system know what, you know, what the query plan is gonna be, what the execution is gonna be ahead of time? It's similar, like there's this thing called the query optimizer or the planner, some systems called compiler, which we'll talk about why they call it like that, but it's a remnant of the 1970s, but it basically again takes a SQL query, converts it to a physical plan. So part of that generation of the physical plan is then figuring out what index you wanna use. So there's a bunch of different criteria it's gonna use to figure out what index you want, right? You know, all these things here, which we'll cover in lecture 14, but like one obvious thing is like what attributes are the index based on, right? My where cause is on looking for a column A and B, but my index is on C, then that index is useless for me. So the query optimizer the data system will know don't pick that index, right? Obviously again, you know, if it's a bunch of, you know, greater than less than queries, then I can't use the hash table index if I have one of them. So all this additional information about what's actually the index look like, what operations or queries you can run on it, the data system has that baked inside and figured that out for you, right? So suppose the work field quick scenario of showing what kind of choice you can make in the system. So let's say we have a query here, select start from students where age is less than 30, the home department is the computer science and the country, the home country is the United States. So say in our example here, we have a single table of the hundred tuples and we have two indexes, one on the age and one on the department. So we're trying to figure out again what index we actually should use at this point in this example here. So say in scenario one, there's 99 people under the age of 30, but only two people in the computer science department. Then in that case, we want to pick the index on what department they're in because that's going to be more selective. That's going to throw away more data quickly when we do the index lookup, right? But if it's the flip side, if there's 99 people in the CS department and only two people under 30, which is definitely not true for Carnegie Mellon, then the age index would be better because there's fewer people under the age of 30. So for this particular where cause, if I use that index, then I'll be able to throw away as much data as I can, more data I can more quickly, right? So again, this is what the data center is trying to figure out. I have a bunch of indexes, I look at my where cause, I look at what the query is trying to do, I look at what the indexes actually can support and I pick one versus the other. And again, the index scan is just like this sequential scan where I probe the index with my iterator that you're building in project two and you call next on that and it produces results and then you shove that up to your parent operator. So now what if there's a scenario where both indexes potentially could be really good, right? So this is called a multi-index scan. And the idea here is that the data center says, hey, I think this one index is not that great and this other index is not that great either but actually the combination of the two of them will help me then get to the result I want more quickly, right? So most of the higher end systems, the systems I've been around for a while will support this, pretty much all the major commercial systems, Oracle DB2 SQL server, they'll do this. Prescust does this as well but they're gonna call it a bitmap scan because the output of the index scan will be a bitmap and then they take the intersection of it. And my SQL for whatever reason, they call this an index merge operation but the basic idea is the same. So we're gonna figure out all the indexes we wanna look at, we're gonna do whatever portion of the where clause for our scan on each index and then they're gonna produce the set of record IDs that match each individual index. And then we're gonna combine them together using either a unit or intersection based on whether we have a conjunction or a disjunction in our where clause. And then we go retrieve the records and if there's any additional filtering we need to do because there was something in our where clause that couldn't be satisfied by the indexes, we go fetch the original tuples that are specified by our bitmap and do the additional predicate scan or predicate evaluation and then produce a result up above. So I realize my hand gestures aren't helping here so let's look at an example. So same query we had before students less than 30 in the CS department in the country US. So we're gonna first retrieve all the record IDs from the age index where there's less than 30 then we're gonna retrieve all the record IDs from the department index where the department equals CS and then we take the intersection of the two of them and then we then do additional filtering on the country. All right, so it looks like this. So in our operative notation, we go grab this index, we do our scan, we get a bunch of record IDs and this could either just be a set of record IDs, it could be a sorted list. In the case of Postgres, they actually made it's a bitmap where every bit corresponds to every single possible tuple you could have in the table. Then I get my record IDs from the department, I take the intersection of them because I have it's and clause, it's a conjunction. Then I go fetch the actual tuples that come out in the intersection, then I check my additional predicate on the country and then any tuple that matches that then gets shoved up to my parent operator. So the point I'm trying to make here is that just because in the scan note, when I showed in the examples before in the processing models, it seemed like it was one for loop and inside that for loop I'm iterating over a table or an index and it's producing one result. I could do a bunch of extra for loops down below to do multiple probes on index, then do the intersection or the union of them and then I start emitting tuples up. Again, this works because we've modernized the query plan into these separate operators, they're composable, so the up above doesn't even know, doesn't need to care how you actually produce the tuples, it just takes whatever the operator spits out, whatever the child spits out. So when you've ever run Explain on Postgres, you might see a bitmap scan and you're like, hey, what is that going on? What is actually going on? Because Postgres doesn't support bitmap indexes, they really mean the multi-index scan. All right, the last thing we gotta talk about or for queries is modification queries. Again, these are insert, update, deletes. Truncates, which basically delete all the tuples on a table, that's typically implemented as like drop table and then re-add it, because it's easier than just scanning the, scanning through the table and deleting every single thing. So for these guys, the way it's gonna work is that whatever its children are below it, they're responsible for telling the, telling these insert, update, delete operators what they wanna modify, right? So think of like an update query that has a where clause. Below that's just another scan operator, either index scan or sequential scan, doing whatever additional predicates it wants and then it's gonna pass up, here's the record ID or some portion of the data, here's the things I want you to modify and then inside of that update operator and then it makes the change. And then the output of these operators can vary, it could either just be a record ID, say here's the thing I deleted or inserted or updated or it could actually be the actual entire tuple itself. And you get this when you use the returning clause, like you can call it update query and normally update query just says, here's the number of tuples I updated but if you add the returning clause, it'll pass back, here's the modified versions of the tuples I generated. Upserts also work the same way, right? Upsert is just if update something if it exists, if not insert it. So the, as I said, for update delete, the child operator has passed the record IDs for the target tuples they want and then for these guys, we're also gonna have to maintain additional metadata about here's the tuples that I've seen before, here's the tuples that I've modified, right? To avoid modifying the same tuple multiple times within one operator implementation. For inserts, the two choices for implementation are to either materialize the entire tuple you're gonna insert within the insert operator itself which I think is how we do this in bus tub, right? So like the insert operator is responsible for saying, okay, how do I put the tuples together and then I know what tuple, put the attributes for the tuple together and then insert it into some table. The alternative, which is the better choice is to have the insert operator take whatever the child operator gives it for the tuple values and insert it into the target table. Because if you do the second one here, that's how you implement basically select into. Because you have a select query, it does whatever it needs to do in the query plan below and then whatever is fed into the insert operator from below it, from its child, then just gets inserted into the table. So I wanna talk about one quick example or problem that could occur if you don't keep track of your, what tuples you've modified. And so for this one here, say we have a table called people that has a bunch of people like employees and their salary. And then I have a query here when I wanna give everyone $100 raise if they make less than $1,100, right? So say now in my update operator, I call next to my child, it goes down below and does the index scan and it's gonna find all the employees where their salary is less than 1,100. And if anybody matches, then I produce the result. So the way this is gonna work here, sort of above this for loop is the setup on the index iterator. So I know what my scan's gonna be. I traverse the index, now I get an iterator. Now I'm gonna walk along the leaf nodes here. And say the first match is me and I make $999. So that gets admitted up now inside of our for loop here. I then remove it from the index, right? Because I'm gonna change this value and this index is based on the value I'm changing. So you have to do a delete. Then I update the salary and then I put it back in, the new value back in, right? But then now I keep scanning along and what am I gonna find? The thing I just updated sort of back in, right? So I'd find now the updated version of Andy, it satisfies my predicate because it's less than 1,100 and then we get passed up here. And then I would give myself a second raise. Which is surprisingly, but we can take that one off one, right? So this is bad, right? I don't wanna do this. What's that? I'm saying, I do wanna do this or? Well, it, how to say this? The thing, the high level definition of the queries is give everybody who has, makes less than this amount $100 a raise. And so semantically this isn't correct because we're giving this one tuple two raises. So we're planning this value twice, right? So this is actually an example of a very famous problem called the Halloween problem. There's actually a Wikipedia article based on this. And it basically comes down to, it's an anomaly that occurs in a database system when the, an update operation changes the physical location of a tuple either in the index or in the table itself. And because I'm trying to scan along in my leaf node operator, in my access method, trying to scan along, I may end up seeing that tuple again because where my cursor was, it got put to a new location that's physically ahead of it. So as I scan along, I'm gonna see it, right? So this was, this has nothing to do with like, the example is not based on anything with the holiday Halloween. So this was discovered by these researchers at IBM that were building one of the first relational databases called System R. They came upon this problem on a Friday at the end of the day, like, hey, this is kind of hard, how do we fix it? And the one person basically said, oh, it's Halloween, let's screw it, let's go party and we'll deal with it on Monday, right? And for whatever reason, they called it the Halloween party because of that, right? So the way we can solve this is that now inside of our scan operator, we have to keep track of, again, additional state that says, here's the tuples that I've seen, the tuples I've seen so far. The record IDs of the tuples that I modified. Or like, you know, either as the physical record ID or the primary key or so forth, right? Because again, if it moves and the record ID could actually change, right? But the additional metadata that knows that within this query, I've seen this tuple before, therefore I don't need to update it again. There'll be some additional metadata that we'll talk about when we talk about transactions and we keep track of like, here's the tuples I've modified within my transaction, which could be multiple queries. But in that case, going back here, if I ran this query twice in my transaction, then it's okay to give Andy a raise twice because you ran the query twice. But in the, again, some answers to this query went around by itself, you shouldn't give a person a raise twice, right? So this is something that you'll also have to handle in, I think, Bustub for project three. And just keeping track of what you've seen before. All right, any questions about access methods? Scratchel scan is the default option. There's some things you can do to speed that up. Paralization is usually the best approach. The zone maps make a big difference too. Index scan can either be for single index or multiple indexes and then you intersect or unionize the bitmap result. Yes. Sorry, this one? So question, what's the difference, like for insert, what's the difference between choice one and choice two? So choice two is easy to understand because it's like, I take whatever my child gives me and I just insert at the table. But think about also like, when you write an insert query, you have that value list. Something needs to then convert that value list into an actual tuple itself. And you could have a separate materialized operator which doesn't exist in the relation algebra. You'd implement it as a materialized operator that then takes those input parameters, creates a tuple and it shoves it up or the insert operator could implement it itself. And in Bustub, I think we do the first one, but it really should be the second. Because again, because these operators are opposable, the insert operator doesn't need to know that it was a materialized operator or scan operator below it, it just takes whatever is given to it and insert it. Okay, so the last thing to talk about is expression evaluation. So in our SQL query, we're gonna have where clauses, having clauses, projection lists, they essentially all work the same way then where they're gonna be represented as an expression tree. And the nodes you're gonna treat could represent the different types of expression operators you could have. I started to, there's a tree for the query plan itself for the join operators and the scan operators and so forth. And then within each of those operators in the query plan tree, there also could be multiple expression trees that you then use to evaluate these predicates. So in our example here, we have a join on R and S where RID equals SID and S value is greater than 100. And you would represent these two predicates together like this. This is a join, so you actually would not combine them together but for simplicity, to diagram, assume that it is, right? And essentially what happens is again, at runtime, the data system is gonna walk this tree, you have the conjunction at the top, the and, go down this side, you see an equal operator, go down this side, get the RID value, go down this side, get the SID value, pop back up here, do the comparison. If that equals true, then you pass, well actually whatever the result of the evaluation, you pass that up to the and. And obviously you can implement short circuiting, like if returns false, you don't go back down the other side, right? This is essentially how all these systems are gonna represent these things. So let's see how it actually works. So I haven't taught you prepared statements yet, but basically think of a prepared statement as a way to tell the data system, hey, I'm gonna run this query a lot, give it a name so that I can call it again as if it's a function or I pass in parameters just based on that name instead of repeating the query. Right, so I add the prepare keyword, prepare XXX as and then I select query. So basically there would now be a handle called XXX for my connection that I can then invoke like a function pass in parameter. And then this dollar sign one here is SQL parlance to say here's a placeholder for a parameter that's been passed in, right? So I take my where clause here, s.value equals dollar sign parameter, the first attribute I'm given, or first parameter I'm given plus nine, right? And so you would represent that in a tree structure like this. So now when I run my query, I do my scan whether it's a index scan or keep it simple to say it's a sequential scan for every single tuple as I scan along, I basically have to reverse this tree. So I start at the top, I see my equal sign, I go down the left, this says attribute.svalue. So now I have some pointer in my current execution context that says this is the tuple that I'm actually looking at. So I go now look in the catalog and say, okay, well I want svalue, that's the second attribute and it's integer. I know how to jump to the second attribute and then produce the output a thousand. Then I go back up here, traverse down this here. Now I have a parameter placeholder, parameter expression operator. I look at my list of query parameters that have been passed in to the query. It's 991, produce that output, shove that up to the addition. It goes down to the other side. I get the constant nine, shove that up, then do the addition, produce the result, shove that up here, do the comparison and it goes true. Right? Hey, I have a billion tuples. I'm gonna do this tree traversal for all one billion tuples. Why is that slow? Not because it's obvious. Because you're traversing this tree. It's, for modern CPUs, this is bad because it's not deterministic branching, right? Where I, you know, if this thing evaluates to actually save values to false, I should say I have more stuff here than maybe this is too simple. But like sometimes there'd be an if clause would go down this path, sometimes there'd be another path, right? All that's not deterministic. The modern CPUs don't like that. They like things going in predictable paths, right? So this is also pretty much how most systems will implement, actually pretty much all systems will implement expression evaluation at the very beginning, right? This is how PostgreSQL originally does it, how my SQL version, this is how pretty much everyone, everyone does it. The way to optimize this though is to basically do just a time compilation or cogen. And this one you only see in sort of the high-end systems, especially in OLAP data warehouses and actually PostgreSQL supports this as well. We can see it work in a demo. But the basic idea is that because we're gonna be doing essentially the same logic over and over again, instead of traversing the tree, which again is a, it's modular where I could, you know, the equal sign operator doesn't even care what's coming down below it, right? As long as it gets produced data in the right type, right? Then it can do the comparison, but it doesn't care how you got the value, right? But if I'm doing this logic over and over again, instead of traversing the tree, I can essentially cogen a function or some simple expressions that do exactly the logic that this expression is doing, run that through your favorite compiler, GCC, Clang, LLVM, whatever you want. And then now it's gonna produce machine code as a shared object. So now when I do my evaluation on the query on a tuple, I pass in pointers to the tuple, right? So it knows how to jump to different offsets, but it doesn't have to do that tree traversal. It literally just invokes like this function, right? Value equals one. That'll get inlined into the query plan execution. And those will be super, super fast. So what I'm showing here is just time compilation for expressions. In the advanced class, we'll talk about how to do just time compilation for the entire query plan. So actually the query plan tree itself, we could cogen and run that in a similar manner. So as I said, Postgres actually supports this, which is super awesome. So let's do a quick demo and see the performance benefit you can get from it. So I've pre-generated a table that has like 50 million tuples or just integers and just random integers, right? So select star from fake data. And one, right? So it's just a giant list of random integers like this, right? So let me turn off, or make sure I run PG pre-warm. This is gonna guarantee that everything, basically again, takes the entire table, make sure everything's already in the buffer pool. So first I'm gonna run the query without this just time compilation. So I'll tell Postgres to turn it off. And my query is gonna be a simple aggregation. Select count star on my table. But then in my where clause, I have like value times value is greater than 100. And then I have a much modular arithmetic to see whether it's a power of two or, sorry, not a power of two, that it's divided by two or divided by four or multiple two or four, right? There you go. So it took five seconds, right? And you see it all did with sequential scan and then applied this filter here. And again, it tells you that shared hit equals, this number here means everything's in memory, right? And it took 5.8 seconds. So now let me run the same query again. But this time with just time compilation turned on, it now took three seconds. So I shaved off two seconds. And here you see Postgres tells you that it was able to inline and compiled the expressions aside of it. And it actually gives you the breakdown of how much time it spent at these different steps, right? So Postgres is gonna have its own internal cost model that says, well, I know how much data I think I'm gonna have to access and it can determine whether it makes sense to even just time compile the expression, right? Because if the compilation time took 36 milliseconds when my query runs in one second, then that's a bad trade off, right? Again, this is a great thing about SQL and the relational model is this is like super advanced feature that we're co-genning query expressions in Postgres, like I don't have to know anything about that in my SQL query. I just said, hey, do this, select count star. And it even figures out what's the best way to execute this, not just like, what access methods should I use, but like how do I actually implement the expressions, right? This is amazing, like to me, this is super awesome. That like the Ravage-Jo-Schmo program doesn't need to know this thing even exists, but they get the benefit of it, right? So if I go back and theory now, if I put like a limit one, I'm not sure why it's taking so long, Postgres did, because it's limit one because I want one output result. But if I go back here, right? In this case here, it's the same predicate I had before with my limit clause. Granted it's not a count star, but Postgres said I'm not gonna compile it. And I didn't change anything. They figured it out for me, any questions? All right, cool. So again, the conclusion here is we saw how we take a single query plan and we can execute it in multiple ways. And most data systems are gonna rely on used index scans as much as possible. The OLAP systems, some of them don't support any indexes like Vertica and they do sequential scans and everything, but they pre-sorted data. But again, from the perspective of somebody writing SQL, they don't need to know, don't need to care. The data system will figure that out for you. We showed how expression trees are flexible, they're slow, and then modern systems can use just-in-time compilation to speed things up potentially, okay? All right, so next class, we'll do the midterm exam next, or midterm review. Next class will be parallel execution. Basically, how do you take all the constructs that we talked about this semester and run them in parallel? And either multiple threads, multiple workers. And that'll be also a precursor when we talk about distributed databases where it's basically the same thing as parallel query execution, but now they're on separate machines. But now we gotta worry about, when you go to distributed, you gotta worry about the network, which is why we wanna wait for that, put that off as long as possible, okay? All right, midterm, who's taking it? You guys, what is it? Midterm exam, where is it? Here, when is it? Thursday, October 16th at a regular time. Why? That video will give you the meaning of life so you can watch that. And then the midterm guide is available. I posted this on Piazza. Go there, the practice exam is there, although I posted the practice exam on Piazza. Yes, yeah, 13th, yes. It's next Thursday. Thursday's the good, that's 16th to Sunday. Every time. All right, 13th, show up. So you need to bring your CMU ID, you gotta bring your calculator. And then the, it's open notes, but you have to write your notes on handwritten on a sort of one regular sheet of paper. You can write it front and back. You can write as small as you want. That's fine. What I don't want is taking the slides, shrinking them down and printing them out, right? I don't want to be handwritten because you'll get something out of it more than just copying and pasting things on. So I'm gonna point it out on Piazza that in 2022, most people don't have a calculator. So yes, you could use your phone as a calculator. That'll be fine. What? Your phone, what, sorry? My phone doesn't do logs. Your phone doesn't do logs? You have to have logarithms. Yeah, you can't do logarithms on your phone? Again, this is Carnegie Mellon. I've said you guys are really smart. I'm pretty sure you can figure out how to do logs on your phone. All right, so please don't bring, you know, real question? Yes. You can write, scratch question is, can the handwritten note be on a tablet and print it out? Yes. I mean, come on. I mean, be reasonable here, okay? All right. Sorry, the point of the exercise by handwritten, like, I think the psychology shows that if you handwrite something, you retain it more than just printing things out. So I'm trying to do that. Okay, so this list, this is the first time I've, again, taught this class in person since 2019. So we haven't had this problem when it was all remote. But every year, things get worse. So these are real things people brought before and please don't repeat this. Do not bring live animals. Do not bring your wet laundry. We had one kid, one year, brought like, he like raced an exam, had his wet laundry and he had to start putting it out in chairs. And then the one year, the kid brought a, like, you get these like votive candles, like religious candles, like from like bodegas. But it had Jennifer Lopez on it. Like, don't do that. That's a fire hazard. Okay? All right. What do you gotta need to know? Obviously the rational model, the integrity constraints, relation algebra. I understand the symbols, how to compose these things together. I understand like, is it, are there duplicates or not in relation algebra? Right, these are the things, the basics that we covered in the first day of class. For SQL, we're obviously not gonna ask you questions like, hey, take this prompt and write SQL because then we got to, it's impossible to grade without running it. But we're obviously will show you, as we did in the practice exam, we will show you SQL queries and not explain what it actually does. It's up for you, you know, you should know what SQL is at this point. Right? So we talked about obviously where it closes, the basic type of queries you could have, output control and then joins aggregations and comment table expressions. I've left that window functions on purpose. So take that as a hint. For the database storage itself, we talked about how the different bufferful management policies you could have, LRU, LRUK, clock. We talked about what do the files actually look like on disk, how do you represent the page directory and so forth. Then what do the pages actually look like themselves? Like there's two main approaches, slotted pages and the log structured approach. Right, what are the pros and cons of these different methods? For our hash tables, we talked about static hashing, linear probe, Robin Hood and cuckoo hashing. You can understand basically what the, how these hashing seems work. And for dynamic hashing, obviously, extendable hashing and linear hashing. So understand again, like an extendable hashing, the difference in global death versus local death, the overflow changes, overflow chains and then later hashing inserting splits and overflow chains as well. Yes? Do you have a question? No, it's question is am I gonna have like, hey, is this SQL query real? No. Okay. Let's not do that. Right, because it's like, yeah. It's really annoying to try to. Yeah, like why would it, yeah, that'd be a terrible question. I'm not doing that, yes, yes. The question is, are we gonna have to differentiate between the different database systems, like for SQL? Or just in general? Postgres versus like. No, okay, so question is like, am I gonna ask questions like, hey, does Postgres do this and does MySQL do that? No, because it's the high-level concepts I care about. The reason why I keep always mentioning like, hey, Postgres does this, MySQL does that, do we do this? Because like, I just wanna sort of show that like, the things I'm saying aren't just random stuff in a textbook. These are actually things that people implement. So, but like, the nuance is that I don't care about. Yes? Okay. All right, tree indexes, we only did really cover B-plus trees, inserting deletions, splits merge, all the standard operations, latch-crabbing and latch-coupling, and some of the optimizations we talked about going along the leaf nodes. For sorting, we pretty much only talked about sort of merge sort, heap sort, there's not much to ask of there. But basically, similar to the homework, the cost of sort data, different data sets with different amount of number of buffers, right? For joins, the three basic types, nested loop, sort merge join, hash join, and then all the variants of each of them. And the kind of questions we would ask you would be like, the execution costs are trade-offs between these different joined algorithms in different conditions for different queries. And then query processing, which is literally today, so I can't obviously guess two deep things, but the different processing models, the advantages and disadvantages of them, and then the three different accesses, not the types, sequential scan, index scan, and multi-index scan. Yes? So, Tuesday lecture, not on the exam? Tuesday's question is Tuesday's lecture, not on the exam, that is correct. Yes? Yes? Is it the format of the exam? The question is the format of the exam, similar to the practice, yes. All right, parallel execution. Again, no class on Tuesday next week, because I gotta see Fat Face Rick, and then I'll see you at the exam on Thursday. Hit it. Super snakes.