 So today, again, we're continuing our discussion on now actually how to start executing queries. So this is what I showed a few lectures ago, just the overview of what a hypothetical database system would look like, a high-performance modern database system. And so we've covered some of the parts down here and the networking layer. And now we're at this point here, and we're going up in this direction. We're going to talk about query execution today. Next week we'll talk about compilation. Then we'll come back and do more query execution. And then after the spring break, we'll then talk about query optimization and query planning. So the idea of what we're going to talk about today and going forward the rest of the semester is how to build an efficient query execution engine. And we're going to differ this between some of the techniques we talked about in a disk-oriented system because if we're entirely in memory, then we don't have to worry about the main bottleneck that the disk-oriented system had to worry about. In a disk-oriented system, the goal was always to reduce disk IO because that was always the most expensive thing. And so now if we don't have stalls because we're going to fetch disks to run our queries, now we have a bunch of other things we've got to worry about. And those the bottlenecks we'll discuss through the rest of the semester. And then the techniques I'll be describing will show you how to actually overcome them or mitigate their effects on the performance of the system. And so we obviously still need disk for logging, so that doesn't go away. But it's really like when I execute a query, I can assume that I'm going to read a tuple or read a column, read a block of data, and that's in memory. So the way to essentially think about what we're talking about here is it's going to be, there's not going to be one technique we're going to do that's going to make everything go better. It's going to be sort of an orchestration or a coordination across multiple optimizations and by combining them together, then we'll get the efficient execution performance that we need. So the spoiler would be just as a heads up, compilation, parallelization, and vectorization. Those are going to be the big three, but there's a penalty about other things we can talk about as we go along. So what are our optimization goals in our system? So if the disk goes away, what do we actually want to care about now to get our query execution performance to be good? So the first one is the most obvious, right? We are going to be, we want to just reduce the kind of work we do when we execute queries. And we can do this by reducing the account that the data system will execute in order to process the query. So we want to execute fewer instructions to do the same amount of work and that's, you know, then we'll get better performance that way, right? So the compiler will help us a little bit, like we can pass in O2, you typically don't ship database systems with O3 compile binaries because that, it's not that it's experimental, it just, there might be some anomalies that you may not be prepared for or can consider O2 is typically people ship software with. So the compiler is going to help a little bit, but what we're going to want to do is specialize our database system for the specific queries that we're executing. Again, that'll be query compilation on Monday next week, but we'll see a flavor of what it looks like today. The next one is to reduce the cycles we have to incur when we execute these instructions. So we've already done this, we've already reduced our instruction count, and so for the remaining instructions that we do have to execute, we want to reduce the number of cycles it takes to process them. Then we know how we actually want to do that. Yes. She said parallelization. Not quite. That's the same thing as, well, that's vectorization, that's parallelization, but SIMD will give us this. It won't necessarily give us this. So branch misprediction is one, but also reducing the number of cache misses. If we have an instruction that needs to touch a piece of data and that piece of data is not in our CPU caches, it has to stall and burn more cycles to go out the DRAM to bring it into our CPU caches. So we want to maximize the locality of the data as we're processing them, as well as being intelligent about how we do our, how we have branches in the for loops as we process tuples to reduce the amount of misprediction. And then the last one is sort of what you two said here, parallelization and vectorization. This is just going to allow us to now use the additional cores that we're getting on our CPUs to process queries in parallel, and we'll talk about what those different schemes look like today. Moore's law is essentially ending, but Intel really can't crank up the clock speed anymore without realizing that it's the same. There are alternative materials we can use to build our processors, that don't have that melting point, but that's way, way in the future. So what Intel and Ambie are giving us is just more queries, so we want to execute our queries on as many scores as possible. And that's going to be tricky because now we need potentially coordinate across them. So paper ahead you guys read was a bit more, I'm going to say, analytical than what we'll talk about today. But the reason why I picked it is because at least in the production portion of the paper, they went through a lot of the techniques that we'll talk about today and just showing you that there's a bunch of different things you could do to how to design the system to execute queries. And that was based on OLAP and that'll primarily be what we'll focus on for today's lecture. But again, they talked about this difference between should I do an index probe, which is a random IO, or should I do a sequential scan? There's no one answer to say when you want to use one versus the other. The main thing, though, I wanted you to get out of it was that typically in a disk-oriented system, it would always make this decision about whether to do an index scan or a sequential scan based on the selectivity of the predicate. If I have an index and I have a predicate that could be used on the index, how many tuples do I think I'm going to get back from the index? In memory world, we actually need to care about what the performance of the harbor is going to look like, what the CPU will actually do when we execute our database system, as well as what other queries are running at the same time. And this one's a bit tricky because this one needs to be, now we need to have the optimizer be aware of what else is running at the same time and then make decisions based on that. So till you only see this, applying this technique in the high-end commercial database systems, Postgres doesn't look to see what other queries are running at the same time. Because it's hard to do this, right? Because it's like, here's my query, I run through the optimizer. I'm going to pick up what queries will be running at the same time. By the time my query comes out of the optimizer and starts running, those other queries might be wrong, so my decisions might be incorrect. So this one is hard to do. This one you can compute in the beginning, although it can vary if you're running on Amazon because even though you get the same instance type, the performance can change by, I think, up to like 20% because, again, somebody else might be running on the same box as you. All right, so the type of optimizations we're going to apply that they talked about in that paper as well as some additional ones we'll talk about today are just how we're going to actually do the process of the query, how we're going to move data from one operator to the next or whether we're going to do a push or a pull. Scan sharing is a technique where you allow two or more queries that are running at the same time accessing the same data to piggyback off the iterators and instead of having each of them read the same data at the same time, you combine it together. Materialized views is a way to pre-compute some portion of a query ahead of time and can maintain it as the table gets updated. So you can use that for query execution. So for these two, we're not going to talk about much this semester. I'll try to talk about it a little bit later on, but it's these other ones here that we'll spend most of our time because, as I said, these three here are where you get the biggest bang for the buck across most workloads. Obviously, in some cases, materialized views and scan sharing could be very beneficial if you have queries that have high opportunities to take advantage of these things. But typically, these things are general purpose enough for what we want to do. Then query compilation is co-specialization, vectorization, what he talked about using simple instructions, parallel algorithms, which she mentioned is running the operator in parallel. And then we'll finish up the end of the story talking about how to embed or how people use user divine functions in queries. And those big opportunities actually basically merge these into the query plan itself, instead of treating the UDF as a black box, and you get way better performance as well. So that'll be at the end of the semester. These are the big three we want to discuss. So today's agenda is that we're going to first talk about what modern CPUs look like in the context of database systems, what aspects of databases do we need to care about when we build our database system. Then we'll talk about different processing models to how to move data between operators, and then we'll finish up talking about different parallel execution models. The idea here is basically how we're going to architect the system to support parallel execution of multiple operators at the same time. So there was this problem over 15 years ago in 2005 proposing an improved version of MonoDB, which was one of the original open source academic column store systems in memory column store systems. It was an improved version of MonoDB where they showed how existing database system implementations were insufficient or not targeting what modern superscalar CPUs look like, and that if you redesign the architecture of the database system, you can get much better performance if you write it in such a way that is ideal or amenable to how the CPU actually processes instructions. So the way to think about this is most times when people build database systems, it's common now, but in 2005, this is very common in this paper. They looked like MySQL and Postgres and showed, and Oracle showed this as the case. A lot of times when people build software, they build it in such a way that makes it easier for humans to reason about that software, or what the actual code is actually doing. But it turns out the way that's easier for humans is actually bad for what CPUs actually want. And so what they're proposing here is if you design the system in such a way that may be more complicated for mere mortal humans to reason about, but it's actually better than for the CPU, you're going to get much, much better performance. So MonoDB X100 was the name of their prototype. This later got renamed and commercialized as VectorWise. And then Actian, which is a holding company for old software, they bought VectorWise, renamed it to Vector. They then killed it off for a couple years, or they hit it on the web page. And then every time I talk to this class, I would always say, oh, VectorWise is great. Too bad, Actian killed it off. And then some dude emailed me last year or two years ago and said, hey, no, no, no, it's still around. Here's the web page where it actually was. The website went out of its way to hide where you could actually download the software. It was really bizarre. But then they came out and rebranded it now as Avalanche. But they're selling this as an in-memory OLAP cloud system in the runs in AWS or Azure. So previous years, I had to hit this paper. The first part's really good. The second part is not so much. And it's a bit dated, right? It's 15 years ago. So they spent a lot of time talking about itanium CPUs, which don't exist anymore. And they talk about how on Pentium 4s, on pipelines with 31 stages. But obviously, in modern CPUs, the stages are much smaller. As of 2015 or when Haswan Broadwell came out, it was 14 stages. Even up the latest one, Copper Lake from Intel is, I think, 19 stages. Copper Lake is 14 stages. The AMD Ryzen is 19 stages. So building your system in such a way to deal with these really long pipelines and the way they describe is not, it still matters, but not to the same extent that they did back before. The other reason why this paper is also super interesting is that both Peter Bantz and Marcin Zorkowski, Peter Bantz went back to CWI and he's working on DuckDB now. Marcin basically went off into co-founded Snowflake. And a lot of Snowflake's design is based on how vector-wise laid out the system, right? So although vector-wise is not that well known, certainly Snowflake is super well known now. And it's the rightful successor of vector-wise. So this is gonna be a crash course in two slides of everything you need to know about CPUs that matter for databases, right? So as I said, a CPU is gonna execute destructions in terms of these pipeline stages, right? And as I said, the Intel CPUs have 14 pipelines or 14 stages and the AMD has 19, right? They're not in the hundreds, it's pretty short. But the idea of these pipelines is that it's gonna allow the CPU to try to be busy at all times by executing all the instructions at the same time as pipeline on different parts of the CPU. So that way if one instruction has a catchment and has to go out to DRAM to get some data that needs to process, at that same cycle, the CPU can execute an instruction that maybe has data already in the registers that it can execute efficiently. So it's gonna allow us to execute, to hide all the delays from these cache misses. And it's gonna do this by executing the instructions out of the order in which they were eroded into the pipeline. So what that means is like when you write your code, the compiler turns it into CPU instructions, the machine code, it processes that stream of instructions and loads them into the pipeline. The CPU may not actually execute them in the same order that they were defined in that stream. And that is they're gonna track other things like dependencies to know that the output of one instruction is used as the input for the next instruction. It has to make sure it executes in the correct order. So again, these are called superscalar CPUs. And so because we are aware of what the hardware is actually gonna do, we're gonna try to now build our database system execution engine where we actually process tuples because that's the most expensive part and such a way to mitigate problems that can occur when there's mistakes in visions that the hardware is gonna make. So the first problem is gonna be dependencies. And this is why I said like if you have a instruction that the input of that instruction depends on the output of another instruction, the CPU can't execute the second one first. It has to wait till the first one finishes before it executes the next one. So there really isn't that much we can do in our database system to avoid this. When you think about it, if I need to go scan a tuple and the output of that tuple is then fed and put into a buffer, I can't put that data into the buffer until I actually do that scan on the tuple. So again, there's not that much we can do to avoid this problem. The one that we are gonna try to avoid is when we have branch mispredictions. So in addition to executing instructions and apart of executing multiple instructions in a single cycle, when it sees a jump statement, like to jump to a branch based on some conditional, rather than waiting to see what that conditional actually evaluates to and then determine whether you go inside the if clause or go around it, it will actually try to predict what path you're gonna take and start executing those instructions that are followed in that conditional. And then if it gets it wrong, then it has to throw away everything that's already done and then go back and refill the pipeline with the correct path that it should have gone down. If it gets it right, then this is fantastic, right? We basically did speculative execution and assumed that we were gonna go down this branch and we got it right and all the work we did ahead of time is actually useful for us, right? And again, this is allow us to reduce the number of cycles it takes as we process queries. So how the CPU actually does the branch misprediction from our perspective, people build in the database system, we don't know, we don't care and actually unless you're working at Intel or AMD, you're also not gonna know because this is one of the most secret parts of the database system or the CPU, right? The simplest thing you can think of is if you recognize, you see a branch and you've been down that branch before, you just take the path that the last time you ever took. That's the easiest, dumbest thing to do, but obviously in Intel or AMD, they're doing way more complicated things and what they're actually doing is a trade secret, all right? But we don't care what it actually is, we know that there is something that's doing branch prediction and we can design our code around it, yes? Is this been heavily impacted by, I forget what the bug was, but there was the thing you can use to branch prediction times? Yeah, so the question is, is this part of the problem people were hitting with the specter or the meltdown stuff? This is part of it, yeah. As I said, because we have long pipe lines, the CPU is gonna start executing speculative branches for us. And again, the reason why we wanna do this is it's gonna hide the stalls, it's gonna take a bunch of pen engines. We think we're gonna execute this, let's go ahead and see. If I can get it, then things are fantastic and that'll reduce my cycle count per instruction. So where we're gonna hit this the most is for analytical queries, is as we start scanning tuples, we have to evaluate predicates and our where clause to decide whether to put something in our buffer, right, as the output for this operator. And that's just an if clause, right? So now if I have a billion tuples in my table and I'm doing a complete sequential scan on that table, I'm gonna have a billion branches I have to potentially predict on in my CPU. And now when you think about it too, unless I'm pre-sorting my data like Vertica, the data could be completely random as I do my predicate, or my attributes when I do my predicated evaluation. So the CPUs have no way to actually predict what branch you're actually gonna take because the probability that you're gonna take the same branch as you did the last time depends on what the data actually looks like and the selectivity of your predicate. So in modern C++, you can actually, so we'll see in the next slide how to design our data system to avoid this. In C++ 20, you can't pass a hint to the compiler called likely and unlikely when you have an if clause or a case statement where you can say that I'm not likely to go down this branch, or I am likely to go down this branch. So you could imagine if you're doing co-gen, which again we'll talk about it on Monday, you could start inserting these things if you know your predicate is not gonna be selective and therefore most tuples are not gonna get satisfied the predicate and put in the output buffer. You could inject these things to give hints to the CPU and say this is not gonna happen, but even then the CPU actually can do a pretty good job in that case if you're not very selective at all, it can still do really well. Just this came up with project one, somebody was asking about this, about branch predictions. I just wanna say there's a way to do this in C++ 20 to make this happen. I think modern GCC and intrinsic make this happen and modern GCC and Clang, they don't have exact keyword, but you can force it to do this as well. So let's look at an example, yeah, sorry. What is the compiler doing in this case? Copilot then I think gives, there's an instruction I think in x86 to tell before the conditional in the actual assembly or the instruction stream to say like, I'm not likely to go down this path. It's a hint, right? All I said was likely, I'm not saying with some probability, the CPU may or may not take that into consideration. And what it actually does when it sees this hint, whoosh, only in tone, all right? All right, so let's say we get a select query of this when we have two predicates, where key greater than equal to some low value and key less than equal to some high value. So normally I tell my PhD students don't show code in slides, but for this example we have to see some code but it should be pretty simple to understand. So a really simple way to implement this, that select query is just a for loop on every single tuple on the table. Go grab the key we wanna evaluate in our work clause and then apply our predicate, right? If key greater than equal to low and key less than equal to high, then we're gonna copy that tuple into our output buffer. Increment or offset in the output buffer so that we can come back around, we can start the next table, right? So the bad part of this about this query is obviously this if clause because as I said the, unless the key is gonna be almost always the same value, then it could be completely random whether I'm gonna go down this path or not and the CPU is gonna have a hard time actually predicting whether to do this or not, right? And if I could put likely or unlikely, but again that's just a hint and it may not actually represent what the actual data is. So this is how to do this scan using branching. We can rewrite it though to do a branchless scan where we're still gonna iterate over every single tuple, but the very first thing we do is copy the tuple into the output buffer. We don't evaluate the keys, we just say, we just copy it. And then we have a little two ternary operations here where we go do a comparison of the keys based on our where clause predicate and what's happening is if the predicate matches then we have a one, if it doesn't match then we have a zero and the two things together and that then tells us whether it's a one or a zero and based on that, that's gonna tell us whether we increment our counter or not. So that if these both are zero, then the offset will be zero. When I loop back around, when I copy back I'm just gonna write the one that should not match. And then I need some code down here to make sure that the last one didn't match that I don't include that in my offer, right? But for simplicity, I'm not sharing that here. So what's gonna happen is, even though it looks like these conditionals can fall as they become operating on energy directly, the pie can write the arithmetic operation, right? It's more than what I'm trying to say is back in the slide, but we turn this into like key, you know, check the actual bits and set this using bit operators to see whether it's zero or not. Like we can rewrite all the more instructions to do just do the math to determine what's going on. Right? So this seems like this would be terrible because for a billion tuplos I'm copying all one billion of them and in this case here, I'm only copying the ones that actually match. But again, depending on the selectivity of the predicate and what our data looks like, this actually might be the better approach. So use the graph later on the vector and show a scan on a table with the branching and the branchless version of that scan as I showed before, and then the x-axis is varying the selectivity of that predicate. So over here, when the selectivity is low, meaning most tuples are not gonna match, right? When no tuples match, the branching one actually performs the best because again, I checked my conditional and it doesn't match, so I'm not wasting instructions doing the copying. Furthermore, the CPU is gonna predict like the last, you know, almost every single tuple is never gonna match that if clause, so it's gonna predict to go around it and not do the copy as well. But up to around 5%, then it actually starts doing worse than the branchless case. So the branchless one is a flat line because no matter whether the predicate evaluates are true or not, I'm always paying that penalty to do the copying, right? So that's why it's almost a plateau straight across. And as you can see up in here, when it's like 50% selectivity, 50% of the tuples are matching, 50% of them aren't matching. That's the worst case scenario because that's the CPU just can't predict this at all. And only when you get down here when it's like 100 tuples matches, that does the CPU catch up, yes. But this graph will be very dependent on your memory latency, right? Because you have changed the bottleneck from your like miscritical to your memory. Your question is... The new bottleneck is memory. So you're saying this graph will look different if you change the memory bandwidth speed or that... Yes. Because the new bottleneck is now the memory. Well, I mean like the memory of what? For the scan? Memory bandwidth, memory bandwidth. Yeah, so like I have to do that scan for both cases. But you have to copy only one. Yeah, but think about it, like going back here, like I do this copy, right? If I come back around, if this doesn't... I do copy, the tuple doesn't match. I come back around when I do the next copy. I'm overwriting the last place I copied into memory and that's gonna be sitting in my CPU caches anyway. But like there's always one transfer, right? Like that is through memory only. The transfer what? To do this copy? Yeah, you are transferring... This is gonna be in CPU caches, right? I read the tuple, I gotta bring in my cache, right? I gotta do that for both of them. Now I do this copy. That's writing from one cache location to another cache location. Probably gonna be in L1, right? Like you kind of always say that you'll always... It's the next obstruction after this. I bring the tuple in my cache, then I copy it and that's running to another cache location. Then I loop back around and the... Yes, by the time I come, I do this and come back around and I do another copy, and maybe the case of the... If I'm overwriting the last one I copied, that location got flushed from my CPU cache. Maybe it gets flushed from L1, but I think it's unlikely to get flushed from L2 or L3. Like this is a tight loop. There's nothing else we're doing. We're just evading this one predicate. There's no guarantee because we can't control the caching policy of the CPU. x86, you can give it hints, but it's not required to follow it. The first compiler does all those other things like reclining, loop and rolling. All of those things should also be in the... You write this code, doesn't mean it will be executed like this in the CPU. Yeah, so at this point, which I agree with him, is that I don't know what other optimizations the CPU could be doing here, or sorry, the compiler could be doing. Like it could unroll the loop, so it's like four copies and then a bunch of these things and come back around. Sort of like... So again, so this is what systems research is, right? The exact view of the... And we see... So the hypothesis is that when branching, when most tuples don't match, the penalty of paying that copying is not worth it because you're doing these exact instructions maybe you didn't need to, right? So I agree with you that there's a lot of other things that could be happening here that the compiler could be doing. But we can verify that this is actually working out correctly the way we think it is based on these results. Yes? So actually, I guess I have a question. So like this is mostly for like the case of like a table iteration. Does this significantly change the event we're doing? I mean, I imagine it does because if you're doing like an index iteration, like it only connects along like elite nodes of like a B plus tree or something like that, then you're not having all these adjacent tuples. Yeah, so his question is, this works fine if you're doing a table scan because you're just gonna rip through contiguous regions of memory. There's a memory prefetcher that, if he knows we're accessing things that are sequential, it'll bring those things into retail cashmashes. So we get a lot of benefits from scanning columns. If it's an index scan, there's not much I can do as I probe down other than maybe prefetching memory. But once I get down the leaf nodes, now I start scanning along leaf nodes, am I gonna get the same kind of benefit if I do something like this? For data in the same node, like if I'm evaluating tuples in the same node, and I don't have to go maybe look at the remaining parts of the tuple. Like if, so if I have my key, my index is based on this key, but then there's also another part in the where clause that touches another attribute where I gotta then go look at the tuple, then this probably is a bad idea or this probably was not gonna be as a big of a win, but if I'm just looking at what I need exactly in that node, then I think this would still work, yes. Yes. So for like the likely clause. For what clause? Likely and unlikely. Yes. Those special codes, if like, is it more for the code into locations that are for cash, like. Your question is, if you use the likely, in this case here, so like, if our selectivity is 50%, so likely and unlikely doesn't help us here. So in that case, could the compiler then rewrite the code or do what, sorry? No, I was saying that maybe likely and unlikely, the special OP codes there, they don't actually influence how the underlying hardware works. Yes. Like underlying CPU works, but maybe the compiler can move a more likely block into the, directly after the if statement. So that, like. So it's basically like, if I had if then else, then you're saying like, do I get any benefit from, say the else clause is the one that's most likely to happen, should I flip them, like do whatever the Boolean logic to reverse this so that the first thing that comes out of the if clause is the likely one. I actually don't know what they do. Yeah. Because then like, our instructions will be in cache. Yeah, yeah, that's your point, yeah, yeah. I actually don't know. Yeah. Actually, at a good point, actually going back to like, well yeah, it applies to a statement to allow the compiler to optimize for the case where passive execution, including a statement, are more likely to, more likely than alternative path to execution that does not include such a statement. So yeah, I mean, this is not very documented, but maybe this is how they, this is what it's actually doing. Yeah, and like, if you know which block you're going into, maybe that can help you like, decide whether to inline the function or not, because you don't like- Correct, yeah. Yeah, so his point is, if you know whether you're gonna go down one branch or not, you can make a decision about whether to inline or not. Yeah, I mean, compilers are a whole another beast. It's not, yeah. We'll talk about, I'm obviously not a compiler expert. We'll talk about compilers in the context of databases in the next class. Like, some things that we got. That's a good point. Okay. So, the other thing we want to try to also avoid it, which is the first part we talked about, of having a lot of instructions, use more instructions maybe than we need necessary experience. Again, if we design the system to be sort of general purpose, a lot of times what you'll see in these databases are these giant switch clauses that as you start evaluating predicates or evaluating tuples, for every single type of data, your database can support. You have these switch clauses, like if I have an integer and I'm adding it with the integer do this, if I'm adding it to a float and do that, all that adds dimensions, all that adds, you know, misprediction that's gonna slow us down. And it's just, you know, it's more instructions to actually, you know, evaluate the conditionals to decide what our type of data is. And so, we didn't show this example before, but this was an intro class. If you go look at the Postgres source code of how they handle numerics at the fixed point decimals, this is just the function to do add, right, to add two numerics together. And you see there's all these if clauses, like if it's negative, it's positive, if we're taking the absolute value of something, right, all this is problematic because this is a lot of instructions and a lot of chances for the CPU to get this wrong. So, we'll see this again on Monday next week of how to specialize our specialized code on one block for queries as long as you need for that one query. And this is called cogeneration or just-in-time compilation. All right, so now we know how to, we know what the CPU is gonna look like and we know how we wanna design our operator implementations to be mindful of it. Now we wanna talk about how we're gonna organize the system to process the queries which can comprise a multiple operators. The last class was talking about how we actually scheduled the tasks that execute these operators, but now we're gonna talk about sort of a higher level content. So there are gonna be different trade-offs we're gonna make depending on what type of workload we're gonna wanna support like OLTP transactions versus OLAP analytical queries. So there's the most common one, the general purpose one, the iterator model, materialization model, and the vectorized model. So again, the spoiler would be that in a modern analytical system we're gonna wanna use the vectorized model. As we can pass, we can pass chunks of columns between operators and then use SIMD instructions inside the operators to execute them efficiently. The iterator model is common. This is something called the Volcano model or the Pipeline model. Volcano model is a pipeline that came out in like the late 1980s, early 1990s. In addition to defining the exchange operator, which we'll talk about in a second, there's also the Volcano Query Optimizer, which we'll cover later in the semester, which then was the precursor to Cascades, which is another optimization scheme I'll talk about. Gert's Graphy was also the guy that did the modern B Plus Street book that I sent you guys for the first project. The guy does amazing stuff, it's very influential. So in the iterator model, the way we're gonna implement it is that every single operator is gonna implement this next function. And what will happen is, when someone calls next on that operator, it has to then return back one tuple. Since we organized the tree, that operator needs data from its children. It'll call next on its children. And that sort of Cascades to the bottom where you have the access methods where you're retrieving the data from the table or an index, and then we move the tuples up. And essentially, we're gonna keep calling next and next and next and next at the root, and keep getting tuples, and we get output for our cursor until at some point we get a response and say there's no more tuples available to us. So at a high level it looks like this. So say we have a two-way join on table R and S. And so for every single operator, we're gonna have this little for loop that basically is gonna iterate over some input data that it's getting either from the table or from its children, and then admit them up. So again, think of these as just the next function. So when my child up here, the first at the root, we call next on our child, we come down here to this next block and it wants to do the join. So it's gonna iterate over its left child, come down here and say give me the next tuple you have, and this will send back up a single tuple. And we keep doing this until this is finished till we get all the tuples and we build our hash table. Then we fall down to the next for loop, call next on this child, and then it's just the same thing. It starts moving tuples up from the S table, then do our probe in the hash table, and then anything that matches gets shoved up here. All right? So this approach is, yeah, sorry. So let's say would we like switch out and do something else in the meantime? So let's say like R involves more processing than just like emitting tuples. Okay. And would you do, would you do a loop? So we're not there yet. So this is like, that would be like how do we, what are the CPUs doing with the processing? Thing of this is like, just simplicity. Thing of one thread calls next on the root, that calls next on this child, right? And there's one instance that's iterating with this tuple. Right, so it's a blocking call. So when I call next, this is not processing, this is waiting, okay? So this is using almost every single database that you've ever heard about. The advantage of this is that we're gonna be able to do pipelining, meaning like in this case here, as we emit a tuple out of S, ride that all the way up, produce a tuple out of R, out of R, because everything we need in order to process a single tuple is available to us at that point in time when we're calling next. Right, I can take this tuple, show it here to the predicate, put it up for the hash table, and if that matches, then I show up here and do the projection. And then I can produce that tuple as my output. So that's considered a pipeline. In fact, we can take one tuple and ride it all the way up into, in this case here, the output of the query, or if there's a pipeline breaker where we can't go any farther, then it gets buffered up there. So, right, that's what I'm saying too. So some operators are at the block until they get all the chode and emit all their tuples. And the other advantage you get from the iterator model is that output control is really easy, because we don't have to push down any logic to do limits necessarily, because if I know I only wanna get 10 tuples from my query, then I just call, stop calling next on the route once I get 10 tuples. I've seen everything come up. And so, down here, this is just, this is just a sample of the databases that use this approach. Like, these are the ones I can confirm by looking at the documentation or looking at the source code. But, I'm actually, that's one of the things they're doing. They're doing the iterator model. That's the canonical way. Do you have anyone who's doing it? Compilation? This is, again, this is different than compilation. This is orthogonal. This is like, you can still do compilation, right? I could still take all of these for loops, pop that, and run it, and that would still be using the iterator model. In the case of Hyper, they're doing, they're still doing this. I mean, that's another thing we need to talk about as well. Like, this is a top-down approach, meaning I start at the top and I sort of pull tuples up. The reverse of this would be a push approach where you start at the bottom and you have this for loop and you start here rather than here and you start emitting tuples up. Hyper does that. Most systems do the top-down iterator model. That's the most common one. Does it look like we're doing something about function calls for the script.com? Yeah, that's next class. I realize it's like, kind of lame for me to give this lecture. I'm like, oh yeah, yeah, here's something cool. Next class. It's like, tuples first and then we'll get this stuff. All right, so the other one, the next one, the idea here is that rather than having a next call where it only gives back a single tuple, I'm gonna have every operator materialize all the tuples that is ever gonna produce all at once and then shove that to my next operator. And again, I could be either doing this for the top-down, like you call get next or next and then instead of getting one tuple, you get everything all at once or I could push it from the bottom up, like run the operator, get the output and then put it, shove it up to the next guy, all right? So in the materialization model, you can either do materialize an entire row or a single column, like if you're doing analytics, it doesn't make sense to materialize the entire, all the attributes of a tuple in your output buffer. If you know that most of the table, most of the query is not gonna need those columns at all or all the rest of the query is not gonna need the columns at all. And so this bit is the same thing. You need to be a column store or a row store and do this. So going back here, instead of having the next function, we just have this in every single operator now, we're gonna allocate an output buffer and then we just keep filling it up with tuples that match and then when we're done, we shove it up. So again, Sumon comes from the top-down, I call the output function on this guy, he calls the output function on this guy, he puts all the tuples that he has in a buffer, puts it on this thing. Now he can iterate over this output buffer and build out his hash table and then we do the same thing for this other one here and we shove data up. So in this example here, this is like the naive implementation, this is obviously really stupid because for some of these things, I actually can combine together what the operator's actually doing. So in this case here, what am I doing? I'm scanning the table S, putting all the tuples in table S into my output buffer, then passing that output buffer now to this operator which is just gonna iterate over that and evaluate my predicate. So a better idea was obviously just combine these two operators together so I do that one scan, just do scan the table, evaluate the predicate and if I see the match then I put in my output buffer. And you could do the branchless one or you could do the branching one, it depends on the implementation. So all of this seems like, again the naive example, this seems really stupid, you wouldn't want to do this, there are optimizations you can do to make this go faster. And you can do other things like if I know I have a limit clause up above and I only want 10 tuples, I can push that down as needed as well. So is my opinion that the materialization model is the best for all of the people who work loads because these operators or these queries only want to touch a small number of tuples at a time. So the size of the output buffer that I'm shoving up to the next operator isn't not going to be that big. Don't get Annie's account record on his Amazon website. That's one tuple that I need to move from one operator to the next. And we're going to benefit from this because we call the operator once, we call that output function once, we get all the tuples we're ever going to need for the operator, we never go back to it again and we shove the data along and that reduces the number of function calls which again are going to be expensive because those are jumps in our instruction stream that the CPU has to execute. So for OATB, I think this is the right way to go. And then when we built HStore which was then commercialized with OATB, this is how we did it. Monadb did this as well, although they're trying to do this for analytics. And so there was a bunch of papers that they had to come up with to actually overcome this issue of like trying to materialize all the data at once in memory. HiRISE originally did this as well, this is a German academic system. The old version of this approach, the new version does not, they rewrote it to do the vectorized model. And then surprisingly, I think this is true, Teradata does the same thing, but they're like a massive scale, parallel data warehouse. So they're running OLAP queries. So they have a bunch of crap they have to do to do a bunch of push downs as much as possible and basically inlining operators within each other to avoid having to move data in wholesale from one node to the next. I need to double check this but I'm pretty sure this is how it works as well. It's surprising because Teradata was designed in like 1979. So it was before the vectorized model came along. All right, so the last one again is this vectorized model. So this seems sort of obvious to us now, but like before the X100 paper from Peter Bonson and Marcin, people just didn't build data systems this way. They either did the materialization model or the iterator model. So with the vectorized model, it's basically like the iterator where you have this next function, but obviously instead of setting a single tuple, which would be expensive to do if we have to scan a lot of tuples all at once, we can send a batch of tuples. And the size of the batch is gonna depend on what the hardware is gonna look like, depending on like whether we can, the operator we're gonna feed it into is gonna be able to do vectorized execution, right? On SIMDs. So if we know the size of our SIMD registers, we can then make decisions of how big our batches should be. So going back to our example here, now slightly more complicated, we still have our next function, but now when we call next, instead of getting back a single tuple, we're gonna get back a batch of tuples. And then inside the kernel here, when we do our for loop, we could do vectorized instructions to execute those things in parallel efficiently. Take the batch, apply all the predicates with a single SIMD, like that. And the same thing now for the other side. So this is ideal for OLAP queries, but again, because we're reducing the number of invocations per operator, we're moving tuples around from one operator to the next in such a way that we can execute vectorized instructions very efficiently. Most analytical database systems built in the last 10 years are gonna follow this approach. So SQL Server and DB2 and Oracle, if you just get the regular general-purpose row store version of the database systems, it's all gonna be using the iterator model. But then they have these specialized execution engines, like the FractionMirror stuff we talked about for Oracle. DB2 has this accelerator called Blue. These are all sort of the standalone copies of data that have vectorized execution models, and they get better performance. CockroachDB is actually not analogous, but surprisingly, they have a blog article that shows that they have a vectorized engine. And in our new system that we're working on here, everything's vectorized as well. Okay? Yes? Is the size of the vector aligned to the sim D size, or is it better to do it later? The question is, is the size of the vector aligned to the sim D size? In our system, it's aligned to the sim D size, but it's not gonna be exactly the sim D register size. So if your sim D registers are 512 bits, it's not like you wanted to pass around only 512 bits, you would pass around maybe like 10 chunks that are each 512 bits, so that you can take a chunk within that vector and then do the vectorized execution on it. And typically the way that works is, and this is what we do in our system, when you turn on the database system, they go read information from the CPU like what are your sim D register sizes, what are your cache sizes, and then you can make a decision how to size things up that way. It's usually just heuristics. Okay, so I already said this before. There's, in all my examples, I was showing top to bottom, but you can actually go bottom to top. This one is the most common, because this is like the textbook implementation of how people build database systems. The paper you'll read for next week from Hyper shows that you actually wanna use a bottom to top approach, because now you can be very careful about how you organize the code for the execute queries so that within your pipeline, you're not just writing up values from tuples within your CPU caches. They can go even more lower level and try to control things like, so they remain the CPU registers, because that's gonna be even faster than the cache, L1 cache. In our old system, which we had this technique called relaxed operator fusion, which I'll cover in later lectures, but this is actually now in our new system as well. The idea here is that we're trying to combine the bottom to top approach with vector as execution, and we're trying to get the best of both worlds as hyper and vector-wise. So we're trying to pass vectors and do bottom to top. And you need to be careful about where you sort of put your buffer boundaries so that you can stage things in a certain size so that you can pass things along and that can fit in CPU registers, sort of what he was saying. And it may not always be the pipeline breaker point. It may be points within a pipeline. Again, we'll cover that more later on this semester. All right, so now let's talk about, regardless of whether we're doing what processing model we're using, we can talk about how we actually want to run parallel queries. So we've already talked about how to do interquery palism before, that that was the scheduling stuff we talked about last class, or we talked about. The idea here is that we wanna allow multiple queries to run at the same time. And so you basically have a scheduling mechanism to decide what task or what query runs. Hyper originally just had it, you can only run one query at a time. When that query is done, then you switch to the next one. In the new system, they don't do that. In most systems, you don't do that, because you wanna have the system be as responsible as possible. So sometimes you have like a fast query queue and a slow query queue. So there's different techniques. And then we use concurrency tool to protect the data of queries or updating things. So this is a conjecture of mine, I can't prove it, but I've thought about it for a little bit. I don't think that there is any difference in the complexity of implementing a different query processing model. The complexity doesn't change if you're using different concurrency schemes. Meaning like, if I'm doing two phase locking or OCC, it doesn't matter whether I'm using the materialization model or the vectorized model. I think they're isolated from each other enough that it doesn't actually really matter. Because in my access method, when I go back to Tuple, that's when it goes and checks to see whether I can actually read something or not, whether it's being held, somebody else locked on it. And all that doesn't matter for this. All right, what we care about though, though, is intro query parallelism. And that's how we take a single query and now execute all its operators in parallel. Again, the scheduling stuff was last week, this is how we assign these tasks to cores. Now, it's basically how do we then organize the flow of data within our query plan so that we can determine whether one task is allowed to run yet or not. So the two approaches to do intro paralism or the two approaches to do intro paralism meaning we're gonna have a different type of time horizontally crossing the query parallelism and vertical parallelism is I can have different parts of the query plan execute at the same time. Again, I need to be mindful about what data dependencies I have between these different tasks to determine whether I'm okay to do intro paralism. And again, although I'm laying out as each of your approaches, it's not like either or, you can actually combine these two things together. So you can do internal parallelism and vertical parallelism together within the same query. There'll also be, within every single operator, there'll also be a parallel version that we can use to execute the query, or execute that operator. And that's what we'll start covering next week or when we start talking about parallel joints and sorting. So we're gonna have a couple of elements within the same query plan to our operators and then are instantiated into their separate instances. And they're both gonna perform, all the instances of the operator are gonna perform the same function just on different pieces of data. This is the morsel stuff that last time break my table up into horizontal positions and then the different instances are gonna run through different partitions at the same time. So to coordinate now, these different instances running at the same time, what we're gonna do is to do a separate query plan, is basically another break point in our query plan so that we can recognize that we can't proceed up into the query plan until all the operator instances below us produce all the tuples that they're supposed to produce. So it's a way again for us to organize the query plan and keep track of internally whether one set of tasks of operator instances are allowed to run or not. So say we have a simple query, G on B. So we can take this scan on A and we can break that up into different operator instances and each of these guys are gonna run in a separate worker. But then now, because I wanna do, I wanna find the operators within my pipeline, I can also do the filter as well. So take the output of each of these scans and feed that needly into this filter operator to then remove anything that shouldn't be there. But now also, I can do other optimizations like if I have this projection up here that shows me that I only need the A.ID for my query and save this A has a thousand tuples. So rather than me copying a thousand tuples from one operator to the next, I can actually push down now, make a copy of the projection here to filter out everything except for the A.ID I need here. So then now they're gonna do, for the join they need to build the hash table and I'm not declaring whether this hash table is a single hash table or whether it's partitioned. It doesn't matter. But I know I can't now do anything on this side. I can't start scanning B and doing the programming of the hash table until the hash table is actually built. Otherwise I could get false negatives. I could do a look up and say does my hash table contain this key? It should, but I just haven't gotten through yet. So I need to wait for these to all before proceeding with the next one. So this is what the exchange operator is doing for us. It's basically now a way to coalesce the results from running from these different tasks, running on different workers and keep track of I can't proceed until this is done. Then now for this on this side, right? I'm doing the scan on B, same thing I'll also do the filter, do my push down or my projection. Now I probe my hash table in parallel, right? And each of these guys are going to produce an output. But to make sure that I don't produce the final result of the query until they've all finished, I add another exchange operator up here. Just knows that I'm waiting for three threads to give me all the results. And then once I have everything, then I can shove the output to up above, right? So now my example here, I put the exchange operator after we build the hash table. This is unavoidable. We definitely need this. But I could have set the query up such that I have an exchange operator here where I do the scan, do the filter, do the projection. And then all my operator instance tasks start filling in an output buffer in this exchange operator. And then once they're all done and that's all done, then now I can have the queries blast the, do the join and produce results. So there's different alternatives to doing, executing this query. And again, it depends on what the is, depends on the selectivity of the predicates, depends on the selectivity of the join calls. The data system could try to figure out what the right way to do this is. All right, there's not one plan that works for everyone. All right, so now for inter operator parallelism, the vertical parallelism, the idea here is that we can overlap now different operators running at the same time. We can still have to use an exchange operator to keep track of whether a set of operator instances have produced all results they're supposed to produce. But instead of doing one and then a bunch of work and then switching over to the next one, I could have a thread start processing data as it arrives from the query plan below it and do the parallel, right? So this is sometimes kind of pipeline parallelism, meaning I'm running the pipelines in parallel. So say I'm doing a Cartesian product across four tables, right, join A, B and C and D with no join calls, right? You wouldn't actually do this, but here it is, right? And so the query would look like this where I can now run the A and then B and do that in parallel. But since I know I can't do the remaining join on the C and D output until that, until this is all done, then I can put my exchange operator like this. So what'll happen is I could have one thread do the join on A and B, fill up some output buffer here, right, which would mean building out the hash table as well. Then another thing I can do the join on C and D, but then once, and then as these guys are and then for the, the join result, I can add this thing now, that's just gonna be running at the same time, right? Because in this case here, there's no where clause, there's no join clause to determine whether something should match or not. So I'm doing Cartesian product, so I want to take any tuple that comes out of C and D and match that with any tuple that matches with A and B, so I don't need to wait for these guys taking the finish, they can start shoving all the data out in parallel and having certain, certain computing the rest of the join as well, right? Okay, so let's finish up real quickly. The thing we didn't really talk about too is also like how to determine the number of workers we're gonna use. We talked a little bit about this about the last time of how we wanted to organize the sort of the scheduling mechanism, but we never really decided, okay, well, I had this number of cores, this number of tasks, how many workers should actually use and as I already sort of alluded to in that one example with the horizontal parallelism, it depends on what the data looks like, what the selectivity of the predicate is and how much output data I'm gonna generate. So one simple way to do this is again, in the case of Hyper, you just have one worker per core and you just pin them to the core that they're actually running, and then another approach is you have multiple workers per core, which is the Han approach. And the idea here is that if one worker ever blocks, then we just, you know, we let other cores run, threads run in the cores at the same time, right? Last thing to talk about again is this is, I think we already covered this last class, actually, the push versus pull. Yeah, actually, we ignore this. This is just the scheduling stuff. I don't know why this is here, sorry. This is just saying that like, in the case of Hyper, I was pooling from a global queue, whereas in the case of Han, I was pushing things into the queue and then the threads had to take them, right? All right, so to finish up, so the, as I said today, and we'll see this more on Monday, the easiest way for us as humans to implement parts of our database system may turn out to be the worst way for the CPU to actually execute this. And so if we're aware of what the hardware looks like, how it's going to behave, we may not know exactly how it's going to behave, but we can at least be mindful of it a little bit. We can design the execution code for our database system to be optimal for what the CPU actually wants. And then as we see in today and throughout the rest of the semester, it's my opinion that vectorize the bottom-up execution approach will be the best way to always execute OLAP queries. For OLTP, it'll be bottom-up materialization, but most systems that a lot of systems like Postgres, MySQL, DB2, Oracle, all of these are sort of designed to be general purpose systems and that's what the iterator tries to be, like sort of good for everyone. Yes? Why does it matter bottom-up or top-down? Why does it matter bottom-up or top-down? Fewer function calls, right? So like for bottom-up in the materialization model, I call the operator, the execution function for that operator, it runs, produces some output, take that output now, call the next function on the, and so forth, right? If you're going top-down, it's like call this, call this, call this, I guess the still is still the same. For materialization, it doesn't matter. For vectorization, yes? So like what you can do is you have, like imagine like this being like co-generated or something like that, for the bottom-up, right? You say, you call like the bottom function, like you'll leave node, right? You can't save that output, then you then pass that output into a next function. So it's like function, function, function, function, instead of function, that calls another function, that calls another function, that calls another function, so it's like produce an execution that's supposed to like recursive. Yes, yeah, he's right. It's iterative execution instead of a recursive execution. And then when you have, as far as my understanding is the best code you can have for a compiler, if it's just a bunch of no conditionals, everything's sort of one instruction after another, because the compiler can look at that holistically and make better decisions. Okay? Smaller cost tag? You have a smaller cost tag as well, yes. That typically is not gonna be an issue of query plans, because it's like- Right, because you came up with the L one. But for the call stack, you know, it's not, I mean, yeah, but I can't imagine it's gonna be like a million function calls. Like if an operator is in a query plan, that's pretty unusual. I can't think of an example like that. It's not to say the query can't be big, but like the theme. No one's doing one million table joints. I think the highest number I saw was from, I saw a talk from HANA people recently where they had one query that was doing a join of 1,500 tables. That's a lot, but it's not a million. So it's not gonna be the call stack. I don't think it's gonna be that big. All right, so again, I realized again, I keep saying, oh, we'll cover this on Monday, we'll cover this on Monday. So the compilation stuff, the coverages stuff on Monday, that is, the paper I'm meeting is from the hyper guy. And I'm saying the guy is singular, like Thomas Norman, he wrote that system and wrote the paper by himself, which is crazy. And it's a bit dense, because it shows the level L of M, R, which you don't need to follow the plan. But I'm not saying the whole idea of what they're doing, how to organize the query plan in such a way that it's ideal for the compiler to then generate CPU efficient code is what the main takeaway should be, okay? Question, yes. I have a question about SIMD. Yes. So like it's a single instruction multiple data points? Yes. Should those data points be, for SIMD instructions, again, we'll cover SIMD in way more detail, we'll have to do two lectures on it. Question is, for SIMD instructions, do the data points you wanna put into the SIMD instruction, do they need to be contiguous? So the way it works is there's a vectorized register and you have to do loads into that. The data you're loading in doesn't have to be, you have to put it into continuous memory in CPU cache and then write that in. I don't think you can do load multiple times from different locations. So if now the data is in different locations in memory, you have to then copy it into a single location and then copy it in. But to think about it, if I'm trying to do a scan on a column and apply a predicate into the vectorized instruction, it will all be contiguous for me. Yes. The scatter and gather CPU instructions for the vectorized if they are like. Yeah, so he said there's a scatter and gather CPU instruction for this. I don't, one of them is not really implemented hardware. I forget which one. I don't know. I think it's the gather. It might, like as of like 2019, 2020, maybe it is, there was a period where like gather had to be implemented through multiple instructions and then I did that for you. Yeah, we'll cover all that as well. Yes, yes. Do you know about shared scan? Is that just like the bottom layer then has the same two folds? Yes, so I didn't talk about shared scan. So a shared scan is if I have two queries show up at the exact same time and they're both accessing the same table, it's a bit more tricky in an memory system. This is why they want to cover it. Like for a disk based system, the most expensive thing is going fetching the page. So even if you and I are running at the same time but we have two completely different predicates, if we can share that disk IO to go fetch that in and then we take a copy of the table or the block and do our predicates separately, then that's a huge win. For an in-memory system, I'll still get some benefit of maybe having the locality of bringing it into my CPU cache. Like that'll still matter, but the overhead of coordinating the different predicates at the same time I think is tricky. Like sometimes you can do things like, I think the Honda guys do something like this. If my query is like where A equals one and your query is where A equals two, they will then convert that to where A in one comma two and apply that predicate at once and then the output of that we both get and then we both have to then apply our additional predicate to get what we really want. So there's tricks like that you can do. I don't think it's that common for in-memory systems because it really requires almost like pinpoint precision of a query showing up at the exact same time they can do this. In a disk-based system, I can piggyback off like, oh, I'm reading these blocks, let me just come in where you started, right? For in-memory systems, I don't know that anybody actually does this. Yeah, but that's the general idea. Yeah. Do disk systems essentially do, is that similar when disk systems look at the buffer pool and just say, can I validate my query using stuff only in the buffer pool right now, like rather than actually having to do like a complete table scan? It's like a limit clause or something and it's like take a hundred tuples from a table or something like that. I have a hundred of those tuples already in my buffer pool. I can just grab a page out of the buffer pool. Yeah, so his question is, so in a disk-based system for scan sharing, one technique to do is if I need to scan on the table, but I have a limit clause, so I only need maybe 10 tuples. Rather than just opening up a cursor or in scanning the table, can I go peek in my buffer pool, figure out what I already have from that table and if I have enough tuples, then I just process based on that. I don't think anybody actually does that. I think everyone always just does the scan because if it's already in your buffer pool, when you go do the lookup on that page, you'll get a hit in the buffer pool. I don't think anybody actually does what you're proposing. We should do that for a project in the intro class. The other more common thing that would be the covering queries for indexes where if I have all the extras I need to process the query in the index itself, it's not really scan sharing. It's just avoiding having to go look at the actual scan on the table. So that's probably the more common. I don't know if anybody would be peeking into the buffer pool and see what's around there and see if there's enough for me. Because you need to maintain some heuristics to say like, I think about what you're doing. It's a needle in the haystack search. So my query shows up, it's on table A, but most of the queries are on table B. So every single time I go look at my buffer pool and I'm only gonna see pages from B, then I'm just wasting time. And then I might as well just go scan A. So you'd have to maintain some heuristics and maintain some kind of quick lookup to say like, oh, if you're looking for table A, like you'll find some stuff in the buffer pool with this probability. I don't think anybody does that. And I don't know if there'd be enough queries expected to want to do that. For OLTP, go get Andy's record. I'm gonna follow the index. It's gonna take me to the page I want to get and I fetch that in. For OLAP, it's usually like scan the entire table. Things with limit calls would be like, think of like loading a webpage of like hacker news. You see the top 10 posts or whatever, the most recent 10 posts. There's a limit call to make that work, right? But it's sorting based on the timestamp. And those tuples may or may not be on the same page. So it's an interesting idea, but I don't think it's actually, I don't have enough queries to make that actually worthwhile. And the overhead of doing it, it would be too high. Yeah, this is a good thought experiment though. Anything else? Any other random queries about databases? Cool, awesome guys. All right. Money class will be compilation and then get started on the second project as soon as possible. And then we will announce what machine you should test your concurrency stuff on. So his question early to the beginning of the class was, the first checkpoint we will not check for concurrency. We'll only check for correctness. So if you wanted to, you could put a giant latch on the top of the thing and just prove that you can do inserts and lookups correctly, okay? Because again, Grayscope only gives us a single thread so we really can't hammer it too much. All right. All right guys, see ya. Bank it in the side pocket. What is this? Some old pool shit. Ay yo, ay yo. Took a sip and had to spit cause I ain't quit that beer called the OE. Cause I'm O.G. Ice Q. Aye, you looked and it was gone. Rocked off, your same eyes hopped off.