 All right, so last class we were talking about how to take a query plan, divide it up into pipelines, to separate tasks, and then be able to execute them. And then we spent a bit of time focusing on the idea of like NUMA awareness in our algorithms so that we try to minimize the amount of non-local memory access. So that won't be so much an issue today. We talked about vectorist execution, that's also an issue when we talk about compilation next week, query compilation next week, but that will come up again when we talk about the parallel join algorithms, because we're spending a lot of time in those. I'm going to make sure that we try to maximize the local memory access. All right, so today's class, we're going to focus on vectorized query execution. And the paper head you guys read was, it was sort of a high level introduction to the things that matter in a vectorized execution engine, and what AVX 512 allows you to do that you couldn't previously do before that. So we'll talk about some quick background of what vectorization means and the consequences of databases, then we'll talk about different ways to actually implement it in your database system. Then we'll talk about the fundamental building blocks for using SIMD that allows us to then do the vectorized database algorithms. So the paper you guys read, they talked about doing scans, I think hash table lookups, and I think the geospatial thing. Like I said, we'll know geospatial one, we'll go beyond. So we'll do the scans, do the hash tables, and then we'll also do look at histograms as well, because that comes from another paper. All right, let's just start it. So it sort of goes without saying, I think everyone here has taken 418, 618, right? Or no. Everyone's mostly aware what vectorization or SIMD is, but the basic ideas here is that we want to convert an algorithm that we would normally have in our database system that would operate on a single tuple at a time or a single piece of data at a time, and we want to convert it into a vectorized form that is going to allow us to process multiple pieces of data at the same time. And this is something that we can rely on the Harbor provide for us, because it's going to have these SIMD instructions that allow us to, within a single instruction call, operate on multiple things at the same time. And that's the fundamental building block of what we can rely on to then build these more complex things that we need in our database system. It's not enough just to add two numbers together. We want to actually do scans and other things. So we can use those low SIMD primitives to allow us to, again, build a vectorized execution engine. And so obviously it goes without saying why this actually matters, because it's additional parallelism that we can exploit in our hardware. So think of a really simple thought experiment like this. We have an algorithm in our database system not saying what it is. It doesn't matter at this point. But it's some algorithm that we can paralyze the execution of the task across 32 cores. But now I'll assume also each core is going to have a four wide or four lane SIMD register. That means within each core they can run our algorithm in parallel four times. So we take our single algorithm, we can be parallelized by 32 times. Then each core now is also getting a 4x parallelism. So it's multiplicative. So in total, we're going to get 128x speed up over the single threaded SISD version or the scalar version of that algorithm, right? So that's massive, that's a huge win. That's why we're spending all this time trying to do this. Of course, the spoiler is going to be we're never actually going to achieve that because that's like the theoretical upper bound. And in the case of a database system, there's a bunch of stuff we need to do that we're not going to be able to vectorize and so we'll never actually achieve this. But for the parts that are like in the kernel, like the four loops that we're spinning through, processing tuples for the things that are very repetitive, that we can vectorize, it will be a big win for us. It's just we're never going to get that total massive 120x speed. We'll get like two, maybe four x speed up in practice, all right? So SIMD, again, it stands for single instruction, multiple data. This is an old concept that goes back to I think the 1960s. There was this Flynn's taxonomy that talks about SISD versus SIMD, like different types of CPU instructions you can have. But now it's in the modern era, at least in the century, in the last 20, 20 something years, pretty much these are within all modern CPUs have these architecture support for SIMDs. So x86 has a bunch of different variants. So the version one was back in the 1990s, this thing called MMX. They had like this like Pentium stuff, like 95, 96, 97. And it was really simplistic back then. It was like, I think 128 bit registers. And the problem in the old version was like when you ever switch into SIMD mode, it shut down the rest of the CPU, so you can only do SIMD and then SISD. You couldn't multiplex them. When the SSC stuff, these things came out, then you could do things in parallel, right? Power has their own thing, arm has their own thing. And risk five, I don't mean risk five is an open source ISA, but they had a proposal for how they were going to do it, right? So what's different about the risk five ones, they don't, where all these are going to have specific instructions that depend on what the data type is and what the register size is. The risk five one is like the instruction, this is what I want to do. And I don't really necessarily care what the data type size is. The hardware figures are now for me. But it's just a proposal at this point. All right, so really simple example of what SIMD does. So you want to add two vectors together, produce an output vector Z, right? So if you want to implement this in any procedural language like it's like in C, we would look sort of something like this. So with SISD instructions, it's just looping through this for loop and going one at a time at every single offset in our two vectors. Adding them together is a single instruction and then populating our output vector Z, right? And then compile it, try to unroll this, do other optimizations you want to do. But in the end of the day, it's going to be one input, one piece of data, adding two numbers and produce one output. With SIMD now, what we're going to do is be able to batch these, batch together the entries of the vector and say we're doing 120 bit registers because these are 32 bit integers. So we can store four elements or there's a four Y to four lane SIMD register. So now it's one instruction to take this vector out of this vector and produce another output vector. And then we just go along to the next one and do the same thing, all right? So previously before I had what, there's eight entries. I did eight instructions add together, the two vectors together to add these two numbers up, to list the number. But with SIMD, I can do it in two instructions. I'm ignoring like how do you get this location and memory into the register, then add it together and get it back out. That's the overhead that we have to deal with and take advantage of it. That's why we're never going to achieve that upper bound for us, right? So there's two types of vectorization we're going to want to do. So the first one is horizontal vectorization. And the idea here is that we're going to take some input and do the same operation of all the elements within that input register. And it produces a scalar output. So say I have a SIMD add instruction that takes a input vector of four elements. I'm just going to add them together and then produce a single output somewhere. But in a database system, what we're going to want to do is more common, is vertical vectorization. But the idea is that we're going to perform a sort of element wise operation in parallel across values in different vectors that are in the same lane. So think of these as just offsets, 0, 1, 2, 3. So in the first offset here, I'll do addition for the first element. And this one that produces an output and do the same thing going down. So we'll see this in our algorithms, this is what we're going to try to do. And the game we're going to want to play is we want to ride along or whatever the operations we want to do, keep everything in the SIMD registers as long as possible without having to go back to memory. Because transferring data from register to register is really fast. Having to go back to memory, which is going to be L1 cache, then back to the register is going to be expensive. So we'll have to design some of their algorithms in such a way that for the things that SIMD can't do, like if branches, if clauses, we can play tricks like we saw before to have like branchless scans or predicate evaluation, do the same thing, but now we can do it in SIMD. All right, okay, so the type of instructions that are available to us. Again, I think I've covered a lot of these already. You can move data in and out of the registers, and then between registers. You can do all the arithmetic you want to do, normally in SISD, you can do in SIMD, do all the bitwise comparison operators for ANDs and ORs and XORs and so forth. Then you can do all our predicate evaluations we want to do, less than, greater than, not equals to, and so forth. There'll be shuffle instructions that move data between SIMD registers in, based on certain patterns. So I can say like I want the first element, the third element, from this register to go in this other register. And then there'll be methods to transform data that's in the sort of x86 native format into what the SIMD expects. And as well as moving data maybe from out of the SIMD register directly to memory and bypassing the CPU cache. These are sometimes called streaming instructions. And you do this because you like say, I know that I'm not going to look at some piece of data I just computed. I'll put that in the memory using a stream instruction. So then don't put my CPU cache because I know I'm not going to read it in the near future, right? So avoiding CPU cache is going to be useful for joins because you don't need, you know after I do a join, I'm not going to immediately use that tuple. I'm going to go back and get more tuples and compute the rest of the join. So the streaming stuff helps with that, right? And we'll see this next week when we talk about some stuff of hyper. For SIMD instructions to handle exceptions, why do I divide by zero? These questions, what happens if you divide by, how does SIMD handle divide by zero? Yeah, I actually don't know, yeah. Let me, will that freak me out? Yeah, it's a good question, I don't know. I just- I think you get the same as I would. Yeah, the same as, I think you would get the same error as you normally would in SIMD. You get an interrupt, yeah. Yeah. Because you can't have like, HEDIS is, like you can't have conditionals and you can't do jumps, right? It's pretty limited what you can do. I say that limited, it's very powerful constructs, but it's not a good general purpose CPU, right? There's a handler for it. But the handler's out in the CPU, it's not something in the registers. What is it? Okay, let's take this offline. But again, the SIMD, so the SIMD is going to be the registers and then there's the instructions, right? And so it's not a, don't think a SIMD is like a device, like a GPU, where this GPU could have sort of trap stuff, right? It controls, always going to have to be back on the main processor. And these things are happening in parallel as well. Like this is what a superscaler, add order execution, or CPU can do. Like the CPU can be processing its regular instruction stream while the SIMD stuff is happening. And then if there's a divide by zero, control has to come back to the CPU and then figure out what to do next. All right, so here's a quick summary of what the history of SIMD instructions, at least for x86, has looked like. And I said the first one was MMX, it was 64 bits, not 128. And I guess this is very, very primitive. MMX doesn't actually mean anything. I think it's been backported to mean multimedia extension. But Intel is paranoid about getting sued. So they don't name anything after anything magical. Like all that Skylake, Crystal Lake, all those are actually physical lakes, like on a map. And they choose those things, choose those names because then they can't get sued and say, hey, you stole my name. It's like, no, no, we got it from this physical lake, right? Same thing with MMX, they picked three random characters because it sounded cool and put it out there. And then someone did sue them and say, hey, you stole my idea from MMX. And then they found in the court documents, like, oh, no, it really was three random letters, right? So advanced vector instruction or extensions, I mean, these do mean something, but original stuff didn't mean anything. But this is sort of the state of the art right now, the AVX 512. Again, in the paper you guys read, they talk about this not just because the registers are wider, not 512 bits, but they've added additional things to the instruction set that makes it better for doing database operations, right? This is the compute, the compress and expand stuff. We'll talk about it in a second, right? And this link here will take you to a great video to a few years old now from James Rendiers, who was like, he was like one of the chief architects in evangelists on SIMD instructions at Intel. So this gives you like a big history of all the cool things you can do with SIMD. I think it's from 2016, 2017, so I don't know if it covers AVX 512, but it's still, it's really interesting to look at. Okay, so what are the trade-offs to using SIMD? So we're gonna get huge performance gains if we can effectorize the sort of most expensive core parts of our algorithms inside of our database system. But then the downside is to actually leverage them and use them, it's gonna be mostly a manual process. It's not something that, in some cases, the compiler's gonna easily just figure out for you. Like, the compiler's not gonna know what a hash table is, right? So we're gonna have to write this code ourselves and well, that's fine because people pay a lot of money for people to write code like this. So for our purposes, it's fine, but for the JavaScript programmer, it's problematic. They're not gonna be able to take advantage of this. The SIMD is gonna have restrictions on data alignments. We make sure that data's in the right form in order to put it in registers in order to actually be able to use them. And we'll see this work from a paper from Columbia. They simplify this or they make a huge assumption that all your pointers are 32-bit integers and all your keys are 32-bit integers because then it fits nicely into all these registers. But in the real world, we obviously know that's not the case, that you can't assume that. So it isn't, like, since we're strings, it's always gonna be a problem. And then, this is no longer really an issue anymore but it used to be a big problem. What used to give this lecture in the old days before the pandemic, like getting the data in and out of the registers and moving them between registers was a huge pain in the ass. But now it's actually because, again, AVS-512 and there's bit mass things that they talk about, they make this a lot easier. All right, so 512 is a, they're 52-bit, 512-bit extensions to the AVX-2 instructions. I don't know why they called it and didn't call it AVX-5 because it's all, you know, if the AVX-2 is 256-bit, I don't think there is an AVX-1, right? So whatever, it's called AVX-512. And so in addition to having the wider, again, wider registers, you're gonna get the better data conversion, scatter operations and the permutations. And then these bit mass laws gonna do things we couldn't do before. But the challenging part about AVX-512, which at least is confusing, is that unlike before when I showed that table, there's like AVX-12 or say AVX-2 and SSE-4, these different versions of the SIMD extensions, it was, they were all or nothing, meaning like if I had a CPU and I said I had AVX-2, I knew I got all the instructions for them. The AVX-512, for whatever reason, Intel has decided to break them up into these different groups and then some CPUs will have certain instructions and some CPUs will have other instructions. So if you go to Wikipedia, you'll see this table here. And long here they show you the different versions of their ISAs from Intel and I show that some of them have, some of the features, these are the different groups of AVX-512 instructions. Some of them have some of them, some of them have all of them. They always have to have the foundational one, that's the F1. So if you look in your, unless you're using the MacBook, if you go look in your laptop, you'll see AVX-512F, that's the foundation one, but then everything else is sort of random. There's another chart here that shows you what some of them have and don't have. This one gets up to Skylake 2017. All right, so it's a bit of a mess. There's Linus complaining on the Linux mailing list about how he hopes 512 dies in a fire or whatever. But I did check with, I did check with former students and friends at the big database companies, like Databricks is using 512, Redshift is using 512, so even though, I mean, they're what they control, they control exactly the hardware they're running on. But if you're running, you're trying to build a database system that you want to run, people run on-prem, this could be a problem. So, all right. So how do we actually implement this in our database system? So there's gonna be basically three approaches. There's the automatic vectorization, compiler hence, and then explicit vectorization. And you can sort of think of this as the, on the scale of what's easiest thing to use versus what's the hardest to use, but also we have the more fine-grained control. And again, I also confirmed with our friends in the database world, all the big vendors are using this last one. They're doing explicit vectorization with intrinsics. Nobody's gonna rely on the compiler for things that they know should be vectorized. They're not gonna let the compiler try to figure it out. All right, so automatic vectorization, the idea is that the compiler can hopefully figure out whether the instructions inside of a loop can be vectorized. And it works for really simple things, I mean, simple loops, but this is gonna be rare in database operators, right? Because it's gonna have, you know, it's not always gonna be, it's something less than something instead of a loop. We'll see how vector-wise does it and try to exploit this. But the, to do more complex things, that this simply just won't work. And of course you have to have harbor support for it. So basically if you try to compile this and your harbor doesn't have some instruction that you're relying on, it's either gonna throw an error or, actually I'm sorry, it wouldn't throw an error because you're not putting explicit SIMD operations in. You wouldn't get it and would just compile and you'd get the scalar version of it. All right, so here's a function like this. It's what I showed you before. We're passing in to this function to add three vectors, point each of the vectors and we're just gonna take each offset in X and Y, add them together and store them into Z. Can the compiler automatically vectorize this? You're shaking head no, why? Yeah, so he says X, Y, Z can alize each other, right? Meaning the compiler doesn't know at runtime what these things are gonna point to, right? So if Z just points to X plus one, then every time I write to X, so I write to Z, I'm clobbering something in X, right? And that may be incorrect, right? So the, what the compiler's trying to basically do is it can vectorize something. It wants to guarantee that the output of the function or whatever the piece of code you're vectorizing produces the same results as the non-vectorizer, the scalar version. In this case here, because we're dealing with pointers, it can't guarantee that. So another thing you can do is add compiler hints to say, tell it, hey, you should try to vectorize something or you should not try to vectorize something. And it's basically like, you know, driving without the seatbelt, if you see, I know what I'm doing, you know, go ahead and vectorize this, it's gonna be okay. And it's up for you as the developer, use the programmer to make sure that, like, there aren't gonna be long-term problems. So two ways to do this is give hints about what is the, what can the compiler know about what's in the memory locations for pieces of data, or tell other compilers to ignore everything, you know, ignore the checks. So in C++, there's a restrict keyword. I don't know if this is in the standard, but you basically can say, I pass this restrict flag or commit, I add this restrict edict here, then that just tells the compiler, don't worry about it, these guys are definitely gonna be different locations and everything's okay. Yeah, so it's, I don't think it's in the C++ standard, but all the compilers will support this, right? Again, you're just giving hint to say these things are definitely distinct. The other one is you pass these pragmas to say, for this loop here, ignore everything, ignore any dependency checks, because it's gonna be okay. There's other pragmas like in like OpenMP, you can say, pound-defined pragma, Cyndi, and it basically does this the same thing, right? So it's saying any checks you would have to make sure that you're being cautious about making sure that these things don't clobber each other, you could ignore that and then try to vectorize this, right? And again, you as a developer is doing this, you're telling the compiler, it's okay, go ahead and vectorize this. So if you write crappy code and you pass in the same memory addresses, you're screwed, all right? The last one is to do explicit vectorization and this is where we actually write the commands or the, I'm gonna say functions, because it's not always functions, but it's the actual instructions or operations to do the vectorization explicitly in our code. And this can be done through what are called intrinsics. These are basically compiler directives that are just aliases to the actual instructions that perform whatever the low-level operation that it is we wanna do. So the problem with this one is that it's not gonna be portable across CPUs, definitely not portable across ISAs. If I try to do, take SIMB intrinsics and try to run it on ARM versus X86, it's not gonna work. But as I showed before, even within the same version of same ISAs, so X86, if I'm running on CPU that doesn't have the intrinsics I need for the AVX512, the compiler's gonna throw in air. Right? So there are libraries that hide some of this complexity. Google Highway is probably the most famous one. I don't know, again, I don't know of any other database system that's using it, but it is a way to get a distraction layer, a programmatic API to using SIMD instructions without having to make explicit intrinsics calls because the manual pages for these things are atrocious. SIMD, unfortunate name. This is probably the other most widely used one. Eve, it's a rewet of another engine and then because he's here, this is what, if you wanna use Rust, you would use this, but as far as I know, it's still experimental. So. You're using auto excitation. You're using, what, sorry? Auto excitation. Okay, so you say, you're saying, yeah, so you're using auto excitation and it works. Yeah. Wait, you see your name in the documentation? I mean like. Did you write it? Oh. It's like, there's like so many added types. Oh, sorry, sorry, sorry. Okay. So here's how to do, again, the same operation before explicit and SIMD and intrinsics. So this underscore underscore, that's the syntax. That's the, that's the, or the preferred prefix of what they use for intrinsics. So this is just saying I wanna do load up a, my integer vector into, sorry, integer pointer, load up into a 120 bit SIMD integer register, same for Y and Z, and then now I'm changing my for loop to go over chunks of four, right? And then I'm doing the addition of the two vectors and then writing out to memory to the other vector. Sorry, writing into the other register. And then if I wanna turn the result, I gotta get it out, right? It's actually stored in Z. Yeah, so statement is in this example here, am I assuming the memory's aligned? Yes. Yes. Keep it simple. We discussed memory alignment, right? Or no? No, okay. Yeah, assume that it's aligned. If I need to cover memory alignment, we can cover that in another lecture. We used to cover one of the N memory database stuff, but now we don't. All right, so as I said, the major database vendors are gonna use intrinsics, at least the ones that are running in the cloud because they control exactly what hardware they're running on. If it's on-prem, then think it's a little bit dicier and then maybe you wanna use, either rely on auto vectorization for simple things or use one of those wrapper libraries. All right, so now we understand how we're gonna implement SIMD. Now we understand, again, what are the fundamental building blocks we're gonna use in SIMD explicitly in AVX 512 that are gonna allow us to build up and do more sophisticated things like doing scans and hash table lookups. So we'll first talk about masking. Again, this was the big thing in the paper that I talked about, that AVX 512 added that allows us to identify or only perform certain operations on different lanes based on previous operations we've done. And again, you could do this before, it's just that there wasn't explicitly, you could do this before in AVX 2, but there wasn't explicit instructions or bit mask vectors to do these kind of things. You just had to mainly do it yourself. Then we'll see permute and then selective load and store will be the same thing as permute, but within going in and out of memory, compress and expand and then selective gather and scatter will be the same thing of going in and out of registers in the memory. Okay, so with 512 ads, AVX 512 ads is these predication variants that allow you to pass in a, for most operations, you can pass in now a bit mask vector that tells you what lanes should be operated on in whatever the operation that you're doing. So there's the non-predication versions and then there is the ones where you can pass the bit mask in, there you go, sorry. All right, so to add to, say we wanna do an add again, so what we're gonna do is we have these, the two vectors we wanna add together, one and two here, then we have a mask here that says for what lanes we want to actually apply the addition for and then we'll have this merge source vector that basically says for any location or lane in the mask that has a zero, go copy whatever's in that lane in the merge source. So these two ones, or these two offsets in the mask are one, so I'm gonna go ahead and do the SIMD addition for those right into my output vector and then for the ones that are zero, I go get copy the values that are in the merge source. So this looks like I'm doing this in multiple steps, but in actuality, the instruction's essentially doing this atomically. Like within one instruction, it's taking, it's gonna produce this output vector that has all these results. So it's not like I have to write a for loop and mainly do this or I have if clause, if it's one or zero do this, it's one instruction and it shoves everything into my output vector, all right? So this is the merge version of masking. There's also a zero version where instead of passing in sort of merge source vector, it just assumes that it's zeros, right? So it'll put zeros in instead of actually copying a value from another vector, all right? So now we can use this, the next step is to use permute and the idea here is that for each lane in our input vector, we're gonna map the, it's gonna determine the, for each lane on an input vector, we're gonna look at this index offset vector and that's gonna tell us where the, at that lane, what value to copy into it, right? So you think of like this index vector, each of these lanes here correspond to a lane in the output vector. So now the idea is that the offset specified in the index vector correspond to the offsets in the input vector. So I start with the first one here. So this says I want at lane, the first lane, I want the, whatever the value is in an offset three. So I copy this thing here up into the, into the output vector. I think you get D. Then you go down the line for all the other ones, right? Again, you could do this, you could do this fundamental step in, before ABX 512, the difference was like, you had to put out the memory, then do the shuffling and then put it back into the register. And now I can keep everything in the registers and that's way, way faster, right? Copying from register to register is faster than copying from the register to L1 and then back to the register, for obvious reasons. All right, so I want to get things in out of memory. I have similar things. I have a selective load. So again, I have now my mask here that's telling me that each lane would value put into it. So in this key here, zero, I leave it alone. One says get the first one from the memory location. So you give it a starting address. That copies into it. This is zero, leave it alone. Next one, you get one. That copies the next one, all right? Selective store is the same thing. So I want to, it's putting out into memory. So I have my register and then the master's tell me where to copy things into memory, going across. So now I can do compress, yes. So how does it manage? If you have to make a memory. So going out to memory, writing out to memory, how do I put, make sure that B and D go next to each other, or? Pretty much sequential. Like. Do you first put B in memory, then you put B in memory? Like this? This is all together, yeah, yeah, yeah. Yeah. I'm just, what's done on the inside that I can make sure B and D go together? It's hardware. It's magic, I don't know, all right? It's not my problem. No, yeah, it's hardware's doing this, right? At the lowest, lowest level, do I care what the wires are or like whatever, the interconnects? No. I care about the cycles. And so obviously also this selective store, I don't think you could, I don't, you may not be able to do this at Exigeons now, you couldn't do it before. I think the new ones can handle this. But this one, this one you can do, right? Because it's just writing out to it. Okay, so the next one is compress and the idea here is that we want to store all the active elements that are specified by our index vector into, continuously in our target register, right? So say there's our input vector here, A, B, C, D, then we have some index vector here, 1, 0, 0, 1. So these offsets, sorry, these, the values in here in each lane correspond to a lane in the input vector. The idea is we're just gonna write into our output vector, our value vector, just continuously, right? So in this case here, the first one is 1, so we're gonna copy A into there and then we have this 1 over here and that's gonna copy D into this location and that means that the rest is just left as zeros. And we're gonna use this when we do selective scans because say that this mask here is gonna correspond to which tuples evaluated in a predicate to true or false. So you wanna get all the tuples that evaluated true and put them continuously in our output vector. So we can throw away the others and then keep processing them in our pipeline. And the paper you guys read, they made a big deal about how to then go back and fill in, if you have unutilized lanes, how to go to then fill in additional values. Expand is then taking, storing contiguous elements in the input vector at certain positions in the target vector. So again, our lanes map up like this. So we have the first one here and that's gonna correspond to say I want whatever one A to go in there. And this one here says, all right, the next one, give me whatever, that after A, you'll be the next one, that's gonna be B, so then B goes there, then everything is left to zero. So if you wanna get things in and out of memory, you can just select this gather and gather. So in this case here, these index vectors are gonna correspond to the lanes here, correspond to the lanes on the value vector going from memory into the value vector. So this is just gonna say at this first lane here, it has a two, so I don't want whatever's in W to go up there and then so forth for all the other ones like that. Scatter is going in the reverse direction, so I have my value vector and I have my index vector. So this position here tells you where to store whatever's in the value vector at each lane. So I'll put an A up there. I should be A, I missed that, sorry. And so forth like that. So in terms of performance, L1 can only do one or two loads and stores per instruction, so I don't know how fast this is going in and out of memory. I'm showing this as if it's done atomically, but I don't know whether you can do, these things also go across cache lines. I don't know whether that's all done in a single cycle or not if it's not continuous, but we can ignore that for now. All right, so now with these sort of vectorized vectorization fundamentals, we can then build algorithms now in our database system that are gonna rely on these things and try to keep everything in registers for as long as possible. And so this is gonna come from a different paper that I didn't have you guys read. It's from a few years before, from 2015. So this was for a real bare-bones prototype execution engine. It didn't do SQL, didn't do indexes and compression and things like that. It was just a testing ground for vectorized versions of these different algorithms. And in their approach, the paper basically says, here's a bunch of different algorithms you could use to do the most common operations in a database system. And the paper you guys read from the hyper guys, that was like they sort of cherry picked just two of them, the ones that are most common, the scans and the hash table ones. So in this paper, what they're gonna do is they're gonna try to, they're gonna prefer vertical vectorization where I'm taking two registers and doing some operation on them and producing an output, rather than the horizontal vectorization was taking one vector and then sort of doing one thing combining it down to a single scalar value. All right, because that's gonna allow them to get the better utilization of the hardware and process more tuples more quickly. So we'll do selective scans or selection scans, hash tables and then we'll finish up with partitioning and histograms. All right, so we showed before, we spent time talking about how you do wanna do a scan in a super scalar CPU where you wanna avoid if clauses because if the branch predictor mispredicts what brand, whether they're gonna go down the if clause or not, you pay a big penalty because it has to flush the instruction pipeline, throw away a bunch of work and then load the pipeline back up. And so we showed how this version here, if I just remove the if clause and always do, always copy the, always copy the tuple into our output buffer and then use the, use the, this turn error operations to decide whether the delta is gonna be one or zero and then I loop back around and then, you know, overwrite something that maybe they didn't get evaluated true last time. Like this approach was faster in the most common cases when it wasn't that selective because we weren't paying a big penalty for the branch predictor. So in SIMD, we're gonna have the similar challenge where we're not worried so much about the, well, we still worry about branch misprediction, but we don't, we can't have if clauses in our branching in our SIMD operations, right? There is no if SIMD instruction. So we're gonna take what we did before on the scalar version of this and modify it to work in a vectorized environment. All right, it basically looks like this. So now, okay, set a looping over a table or tuple one by one each table and now I'm gonna get a vector of tuples. And then I'm gonna load them into my SIMD registers, then apply my SIMD evaluation to do comparisons. And this is gonna produce a bitmap, a bitmask vector that tells me what tuples in our vector are evaluating true or false. And then I write it back out to memory and then if none of them were set to true, then I know I don't need to offset as much. I should not update the offset and come back around and overwrite them when I come around again. So this is abstract, this is fake. So let's actually see how to do this with real values here. So let's change our select statement now to actually be where key is greater than or equal to zero, sorry, O, it's the character O, and key is less than or equal to U. So say I have a table like this, whether it's a bunch of tuple IDs, one through five or one through six, and then I have the keys, again, assuming it's a single character and it says joy sucks, that was my first PGA student. He does not, trust me. So first thing I'm gonna do is I'm gonna store this column into a vector. Again, I'm showing characters, just make it easier to understand, but like assume that it was integers. And then I'm showing six characters or six values here. That's a weird number. I assume it's some power of two. So the first thing you do is do the same D compare and that's this part here. I can do my evaluation in, since I have key greater than or equal to this and key less than or equal to that, that's gonna be two SIMD operations, just two SIMD comparisons. But then the output is gonna be the bit mask that says the zero if it matches, one if does not. So now I wanna get back to what tuples, what are the IDs of the tuples that actually match? So to do that, if I just have another vector that has all the offsets, I could then use SIMD compressed to now produce an output vector that just tells me what offsets in the actual column itself and in the table itself actually match my predicate. And it's nicely aligned so I can do something with it. As opposed to if I ended up with this and I didn't have SIMD instructions, I would have to write a for loop that basically says if bit is set to one, then it's this offset. But I can just do all of this in SIMD efficiently. So let's see what performance looks like for these. So again, this is coming from the paper for 2015 so it's an older CPU, I think it's a Haaswell, Haaswell Xeon, and then they had this other thing called the Xeon Phi. Has anyone heard of Xeon Phi before? No, well yeah, you were Apple or on compilers. Xeon Phi was this thing from Intel where it was like their version of a GPU. It was, like I said, having like the GPU cores being really, really simple and you had like hundreds of them, thousands of them, you would have these co-processors that had basically like atom cores or Pentium 4 cores but they had SIMD instructions on them. I think probably AVX too. And so you got less than a hundred of them but you could push things to them like a GPU to do some computations. And so Intel kitted it off, I think in 2019. There's some elements that came out of it that went into the modern versions of the Xeon but this is sort of, this was their response to like NVIDIA to build something like this. So you had one that would sit down the PCIe Express, again like a GPU. This one actually would sit up on the motherboard and actually you could run the OS, boot the OS from this one and then this one with this little funky thing here, this is the interconnect called OmniPath. This is like their version to do like RDMA things. Or like InfiniBand, it was like an interconnect to read remote memory. Anyway, so the main thing to point out though is that for this paper here, it's an older version of the Xeon 5 that didn't have out of order execution and it didn't have speculative execution. So it's sort of like really simple cores that can just do, you know, sequential processing and then whenever there was a misprediction or sorry, whenever there was a branch, the penalty was not as big as the it wasn't as big as doing a speculative execution system or a CPU, but it was just overall just slower because it couldn't speculative execute. So they have four versions of this selective scan, selection scan, scan to have the scalar version with and without the branching that we showed before, then we'll have a vectorized version with and without or late materialization. It basically just means with late materialization means that like you didn't have to actually go and copy the tuple and put it back together, right? It's not a real data system. It was just like ripping through columnar data and memory, but just didn't stitch things back together until the very end, or they just ignored that cost. All right, so the first number here is just the scalar version of this branching. Then you have the branchless version and again because there's no benefits in speculative execution in the PHY and the CPU is just slower in general, that copying cost of always copying between your tuple was just slower. But whereas in this case here for the Xeon you do see a difference. I forget why this is not downward down here on the branchless version. But then everything, once you get to selectivity 100%, now you're saturating that memory bandwidth. So there is no difference performance because you're always just trying to read things out of memory into the caches. But now for the vectorized versions, in the case of the Xeon PHY, you see a huge difference in the cost of late materialization without, with doing late materialization because again it's less overhead and that's why the performance penalty for the early one is higher. The case of the Xeon PHY, there isn't a difference. I think the reason is just because, because the main cost is like getting the data in and out of the registers. Not the materialization stuff because that comes later. So the main thing here is shows like with vectorization, you're getting pretty significant performance improvement over the scalar version. Even though this is old hardware, yes. So branching is better in this case. Sorry, for which one? Yeah, like, why is branching is better in this case? Branching is better in this case here always because it didn't have, it doesn't have a speculative execution. So like, and the cost of copying was so high. So like if you're always copying, then checking to see whether you should have copied or not, it's better to check whether you should copy and then don't do it. That's why that's the case. But I don't remember why this isn't now, sorry, this is the branch list, sorry. Yeah, it should be the case of the branching version at selectivity of 0% should be faster than this, but I forget why. The main thing I want you to get out of this, again, the SIMD is faster until you saturate memory bandwidth. Wait, is there a branch that's vectorized? It's a question, is there a branch that's vectorized? It's always branchless. Oh. Yeah. Because again, you can't have an if clause, you can't have a conditional without going back to CPU instructions. You can't do it all in SIMD. So you use those bit masks to decide what moves forward in the evaluation. So the question is, why is it the case that when selectivity is super low that this is like so much faster? Oh, why is it, sorry, why is everyone over here? It's the memory bandwidth. Just like to the cost of just copying everything out just crushes you. So again, the reason why I didn't have you guys to read this paper and I had you to read the hyper paper is because this paper here, again, they make a bunch of assumptions that don't work in the real world, right? Like, your memory pointers aren't 32 bits and your keys always can be 30 bits. And they did that because that make it aligned in the registers of the head of the time. The guy who wrote this paper then went off and built all the SIMD stuff in on Redshift. So Redshift is doing similar things. There isn't a paper about it. So in the example that I showed here, the, if for each batch of tuples, or vector tuples as we're processing along that there's gonna be entries in it that are for invalid tuples or for tuples that did not match whatever the predicate or checks we're doing. But because they're still sitting in our CP registers, we're still operating on them as we go along, right? And at some point there could be a check inside our pipeline that says, okay, if my bit mask is all zeros then I know that it doesn't make sense to keep processing. I should bail out my loop and go back and get the next vector. But if at least one of them is still valid, then I have to keep processing. Or I can decide how to go then go get more tuples and fill in the vectors. So looking at an example like this, so I have an aggregate query where I have a check where age is greater than 20. I'm gonna do aggregation on the city field. So the issue is that if again, say this is our query plan here, in a scalar version of this, we wanna be able to sort of vectorize this point here. But again, the problem is gonna be like if I can vectorize this aggregation lookup then I don't wanna, if I only have some of the tuples matching in this then when I do my aggregation, I'm not gonna get the full benefit of SIMD because I'm gonna be doing wasted work. Or I have to do all the stuff to make sure to throw out the things that shouldn't be aggregating, right? And then for the last pipeline, that's pretty big, there's nothing magic there. So in the paper you guys read, they talked about this materialization model from a minute at all. That's Prashant, that was one of my PhD students. He now works on Photon at Databricks, building a vectorized engine. And so the paper that they were citing is this thing called relaxed operator fusion. And the basic idea was it's a way to introduce stages within a pipeline that'll allow you to make decisions about whether to proceed or go back and get more tuples. So again, the paper you guys read had all these additional checks about when to go back at different locations, but our idea here was to have at places where you know the SIMD would stop and there may be more SIMD to do vectorized instructions to do after that, you just introduce these artificial buffers that you can then fill up, go back, get more tuples, and then when the buffer gets full, then proceed with the next stage in the pipeline. So again, the pipeline is trying to take a bunch of operators that when you know you can proceed all the way up as far as you can up in the query plan before you have to go back and get more data, before you can proceed to the next pipeline. So the idea is introducing these many pipeline breakers artificially into the pipeline. So let's go back to our example here. So we have again the symbol aggregation function. So say that we want to vectorize this, right? And then what we want to do is we introduce this stage buffer here so that we can do this part in the stage one as much as possible. And then till our stage buffer is full, tuples we know have satisfied our predicate, then we move on to the next stage, right? So really simple pseudo code will look like this. So the first stage again is this part here where we're just iterating over vectors of tuples in the table and we're doing our comparison. And then we have this second stage for when the buffer is full, then actually do the aggregation. Not showing how to do this vectorized, I'm just showing a real simple pseudo Python. But the key point is this buffer part here where if my buffer has gotten full up to some size, some threshold, then I then jump into this part and then flush the buffer by then applying the, doing the operations in the next stage. Again, I'm not showing also too, like if you bump out of this for loop on the scan on the table, you gotta go check where the buffer actually has anything in it, but assume that there's a check there, right? And then for this last step here, there's just this aggregation piece here. So, yes? I'm having a lot of data moving around. How many registers are there? Vectorized units, what is it? The newer ones, the first versions of AVX 512 had one. I think the newer ones have more than that, but it's like less than 10. It's not like hundreds or thousands, right? And again, this is a really simple example here. I can do this, like simple SIMB operation to compare is wireless than 20, but think I have more complex predicates operating in different columns at the same time. You have a stage of these series of these SIMB operations you're doing on it and you're sort of moving things from one register to the next. So, one of the advantages of that you can get at this approach, what I don't think the paper you guys read talked about is that since you know you're running this sort of tight loop here and then occasionally you may pop down in here and do something, assuming again that this step here is more complicated, you can actually give a hint to the CPU using what software prefetching and say, hey, by the way, I'm about to read this next piece of memory. Go fetch it for me if you can. It's a non-blocking call, so you're not gonna block and wait till it shows up in your CPU cache. The idea here is that you avoid last-level cache misses because you know what you're gonna read next because you're in charge of iterating over this loop. So, you can use software prefetching to tell the CPU, hey, I want this new thing, I'm gonna work on this other piece of data but go get me this other data because I'm gonna come back around and actually get this. And so, we're not gonna go too much details on this but where you actually wanna put your stage buffer, these artificial staging areas. It's kind of based on where you think the amount of time or processing you have to do versus the amount of time it's gonna take the CPU to prefetch the next piece of data you need so that again, that you don't prematurely come back around the data you need isn't there and you stall or you go too long processing the current data and then by that time the CPU evicts the thing you asked for because you didn't touch it and then you have a cache miss. So, figuring out exactly where to put this is tricky but it can make a huge difference in performance because when I come back around, the data I need is already there and you can reduce cache misses. So, there's different types of CPU prefetching, software prefetching, we can ignore that for now but just assume it's like again, you give a hint to the CPU, I want this, go get it for me and it tries to do it, it's no guarantee that it will. So, this is a result from the paper we published a few years ago again, this was the old Peloton system that became NoisePage, which we end up killing because we couldn't build a new system during the pandemic anymore, competing with the Germans was too hard. But this is showing you with LLVM's query compilation which we'll cover next class using the hyper method with a sort of push base approach with this pipeline fusion, you'll read about that next class, but then with the relaxed operator fusion. So, for this particular query, the SIMD and vectorization and the prefetch doesn't help for this, whereas for this particular query it does. I forget the exact details of why this is the case for Q19 versus Q1, but the main takeaway from this is it doesn't always work, it doesn't always make a huge difference and there's a trade-off where the compilation versus vectorization is actually the key thing to your target. Vectorization is probably going to be the best approach and it's from an engineering perspective, it's easier to do than query compilation which we'll cover in the next class. But if you combine them two, you can get a huge win, but not very few systems do that, yes. Sorry, what is Q1 to Q1? This is a TBCH benchmark, this standard benchmark. I forget the exact details, I think Q1 I think doesn't have a join, it's just ripping through. I think Q19 has a join, you can double it. Maybe it's the other way around, I forget. All right, so I'm going to show you this graph here. Some of you know how much I hate TBCH, it's because of this graph, because he solved it. I gave a talk of this and he did something unethical to Brichon. It's still a good graph, all right. That sounds crazy, all right. But so this is like the first version of Pelton we wrote was like this interpreted engine and then we added query compilation with the LLVM, then we had the relaxed operator fusion with SIMD, then we had a relaxed operator fusion with SIMD plus the prefetching here. And you just see as you go down you get progressively faster and faster, like compilation makes a huge difference. Of course obviously the old system was super, super slow. But with relaxed operator fusion and SIMD you can carve from, you know, shave it down by another 65%, but prefetching at that point, the system is so optimized, you're getting a minor improvement. So the main thing I'm trying to say is like, yes you could do prefetching with this, but the engineering effort doesn't, you don't produce a big win if you do that, right? I don't want to give the impression also too that compilation is better than using SIMD and we'll see a paper in two classes that actually compares the two approaches in a single system and it's a wash. Sometimes one is better than another. From an engineering perspective, SIMD though vectorization is better, less work. Okay, so we've done selecting scans, let's talk about do hash tables and partitioning histograms, just to finish up. All right, so with hash tables, there's nothing you can really do to speed up the built side because it's random lookups and worst case scenario you have to expand the, you have to expand the, how do I say this? You have to expand the hash table if you overflow, but I guess when you're probing to see whether you actually put it, you could speed, these techniques don't work. But typically you're gonna use it on the probe side because that's the larger table and you're trying to rip through that and new pros as fast as possible. So the scalar version will look like this. Again, I have my input key, I hash it. It gives me some offset mod by the number of slots I have in my hash table, assuming we're doing linear probing. And then I landed some location in memory and now I'm just doing comparison. Does this key equals the key that's in there? And I just keep scanning down until I find a match and then I'm done, right? So one way to vectorize this is to do what's, use a horizontal approach where we want to use a single input key, but then compare it against multiple keys in the hash table at the same time and using vectorized instructions. So what we're gonna do is that we're gonna expand the, for each position or slot in my hash table, I'm gonna now store four keys per slot instead of one. And then likewise I'll have four values in the payload portion. So now when my input key shows up, I hash it just like before and I land into some offset. But now setting, getting back a single key, I give back a vector keys, then I'd use my SIMD comparison to produce my output mask, a bit mask, or whatever I match or not. If I have a one, then I know I'm done because I found something that matches my key. But if it's a zero, then they're all zeros, which I can do in SIMD quickly. I go down and see, you know, go to the next one, fetch the next four keys and do my comparison, right? So this doesn't work just because the, it doesn't work if the key's too large. I guess all of these approaches have this problem. But the cost of, you know, the cost of getting things out of the hash table and putting into this just becomes too expensive. So what I'm gonna do is set a vertical vectorization where now we're gonna have multiple input keys we wanna hash against. Again, think I'm scanning along on the probe side of my join. I'm getting, you know, a batch of tuples and I wanna do a probe in the hash table and see whether there's gonna, you know, find the keys I'm looking for each of them. So for each of these input vectors, I'm gonna hash them individually and produce my hash index vector. Also, I also point out here, going back here, there is, there are vectorized versions of this, like there are vectorized libraries, there are libraries that provide vectorized hash functions. Right? But as far as I know, there isn't a SIMD function that can take, you know, four keys as input and hash them all at the same time. Produce the output. You can use the Raydex stuff, we'll see in a second. It's a, like a poor man's version of a hash function, but if we're using, you know, we wanna have low collisions and we don't wanna do that. Right, so we, sorry. So we have our, we have our, again, our hash index vector that's produced with our four input keys. And then now we have to do four probes into the hash table to go get our matching slots. Sorry, why can't I do this? All right, here we go, good. All right, so again, we do our probes, we land a bunch of different keys, we copy them and do SIMD gather on those, bring them into our, into a single vector, do our SIMD comparison across the lanes and then it's gonna produce an output vector with ones as errors, depending what have a match. All right, so now we have two ones here. So we know that we don't need to search anymore because we land exactly in our hash table that had the keys that we were looking for. But for key two and key three, since they weren't a match, we gotta go to the next slot in the, in their hash tables, their respective positions and keep going until we find a match or we get an empty slot and therefore we know that we, you know, there isn't, there never is gonna be a match. So this goes back to that lane utilization that I talked about before, like we could just keep scanning down with this input vector here and just ignore anything that happens for the first one and the last one because we already know we found a match. But that's a waste of work, right? We made a little, you know, the hash table and we're just burning cycles, comparing things that we don't need to compare. So instead what we wanna do is go back to our, you know, as we're scanning it in the table, go back and get two new keys for the first and last position, run our hash function again. For the second and the third one, we just add one to where we were before because we wanna step down through the hash table because we're doing later probing. But the first and last one, we get new hash positions and then we go back to the hash table, get new positions, do the same thing we did before, go fill our vector and do the comparison again, right? And again, the idea here is that every single time we're doing that Cindy comparison back here, the lanes are always fully utilized, all right? So there's one problem with this approach, in addition to the, you know, the Cindy gather, that's, if these are parts, here, the hash table sees CPU cache, this is gonna be super slow, right? Cause you're doing random probes into different locations in memory at a hash table. So that's gonna suck. And I think the paper, the German paper talks about this of like, oh, you can do radius partitioning to split it up to smaller cache size hash tables. We'll see that in two weeks, how to do that. But there's another problem, and that is that the, because we're using the lanes dynamically, it means the input keys are gonna be read out of order. Now under the relational model, that's okay because there's no guaranteed ordering for any data. But from a developer perspective, this makes it kind of random, that every single time you run your hash function on the same input data, you're gonna get different results and things are gonna happen in different order, right? And this makes it much more difficult to actually to debug and figure out what's, when you have problems. So this is sort of implicit engineering complexity to this approach. If you wanna do something like this, but again, to get the event as a Cindy, it's unavoidable. The, the selection scan doesn't have this problem because you're just ripping through tuples and you'll fill things in programmatically. All right, so quickly look at the performance. Again, this is from the Columbia paper with the Xeon Phi and then the older Haswell. So again, when you have, when you have a hash table size that's really small, the, you know, the, what's that, something like this. Yeah, when you have the, the hash table size is really small. The vectorized stuff does amazingly well. But when you, when you add a CPU cache for the respective Harvard sizes, then they've all converged to the same thing. Again, the German paper talks about this. That like, you don't get any better from Cindy when you're outside of L3. All right, the last one I'll show you is how to do partitioning with histograms. Not that this is a super, super common operation that you would spend a lot of time in a database system. I just think this is kind of neat, kind of clever. That's why I like sharing it. So for this one, we're gonna use gathering gathers to increment counts. And the way we're going to avoid collisions when we do our summations is that we're gonna split the histogram across in different, different lanes, different registers. And then we combine all those registers together to produce the horizontal vectorization and produce our final output. All right, so say this is our input key vector. We're gonna compute a histogram and look what the occurrence of all the different keys are. And again, a histogram is an approximation of the actual values. So we're gonna hash them and then basically add one to the location of the hash table if that key exists. All right, so since we don't have a hash function that can operate on four keys or multiple, register at the same time, we'll use a Radix instruction, which is just taking the first byte of whatever the value is. Again, it's a poor man's hash. So then we produce our hash index vector, and then this is gonna tell us where in our histogram we wanna write into. Right? And again, this is sort of a bad example, like I have four keys and five elements in my histogram, but assume I have a billion keys and I'm trying to summarize this into a smaller vector size. So in this case here, this is not gonna work for us because both H2 and H4, those hash values here, are gonna map to the same histogram location. So when I do plus one on them and I'm trying to increment the histogram to do this, they're gonna clobber each other and it's gonna only have a value of one instead of two. All right? So I'm gonna be missing an action update here because I can't, it's not a for loop inside of the hardware to like, okay, let me update this lane first and then update it again, so plus one, plus one, twice. So all we need to do is just replicate our histogram, split across multiple copies of this, and then each of these are gonna correspond to a lane in a register. So now when I write the first element, I go to the first lane, write to the next element, I go to the next one, and so forth, right? And then now to put it together, now I just do a SIMD add going across the lane and then I end up with the final result, the correct result, right? It's always a loop for the time, so it's 309, that says 409, and this one says 312. I'm looking at three different times. Okay, so this all sounds amazing, right? But the problem is, with AVX 512, is that it's not always gonna be faster than AVX 2. And actually, in some cases, it'll actually make your process, it'll make your data run slower than if you didn't use any SIMD at all. So I don't have anybody call this in the paper here. This is one of the pages, there's a little history, there's a footnote down here that says, please note that throughout our experiments, we did not observe any performance penalties through downclocking. Both the processors, the nice landing is the Xeon Phi and the Skylake is the Xeon. They're both remain stable at their expected clock speeds. Anybody know what they're talking about here? Well, it turns out that with AVX 512, that Intel will actually downgrade the clock speed of the CPU when you actually execute some AVX 512 instructions. You go to these AVX 512 nodes. There's actually two levels of clock speed downgrading. There's one that's like a soft one that like it kind of does eventually, but then there's a hard one, depending on what instructions you execute, like I think with floating points, AVX 512, it'll do a hard reduction in your clock speed. And in some cases, it'll keep it for the remainder of the process, right? So because of this, I'll post on Piazza afterwards, there are some discussions on the mailing groups for compilers that talk about like for GCC, they say, you know, you almost never want to use 512. And by default, they'll give you 256 or AVX 2 instructions if it tries to vectorize something. And only if you explicitly say, I definitely want to use AVX 512, then it actually tries to do this. Because some people will see a performance degradation in their system because some piece of the code or some library they're linking in is using AVX 512. All right, just go, why is the AVX 512 slow? And you see a bunch of these like Stack Overflow posts where like somebody linked in some library that makes up AVX 512 call, and they can't find out where that instruction is. And it's like one instruction that calls the process that run 15 to 20% slower, all right? So there's this great Stack Overflow post here, again, this link's available in the slides. This guy explains exactly what's going on, right? And he talks about these light instructions and heavy instructions where they're gonna do a soft to hard transition to the slower clock speed or not, all right? So, as far as I know, this wasn't a problem for AVX 2, it's only AVX 512. And I think Intel is doing this for heat reasons because that these instructions somehow use more power, they generate more heat and they don't want to burn out the CPU, they'll turn down the clock speed. So be mindful that even though the paper talks about the great things you do in AVX 512, if you're not careful about when you use it, then you can get worse performance. There are ways to override this. We can say like, again, take the seatbelt off and force it to always run with the higher clock speed. But again, like, depending on whether or not you control the hardware or not, we'll determine whether you can do that, okay? All right, so the main takeaway from this talk or today's lecture is that vectorization is this neat-building block that we can use in our database system to speed up other app queries if we're careful about it. And if we use this in combination of the additional parallelism stuff we talked about before, you're running multiple threads, now if we can, everybody's using SIMD to do sort of the most expensive parts of scanning a table, then we'll get a huge speed up in performance. But in practice, it's usually two to four X. It's not the 100 X that I showed in the beginning. All right, so next class, we'll talk about query compilation. Now we're just sort of confusing because again, this idea of oh yeah, we'll compile queries, like this idea has been sprinkled in throughout the entire semester so far, but now we'll actually read the paper of how the hyper guys actually did it. And the paper having you guys read isn't the first one that did query compilation, but that's this sort of seminal paper in the last 10 years that says, oh yeah, you should be compiling queries on the fly. They're gonna use LLVM for it. The newer version, Umbra, actually emits direct assembly, which is insane, which is amazing. So anyway, so we'll cover all those techniques in next class. Then we'll also spend time talking about potential project three topics because in two weeks you guys have to then present your idea to class. And again, some of the people in the class have already talked to you about your topic would be for others if you're not sure. Again, I'll show what you can do on Monday and then we can follow up next so we can talk about things, okay? All right guys, enjoy the good weather. See ya. Ha ha ha ha, that's my favorite all-proud dog. Ha ha ha, no. What is it? Yes, it's the S.T. Cricut, I-D-E-S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Here comes Duke, I play the game where there's no roots. Homies on the cusp of y'all my food cause I drink bro. Put the bus a cap on the ice bro. Bushwick gonna go with a blow to the ice. Here I come, Willie D, that's me. Rollin' with fifth one, South Park and South Central G. By the 12-pack case of a five. Six-pack, 40-act, gets the real price. I drink brooch, but yo, I drink it by the 12-valve. They say bill makes you fat. But sayin' eyes are straight, so it really don't matter.