 The compilation is an awesome topic and there's a lot to discuss. So this is just what's on the docket for everyone here. Project 2 checkpoint, I think he'll post it on Gradescope. I think if it's not already up, it'll post it today. We're just trying to figure out, we looked at ways to make the compilation go faster. It wasn't anything obvious, but when you submit this, it won't, it won't run like Clang Format and Clang Tidy, but it won't run the full linter. So it'll be, make sure you run that yourself, because when we'll run it from this, it has to pass that. So again, the first checkpoint, there's the insert. We'll figure out the delete and then scan key. And then this one will be the full concurrency. And again, this one, he's not checking for concurrency, because Gradescope only gives you a single thread. We'll do more exhaustive tests in the final, final analysis. Okay. And then on Wednesday's class, I'll now announce what's, what, what about, I'll start discussing Project 3 and I'll propose some topics that you guys can look at. Certainly some things that we've talked about so far. Some things we'll talk about today are Jemaine or applicable topics you could explore. But then with the first class after spring break on that first Monday, we'll have in-class presentations where every group will come up and spend five minutes to say, hey, this is what we're going to do. Okay. So you should at least, and I realize you should have to build the B plus tree to start thinking about what it is that you actually want to possibly build for Project 3. Okay. And if you're not sure about potential topics or want some further clarification, I'll be around spring break and we can meet if necessary. Okay. All right. Query compilation is super important. It is the, one of the main techniques that our people are using in modern systems, database systems today to get best performance. So this is why we're going to spend an entire lecture on this. So we'll first talk about some background about what, what, what, you know, cogeneration, why we actually want to do this or do the compilation. Then we'll talk about the, the two techniques, the cogeneration of transpilation, basically source to source compilation. And then we'll talk about the JIT compilation using the LLVM because that was in the hyper paper that you guys read. But certainly this is not the only way to do JIT compilation. So we'll look at some real-world techniques, including popping up in Postgres and our own database system and seeing what the, the cogen compiler looks like for, for these different systems. Okay. So last class, or a couple classes ago, we've been talking about how we're going to get our system to run as fast as possible. And we said that the, you know, the, the way we could do this is, is reduce the number of instructions we have to execute, right? And also get more instructions per, per cycle. So if you start to think about how actually hard this is, to really get, you know, good speed up just based on, on instructions. This is a back of the napkin calculation that the Hecaton guys, Hecaton guys did for their paper when they were explaining why they were doing cogeneration compilation in their system. And it basically says that if you want to have your data system go 10x faster, then you need to execute 90% fewer instructions. This is doable, right? This is something, you know, not just, you know, maybe turning, compiler optimization flag to get, you know, better, better, better binaries. But through careful redesign of a database architecture, we can achieve this. But if we want to get 100x faster, now we have to execute 99% fewer instructions, now this starts to get really, really hard. And so what today's class and then Wednesday's class and then after, after the spring break, when we start to talk about vectorization, right, these are the techniques we're going to use to allow us to execute fewer instructions to do the same amount of work. So that we can try to achieve this, this 100x, right? Because Intel, it's not ratcheting up the clock speed anymore. They're giving us wider SIMD registers, more specialized instructions. There's more things Intel's going to give us this than clock speeds. We're going to get the designer system in order to try to achieve this. And there's not going to be a magic flag in GCC and Clang, like 0100 that we can use to make this happen. Again, it's us as the database system developers have to design the system. And again, fortunately people pay us a lot of money to do this, so that's good. So this is why we want to do code specialization or query compilation. So the idea of code specialization is that instead of having general purpose code in our database system to process queries or do whatever tasks we want in our system, we're going to generate code that is specific to the one task we're trying to process or complete. And for our purposes today, it's always going to be a query. So we have a query and rather than running this through a general purpose system, we will then generate code that is hard coded or baked just to execute that one query. And the reason why this is going to be better for performance is because in the general purpose system, you're going to have all this indirection, you're going to have all these if-causes or switch statements to deal with all the possible data types or operands or predicates or aggregations you could be executing while you process the query. And that means that as I look at every single tuple and I call my aggregate function, there's going to be a switch statement that says, if my data type is this, do this, if my data type is that, do that. We don't avoid all of that. Strip it down to be just the exact instructions I need in order to execute that query. So this is going to be tricky to do. And it's not to say that people, when we write general purpose data system code, they're not doing this just because they're dumb. They're doing it for mostly software engineering reasons, right? You're making it so the code is reusable, so you don't have to duplicate operations over and over again, adding two numbers versus adding two floats. And we also write code in such a way that's easier for people to maintain and support. But the problem's going to be, as I said before, is the way we write code that's easy for humans to understand, we're actually going to be the worst way you can actually write code for the CPU. So I'm going to show a bunch of examples. I'm going to use this simple three-table schema, ABC, right? A and B have primary key integers and some value. And then C just has two foreign key references to the primary key and A and the primary key and B. So I'm going to show a bunch of examples doing three-way joins on this table, on the foreign keys. So let's see now how we could process this query here, a three-way join with a nested query, or nested aggregation, using the interpretation model, the iterator model that we talked about before, right? So again, we have a group by, or an aggregation of a group by for B inside this inner query, and then we're just going to do a join on A, B, and C. So the query plan would look like this, right? This is, for our purposes here, this is just the logical plan. We're not saying what the join algorithm actually is. But we can just assume that it's a hash table. We're not saying how we're doing the aggregation, and also it's a hash table. So if we go back and use the for loop iterator model approach that we saw last class, the way we'd execute this query is, again, just a bunch of for loops where in one operator, they iterate over the target input. In this case here, we're scanning B for every tuple in B. Then we shove it up into the next operator to do the filter, and this shoves it up to do the join, right? So again, this is going to be slow for an in-memory database, or it's still going to be slow for a disk-based database, because we're copying a lot of all the data from one to the next. Yes, we can combine these, but we're having all these next calls and these emit calls, like all these function calls are going to be expensive for us, right? And we have to do this, right? We have this general-purpose evaluation or evaluate predicate functions, because we don't actually know what the predicate actually looks like. We just know we're going to have this expression tree that we want to evaluate. And all this code here to do the iteration over the next, every tuple from the child, just calls this function. It's not doing anything special. It's not doing anything inline, right? So again, there's a lot of overhead from all these next calls, because we could have different types of indirection. But the expression itself is also going to be expensive as well, right? So we just take this inner part here, right, where b dot val equals question mark, meaning it's an input parameter. This is a prepared statement. So at runtime, someone's going to pass in the value for this parameter, and then we're going to add one to it. So typically, the way you represent these predicates is through these expression trees. And the way to think about this is that the data system will enter the root and invoke whatever this operator is and say, hey, I want to evaluate this expression tree from the root. And then it has to now traverse down into the tree and evaluate all its children and start pushing results up. So to say we come down here, we look at the equal operator. We start going to the left child. We see that our expression here is a tuple attribute on b dot val. So we deal a lookup at the current tuple we're looking at because we're calling this, you know, evaluate predicate function for every single tuple as we scan along. So we'd say, all right, I want b dot val. So now I've got to look at my table schema and say, all right, well, the val attribute is the second attribute. So I know how to jump over here to get 1,000. And then I produce that as my output. Now I traverse down to this side of the tree. This says I want parameter 0. So I go look up my parameter array, find the first one, and that's 99. So I get that there. Then this is just evaluating it constant. So I just take one, show it up here, add these together. I get 1,000. Then now I can do my comparison. And the result is true. So again, I have to traverse this tree for every single tuple that I evaluate. So I have a billion tuples. I'm making one billion times, you know, whatever, four jumps for all these different expressions, you know, function calls. And we're doing this because, again, as humans, it's easy for us to reason about how to represent the where clause to this tree. But again, that's going to be slow because, you know, all this is in direction. So this is what, again, these two things, avoiding that the interpretation of the query plan to deal with all the in the direction of the different operators, as well as the predicates that we're going to have to evaluate inside those operators are the main two things we're going to try to get rid of through code specialization. So again, the idea is that any time we have a CPU intensive task in our database, we want to then convert it or compile it into machine code that can then we can execute directly. So it's like, again, we get rid of all the switch statements, we get rid of all the if clauses other than, you know, checking where the predicate evaluates to true. And it comes down to be stripped down to be exactly what the query wants with minimal look ups and indirection. So I've already shown how to do this for access methods or evaluate predicates or operative execution. We can also do it for store procedures or prepared statements, right? The logic of PL SQL, we can then convert into machine code. Only Oracle really does this. Predicate evaluation we also saw other parts of the system like logging operations, like on recovery, if I have to replay these log records, rather than me looking at the schema and then having indirection to say, oh, my schema, my log record has these types and these values. Therefore I know how to apply them to the database. I could compile or have code specialization methods applied to making recovery work faster. No database and actually does this one. Predicate evaluation is probably the most common followed by access methods and operative execution and then it's only a few number of systems actually do this. And this is actually something we're exploring in our own system, right? So I've already said this before, but why do we want to do this? Well, we can do this because we know what the attributes are in our database ahead of time, right? In the relational model, you have to declare a schema. So it's not like we're looking at arbitrary JSON fields or arbitrary CSV files. We know exactly what the schema looks like. We know exactly what the size of the data is. The bar charts have to be treated differently, but at least we know that we have a fixed size pointer to that bar chart. So therefore we can, instead of having all these function calls to do lookups and say, give me the, I want this attribute from this tuple and therefore I jump into this function that knows how to do the arithmetic to find the value I'm looking for. I can just inline directly the address math to go get exactly the data that I'm looking for. Likewise, all the predicates are known ahead of time because we're given the where clauses or we're given whatever is in our projection list. So we know exactly how to, then instead of representing it as a tree, we can represent it exactly to be the predicate that we're actually trying to apply. And then likewise, we want to get rid of all the function calls we have inside loops so that we have this tight kernel that we don't have to do any branching inside of it. It would iterate very quickly over and over again and this will get some benefits from the compiler in this case because it can do some unrolling and then in some cases it could do some auto vectorization which we'll cover on Wednesday, right? All right, so at a high level, the database system that we're talking about now looks like this. So I haven't shown this picture before but this is basically the pipeline within a real database system of when a query shows up. So to say there's a networking layer here, SQL query shows up. First thing we're gonna parse the SQL and from the parse SQL we get an abstract syntax tree. This is just the tokens like the select, the names, right? All the strings that are inside the SQL itself. Then we run this into our binder and the binder does a look up in the catalog to replace the string tokens of the names of objects in the database with internal identifiers. So if my table is called foo, the abstract syntax tree will see, oh, someone's doing a look up on something called foo. I do look up my catalog and say, well foo corresponds to this table and here's an internal identifier to allow you to find it more quickly in the future. So then you pass along now this annotated AST into our query optimizer which then is going to generate a physical plan. There could be more steps and we'll see this later in the semester but for our purposes today, we only care about that the optimizer generates a physical plan which you then now feed into some compiler or the transpiler or the code gen engine, whatever you wanna call this that's gonna take the physical plan which is gonna be that operator tree and then it's gonna spit out some kind of native code or byte codes that we can then interpret. So now the idea is that we take the physical plan we've gotten from this and we convert it into source code or native code that does exactly what that query wants to do without any indirection. So now how you actually run this compiler is gonna vary between the different systems and what this actually is coming out of this thing can vary to the different approaches that we'll talk about today as well. So one thing I also say too is like in the cases where we are going to actually compile the physical plan into like machine code that we can then link or execute in a set of our database system process because as us as the database developers we're the ones writing this translation step we don't have to have any security concerns because it's not like we're taking arbitrary C code from the user and running it inside our data system which would be stupid. So because we control this step we don't have to do extra security checks and make sure that we're not gonna have buffer overruns or malicious code and things like that. This is code that us as the developers will spit out so we consume it sanitized unless somebody on the inside tries to take us down. We consume that this is safe for us to run directly inside our database process and we don't have to sandbox it at all. We'll see when we talk about UDS like in some commercial systems like in Oracle for example you can write UDS in C which obviously is super dangerous because you can jump to any address base in your process. So in that case they'll fork off a sandbox and run it there so you can't hurt the real database system. But again in our code we don't worry about this. Yes. How does like the memory like let's say you have some global data structures how do you know where exactly the data structures are if you have a global data structure? This question is how do I know with the if I have a global data structure how do I know where those global data structures are in the compiled code? The compiled code can invoke anything in the it'll get linked in with anything that's running inside your data system. So you wouldn't want to have it like had this arbitrary memory address that has the object that you want you would have a way to link it and say well here's the function I could call that can again give me access to the object that I can do what I want. Yeah. So same thing is like if it's like arbitrary so if it's lost like I can link it with an existing library and be able to invoke into that library it works the same way like the ABI of Linux or whatever operating system you're doing handles all that for you. All right. So there's two approaches to this. There's transpilation and the JIT compilation. So transpilation also called source to source compilation. The idea here is that the database system will have specialized code that emits new source code. So like it'll have C++ code that spits out C++ code then you then run that new source code through a regular compiler link that in and run that and that's your query you're gonna execute. The other approach to compilation where instead of generating direct like higher level source code we're gonna generate this low level IR or intermediate representation finger like the LLVM IR or the JVM byte code we're gonna emit that directly and then we can then invoke that inside of our database and we can either compile it or interpret it. So at a high level the end result is still gonna be the same that we're gonna take our physical query plan and generate executable code that is baked just for that query plan. Excuse me. Whether we're doing that because we generate C++ code first or generate this little IR first the compilation time is gonna differ between these two and there's also software engineering differences as well but at the end of the day it's still gonna be the same thing. This is no interpretation of or having these different lookups for the different types of interaction we could have working on arbitrary data types that the query plan we're generating that comes out of the compiler is baked just for that one query. Okay. So we'll go through this one by one. So one of the first database systems in the modern era and I'll explain what that is when we talk about system R. When the first data system in the modern era that's doing code generation was this thing called HICU out of the University of Edinburgh. And like I said, what did it do is like for a given query plan that the optimizer gave it it would write out a C++ code that implements the executable plan for that query and all our predicates and all our type conversions are all baked exactly into the query plan based on what the schema is for that given table. And then maybe just do a fork exact on GCC have GCC spit out a shared object link that into our database system process and then we just invoke that to execute the query. So the way to think about this is the the source code we're generating here it's gonna have a like a main function what's not supposed to be called main but it's gonna have a function that the name is known and have the parameter signature is known to the database system. So when you get the shared object and wanna invoke the query you know you just call that one function then it spits back whatever the result is, right? So as we'll see as we go along you're obviously gonna pay a big penalty for in performance for having to fork exact GCC because that's firing up another process there's a context which GCC wants to read its own config files and if you're doing this for every single query over and over again it's gonna be slow and this is what the JIT compilation is gonna solve. Let's see roughly how this works. So we have our query here select star from A where A dot val equals question mark plus one. So for the interpreted plan it's that you know the sort of for loop that I showed before in the first example, right? We're gonna iterate over every single tuple for the tuples we have get a tuple to give an offset evaluate our predicate and then admit its output. So in the first step here when we invoke this function what it has to do is go look in the catalog and figure out what the schema looks like for the table then you gotta calculate the offset based on the tuple size so I know to add jump to the block and jump to the fixed like offset and then I return the point to the tuple. Now you can cache this first one here you can try to avoid having to go get the schema every single time but these other ones here you still have to do but now the big cost is gonna be and when we evaluate our predicate because now we got to reverse that expression tree and pull all the values up and see whether it returns true or false and then if so then we emit our tuple, right? So what Haiku is gonna do is have a templated plan where the, this is all Python but they're doing their templates and simple slots where they know that they have to do an iteration or scan over a table and the only thing that's really gonna be different is what are the predicates or the different values you're gonna have to substitute when you wanna evaluate the scan, right? So the tuple size, the predicate offset and the perimeter value like these are the things that are gonna be told to us when we invoke the query but everything else, the predicate will change could change from one query to the next but everything else is gonna be always the same from one scan to the next. So now all I need to do is take these values and take this template and just at runtime it will just fill these things in for me. This one here, I just check to see whether that evaluates are true or not and same thing, I just get it from up here and then I can evaluate this. So again, I got rid of the lookup for the, for the, to get the tuple and I got rid of the lookup to evaluate the predicate. So I removed the two functions that were gonna cause us to have jumps inside of our for loop and now the CPU can iterate through this for loop very quickly, yes. How are we evaluating the predicate offset? How are we using the predicate offset? Yes. Like are we, we have all the predicate in the menu? So we have to get, the predicate offset is what attribute in the tuple do I wanna evaluate? So I would have to know I want, I want, what was it, B dot value? So this would tell me at what offset of the tuple is B dot value? That's fine, but like, so how are you converting this to value equal to parameter value plus one? Like, how do you know the predicate itself? Like, this is the predicate, right? So, yeah, so your back, Yeah, the function was traversing the tree and finding out this predicate, right? Yeah, so you have this predicate? Yes. So how are you converting that into the code directly? Cause I, what do you mean? Like I know. Okay, so you are first, then you are traversing the tree during the compilation, then one? Correct, yeah, so here's the question is, how do I convert A dot value into question mark plus one? How do I convert that into this? So the, I have to take a pass through the query plan that comes out of the optimizer and figure out what is the predicate that it's actually doing, and I convert the expression tree into this line here. So, you may think, well, isn't that doing a lookup? You don't have to do it. Yeah, you have a billion tuples, I only do that once, right? Whereas this case here, you have to do it a billion times, right? So it should be obvious why, again, why this is gonna be a big win for us, because everything's baked in, there's no additional lookups, and we can just iterate very quickly over every single tuple, evaluate our predicate, and produce the output. Yes? I was just trying to see plus plus function, and then you take in those, but I was late. So this question is, how is this, like for the predicate? No, I mean for the whole thing. The whole thing? Yeah, the whole thing, cause like right now you're like generating the sequence at the right time. Yeah. Correct. We'll get there, so this question is, in my example here, when the query shows up, it's gonna generate this structure every single time. So if I actually, in theory, if I don't do any query plan caching, if I actually do the same query over and over again, it's gonna pre-generate this thing over and over again, compile it over and over again. And so couldn't you recognize that, oh, well, I only have to do so many things in a database of some light, like there's only so many predicates, you know, equals less than, greater than, there's only, and the fourth, there's only so many things I'm gonna do actually to query, could I just pre-generate all those primitives as the word I'll actually use? And then that way at runtime, I don't have to generate C++ code, I just invoke these functions directly. That's what vector-wise does, we'll see that later in the class, yes. And that's actually what we do in our system. The number of predicates can be dependent on the depth of the tree, right? This question, the number of predicates depend on the depth of the tree. But I think his point is like, say like, what am I doing here? Thou equals something, right? And so, instead of having this if clause here to do this, I could have a function that says, take two integers, check to see whether they're equal, and pre-compile that. Yeah, but you don't know, right? How many predicates, like, it can be a five depth tree, then how? No, but you can decompose them, right? To like, just like the conjunction clauses. Yeah, but I was gonna say anyways. No, it's not, now it's an array, it's not a tree. A equals one, and B equals two, and C equals three. Each of those ones could be a function called, you know, something equals something, and I invoke them in an array one after another. I don't have to traverse any tree. But to construct data, I'm saying, why don't, when you have to construct it, you have to do that thing, then here also, you are constructing it. What's the point of having functions when you do it? One, the re-vectorization, we'll talk about, when we talk about vector-wise, but like, we'll get to that in a second. The compilation clause is what's gonna kill this approach, right? So if you don't have to now compile something equals something every single time you execute a query, and his example can you just cache that, or compile it once and just link it, yes? Compile it. Yes, this is what vector-wise does. Right, so, related to his question is well, how can my generate query code invoke and touch other parts of the system if I don't know where the memory address is? Again, well, this is just C++ code. We're gonna link it in with our database system shared object. So if we expose an API to allow you to go get access to the internal components of our database system, then our, you know, our code gen query plan could invoke and touch those things as well. Right, so like, if I need to access the transaction manager, I got a function that says, you know, for my current execution context, give me my transaction manager, and now my on the fly code can invoke that and, you know, check to see whether it's allowed to commit or not, things like that. So it's almost as if like, it's just the same thing as the code we would write in our database system, but we're writing this and compiling it on the fly at runtime. So us as the data system developers, we don't know exactly whatever query is gonna show up, but we can still do, you know, we can still invoke the pieces of the system and actually make it happen. Right? So this can be problematic in a JIT system at LLVM because if it's C++, then you have these mangled paths to the functions. Like, if you ever looked at LC++, what the functions actually look like when you look at their names, like in GDB, unless they do, on mangling, you know, it's these long strings with like the class name and like the function name and things like that. And that can get a bit gnarly, so you need a way to bridge into the database system to do that. In their world, because they're generating C code, it's not an issue, or it's simple, it's not an issue because they just invoke the other functions as if it was all being compiled at the same time. Right? The other nice thing you get about this translation approach is that it makes debugging a lot easier because now if I crash in my generated code, I can just use GDB and all my standard debugging tools to figure out why I crashed. And that's actually gonna be a big problem for the LLVM stuff. You do have to have, do a little extra work to figure out like, well, what is the C++ code that generated the C++ code that crashed? You have to make that jump and debug symbols are hints of how that happened. But again, it's not as bad as it is in LLVM. All right, so let's look at some experiments that they did for haiku to understand the benefits of their approach. So this paper's big old, I think it's 2010 now. But I like it a lot because they generate all different variations the way you could do translation or you could build a database system and they put it all in a single engine and compare against all of them. So they're gonna have five different approaches. So the generic iterators would be like the textbook implementation of a database system where you have like the volcano model and you're calling these functions inside the for loops to evaluate things, you're calling next, next, next over again. Then they're gonna have a slightly more optimized version where now you have iterators that are specific to the different types of columns you could be accessing or attributes you could be accessing in your database. And you can evaluate the predicate before you pass it up to the next operator in the tree that you're basically doing predicate push down. Then you have these hard coded implementations where like they had a grad student implement like a best effort approach with like the generic iterators and predicates equivalent to this one and then a more optimized version that's equivalent to this one. But again, it's like hard coded just for the query and not like a general purpose engine. And then the last one is what their source to source compiler is co-generator engine's gonna spit out for queries. The way to think about this also the optimized hard coded one is what he was asking about like can I pre-compile all the predicates and types ahead of time and now I'm just sort of linking together these different functions. So that's what this one is. So for this one, they are running, I think it's just a scan or a join query over two, 10,000 tuple tables to produce 10 million. And so what you see is that for the case of the generic iterators, the surprise performs the worst and then as you get over here, the difference between the optimized hard coded one and the haiku one, are they're about the same because the haiku engine is spitting out C++ code that is roughly equivalent to what the hard coded one can do. The idea now again, for any arbitrary query instead of me having to write code by hand over and over again, I can have the engine generate that code directly. The other interesting thing to point out too is that we have a lot more memory stalls here in the generic one just because there's so much like indirection and we're doing these jumps and we don't know exactly what piece of memory we're gonna read ahead of time. Whereas in these cases again, the for loop is super tight, we can rip through things very quickly and then the Harbor Prefetcher could bring things in a memory ahead of time for us. Okay, yes? You kind of imply that there's like never really a case where you don't really wanna inline like small functions. So his statement is this basically says that this argues for the case that you almost never want to inline small functions or not inline them for the inside the for the kernels. Yes, I would agree with that. Yes. Yeah, I think for small functions, yes. Yeah, I'm trying to think of like there's predicates that are really expensive or really large or like functions are really large but actually I think the compilers these days actually do a pretty good job for figuring out what the inline. So in the case of like Hikoo when it's code-gening the source code, I don't think you wanna put explicit inline hints. I think you just want the compiler to do whatever it wants to do. I think that's the conventional wisdom now for C++, you don't add inline anywhere. You let the compiler figure things out. Okay, what's the downside of this approach? Well, it's gonna be the compilation cost now, right? So for this one, they're gonna compile with O0 and O2. O0 basically is no optimizations, O2 is with the most aggressive optimizations that are considered to be safe and obviously there's more passes when you do O2. So therefore the compilation time goes up. So this is for TPCH, right? This is not the query execution time, this is just the compilation time. So now you're starting to get into problems because in the case of Q3, it's gonna take me 600 milliseconds to compile it, but in some cases the query can be done in maybe 100 milliseconds. So I'm spending more time doing compilation than I'm spending on actually executing a query, right? So this is gonna be a problem with this approach. And we'll see MemSQL later in the class, but like MemSQL actually, their first implementation actually did this. And when you look some of the early blog articles, which they've since removed, you look in the archive, you can still find them, right? They would have examples where like you run a query and it takes one second to run, even though it does no work because they are fork-executing GCC from the C++ code they were generating for that query, compiling it and then linking back in and run it. But they did a good job caching everything so that when you execute the same query again in their examples, they would show the execution time would be now zero because they can just reuse that binary over and over again. Okay, so the, as I said in the beginning, the way we can organize our query plan tree is useful for us to reason about as humans, makes the code reusable, makes the code easily extensible, but it's again, it's not gonna be the most efficient way to execute it. And in the case of the haiku stuff, again, even though we could execute C++ code, that would be a more efficient way to execute these queries, it's going to be slow for us to compile, right? A big issue also to do with haiku is that they're not gonna support full pipelining. They're still gonna generate on a per operator, the for loop for that one operator, still have an emit function to shove it up to the next operator, who's then is gonna have its own for loop to process things. Now they can do a predicate push down so that when you do the scan on the table, they'll evaluate the predicate. But everything else up in the query plan, again, it's gonna be like that next call you have to run your own for loop as well. So, to understand how we can get better performance with pipelining, we go back to this three way join query, and now we're gonna divide it up into explicit pipelines. Again, a pipeline is a portion of the query plan where I can take a single tuple and ride it up as far as I can up into the query plan until I reach some point where I can't continue up in the query plan until I go get the next tuple, get all the tuples that are coming within my pipeline. So the easiest one to understand here is pipeline two. I'm scanning B, then I'm applying my predicate, but now I wanna do my aggregation on the group by, but I can't go past this operator until I get all the tuples into my hash table for my aggregation, because I'm computing the count. So I need to know what is the count of the more tuples that I have for the group by clause before I can pass anything up over here. In the case of pipeline four over here, I can scan C and assuming I've already built the hash table to do the joins on A and B, I can take a single tuple, ride it up here, check to see if I do the join it matches. If so, then I can run it up here and check to see whether I can do the join at A and whether it matches. As the idea is again, we can have the pipeline go, the tuple go as far as we can up in the query plan until we hit a pipeline breaker, right? So this is what Hyper does. So Hyper actually has two main ideas in the paper you guys read. So the first is that they're gonna do this push-based query plan or processing model, but then they're also gonna do just-in-time compilation of the query plan using LLVM, right? And so when you read the paper, hopefully you didn't read the appendix, I should have warned you ahead of time because there's all this LLVM IR. I don't understand it, it's not really useful. The core material of what's actually going on in the paper was the front body of it. So again, the reason why they're gonna do this push-based model is that more than just keeping things in your CPU caches, now you can keep tuples and values in your CPU registers which are even faster than L1 cache. So now as I go up the query plan in my pipeline, if I'm just having the same tuple in my CPU registers, then I can rip through it very, very quickly, right? For those who don't know, actually, who here has heard of LLVM before? Or here has not heard of LLVM before? Okay, perfect. So LLVM is, it originally stands for the low-level virtual machine, although you shouldn't think of it as a virtual machine, not like virtual box or VMware. It originally started at UIUC in, I think, like, 2019-99 and they were trying to build a tool to investigate dynamic compilation techniques for programming languages. And they end up building this, like, toolkit that's gonna have all these different components you would need to build a full-fledged compiler. The idea is that rather than having a, you know, being compiled for only one programming language, it would sort of have different front-end plugins. You can then take different programming languages, build a front-end for it that can then convert it to the LLVM IR and then from there, you can then compile the into machine code. So Apple is invested heavily in this. They hired one of the main guys at UIUC in, like, 2004-2005 and he basically runs their, you know, all the work they do on Clang and LLVM based on this, right? So again, what's gonna happen is we can take any arbitrary language and convert it down to this low-level IR that's gonna sort of look like assembly, but it's designed specifically for the virtual machine that LLVM provides. And then on the back end, they can then compile that IR into whatever your target ISA is. So it supports x86, it supports ARM, so you can take any arbitrary language, convert it to the IR and then have it spit out to any CPU ISA that LLVM supports, right? So an important thing to understand, though, is that in the case of Hyper, Hyper is gonna have C++ code admit LLVM IR directly. So Hikoo was having C++ code generate C++ code. Hyper is having C++ code generate IR, but the rest of the system, just like in Hikoo, does not need to be written in the same language that the query plan is generated in. So the rest of Hyper is written in C++ and it can still make calls to C++ code, but you have to mangle the function names, the class names, as I said. LLVM doesn't do that for you for free, but you can still have the IR code call into your C++ code. So you still get the same benefit of everything running in the same dress space, right? So now let's go back to this query plan and see how we would generate a source code that would do the push-up approach in Hyper. So now for the three of these, these different forlips as the different pipelines. And again, the idea of a pipeline is it's a forlip or a bunch of forlips that can take a single tuple and keep processing it as far as I can up into the query plan. First, so if we're doing the scan on A, there's just a forlip on A, and then we evaluate our predicate, and then we can materialize it into our hash table, right? Then now we jump over to pipeline B, we can then just scan through that, apply our predicate, build our hash table, then we can then materialize the output of the aggregation table on B. But now here in pipeline C, we have three nest of forlips. So we're gonna take a tuple in C and then try to do a join against it on against B, and then if that matches, we try to do the join against it on A, and then that is correct, then we can split it up as our output. So for one tuple in C, we can do the join in A, the join in B, and then produce the output if necessary. We don't have any switch to another tuple, yes? Is there multiple ways to generate this? So this question is, is there multiple ways to generate these pipelines? And if so, is there always gonna be an optimal way to do this? The typical optimization strategy is that you, well, actually there's two things. One is, are there different ways to generate this query plan? Yes, right, because that's what the query optimizer does. And so from our perspective in this class, we're just trying to say, well the optimizer gave us a query plan, how can we generate pipelines for it? And in that case, it's a pretty simple heuristic to decide where these pipelines should look like. And the typically way you do this is the, you start with the left side of the tree of any join, that's a pipeline feeding to the join. And then on the right side, you try to have the pipeline go all the way up until you hit a pipeline breaker. So the conversion process, to answer your question, yes, there's different ways I could generate the pipeline for a physical plan, like I could have a pipeline in here that have a new pipeline, it'd be stupid, but I could do that. But the heuristic to find the optimum one for a given physical plan is pretty straightforward. The harder decision is like, should I join AB or BA, like all that before. So again, what Hyper is doing, Hyper is gonna take, instead of generating the pseudocode, they're gonna generate the L of an IR that does exactly these steps, compile this as one giant function, all the pipelines together, and then now it's just stage as, all right, I'll run this for loop, and when that's done, now I jump to this part, or it's actually not a jump, it's just executing sequentially. Then I do this for loop, then I do that for loop, and then I do this last for loop for the pipeline. So in our new system, we can actually compile these pipelines separately. In this version of Hyper, everything was compiled all at once. So you had to have all the IR generated for the entire query plan, for all your pipelines generated together in this giant function, then you fire it off to L of M. So let's look at some performance numbers comparing two different versions of Hyper. One is doing the L of M IR, and then one is doing the high-q approach where you're spitting out C++ code, and then fork exec in GCC. Then we have the vector-wise approach, which is using his pre-compiled predicate method that he mentioned. Monadb would generate, it generates what looks like an IR, but then they have an interpreter for it, they're not actually compiling it into machine code, and then Oracle does nothing. It's just interpreting the query plan in the same way we always did, right? So this is also not measuring the compilation time. This is just saying, like, assuming I have everything compiled ahead of time, how fast can I go? And so because in the case of L of M version of Hyper, they're doing more aggressive, more pipelining, right? They're making sure that the pipeline is as long as possible, that they can do slightly better than this version. And the Oracle is always gonna lose again because it's always doing interpretation, right? For the case of, like, for Q1, Q1 is, there's no join, it's a single table, it's just a bunch of aggregations. So that one you can do more efficiently if you code in everything. Q5 is like five or six joins, and the output is pretty simple. So you don't get as much of a big amount of benefit because the major cost in executing this query is always gonna be the join. That's gonna be pre-compiled anyway to probe a hash table, for example, and book the hash function. So in that case, you're not gonna get that big of a benefit for the code inside of things, right? So let's now look at the compilation cost. So now, this is not a true, like, Apple's comparison because I'm like taking the numbers from the Haiku paper and mashing it together with the Hyper results. So they're not running on the same hardware. I think the Hyper guys are running on like a Xeon. This is running on a Core 2 Duo from 2008, 2009. The scale factor is still the same. They're still compiling the same TPCH queries. It's sort of the relative difference is what matters, right? Just because they have a slightly newer CPU, it's not gonna magically get faster. So this is just showing you that not having to then parse the C++ code, run through your AST and your tokenizer, and then compile it as you would in GCC or Clang, but is emitting the IR directly, then being able to run optimization pistons on that inside the LLVM, that's gonna be ordered magnitude faster, or at least one ordered magnitude faster than GCC. So for this reason, I think the LLVM compilation approach is the right way to go if you have a C++ based engine. We'll see examples of some Java-based database systems. They'll do this sort of the same thing. They'll emit Java bytecode directly. Instead of emitting Java code, then compiling that, okay? So where is this compilation cost coming from? So you know, 37 milliseconds, it's not as bad as 400 milliseconds, but it's still a lot, right? Some queries can run in less than a millisecond, but it takes me 37 milliseconds to compile it, assuming I can't cache it ahead of time, then I'm not really getting any benefit. So what's happening here? So the issue is gonna be that of compilation time is gonna depend on the query size. So this means the number of joins we have, the number of predicates that we have, number of aggregates, just sort of how complex the query is, the more things we're trying to do, then the compilation time is gonna go up. And now it's sort of this trade-off between like, well, if my query's gonna run for 30 seconds, who cares if it took maybe an extra second to compile? Because I'm still gonna get a big win in terms of performance numbers that we saw over Oracle. But for other queries that are really fast, maybe the amount of data they may need to process, you can rip through very quickly, because it's a column store, the compilation time could start to eat into the extrusion time. So for all of the applications, this won't be an issue. I'm gonna take a guess why. Why would we care less for all of the TP? Less joins. He says less joins. Yes, that's one of it, yes. Well, you were saying the same thing. So he's saying the complexity of the query is relatively easy compared to OLAP. It means they have less joins, right? So yes, the OLAP query's gonna be way less complex. You're not gonna do a hundred table join. It's gonna be like, look up Andy's record from the index and go get some basic information. And maybe do a join with a foreign key table. There's another reason as well. Caching, exactly. In OLAP applications, we're gonna execute the same queries over and over again. Like I load the webpage on Amazon, they do a query lookup in the index to get my record. He goes visit Amazon. It's the same query, just a different key. So you can cache that, either as a prepared statement or pre-compiled code, and we invoke that over and over again, right? For OLAP, this is gonna be an issue. But where that trade-off is when, having to decide, oh, this is good enough to compile versus just interpreting, that's hard to figure out, right? Because at the compiler level, when we're generating, we're doing code gen, we have a rough idea of how much data we're gonna access, but those estimations can always be very wrong. Yes? Generally, OLAP queries are not like getting over in one, two seconds, right? Sorry, get over in what, sorry? In one, two seconds, they take a lot of time, right? Like one, two minutes, for you. If it's in memory, if it's in memory, then it can be really fast, right? We can run, like. Depends on the data also, right, how we click. So I think this is like scale factor one. So this is like doing, this is reading one gigabyte data in 35 milliseconds. So multiply it by whatever terabyte it is, 1024. For in memory, this is an issue. For a disk-based system, the disk is always gonna crush you, so it matters less, which is part of the reason why Oracle probably has never done this. At least for the traditional disk-based system. Okay, so like these things doing this won't matter because I was the one to do it. Correct, yes. In memory, the compilation time all the function lookups are gonna be bottlenecked. Now, with that, I would say also too, so Postgres is a disk-based system. We'll see this in a second. They do compilation too as well. Now, as of like 2018. But then they have this little parameter you can set to make decisions about should I actually compile or not, based on what the execution cost of the query is actually gonna be. So let me get actually give a, I'm gonna talk about what Hyper does. The motivation for what Hyper's gonna do here was just like us, they were trying to support the Postgres wire protocol and Postgres catalog. And so there's this very commonly used tool called PG-Admin, which is like a PHP interface to configure your Postgres installation. So the way all these like these visual database tools work is that when you turn them on, they connected the database, it immediately run a bunch of queries against the catalog to figure out what tables do I have, what columns do I have, what indexes do I have, so they can expose that to the DBA to manage the database. So when you turn that, when you would, in Hyper's case, when you would turn PG-Admin on and point that at Hyper, it would be this long pause at the very beginning, right? We're talking about maybe 10 seconds because the PG-Admin would fire up all these queries that then had to run through the LVM compiler just to figure out what tables you would have. Whereas if you run with regular Postgres, it doesn't do any of that compilation. So when you turn on PG-Admin, it would boot up more quickly. So to solve that, so those queries aren't that complex, but there was just a lot of them and the compilation cost was eating all your time. So this is a paper that came out in 2018 from the Hyper guys, it won best paper in ICD. I think this is actually a really good idea. We tried doing this in Peloton, but when we killed off Peloton, we do something slightly different than what they're doing, but the idea here is a really good one. So what they're gonna do is when a query shows up, they're still gonna generate the IR just as you normally would, but then rather than firing off the LVM compiler, waiting for that to finish and then start executing the query, they're gonna have a IR interpreter. Think of this as like a VM that can then interpret that IR and start executing it. So the German guy, Thomas, wrote it apparently in two weeks. You basically take the byte codes that LVM fits out in its IR and you just implement a virtual machine to execute it. So now it's gonna make all the same function calls to the rest of the system, make all the same operands and predicates as you normally would in the compiled engine, but it's running as an interpreter. So it's not like you have to build two separate engines completely, you just have the interpreter execute the same instructions that the compiled version stuff will. So now the interpreter is running. Then in the background, you start compiling the query and then when the compile query is ready, you just slide it in if the query is still running to replace the interpreter execution. So again, they're using morsels. So what would happen is every single time a thread, a worker thread would complete a morsel, it would check some flag and say is my compiled version ready? If yes, then invoke that. If no, then I just keep running in the interpreter. So now for those queries that could take a long time to compile, but will execute very quickly. In those cases, you could finish them off just through the interpreter and not wait for the long compilation stuff to finish. So they're actually gonna do three stages of different types of compilation to do based on what type of optimizations they're gonna do. So again, the SQL query shows up and so in their case, the optimizer maybe takes 0.2 milliseconds. Then they have this code gen engine and that takes 0.7 milliseconds because you're traversing the tree, spitting out the IR. And then the first thing that'll do is they'll pass off the IR to this bytecode compiler or interpreter and execute that. And so in some cases, that could finish up in 0.4 milliseconds. Sorry, it's a compiler. So you're taking the IR and converting it to a bytecode that they can then interpret. That takes 0.4 milliseconds. Then the IR also goes to the LLM compiler, but they turn off all the optimization passes, like unrolling loops and people optimizations, all that's turned off. So that complete in six milliseconds. So now that's gonna be a little bit faster than this one. So when this one finishes, then you can replace this with this. But then they're also gonna then, if it runs even longer, then they'll run it through all the optimization passes that the LLM provides. Right now you're doing like dead code elimination, sub-expression elimination, the P-poll stuff. Then they can then run it through the compiler and then this picks out the x86 code. So the idea is that I start interpreting right away on the bytecode. If it finishes right away, then I'm done. If this thing finishes before this execution finishes, then I had to start executing this. But then I also fire off this pass, and then if this is not done by the time I get this, then I replace it with that. So you're sort of staging how fast the execution engine is gonna get. And the idea is that rather than waiting for this thing to finish, which again, this particular example here, 25 milliseconds plus 70 milliseconds, rather than executing any work during this time, I can at least get some work done. It's not gonna be as efficient, but it's better than nothing. So they have some numbers about CPCH for those three stages. Again, this is just gonna show you the relative difference of performance between the bytecode interpreter, the unoptimized LVM, and the optimized LVM. So again, you're getting an order of magnitude difference between the optimized compiled version and the interpreter. So that explains why you wanna have both. The other benefit you do get, which I don't talk about so much in the paper, is that since again, it's the same IR, the same sort of, the bytecode is executing the same query that we're executing here. If now there's a bug in how I generate that IR, rather than looking at the compiled version of the query plan, which in this case here, you're not gonna have debug symbols, you're not gonna have a stack trace when you crash, you're just gonna land in an assembly. You can at least step through the interpreter with this and figure out why your query is breaking. All right? Yes? But with this, you can only figure out errors before this, after the bytecode to LLVM thing, if something wrong is happening there. Yeah, so the statement is if there's a, back here, if there's a bug in this, compiling the IR into the bytecode, or a bug in the interpreter itself, then yes, you have to figure out what's going on here. But the idea would be that the... Even there, so like from IR to the code, that's the machine code. That's LLVM. So like, yes, the compiler could be wrong. Well, it's not the compiler. There's a hierarchy of what could be wrong. It's always like, the first thing to blame is your code. Then the next thing to blame is maybe the library you're using. Then maybe you next blame the compiler, and then maybe the very unlikely blame the hardware. So the higher probability your code is wrong, like this thing is not gonna be wrong. So the point I'm trying to make is like, we don't have to write this, this is not gonna be that difficult to write, and it's not something that has to be modified all the time as we expand new functionality. If we design a system such that it's sort of general enough, every time we add a new SQL function, it's not like we need to modify this. So only a small number of smart people. Think about this. There's a small number of smart people that can write a database system. There's even a smaller number of smart people that can write this piece, right? And so we pay that person a lot of money to get this right, and assume it's right going forward, okay? We have 15 minutes. I wanna rip through very quickly a bunch of different real-world implementations of this. So as I said, at the very beginning I said, oh, the IQ was the first example in the modern era, and the reason I use that as that phrasing is because IBM did this as many things in databases. They did this in the 1970s, but then they abandoned it. But now pretty much a lot of the systems today are using this approach. So IBM had a primitive form of Cogen and query compilation back in the 1970s for system R. Remember, the system R project was they took Ted Kott's paper, handed it off to researchers at San Jose, got them in a room with a bunch of people with brand new PhDs and said, hey, build a database system, build a relational database system. Every person with a PhD carved off one piece of it. One guy did storage, one guy invented SQL, another woman did query optimization, and then somebody did this code generation thing. And what would happen is they would take a SQL statement and they would have it spit out assembly that would then, again, just like in Haiku or the IR for LLVM, be exactly the baked in execution plan for that query. And they would have a bunch of templates and sort of splice things together. So it turned out though, that this was a huge pain to ask maintain engineer because back in the day in the 1970s, IBM had all these different mainframes that all these different ISAs and instruction sets that they had to support. So in order to get system R to work on 360 or some other system, you had to make sure you put all this assembly stuff, which was error-prone. So when they went and started building DB2, some pieces of system R made it into whatever the version of DB2 they built first, like the SQL stuff, but all this code generation stuff abandoned. The other big issue too was any time that the other parts of the system change, like the layout of pages for tuples or the indexes, you had to go change all of this assembly code, which is a huge nightmare, because every time there was a change, we've got to change this part and test it. So there's this great, there's a retrospective that came out in 1981 that talks about the history of system R, but then the main developers that worked at IBM at the time, they did like a panel or a bunch of interviews in the early 1990s that talked about what was like building a database system in the 1970s where nobody knew how to build a database system. And one of the things when you read those interviews, they talk about how this thing was a huge nightmare and that when they built DB2, they got rid of it. So Oracle, for their high-end things like the Fracture Mirror and memory column store, and then for like Exadata, they do something similar. But like if you just download regular Oracle, the disk-based version, it doesn't do any compilation for queries. They might do predicates, but again, that might only be for the high-end versions of it. The one thing they do compile though is store procedures. So they're gonna take your PL-SQL store procedures and convert them into Pro-C or Pro-Star-C, which is their specialized dialect of C, and then they'll compile that into native and C and C++ code. And the reason why they're gonna do this is because this is gonna have a bunch of security checks to make sure that your store procedure is not doing something weird with address space. So they don't have to run this in a sandbox. They can run this directly inside the database system process. The one thing that Oracle can do that nobody else can do that's super insane is like actually put the database operations directly on hardware. Now FPGAs are a thing that you can do and people have done this from database systems, but those are still slightly more general purpose. This is like they're actually manufacturing the CPU and they'll put specialized instructions for the Oracle database on the CPU. They bought Sun 15 years ago. Someone's making the spark chips. So some of the newer versions of spark chips have support for Oracle's compression algorithm or Oracle's bitmap stuff directly on hardware. So that avoids, that blows out anything you could do with Code Gen because instead of compiling code that does what your database wants to do, you just invoke the operations on the hardware itself. It doesn't get any faster than that. I don't think they do this anymore because I don't think they make sparks anymore, but this is maybe like four or five years ago they were touting this. You could buy an Oracle Rack machine from them. They had Sun's hardware and they could make your Oracle database go faster. We'll talk about FPGAs at the end of the semester. For Hecaton we've already talked about, they could compile both store procedures and SQL. What was kind of cool is that they would have, they would pre-compile a bunch of these operators that allow you to have non-Hecaton queries, touch Hecaton data. I think you can run those efficiently. And the way they would do this is they would generate C code from the syntax tree of the query plan, compile that using whatever the Microsoft compiler, generate a DLL, then link that in at runtime the same way that the Hikuhi actually did this. And to make sure that anybody was doing something funky, they would have a bunch of extra checks to make sure that you didn't have a weird predicate that tried to do a buffer overflow to take control of the system. Now Actium Vector is what he was asking about before. So they're not really doing co-gen. They're pre-compiling these primitives. And this is why I used the term. And think of a primitive as some low-level operation you'd wanna do on a piece of data repeatedly. But it's gonna all gonna be written for a specific type. So I'll have a primitive to do a comparison between two numbers, or one will be for 32-bit, 64-bit, 16-bit floats, right? I'll pre-compile all these primitives. Then now at runtime, what my query plan is basically doing is stitching together all these pre-compiled primitives as if it was a bunch of C++ code that was generated on the fly. And now I'm just making calls into these functions that are pre-compiled, and that's gonna be almost as fast, or in some cases, faster. So looking at one example here, right? So here's a primitive to take a pointer to a bunch of n32 values inside of a column, take the value I wanna compare against, and then I just do a less than. And if it matches, then I produce in my output buffer. And here's the same function, but now I'm comparing a double. So my input column is a 32-bit integer, but my comparison value is a double, so I just add that piece in there. And the compiler would generate the right casting code for me, right? So we'll see this next class, we'll talk a little bit about the next class and also after spring break. But you would think, all right, well, isn't this gonna be slow now if I'm invoking this function for every single tuple? Well, that's why they're passing in a pointer to a column. It's sort of in the name of vector-wise. So they're passing in a vector of tuples that this can then be vectorized by the compiler using SIMD instruction. So now I'm not doing a comparison between a single scalar and another scalar. I'm taking a batch of values, invoking a single instruction to execute that more efficiently, right? So again, we'll see this, we'll see how to do vectorization next class, but this is one of the reasons why, although vector-wise is not a full code gen engine, you can still match the performance of a code gen engine because you get this benefit here. Whereas hyper can't do vectorized execution. It's only tuple at a time. All right, so now there's a bunch of database systems that are based on the JVM, that run on the JVM. So Spark in 2015 announced they had this new tungsten execution engine. Spark's written entirely in Scala. So inside those source code, they have a way to take the predicates inside of your query, because Spark supports SQL, and then they'll convert that directly into Scala ASTs, which then can be generated into bytecode, and they can invoke that and execute that natively inside of the engine a lot of running. So Spark is doing this for Scala. Again, it's just running the JVM. There's a bunch of other JVM databases that are more or less all doing the same thing. Neo4j, Splice Machine, Presto, and Derby. Splice Machine uses Derby, so these two are kind of the same. I was looking at the Neo4j source code last night. It's not very good, because there's no documentation, but from what I can tell, it looks like that you can actually generate the bytecode for your query, and they can also then reverse it and put it back into Java source code. So if you want to, again, now run it through a debugger and figure out why your query isn't actually working, they can actually support that. I don't know whether these other guys can do the same thing, right? The others to think about too, about using the JVM, because it's doing just-in-time compilation, I can emit the bytecode. I don't have to run any compiler passes on it right away. Then the hotspot VM will recognize, if I'm executing this bytecode over and over again, in my forlip as I'm accessing every single tuple, it'll then do the compiler compilation stuff for me. So I don't need to do that multiple stages that we did in the LLVM. The JVM takes care of this for me. But I think it's kind of cool. So MemSQL is an interesting one because they have two versions of their engine. So as I said before, one of MemSQL's co-founder was at Microsoft when they were building Hecaton, saw the early talks internally at Microsoft, although he wasn't working with Hecaton, saw the talk from the researchers to talk about how they were new co-gen for Hecaton using C. So when he went off and built MemSQL, he did more or less the same thing, which is the same thing as haiku. So they would have code that would generate the C source code for a query plan, then fork exec GCC, link in that shared object and then run that query. And as I said, in the early versions of MemSQL, when you look at their blog articles, they would show examples where the first time you run a query would be one second because that's all the compilation overhead, but then the second times would be much faster. The way they were able to make it go faster is through caching by taking any query that showed up, extracting out the constants, and then recognizing if the same query shows up again, just with different input parameters, I can reuse my cached shared object. So my query shows up, select start from A where AID equals 123. I recognize I have 123 here, rip that out to be a parameter, compile that, cache it, and then now if another query shows up with AID equals 456, I could recognize that I could reuse the same query plan and run that and don't pay that compilation overhead. Now, we asked them and they told us that the only thing they were doing here was just string matching. So if my predicate was like where AID equals one and NBID equals two, I cached that one, but now if I show up with reverse of like BID equals one and AID equals two, then they couldn't reuse that. Even though semantically it's the same query, the string won't match, right? So then what happened was, and actually the MemSQL guy, Nikita told me that like in the early days of MemSQL, had they had to do it all over again, they would have not added this compilation stuff at the beginning, because it was a huge pain for them to maintain. I think they were going through a lot of the same pains that the IBM guys were having, yes. This compilation is just a one-time thing for a query, so how much of the overhead does it cost and how much can be prevented from crashing it? So you said, you say this compilation is expensive, but how much is the overhead of the compilation versus the extrusion time of the query and how much can we actually cache things? I mean, as you showed, it was 36 milliseconds? Yes, right, but if the query runs for one millisecond, if you ran the interpreter and it took 10 milliseconds, but my compile version runs for one millisecond, but it takes me 20 milliseconds to compile it, then I'm better off just running the interpreter. MemSQL, to avoid engineering overhead, you don't want to have to build two separate engines, so they would compile every query. So to avoid, and then they were sort of focusing on all that thing, so you could hope that most of the queries show up with the same pattern and you can reuse the cache plan over and over again. The question is, though, how much benefit it provides well depends on what the query is actually gonna do. If I'm gonna read one tuple, then caching could help a lot because the query is gonna be so short anyway. If I'm reading a petabyte of data, caching is probably not gonna make a difference, who cares? Right, for in-memory, most of you don't have databases running in one petabyte of memory because that would be super expensive, so most in-memory databases we're looking at are tens of hundreds of gigabytes. I think MemSQL said they had somebody that was like 14 terabytes or something like that. Still a lot, but most queries aren't gonna have to rip through everything. So, as I said, the MemSQL guy told me they had to do so over again, they would not start it with the compilation stuff, but then they got a bunch of money, which always makes things easier, and they hired the guy from Facebook that built the hip-hop VM for Facebook. So, Facebook famously runs on PHP. PHP is interpreted language, and the default is when I did PHP developed back in the day. Like, the default PHP interpreter is super slow, so Facebook never get better scalability. They built their own VM that can compile PHP. So, MemSQL hired the guy that invented, or that worked on that hip-hop VM to let that project to go rewrite the execution engine to be entirely based on LLVM. So now, what is in my opinion the right way to do a LLVM-based query execution engine. This is basically what we're doing now in our own system. So, what they're gonna do is they're gonna take the physical query plan the optimizer spits out, and then they're gonna first convert it into an imperative plan that is written in a high-level domain-physical language, or DSL, that they call MPL, the MemSQL programming language. It basically looks like C++, and we'll see our example in a second. Then you take that DSL, and then you compile it into a bunch of op-codes, and then now you can have these op-codes be either interpreted or compiled. Now, MemSQL's case, I think they always went straight into compilation, but you could still have an interpreter the way that the hyper guys did for their IR, right? And now the benefit you get from this is, from a database engineer, people actually build into the system, it's way easier for higher new people to work on this part and not worry about this part down here, because I'm gonna assume that my really expensive, really smart people wrote this part correctly, and everyone else, I can work on this part here. So you guys will do the same thing. If for your project three, if you end up working on the execution engine, most of you, if you wanna add new features, like new string functions, and new SQL functions, or date functions, you don't have to touch this part here, you just need to modify this part up here, right? All right, Postgres added support for the LLVM compilation in 2018. First came out in version 11, but it was turned off by default, and then now version 12, I think as of last year, it's turned off by default. So every query, and they have an internal cost model to decide whether I should do compilation or not. And they're gonna do compilation of predicates and tuple deserialization, and basically get removed, reduce the number or get next calls that they have in the iterator model. So they're gonna try to inline everything as much as possible, right? And the way they did it is super interesting is that they took all the backend Postgres code, like the server itself, that they have written in C, and they could then convert that same source code into LLVM C++, and then through that, they can then remove these iterator calls. So you sort of take the function to add two numbers together, and they can precompile that in LLVM C++ code. So now, as I do code gen for queries, I invoke that C++ code and not the regular Postgres C code, right? Over time, I could give a demo, but we'll skip it. We'll do it at the end if we have more time. CloudR and Palo's another one that's using LLVM. So they only do predicates, just like in Postgres, but they also do for record parsing, which is interesting. So, and Palo doesn't have their own, like proprietary storage format, like Postgres MySQL, there's always proprietary, you know, row, column layout of data, you know, they run on Parquet, they run on Avro, they run on all the Hadoop or cloud-based file formats. So what they're gonna do is, in order to parse these records more quickly, they're gonna precompile a bunch of these parsers ahead of time, or if my query shows up and I'm operating on some CSV file, so I know what the schema is, I can then precompile the CSV parser, so I don't have to have this interpreter to look at, like, you know, find columns, or columns and things like that. All right, I'm gonna skip VTSDB, I wanna quickly talk about ours. So, the first version of Palo's on, we did what Elf Hyper did, we would have our C++ code spit out IR directly, we weren't doing full pipelining they were doing, we actually can do, we actually introduced these pipeline breakers at different parts of the query plan, because now we can pass around vectors or tuple to get the benefits we have on vector-wise, this'll make more sense in the next class, and we'll use software pipetting to hide this. So, just to show some numbers, this is what my PhD student, Brashant, ran. So this is like the interpretive version of Peloton, versus like the Kapal version with and without this lax-operative fusion technique. And so, this is not a good example of what the benefit you can get from compilation, like this is like if you're retarded versus not retarded, like the gray bar here of Peloton, that was a bad engine, right, it wasn't good at all. And so I don't wanna get the idea that you're gonna get a 100x benefit or performance improvement through compilation, is usually like from ending from two to 25x. But this is showing here that again, we can go from, this is like 88 seconds to 800 milliseconds. That's a pretty significant drop by doing compilation here, right? But we abandoned this because it was a huge pain, because now you need to be an expert in LLVM IR in order to debug anything or make any changes to the execution engine, because when a query crashed, you land in an assembly and don't have a stack trace, and only like two or three students could actually work on it. Now our new system, which is currently unnamed, but we have a name, we haven't announced it, is we do what MemSQL does, where we take the query plan, convert it into a high level DSL that's specific to our database system, basically looks like C, then we compile that DSL into opcodes, and then we can interpret that opcodes while the compilation occurs in the background. So it looks like this, my query plan shows up, I can then convert this into some dialect, which again looks like C, like this is doing a scan on foo, and then has this predicate on column A, column B, while inside of my DSL, I'm doing that operation directly. Now I'm gonna convert this into opcodes, the exact details don't matter, but like this is actually human readable, right? Like we have things like table vectors to iterate over tuples and get necks and have function calls to get integers and things like that. So now I can then take this opcode, start interpreting it immediately, which means I can start executing without having to run the compiler, then in the background I run my LLM optimized compiler, I generate C++, I generate my shared object, I can then link that in and fire it off, right? Again, that was a bit rushed, I can show you a demo of this next class. All right, so the main takeaways from this is that query compilation is super important and every modern data system is gonna wanna do this, the problem's gonna be, it's gonna be not easy to implement. You need to notice something about compilers and level systems. The mem seek approach as of 2016 is the way to go and every system now is exploring this. All right, so next class we will discuss vectorization. Okay, and we'll talk about project three topics. Okay? Bank it in the side pocket. What is this? Some old bullshit, right? Took a sip and had to spit cause I ain't quit that beer called the OE. You look, then it was.