 Today's lecture is something that's actually really important. This is the way you actually want to build a modern database system. Last year when I taught this lecture, I actually put it at the very beginning. This was like the third lecture because in the older system, the compilation stuff was the major direction we wanted to go and it was the new part of the system that I wanted to teach everyone so they understood it. Going forward in all their projects, they knew how to use it. This is at the end for this semester because as you saw from one of the project groups, our system doesn't have an extrusion engineer, so we don't have any of this co-generation compilation stuff. They're working on that now. So for the exception of that one group, nobody else is touching this stuff, so it makes sense to put it back here at the end. I think next year I'll put it back in the front because again, I know that when people watch these lectures on YouTube, they watch maybe the first four, so they go through a little bit about multi-version concurrent control, but then everyone drops off after that. Whereas this is super important and this should be in the front, but for your guys' purposes, since it didn't matter for the projects, I put it at the end. So today we're going to talk about co-generation query compilation. So first we need some background about why you want to do this, and then we'll talk about the two main techniques, transpilation or co-generation source-to-source compilation, and then jit compilation. Then we'll spend some time talking about how some actually real-world systems do this, and as I said, if you're building a new system today, you want to do the things that we're talking about in this lecture. The performance difference is quite significant. Then we'll finish up discussing what's going to be expected. You are general tips for doing the project two code reviews that I'll start on Monday. So the motivate why we want to do co-generation compilation is comes to this nice little snippet from the Hecaton guys. So when they started building Hecaton, they were trying to figure out, if we put everything in memory, how can we get this thing to run faster and the system run faster? At the end of the day, it really comes down to trying to execute fewer instructions. We're going to talk about how to execute more instructions per cycle. There's some things we talked about before, and we'll talk about going forward. But you just say, just in terms of doing less instructions to compute the same amount of data, is our target and how much faster can we actually get? So they did a little napkin calculation that said, if you want to get 10x faster, then you need to execute 90 fewer instructions. That's feasible, it's not easy, but that's possible. But if you want to go to 100x faster, then we need to execute 99 percent fewer instructions. Now, that gets to be super hard. That's not just turning on 0100 in GCC to start printing things. That doesn't exist. So this is a good example. Again, we can try to execute fewer instructions, but there's an upper bound or lower bound, how much less work we can actually do. So we're going to carry out other things, and this is where caring about the instructions per cycle is going to matter a lot. So the way we're going to do this, so we talked about some techniques, and we did some techniques to do less work on the Freud stuff. We talked about not having to invoke the function and doing a bunch of optimizations by putting everything into a relation algebra. So in our role, what we're going to talk about today is what's called code specialization. The idea here is that rather than have the database system have this general-purpose engine that interprets the query plan, and it run times figures out what the data looks like, what the operation is trying to do, we're actually going to bake or generate on the fly code. It could be a machine code or like source code like CC to the plus. We're going to generate that on the fly for the query we're trying to execute, then compile that, and then now we don't have any branches, we don't have any interpretation of what the query plan is trying to do. We now have a hard-coded baked machine code that only executes that one query, that's specialized to execute that one query. Now we're going to execute a few instructions, and we're going to also end up being more clever about our algorithms and get more instructions per cycle. So we can try to achieve that 100X speed up that the Hecaton guys were trying to get at. So the biggest bottleneck is going to be is that for I'm going to say traditional databases, but the way people normally implemented databases, you write it in a way that makes the source code is maintainable by humans. You try to do good software engineering practices so that a human can read it and understand what's actually going on in case they need to modify or debug it. But in many cases, the way you organize code for humans is not actually the best way to organize code for modern CPUs. And the compiler, whether it's Clang or GCC or whatever you're using, is not going to be able to magically make that become optimal. So we can sort of organize our specialized codes that we're generating for our queries in such a way that may not be easy for humans to read, but actually is the best way to do this in our hardware. So to give you a quick example of what kind of things we're going to do, I want to use a really simple database here. Three tables, A, B, and C, and A and B have primary keys, and C just has a foreign key references to both of these guys. So we'll just use this going forward in a bunch of examples. So first thing we want to talk about is, again, what are existing systems doing? If you're not doing code specialization, you're not doing compilation, you're doing interpretation. So what does that look like? So we have this query, and it has a nested aggregation, and then it joins that with A and B. So the query plan essentially looks like this, and as we saw before when we actually talked about the processing models, the way we're actually going to implement this, using the iterate approach, going from the top to the bottom, is calling next, next, next, going down one at a time, and then pushing tuples up. So this sucks because what's going on here? So every time I call next, what is that? That's a function call. Let's now jump to some other address space in our process, then we have to set up the stack, pass along registers, and vote that. Then when we do a return or do an emit, we're sending data back up, and then now we're popping off the stack and going back down to where our function was before. When we actually evaluate these predicates, we'll see in the next slide, this is really inefficient too. So I'm going over every single tuple, and in this case here, I'm doing the filter on b.vow equals question mark plus one. I have to invoke that for every single tuple that I'm evaluating. I have to do that addition on the tuples attribute, and then check to see what I'm looking for. If you remember also, when I showed that example of the Postgres numeric code, it had that giant switch statement that said, if you're trying to add a negative numeric plus a positive numeric, do this. If the two negatives do that, all that is being invoked for every single tuple as you go along because this code is set up not to know anything about what the data is that you're looking at. So there's nothing in here that's baked in about what the schema is. It just says, oh, I'm looking at a tuple. I know what offset maybe I want to look at in my predicate, and then I do my comparison on that, and then return true or return the tuple up based on what I'm doing. So there's all these giant switch statements all over the place that says, oh, the value you're trying to operate on is this type, and you're trying to do this kind of comparison with this other type of value, here's how to actually do it, and that's really slow. And you saw this when you read the X100 paper from Monadb. If you remember them talking about why my SQL was slow, it's all that interpretation because the query plan is not baked into what the data actually looks like. So we see this also problem now too with predicates, same thing. I sort of mentioned this before, but let's actually see what's going on. So let's say I just want to do this one simple here, b dot val equals question mark, and this is if you have a prepared statement, this is a placeholder value, a placeholder where you can stick a value in at runtime. So when you invoke the prepared statement, you say parameter one equals one, two, three, and then they sub the two that in. So if I want to evaluate this, the way it's organized is through an expression tree. So you have the equal clause and then has a left child and a right child, and then one side you have the lookup for the tuple attribute, and the other side you have the addition where you take the parameter you're passed in and the constant. So when I want to evaluate this on a per tuple basis, I'm going to pass in a bunch of execution contacts about what tuple I'm looking at and what's the invocation of this query. So I have like in this case here to do this thing to look up and say, what's b dot val? I look up in my contacts and say, well I'm looking at this current tuple here for this table and I know that b dot val is at some offset based on its table schema over there. So then you know how to jump, you do the lookup and say I want b dot val, b dot val is the second attribute that one prior to this is 32 bits. So I know I want to jump over 32 bits to find in the tuple that I'm looking for the value that I want. And then it says, all right, the schema says this is an integer, so I know I should be passing up an integer here into my equal clause. Now I go down, all right, and then I'll put a thousand. Now I go down the other side, same thing, I want parameter zero. So now I go look up my array of parameters that I got invoked when I invoked the query and I want the offset zero, so that gives me 99. I know that's an integer, so I pass that up to this addition. This guy says it's a constant ones, so that's just one. Goes up here, and now I'm doing interpretation to say my left side is an integer, my right side is an integer, now do an addition for two integer values. And then proves an output that then gets shoved up here and now do my comparison. I have an integer on this side, I have an integer at a put at this side, I now know how to do a comparison of those guys. So again, it's just these giant switch statements that says I know what my left child type looks like and my right child type looks like. How do I do addition on them? Because what would happen if this be a float and this was an integer, you need to produce a float and not an integer. So again, this is being implemented by switch statements. So now, this is really expensive, I'm doing this for every single tuple I'm looking at. So if I have a billion tuples, if I organize my expressions as a tree, I'm traversing this tree, again, which are function calls and jumps to these different regions of memory to do some kind of trivial computation to then emit the value back up, right? And we do this because these expression trees can represent any possible predicate we could have in our query, right? So it's a general purpose representation, but it's expensive, again, easy for us as humans to reason about, difficult to run fast because of the function calls in indirection. So this is where, again, code specialization is gonna do for us, is that we're gonna take any kind of computation we're doing over and over and over again in our database system and we're gonna generate code that is specific to execute that code, or that operation. So that could either be the predicates I just showed in the last slide, it could be the operate execution when I show the entire query plan, like computing the join, right? But there's other parts of the system, like doing the scans, the store procedure stuff, we talked a little bit about this with Freud, so instead of taking the UDF and converting it into relational algebra and shoving it to the query plan to do query rewriting, if I can't do that, then I can actually then compile the if-causes in the T-SQL or the PL-SQL code and invoke that as native machine code. For logging operations, this is mostly a bottleneck or mostly an issue for doing replay on replication. So I'm sending over log records and instead of having to interpret the log records, I can compile a specific code that knows how to apply the updates very efficiently. So again, that's the goal of this. The idea is that we're gonna do some same operation over and over again and we wanna be able to generate machine code specific to that operation because that's gonna be way faster than doing any interpretation. So that means that we don't do any lookups and say what type of data am I looking at, how big is it, what kind of operation I'm doing, how do I handle negatives and positive numbers, people put together. I bake that all that in directly into machine code and that's gonna be way faster. So again, the reason why we can do this is because we know a bunch of stuff ahead of time. Right, this is in a relational database, you call create table, you tell the data system, my table's gonna look like this and it doesn't let you store anything in a column. If your column's integer, it won't let you store something that's a float or a varchar, it'll throw an error. So that means we know exactly the data type of every attribute that we're looking at. So that means when we look at our predicate, and our predicate says we want the second attribute in my tuple and then this constant parameter here and I wanna add them together, I don't have to do any lookups and say, well, what's the type of this? I can bake that in my specialized code, my generated code, exactly to do that operation and that's gonna be way faster. So again, we know all the attributes ahead of time so we can avoid all the function calls to do access stuff, right? We know all the predicates ahead of time so we know what they're re-looking at so we can generate efficient machine code for that. And then when we do our operator execution or even actually in the predicate evaluation, we basically inline everything at least as much as possible so that we don't have any function calls. So we're just ripping through machine code that it won't always be branchless but ideally, yes, it will be. We just rip through sequential code and that's the fastest way for the CPU to execute it, right? So this is about in the superscalar architectures, they have these long pipelines and we just wanna keep feeding that thing instructions and have all the data that it needs and its caches so that we never have any memory stalls or we don't have any branch mis-predictions where we have to flush out our pipeline and jump to some other space. We just keep giving instructions and let the CPU run on this full speed. So that's the goal. All right, so how does this work, what does this look like in our database system? So this is a rough overview of where all everything we're gonna talk about works today. So the application sends us a SQL query, we first go through the parser, the parser will give back an abstract syntax tree. This is just saying here's the tokens that are in your SQL query, right? But it hasn't mapped anything to the table objects of the column names, it hasn't mapped them to anything. That's what the binder does. The binder then does a look up in the catalog and says I see a query on table foo, it comes back with either a pointer to table foo or an object ID for table foo. And then I annotate my AST, AST, then I send this along to my query optimizer who's gonna crunch on it and try to find an optimal query plan and then it spits out a physical plan. So the physical plan is saying I wanna do a hash join on these tables and reading these indexes with these predicates. It's like the thing you would actually interpret if you didn't have this compilation stuff we're talking about, okay? So then with this physical plan, we feed that to the compiler or the code generator and it's gonna do something, we'll explain what that is in a second and it's gonna spit out native code that we can then execute as if like we're invoking something from the command line, like an executable, right? The way we're really gonna do it is generate a shared object, link that into our program and then invoke the entry point for that shared object to invoke whatever it is that it actually does. So it's not like we're generating native code as a standalone binary and then fork exacting it on the side. We're gonna run this in our address space for our database system and it's not a security issue because again, we're the ones actually generating this code. So we're not worried about someone that's sending us malicious stuff that then trashes our system or reads something that they shouldn't be reading. This is code that we're generating so therefore it's safe to run this in our own address space. And it's not, for store procedures in UDS, if you let people run arbitrary languages that are not safe, like C, then you don't let them run UDS in the same address space. You run them in a sandbox. But again, we don't have to do that because we're running query code here. All right, so the two approaches to do this. The first is called transpilation or source-to-source compilation and then JIT compilation. And a high low again, at the end of the day, they're gonna produce the same result. You're gonna end up with, for your query plan, you're gonna end up with machine code. That's hard coded or baked or specialized for your one query. Just how they go about it is slightly different. So with transpilation, you're gonna have code that generates code that gets compiled. So you're gonna have C++ code or Python code, whatever you want, takes in your query plan and then it spits out a source code file of more C++ code. And then you run that through a compiler and then that generates the machine code. The second approach is JIT compilation and this is where you're gonna generate an intermediate representation, IR, take it like Java byte code, same thing, and for the JVM. So you're gonna go from the source code and generate, so you're gonna go from your query plan and generate this IR directly, then you feed that into your compiler and then that generates the machine code. So semantically, again, at a high level they're doing the same thing. You're going from a query plan to machine code, it's just what are the intermediate steps in between them. Okay, so we'll go through each of these. So for the first one, translation or source compilation, there was a system out of University of Edinburgh about a decade ago called Haiku. And again, this was one of the first prototypes in academia, at least in the modern era, that would take a query plan, generate C code and then they would actually fork GCC, run the source code through that and then GCC would spread out a shared object and they link that back into their database program and then that runs the query, right? So for this one, again, they were just using an off-the-shelf compiler. You pay a penalty for forking, obviously, and it seems like it could be a bad thing to do. We'll see one simple, one system at the end, commercial system actually did it the way Haiku did it. I think this is the wrong way to do it because again, the exact part is super expensive. And there's a bunch of stuff you have to set up in your query, sorry, in your code to generating that you may not be able to do if you're going to the JIT compilation approach or having an intermediate step before you do this. But let's look at Haiku and what it's doing. So say we have our query here, select star from A where A dot val equals question mark plus one. So the interpretive plan would essentially just do this. So you grab some handle to a table and just iterate over every single tuple in your table. And then for every tuple to go grab it, the way you have to do this is you go get the schema from the catalog for that table, which you do once and then cache it. So that's not that big of a deal. But then assuming that we're in memory, we gotta go figure out that we wanna, for some offset that we want in our table, we then know how to jump into memory to find that tuple that we want. And then we return a pointer back here so into our code that we could actually end up using. So this one here, this one we can cache is not big of a deal, but this is basically again looking up to see doing some simple arithmetic to figure out where we wanna go and then finding that pointer and bringing it back. So the most extensive part probably is the jump to code call this function, right? And then now for every single tuple we get back, then we wanna evaluate this predicate. We have to traverse that tree that I showed in the beginning where you go down and sort of evaluate each operator and you start pushing things up into the plan, into the expression tree until you get your final output to tell you whether that tuple has been, which satisfies your predicate or not. Again, as you're going along, you're casting the comparison operators based on what the predicate actually wants and what the tuple is actually looking like, right? And then you return true or false. So the way in haiku what they would do is they would have these templated query plans like this, our source code, where you would have all the things that we were computing on the fly over here, like the tuple size, the predicate offset and the parameter value that I want and you just fill in these values. So then now when you actually invoke this, the compiler is basically just gonna insert these constants in here so that you don't have to do a lookup and it's super fast. And then when you actually do your comparison here, now again, instead of calling an evaluator function for the predicate, I'm baked in directly in my if clause, right? If val from the tuple that I extracted, if the parameter i plus one equals this, then I know I wanna emit it. So again, there's still indirection because I have this if clause to decide whether I wanna emit this or not. We talked about how to do a branchless version of this, but everything's still the same. But the main thing is that I didn't have to go any function call, I didn't have to reverse a tree, I didn't have to interpret any aspect of the query plan. This thing is baked to do exactly what that query wanted to do. And that's gonna be so much faster. So the nice thing about this also too is that since this case here, this is Pseudo Python code, but they're gonna be generating C C++ code. And so that means that if the rest of the system is implemented in the same language and is in the same runtime, then our generated query code can invoke any other aspect of our database system that we've implemented in C++. So that means that like in my template of code that I'm generating, if I didn't make calls to the transaction manager or the log manager or other assets of the system, I can just bake that into this stuff here, right? There's no sort of specialized bridge I need to have between this stuff and the rest of the database system. So that's actually really nice because then again, you can implement all those other things in C++ without doing this code gen stuff and test that just as if it was a regular database system, but then when you run the query, it's just making calls to those functions as if you were running the interpretive plan. So that makes the integration of other parts of the system with code generated or on the fly generated source code for your queries, all that sort of integrate seamlessly. Now in the case of the LLVM stuff, we don't have this problem anymore, but in the old system, there was much of macros we'd have to have to allow you to call from the LLVM IR, call into C++ code, you had to sort of mangle the names of the objects and functions. So a bunch of tricks we had to do to make all that work. It wasn't as seamless as the way that they're describing here. And then now with the way we're doing this and the new DSL, which I'll talk a little bit at the end, you can still do this, but again, it's hidden from you as the rest of the programmers. There's some extra stuff you have to do before you can call C++ code for the LLVM stuff, whereas this one just calls it natively as if it was running a regular function. Okay, so let's look at an evaluation from the Haiku guys, because this is a good, it clearly shows the benefit you're gonna get and what kind of performance improvement you're gonna get from doing this code generation. So what I like about this paper is they basically implemented all different variations of ways you could do query execution and they put it all in a single system. So it's like a real Apple comparison. So the first is that they're gonna have what they call a generic iterator. So this would be an introduction database class the way someone would implement a really simple iterator scan over a table. But then they have optimized versions where they inline all the predicates and then they had a type-specific iterators for every possible column type you could have. So they would have an iterator go over integers and iterator to go over floats, right? Where in this one it was generic and you have the switch statement to say to interpret for every single two-boy that you look at what the type actually is. Then they had a hard-coded version that was in using sort of generic pattern but this was like hard-coded C++ code written by a grad student. Then they had a optimized version that did direct tuple access because the types are baked in. Now this will end up being the way vector-wise or X100 does will show at the end of the class but just as a spoiler or a precursor to say this is coming later. What they're describing here is actually the way vector-wise does it. And then they have the high queue which is doing the query-specific specialized code. Again, they have C++ code generate C++ code and then they compile that and link it in. So in here, the paper's from 10 years ago so the machine's a Core 2 Duo which is pretty old by today's standards but again, the relative difference of performance is still valid. So again, with no surprise, the optimized hard-coded version and high queue actually do the best. I mean, it's very small but the high queue version is actually slightly better. Again, because they organize the source code in a way that's more efficient for the CPU whereas this is written by a human and it's done in a way that the human could reason about. So this is showing that it's like, the dataset can be a little bit smarter than what a human can do in organizing the code that it's emitting to get better performance. And of course, everything else is as expected. Like the generic is the worst. The iterators and interpretation is always really bad and then as you get toward this side, if you get more specialized you can get much better performance, right? So the main thing also, another thing to point out here too is like, this thing has huge memory stalls because of the indirection, because of that again, you're doing this interpretation of look up and say what kind of type of data I'm looking at. But down here they have almost zero memory stalls, right? So that's super impressive. I think they also measure L1 cache misses here and I'm not showing it because the number is just so small. It wasn't worth plotting. So then the other thing now we talk about is how long it takes to compile this thing, because that's the biggest issue. So going back here, so in this case here for these TPCH query or doing a join on a rather small table, high queue is 50 milliseconds. That's pretty good. But the problem is it's gonna take you over 100 milliseconds to actually compile it. So the compilation time is taking longer than the query time, right? So that sucks. So in this experiment here what they're doing is they're gonna show what the performance is or that the compilation cost is for running with O0 and O2. I don't, they didn't report O3. As far as they know when people ship database system software, if you don't care about trying to recreate a crash or care about passing on debug symbols, you ship software with O3, because the difference is worth that. It's worth using O3. So again, no surprise, O0 is basically, it's no optimization passes, but even then it takes 100 milliseconds to compile queue one, which is pretty simple. There's no join in it. And then doing with O2 takes twice amount of time or even more in some cases. So again, this sucks because if my, so say even like really simple queries, if my query is gonna run for five milliseconds but it takes me 100 milliseconds to compile it, then the benefit of compilation is basically out the door, all right? And the reason why it's going so slow is again, they're fork-executing GCC. So think about what that does when you do it from the command line, right? It reads in the config file, it parses the command line parameters, loads in the file you want to compile, and then it emits the binary, right? So there's a lot of setup involved in making that call that is not directly related to the compilation calls. Like this has no optimization passes and it still took it a lot of time. So again, this is back. This is what we want to try to avoid. So there's other aspects about why the high queue one doesn't work as well. And the compilation stuff is unavoidable. We know about this because we're fork-executing GCC but in terms of a performance of the actual query plan they're generating anyway, the way they sort of set up the system to emit the source code is that they're not allowing for full pipelining. Meaning they're still kind of doing that tuple-out-of-time iterator approach that we showed at the very beginning where they're not making function calls every single time because everything's in line but they're still sort of operating one tuple-out-of-time per operator and then moving things up step-by-step. So if we want to do pipelining, this is actually something again, we can organize our pipelines in such a way that's ideal for our CPU and it's gonna generate really ugly code but who cares because people shouldn't be debugging our auto-generated code anyway. And this is something that us as the database developers have to deal with but it's not like our customers have to do this. So we can be a little bit smarter about how we organize the source code we're generating to take advantage of full pipelining. So remember what I said before, a pipeline is basically a segment of the query plan where we can keep operating on a single tuple far up in the query plan as we go. And then there'll be what is called a pipeline breaker is where we can't proceed in the query plan for a tuple until we get all the tuples below it or all the tuples for that pipeline. So going back to our simple join query like this, so these are the four pipeline boundaries we could have. And the easiest one to reason about again what a pipeline is is number two down here. So I'm computing an aggregation on the counts for the BID. So I can scan over B, I can iterate over B, do the filter and then start computing some of the count here that built my hash table but I can't actually proceed into the next pipeline until this thing actually completes, right? So I do one tuple, add it to my hash table for this then I come back and get the next tuple. I can't keep going up into the query plan, right? Same thing for this side here, I can do my filter on A, put it to my hash table like me for my join but I can't compute that join until I finish computing my hash table and then compute the probe side. This pipeline over here, I can scan on C, do the join with the hash table I built over here from this pipeline and then if a tuple gets satisfied in the join predicate then it moves up to the next join operator and I've already built my hash join from this pipeline so then now I can just do the join right there and then admit it for going further. So in this case here I can take a single tuple, I can go from the scan on C, this join to this join and then produce the output without ever having to go back to the next tuple. That's why this is called a pipeline. So in hyper, the paper you guys read, they're gonna try to take advantage of these pipelines and try it again to keep tuples and registers, CPU registers for as long as possible going up in this pipeline, right? So again, it's different than the iterator model where you're sort of going, grabbing a tuple, maybe doing something on it and going back and get the next one, right? Before you go back and give data back up to your parent operator. So using the syntax that we talked about before when we talked about query processing, so they're gonna be doing a push-based model instead of a pull-based model. So they're pushing data up, rather from the bottom to the top, rather pulling it from the top down, right? And then they're doing what's called data-centric or is operator-centric. This basically means that they're gonna keep their tuples for as long as they can in the CPU registers and our pipeline, rather than sort of think about how to organize the source code they're generating based on what operative we're invoking. So it's more revolves around how we're organizing the system to operate data tuples rather than how to execute these operators. So the hyper-paper you guys read is from 2011. And so this was, again, part of the modern era of doing query compilation. This is the first paper that came out and says, here's how to do this in the LLVM. So they're gonna be doing JIT compilation. So in the paper you guys read, they had this appendix, they had all this IR stuff. I hope you didn't read it, because I can't read it, it doesn't matter. But that's just showing you what, like they're gonna have C++ code that invokes some macros that generates that IR directly. And then they pass that IR to LLVM, which then compiles it. So they're not doing the source-to-source conversion. They're just going directly to IR, and then compiling that. So there's no sort of intermediate step. So this paper's crazy, right? It's written by one dude, Thomas Neumann. So he's a professor. So the professor wrote 80% of hyper, and then wrote that paper by himself and got it published. And this is a very highly cited paper that everyone follows. But this is crazy. Again, he's maybe my age, he has PG students, he has to teach two classes a semester. He's got three kids, I have zero, he doesn't have a dog. He has way more going on in life, and he wrote 80% of their system. And then, I can say this publicly, it hasn't been announced yet, they've already, Hyper got bought by Tableau, so like half of students went to go build the commercial version of Hyper for Tableau. He's already building the new one, right? And the new one already supports all the TPC-H and most of TPC-DS, right? And we haven't done anything yet. It's like, he's one dude, and he doesn't, I don't like, what the hell is he doing? Sorry. LLVM. LLVM. So, actually, who here has not heard of LLVM before this lecture, or for this paper? Okay, that's fine. So, it's a really interesting project. It came out of UIUC, University of Illinois, and they were trying to build a replacement compiler for Clang, and then what they ended up building was this toolkit of a bunch of components you need to do for compilers. And the idea is that they have this IR that they've generated, again, very similar to the JVM's bytecode. And it's meant to be sort of a universal IR that you can then build a bunch of front ends for to take any possible query language and have it turn into that IR. So you can take Scala, you can take C++, or whatever you have, or Python, and that can then generate that IR. And they have a bunch of back ends that can then take that IR and compile it to whatever architecture you want to target. So x86, and ARM, or Power, or whatever, right? So that's, I mean, that's actually really kind of cool. And so there's a lot of, since then, it's gone beyond the university, and I think Apple has invested heavily in this as well, and they're funding a lot of the development. So, we can leverage this, to run this, basically do the same thing that they were using for GCC for. We can run this inside of our same address space for our database system, feed it in IR, whatever we want to generate, have it do the compilation, and generate the shared object directly in memory. And then now we have an address to that shared object which we then invoke, right? We don't have to link anything dynamically, like in the case of the Haiku guys have to do. So, again, the main thing to understand about this, just like in the case of Haiku, where we can have the code regenerate, call all the parts of our system that are written in C or C++, we don't have to implement our entire database system in LLM IR. We can have a little macro magic, we can have LLM IR stuff invoke our existing C++ code. So, that's why, in the case for you guys working on our system today, the execution engine we're building out with the LLVM, it's not anywhere in the repository, but everything we're implementing is in C++, because once we have that execution engine in place, it can then invoke your C++ code, just as if it was written in C++, even though it's written in LLVM stuff, right? So, I don't go into details of exactly how the IR stuff works from Hyper, I wanna talk about how they're gonna do the push base execution. So, again, this is basically the kind of code they're gonna generate, they'll generate this in IR, and then they send that to LLM and compile this. And now you see it's sort of organized in these pipelines. So in the first pipeline one, we scan on A, we do our filter, and then we materialize it in our hash table on the build side. So you can see that one, we're taking a single tuple, and we're iterating every single tuple in A, and we're going up as far as we can in the pipeline before we move on to the next one. So once we've done, completed this pipeline, then we jump to the next pipeline and do the same thing, and then do it with the three, and now four here you see, we have three nest of for loops, because we're iterating over every single tuple in C, and then we're doing our join in B, and every tuple comes out of that, then we do our join on A, right? So again, the idea here is that for a single tuple that we have coming out of the bottom of our pipeline, we go as far as we can up into the tree, and they try to keep everything in CP registers, like they can control this with LL and IR, and they, that's gonna get us the best performance, right? So let's see what they can actually do in terms of, for runtime code. So this is running on TPCH, I actually forget the size of the table, yeah, so I don't know how big this is, but they're gonna take hyper with the LLVM approach, hyper which did the same thing Haku did was generating C++ code and compiling that, vector wise, which does the pre-compact primitives, which I'll show in a second, monadb, which does, they have an interpreter on opcodes, which looks like SQLite, which I can talk a little about as well, and then the question mark one is just Oracle, right? So again, for all of these, the hyper LLVM version does the best, I think vector, except for this last one here, they actually match what vector wise can do, which again, you'll see why in a second, because they're basically just operating at bare metal speed for the machine code. The primitives, instead of generating code on the fly, vector wise pre-compiles all possible types of code you can need and then just stitches them together at runtime, right? So that's why they're able to match performance. I forget why they said the C++ version of hyper doesn't do as well, I think it's because they switched over to the more aggressive pipelining approach than in the LLVM stuff, right? Yes? Why does vector wise not just for QQ? I think it just crashed, I think that's what it said, yeah. This is back in 2011, so vector wise, at that point vector wise is like three or four years old, it wasn't, I don't think it was commercial use at that point. Okay, so performance looks good, but what was my main complaint about Hikoo? Hikoo was getting good performance too, but what was the main problem? Compilation, right? So how does this compare to Hikoo stuff? So this is not a true apples to apples comparison because I'm taking the data from the Hikoo paper and the data from the hyper paper and since they're compiling the same TPCH queries and the same data size, it's not on the same hardware, but you need to get a relative idea what the performance difference is. And as you can see, the hyper guys are down in like tens of milliseconds, like 13 milliseconds and so forth for these queries, whereas like in Hikoo with O2, you're up to 200, 400, or 600. Right again, because you're not working in GCC, like LLVM is optimized to run everything in memory into your compilation, there's no setup cost for that, so that's why your compilation times are so low. They're executing the same queries, it's just the, there's no parsing overhead of the C++, you're just going, or parsing of the config file, you're just going directly from the IR into machine code. So that's why they get that lower time. So, this seems like this solves all our problems, right? Like, of course we want to do this, right? Well, for these simple queries in TPCH, yeah, these compilation times are small, but not all queries in the real world are going to be as simple as TPCH. Even though it's a benchmark that's 50 representative of what real world workloads look like or real world queries, that is the case, but there are sure queries that are more complex than this that people want to run all the time, right? So, the issue is going to be with LLVM is that, the way to think about how they're organizing the program, the IR, they're generating one giant function, right? They're like, the entire query plan is going to be baked into this, you know, this IR, that then pass that to the compiler. So what's going to happen though, is that now, according to the hyper guys, the compilation time is going to grow super linearly relative to the complexity of the query. So, in the real world, queries can get quite large. So, I always say this, but friends at Google told me that sometimes they see queries in their database systems that are like 10 megabytes in size. So that doesn't mean they're operating on 10 megabytes of data, I mean the SQL query itself is 10 megabytes. Because what's happening is, people have all these dashboards or analytical tools where they click a bunch of check boxes, like filter things on state zip code or other features. And so, they had these things that can generate a bunch of pre-filled definitions of the predicates that can then essentially be these giant in clauses or complex case statements that will turn the query to be really, really large. So now you throw this into our optimizer and then the optimizer is just out of query plan that's really big, we throw that through the LLVM and the completion time is going to be huge. So in the case of Hyper, they told me that when they were sort of preparing Hyper, whether or not this was for the acquisition before the acquisition by Tableau, they were doing the same thing that we try to do or we're trying to do of mimicking the Postgres catalogs and speaking to Postgres wire protocol. So now you connect to it with a bunch of Postgres tools and get the same management benefits that you get from if you ran regular Postgres. So they attach PG Admin, which is the web tool to do administrative stuff with Postgres and when you turn on PG Admin and you first connected the database, they invoke a bunch of queries in the catalog to figure out what tables do you have or databases you have, what the address look like. And they told me that when they first turned them on, these queries weren't that complex, but the compilation time was in the one second range. So that means you turn on PG Admin and it took two or three seconds for the thing to actually come up and say here's your database, whereas if you ran with regular Postgres, it'd be instantaneous because these queries would be so fast. So this is going to be a problem in OLAP queries. It's not going to be a problem for OLTP queries. And I'm going to take a guess why. So it's usually just so that you quickly update one thing. Yes, so it's two aspects. He says first is the simplicity of them. What's the other aspect of this? You just update a single attribute a lot. No, but the query doesn't know that. If you have an update statement, like the compiler doesn't know you're updating a billion tuples or this is one tuple, the query is the same. In OLTP applications, they're executing the same queries over and over again. So we can cache all this shit we talked about. So if it's a prepared statement, we just cache the query plan and then we fill in that the type at runtime. Or we say we know the type for the place what our value needs to be an integer. So if someone passes us a string, maybe we cast that once before we fill it in to our query plan. So for OLTP applications, it's not a big deal because we can cache everything and the queries are real simple. For all that query, this is more problematic. Then the PG admin or the Google example is the two ones that I gave. So the HyperGuys came up with an interesting solution to this. So what they came up with is, is what they're calling adaptive execution. This was published in ICDE last year, and this actually won best paper at this conference. I think it's a really good idea. So what they're going to do is, they're going to go through the same step they did before. You take your query plan, you run some macros that spits out LLVM IR. Then now what you're going to do is, you're going to take that LLVM IR, you're still going to compile it, but you're not going to wait for that compilation to finish before you start executing the query. You're instead going to have an interpreter for that IR that then knows how to execute the query based on that IR. Then when the compilation is done, and you have the binary, you slide that in into your query execution, and then the rest of the executors just use that. So again, remember the morsel stuff. The morsel was organizing the tables and the blocks. Every single time that a worker says, all right, I'm done with this task, I've finished this morsel, it goes back and tries to get another morsel. So now you just have a flag somewhere that says, all right, with the compilation for this query has not finished, so the interpreter engine can just keep crunching on it. Then at some point when I finish the morsel, and I look and my flag is set to true, I know I don't want to keep interpreting, I go get the compiled version for some location and run that, and that'll run much faster, right? So it looks like this. Again, the SQL query shows up, we run that through an optimizer, and these are hyper's numbers. That takes about 0.2 milliseconds, then we run it through our code generator, and that takes about 0.7 milliseconds. So then now here we're going to fork off into three ways. So we take our LLM IR, and then we have a bytecode compiler that's going to just turn this IR into bytecode that we can interpret. We actually tried this ourselves, and instead of actually doing this lightweight compilation step to turn into bytecodes, we actually put our own IR interpreter directly for LLVM. Then we have another thread will run through the unoptimized version of the LLM compiler. So I think of this as doing O0 in GCC, and then that spits out x86 machine code, which then replaces the bytecode we're interpreting. But then also in the background we're running the more expensive optimization passes, and the optimization LLM compiler. So now you're 25 milliseconds plus 17 milliseconds, and then this then produces the fast machine code, and that replaces everything else. I didn't report numbers, but I think this is like 2x faster going down for each of these. It's quite significant. So now this solves all our problems that we talked about. The first problem I said was the large Google code that takes a long time to compile. So I can still make forward progress on this, but up in here, while in the background, this thing can crunch me for a second. So if it's a really long query, the query is going to take 10 milliseconds, then I can still make forward progress, and then this thing takes a second, and then for the remaining nine seconds, I can use the faster version. In the case of the PG-Admin query, it's actually really simple. The query is going to be really short. So maybe I only get to this thing here, and then maybe it's 2x slower than what we do in a regular interpreted engine, but it's not the 10x slower than it was in the compilation version. Then of course also I can cache that query plan, so that if I come back and connect to it again, if I had the query plan around, and the compile version around, I can reuse that, right? Yes? This requires that you have something like morsel, right? Otherwise, you wouldn't have had to have morsel. Yeah, so his statement is, this requires you to have morsel. Yes, because in order to handle this, because the morsel provides you a natural stopping point to check to see whether any of these compilation stages has finished. Because otherwise, if everyone's just going full blast, you have to somehow interject, hey, stop running, right? Because again, the whole point is, I don't want to throw away any of the work I've already done, just because this is the interpretive and this is compiled, the answer is still going to be correct, because it's all coming from the same query plan. So to tell you how we did it in the old Peloton code, we actually couldn't do any of this, because we had either, we had two separate code paths. We had one for the interpretive engine, and one for the LLM engine, and they did not know about each other, they couldn't talk to each other. It was even worse than that, they had different semantics, so they actually, depending on whether you went from one down one code path versus another code path, you could end up with slightly different query results. In this one here, everybody's operating on the same IR that comes out of this thing here, so as long as your interpreter of the IR is correct, it may not execute the instructions in the same order as this one, but that's okay. The high level meaning or the high level semantics of what's actually being executed will still be the same. So I think this is the right way to do it, yes? Are there any situations where even the code generation stuff is not worth it, and the old naive iterator model is better? So his question is, are there any cases where this compilation stuff is not worth it, and you're better off using the immediate interpreter, or the old interpreter? Yeah, even more stupid than that. Even where this code generation, even where turning it into LLM at all is the problem. Like taking this query plan and actually interpreting it? Sure, sure. The old style. Key value stores, put key, delete key, there's no query plan, but that's super simple. And most key value stores are looking at opaque blobs. My key is a bunch of bytes, my value is a bunch of bytes. So there's no real query plan there. If you're doing anything beyond simple key value store, I can't think of anything, you'd want to do this. Again, the thing I want to stress about this, and we'll see this in the newer system that we're doing now, and in MM SQL, like again, it's all operating on the same IR, so it's not like I have to have two separate code paths. If I'm building a system today and I want to do complex queries, I want to do this approach. I don't want to have an interpreter engine and an LLM engine, and then decide at runtime which one to use. You got to go either one. And if you have any kind of query plan, and you have any kind of, if you're allowing users to define the data you're storing, that's going to require interpretation, and that's where this thing will always be the interpretive model. So I want to talk about now some real-world implications. Again, we got to pick a name this month, right? Because I think two students are writing master's thesis, and we're trying to submit papers in this, and we can't call it, you know, we can't call it, we can't call it, right? We got to think of something. The original name of the system was, right? So we're trying to think of two one-syllable words to put together, like postgres, click house, that only means our database system. So my wife and I were joking, I'm like, oh, we can come up, right? That's stupid and often it's unique enough. We actually, I think for the intro class, for the database we built in 15, 445, we might call that, just a mess of you. Anyway, so we got to think of a name this month. This is on me, not you. All right, so let's go through a bunch of examples and see how they're doing query compilation in different ways. So as in many cases in databases, what seems like novel and new, and that's why I was sort of qualifying what I was saying before about like, oh, in the modern era, Haikyuu was the first one that did source-to-source compilation, and Hyper is the first one to do JIT compilation. The reason why I'm saying the modern era is because back in the 1970s, IBM did it first, right? So when they first started building system R, as I said, they got a bunch of people, they had brand new PhDs, put them in a room, says let's build a database system off of a Ted Codd's paper, and everyone picked their own little piece. One person went off and did SQL, another person went off and did query optimization, somebody went off and did query compilation. And the idea again was that for a query shows up, they were gonna generate assembly code by sort of doing what Haikyuu was doing is selecting these templated operators that they didn't fill in the parameters at runtime, and then they just invoked that. So it had a big performance benefit, but the software engineering cost was really bad, and because every single time you change like the header of a tuple or the layout of data, you had to go back and change all of your templates. So if one team made a modification, that would break everything in the query compiler. So when they went IBM, they never commercialized system R, but then when they built the first commercial version of a relational database DB2 in the early 1980s, they brought over some pieces of system R, they didn't bring over this, they abandoned it because of that software engineering ad. The other issue too also was back then, x86 wasn't as dominant as it was now, so IBM had all these different ISAs for all these different mainframes they had to support. So you either have, if you wanna have a data system run on these different systems, and you wanted to do this code generation since they were emitting assembly directly, you have to have that, you have to have support every single assembly as ISA for all architectures that they already had. So again, this is why nobody did that, nobody did this in the 1980s or 90s or 2000s, really only in the modern era that people actually do this now because of all the problems that IBM had. And there's these great papers, this one's called the history and evaluation of system R, they have a bunch of interviews with the system R developers in the early 1990s, and they talk about all the problems they had trying to build the system, and this thing comes up as being like a big. So Oracle, at least in the, like the Oracle database system, like when you download it for your laptop or you get through RDS on Amazon, like the not exadata, not rack, not the specialized Oracles, like Oracle, Oracle doesn't do any code generation query compilation stuff that we're talking about here. I think they might do a compilation for predicates, but I haven't verified that. The only kind of true sort of end-to-end compilation that they'll do is for store procedures. So for store procedures, they'll convert the PL SQL, which is supposed to be based on the SQL standard. They convert this into their own proprietary language called ProStarC or ProC, right? And then they have a compiler that to compile to native code that can run inside the system. And the reason why they're doing this transformation to ProC is because they do a bunch of checks to make sure you're not violating memory constraints or not reading something that you shouldn't be reading. So sort of they have an intermediate security check. The one thing that Oracle does that nobody else does, which is crazy, is that they say, fuck all compiler stuff. Let's just bake this in the hardware, right? Cause they bought Sun, I don't know, 10 years ago and so Sun was, you know, developing spark chips. So in the last five years, Oracle will now sell you newer spark chips that actually have Oracle database operations directly in silicon. So like for like compression and security stuff and vectorization, like there's stuff they're adding specific to Oracle chips that Oracle's database system can actually use and go faster. So that's, this is even better than compilation, right? Cause this is like avoiding any, you know, this is avoiding general instructions at all. Go directly to the hardware and do stuff very quickly. Of course, this is not cheap. You pay a lot of money for this. I can't say anything anymore. For Hecaton, they, I'm pretty sure this made it into the commercial version. I know at least the academic version could do this. They could compile both the store procedures and the SQL and what's really kind of cool about this is that they had these, with Hecaton, the idea was you declare tables being in memory and you could have your regular SQL server tables and then the way they would speed up access, like if you wanted to join the in-memory table in Hecaton versus the regular table in SQL server, they would sort of pre-compile these conduits or these inter-operators that allow you to do the joins or get data from one side of the next very efficiently. All right, so that's another good example. Like specialization is not just for a single query plan. There's a bunch of stuff you can set up ahead of time to, you know, based on the schema that's defined for the table, those allow you to move data around or process data more quickly. So the way they were doing this, they were doing the Haiku way where they would generate C code from the query plan, compile that in the CLR or the DL, or guess the compiler and generate a DL and looking at run time. There's another system they have that does sort of streaming analytics that does the same thing. So the Haiku approach is very common at least in all the Microsoft systems that I'm aware about. All right, Impala is a SQL engine built on top of HDFS or the Hadoop file system. So they're gonna use LLVM JIT compilation and I think they're generating the IR directly. They're only gonna do this though for predicate evaluation and for processing files. So the idea is that there's no, Impala doesn't have its own proprietary data format like Kinko, Oracle, MySQL, SQLite, they have their own proprietary file formats and that they know how to read and write data from. In Impala, they wanna operate in the Hadoop ecosystem or on the cloud, so people are generating CSV files or JSON files or Parquet files in all the different formats. So they wanna be able to process things very efficiently. So they'll pre-compile the code to actually parse these files because you have to tell it, hey, my CSV looks like this, right? So they have a fast compilation method to go read that data very quickly. And then for the predicates as well, they're doing the compilation, right? Again, this is another example of not just taking query plans, doing all the more low-level things, you can do that better with compilation. Actium vector we've already talked about, last year I said they were dead and then the email being complained and said they're not dead. Again, this is the commercial version of vector-wise. So what they're doing is that instead of every single query plan, either generating on the fly the IR or the C++ code and then compiling that, the database developers, the acting or the vector-wise developers themselves, they're gonna pre-define a bunch of primitives for all the different ways you could access data in a query, right? So like, think of like, there's like hundreds of these primitives for every single data type, for every single operator you wanna do on them. And then when at runtime they figure out what primitives they need to stitch together to then produce the result for the query that you're trying to execute here, right? So these primitives look like this. So say we have here, we have two scan operators on a column and you see that we're passing in like is something less than something? So this is the less than a 32 bit value and this is the less than a 64 bit double. So this is the code that'll get compiled when the system gets compiled. So then at runtime, I look at my query. My query's trying to do a lookup on this column. So this column's a double and my scan operator is doing is something less than something, right? So then now I just pass in here, a pointer to my column and the value I wanna do comparison against and then I just invoke this thing here, right? That's the same thing what the code generation guys are doing. They're generating code that looks a lot like this. Just now we're pre-baking or pre-generating things ahead of time. And I showed in that one slide, they're getting just as good performance as Hyper in many cases. And for us, it took us a while to actually be able to be vector-wise in our own system, right? Because again, there was a bunch of harbor tricks we had to do to go faster than this. So this is just another way to think about doing code generation. Like instead of actually doing it on the fly, you can do it at the very beginning and just stitch these things together and you end up getting pretty close results. And the paper you guys will read two weeks from now we'll compare the vector-wise approach versus the hyper-approach in a little bit more detail. Also in the context of vector-wise execution. All right. Actually, yeah, so real quick I'll say also too for this one here. In this case here, like we'll talk, you'll learn more about vectorization next class of the SIMD stuff. So the compiler, if it's smart enough or you provide it with a pragma hints, the compiler can recognize here that I'm doing iteration over a column and therefore I can make, and I'm just doing like a simple comparison operator here, it might be able to convert this into SIMD code automatically for you. So this is where the vector comes in a vector-wise name. The goal is that by writing code in this way where you're doing these simple primitives, the compiler can be smarter from recognize that this little kernel, this part of the query, the for loop can be vectorized and it'll unroll it for you automatically. Right. So again, we'll cover that more detail next week. All right. From MemSQL, MemSQL actually did it in two ways. The first way they implemented this was in Haiku, the Haiku way where you generate C++ code and then fork his XGCC. And the reason why they did it this way is because one of the founders of MemSQL was at Microsoft when they were building the Hecaton project. So he saw some of the early talks from Hecaton people. He wasn't working on a project, he was just there. They had these internal talks about Hecaton and they talked about how Hecaton was gonna do the source-to-source transpilation approach. So when they went off and built MemSQL, they borrowed along that same idea. So the problem, though, again, is that the compilation time was super high. Haiku was an academic system so they didn't have to worry about people getting pissed off that their query was taking super long because they had to compile. In a real system, people notice these kind of things. So the way the MemSQL guys would get around this is that the first time the query shows up, they would extract out all the constants from it and turn it to a parameterized, essentially a prepared statement automatically. And then you fork exact and compile that and then you cache that query plan. So now the next time you see the same query or the query with the same structure just with different constants, you could reuse the compile plan that you had before. Right, so again, so select start from A where AID equals 123. I can extract out the 123, compile this and cache it. So then if I come along with where AID equals 456, I extract that, extract out 456, recognize I had this pattern, this query from before, I can reuse that plan. So in the early days of MemSQL, when you read their documentation, they would talk about, they were very explicit about this and they would give examples to say, look, you run your query, the first time it's gonna take one second because we're doing this compilation, then the second time you execute it, it's gonna take 0.0 seconds because we're able to reuse the query plan. So we asked them about this, the parameterization and the matching was really simplistic, it was like, it was just exact string matching. So if I had something like, where A.ID equals something and B.IE equals something, if I came back now with another query with those predicates reversed, like BID first and then AID second, it couldn't cementically recognize that they're the same, it would just say the strings didn't match and it would always have to recompile. So it was a really simple technique they were using. So then after 2016, MemSQL got a lot of money and they hired the guy at a Facebook that built the HipHop VM. So Facebook famously runs on PHP and the PHP interpreter is slow, so they built essentially their own JVM for PHP. And so they hired the guy that led that project to come along and re-architect the entire system of MemSQL. And what they're doing is actually, which I think is the best way to do this and what we're doing now in our own system is that instead of going directly from the query plan to C++ and compiling that, and also instead of going from the query plan to IR and compiling that, they're gonna have an intermediate language, a DSL, domain specific language. So they're gonna convert the query plan to this C-like dialect that they call the MemSQL programming language, MPL. And then from there, they're gonna generate op-codes and then they have an interpreter for that op-codes but then they can also compile that, convert those op-codes into IR, then compile that. So you may say, this sounds like a lot of steps, why do I wanna do this? Well, the thing that we haven't talked about, the biggest problem with the LLVM stuff that the hyper guys do is that when you crash, you land in assembly, right? You don't have a backtrace to figure out what the hell is going on. And you have no way to recognize that like, oh, I'm crashing this assembly, here's the pointer to the source code that generated that IR that calls me to crash, right? It's a lot of trial and error to figure this out. And this is, you know, in our own system, we only had maybe two or three students that actually could operate on, actually work on our engine. Everyone else would cry or give up or whatever, right? I'm not saying, I'm not, you know, I'm saying, I'm not one of them, I couldn't figure it out. It's hard. So what this allows you to do by having this DSL and having this way to interpret it, now you can hire your, I don't wanna say, your less gifted developers, but the people that aren't working on the IR stuff, they can operate directly on this and they can now debug this. Because again, it looks like C or C++ and you can run it through the interpreter and step through and say, well, what's actually going on? And with little debugging tricks, you can then have GDB set up and say, all right, well, here's some hints about where in the source code that we actually generated the drop codes that you're actually looking at. So they told me, and I think they're absolutely right, it's that it's way easier to hire people to operate on this stuff rather than have people operate directly on the LVM IR, right? So there's a, you know, it's a, actually I don't know what the performance penalty is. I mean, obviously there's going to be some performance penalty, but I don't think it's that major because you're not doing optimization passes. You're just doing this transformation. So it's almost like the same penalty you would pay from going from SQL to our query plan or AST, right? So I think this, my opinion as of 2016, this is the state of the art way to do code generation. This is the right way to do it. And it kind of is a combination of everything, right? It's, we're doing, we're doing transpilation or source to source compilation because we're taking a query plan generating this intermediate DSL and then we're taking that DSL and then converting it to the IR, doing JET stuff. All right, a few more to finish up with. ViteesDB, I know it's ViteesaDB, there's ViteesDB, that's the MySQL stuff from YouTube. ViteesaDB is the query accelerator for Postgres and Green Plum and they actually borrow a lot of the ideas from the hyper guys. So they're doing the push-based processing model. They're compiling with Elvin IR directly, right? And they're avoiding all indirect indirection by inlining or having direct calls for everything, right? So for this one, what's really interesting is that they're hooked into Postgres and Green Plum is based on Postgres, so it works the same way. And basically your query shows up, they look at it and figure out, is this something we can compile, yes or no? If no, like if it's an update, then they go through the regular Postgres path. If yes, then they go to the compilation path and then know how to operate directly on the Postgres native data. So it's not like you have two different databases running. It's all the same underlying data structures and storage. It's just how you get at it, depending on whether it can be compiled or not, will vary. And this is actually a pretty good link. It's a few years old now from one of the founders of the company. It's like at a Postgres conference. They basically goes through all the things we talked about today and plus vectorization and talks about how they can make Postgres run faster. Back then also too, they could do intro query parallelism, like for a single query, you can run multiple queries. This is back from the day when Postgres couldn't do that. Postgres actually does most better at this now. Postgres support parallel queries. My SQL does not. Apache Spark does this for expression trees. So they had this thing called Tungsten Engine that came out in 2015. So the way this works is that in Spark, everything's written in Scala. And so there's some trick you can do in Scala to have it generate JVM bytecode directly. So they're sort of doing the same kind of JIT stuff that HyperDOS, but instead of generating IR, but generating JVM bytecode. All right, so let me get to us. So in the original version of the system, we were entirely interpreted. And this is because we built the system not for this, we sort of built the first system to do some non-volta memory stuff and not directly the compilation stuff or the self-driving stuff we're doing now. So we had the old interpretive engine and then my student, my PhD student, built out a new LLVM engine. And the goal eventually was to throw away all the interpretive stuff and just keep the compiled engine. Of course, that caused a bunch of problems as we were going along because we essentially had two code paths to execute queries and we would get different results and it wasn't clear whether we were going out on one path versus another. So the way, in the LLVM code he implemented, he was doing the hyper-style full compilation of the entire query plan. We were having the C++ code generate the IR directly and then compile that. We didn't do the adaptive execution stuff that the hyper guys did. We did have a visiting student build an IR interpreter but we never integrated that into the full system. So we eventually wanted to do the sort of the staging that the hyper does, but we just never got there. The thing, though, that's gonna make what we were doing that was significant over what hyper was doing is that we were actually gonna relax some of the pipelines to create these little mini batches going from one pipeline to the next. And the way we would hide the memory stalls is actually use software prefetching. So hyper wasn't doing vectorized execution. So they would take a single tuple and they would sort of go along up the pipeline as much as possible. But if you wanted to vectorize execution, that doesn't work because you need to get a batch of tuples and then put them in your SIMD registers, then fire off the instructions to execute that. But then that sucks now because now like instead of taking the same query, the same tuple and writing up the registers as much as possible, I'm going back now in my loop and going to get more tuples and I can have a memory stall for that. So we would do software prefetching which we would give hints to the CPU and say, hey, we're gonna loop back around and get, you know, look at this next tuple. So go ahead and prefetch and put it into our caches and we'd hide our memory stalls. So we have one graph here that shows you the benefit of doing this prefetching and we'll cover this in the next lecture. So the gray bar here, this is the old engine. I'm just showing you here to show how retarded it was, right? So this is what? This is? Yeah, it's 100 times slower. This is not, I don't want you to get the idea that going from interpretation to compilation will get you 100x. This is like if you're retarded to not retarded to get you 100x, right? The real number is maybe like 20 to 50x, all right? Okay, so the LLM is also gonna give you huge benefit but in some cases for some queries you can squeak out another 20% or so by doing this, by doing vectorization and doing the software prefetching. So again, we will cover a paper that compares the trade-offs between compilation and vectorization in two weeks. This is one example of actually trying to combine the both of them. Vector-wise is sort of doing this predefined compilation and vectorization Hyper was doing on-the-fly compilation. Our approach is trying to do both, right? On-the-fly compilation with vectorization. All right, so now, this is 2017. Now what are we doing in 2019? And this is gonna explain what the, that one team was doing for the project. So, we're gonna do MemSQL style conversion to go from the query plan to a database-oriented, just domain-specific language. Currently it's called, the timing language, but again, the name will change. So then we take that DSL and then we convert that to op-codes. Then we have an interpreter for those op-codes that can then again execute in the same way that the hyper guys were doing, right? While we compile in the background. Then at some point, the compilation will finish and then we can slide in our new, our compiled version. So the difference between here versus what the hyper guys are doing, the hyper guys are taking the LL and the IR and interpreting that. We're generating the op-codes that the many of the MC guys are doing and interpreting that, right? So that way, again, that's even easier for humans to read because if you look at the appendix of the paper you guys were required to read, that's all mumbo jumbo unless you know LL and IR. These op-codes are a bit easier to understand. They're like human readable. So again, so today we have a query plan. We generate something in our DSL. So this is what the language looks like, right? It looks like something that looks like C or Python, right? This is readable. And then we convert this into op-codes, which looks something like this. The name doesn't matter what we're actually doing, but again, I can read this. Jump if false, table vector iterator, right? These are all things that someone stepping through the code can then figure out what the hell's actually going on. So then we take this, these op-codes, have an interpreter to again interpret these things step-by-step manually. Then in the background we also then compile it and when this is ready, we slide in our compiled machine code to replace the one up there, right? So this is the future. This is what that team is helping my PhD student Brashant build. And then coming in the fall, we'll have this fully integrated, so now we'll be able to execute an end-sequel. And if you make a mistake in the code that generates the DSL here, you can figure out actually what calls the error and lead you back to where that was located, right? That's the goal, that's what we want to do. All right, we're well over time. I'm gonna spend, maybe actually, maybe it's over out of time, maybe I'll cover the code review stuff on at the end of class on Monday, right? We'll do the reviews and then we'll do a quick, like, hey, here's what you need to do for the code views. I'll also post this guideline as well on the internet. Okay, so the query compilation stuff, again, the reason why I pushed this lecture in the beginning last year, because I think this is super important. This is hard, this is why they pay database developers a lot of money, but this is what you want to do, get the best performance now on today's systems. And if you're building a system from scratch, you want to be doing something like this. So, and again, I've already said this, but the way MemSQL does it now is the best way to implement this. I think you get the most benefit of performance and the best benefit for software engineering, right? Hikoo gives you the best engineering benefits because you crash, you're landing in source code that you know how to reason about. LLVM stuff that Hyper does, that's the hardest because you land in assembly. This thing gives you the best of both worlds, right? You get the performance and the debugability. So, any new system you're building now, you want to do this today, okay? All right, any questions?