 So, disappointing turnout because I think for today's class, because I think this is like one of the more better lectures or topics we're going to discuss. So, today we're going to talk about query compilation and the idea here is that instead of running sort of having a general purpose database execution engine to process queries, we're going to generate specialized code to actually execute each individual query. As we'll see, the performance difference is quite significant. So, we'll start off talking about, again, the general background of query compilation or the idea of specialized codes in a database system. Then we'll talk about two different techniques. One to do sort of code generation on the fly, to generate C++ files and then compile that and link that in and execute the query. And then we'll talk about a way to do just-in-time compilation or using the LLVM to actually generate the IR you need for the query and execute that directly in the process of the database system. And then we'll finish up going through some examples of what actually real world database systems actually implement. All right, so part of the reason why I think this is a really important topic to discuss is that it's sort of the next frontier, the next major boundary we're going to have to overcome to improve the performance of an in-memory database system. So, we've already shown in the beginning of the course, we've gone from a disk-based system to an in-memory system, how we can have better concurrency control, we can have better indexes, we can have all sorts of things be much faster. But now, if we've done that, where do we go to next? And there's this great comment in one of the papers from the Hecaton people where they talk about how they wanted, when they started Hecaton, the idea is that they wanted to go 100x faster than what the existing SQL server system could do. SQL server at the time was 25 years old, and they're looking at how do we make this thing last for the next 25 years. Now, obviously the first thing they did, they went from a disk-based system to an in-memory system. But then to go beyond that, they came up with a calculation here, and once everything's in memory, the only really way that you're going to have the database process queries faster is if you simply just reduce the number of instructions you have to execute when executing a query. We're not worrying about disk IO, everything's in memory. Yes, it can be clever about how we keep things in our CPU caches, but it really comes down to the end of the day just reducing the amount of work we have to do. And so their calculation is that in order to go 10x faster, then we're going to need to execute 90% fewer instructions for every query. But to get to that 100x number that they were shooting for, then you need to execute 99% fewer instructions, which is a pretty daunting task. So this is where code specialization or query compilation fits in. So the idea of code specialization is that we are going to find different parts of the database system that are executed repeatedly as we process queries or do other things in our system. And we want to be able to specialize them in such a way for different input parameters so that it's no longer doing all these conditional checks to say, am I operating on an end to am I operating on a var char? You sort of directly access the data that you need. We can do this for other aspects as well for predicate evaluation. So again, the idea is here that we're going to generate code on the fly, since we don't know all the queries that are going to show up. Someone could be typing in from the terminal type of random query in to generate it as it comes in and execute that directly in the system. Now this main seems kind of obvious, as we'll see later on. This is not a new idea by any means. This actually was, as many things in database systems, was first implemented in the first system by IBM and system R in the 1970s. I'll talk a little about that in the later in the lecture, but they basically abandoned this idea of query compilation and code specialization in the 1970s for software engineering reasons. And then nobody basically did anything like this for 20, 30 years. And it's only now within the last decade or so is query compilation coming back into vogue. And that's because we're transitioning from dealing with the multi-cores and going to main memory, and now we're looking at how to go beyond that. So one of the problems that we're going to face when we have to build a system to do, that incorporates query compilation and code specialization is that a lot of the times the code we're going to write in our database system is written in such a way that's maybe not the best way to execute on a CPU, but is the best way for us as humans, as the people actually writing the code, to be able to reason about and understand when we're doing query processing. So to give an example of this, I want to show you how and when we use the traditional volcano style or iterator model for executing queries, it's really easy for us to look at this and make sense of it, but that's probably the worst thing to do from a CPU architecture standpoint. So for this, we're going to use a simple database in this example. We have three tables, A and B have a primary key and then a value, and then C has foreign key references to both A and B. So the query is going to be a three-way join between A, B, and C, but for table B, we're going to materialize it as a nested query that computes an aggregate over to ID. And then we'll have our where clause that does a equality predicate on A's value, and then combines A with C and B with C. So we didn't really talk about this when we talked about query planning before, but basically what happens is one of the main ways you can support a nested query in your relational algebra is that you basically rewrite this thing to be a join. That's probably the most common technique. You could execute this as a separate query, materialize it, and then do a join with that. But typically what you do is rewrite this to be a direct join within the query plan. So what we see here is then the representation or the logical plan of this query, we can ignore the physical stuff, because that's not really relevant to our point here. We don't care what algorithm or join algorithm we're using. So we see our scans on A, B, and C, and then we feed into each of these. This one has a filter. This one has a filter. You compute our aggregate, and then the aggregate output is joined to C, and then this is joined up there. So when we implement this in the volcano model, we would rewrite all these operators with these little snippets of code, right? And I realized that I don't expect you to recode. And I always tell my students never show code in a presentation. But for our purposes, we have to show code because we're doing sort of code compilation stuff. But this is basically everything we talked about before, where we have in each operator, they're calling getNext on their children to have tuples be pulled up into it, right? So this thing would call left got getNext. Then we'd call this filter. The filter iterates over its child, which is A. It calls getNext, which then comes down here, and then emits tuples for every single tuple that's in A. And it moves them up into here. And so the reason why this is terrible from a CPU standpoint is for every single tuple, we have to make all of these different function calls as we go down, right? I call getNext on left. Then this call is getNext on down here. And then it calls getNext on this. And then we're just pushing up one tuple at a time. Now we can change this up using the block model, the block processing model we saw in vector-wise, where instead of pushing a single tuple, we push a bunch of tuples. In the case of Peloton, we can push tiles. But it's still sort of the same problem where we have these four loops where we're calling all these different functions. We're having gotten to how we actually evaluate predicates. That's a whole other mess as well, right? So you just see, again, all of this is really nice for us to understand. And from a human standpoint, because we have these function calls, we have these nice abstractions. But when it comes down to it, this is the worst thing for a CPU. Now you may say, why can't you just inline all this in your code? For some of these, you can. For other things, you can't. Because as you start processing queries, you have to go figure out what the type is in order to do whatever the processing unit is you need to do. So that's going to be these giant switch statements that says, if I'm an integer that I know I'm this size, if I'm a var char, I know how to look it up in a variable length table, all of that is in direction here, which is terrible for the modern CPUs that have large pipelines. Because soon as you have something that is a function call or there's a conditional branch, then if the processor makes a misprediction or you have to jump to another location, then you're flushing your pipeline and storing registers in the stack and things like that. So all this is going on just by calling these functions. And there's no really easy way to put in magic macros or pragmas or preprocessor directives in your code to make this go magically faster. So any questions about this? Why the volcano model or the iterator model is bad for modern CPUs? So now predicates have the same issue. So let's say we're just looking at this one predicate here. b dot val equals some input parameter that we're getting from the client, assuming this is a prepared statement, and then plus 1. So the way we would evaluate this predicate, so we have that for loop where we're going get next, get next, get next on the scan operator on the table. Then we feed that into our filter operator, and then for every single tuple, we have to traverse this tree to determine whether the predicate evaluates to true and therefore the tuple should be emitted up our query plan. So what happens is every time you have to evaluate a predicate, you have to pass in some context information that tells the predicate code what tuple it's looking at and what the context it is as it was while it's executing. So you have the current tuple you're looking at, and this is just the actual values themselves, the byte array. Then you have the input parameters that may have been passed in, because we have a prepared statement. And then we have to know the table schema that tells us for this byte array, how do we jump to attribute 0, attribute 1, and so on, and what type they are, how to cast them. So the way we have to evaluate this predicate is we'll start at the root, and we have this equals operator, and then we'll go down the left side. And this one, and we have a node value that says, substitute this node with the tuple attribute that corresponds with the value, actually named value. So for this, we look at our table schema, figure out the value is the second tuple, or the attribute in our schema, and then we jump over to our tuple, find the one we're looking for, cast it to the right type, and then produce 1,000. So now we have all these interactions with all these function calls to go figure out just what this actual one value is. Then we traverse further down the tree. We get to this plus operator. It knows it needs to go down and get more values here. So now we have parameter, and this corresponds to the query parameters. So now we look up in our context here, and we have to know what the type is. We have to know the value. Then we get our 999. Then likewise over here, this is a constant. This is pretty easy to do. It's just one. But then now, as we traverse up the tree, and we add these to get together, and we do our evaluate whether it's true or not, there's all these additional checks we're going to have to do during this predicate evaluation that is expensive. So for example, we have to make sure we don't overflow. We have to make sure that if it's null, that we should be producing null and not an actual integer. We may have to cast it based on what type we're doing. We have to cast it a certain way to compare a date and a time together. So this is pretty expensive to do, and we're doing this for every single tuple as we're scanning it along. So this obviously sucks. This is very expensive. And again, this adds a more indirection. It adds non-determinism or conditional branches in our code, and it's really hard for our compiler to optimize for this. So this is where code specialization is designed to help. The idea is that, again, instead of having this general generic execution model for both our operators and our predicates, we want to kind of bake in exactly what we need to do for every single query. So there's no longer we have to go look to see what the schema is for our table to know where to find the second tuple and how to cast it. We just have directly in our code jump to this offset, cast it in this way, and then do whatever it is the evaluation I want to do. And this reduces the number of functions, it reduces a lot of different things. So we can use code specialization in all of these different parts. So it access methods to be, how do we access the actual underlying tables or indexes? We can compile that down, store procedures. We can actually take the PL SQL that the application developer would write and convert it down into faster runtime code that doesn't have to be interpreted. Operator execution we saw, and we'll focus on in the rest of the lecture, but then predicate evaluation is the other big one as well. So most of the time when people do query compilation or code specialization, they're probably they're doing these two here. The commercial guys can do these. And this last one, logging operations, this is actually not, I don't think anybody does this. And this is something that we're trying to look into here at CMU with the Peloton stuff. One way to think about it is any time you have to do any parsing, like parsing a string, parsing a log message, right? Parsing the byte stream to figure out what tuple is. This is where code specialization could help, because you'd get rid of that giant switch statement to figure out what the type should be. So what are the benefits of this? So I sort of alluded to this already, but the obvious ones are these three here. So the first is that since we know exactly what the query is going to do, we know exactly what data is going to touch, and we know the layout of the data, because we have the schema of it, then we don't have to have a giant switch statement to do the casting, right? We directly casted exactly what we need to do using that reinterpret cast method, or compiler directive that I talked about before, right? So that's not even an instruction. It's sort of something that the compiler uses to allow you to use a particular type, or use a arbitrary pointer or a byte stream and to cast it to a particular type we can then use for other operations. Since we also know what the predicates are going to be, we know it's going to be value equals input plus one. We no longer need to have a tree to do that. We can literally just write the code, then notice how to take that one value, add it to another value, and then compare whether two of them are the same. We don't have to traverse the tree. We can add a little bit extra stuff to make sure that we're not dealing with null values. We can rely on the hardware to do overflow detection for us, right? But there's a lot of stuff that you would write in code we no longer need to do when we do specialization. And the last one obviously is avoiding the function calls inside of a loop. Inlining's sort of a help with this, but not always. It also keeps the code very compact so you can keep everything in your L1 cache. So this is why we want to do this and the performance difference is quite significant. So just to ground everyone what we're talking about to make sure everyone's on the same page here. This is the same diagram that I showed last week when we were talking about query optimization. So we were dealing up in here the planner and the optimizer and the cost model stuff. So now what we're talking about when we say code specialization or query compilation is that we're adding another sort of box in the workflow here that after we get the best physical plan from the optimizer, we'll take that physical plan and convert it into some intermediate format or additional format, compile it, and then generate native code that we then execute directly in our system. So it's no longer we're compiling a plan that is then interpreted by those operator executors, the same way you guys do it in the hash join assignment. We're literally baking the code to run exactly the query we want to run and we invoke that as a function, right? Yes. So this question is how much time do you have to do the compilation, right? And that's a very good question and the answer is not much. We'll see this in a second, but like there'll be some queries in the paper you guys read, it took them half a second to compile a query, running in GCC. LLVM makes things a little bit faster, but it's still in the tens of milliseconds. So there's two ways to sort of get around this which I'm not really talking about. Well, I'll talk about one way that MemSQL does it, but in the case of OATP, you're almost always gonna execute as store procedures or at least with prepared statements. So you can compile it ahead of time. And remember we said most of the OATP queries are searchable, so most of the times it's just picking exactly what index you wanna do, generate the code to notice how to go get that one thing you want and produce the answer. So in OATP, we can get around that. In OLAP, it's a little more tricky because it's usually a one-off query that we haven't seen before. And you basically just pay the panel. In the case of LLVM, it takes maybe 20 milliseconds to compile it, but that's okay because the query could take 20 seconds to run. So I'm willing to make that trade-off. I don't think anybody does this, but this is something we're interested in pursuing here at CMU to do partial code compilation or partial query compilation. So let's say that there's some portion of a query, like maybe like an aggregate, like a Vanessa query, that shows up over and over again in different queries. The outer part of the query is always different, but this inner part's the same. So maybe we compile just that one thing, cache that, and can reuse that subsequent queries that come later on. Anybody else? Anybody else? Yeah. The statement is this is like what JDM does with the hotspot optimization. Exactly, yes. So the answer to your question is, the faster the better. GCC, I think, is slow. But for LLVM, 20 milliseconds is a reasonable time. Optimization you do is like change the way. The reason LLVM is faster than GCC is because it does that. Yeah, so the statement is, the reason why LLVM is faster, because it does less than my running with O3 optimization with GCC. Absolutely, yes. No, I don't think everybody's done that sort of yet. Yes, so in the sense you can figure out what optimization sort of matters with these kinds of queries and only property on those. Yeah, so the statement is you probably just figure out what optimization matters for the query you're trying to look at and only enable them when you run LLVM. Absolutely, yes. Alright, we're not gonna talk about that here. And the main takeaway is like, we'll see this later on, LLVM will take 20 milliseconds, come out of query and that's good enough. For LLVM, for OATP you have to catch things. Alright, so that's a good segue because here's the basically two different ways you can do this. So the first choice is you will call code generation and this is basically generating C code or C++ code for the relational plan for that query. Then you fork GCC, it produces a shared object and you link that into the address space of your database process and there's a function that is the entry point for that query execution and then you just invoke that. And then the second choice is to do just in time compilation and this is where you're using the LLVM to, which I'll talk about in a second, but you basically, you generate a intermediate representation or IR of the query plan, you compile that and then that produces a binary that's already linked into the system. You just sort of execute that right away. I think, we'll talk about this later but I think this is the better way to go but for the paper you guys read, we did the first one because I think that paper is better written and it sort of walks through all the things that are important for doing query compilation. So the high queue system from the assigned reading was an emery database out of Edinburgh, sort of an academic project and I don't think there's, we've used anything beyond that but it's everything I just sort of said here. They have sort of this module that can take a physical relational plan and convert it into C code and then they exact GCC, they can pass in whatever flags you want, it compiles it to a shared object, you link it and then you invoke that to execute. So everything that we sort of talked about before is basically what they're doing here. So let me show you the examples. Let's say that we use that same example database that we have before but this time we'll just do a single select on A where A value is some parameter plus one and assume for our purposes there is an index so we have to do a sequential scan. So the way you would do this in the volcano model that we talked about before sort of looks like this, right? You have some iterator that's going over every single tuple and then we'll evaluate the predicate and we emit it if it's needed. So this get tuple operator is basically taking the pointer location for the tuple you're trying to get, right? I'm iterating over every possible tuple I have like from zero to a hundred and then I need to say find me tuple one, find me tuple two and therefore I need to look at my schema. I need to figure out what's the size of every tuple because they're all fixed length and then compute the offset to jump to the memory location that has the tuple that I want. Then we return the pointer and then we invoke the predicate evaluation where we can have to pass in the context information about what we're doing, what tuple we're looking at, what the table schema is, what their input parameters are for the query and then this has to traverse that predicate tree that we saw pull all the values up and as you go along, you're checking to see whether things overflow, you're checking to see whether things are null and then you perform whatever casting you need and then return true or false, right? So this is what we saw before and as you can see, there's all these function calls, there's all this extra work we have to do because it's sort of generic, we don't know where anything actually is. We figured out at runtime. So now if we create a templated plan, you can think of this as like, this is what the template would be internally and then there's some variables here that we would just fill in the values for each particular query that we're dealing with, right? We would know that the tuple size is whatever it is based on the schema at the beginning. We would know that for our predicate where b value equals the input parameter plus one, we would know that for the b value attribute where the offset is in our tuple and then we would know also now what the actual value of the parameter is. We don't need that constant node in our predicate tree, we're gonna do a straight evaluation. So now our code is we take all these values, again we bake them into the actual code and then this is all we have to execute to perform the same stuff that we did over here and in particular, the most important part is that our predicate evaluation is really simple now. We have our tuple, we know where to find the offset, we can add one. This is now down to like probably 10 instructions to do this evaluation whereas before, we were doing thousands of instructions because we're following pointers or checking all these conditions, right? So this is enormously faster and this is why the code compilation stuff or the query compilation stuff is important. So you imagine the same techniques for all the other operators up in the tree, the joins and things like that because they're all gonna be basically done the same way. So this is the most important part to understand what we're doing here, why this is faster, now we're gonna get this. I'm seeing head shaking, yes, so that's good. Okay. So you may be thinking why not just do this in using C++ templates and I don't mean like the code we're generating in IQ is using C++ templates. I mean like in our own database system code, why don't we just use templates for everything and then at compile time it'll produce all the sort of the tailor or the specialized implementations of all our functions. Let me take a guess why this is a bad idea. Right, exactly, he says you can't do this, right? Think about it, for every single tuple offset, for every single attribute I have, for every single scan I would wanna do, I'd have to generate a new template. Because this template thing, it's basically done in compile time. It says oh, I see you're using this class and it's using an integer, I'll make an integer version of it, I'll make a float version of it. So now you would have to do for all possible combinations which is exponential for all possible queries you could do, you'd have to generate a templated version of it. Yeah, so his statement is for types this is fine but for values this is bad. For what kind of operator comparison you're doing is bad, right? You also can do this, right? Because if you have a complex query plan, there's an infant space. Yes, yes. Correct, yes, so his statement is this is infinite. Right, you can't do this. So this is why we have to do this dynamically, it's why we have to generate our code as the query comes in. Because we simply can't prepare, we can't do enough prepositor directors to make this work. Okay, so a key thing to remember though is that the shared object we're gonna compile for our query gets linked into our process and therefore it has full access to all the data structures, all the other functionalities in our database system. So that means that not everything that the query is gonna do has to be implemented in our generated code. For example, if you have your index, you probably don't wanna bake the index code directly in the system, right? Because other queries are running at the same time and may need access it. So the one of the things we can do is what's nice about this code generation stuff is that it can invoke other parts of the database system using the regular function calls and APIs that our interpretive version would be using. So that means again, we only have to specialize maybe the predicate evaluation and the stuff that we're doing over and over again. But like all the parts of the system can still be written as by the developer as you're building it out using your expected C++ APIs but then we can call into it as we run. This is important because we don't wanna implement concurrency control again and logging checks, points and indexes. All of this, we reuse it, we can test it and everything just sort of works. All right, so now we can get to the evaluation that the Haiku guys did. And I actually really like this paper because again, they sort of walk through sort of all the aspects of the complications of implementing this, what the performance expectations are and then in their evaluation, rather than just saying, here's what it was before, here's what it is after, they actually implemented all these different variants leading up to what the Haiku think can actually perform and what can actually do. So they start off with the sort of generic iterators and this is the volcano model that we talked about at the beginning doing the generic predicate evaluation by traversing the tree. And then they came up with an optimized version of the iterator model where you try to inline as much as your predicate evaluation is possible and you have the type specific iterators that know how to find exactly what offset and cast the attribute you're looking for as you spin through the table. Then they have a sort of hand coded version of the system but that's still using generic iterators and predicates. So this think of this as like, I'm gonna implement by hand the exact query plan I wanna execute but I'm still gonna invoke the sort of generic stuff that I used up in here. Then they had the optimized hard coded version or hand coded version of the query and this is like the best case scenario. This is like you have someone really smart writing exactly what the data systems should be doing for that query and only that. And then their high queue thing is sort of the automated version of all of this. And the idea is that the goal is that this thing can produce queries that can run as fast as this one. So for their evaluation, I'm only gonna show one particular example because this shows the most performance improvement. And this is also running on 2009 hardware so it isn't as speedy as but it's okay. And so obviously what you see is that the as expected as you go from left to right, the generic stuff is the slowest but then when you get over here, the optimized stuff is the fastest. Also you notice here is that the iterator models, both of them and plus the generic hard coded version have spent a lot of their time doing stalls for memory. They also measure, we're showing L2 and memory stalls. They also have L1 but obviously it's so fast that it doesn't really register in the grass so I didn't include it. So obviously there's more memory stalls here because we're jumping around all the time doing these function calls, evaluating things. We're just doing more work and the CPU is not, we're not using the CPU in a smart way. So the main thing to remember this though is over here is that we show that the high Q guys can generate code that is almost as good or just as good as what the human would write if they were writing exactly the plan for a single query. So this is pretty significant here. So this is running around, I think what, 50 milliseconds whereas this guy's up here about a quarter of a second, 250 milliseconds, right? So this is about a five X improvement by doing code compilation. And a big part of it is that you're reducing number stalls but also reducing the amount of instructions you're simply doing because you don't have to do that giant switch statement anymore. So the related to his earlier question is, well, what does it take to actually now compile these queries? So again, high Qs is forking off GCC. So when you think about what that actually means, it's a pretty expensive operation, right? You're starting a new process, right? The GCC has to check as config files, it has to check all these command line arguments, right? Then it has to allocate all the memory it needs and then it actually does the compilation. So it's a pretty heavy weight way to do this. And so what they've compared is running GCC for some queries from TPCH with the compilation flag set to O zero, which means no, trying to don't do any optimizations. And then with O two, there's also O three but I don't know what O three does beyond O two. Three does a lot. There are not going to go through this operation, does it? Yeah, it's on O three or two, yeah. Three is pretty aggressive. I think for the most part, I think when you run a production system, I think you usually run O three, although it makes it difficult to see what, things crash, what's going on in the stack trees. But the main takeaway here is that the compilation time, in particular for this query over here, is well over half a second, right? So that's pretty bad. Now this here is a different set of queries. For some reason they don't include the compilation time for TPCH, but as we'll see when the hyper guys, this query could take maybe 10 seconds or so, depending on how big the data set is. So maybe half a second isn't that big of a deal. But certainly for OTP, where you care about queries coming back in milliseconds or microseconds, then this is simply too long. And again, if they're doing this in OLAP queries where you expect to read more data and therefore paying this penalty is not that big of a deal. So any questions about IQ, the code generation stuff, the GCC problems and everything? So as we sort of said before, the relational operators, using the relational operator model to execute a query plan is really useful for us as humans to read it or reason about. And it's also useful for the query optimizer to apply the transformation rules to generate an optimized plan. But as we said, it's not the most efficient way to execute this. In the case of IQ, it's still doing something that looks like the volcano model, right? The exception is that it's not calling all these extra functions, right? And then it can notice how to jump to offsets and find exactly the data you're looking for. The other big problem is that it has this problem where that it takes a long time to compile a query or we showed that in the last slide. But then the third piece is that because it's still using this sort of iterator model to execute queries, it's not gonna allow for the full pipeline that we would want to make sure that we're reusing as much data in our low level caches all the way up and down the query plan. So what I mean by that is if we go back to this three-way join query we have before and then this is the physical plan or the logical plan for it, if we were doing this in a iterator model, we would sort of push one two up one by one and maybe not go as far up we can in the tree to do everything we need to it, right? So let's say that we could do some extra stuff here before we pass it along, but we're just sort of looking at one tuple at a time. It's like one tuple, one tuple, one tuple, one tuple, like that. So the pipelines correspond to sort of regions where we can do as much as we can with a single tuple as possible when we sort of push things up into the query plan. And if you use a pool-based model you can't always take advantage of that, right? So for example what might happen is you might say get next, come down here and then jump over and get next down here and then jump over and get next down here, right? What you really wanna do is execute everything you need for these two operators, get it done and then push it up to the next guy and then have it work on the next pipeline. This is different than the materialization execution model because we're combining multiple operators where in the materialization model you would do everything you need for one operator, generate the answer and then push it to the next operator. What I'm saying is you can have these things sort of call back and forth to each other in an efficient manner by avoiding, if you avoid the functions, do as much as you can with a single tuple because it's probably gonna be sitting around in your L1 cache or even your CPU registers and process all the operators that before you move on to the next pipeline, yes. All right, so his statement is the reason why HiQ doesn't do this is because they're not generating code that can deal with multiple operators at the same time, correct. They sort of like, I'm doing this scan. Here's the code template, done. Then here's the next one, right? And they sort of, they just sort of link those things together, excellent point. All right, so here's our pipelines and we wanna do as much as we can within a single pipeline as possible. And obviously there's this notion of a pipeline breaker where like for this join here we can let this guy, let the left side execute as much as possible but for the right side we have to wait until things percolate up to us. So this gets to the second paper that you guys weren't required to read but I think is sort of, this is the state of the art for doing just time query compilation with Hyper. For these guys what they're gonna do is they're gonna use LLVM directly inside the system, generate an IR for the query plan, pass that and execute that using the LLVM. And the difference is that they're gonna maximize the amount of pipelines or the number of pipelines they can have in their execution chain by using a push-based model. So the operators are pushing data up instead of having all these pool-based things that you call get next, get next and have things pulled up to you. And the idea is that sort of what his point was they're gonna think about having operators execute on data and having a processing model based on single tuples rather than having you based on I'm executing this operator or followed by this operator. So if you're not familiar with the LLVM, it's been around for a while now, I think since 2000. It was originally started out to be I think a compiler replacement for like GCC but it ended up being this toolkit of a bunch of different compiler technologies that can do some really cool things. They can use it as like, you know you can have it be a front-end compiler and the back-end could be your GCC or Clang or whatever else you want. But the main thing that people are gonna use it for in the context of databases is the ability to write a low-level programming language called the IR, the intermediate representation that is sort of similar to like assembly but it's a little more human readable and then have the LLVM execute that as native code directly in the system. You have to do extra compilation to actually convert it into actually true assembly but it's not as bad as invoking GCC in like in the HICU case. So the key thing to remember though is just in the same way that HICU's code could then call other parts of the database system like the indexes and concurrency control and not have to have that fully implemented in the code generator part. We can have our LLVM query plans execute other parts of the system and not have to implement the entire thing in LLVM. And this is important because writing the LLVM IR is non-trivial, right? It's not as easy as writing Python or Java. So the number of people that can write LLVM IR is much smaller than the number of people that can write C++. So you can sort of have a small crew of people actually write the part that generates the LLVM code and then the other part of your team could write the other parts using C++. So to give you an example of what this looks like, I'm not gonna obviously show LLVM IR because again it'd be difficult to read so we have basically a bunch of pseudocode here. And so this is an example of how they're gonna do the push-based query execution using these pipelines here. So here we have sort of this first part is one pipeline they're gonna iterate over A and that corresponds to this pipeline here and we'll do everything we can for this pipeline before we switch over to this next part here. And so these are sort of divisions of the pipelines in the query plan. And as you can see the key difference what we showed before is down here in this last pipeline we're doing all the join operations one after another because we now up above we've already materialized all the information that we're gonna need for every single two-boy. Now we can just scan right through them and compute the join rather than jumping around calling get next from all these different operators. So they claim this is pretty significant. They claim that this is the correct way to do this. When you want to get the better performance beyond what the volcano model can do. Well we can talk about the numbers in a second. And the way you actually implement this is not necessarily in code that looks like this they sort of have a producer consumer model that sort of has the same flavor of writing get next without having to write everything is low level LLVM IR. Does that sort of make sense or no? You should be gonna say yes. Okay. So let's look at some performance results for hyper. So in this experiment they're running a subset of the TPCH queries and they're gonna compare hyper running with LLVM versus hyper using the IQ model where you generate C++ and they compare it with vector-wise, monadb and a unnamed database system which probably is Oracle. So what you see across the board for these red bars here for the most part and also we're doing on the long scale Y axis. The red bar is almost always faster than everyone else. I think it always in this particular example here where in the C++ case it's a little bit higher and this is not including actually the compilation time. This is just, you know, is the code you're generating much faster, right? And they basically say that using the LLVM IR allows you to do certain optimizations that are kind of difficult to do in C++ because you have more fine-grained control of exactly what's in registers and how data is moved around from one operation to the next wherein C++ doesn't really allow you to do all of these things and the compiler may actually end up changing things. So the main takeaway here is that using the LLVM in their model and hyper is the better way to go. Now they have another paper where they show vector-wise actually beating their LLVM implementation in some cases. I haven't looked in to see why that turned out to be the way. Maybe it's a newer version of vector-wise than what they use for this paper here. So now of course we can go back and look at the query compilation cost. So for this here I'm actually mashing up numbers from two different papers. So it's the numbers from the Haiku guys from before versus the hyper paper and it's not an apples-to-apples comparison because this is obviously different hardware and who knows what actually the query set size is although it doesn't matter for compilation. What basically showing here is that when you run with Haiku and forking GCC the compilation time can be up to over half a second but in the case of using LLVM they're down in the tens of milliseconds. So again the LLVM is faster because you're not forking an extra process and you don't have all the overhead of what the full-fledged compiler in GCC does. All right so this is actually something, this LLVM stuff is actually something we're very interested in looking at here at CMU in the context of Peloton. Prashant's not here right now but this is something that his research is looking into. So this is just sort of a micro benchmark that he worked on last semester. It's not within Peloton, it's just his own sort of standalone thing but now we're working on adding this to our full system and here we're just showing the performance difference you can get for a select query, a group by, and a hash join and so you can see about, sorry, you can see that the query time difference between the interpreted version and the compiled version is always about five acts and then his compilation times are pretty small. He's less than 10 milliseconds. So I think I would attribute this that his thing was sort of a testbed prototype that didn't support all the possible different types that the full feature did be a system of support. So I think, like I said, I think this is really interesting and this is something that we plan on pursuing in our own system as we go along. All right, so any questions about query compilation, the cogeneration stuff or the LLVM? Yes, your question is the difference between the... The pipeline method. And say if a tie-based thing or we're like, rather than like tuple where tuple you go, like block by block or something like that. Yes, so your question is the difference between a block-based system like vector-wise and the vector-wise execution system, the vector-wise versus our tile system? It's just... They're basically the same thing. No, not the tie system, as opposed to the pipeline. The pipeline thing. Yeah, so the difference is that. I should go back here. Again, we're operating on a single tuple and we do as much processing on that single tuple before we move on to the next one, right? And so like in this case here, for this, for every single tuple in A, we're gonna do a predicate evaluation and then normally you would shove it up to the next guy or if you're doing the vector-wise execution, you would put it in a little queue and once the queue's full then you shove that up to the next operator but they're actually doing everything you would do that you can do with this tuple while it's still in your low-level registers so that's when they also put it in the hash table. So the way this would actually be implemented in the volcano model or in the vector-wise model, there would be a separate operator in your query plan to do this materialization. So this would then get pushed up to that which may come back and ask for more or actually try to put it in your hash table. It's sort of like inlining in some way. Yes? You could implement this using, for example, the iterator model, right? So the statement is, this is not unique to query definition. You couldn't implement this in an iterator model, yes. True, yes. And the main takeaway of what you get is that they do something different than what everyone else does and they're using LBM, they give up that. But they're not we should have excused it. I'm sorry, they are we should have excused it. You can have one without me. Okay. So let's go through now some real-world implementations of query compilation. So we'll start, you know, there might be more beyond this but these are the major ones that I know of, at least the ones that make the most, you know, make the most noise about what they're doing. So as I said in the beginning, this was actually, this query compilation stuff was actually used in the first, you know, the first system they were building at IBM in system R. And remember I said that when they started system R there was like seven or eight people that all had PhDs, a lot of them are fresh out of grad school. They put them in a room, they said we're building a new database system and everyone sort of picked their own topic, right? One person picked query optimization, another person picked SQL, right? So in one person picked code compilation or query compilation. And so what they would do is they would take any SQL statement that came in and they would generate by using the code, generate the actual assembly code that would be used to execute the query and then to invoke that directly. So it wasn't using an interpretation. So again, if you go back to read the annals or the retrospectives of system R, they talked about how the performance advantage of this was pretty significant, but it was such a pain to actually maintain and implement. Right, because anytime you change like the layout of data in the storage manager, you had to then go change all your query compilation code to reflect that, right? So that became a big pain for them. And at the time it wasn't, you know, before the dominance of x86 or x64 architecture, everyone, there was DEC, there was all these other different semiconductor companies that all had these different chips, that all had these different instruction sets. Even within IBM, they had all these different instruction sets for the different models. And so it was really hard to write code that could generate assembly that could run on all these different CPUs. So, you know, it ended up not being very portable and it was very hard to expand. And obviously, as I said, there was software engineering complications for every time you change things that affected the query compilation stuff. And then for them, they still had an expensive cost of actually making external function calls to other parts of the system. So when they built, when IBM started building DB2, and a lot of the parts, I think I was told roughly 50% of DB2 in the early days came from system R, they ended up not bringing over the code compilation stuff, because it just thought to be too much of a burden. And then, like I said, this was abandoned for 20 or 30 years. There was, you know, you read a couple of papers here and there that talks about how they do code compilation for small parts of the system, but nobody's really exploring the, you're doing a full system that does query compilation, like we're talking about here until at least in 2000s. Oracle does this as well, so they are able to convert PL SQL that you write for a store procedure and compile that into what is called Pro, ProStarC, which is their sort of internal C dialect. And then they have their own compiler that can then convert that to data C and C++ code. I don't think they do code compilation to do predicate evaluation or the actual operation implementation that we talked about in HiQ or Hyper. I think they only do this for store procedures. But now one thing that's kind of cool is not directly related to doing query compilation that Oracle is doing is they're actually now putting the different operations that your database system does when it processes queries directly inside of Silicon. So if you remember a few years ago, Oracle bought Sun and Sun was making all those spark chips, the Niagara's and things like that. So Oracle still does this and now they're selling, you can buy boxes from Oracle that have this specialized, I think it's called the Spark M7 chip that has specific instructions to do things that you could do with LLVM. So they actually have instructions that actually execute SQL stuff. They have a SQL parser, they have a way to do dictionary decompression, compression directly in Silicon. They have a bunch of different vectorized execution operators and registers just to do SQL stuff. And they're also putting like security and encryption in it. And I actually have one of these chips here. I'll pass it around for show and tell. So this was given to me last summer at a conference. So this is the Spark M7 chip that Oracle now sells as of 2015 that has all the stuff in here. So if you're putting stuff in Silicon, you don't even need to do the LLVM stuff because you can make calls down to this thing and process things even more faster than you could with LLVM. So this is actually a busted chip. So this is like a $30,000 chip. Joe and I already tried to hawk it, we didn't get anything. So this is like a rejected batch that they handed out as like door prizes at a conference here. So if that was real, that would be $30,000 and it'd probably be the smallest thing ever held. Those were $30,000. It's just, like I said, it's kind of cool to look at. So again, they can put things in the chips so that maybe they don't need to do the query compilation stuff. Yes. This question is, how widely is this chip deployed? I mean, I don't know what their sales are for these things and it's only since October, 2015. The reason why they're doing this is actually kind of smart. They are losing ground obviously to the Amazons of the world, right? And if you're already okay, maybe you're running on Amazon RDS, maybe you're already okay switching over to MySQL or something else, right? So their strategy is that by putting this stuff in Silicon, if you're running Oracle, the way to really get the good performance in Oracle is now to buy their specialized boxes that have the stuff in here. So I don't know how well it's doing it for them. But again, from a long-term strategy point, I think it makes sense. Yes. You seem skeptical. Do something with a custom memory system, right? So the statement is, if you really wanted to do something that would really optimize the data system, you would optimize other parts of not just the CPU. So there's a term for these. They were called database machines. They were invoked in the 1980s. The idea was that instead of buying a generic rack box from Dell or HP, you'd buy a machine that had specialized hardware to do some of the repetitive operations, right? And then one asset could be you could configure the memory controller to be optimized for doing database workloads. So these are real big in the 1980s, like I said. There's a bunch of proposals. There are actually some people selling them. That market totally crashed. Because in the end, people actually want to just be able to run on commodity hardware. Even now, nobody can get away with selling appliance other than Oracle, right? Because now they control the whole stack. If it doesn't run on Amazon, it's no one's going to use it. The last appliance system I can think of was Clusterix, and they basically gave up on that. And they weren't even doing specialized hardware. They just bought hardware and made sure the data system was tuned for it. The last specialized hardware system was probably Natesa out of IBM, and IBM bought them in the early 2000s. And that was basically an FPGA that could do early filtering of data as it came off of a disk. But for the most part, no one actually does specialized stuff other than this M7 thing. So how it's going to work out, I don't know. Yes? This question is, what is the performance improvement of the M7? I actually don't know. Yeah. I can't imagine a lot, right? Especially if you have really little data set, you would do all this stuff. Okay. Hackathon is another great implementation of Do's for Accomplation. So they actually compile both procedures and SQL, and they can compile the actual access methods for the extra tables. And this allows it to interoperate with the sort of non-main memory, the non-Hackathon version of part of SQL server. You can call into the table and hit this fast compiled functions to do various type of operations like hashing and selections. And so the way they are going to do this is that they're going to generate C code from an intermediate representation after you generate the physical plan. So they're thinking of this as like a parse tree that they produce and then they feed into like the Visual C++ compiler, compiles a DLL and the same with the HICU does and then you link it in and you can evoke that. So one of the things they talk about in their paper I think is kind of cool is that they have to deal with people being malicious. So they have to have all these extra safety mechanisms in the actual database system or in their code compiler to make sure that people don't write, like try to have the C code when it compiles, then make calls to outside things that it shouldn't have. What do you think is kind of cool? You can just hold on to it, I'll pick it up afterwards. Clader and Paola is another big one that does compilation. And so I know they do this for predicate evaluation and they do this for record parsing but I'm not sure whether they do this for the actual operator compilation itself because their distributed system, it might be that the network is the main bottleneck they had to deal with. So we didn't really talk about record parsing but this is actually really important. As I said, anytime you have to like take data and figure out what actually find dividers and actually convert it into a format that you can use internally. In the case of Impala, since they're trying to operate on top of HDFS, they don't have everything as a rigid schema. So they have all these different data files and these different formats and these different serialization implementations or formats. So they have to use LLVM to make all parsing of that run really fast. Yes. Right, yeah, so the statement is Google protocol buffers has something similar to this. Yes, same thing. I'm just saying, I bring this up because when I look at the literature, I only see Clader talking about, oh yeah, we use LLVM or code compilation. Code generation for record parsing. Everyone else is using it for the operator execution and predicate evaluation. So MemSQL does this as well and this is sort of one of the main tenants of when they first came out, but hey, look, we do this because they sort of got some of the ideas from the Hecaton guys because the founder was there when Hecaton came out. And so what I'm gonna describe now is actually what MemSQL does or used to do up until this year. What they do now is actually very, very cool and I think is one of the state of the art implementations for doing code generation in a database system. I'm not gonna talk anything about it because the VP engineering on Kragolia, who is a CMU alum, he's coming at the end of the semester to give a guest lecture exactly about how they do query compilation. Let me describe what they do now or they did before and I'll talk about a little bit of what they do now and why I think it's cool. So they would do the high queue approach where they would generate a bunch of C code for the query and then invoke GCC. And obviously if you're doing this for every single query that comes in, this is really slow. So the way they do caching is they would take a query string like this, find all the constants in it, like the things that are being passed in from the client and then they would parameterize it, like convert it into a prepared statement even though the client didn't invoke it to be a prepared statement. And then they would cache it. Then what would happen is now if you have another query that comes along that looks very similar, it's just the input values are different, they would parameterize this and then be able to map it to the same cache plan that you had before. So every single time you would execute this query, you didn't have to fork GCC over and over again. They're only doing straight string matching to make this work, right? So if you had like reorder things or maybe set as an A.ID, you just said ID, something really stupid like that, they're not always going to be able to find the plan you had before and reuse it. So the statement is why do you string matching, you can match the AST, yes, I agree. From what I read the documentation, I think they're doing string matching and actually we have a project here to actually do the AST matching, make this work. You can imagine if you have A.ID equals some parameter and B.ID equals some parameter, that compiles one plan but then if I come in with B.ID equals some parameter and A.ID, if I just flip the order of them, then I can't reuse them, right? The AST would take care of that. And again, except this is something we're looking at. So what they're doing now is they got rid of GCC and now they're doing the hyper LLVM style thing. And they actually hired the guy from Facebook who created the hip hop VM, that if you remember that, they're optimized version of PHP. So now what they're able to do is they're able to compile, they still generate sort of C-like code, they have their own internal dialect of C and they pass this through a bunch of compilers that the hip hop guy created which then compiles it into LLVM. So the reason why this is, I think it's really a good idea and really important is you can have all your database developers just generate C code and only have a small number of people who know how to turn that into LLVM code. So that means like in the hyper case since all your queries are being written as LLVM, anybody that works on the system has to know how to write LLVM code. But in the new version of M-SQL, you can have most of your people write in C code and then that just gets compiled down into the optimized code later on. Question or no? The hip hop guy, whatever his name is. Like I think he put a press release out when they hired him, so that's, you know he's legit, yes. So what do you say this thing is like? Yeah, so his question is, am I suggesting that the right way to do this is that there should be a DSL in the way that M-SQL does? I think yes. If you're doing a major commercial database system and you have a lot of people working on it, then you absolutely need to be able to do this because I think it's very hard to hire people that can write higher on. My understanding too is also that from people that have, we've only been using LLVM less than a year here. My understanding of people that have looked at it and tried to use it is that it's a very hard to debug. And I think having this intermediate DSL is, makes things a lot easier, right? Just in the same way, it's really hard to hire people that can actually work on a database system. It's even harder to hire people that can work on LLVM stuff inside of a database system. It's a very niche market. So again, Ankara will come at the end of the semester and he'll talk about this. I specifically, I don't, you know, don't give me some BS like, what's it like to work in M-SQL? Like get down the nitty gritty details of how they do it. I think it'd be really cool. So the last one is PatissaDB. So this, think of it as basically like, it's a query accelerator that they get sort of hacked into or grafted onto Postgres and Green Plum. Green Plum is based on Postgres, so it's not a major, major derivation for them. And so they're gonna be doing the LLVM to do just-in-time predicking compilation and using the push-based processing model and allows everything to become either direct calls or have been in line. So all the same things that we talked about before. And they talk about getting, you know, about 5x improvement using LLVM over the Postgres stuff. Currently right now, they don't use, they don't support all the different functionalities. And anytime you have a transaction or anything that modifies the database, that still goes through the regular Postgres code. So it's sort of this thing sort of living on side of it, using the execution hooks that Postgres provides. So it's to interject yourself into the execution pipeline of a query. And I'll post this on the, obviously the slides on the website, but there's a pretty good talk from their people from Postgres conference last year that describes exactly what they're doing. They're also doing column stirs off and intro query parallelism, but we've already covered all those things as well. All right, so the main takeaway for all you guys remember is that the query compilation stuff is important, it makes a difference, but it's not easy to implement. And actually I should have mentioned this earlier too I went and visited the MemSQL guys back when I was still in grad school and we were sort of talking about the compilation stuff and they were telling me that it was, part of the reason they switched to the DSL from what they were doing before is obviously to avoid the GCC overhead, but it also makes the code, writing the code to generate their query plans much, much easier. And they told me in the early days that they had to do it all over again, they probably wouldn't have started right away doing the code gen stuff the way they did it. Because again, it made the system very brittle, you had to change one part, you had to change the code compilation stuff. But now they're in an awesome position, I think as when Ankur comes and talks about it, I think the current version of how they do code compilation is the correct way and I consider that to be sort of the state of the art. The hackathon guys do some very cool stuff as well, and I, but I don't think they have a DSL, right, they have this sort of intermediate formulation. So any questions about query generation, code compilation, predicate evaluation, and all of this good stuff. So for next class, we'll spend time talking about scan sharing. So we're starting out in the part of the semester where again, we figured out how exactly how to build our database system and now we're talking about some techniques to speed things up. I think the code compilation stuff from today and then the vectorized execution stuff from next week are the two main things that you're gonna get the most speed improvement from all the techniques we'll talk about. The scan sharing stuff I think is kind of cool, but I don't, depends on what the workload is. We'll determine how useful it actually is. All right. All right guys, thanks for coming. See you on Wednesday.