 As I said, last class, this lecture is normally what I would, you know, the last two years when I taught this course previously, this lecture here is what I would actually put at the end of the semester. Right? Because this is like, this is the advanced and the advanced part of the name of the course. But I think it's so important and it, it's going to permeate through all the topics we talk about throughout the semester. And as you guys are seeing in the first project, you have to, you know, understand how code generation, code calculation works. That I wanted to push it up as the third lecture. And also when I go look on YouTube and I see like what lectures people watch, everyone watches the first one and the second one has half as many washers and the third one has half as many washers. And after that, like everybody drops off. So I feel like I'm doing the world of service by putting this important topic as the third lecture because then at least some people will actually see it, right? More than, more so than it would, than they would it was at the end of the semester. So, all right, so today's topic we're going to talk about grid compilation and code generation. So I'm going to start off with the background and talk about why you, what, what's the motivation for this? Why we actually want to do code generation? Why do we spend the time doing, you know, a lot of work to make actually, you know, something you can implement in the database system. And then there's basically two methods to do this. So we're going to first start off talking about the first one called Transpillation or source-to-source compilation. And then we're going to talk about the technique that was in the hyperpave that you guys read doing just-in-time compilation with the LLVM. And the spoiler of the lecture is that it is my opinion that this, the second approach is the better way to go. And most systems that actually do query compilation are doing the second approach. And then I'll finish off talking about what are some other real-world systems beyond just us at CMU or Hyper in Munich that are doing code generation and coding compilation. Okay? So we've already said at the beginning of the semester that we were going to make our, you know, we were going to get rid of the disk and we were going to go to an in-memory database system architecture. And that's going to give us a huge speed-up. But now if we want to go even faster, right, if we want to put everything in memory, what's next? What can we do to actually speed up our system and make it more efficient? And there's this great comment, side comment from a paper from the Hecaton guys a few years ago when they were talking about what they were going to do to speed up their database engine. So Hecaton is a system or a database engine that we were going to cover throughout the semester. The background was, it was a, sorry, an in-memory database engine that Microsoft developed in recent years to speed up the performance of the regular SQL server engine. So the SQL server was 25 years old and at the time they started the Hecaton project and so they got some really awesome database people, Mike Zwilling, one of the original guys who helped port side-based to Windows NT. He was the guy that helped build the shore system you read about last class in the University of Wisconsin and Paul Larson, another famous database researcher. And they basically said, let's design a new engine to make SQL server be relevant for the next 25 years, right? And so if you start looking about what can you do to speed up your database system, if everything's already in memory, what do you do next? And so they came up with this back of the envelope calculation that I really like, and so they said, well, if you want your database system to go 10x faster, then you need to execute 90% fewer instructions. But then if you want your database to go to 100x faster, you gotta cut out 99% of the instructions. So the first one, that's doable, right? It's not easy, but that's possible. But slicing out 99% of the instructions, that's pretty challenging, that's pretty hard, right? And so that's sort of what the motivation for what we're gonna talk about today is how can we get closer to this number here, right? There's no magic flag in GCC, like we can set O100, that's gonna make all these instructions go away, right? We're gonna have to be, do careful engineering and reevaluate how we design our database system. So another thing to also say that in this class, for this lecture, we'll mostly talk about instructions, but another useful metric that we're gonna care about throughout the semester as well is also instructions per cycle, right? So you wanna reduce the number of instructions, but also within a single cycle, when you receive U, you wanna execute as many things as possible. And that essentially means trying to minimize the number of cache misses. And it's also very related to what the hyper guys are doing as well. So the only really option we have to achieve this goal of cutting down on the number of instructions in our database system is through code specialization. And so the idea of code specialization is that instead of having this general purpose database system that knows how to operate any, or execute any possible query that you throw at it, the idea is that we're gonna generate code that is specific to doing one particular task in our system. And that's essentially gonna mean just executing a single query, right? And so what's gonna happen is there's gonna be all this extra stuff we'd have to do if we were doing interpretation. Meaning as we execute a query, we have to look at what's in the where clause, and then for every single two-ball, figure out what the type is, whether it's a null or not. There's all of these extra stuff you have to do when you don't know what the data could be when you're trying to execute a query on it. And with code specialization, we can basically bake into some machine code or executable code exactly everything we know about the data and exactly everything we know about the query, and we don't need to have any giant switch clauses that says if it's a int, do this, if it's a flit, do that. We know everything a priori. And so the reason why this is now sort of in vogue and why a lot of existing systems don't actually do this kind of stuff is that what's gonna end up being happened is the way to write your database engine, the source code itself, the way to write it that's most easily understandable by us humans, actually often turns out to be the worst way to write it for your CPU. So this is why most systems don't start off doing code specializations. Most systems do the interpreted engine, which is what we had originally in Peloton, because that's easy for humans to understand and engineer. So I'm gonna go through a bunch of examples during this lecture, and the example database I'm gonna use through all of these examples is this really simple one here that comes from the hyper paper. It's a three-table database, A, B, and C. A and B have integer primary keys and a value, and then C has foreign key references to both A and B. So the first thing we need to talk about is how to actually want to process a query. So the standard way that everyone does this and in the introduction class, this is the way we teach it, is through the two-byte-of-time query processing model. It's sometimes called the iterator model or the volcano model. Volcano was an influential system from the late 1980s. It's from the first graphi guy, and he's done a ton of different stuff in this area. One of the things Volcano proposed was this two-byte of time query processing model. Now, it's not to say that people weren't processing queries in this way before this, but he sort of laid out how to do this, and there's some extra stuff that they add in the Volcano how to do this in parallel, but for that, we can ignore it for now. So the two-byte-of-time query processing model is basically where you traverse the query plan tree and you call Next on your child operator to go pull up data from the guy below you. So here we have a really simple SQL statement. Select ID from A, B from Val, do a simple inner join on A and B, and then do a filter on B.Val. So this is the query plan that the data system would generate. You have the access method scans on A, then you do the filter on B, then the join, and then the projection. So with the two-byte-of-time operator model, what happens is you start at the root at the projection, and then this guy would iterate over all the tuples that its child operator produces. It calls Next on this, says give me the next tuple you have. The join says I don't have anything. So I'll call Next on my children, and then this guy then does the scan on the table and then produces the tuples. So you're sort of starting from the top, going down and pulling data up. The other alternative is the operator-at-time model. I think the hyper-paper refers to this materialization model, and basically what happens here is that rather than starting at the top and going down, you start at the bottom. Each operator executes on its data and generates its complete output. So for A, if I had to scan all the tuples, my output is all the tuples in A, and then you push them up into the operator above you. The main thing for B pushes up in here, do the filter, and then it goes up to the join. So the advantage of this is that you don't keep calling Next as you go down. It's just sort of exude all at once, and then you shove it up, and you never go back to it and answer it for more. This is actually a good way to do this for an in-marry database if you're doing all 2D workloads. This is actually what we implemented when I helped implement an h-store multi-B. MoniniV actually implemented this as well, which I disagree with because they're doing OLAP queries, and they've tried to work around it in a bunch of different ways. But the advantage of this is that essentially you don't have to call Next, and you don't have to do all these function pointer lookups. The in-marry database now, so function pointers are our enemy when you try to reduce them because it's additional branches and additional lookups. Whereas the operator time model, we don't have any of that. We just run it once, spits out the output, and we keep going. The best way to think of both worlds is the vector-at-a-time model. This was proposed by the vector-wise guys, which is a system we'll talk about later in the class. They're going to be like the tuple-at-a-time model where they're starting at the top, going down. Instead of every single time you call Next, you get one tuple, you get a vector tuples. Instead you get a batch tuples. This has the advantage of you reduce the number of function calls as you would in the operator time, but you don't need to materialize the entire table to show it up to the next guy. We'll see this when we talk about the hyper-way of doing compilation. They're going to try to do tuple-at-a-time, but they want to push things up, like the operator model, rather than making all these function calls and going down. This is how they're going to get better cache locality, and reduce the number of instructions they have to execute. Now let's look at a more complicated example here. This is also from the hyper-paper. We're going to do a three-way join between A, B, and C. We're going to do an inter-join A and C, and then we'll have a nested query in our front clause where we do an aggregation on B. We'll cover this later when we talk about query optimization and rewriting. Essentially, the way you would execute this query is the optimizer can rewrite the inter-nest query into a join operator. That's the most common way and usually the best way. The alternative is actually execute this once, put into a temp table, and then do a join into it. But for this query, we don't have to do that. And so if we're taking the tuple-at-a-time approach, the pseudocode will sort of look something like this. Now, I always tell my students whenever they give a talk, either in a class or at a conference, or at a group meeting, never show code because everyone's eyes glaze over because they're trying to read it instead of listen to you. I'm going to violate that here. You kind of have to. If you're going to talk about code generation and query compilation, you kind of have to show code to bear with me here. So if we're doing the tuple-at-a-time approach, the volcano model, we would start at the root where we have this join. And then we're going to have a for that where we're going to iterate over all of the children, all of the tuples generated by our child on the left. So at the beginning, we call left.next, and that goes down here. And so you're trying to get the next tuple that it has. But at this point, we don't have any tuples. So it's going to look at its child. I'm going to call next on that. It goes down here, go to grass and single tuple, and then pushes this back up. Right? It pulls it up. So this approach is fine if you're a disk-based database system because, again, we don't care about function pointers. But now if we're in memory, all of these next calls, all these branches, this is problematic. This is slow because we're executing additional instructions to check things out as we go along. So this is one big thing we can use or try to avoid to reduce the number of instructions we're executing to speed up our query. The other thing I want to point out now too is while we're executing this query, we have all of these predicates where we have to evaluate for every single tuple that we look at to decide whether it matches or satisfies the predicate and then we want to move it up into the tree. So let's use this one as an example here inside of our nest query. FeedupVal equals a question mark which means it's an input variable. This is something that the client will provide you at runtime. Like this is up here. So the input variable plus one. So the way a database system is going to represent a warehouse predicate like this is through an expression tree. And basically the expression tree is just picking apart the different operators you have in your expression and filling out a tree like this. So if I want to evaluate for a single tuple, whether b.val, the value of the val attribute for that tuple equals my input parameter plus one, I have to traverse this tree and evaluate it and see whether it evaluates to true. If so, then I know it should be included in my output. So the way you would traverse this, you would start off with the equal sign here and you would go down to this side and you would have an operator expression that says to give you the value of the dot val attribute for this tuple. So what happens is when you execute and process this tree, when you walk through it, you're also providing some execution context with information about the tuple that you're operating on and some other things that are going on for your transaction and your query what's running. So in this case here, we would maintain in our execution context a pointer to the tuple that we're currently processing. So in order to get the dot val or the val attribute, I'd have to also look up in the schema for this tuple and say, well at what offset is the val attribute? So in this case here, I have two 32 bit integers, so I'd have to do some arithmetic to know that I need to jump to 32 bits over to find the starting point of the val attribute. And once I have that, then I know that the value will be a thousand because that's what's in this tuple here. Then I go back up the tree, come down here to this addition operator, same thing. Now I go down to a left side and evaluate this parameter. So in this case here, I've also provided the input parameters that the client specified when they invoked this query. So in this case here, 999 would be substituted for the question mark here. So again, same thing. I know how to jump to the offset in my input query parameter array. I get back 999, go back to the other side and now I get a constant. This one's pretty easy. It's just baked into one. Now that I have all the values for my children for the addition expression, I can then move them up, you get a thousand, go up to the equal sign, and then evaluate to see whether a thousand equals a thousand, which is true. At this point I know that this tuple has satisfied this predicate, and therefore it should be admitted in the output. This sounds like a lot of work, right? And so now if you have a billion tuples in your table B, you're scanning through one billion times and traversing this tree, making these function calls lookups, looking up things in memory up in there just to evaluate whether 1,000 equals 999 plus one. So this is really expensive. There's a bunch of other stuff that's going on, but I'm showing here that you have to do as well. So I had to check to see whether the tuple parameter is null, because null can't equal to anything. So I had to check whether, we'll talk about how to implement nulls later on, but it could be a separate bitmap that I could check, or I could check a specific value in my range of integers that would tell me whether it's null or not, or I could check a little flag in front of it. So there's a bunch of extra stuff I have to do at every step of the way to check whether something's null or not. And that becomes good expensive. So all of this is we'd have to do in a general-purpose interpreted engine, because I can take now any predicate and generate a tree like this, and the system is going to know how to traverse it and compute the answer that you want. But again, this is all really, really expensive to do when your database fits entirely in memory. So the way to solve this is through code specialization. The idea is that anytime we have a CPU-intensive task that we need to execute repeatedly inside of our database system, and when it's something that we're going to keep doing over and over again just with different input values, like in that case here, I'm traversing that expression tree just with different input values every time I run, for different function pointers to different tuples. Anytime I have something like that, then I want to be able to try to compile it down into native executable code and invoke that instead of the slower general-purpose version of it. So that's the main idea what we're trying to achieve with query compilation and code specialization. So a bunch of different things that we can do in addition to the operator execution and the predicate evaluation, which I just showed previously. We can also use this for our access methods, as we actually scan the tables. Say we're doing a scan and we just want one column. We can bake into our specialized code to know how to jump to the right offset for your row to find the value that you want. We've all talked about this so much in this class, but you also can compile down store procedures. So to say instead of having PLPG SQL that you want to interpret, every single time you invoke a transaction, you could just compile that store procedure into machine code and just execute that directly. That's been way, way faster. I know Oracle does this. We're working on adding this in our own system and SQL Server can do this. PLPG SQL is actually really slow to execute. It basically takes everything that's in there and converts it to a select statement. So say you have A equals A plus one. It converts that to select A plus one. So that's going to be really slow. And then the last one here, logging and recovery. As far as I know, this is not something anybody has done yet. This is something we're interested in exploring here at CMU once we fix our regular log install. The idea here is like again, think of something that would be repetitive if you're doing it over and over again. Parsing log records to figure out upon recovery what change you need to install into the database. That's something we think we could co-specialize as well and speed things up. So the benefits of doing this, sorry, yes. There's not too much work because recovery is going to run only during the while the data is starting up, isn't it? So how much is the performance benefit you're going to get by doing co-specialization for recovery? So this question is, since logging is only done on recovery, how much benefit would you have to get for co-specialization? Logging is not always done on recovery. No, I said the parsing of the log during the recovery or using co-specialization during the recovery phase. So you're saying how expensive is parsing the while records? So you said like, you could use co-specialization for parsing the log records during the recovery phase. Yes. So my question is, because the recovery phase is done only once while the database is starting up? No, no. If you do replication, right, most of the instances, the way they implement replication is you send over the right-hand log, the out log, and then the server side of the replica is just pre-playing the log as if it was in recovery mode. So you would want to do it at runtime as well. You're right. But other bottlenecks in replication or logging, like reading with disks is always going to be super slow or growing with an error would be super slow, that this may not be a huge benefit, but then this is something that's worth exploring, especially as non-balto memory and other password disks and networks come out. This may end up being a problem. It's sort of the same argument we're making for the students in general. When everything was on disk, who cares? But now everything's in memory, this makes a difference. It's a good point, though. Okay. All right. So we're the benefits of this, right? So because we know all our attribute types ahead of time, we can convert what would normally be giant switch statements that says if an int do this, if a bar chart do that, we can bake in to be direct access on data and memory and operate it on directly. This is one of the advantages of having a relational data that's been in the system with a relational data in general with a schema, right? If you're going to the SQL route and you don't have a schema, it's hard to do this because the query's not going to know what it's reading, right? Because from one record to the next, it could be different. Because we're going to enforce all of this when we insert the data or update it, we know exactly what the type's going to be. We know the length of the data and we can just convert things in the most efficient way to actually execute them or operate on them. Likewise, where you know all our predicates ahead of time and because we know the types, in many cases we'll be able to convert the evaluations of the expressions for our predicates down to primitive instruction operations, primitive data type operations. So instead of saying is this actually equal to this value, instead of having a tree and traverse it, the two values into CP registers and invoke a single quality instruction, compare instruction to see whether they match. That's super, super fast whereas traversing the tree is really slow and expensive, comparatively. The other thing that's come up when we talk about hyphers is that we're going to minimize or have no function calls in our loops. And this is important because we want to reduce as many branches as possible. We want to get rid of s-claws, we want to get rid of jumps. And so if we can inline as much as possible, you can't inline everything but you can inline a lot. Inline as much as possible, you're allowed a compiler to maximize this cache we use, maximize its use of CP registers. So this is the ground everyone on the high level where we're at or we're talking about. So this query compilation and co-generate stuff, this occurs after the query optimizer spits out the physical plan. So again, the client sends a SQL query, we throw that through the parser, that gives us an abstract syntax tree. Then we have a binder where we look in the catalog and then we do the mapping to say what table maps to what internal pointer to that table. You guys are doing this in the first project, right? When you bind things or register things in the catalog for your functions, that allows the binder to say, oh, I'm invoking string, I'm invoking a function called what's the operator ID for that? So that's the same thing going on here. So now we have an annotated abstract syntax tree with operator IDs or internal IDs to say here's the actual data structure of the element, the thing you're trying to operate on. We feed that through our query optimizer, that spits out a physical plan, and then we have our co-generator or query compiler generating the actual machine code, the native code that we use to execute that query. So this is done after we have the physical plan. For today's lecture, I'm not going to care about what the physical plan actually looks like, meaning I don't care whether it's hash join or servers join. The total flow is still always going to be the same. Put it back. Is the native code for start procedures that are attached? Sorry, because it is complicated. So your question is if you're compiling the sort procedures, where does that fit into this? Does it perform the compilations that every time you are on start procedures? Oh, okay. So your question is if I'm doing code generation query compilations for start procedures, is that done every single time you invoke it? No. You typically, well, you would do this. This is what they do. You compile it once, throw it in a cache, and every single time you invoke it, you just execute it right away. Can you also cache the output of physical plan compilations too? This question is can I cache the output of physical plan compilations? Yes. Is that done? So, most of us try to do that, because we'll see about compilation time in a second, compilation time can be a lot. So you try to do that as much as possible. You can do this, you can actually do this basically the same thing without code generation. You would do this for prepared statements. For prepared statements, you would cache this. Now the tricky thing is that, and which is sort of unsolved problem in query optimization is since you don't know what actual parameters you're going to use at runtime when you plan the query and cache it, one input may have different performance characteristics because it may change the distribution of how you access the data. For our purposes, we're not going to know that for now. But you try to cache as much as possible. And I think in our own system, I think we did this. We just added this in our own system. Right? Okay. Okay. So the two ways when you code generation are the following. The first one is called Transpillation. And again, this is basically we're going to have growing a right code that generates code that we compile. And so then you compile it just using an off-the-shelf compiler like GCC, Clang, or ICC. And then the other approach is to do a JIP compilation. It's where we're going to have code that can generate an intermediate representation of the query. It's called IR. And then we can compile that quickly into machine code that we can then invoke. Right? We use the LLVM to make this happen. So I'm going to go through each of these. I'll start with the first one. So this is from a paper a few years ago from these guys in Edinburgh for a system they had called Haiku. And this is a classic example of a Transpillation database system. So if we're giving a query plan, their database system would then generate C code on the fly that essentially implemented exactly the execution of that query. Right? So again, all the predicates, all the expressions, you know how to put that into a direct form that you can execute without having to do additional lookups or walk the tree like I showed you before. And then they just do fork exact on GCC. Get that C code that is generated and GCC spits out a shared object. You link that in into your process to now it's in your address space and then you invoke a function to invoke it. Right? So when you open up the terminal and you type a select query and hit go, it does all these things. Right? So let's walk through an example of this. Right? So they have a simple query. So let's start from a where a dot value equals input parameter plus one. So the interpreted plan would essentially look something like this. Right? I'm going to put my tuples go grab a tuple at a given offset then evaluate my predicate and see whether it evaluates to true and if so then I emit. And this is also following the iterator model the tuple at a time model that we talked about in the beginning. So this seems pretty simple. Right? But there's actually a lot more going on that I'm not showing here. Right? So in case I'm actually getting the tuple, you have to go get the cattle over the table. I mean you would typically catch this and you're going to calculate the offset of what tuple you're trying to access based on the tuple size. So now we have a fixed-length data pool for all the fixed-length data of the tuples. So I know how to take what my tuple is one kilobyte and I want the 20th tuple. I know how to multiply those two together and jump to the correct offset for the thing that I want. And then we're going to return a pointer to this. Next thing if I'm going to do evaluation now I got to traverse that tree that I showed in the beginning of the tree structure to get all the data that I need and then see whether it is true. And if it does then I know I want to go down this branch and then admit it. So now what the Haiku source-to-source compiler would actually generate instead is something that looks sort of like this. So the first thing we're going to see is that at the very top we're going to have now placeholder variables for all the things that we're going to need. So we can bake in ahead of time exactly what the size of the tuple is. We know how to jump to the right offset for our input parameters to find the predicate that we want. We know what type it is and then we know how to this is to get the offset into the tuple for the value that we want and this is the offset for the actual value for the parameter we want to evaluate against. So now what we do at runtime we fill in these parameters and this is going to be way more efficient. This is the example I was saying before. Instead of having to reverse an expression tree with an equal sign and then a parameter and then tuple value I can just bake this into a single instruction does something equal something and I know how to fill these parameters because I have them already computed. This is way way faster than having to do what the query plan is figure out what the type they want for every single tuple as you iterate over them. Okay? So the way you could build this in a database system is actually kind of nice because your code generator is going to make query code that actually can interact with any other part of your database like the virtual transaction manager that we want to invoke to keep track of things we can easily generate code that can then invoke them directly. And there's also a nice advantage of this that makes it easy to debug because you have C code, you crash you open up GDB and figure out what why you fail. You can add debug symbols to keep track of things like this line was generated right? This is kind of nice. So now they have this great experiment that I'd like to show from this paper where they're going to evaluate their source to source compiler their code generator versus a bunch of other different ways to actually implement your database system for doing query evaluation. So at the very top they have what they call generic iterators and this is where you're just doing an interpretation engine that knows how to operate any possible query with any possible type. This is essentially what we have in our own system the interpreted engine. This is how most will implement their database system. This is what Postgres does MySQL does. This is the standard way of implementing a database system. And then they're also going to have an optimized version of these where now instead of having a generic iterator they're going to have specific ones that are specialized to actually execute on particular data types. And then they're going to have a generic version that was written by hand by presumably a grad student and then they're going to have an optimized version that's sort of like the optimized iterator where they're going to have they're going to write iterators that are specific to the different types. But again the way to think about the hard coded ones is it's a hard coded C plan that only knows how to execute that single query. It's not general purpose and if you have another query you have to rewrite the program and have it do something else. And at the bottom they're going to have their high Q code generator engine. So for this it's a the paper is a bit old so the machine is a bit dated but I still think the results are relevant for our discussion here. So what they're going to do is they're going to do a breakdown of the execution time for the queries and they're going to measure how much time is spent doing L2 cache misses memory stalls and actually executing instructions. In the paper they also show L1 cache misses but that time is like so small that I'm not going to bother showing it. It doesn't affect our discussion but they're in the original paper. So the first thing to point out is that you only see in the iterator models and then the generic hard coded version that you have memory stalls. Right. You scan what's going on here is that for doing predicate evaluation and other things there's these giant switch statements that you have to use to say well is my predicate this or is it on this type and do that evaluation and all that ends up being extra code you have jumps now into different parts in the address base of your process you're grabbing more data than you maybe actually need so that's causing you to have memory stalls. Over here execution code they always have very little memory stalls so you're basically operating almost directly out of your CPU caches which is really nice which is what you want. The high two performance is slightly better than the optimized hard coded one and this is because the source code generator can generate code that is actually better for the the CPU or better for the compiler to optimize than what we as humans would normally write so it's generating code that if you had to read as a human it would take you a long time to understand what was going on but it turns out that's the best way to actually execute it on a modern CPU and that's why they're doing slightly better so what's one obvious problem with source to source compilation or translation and the approach that they're using here compilation time compilation time exactly right so for every single query with high two four could set on GCC feed in their the C code let it compile it spit out the shared object and then link it in so you think about what's going on like when you start GCC you have to parse all the command line parameters you got to parse configuration files set up memory and they're doing this every single time it's really expensive so in this graph here what they're showing is the compilation time for three queries in TPCH where one time they're using and then they're doing compilation with O2 I don't know why they include O3 as far as I know most production software is shipped with O3 turned on right if you don't need to have debug symbols and so what you see is that the with O2 it's almost two to three times slower than what you would get with O0 but even then O0 is is pretty long right it takes a long time to run so that you know if your query takes 20 milliseconds to run but it takes you 100 milliseconds to compile it what's the point right so this is the big problem that they have to deal with so the the big observation though you get from IQ is that the way that the organized the code that they would generate for the query plans is actually turned into a solution if they target the CPU that they were trying to run on so rather than having us as humans write the code in a way that we can understand and engineer and change over time they said well we know no one's ever should ever have to debug our generated code once we know it's working so we can do it in a way that's most efficient for the hardware first used when their first version of the system came out and then they had to use the caching to avoid this like one second start of time for every single query we'll talk about that at the end of the lecture the other thing about IQ is also is that because they're still using the two-quad-of-time approach it's not going to allow for full pipeline right so this is what the hyper paper talks about caching help only if you execute the same query again because you need to generate machine code for every new query yes so his statement is when caching only help if you know you're executing the same query over and over again absolutely so in an OTP application the queries are usually invoked from the application code like some PHP code Java code you invoke a function so it's always going to go through the same code app it's always going to invoke the same queries in that case if you open up a terminal type of random query and it's never solved before they may never see it again so caching doesn't help so absolutely yes for OTP 200 second compilation time is really bad for OLAP depending on the query it may or may not be bad yes the following is question are like the queries with different parameters are considered like as the same queries so this question is are queries with input parameters are they considered to be the same query yes right it's a prepared statement you know that it's going to be the same thing over and over again in the case of MemSQL we'll see this at the end they actually if you didn't have input parameters they didn't have the question marks they would actually go and extract them automatically and convert your query into a parameterized query so this is that they could reuse it if that's true for like cojo part correct for the okay alright so pipeline so pipeline is basically a segment in our query plan where we can process the same tuple from one operator to the next right the idea is that if you want to maximize catch locality rather than having all these function calls or the conditional branches or do the operator time model where we all the tuples from one operator and then move the next one across the same one the idea is that for a single tuple we want to go as far up as in the tree as we can because we're going to maximize our catch locality and reduce number of catch boxes right so in this query here it's going to have four pipelines and so the easiest way to think about this I think is pipeline number two here so we're going to scan on B tuple we can apply the predicate and then we have to stop at the aggregation right because we can't compute the count of the number of tuples we have until we know number of tuples we have right so we can process the single tuple up to here then we put it into our hash table for aggregation and then we have to stop that's the pipeline breaker the pipeline boundary and then go back and get the next tuple same thing over here again but then now I want to do a join say I'm doing a hash join so I'm on the build side of the join here so I would put it into my hash table but then I have to stop processing that tuple I'm going to go back and get the next one so the idea here again we just want to have as long pipelines as possible and take the same tuple and go up as far as we can into the tree it's different you could do this in the tuple of the time approach because of all those function pointers those are conditionals and that's why it's problematic so the hyper guys this paper was crazy about this paper I realized there's a bunch of LLM IR which is a bit dense to read but the reason why I had you guys read this paper is because it's sort of the first one that's sort of laid out how to do LLM compilation for a database system and the crazy thing about it too is that this paper is a freak and I say that in a good way he supposedly wrote 8% of hyper by himself and he wrote this paper by himself it's amazing and he's got three kids too like I have zero kids I don't have time to write papers like that and plus write all the code so he's awesome and again we're going to come across hyper a lot throughout the semester so what this paper shows is that you can compile queries using native code using the LLM compiler toolkit it also proposes that if you want to maximize cache locality that you actually want to use a push-based model where you can write a tuple as far as you can up into the query plan as much as possible and avoid doing branches and functional bookups and this is essentially saying that instead of having an operator-centric view of how we want to execute a query and process data which is how we normally think about things like this is one operator and this is the next instead of having a data-centric approach where we think about how to move data around rather than how do we maintain boundaries between the different operators so a quick show of hands who are here as who are here as heard of LLM okay so we're going to represent so the LLM is a collection of compiler tools that originally came out of UIUC so my understanding was they were trying to build a replacement for the Clang compiler and they end up building this whole this whole framework or this collection of things you can do build other more complicated things on top of and so the core component of the LLM is their low-level programming language or intermediate representation or IR that is sort of similar to assembly so you can sort of think this is like JVM or the Java bytecode right but it's not tied to Java so you can generate IR right in whatever language you want to use you could have like in Scala and Scala spits out IR then you can pile that in the LLM in our case we're using C++ the idea is that you can generate this IR that can then be compiled by the LLM compiler and it can target any ISA or any architectures that you want so the same IR that we generate can then run on ARM or Power or x86 right so that's pretty powerful so the key thing to point out though and I think he mentions this in the paper is that we don't need to write our entire database management system completely in LLM IR right he has that appendix where he shows all the instructions right first of all I can understand it and it would be hard to find people who could understand this that's why we have those C++ macros for an ARM system to abstract away the low level details of how to actually generate the IR for you but the key thing about is that the LLM code can actually make invocations into C++ so we get the same advantages that we have with Haiku where even though we have our specialized code for our query we don't have to put the entire functionality of the database system to invoke our regular C++ code that doesn't need to be compiled at run time so our indexes are all written into C++ because it would be really hard to actually build and debug a fast index in LLM IR so we have the proxy stuff that allows us to make these calls into C++ you have to do some mangrove with the class names to find the right location and memory but again for Sean it's taken care of our model we're going to take the pipelines we had over here and then we can generate essentially pseudocode that sort of looks like this so for the first pipeline where we do a scale on A then we have our patiket and then if it matches then we're going to put it into our hash table that we know we're going to need for the later pipeline to do the join so again this is just pseudocode but another thing to get at and this is like things like doing predicate evaluation does something equal something that no longer needs to be the expression tree we just load these two things into our register and then do execute a single instruction to do a comparison and then we can execute pipeline 2 pipeline 3 and then once we have everything then we finally execute pipeline 4 so now we're not going to talk about parallel execution in this class this will come up later but it's sort of depend on other pipelines being executed furthermore there's nothing to say that we couldn't have multiple threads execute a single pipeline at the same time so I can add 10 threads all execute pipeline 1 they all operate on different segments of the table and they produce their own output and then we can combine them together later on we'll cover this how to do parallel execution later but for simplicity reasons we just assume that everything is single threaded for now yes everyone looks very similar to a materialized model but you look at the leaf and look at all the tuples in the leaf node and then push them also so the statement is that this pseudocode here looks a lot like the materialized the materialized yeah the operator and time model where you push all the data up yes but like in this case here I'm going to do my scan on B and then I and then for every tuple I'm operating on one tube at a time this case here would be scan everything ah so in this case here I could start emitting tuples out from this join to this join without waiting for the join to finish I have to wait for all C to go in actually that's not true either I take a single tuple and I scan out a C do my probe my join once I know that this thing has been populated I check to see whether I have a match if I do match then I can go up right again the idea is here is rather than having just taking a tuple checking to see whether there's a match going back in the next tuple checking if there's a match because now you're swapping that thing in and out of your caches I take a single single tuple and ride it all the way up right okay so for this for these numbers here I commit to the paper so they're going to take TPCH and they're going to pair against hyperactually so hyperactually ended up first started implementing the haiku way where you generate C++ code and then compile it and then they went off and implemented the LLVM engine so it's kind of nice within a single system we actually can do a comparison between the LLVM approach and the C++ approach and then I can compare this against vector wise which we'll talk about later and then monegv is the vector wise came out of the monegv project vector wise and then the only discompliation on the predicates but not the actual query plan and then I think it's called dbx in the paper but it's just oracle oracle doesn't do this version oracle doesn't do any complation at all so what you see across the board is that with the LLVM and their sort of push based model it always executes almost as fast as fast and in some cases vector wise does a little better sometimes vector wise wins or is pretty close vector wise again is doing complation but in a different way plus they also can do vector wise operations which we'll cover later alright so the and the numbers I showed you before they weren't including the complation cost it's purely execution so here in this graph it's not a true comparison because I'm taking results from the Haku paper and comparing them against the the hyper paper for their complation LVM so it's not running on the exact same hardware but at least gives you an idea what the ballpark difference is for these things so with their LLVM engine they're compiling queries in roughly less than 40 milliseconds at least for these queries here there's just showing that you compile queries with the LLVM engine because everything's you would have if you have to fork the process and spit it out generate GCC code or just journey code to give you the GCC so these numbers look okay right 15 milliseconds is not bad especially for O.query but the problem is what they found out later on is that the compilation time for queries ends up growing super linearly because it depends on how large a query is so it depends on things the number of joins the number of brackets and the number of aggregations so if you have a really large query the compilation time is going to explode so the two examples could be a lot of times there are machine generated queries like a dashboard or a tool for doing analytics you click a bunch of buttons you want to export some data and then that generates a C query that you then invoke these queries can actually get quite large some friends at Google Google told me they often see queries that are like 10 megabytes in size so it's not operating 10 megabytes of data the SQL itself is 10 megabytes the other thing that came up and this was a problem with Hyper is that like us they once support the Postgres wire protocol on a Postgres catalog so what they did was when they were getting acquired by Tableau they would plug in PGAdmin which is like a web-based administration tool for Postgres the way PGAdmin works is when you turn it on it fires off a bunch of queries into the catalog to figure out what tables you have what types you have and so forth and so there was one query that came out with like 22 table join and to compile that took like a second with Hyper using the LLBM so the problem where you would turn on PGAdmin pointed at Hyper and then it would stall for a couple seconds before it actually became available whereas in Postgres because it's not doing compilation by the way the queries it may execute slower than the way Hyper does but it doesn't have to do all that expensive compilation ahead of time so as I sort of said before it's not that really big of an issue for LTP queries because you're going to be executing the same ones over and over again you can parameterize them and then cash to cast the query plan but this is a major problem with all that workload so the way they solved it which I think is actually really clever I'm actually starting that actually I can tell you guys about it I knew about this last year because one of the Hyper guys came and visited us at CMU for a month Victor Lease and we were talking about this problem and they were like yeah we solved it a year ago this is a leisure we solved it a year ago but we haven't hit the paper yet we just haven't got around to doing it so I felt bad because I would teach the class and I'm like alright here's this really sucky problem but there's a way to solve it but I can't tell you what it is at least not on video because I didn't want anybody to have problems I didn't want them to have problems when they try to publish the paper and get scooped but now the paper came out in ICD 2018 it hasn't been announced yet but Victor sent it to me it has been accepted so I can tell you what they're doing now which I think is really clever and this is actually something we went for soon so what they're going to do now is you're going to take a query you're going to generate the LLM IR just as you normally would but you're going to compile it which is going to be slow but while you're waiting for it to compile you start interpreting that IR sort of like a virtual machine or the JVM you're going to start interpreting it and start processing the data and then at some point the compilation is going to be done and then you just seamlessly slide in make a file code to replace the interpreter code and execute it and get the faster version of it so the way to sort of think about this is rather than having what we have now in our system we have the old interpreter engine we have the LLM engine we don't talk to each other we don't have the same semantics we don't often produce the same output instead of having two separate code bases like that you can have something that can take the same query execution plan the same the IR that you generate then interpret that it'll go slow but it's going to be exactly the same as the compile version and then when the compile version is done you just slip in the faster version or if the query was super short then the interpreter version will finish right away then you cast the compile plan and you can reuse it the next time you execute it so it sort of looks like this so they call this adaptive execution so we have a SQL query comes in and then you're going to run it through their query optimizer for hyper it takes them two tenths of a millisecond and then you're going to feed this into your code generator and that takes seven tenths of a millisecond and now at this point you have LLM IR so now you're going to split off into three different paths the first is going to have a simple bytecode compiler this is basically the interpreter that it's just going to take the bytecode and it would be like a virtual machine and execute it then you're going to have another thread is going to take the IR and run it through an unoptimized compiler this only takes six milliseconds to run but you can it's not going to be like O3 it's not going to do all the extra optimizations you would normally get in a regular compiler and then another thread is going to do multiple passes with the LLM optimizer then take the optimized IR run that through the compiler again and then generate machine code so the idea here is that if your query can finish before this 17 milliseconds plus 25 milliseconds is done if you can finish interpreting the query before you get to this part then you're done and you have the answer otherwise you slip in the x86 that you just generated and have that be pick up where you left off where the interpreter left off and run that really, really fast so I think this is a clever way to do this and it solves the problem like okay if I have a really simple query and the compiler takes a long time then I just have the interpretive version and I split the output and I'm done but for long running things the optimizer I can still make forward progress while my compiler is running yes so one of the things that you talk about is keeping the data in the registers yes you have all the stuff running in the background giving that sort of you know like mess up like your cash your registers along the way so this question is I made a big point that we want to minimize the we want to maximize the reuse of data in our CPU registers but now I have other threads compiling in the background is that going to mess that up so typically what you do on a multi-thread system is that you will pin execution threads to cores and then all the background tasks they run on other cores so there will be no interference that's the way you get around that yes what's the overhead of actually switching out a condition build sounds like that I don't know what kind of guarantee you get with this sort of approach but it doesn't seem trivial that you can map one progress to the other because they might not have to say but so this question is how can I slip in the the compile code and know that I'm picking up where I left off on the on the interpretive code right when having having false positive and false negative right so think of it as the the unit of execution for the engine is a block of data hypercalls as morsels we'll read this paper later on the idea is that the interpretive engine says alright I'm reading this table it has 10 blocks so I'll grab the first block I'll run my interpretive engine on that I finished that block now go check to see where the compile version is done it's not I'm gonna get the next block interpret that right and so at some point the compile version will be done you set a flag the interpretive engine says oh I'm done plus my block I shouldn't grab the next one because the compile version is done so I stop that's how you avoid that problem in the back yes why is this only a problem for the LVM compilation or like the super linear thing or is it also a problem for the C++ compilation so this question is is the the growth of the executable growth of compilation time is that only specific to the LVM or is that also a problem in GCC it is a problem in both yes so you're done from interpreting the bytecode then so when the unoptimized code is ready it starts executing the unoptimized code and then when the unoptimized code is ready yes so the paper's not it's not officially out yet but it's on the website Victor sent to us made available it's a good reading yes when you're doing this are you actually just shifting the compiler the order in which it's actually executed because now the interpreter and the compiler kind of have to do things in the exact same order the question is am I am I requiring that the the compiler or the compiler version the compiler version am I requiring that the compiler version do things in the exact same order as this one so the order of what instructions or the order of operators I would say operators right so operators that's that's determined here the IR generates the the actual execution instructions to the query so that's going to be at a high level that's going to be the same now when I take multiple passes on this it may end up reordering the instructions but the the order of the operators will be the same and I'm saying to think about it is that necessary because you know when we have the other VMIR and that's then the optimization process that we are running on that I can reorder everything you know significantly it's going to be one of the instructions but it's still going to produce the same output yes I do think that the output will be the same but the order even the pupils are processed can they not be different or the question is are the two of us going to be operating in different order? again even then I have to check to see whether it's sorting but like even then the relational output is unsorted so it doesn't matter so like if you know if I have my IRs like here's operator one operator two it's a big chunk of code it's not going to magically put the other operator above that and somehow violate some ordering guarantee it'll swap around instructions and things like that everybody orders those but it's not going to make it's not going to make logical changes to the the query plan yes what exactly is the byte to how you're doing it it's taking an L-O-V-M-I-R and then generating like a almost like another intermediate language it's just like really really simple really simple yeah yeah alright let me keep going because I want to go through other examples of these alright so here's a bunch of different systems that are doing them as you see they're like it's not everyone it's not a lot of them alright this is this code generator stuff has really gotten vogue in recent years so I'm going to go through a high level a bunch of these and see what they do so as in all things in databases what seems like super new and super novel and super cool IBM did in the 1970s right it's going to come across you're going to hit this multiple times throughout the semester so well the system R project was the IBM guys trying to take the relational model paper from Ted Ca and actually implement a real system so they got a bunch of people that all had PhDs put them in a single room and every person in the PhD got their own topic on the system right so one person worked on query optimization one person invented SQL so one person worked on on code generation so they take SQL code statements and then they would generate assembly code by sort of having these pre-generated code templates written in assembly and sort of matching them together to actually generate the query plan itself right the end of actually abandoning yet when they actually later put out DB2 so IBM never actually released system R and never actually was commercially available but there's different versions of DB2 which we can talk about later but the one of the main versions end up taking 50% of the code system R and they use that as the basis to start DB2 we know this because there's some really great papers like this one here the history and valuation of system R where it's like a retrospective of the project and then there is there was a reunion in the 1990s where they interviewed a bunch of people that worked on the system R project they talked about different things that they did and one of the things comes up is this thing made a big difference back then it wasn't like x86 was the dominant CPU ISA right there's a ton of different chips IBM had a bunch of different chips that all had different assembly so you had to support all of them and that was a big pain and the other big issue was any single time somebody changed another part of the system right if I take my row header and I add more data to it now I can go back and change all of my code generation code to account for that right which was back then it was a big pain to actually do this so when IBM put out dv2 they ended up abandoning this and then no one really considered this for 30 years afterwards right in Oracle Oracle as far as I can tell from reading their documentation and I do this every year to try to figure out where things have changed but as far as I can tell the only thing that they actually compile are KKPL SQL from the store procedures they convert it they're dialect to C called Pro-C and then they can compile that into native code so when you invoke a store procedure as a transaction you actually get native code right but all the predicate evaluation all of the query actually all of that is done through an interpreted engine now one cool thing that Oracle does is that they just jump past code generation and query compilation instead of having the software generate some native code they just put stuff on the CPU itself right because Oracle bought Sun and Sun was making these like spark chips so I know in the M7 and then more recently in the M8 they end up actually putting Oracle database-specific instructions or operations directly on the spark chips and then have the data system can invoke them directly so you don't even need to do code generation I'm doing specialized code if everything's on hardware that's going to be vectorization compression and then memory scans evaluating predicates so I think this is kind of cool and Oracle is sort of a unique position where they can actually do this the one system that I know from Oracle that can do code generation is like the really expensive one called Exadata they do predicate push down it's a shared disk system so you have this expensive stored appliance they can push the predicate there and then I think that gets compiled but if you download Oracle the one in your laptop or a little machine it doesn't actually do any compilation as far as I can tell we talked about Hecaton before Hecaton can do compilation of both the procedures and the SQL and what's really kind of cool about this is that you can have non-Hecaton queries so like in SQL server you can have different database engines so you can have the old engine actually invoke or request data into the Hecaton engine and they have these pre-compiled inter-operators that allow you to do that very efficiently which I think is kind of cool so what they're going to do is they're going to generate C code from the interpreted syntax tree of your interpretation and query compile that into a DLL and then link that at runtime so there's a bunch of extra stuff you have to do when you want to do compilation and link things in like this to make sure that one you don't take down the entire you don't start reading into memory or reading locations in your dress space that you shouldn't be then injecting malicious code or having problems so I think that's kind of novel Cloudera Impala is a distributed database or analytical database designed for run-in and do environments so they don't do entire code generation or query compilation they only do predicates and expressions and then they also actually do it for doing record parsing so because they're trying to run in a Hujub environment that means they're reading files from HDFS that may be generated from various different services so they want to do compilation to read things like parquet files CSV files or fork files and they can do that really efficiently when they bring things in memory so that's another example of using you can compile accelerators make that go faster acting in vector is a awesome database system it's formerly called vector-wise and so what they do they do actually something very different from everyone else so instead of having on-the-fly code generation and compilation they actually pre-compute a bunch of primitives to do all possible operations you could code on your database system ahead of time and then in together these pre-compiled plans that are optimized for your query or optimized for accessing the data a certain way and then that basically gives you almost the same thing as doing on-the-fly code generation right so it sort of looks like something like that so say that you want to do a scan on a table where some attribute is less than some other value so they'll have a scan less than double functions right it's basically the same thing it's just they know the size of the the data you're trying to access and they know how to jump to the right offset in order to do that so in this case here this bottom one here is to do on a double the top one is to do on integer so they generate everything that you could possibly do and then they just put these things together so we've done some benchmarking just recently we were able to beat it with our LLM engine for some things the vector-wise came out of an academic group in Europe Actium bought them and then they basically like killed it unobtrusive they removed it from their web page they made it difficult to download which is a real shame and basically the whole team left and went to go work on Spar or something like that so it's a shame for the system all right so we taught at MemSeq before and I said that when they first came out the one of the developers the founders of MemSeq was at Microsoft and saw the Hecaton project getting developed although he didn't work on it so he saw a lot of the early ideas from Hecaton and then he basically went off and sort of did the same thing so they they do the code generation the way the Haiku does it where they have their own code generate C++ code or a ZEC GCC link in the shared object and execute that and so the question that I think you came up for earlier was isn't this going to be really slow if you do every single time you see a query and is there a way to actually speed things up and you can cache the query plan so that's essentially what they would do so if you have a query like this where a.id equals 123 they would convert this into a parameterize query or a.id equals some value and then they will when another query came along that basically had the exact same syntax just with a different input value, they would then be able to map that to the parameterized cache query plan and then execute that. So as far as I could tell from talking with those guys, all they're really doing here is basic string matching, right? So they would extract out the 456 here and then see whether that string matches this other string. So they couldn't do things like if the where applause were a.id or a equals something and b equals something, if I come along with another query that's b equals something and a equals something, right, a reversed a and b, they would think that's a different query when semantically we know it's exactly the same, right? And so I wanted to show this, I maybe should look in the archive.org. You can have this blog article on the website with a display and like, all right, you run the query the first time, it takes one second, then you run the query the second time, it takes zero seconds, right? Because they were doing this caching thing. They did a good job hiding that, so I wasn't able to find it. So this is before 2016. And then in 2016, they hired this dude from Facebook who helped build the hip hop VM, right, the PHP VM that Facebook uses. And he basically re-architected the entire system to now use LLVM. And they do something I think that is actually very interesting. So instead of doing what we do is we have C++ code that goes directly to IR. They actually have a bunch of different layers in DSLs that goes from a high-level intermediate language that compiles down to a bunch of op codes, and then you can compile that down further to get LLVM IR. So they have an extra layer in between. So the top layer, they had this thing called the memcicle programming language. It was basically just like a C++ dialect that has memcicle specific things. So then you compile this, and you get the memcicle bit code, and then you can think of this as like the JVM byte code. Or if you've ever seen a SQL byte, the way they have operands, the way they execute queries, it's sort of the same thing. And then finally they can compile that down to IR and then run that as native code, right? And the reason why they do this is for engineering reasons. It's really hard to hire somebody to work in this level down here, right? It's a lot easier to hire someone as a general C++ person, right? Because in this case here, you can have the C++ code get compiled into, you know, on your desktop, and then you can run the GDB and step through and figure out what's going on. If you crash down here and you don't plan with a nice back trace, you end up with assembly, and it's hard to get people to want to do that. So part of the reason why they went to this route is because of all the same problems IBM was having with the source-to-source compilation. So I visited them like 2013 before I came to CMU, and they sort of confided to me that like they had to do it all over again, they probably were not gone with the code compilation or code generation approach they were using when they first started the company, because it was really, really hard for them to work on the system and add new features. But then later on, they got a lot of money, and I think they did it the right way. All right, the TSTB is another system, it's actually an extension to Postgres and Green Plong. So they use the LLVM, and they allow for interquery parallelism beyond what rebel posters can do. And so this thing looks a lot like Hyper, they're doing a lot of the same techniques, they're doing push-based process model. They don't compile the entire query, they only compile the predicates. There's a bunch of other stuff that they do that are kind of interesting. So the link here will take you to a YouTube video that talks about how the TSTB can get 100x faster than in Postgres. If it's not just compilation, they're also a column store, they do some other things that speak and stuff. Apache Spark added this new engine called Tungsten in 2015, and they rely on a extension of Scala that allows them to convert the where flaws in your SQL predicate into an expression tree, and then they just compile that into a JVM bytecode, because Scala runs into JVM, and then they're able to execute it natively. And they have a really good blog, there's a blog about what describes this, and then they have a paper in Sigma in 2015 that talks about this as well as their new optimizer. So I want to finish up with our system. So as you're aware of, we use the LLVM, the goal is for us to use the LLVM entirely for all our query execution. So we're doing it sort of the hyper way where we're compiling an entire query plan and not just the predicates. We'll cover this paper later on in the in the semester, but we have a paper that came out in VLV last year that talks about how we actually execute queries. And so it's sort of like there's the vector-wise approach, and then there's the peloton approach, or the hyper approach. Hyper doesn't do any vectorization as you execute queries, because it's pushing a single tuple up all the way up in the query plan. And vector-wise, they make heavy use of moving vectors around, and that allows you to use SIMD instructions to speed up the query execution. There's another example of how you can reduce the number of instructions you execute by using SIMD to process multiple data items at the same time. We'll cover SIMD in more detail later in the semester, so I'm not going to go into too much detail how this works, but the key thing we made, we used to pull this all together, was we end up using software prefetching. So there's a way to invoke intrinsics or instructions to tell the CPU, I'm about to read this piece of memory, go prefetch it, and put it in my cache. So what we're able to do is that we can hide the memory stalls you would get, normally when you're doing batching or vector-wise execution, that the hyper guys are trying to avoid so much by using software prefetching. So the last graph I'll show you guys is this one here. So this is comparing the old interpretive engine that we have in there now with the new compiled version, plus with and without this relaxed operator fusion. So to keep the plan here, we're on the log scale. So just take this first query here, Q1, this shows you how crappy our system was, and why we had to go switch over to LLM execution. So there's one query here, same hardware, same database, same everything, we take 88 seconds to execute, but then when we compile the LLM, now we're down to 900 milliseconds, right? That's a significant drop. That's why this is the future for us, why we want to go this, and that's why everyone else, if you're building a new database system, they're going to want to do code generation and query compilation, because the format benefit you can get is quite significant. And then a little bit better you get by doing vectorization is shown in the red bar here. So you can get maybe 10 to 20% improvement over that. But the big drop is going from the interpretive engine to the LLM engine, and this is why we're investing all our time doing this. Okay? Okay, now I'll cover this more in real estate. All right, so the main takeaway of all this is that query compilation makes a huge difference of performance. But as you're probably aware of in the first project, it's non-trivial to implement, right? And then reading the hyper paper when he shows all the IR, it makes your eyes bleed. It's like, Oh, God, this is terrible, right? But the performance difference is so significant, and that's why everyone is trying to do this now. In my opinion, the 2016 version of MemSQL was probably the best way to do this, from an engineering standpoint. The Hecatom one is very good as well. The Hyper one is very good. Ares is very good at getting there. It's not as good yet. And again, if you're building a new system from scratch these days, you want to do compilation, right? Any questions? All right. So this is all mostly talking about OLAP. Analytical queries are reading lots of data, right? And that's where compilation really makes a big difference. So now we're going to switch back over to where we normally would follow along in the course of some more transaction processing, more front-end things. So on Monday next week, we'll start talking about concurrence control, or spend about a week and a half talking about different concurrence control models. And then from there, we talk about doing efficient indexes that need to be concurrent as well. Okay? Any questions? Mike's at Fahrenheit when I hold him real tight. Then I'm in flight. Then we ignite. Blood starts to boil. I heat up the party for you. Let the girl run me and my mic down for oil. Wrecking still turns with third degree burn for one man. I heat up your brain, give it a suntan to just cool. Let the temperature rise to cool it off with same eyes.