 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and wrap this man with a church of fruit times. Now, nothing's trying to get me there. Now, you took the exam, so it's good. All right, so today we're going to talk about co-generation or query compilation. Query compilation, co-generation. I'll say you'll see what I mean as we go along, but this term is there's not like month specific term. We mean exactly everything we're talking about. But it's basically how to take a SQL query, convert it to a query plan, and then generate machine code or something that can execute directly that query plan instead of having to interpret it. For you guys and the things that are due, project one is due this Sunday coming up. WAN set a reminder on Piazza that you should do it. Project two has been released yesterday or this weekend. I'll talk more about it on Friday, or sorry, on Wednesday's class. That's basically just writing up a sort of encyclopedia article about a particular database system. And then what I need to do is I need to go through and look at the sort of existing systems that we have and come up with a list of curated lists of like which ones that you guys can do. Cause some of them are already pretty fleshed out. Like the PostgreSQL one is pretty well written, potentially same thing with Redshift and other ones. So I don't want to pick ones that are germane to the topics in the course. Project three, I'll post the write up on the website tonight, but the proposals for the class will be in class next Wednesday on the first. And there'll be updates, and the final presentation will be during finals week. Yes. So project three, is it positive exactly on this? Yeah, so like I'll discuss it in this lecture, like seed ideas. So you have to give a proposal by me first? Yes. But it's like five minutes like, hey, here's what I think we're gonna do. Just like the plan of flags, like I know what, at least, no, it might own the right path to get started. Again, the challenge is, we're not building our own database system anymore at Carnegie Mellon, so like we don't have a full system that everyone can work on and we just carve off different pieces. We've done it in previous years in 721. So everything we're gonna do is either it'll be Postgres or DuckDB or really any other system that we have. But the project will be disconnected for the most part from each other, or isolated from each other. Again, reading the hyperpapers, again, the number of the papers in the Germans. You know, I missed building a system here at CMEA, but it's not our sort of main priority right now. Okay, all right, so again, we'll cover project three topics at the end of this class. And I'm happy to discuss in my office hours if there's additional things you guys are interested in exploring and what help and how to get started on it. All right, so last class, we spent time talking about how to vectorize the sort of the basic algorithms in a database system to optimize them by taking advantage of SIMD. And this is an example of introquery parallelism within a single CPU. Can we operate on multiple data items in parallel? And if we can do this across all the cores at the same time that are running our query, we get a pretty significant speed up, potentially. Of course, there is challenges of getting the data in and out of the SIMD registers. So if you read the hyperpaper and a bunch of papers around 2014, 2013, 2015, there's sort of seeing maybe there are these two camps of in the research literature that talk about either vectorization or just-in-time compilation. As if they're mutually exclusive, you couldn't do one, you couldn't do both of them together. You either build a system that did one or you build a system that did other. And that sort of say up front, like that's not the case. And maybe that was talked about a little bit in the hyperpaper. We'll see next class a comparison of these two approaches. But when we talked about in our old system, we were going to see that relaxed operator fusion, we could do the SIMD and compilation together. So I don't want to get the impression that you have to pick one or the other. Some of the modern systems will do, or pretty much all the modern OLAP systems will choose vectorization. And then some will do some light, not light's not the right word, but some partial compilation, just for like the expressions. And then some systems will do the holistic query compilation. But in today's class, we're focused entirely on just how you compile queries. And not worry about the vectorization stuff. And then next class, we'll compare the two approaches in a single system. And so this is just a reminder again, why we're doing this, right? Our goal is to make the database system run faster. The three approaches are to reduce number of instructions we execute, reduce number of cycles per instruction and maximize the amount of parallelism we have across multiple threads and multiple workers. This is again, you do this after you've already done your zone maps or your filters, your indexes to remove as much data as possible. So this is for the data you do have to access to do you have to analyze how to cut down the execution costs of that. So in today's class, we're gonna mostly focus on this one here. We're gonna use compilation as a technique to reduce the instruction count for all the operatives in our database system because instead of having to interpret the query plan and interpret the data as we access it, just think of these giant switch statements that says if my data type is this, do this, do that, do that, right? We can basically hard code an exact program that knows exactly what the data looks like, knows exactly what the size of that data is and what we wanna do on it. We can do this because in SQL everything's declarative, we know what the data is ahead of time. So to give an idea of how hard this actually is, this is a, these quotes here are from a paper written by Microsoft in 2011 about why they chose to do query compilation or code specialization in the hackathon system that they were building. We're not gonna cover hackathon in this class, but it was basically or is a in-memory execution engine for SQL server that allow you to do fast transactions. But they wanted to reduce the execution time of transactions because then you spend less time holding locks and latches. So they built a code compilation engine for hackathon. So if you just do, so the back of the envelope of calculations, do you say if I wanna reduce my instruction count to execute queries, if I wanna get 10x faster, then I have to execute 90% fewer queries or fewer instructions, all right? That's again, code compilation, we can do that. But if you wanna get now 100x faster, like two orders of magnitude, now we're gonna execute 99% fewer constructions. And that's where this thing actually gets really, really hard, right? There's no magic like 0100 flag in GCC to compile away like everything, all right? So again, I don't get the impression that it's just reducing number of instructions entirely unless the only goal. We're gonna care about the number of instructions per cycle, but the way we'll be able to achieve that, in addition to, or the byproduct of doing code specialization and reducing the number of instructions, we can be smart about what data those instructions are accessing to help us reduce the number of cycles per instruction. All right, so today we'll start off talking about sort of background of what code specialization and generation is. Then we'll talk about transpilation or the cogeneration approach from the Haiku paper, which is a precursor to the hyper paper that you guys read. It's basically source-to-source compilation. How do I take C++ code and have it generate C++ code? The compilation paper in the hyper work, this is about taking, having the C++ code, what are the data systems written in and generating some intermediate representation like the LLM IR, or we'll see in SQL Lite, they have their own op codes or byte codes, and then compiling that. And then we'll go through sort of a quick sort of buffet of here's all the systems that I'm aware of that are doing the notable systems that are doing this code specialization approach. And you'll see again that how some ideas are reused from one system to the next and other systems could take a completely different approach. And then as I said, if we finish up, we'll have time, we'll go over some potential project three topics, right? So again, the paper you guys read for the JIT compilation, that's using LLVM. The reason why you guys had to read that paper because it came out in 2011. And that's, it wasn't the first paper in the modern era to do code specialization in a data system. And it's definitely not an easy read. Hopefully nobody would spend a lot of time reading the, like there's the LLVM IR stuff in the appendix. Like I should have told you not to read that, but it is the first system that's doing this approach and then a bunch of systems are doing something very similar after that. There's systems based on the GVM. Again, they're generating byte code that the GVM can interpret that. That's again, at a high level, it's the same thing. All right, so as I said, we want to reduce the number of instructions we have for why we execute queries. The way we can achieve this is through code specialization. The actualization is one approach, again, but this is like, now, again, they're not mutually exclusive. This is another way to achieve this. So the basic and high level idea is that instead of having this general purpose code in the database system, that can handle all possible variations of predicates, data types, aggregations, for any possible query that shows up, for every query that does show up, we want to generate exactly the execution code to do what exactly that query wants to do. And we can do this again because it's SQL, everything's declarative. It's not like people are submitting us arbitrary Python programs or arbitrary C code, right? It's, we know exactly what we want to execute and we can generate the code to do exactly what is needed, all right? And we can do this for, again, the main benefit would be for different data types and then the different operands we have to execute an expression tree. And this is going to reduce the number of function calls and number of jumps, amount of indirection in our code path. So the reason why people, we have to do this is because, because whether this is gonna matter so much for us is because the way people normally would write code is often not the best way, again, for the CPU to execute it. I've said this multiple times before, but this is especially true when it comes to this type of stuff, right? Because people want about database systems where their pieces of code are modular or reusable. You have this inheritance property or paradigm implementation, but that's the worst possible thing for a CPU because it wants long sequential instructions in order without any indirection. And code specialization will avoid all this. So to give a running example for this class, I'm gonna show a, assume we have a database of three tables, A, B, and C. A and B have a primary key, and then C has foreign key references to A and B, and that's the primary key, right? So this, again, will be the running example going forward for a bunch of, when we talk about how to code specialization. Let's say we have a query like this where we have a join in A and C and then also a inner query on B that's doing some kind of aggregation. So a rough approximation of a query plan for this query would be something like this. And assuming we're going with the volcano model, the iterator model, for each of these operators in our query tree, the code would look roughly like this, like some kind of pseudo Python code. So now, if, again, I'm going from the top to the bottom, in the top plan, I call next that goes down to this side, this calls next on the right side, it goes down. I'm gonna call it next, next, and next. All these are function calls. And then within each of these operators, it's gonna have these lookup tables and say, well, my table looks like this, or my input table is gonna have this schema, therefore I'll allocate a hash table that has this size, I can store this data that looks like this. Then when I'm doing these comparisons, again, same thing. I have to do these giant lookup tables and say my data on my right side is this, my data on my left side is that, here's how to compare two integers or compare an integer and a float. So there's, again, if you look at like, in any open source database system that's not doing code specialization, you'll see these giant switch statements that says, if the right side is this and the right side is that, here's how to cast it into the right type and do the comparison, do whatever the operation I want to do. So there's a lot of general purpose code here to traverse this tree. But then again, now within the predicus themselves, they're gonna be expensive because as I was saying, now you have to have these giant trees to traverse this to actually do whatever the predicate says it wants to do. So here we have b.val equals question mark plus one. Question mark is a placeholder for prepared statements, sometimes like dollar sign one, dollar sign two. The idea is like it's a prepared statement, it'll be provided at runtime as if it was a function called. So a approximate representation just for this one predicate would be tree like this. And now for, as I'm scanning the table and the b table here for every single tuple, I have to traverse this tree and produce the result to see whether it evaluates to true or not. So I would start at the root, go down the right side or the left side, and I have an attribute here. So then now I gotta go look up in the table schema, figure out, okay, if it's b.val, it's for then this tuple, the first, I want the first attribute, it's an integer. So I know it's gonna be this size and I can then shove that up to as the output of this. Then I go down to this side, I see the plus operator, I know I need to go down to the left side here, so I'm gonna put on this thing. This is asking for the parameter, that was input for the prepared statement. So again, then I gotta go look up my query parameter context that's being passed around for every single tuple. This is produced as 99, come over here, constant one, gives me one, then I go back out of the other and then it values to true. So I'm doing a depth first search or depth first traversal of this tree for every single tuple. And every single time I go from one upper to the next, it's gonna be another function call. That means another jump in my instructions. In this case here, there's no conditional. Well, you would always have conditional. In this case here, the CPU could execute these in order, but if I have a bunch of conjunctions and sometimes I can do short circuiting and not go down a branch, I might mispredict that. Again, always think of an extreme. So if I have a billion tuples and I'm doing one billion traversals for every single tuple, just for this wear clause. And that's just for evaluating this one predicate. So the goal of co-specialization is to remove all of this. That we want to be able to just say exactly, I know what the predicate is, here's exactly the steps I need you to apply whatever my check is. So the idea at a high level, we're gonna be able to identify different parts of the system that we know are gonna be CPU intensive and they're gonna be have the same repeated behavior just with different inputs. And then the idea is then be able to extract that out and generate that on the fly or have that existing cache version of it that I can sort of substitute and stitch in and then that becomes my query execution rather than having to traverse the tree and do all these lookups. So the most common one is gonna be predicate evaluation like wear clauses. Postgres does this, Clickhouse does this. And then the hyperpave are direct, they're doing what's called holistic query compilation where you're compiling all the query including all parts of the query including the including the practice themselves. So access methods and store procedures. Store procedures, starting at only or UDS only Oracle does this. Query operators, again that's the hyper approach. Logging operations, this would be like if I'm, if I had to replay the log and instead of having to interpret what the log entry is, if I can hard code or bake a program and it's exactly like I, here's the update on this table, has this format that I can accelerate replay. As far as I know, nobody does this outside of the research literature. So just again, going back, well, how to do predicate evaluation, it's not the exact query I had before but a really simple example would be I have a where clause where value equals one, I have my tree where equals and then the attribute and the constant and then I want to basically generate a function that takes in some value and then returns true if it equals one. This again, this is super high level. This is the goal. We want to generate machine code that does this. So we don't have to traverse the tree. Now there also, one point to take a point out is that there's not going to be any security concerns in what we're doing here because again, we're not taking arbitrary code from outside, from outside the database system. The database system is the one that's actually generating this code. So we have to be careful and make sure we don't clobber memory and break things but we don't have to worry about some query showing up and then us injecting code or writing code that then does Bitcoin mining or something like that. The database system wants to generate this code so we trust ourselves not the right stupid things or malicious things. So we don't care that, yes, it's arbitrary C that we're linking the system because we're not going to go try to shoot ourselves in the face. We'll see in the case of, in some systems where they support, you can write store procedure to UDS and C which could be malicious. They all run that in a sandbox at a separate process so you can't access the memory of the database system. So Oracle does that. Postgres lets you write functions in C but they don't run in the sandbox as far as I know. All right, so there's gonna be two ways to do this. There's translation or source-to-source compilation and then the JIT compilation from the hyper paper. So again, the first one is we wanna have a database system itself like say it's written in C++, it's gonna generate C++ code for the query or the predicate we're trying to evaluate and then we go ahead and use a conventional compiler and then to generate a shared object, link that in and we invoke that. The JIT compilation as I said, the idea is that we wanna generate some kind of IR intermediate representation either the LLVM representation or some opcode or bytecodes that we've defined and then the data system can either interpret that bytecode or then pass it along to LLVM or whatever compiler you want to then compile that directly. So what this looks like in the stack of the database system is that we're gonna do all of this after we get the physical plan from the query optimizer. So the SQL query shows up, we parse it and we put it into an abstract syntax tree then we run it through the binder or the mapper sometimes it's called where we do lookups in the catalog and map like string names on the table to the actual table object ID or column ID or get the type. Then we have this annotated AST, sometimes some systems will, this will be actually a logical plan but we can ignore that for now. Then we have some query optimizer, assuming it's a cost base, we have a cost estimate, could just look up the catalogs to figure out types and try to figure out what the best physical plan is and then we take that physical plan and then generate the compiled version of it. Because the idea is, again, the optimizer says if here's the hash join, you should be doing hash join so the compiler can then say, okay, here's the hash join machine code for it. And then we go ahead and natively execute that. This is a gross approximations, we'll see in a second or later on, how like in hyper the optimizer will spit out physical plan and then it'll generate sort of one pass will generate like the byte codes. Then in the background, they'll use the LLVM compiler and then substitute it. Again, but the main thing I would point out is we're doing this after we get the physical plan. All right, so I'm going to go through an example of, one of the first examples of doing transpilation or source-to-source compilation and then we'll focus on how hyper does it. Again, both the newer version, the original version that you guys read and then a newer version. All right, so one of the first prototype systems in the modern era, and again, my modern era I mean like 2010, like the late 2000s, early 2010s. As I might have said this already, but as often the case in database systems, IBM actually did this code generation stuff back in the 1970s. They did it first, but then they threw it away, they abandoned it in the 80s and then now it's back in vogue because everyone's trying to get again, reduce this instruction count. So I'm saying again, so IQ was the first in the modern era. So again, they're doing source-to-source compilation, so query shows up and then the data system will generate a C++ program that implements exactly what that query wants to execute. And you're going to bake in all the type evaluation, the predicates and so forth. There are no external, like there's no function lookups to induce interpretation to say what the data type is. It does exactly what you need. If you're operating on a 32 bit integers, it makes instructions for 32 bit integers. And again, in C++. And then they're just going to use it off the shelf compiler. I think they're using GCC, then convert the code into a shared object. You link it into the database process and execute it. The way this works is the program you're generating has to have an entry point, like a main function, but not exactly, an entry point that it generates with a certain function signature, input signature arguments, and then the data system knows how to then invoke that function in the shared object. We're relying on the sort of ELF format in Linux or whatever the operating system's using. So it's not like we're running the Davy system and then we make a fork-exec call to this program that can run a query. We're linking it in and then calling it directly in memory. But in the case of the compiler, that one, we had to fork-exec. We had to call that as an external process of running GCC. Of course, that's going to be expensive now because what does GCC do when you start compiling something? Well, it starts looking for its own key fake files. It parses them to figure out what it can do. So there's a pretty heavy weight way of generating machine code here. All right, so let's see how they're going to generate operator templates. Again, the idea is that we're going to convert the query plan into this template of code that we then can compile and then substitute at runtime whatever the values actually are for any tuple we're looking at or any parameter that's passed in it. So we say we have a simple query, select star from A or A dot val equals input argument plus one. So in the interpretive plan, it's basically exactly what we looked at before. Again, this is like pseudo Python code where I'm going to iterate over all the tuples, for the number of tuples I have in my table, first step is going to get that tuple. And so in this part here, what is actually going to happen in the code is that you got to go look up in the catalog and say, what's the scheme of the tuple I'm looking at? And that's what you can cache, right? Because usually the catalog's in the database system itself or it's like a remote service, like the Hive Metastore. So for that one, you don't want to do a look of every single tuple, you can cache that, but still you have to go do it at some point. And then you got to figure out, all right, what's the offset in the table I need to look at based on the size. And again, I'm going to return back a pointer to that tuple. And then to actually evaluate the predicate, I got to traverse that expression tree that we saw before, pull up the values and produce whatever the result is and then have to make a termination based on whether the predicate values to true decide where to go. Again, so this is, I'm doing this for every single tuple. Think of extremes, the billion tuples. So I'll have this cache, the catalog will be cached, but this is very expensive to do. And so instead, you can generate a program. Again, I'm showing Python, but it's in C++ where you have these values that'll be passed in at runtime. Like what's the size of the tuple, so you know how to jump to the right offset. What's the predicate, what's the offset of the predicate I want to do my evaluation for and what's the parameter I'm doing my comparison with. And so in this case here, I can just bake in all of this code. And then now when I execute it, I'm just filling in these parameters and then the, instead of having an expression tree, now it really is just like, what is the parameter value plus one? Is that equal to the thing I'm looking for? And that's my example where I was gonna be able to convert this expression tree into a single one line instruction to do the comparison, all right? So there's still gonna be some branching because of this if clause and we saw how to do branchless scans before. You could substitute that and replace that. The main thing here is I don't have to traverse that expression tree every single time. So the generate code can evoke any other part of the database system. It's like as if I was the programmer building the system and I just wrote, made a new file that could then get all compiled together in my data system when I generate the system binary, because I'm gonna be running in the same address space, I can call any other part of the system. And that allows you again to take advantage of all the built-in functionality the system comes with without having to write it from scratch or using some kind of interconnect or IPC to send messages back and forth. So I need to allocate memory that I want to store in the buffer pool. I can just have my generate query code make calls to the buffer pool manager or I need to send things over the network or read something from a disk. I can reuse all of that code. The other advantage you get also to this approach is that debugging is relatively easy. I'm saying relatively compared against the LLVM stuff which is actually a nightmare. I'll see why in a second. Because it's just C++ code, if there's a bug in it and it crashes, I can go back and compile it and walk through my debugger to figure out what's going on. I have to do a little extra work to figure out to maybe keep track of what C++ code generated that C++ code. Because I want to know, because I can't just fix the program because that doesn't help me. I want to fix the thing that generated the program. So there is some extra work you got to do to figure out where to go find in the system code that generated that code that crashed, but at least when you debug the crash query you would understand why because you can walk through the generated source code. And all the nice debugger tools you would have for regular system development are available to the C++ query. Again, that's going to be a problem with the LLVM when we do just time compilation. So to understand how much better this is, what I like about this paper is they compare against a bunch of different variations of how someone could build a database system with their approach. You see the actual true performance benefits that you get through code generation. So Haiku, it was again, it was an academic prototype. It was a column store. I think everything was in memory. And they're going to process that single table, single set of queries at a time. So the first one to be the sort of the four, four to five, six, four to five implementation of a bare bones or basic volcano style query engine with generic predicate evaluations traversing a tree. Then you have a optimized version of it where instead of having a switch shape that says what's the data type, you would have an iterator for integer columns or floating point columns. It was at least specialized in some way. Then you had a generic core code version where it was handwritten, but it was sort of optimized for what the query could actually be. And then you have, this is what a grad student would write. Then you have an optimized version of that and then you have their query specific, their system generating the code. So in this here, you're showing the total execution time of the system, but they're going to break it down between how much time is being sent for actually executing instructions versus installing because of memory lookups and then installing because of L2. They also measure L1 misses, but like they're so small, so small compared to the other ones that it's not even worth looking at, right? And so obviously the generic one is slow and this is a simple join query on a small table. Again, this is back in 2008, so like it's running on a core two duo. It's a pretty obsolete CPU, but it's a relative difference is what I care about. So the main thing to point out here is that the grad student optimized handwritten version of the query plan is a little bit slower, but basically the same thing is what Hikoo could generate. So Hikoo could, the data system could generate code as good, if not better, than what a human could actually write and try to optimize, all right? And whereas all this other stuff, again, because it's not specialized, because there's the traversals of the trees and so forth, that's why things are slower. So this sounds fantastic, right? This says, okay, well, I can have my data system generate code as well as like a hardware engineer could write by hand. What's the problem? Bingo, compilation time, yes. So this GCC thing doesn't come for free. And so I don't have the numbers, because the paper doesn't really show, I don't have the numbers of like, what's the performance of the query for these two approaches? But if you just look at the compilation time, you can see obviously running without any optimization pass with O zero, it's two to three X faster than running with O two. And again, I said, you don't ship database binaries or any system software with O three, because there's no guarantee that like things won't end up in a funky state and you could crash. And you don't typically ship O two unless, well, if you're in the cloud, you could run O two or O zero, because you wanna maybe step through with the debugger. But like if you're running on prem, you don't ship anybody with O one, unless you're trying to like debug some funky crash. All right, so O two is typically what you go at. But again, with O two, the queries could be run faster, but now you're paying up the, in the case of Q 10 and TPCH, 600 milliseconds to compile the query. And so if the query is gonna run for like five hours, then who cares, 600 milliseconds. But if the query is gonna run for like 50 milliseconds, then if I'm spending 600 milliseconds like a pilot, it's not a good trade-off, all right? So that's gonna be a problem. The hyper paper you're gonna read also has this problem. We'll see how to solve this afterwards, okay? All right. So as I said before, the high-currency for TCC has to allocate which, again, it's a full process. It has to allocate it to a memory, it has to read configuration files, and then shut itself down. Like that's not cheap. And so what we wanna be able to do is, we wanna be able to streamline this process so that we don't have to compile stuff or use external program to compile things. We don't generate this stuff directly inside of our database system. There's other issues too, like high-currency didn't support full pipelining, and we'll see, again, this is the push-based approach. The hyper paper has two major contributions, the push stuff and the compilation. It's hard to separate the two. I started to take them back. The paper, you have to sort of read both, they're both in there, you can't just like just read the compilation stuff because it's all intertwined. And this is why I was saying like people have maybe had the impression that you couldn't do vectorization and compilation at the same time within one single system because if you read this paper, it seems that you only can do the data-centric operator fusion stuff that Thomas is talking about. But again, it's not the case. All right, so with hyper, what they were gonna do is they're gonna compile a query into native code using LLVM. Does everyone know what LLVM is or no? Well, I know you don't. Is anyone here not know what LLVM is? All right, cool, I'll just skip that. All right, perfect. So what they're gonna do is they're gonna use LLVM and instead of emitting C++ code, they're gonna go into these C++ macros that will spit out LLVM IR and they basically stitch these things together and then hand that off to the LLVM compiler to generate the x86 machine code or whatever the platform is. They're also gonna do this operator fusion function, operator fusion within the pipeline so that the goal is to be able to keep data in CPU registers for as long as possible going up the pipeline. Again, they claim in the paper because they're doing code generation and query compilation and having exact control of how the LLVM IR is allocating data in registers, you're more likely to achieve that when you actually run the code. You can do this without having to do the operator fusion stuff they're talking about. Again, we've talked about that before but that's sort of the main idea. So if you go back to the query plan we had before, right, we say these are our pipelines, right? So in this side here on the left side, we're gonna scan A and then do the filter, probably the hash join, and then we execute all these other ones. And again, the pipelines are defined by pipeline breakers. These are not something that, it's not hard to figure these things out. So this way you do this, this code generation stuff after you have the physical plan because the physical plan will tell you, essentially here's the simple here is to say here's where the pipeline breakers are and then you can then convert that into the appropriate program, right? So now the push based fused program that represents this query will look something like this. Again, if you think of these, each of these for loops is another pipeline. So I'm just generating all these programs, these for loops after each other. And now I just call this one thing as if it was a single function and goes and fires it off, right? And so what will happen is in this first pipeline here, it's gonna generate, if it does the filter, it materializes the hash table, that thing can be used later on for this pipeline down here. Same thing for this one, it's gonna do the aggregation in this hash table that's fed through here and then it does the materialize that hash table and does the join down, right? So again, this is doing the push based operator fusion where again the starting the leaf nodes of the query plan, like they're pushing tuples up as far as they can in the pipeline, right? Okay. So now we can do a comparison in the paper of like how their approach with query compilation compares against what other well-known systems at the time. This is like, this is 2011, so these are state-of-the-art overlap systems at the time. And so for Hyper, they're gonna have two versions of it. They're gonna have their LLM version and then a previous version that was doing something like haiku where it was like handwritten C++ operators. So the main thing to point out here is the red bar is always hyper with the LLM. I think they're including the compilation time of these. And even then it's still faster and Oracle is just getting crushed here because it's a roast or it's not doing any code specialization. Vector-wise, we'll see in a second what they do. Instead of compiling on the fly, they're actually gonna pre-compile all the predicates in your query, like when you actually compile the database binary itself and then at runtime, they're gonna just make function calls into them. So it's like it is code specialization but not just in time. It's sort of done, it's done when the system initially starts up. And then Monadb didn't do code specialization. They did what SQL Lite does where they had, they would generate op codes for the query plan and then they had an interpreter for those op codes. Right, right. So of course they're gonna have the same problem where a compilation cost is gonna dominate a lot of these queries, right? So this is not exactly an apples to apples comparison because I'm compiling results from the hyper paper which is a newer CPU in 2011 from the Haiku paper that preceded it. I think the hyper paper is on a Xeon, early Xeon, this one's running on, Haiku was running on a Core 2 Duo. But the main takeaway again is that the, if you have to fork GCC, the compilation time is super expensive, right? So Q1 go from 274 milliseconds down to 13, right? Cause you're not, everything's in memory, you can't control in LVM, what optimization passes you wanna do so you can have more fine-graining control exactly what the compilation time's gonna take, right? But for these queries, they're relatively simple and so 37 milliseconds is how this is. It is actually a lot, but for this particular query, it isn't that much, right? So Q2 would be hyper, they can run it in 125 milliseconds but the compilation time is 37 milliseconds. So that's a sizable percentage. But again, if it's a scale factor of 100, if the compilation time is 37 milliseconds and the query takes 20 seconds, then that's fine. So in a subsequent hyper paper that came out, I think 2018, they talk about how, again, this compilation calls actually becomes very dominant for queries that actually don't read a lot of data, but if the query itself is very complex, then the compilation calls grow super linearly. So it depends on the number of joins you have, the number of predicates, the number of aggregations, such a number of pipelines you're gonna generate. Each of those are gonna have a, it contributes to the compilation calls of the system. So again, if a query takes 30 seconds, if it takes one second of the compile, who cares? But if a query could take five milliseconds and it takes 50 milliseconds to compile, then that's a problem. And I don't know if this is in the paper, this is what they told me, that they hit this problem in particular when Hyper got bought by Tableau, they were trying to be PostgreSQL compatible. So they found people cooking up this tool called PgAdmin, which is basically a GUI interface to looking at your tables and your table schema or interacting with a PostgreSQL database instead of the command line GUI. And when you start PgAdmin, the very first thing it does is it does a bunch of queries against the catalog, PgCatalog, it looks in the database, what are the tables I have, what columns I have. And the query itself doesn't read that much data, but it has a lot of joins between the different catalog tables. And it would take like 10 seconds to complete one of these queries because all the time has been spent on compilation. So with regular PostgreSQL, you'd use PgAdmin and turn them on and immediately boots up and you can see all your information. If you hooked it up to Hyper, it would take like 10, 20 seconds to start up because long pause and people are like, is it broken? What's going on? No, because it's spending all its time doing compilation. For all three applications, this is not a big problem because most OT applications will be exceeding with stored procedures, or sorry, prepared statements. So you see the same query over and over again just with different input parameters. And therefore you can pilot once and you're done with it. Redshift will achieve basically the same thing by using a large compiled query cache. But in the hyper world, they don't have, every time it's always a cold start, you don't have this in cache and because they're compiling the entire query. So if you've never seen that query before, you couldn't cache it anyway. So this is gonna be a big problem for them. So the way they're gonna handle it is, again, it was a subsequent paper in 2018, which I think won best paper award at ICD that year, where they call adaptive query execution. And the idea is that query shows up, we had the physical plan, we generate the element in the IR just as we did with regular hyper. But then instead of waiting for it to get compiled, then running it, we immediately start interpreting the IR while we compile in the background. And then when the compilation's done, we just slide it in and replace the interpreter with the compiled version of this. And they can do this with morsels, right? Because every morsel was a separate task. So when a worker says, all right, I'm done with this morsel, and once we get the next one, it goes and looks and see whether the flag that says this compiled query is now available. If it is, then it goes and fetches that and replaces that with the interpreted version. So you can do this seamlessly and, because you've already broken up the data set you're trying to access into these smaller chunks, you can seamlessly drop them in the compiled version and it doesn't break anything, all right? So let's see how this works. So we have a SQL query show up, and these times here I'm showing is from the paper that show how much time you're actually spending in these different steps. So SQL query shows up, you're gonna spend 0.2 milliseconds in the optimizer generate the physical plan, and then you're gonna spend 0.7 milliseconds in the, to do the co-generated part, to take that physical plan and convert it into LLMIR. To now this point, the execution breaks off into three branches. The first branch is gonna take that LLMIR and you're gonna convert that into some kind of byte code that they developed, and you're gonna do a lightweight compilation of that, convert it into, sorry, convert the LLMIR into byte code, and then you have an interpreter for that byte code. The next one you take the LLMIR, convert it to the, pass it to the LLVM without any optimization flags and compile that, and that generates machine code and then you can again plop that into, replace the byte code interpreter at the top. And then you can do more complicated LLM passes to also do the optimization, and you generate machine code there. So again, if my query runs in less than a millisecond, then I never even get to these parts because I just did the byte code interpretation and immediately done. Now, I remember, the paper talks about how they have, I don't remember what the paper talks about, they have their own byte code or whether they're interpreting the LLMIR directly. They told me back in the day before the pandemic that they had their own interpreter that they wrote for the LLMIR. Thomas wrote it in two weeks because all the other source invitations were not good enough for him. He wrote it in two weeks and that was good enough. But then the paper talks about how they had this later byte coding. I actually don't know whether they actually implemented that. We, in our own system, we went directly from, in the old Peloton system, we would go directly from LLMIR and then we had our own interpreter as well. You don't actually need to implement all of LLVM and the instructions to be able to run queries because there's a bunch of things you don't need when you're running SQL statements. So it's actually not as bad as writing the entire thing. Again, these are just additional composition passions. So another benefit you're gonna get from this, again, not from terms of performance, but from terms of software engineering, is that if I needed to bug a code that's crashing, I can walk through it with the interpreter here because it could execute the same things and the same instructions you're gonna get or not instructions, but execute the same logic that you would execute when you actually compile it. So if it crashes here because of some malformed instructions that I'm generating up above, then I can walk through the interpreter and figure out what's going on. Again, I have to have links to go figure out what actually generated that code, but I can walk through the interpreter and step through and debug it and figure out what's going on without having to look at a stack, or a core dump from a broken query down here. Because again, this is machine generated code, so you're not gonna have a nice stack trace with function names and so forth. You're gonna land in an x86 assembly and you gotta go figure out what the hell you're actually looking at, which is not easy to do. So at least in this case here, you would end up with a potentially human readable stack trace. Yes? So it says with this approach, could you turn on O3 here? I think there's, I would say again, I don't think you want to, I think you could, but like, yeah, how does this, like with O3, I think it's gonna reorganize things in a way that you wouldn't do with these other two approaches, like the other two paths, and you could end up with incorrect results. That's the issue with O3. I think it's too aggressive, it moves things around and it should move around. But what is safe code, right, like? Not a different stage. Sure, look, I don't know enough about like how to make O3 do exactly what you want to do. I don't think Ross handles that either, I have no idea. I'm looking at you as if like you're, sorry. All right, yeah, just don't ship it O3. All right, so here's the, here's performance results on a TPCH. Again, with the showing, if you just run the entire query with the different, so those three paths, the byte code again, it's always gonna be the slowest, no surprise there, but it can show you some of the marginal benefits you can get if you actually run with the LVM op, the LVM optimized version of LVM passes, right? So I always like to point out Q2, like clearly Q2, it's way faster if you, if you then run it with native machine code, but you're only getting a five second improvement, a five millisecond improvement if you then actually compile it. So again, if the query runs in, in 20 milliseconds or something like that, this is probably gonna be better because the time it takes for the compiler to get the result back, I've already completed this, but it's clearly it's faster than the, you know, than the interpreter, all right? Yes. We have to do these three things because if the query runs for one millisecond, like you were saying, you just use the fresh one. Is it not possible beforehand to get a rough estimate of how long the query was done and not have to do all the... Yes, the question is, is it not possible to, well, how hard is it to actually know what the runtime of query is? For simple things like index lookups, you would know it for large scans. But I mean, it's, it has to be less than 20 milliseconds, right? So that's a very small margin, like they're only a very small set of queries or like a very small set of things you could do within 20 milliseconds, right? There's only small, very small number of things you could do. There's only so much you can process in 20 milliseconds or, right, so it's not like you have to work out cost estimation for like a very large complex, but it only needs a very small set of queries, right? I mean, are you saying that like, you're basically saying, could you use a simple heuristic to decide whether you even need to fire this thing off? I suppose you could, yes. Query process, query progress estimation and like query runtime estimation is like a super hard problem in the old problem in databases. Heuristics probably get you most of the way there. I would say also too, that like going back to this, again, for this approach they're doing, they're compiling an entire query. So if, and for another query to show up, do we have to reuse the cache plan? It has to have the exact same sort of structure. In the case of Redshift, they'll break up the query into like fragments or segments they call them and they cache those. So now my query shows up. It may not look like any other query I've seen before, but the actual physical plan could change parts of what I've seen before and I can reuse those and stitch those together. So in that case, like, they're not even gonna have these extra passes. They're always gonna compile, but like that compilation cost to be super low because they can go fetch a cache versions of them. So going back to your original example here, yeah, I think you would use simple heuristics to say, okay, do I still need this? They talk about how you could then, even if you, if it completes before the expensive compilation finishes, you cache that, it can reuse it. But then the question is, the point I was trying to make is like, unless the exact same query shows up, the exact same physical plan, you're not gonna get any benefit from that. You're not gonna reuse that. Redshift handles that by breaking it up. Okay. So I understand again, the high idea we're trying to do here is query plan shows up. We regenerate this element in the IR and then we can either interpret it, do a cheap compilation pass or put a machine code or do an expensive one. And then if the queries are running by the time they finish the cheap compilation pass, then maybe I'll fire off the expensive one. And again, this is different than the, than the haiku approach which I'm generating C++ code and just challenge GCC to compile it entirely. All right, so I wanna go through a bunch of different systems to implement these approaches. And I've loosely categorized them into what I'll call sort of custom implementations that are all doing something slightly different. And then there'll be ones based on the JVM that are gonna be very similar to what these guys are doing. They just, the main takeaways is they're generating the byte code directly to the JVM byte code and letting the JVM do whatever just in time compilation stuff at once on that. And then I'll have other ones based on LLVM. And then the tombstone means that these systems are defunct. So we've killed two systems here at CMU. Not proud of it, but it happens. Okay. So as I said in the beginning, IBM, all this IBM invented 1976. The background system art is fascinating. They took Ted Kott's paper, got a bunch of really smart people in a single room and said, hey, you guys built this. They were all a bunch of people with brand new PhDs. Some in computer science, some in math because computer science PhDs wasn't as common to this now. And basically all the people with PhDs took one piece of the system and went and built it. Two guys went and built SQL. Another guy got to go build the locking stuff, concurrency control, and Salinger went and built off the cost-based query optimization. And somebody worked and built the code generator, the compiler for the engine. And so in the paper here, this is a retrospective looking back on the system art project from 1981. So the paper's already old, but by then they're looking back when the system art was old too. They talked about how they figured out that this guy figured out that oh, you could generate, for a SQL statement, generate assembly code and IBM 370, system 370 machine language, and have that be broken up into small little fragments and register called these segments. And you stitch these things together and you run that into actually having interpreted the plan. And again, they were doing this back in the day because the machines were super underpowered and had really limited memory. So interpreting the tree thing that I showed before would be way expensive for them. And then machines got better and people need to debug things and maintain these things. This approach went out of vogue, but now we're back around again to redoing what they've done because we're just trying to squeak out the best performance we can in modern hardware. Right? So again, they take a SQL statement, take a physical plan, convert it to assembly code and then could stitch these things together. But IBM is gonna abandon this in the early 1980s when they started building SQL DS and DB2. So SQL DS is actually the first commercial relational Davis system that they put out. DB2 came out a few years later and they used bits and pieces of system R but the thing they didn't carry over from system R was this code gen stuff. And in the retrospectives, they talk about how it turned out to be really expensive to execute these code-generated plans because you had all these function calls to these segments and every single time they wanted to go convert the data to run some new IBM mainframe. We had just new OS and new ISA or new assembly language. You had to rewrite all that code gen code. And that was a huge pain in the ass for them. And also too now every single time as you were building the system, if now the code gen code assumed data was laid out a certain way because they were accessing the data pages directly, every single time you changed the contents of the data page of the layout, you had to go then change the code gen code because it had made certain assumptions. So from their perspective, this was a software engineering nightmare and they discarded this. So Actium Vector also previously known as VectorWise, again, which came out of the CWI group at where they built Monadb and then Marcin built this and then went off and built Snowflake. But the idea here is that instead of having to do just-in-time compilation for predicates, they're gonna say, well, there's only so many predicates you possibly have, like integer less than this number, integer less than an integer float greater than a float, there's only so many of those combinations. So instead what the data set is gonna do is just pre-generate all these possible evaluations you wanna do to these separate functions. And then when you compile the database system itself, like when you actually ship the binary, you compile all these existing, of these pre-generated predicates, then at runtime when you wanna run a query, instead of, again, co-gending on the fly, you say, oh, you want to compare two integers, less than this, greater than that, I have the function that does that. Let me make a function call to that thing and use that instead. And then, of course, obviously this would be expensive if you're doing on a per-tube by basis, per-tube by tuple. So you wanna rely on auto-vectorization or vectorized instructions to be able to passion bash the tuples, to amortize that function lookup cost. So it'll sort of look like this. Say I wanna have a query that says I wanna compare two integers, is one integer less than another integer? Or I wanna say is this double less than another double? So here you can see I'm passing in the value that's in my predicate, like is this thing less than this? And then I'm passing in a pointer to a 32-bit integer column. And then I know how to do a comparison of this, right? So again, you pre-generate everything, compile it. It's in the binder that ships with this database system and then you just stitch to these things together. The, actually, we're not gonna read this paper, but one of the things they do here in this paper is, I don't think it's actually shipped in the real version of vector-wise. They will try different versions or compiler flags to these different functions, and then they try to figure out which variation of the machine code for this function here is the fastest on the fly. Never made it to the commercial version as far as they know, but there's a bunch of other things you can do to try to speed this thing up. All right, Redshift, as we talked about before, their basic idea is they're gonna take a query plan, break it up into fragments, essentially looking like the segments or the pipelines that self-re-sale before, and then they're gonna convert that into templated C++ code. But before they do that, they want to then figure out, does this, the fragment I need to compile, does this actually already exist in my cache for a customer, for a local database? So you go look to see, I wanna compile this code so that's the expensive part. Does this already exist in my cache? If yes, I can reuse it. If no, then I compile it and then put it in my cache. What's really fascinating about this is that, and it's a much different perspective of the world of databases compared to the hyper guys, is that Amazon will actually reuse compiled segments or query plan fragments across the entire fleet of databases. So in hyper, everything's running on a single node. So even if you had a query cache or these compiled plans on another node, it doesn't know about it, so you can't reuse it. But in the cloud, since Amazon controls everything, they can have a giant catalog of every possible query plan that's ever existed, they have a cache of everything. So what's fascinating about this is that, my query shows up, I've never executed this query before, but it's gonna have basically bits and pieces that look very similar to other queries, even though it's on other tables from other customers. And I just go ahead and reuse their compiled version. Yeah, so I think they said, I think their cache hit rate is like 99.95%. Right? Is that the best there is? The best there is? You can't say anything, yeah. Okay, yeah, worked on this. So I think it's like, I think it's like 85% hit rate for the local cache and then 99.95% hit rate on the global cache. Wait, what I hate is actually pre-worn segments that you pre-compile and you just put them into like- But you shove them out to the global cache or the local caches? Yeah, so her internship project was working on this piece, but yours is like, if I upgrade the version of Postgres, because it can rest your runs on Postgres, it's a fork of Postgres Caterpark. But if I upgrade the version, then I have to regenerate all these cache query plans. So every time there was a version bump, I wrote a thing that would then just go pre-generate all the cache query plans that they have before and then push that out to everyone. Right? Again, I find this fascinating. Again, we're not talking about distributed databases in this class yet. We still focus on what happens on a single node, but like beyond even distributed database, look if I have all the databases, all the queries, you can leverage that and reuse that across customers. And again, it's not like there's anything proprietary in the query plan that they're generating. If my table has bank information and I have a float column and your table has a, I don't know, a sensor data and has a temperature column that's a float, to do a predicate on two floats from my table versus your table, at the end of the day, the instructions are exactly the same. So there's no confidentiality or no proprietary information that you're leaking from one customer to the next because the database doesn't just see data, it doesn't care what's actually in it. Anyway, so Iprocratus will come and give a talk. He's one of the lead people at Redshift. CMU did this group of long, he did his PhD a decade ago, but he'll come in the end of the semester and talk more about Redshift. So we're not gonna read a Redshift paper because he'll come and talk about it at the end of the semester. All right, Oracle, as far as I know again, at least I don't think in any of the newer versions, they have a bunch of these accelerators. I still don't think they're doing any cogeneration of query compilation. The only thing is I think they can do is that if you have a UDF in PLC, it's basically looks like what's in the SQL standard, how you define procedural code in functions like UDFs. So if you write a UDF in PLCQL, they'll convert that into ProStar C, which is their dialect, restricted dialect of C, and then they'll compile that into the native SQL code that they can then link in. Think of this as a way to prevent people from messing around the address space of the database system. Because you don't want to, again, you don't want to take arbitrary C code and have someone start scribbling into memory because then you could crash the database and it could corrupt your database. Prior to that, one thing they were doing also too is when they were still shipping spark trips, it's about a decade ago or so, they would put specific instructions for common database operations like compression, decompression, and memory scans inside the hardware itself. So now I say, is that code generation? Well, it's even better because it's in the silicon, right? They don't ship that anymore. This just doesn't exist. All right, with Hecaton, as I talked about, they were trying to build an in-memory sort of accelerated query engine to do transactions, and they would have it running alongside the regular table engine, the execution engine inside of Hecaton. And so what they would do is they would compile both stored procedures and SQL into C code, and then they would compile that into a shared object or DLL and then link it at runtime. But what was really cool about this was like, they would allow the, to mix the, your Hecaton program, when you're running one of these compiled queries, you can mix in data from the in-memory row store of Hecaton plus the regular table data in the regular SQL server. And the pivot, they talked about how they do a bunch of extra stuff. When they generate the C code, they make sure that nobody's putting buffer overflow crap inside of your SQL statement that causes it to generate C code that causes security issues. Again, this is OATP not OLAP, it's just worth mentioning. SQL 8 is probably the most widely used, or it is a widely used data system, so they're doing code generation, and most people don't know this. Because it's embedded data system, but it has its own VM, which is pretty fascinating. So they have a query plan shows up, they're gonna generate into their own native op codes, and then they have their own virtual machine called the virtual database engine, the VDB or the bytecode engine, that can then execute the instructions for that it generate to these op codes. So when you call explain on SQLite, instead of getting a query plan tree like you normally would in Postgres or other Navy systems, you're gonna get something that looks like this, which is basically here's all the address, here's the address of the program, and here's the op codes, and here's the inputs, and then here's what it's actually doing. So if we select one plus one, they generate a program and then actually execute it. And what this allows you to do is that now I can make sure that the VM, all I have to make sure is that the VM implementation of these op codes works in whatever environment I'm trying to run on. Like I'm trying to run on like a new cell phone or new CPU, I just need to make sure that the VM works and not worry about the rest of the parts of the system. There's a file system layer to care about, could you talk about the OS? But in terms of this query execution, all I have to do is implement these op codes and I can port SQLite to anything. We talked to him a while ago, like could you bake this into like an FPGA and have like hardware acceleration for that? And he said this stuff changes like from one version to the next, and it's not meant to be backwards compatible because he wants to hide this from the end user, right? That actually he's actually using this VM. I would argue if you call it explain, you see this, you're not really hiding it. Because if you want to get the actual query plan, you have to like, it's called explain plan, whatever. Rich is a good guy, so I'm sorry you thought this through. All right, so there's a follow up paper to the hyper one you read. There's the adapt execution one I meant about, but then there's a newer version of their query compilation engine in Ombra. And again, there's this, I struggle with like, okay, should I have you read the latest paper or the original paper that's the seminal one? So maybe in future years I'll have students read this, but basically they built a new compilation framework, a code generation framework called flying start in Ombra, which is the successor to Ombra, or successor to Hyper. And instead of having it generate LLVM IR, they're gonna generate their own IR that looks very similar to LLVM stuff, but then instead of going to like a bytecode and interpreting that, they say, screw all that, let me just go to immediately from the IR directly to x86 assembly. Then you run the assembler on that, which is super fast, super cheap. And then I still can do that adapt execution in the background while I'll compile the IR into machine code. But most of the time you can get by just running assembly, right? So this paper is insane because they talk about, oh yeah, you need this, you need this, you need a dead code elimination and this. They're basically a built at home compiler inside the database system to generate optimized x86 code, right? So not only is this insane, not only are they amazing at databases, they're also building a compiler too. I don't know how to repeat. So this table here just shows you on the x-axis, they have compilation time, on the y-axis, they have execution time, and just showing you that with the flying start approach, you can get the low compilation time and reasonable execution performance because you're generating x86 assembly. But if you want to even want to squeak out the best performance, you pay the penalty to compile it with LLVM, right? So again, this is the state of our approach of how you do cogeneration in a database system, you generate x86 directly. But again, I would say, my only critique of this would be, it's a single node view to the world of databases. And the rest of the approach, you don't need any of this because you have giant query cache that has every possible query you could ever execute or every fragment you're gonna possibly execute. If you're trying to run on a single node as fast as possible and you can't reuse stuff from other queries or in the past, then you would want to do something like this. I would say again, it's a very small segment of the world of people who actually can build a data system and also build a compiler that generates x86 assembly. And that's the germans. Okay, so Apache, so now we're getting to the JVM stuff. So Apache Spark originally was written in Scala and they weren't doing cogeneration, they would do interpretation, just like any traditional system would. But then in 2015, they introduced this new tungsten engine where they would convert the wear clause, the expression trees into Scala ASTs and then they would convert those ASTs directly into the JVM bytecode and then they execute it natively in the JVM. And it's up to the JVM, excuse me, it's up to the JVM hot spot compiler to decide, should I compile this bytecode or not, right? But once they get into that bytecode, they let the JVM handle that. So in the, well, we later in the semester, the newer paper from Databricks in their new photon engine, they abandoned this approach because they don't want to have, use any of the JVM at all for the hot path of query execution, right? Use JVM for all the front end stuff like when you actually actually query this, you want to use C++ code. But another challenge they faced also was too, is that they found from complex queries, they were generating giant query plans that exceeded the limit of how big the JVM will let you generate dynamic code. And so they would oftentimes crash because the query would show up and they try to generate code for it and make it run fast, but it would be too big and therefore they had to roll back and use the interpretive plan. So with photon, they actually don't do any code generation, they only do vectorization. And they do sort of like the vector-wise tricks or like having some stuff pre-compiled, ahead of time and then stitching that together. In terms of Java databases that are doing something similar instead of with Scala, you know, there's a bunch of these here, Neo4j, Splice Machine, this was founded by a CMU alum. He's actually on the, he's on the advisory board for SCS's trustees or something like that. All right, so this was like HBase plus Spark stitched together, that failed. Pressure and Trino did something like this and then patched it up as well. Again, it's basically the same approach as Hyper, but again, in Spark instead of generating the LLMR, IR you're generating the JVM bytecode. All right, so single store, formerly known as MemSQL, they actually had two iterations of the execution engine that was doing code generation. And so I'm gonna talk about the first one and then they hit the same problems that we talked about and then they rewrote it into a newer version that looks very similar to what Hyper did and what we do and our old system. So the first version of MemSQL, prior to 2016, is that they would, query plan shows up and they would generate C++ code just like the Hikku did and then they fork exec GCC. And of course, now fork exec GCC is really expensive, so they wanted to use cache plans as much as possible, but their caching was quite limited, right? So they would convert any query plan that shows up, they pull out the constants and they would replace it with a parameterized SQL string and compile that so that someone else comes along with basically the exact same SQL query, but with a different input parameter, they could reuse it, right? And the reason why they decided to go this approach is because they saw one of the founders of MemSQL or single store was at Microsoft working on SQL server when they were building Hecaton and saw all the internal talks from the Hecaton team saying, this is how we're gonna do code generation, but they were using, Microsoft was using a more sophisticated approach and was using the CLR running in SQL server. So this is like a poor man's approach to replicate what they were doing. So again, now if I show up with another query that says, where parameter equals a.id, unless they do any normalization, the signature or the fingerprint of the SQL string wouldn't match, so you couldn't reuse the cache plan. So then in 2016 or 2015, they hired away the guy that built the hip hop VM at Facebook, right? Facebook famously runs on PHP, PHP is slow. So this guy built a compiler for it called hip hop. So they hired him away to go build the new compiled version of MemSQL single store. And it looks a lot like the, very similar to the adaptive execution approach that Hyper had generated, where instead of going directly from machine code, sorry, from LMIR to compile executable, you could have different stages. You could have an interpreter and then a compiled version of this. So query plan shows up, they're gonna generate this into a internal language of dialects called MPL, the MemSQL programming language. And then they convert this DSL into their own custom op codes, like SQLite. Then they could either interpret that or then compile it, right? And I don't have performance numbers to show here, but like it was pretty significant. But from a software engineering perspective, they only had to have like the top dude or top small team build the actual interpreter in the executable code or the compiler part. And everyone else could work on just the, the high level MPL programming language. There's like, the JavaScript programmers, you don't let touch this. Then there's like the good systems programmers that like, okay, they can maybe write this. And then like the guys on cocaine or whatever, they write this bottom part here. All right, Postgres in 2018 added support to do just time compilation. So I think they shifted in 2018, sorry, they shifted it in version 11, but it wasn't turned on by default. And then in version 12, they came out a year later, then it's turned on by default. And again, the only thing that are compiling here are the expressions and the where causes. They're not doing the holistic query compilation that we saw in Hyper and other systems. And basically there's a bunch of flags that you can pass into tune in Postgres to say, in Postgres to say, if the optimizer thinks the query's gonna take this long, don't compile it or do compile it. I think it's like, I don't think you can start running the query and then decide after the fact I wanna compile it, they have to make the decision when the query shows up. And there's flags that let you specify how aggressive it is in just making that decision. Of a test DB or, is it VATISA DB? There's VATAS and VATISA, I forget which one is which. One of them is as a sharded version of my sequel came out of YouTube and then there was VATISA DB and this is a fork of Green Plum. So they try to do complete, they do much more aggressive query compilation over Postgres. Again, they were doing this before Postgres added the JIT stuff. So basically query shows up and they can identify whether this is something that the VATISA engine can support and they go down that code path and they can compile things and run that more quickly. And they do a bunch of the techniques that we talked about so far. All right, so I'm gonna quickly talk about last but the two systems we were building at CBU and the challenges that we faced and then we threw away the code and not because of these challenges just because, but as good as you can see how the other systems have, we talked about the other systems have got around these problems. And we were a small ragtag team of academics that we didn't have all the money that Databricks had to have a large team to build these things or whatever company. So we had to make do with what we had. And we only had one German. But he went back, he works on Hyper now in Germany. Anyway, so the first version of Peloton, we were doing a full compilation of going, we had C++ macros that generate the IR and then we would execute that, compile that and execute that directly. And then we were using that relaxed operator fusion stuff we talked about before and the last thing we talked about is the materialization approach. We had these stage buffers where we could store a bunch of tuples there and then iterate over those tuples and do that in a vectorized manner and use software pre-fetching to hide the stalls. So this is the graph I think I showed before just showing you that the, the interpret engine was always awful, but then compilation in some cases, compilation always made a big difference but whether or not you got, you took full advantage of vectorization dependent on the query. Like so for this query here, doing the, this here, it's just a scan with a line item with a bunch of aggregates. You didn't get really a lot of benefit from aggregation there but for the other ones, it did make a difference. But the problem with this one is that as I always started saying before, like the number of students I have that could work on the actual compilation side of this thing, the Davis engine, was pretty small. I was my PhD student for Sean who now works on Photon and Databricks, the German guy that showed up and then like one or two other students, one of the PhD student, actually the both PhD students at MIT now. Right, so it really challenged you to have students build out new functionality in the execution engine because you had to understand the element IOR, had to understand compilers and plus you had to understand databases. So when we started over and started building noise page, we switched to the Memsegal single store cell approach where we could take a query plan, generate our own DSL, if you've poked around in the CBUDB GitHub repositories, there's something called like TPL or Terrier. That's the name of my dog. We named it the programming language after my dog because we didn't know what else to call it. But we would generate this DSL and then you could compile that DSL into opcodes and then there was an interpreter. You could step through it and watch and see how it would actually execute that opcodes. And the benefit of this is that the opcodes basically execute the same pass in the system as the compiled version did. So again, if there was a bug in the compiled version, you could at least walk through the interpreter and figure out what happened. So here's a really simple example of what it looked like, query shows up and it would generate this DSL like this. Again, this is a pseudo language. It's specific to the data system. There wasn't a, he may design the choices in this of how he defined the syntax to make it easier to do compilation and interpretation later on to buy goods. So it sort of looks like C, but it's not exactly the exact same thing. And then you would generate this huge opcode program which would then either do have an interpreter and set through it or run it through LVM and Copilot just like in the case of hyper. And then when we switch to this approach it increased the number of students that could work on the execution and the part of the system. And we had other challenges like we didn't have a good query optimizer that calls us to have problems. And this is why we threw away the noise-based source code plus the pandemic didn't help either. All right, I am way over time. So I know we wanted to get the project through yourself. We'll have to pick up that up next class, but sorry. But the slides are online. You can start seeing things. So query compilation makes a huge difference but it's not easy to implement this. And this is why the most of the systems are going to choose vectorization and in the cases where they don't do a holistic query compilation they'll just do predicate compilation or the expression tree compilation stuff. This is 2016 MemSQL, it's probably the single store. It's probably the best query compilation implementation right now in terms of both performance and the actual like how to say this and the software engineering side of things. I think obviously the onboard stuff is amazing but like to actually implement that is like it's super hard whereas this one, it's hard but like it's in the grasp of if we could do it here in the U.S. with only one German then like I think other companies could do the same thing. And so any new system that has to come out now whether or not using vectorized execution or not or a compilation or not you have to get the same performance as if you were doing a holistic query compilation. And there's ways to do that. I think in the vectorized approach we can pre-compile things and stitch it together. That can get you most of the way there, okay? All right, so next class we'll go over project three topics but then the paper you guys are starting to reading is a comparison between the vectorization approach and the compilation approach. And I understand like when is one better than another? Okay? All right guys, see ya. Ha ha ha ha ha ha ha. That's my favorite all-rounder. Ha ha ha ha ha ha ha. What is it? Yes. The S.T. Cricut I.D.E.S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Now here comes Duke. I play the game where there's no roots. Homies on the cusp of y'all my food cause I drink brook. With the bus a cap on the ice bro. Bushwick on the go with a blow to the ice. Here I come. Willie D. that's me. Rolling with Fibble G and St. I.S. when I party. By the 12 pack case of a four. Six pack 40 act against the real violence. I drink brook with Joe I drink it by the 12 violence. They say bill makes you fat. But saying eyes is straight so it really don't matter.