 Okay. So this is started. So real quick, the announcement for Project 3, I will send out an email either today or tomorrow about how you guys should be pool the latest changes from our master branch because we have a bunch of fixes that are relevant, I think for a lot of you guys. So in particular, we have the new Postgres parser in place. So you shouldn't be, you know, the old one which was broken is gone. We now take Postgres and convert it into our own objects. We hope to have a new catalog for you guys within a couple of days. And there's a bunch of other bug fixes that we found that we've been fixing the last two weeks. So I'll send an email how to do this. Essentially, you want to do a rebase, not a pool, and that way any changes you've made get put to the top of the, to the commit log. And then if you ever need to roll back to an earlier version, it's easier that way. And that way it's not all intermingled with your changes with our changes. Okay. All right, so for today's lecture, I think this is probably what I consider one of the most important topics of the semester. Because as we'll see in a second, the performance gains you can get from doing query compilation or code specialization is actually quite significant. And I would argue that every database system that comes out in the next, you know, from this point going forward, if you're not doing query compilation, then you're just not going to be able to compete with the other systems that are out there. Because the performance gains you can get from doing this is a lot. So we'll start off talking about the background of what code specialization and query compilation looks like. And then we'll look at two techniques for doing this. So the first is called cogeneration or transpilation. And this is basically taking, having source code that generates more source code and then you compile it. And then we'll talk about how to do this jit compilation that was in the hyper-paper that you guys read using the LLVM. And it is my opinion that this is the better way to do it and this is what we're actually pursuing in our own system today. And there's not very many systems that do this. As we'll see at the very end when we talk about the real-world implementations, it's only maybe, I think, five in the world. So we'll talk about what it takes to actually do this. So there's this, the reason why this compilation stuff or code specialization stuff is important is because when we start looking at what it's going to take to make our system run faster, we're getting to the point where we're running out of options. We already said that we were going to get rid of disk, we're going to have a memory database system, but now that everything's in memory, how are we going to get our system to running faster, to support more complex applications, more complex queries and larger data sets. So there's this great calculation, or the sort of back of the envelope calculation that the Microsoft guys did in a Hecaton paper when they were sort of designing their system to figure out how can they get this thing to run as fast as possible. Again, the original motivation of Hecaton was they were looking to build a new memory database engine, in this case it was for OATP, but they were looking at having to be support to run much faster than what the current SQL server system could do. So when Hecaton got started, SQL server was already 25 years old, and they were looking to say, well, what do we need to do to get us to have the system be around for the next 25 years and still remain competitive? And so what they found is that because everything's already in memory, then you're not paying any more disk penalty. The only way that you're going to really get better throughput and better performance is simply just by doing less work, meaning executing less instructions to execute the same queries that you had before. And so they sort of had this rough calculation that says if you want the data to go 10x faster, then you need to execute 90% fewer instructions than what you have in your original system. But to get to 100x faster, you need to execute 99% fewer instructions. That's a lot. So there's not going to be this magic flag we can give to GCC, like, oh, 100, that's going to make these instructions go away. We have to be smarter about how we design our system, and code specialization is the way we're going to do this. So they'll say also that another metric you could use to determine whether you're speeding your system up is not just the number of instructions, but also the instructions per cycle. And the idea there is, again, this sort of takes into consideration cache misses and things like that, memory stalls, so that you try to avoid those things so you can just have the system being running at full capacity or full utilization. So the only way that we're going to get the kind of reduction that we just talked about here is through code specialization. And the idea of code specialization is that rather than having this sort of general purpose framework or execution engine in our database system that can run on any possible query plan that the optimizer gives it, we're going to instead create a code that is only designed or will only execute the thing that it's supposed to execute. So again, we'll see what the example is what I mean by this, but instead of seeing a query plan where you have to look at it and say, well, what's my predicate I want to evaluate? What attributes are my tuples I want to touch? What type do they have? Instead of having these giant switch statements decide, oh, is my data this type versus that type, I'm just going to bake in exactly into machine code. I know it's this type and it's this predicate on this data, and you just execute that. There's no if clauses. There's no virtual function table lookouts. You just execute just the bare minimum you need to compute whatever it is that you want to compute. So this is the main idea for code specialization. And you may think, all right, well, this is this sort of seems obvious. Why doesn't everybody do this? Well, it turns out as well as I see it was a go along in the lecture. Part of the reason why existing systems don't do this for the most part is that the code you're going to end up writing for doing code specialization is a lot harder to debug, a lot harder for humans to understand. And actually the slowest way to write your query execution engine or operate on queries is actually the easiest for us as humans to understand. So most of the time, and again, we did this in our own system. We're guilty of this. When you build the first version of your system, you build code that is easy for you to engineer, but as I'll show in the next example, it's actually the worst thing for performance. So in this example, I'm going to use three tables, A, B, and C. A and B have a primary key and then a value attribute. And then C just has foreign key references on both A and B. So the example we're going to use is this query here, which is a three way join between A, B, and C. And this actually comes from the paper you guys read from the hyper guys. And what it's doing is it's going to do a nested subquery on B to compute some kind of aggregate. And then in the outer query, you'll do a join on A and B with C. And so we didn't really discuss this in this class, but typically the way you rewrite nested queries in your optimizer is you just rewrite it as a join. So to do this nested query is essentially a join on B and C and A. Another way is you can extract this query out and execute it first, put it into a temp table, and then do another join on that temp table in the original query. But for our purposes, it doesn't matter. So in this case here, I'm showing you how to do this one single query using joins. So this is the rational algebra plan or the logical plan to execute this query. And the thing we're talking about in this class, we don't care about the physical plan because we're not talking about how you actually implement the different algorithms to do joins and other things. We can use all the same techniques that we talked about before. So we only need to deal with this discussion at a logical level. So the way this would actually be implemented in a system, in sort of a volcano iterator style system that we teach in an intro class, is that for all these operators, there's essentially these four loops that are going to iterate over the input tuples for the operator and then emit them to the next guy up in the tree. So again, the volcano iterator model, the way it works is you start at the top and you literally just start executing this code, right? So in the first code, the first line you have called left got get next. So it's going to call get next on its left child to go get some tuple that it needs before it can do any processing. So this is a function called down here and this says when I start executing this, I have a single child and I'll call get next on that and that's the iterator or the scan on A. And then this thing just takes the tuples, shoves it up to A to this filter. The filter evaluates the predicate and if it matches, then we pass it up to here. And again, as a human, this is sort of easy to understand because it's sort of these, you're traversing this tree, you're calling get next, get next, next on all your children operators and you're sort of passing tuples up in the plan and then once you have enough, you have enough data, whatever it is you need, you can then maybe also shove it up to your operator. So again, normally when I always tell my students when they give a presentation, like never showed code because people's eyes sort of glaze over when they see a bunch of code. But for this lecture, we kind of have to agree because to talk about code generation, you kind of need to show code. So it turns out this is, in this approach here, this is pretty much how everybody, every data system you can think of, this is pretty much how they implement it. My SQL, Oracle, DB2, Postgres, this is essentially what they do. We'll talk about more about this on Thursday. We talked a little bit about this before about systems like vector-wise or nironsys and peloton. We don't pass around single tuples. We pass around blocks of tuples or vectors. But for the most part, it's always sort of starting at the top and going down the bottom and pulling things up. So the reason why this is actually the worst for your CPU is because you have sort of no locality of these tuples. You essentially are making a function call here and that would then be a branching operation in the actual instructions. And so if the system predicts incorrectly, it'll flush your instruction cache, instruction pipeline and fill it back up when it moves to the next thing. And then all of these, again, are just function calls where we're jumping to some new location in memory and allocating space on the stack and moving things around. Right? So this is bad, right? So having to do all these function calls, having to pass around a single tuple one at a time is bad for locality because as this thing evaluates the predicate and it may emit it, we may not do any more work with it until we get back up to here, right? So at a high level, executing the query plan with a volcano model is bad for modern CPUs. The other problem we're going to have also too is how we're going to evaluate the predicate, right? So in this case here, we'll just look at this one example here. Say B dot value equals some input parameter because this is a prepared statement, plus one, some constant. So the way a traditional database system would represent this predicate is through a syntax tree like this. So you sort of have at the top, you have the equality predicate, and it's going to have left and right children. Right? On the left side says, give me the value for the current tuple I'm looking at, give me the value of the vowel attribute for the current tuple I'm looking at. On the right side, it wants to do an addition for this operator here and that operator there. So you're essentially going to have to evaluate the predicate by traversing this tree for every single tuple you process as you scan a table. And so for this, the way to make this work is that when we invoke the evaluation function on this tree, we also have to pass in some execution context information that says what tuple are we currently looking at, what were the input parameters that were passed in to the prepared statement by the application, as well as what's the schema of the table we're looking at so that if we want to look up an attribute, we know where to go find the offset that has the thing that we're looking for. So again, for every single tuple, we're going to start at the root of the tree, go down on this side here, and this says again, I want this attribute called vowel, give me the current value. So then I'll have to go look up in my current tuple at some offset as defined in the table schema. So as I make a function call down in here, I then may have to do another function call look up to say, well, how do I get to this offset and what should the type be? All just produce the one value that I need. Then I traverse back up the equality and then go down on the right side and same thing, I'll jump to this piece here, or this operator, and it says give me the parameter at position zero. So I look at my context information to get that. And then in this case here, it's a constant one, so it always puts out one. That's easy. But then we pass these up the tree, we produce a summation here, and then we finally do our evaluation of the two sides to see whether this evaluates to true. So again, we're doing this traversal in this tree for every single tuple that we have to look at. You can be smart about certain things like if I know there's maybe an AND conjunction, and one side evaluates to false, I know I don't need to go down the other side, but in general, you always have to go through and look at everything. So in this case here, to evaluate one tuple, I have to do four function look ups at least, depending on how I represent this context information. For every single tuple, I'm doing four function look ups just to evaluate this one predicate. And so this is another reason why this is very expensive to do. So not only do we have to pay all these function pointer costs as we evaluate the entire plan going down, because all these omits and all these get next are function calls, then for every single tuple, we have to do all these function calls in here to evaluate the predicates. So this is the problem that code specialization is trying to solve. So rather than having this sort of generalized execution engine to do the evaluation of predicates, to evaluate the queries, we want to sort of bake into like native machine code just to perform exactly the evaluation or whatever it is the data doesn't need to do, just for that one query. So we don't have to do any switch statement look ups to say what kind of type am I looking at, what kind of operator do I have in my expression tree, we don't have to say anything about like jump, and the data I'm looking for is that this all set and it's at this type and cast it that way before I pass it along to the next operator. Everything we can know ahead of time because we know what the query is going to do, because it's a declarative language like SQL, so we know exactly what data wants to touch, we know what it wants to do with that data, and we know what kind of answer we need to produce. So we can know all of this a priori, and then we can have code that's specialized to execute the thing that we want to do without worrying about any conditionals. So we can do code specialization for access methods, so doing sequential scan as I showed before, we know how to jump to the different offsets that we need to get the data we want, we know what type is going to be, we won't talk about this too much in this lecture, but in some systems you can actually do compilation of stored procedures, so instead of interpreting PLPG SQL, you can just generate machine code that can execute that stuff directly. We can do operative execution, and that's again doing the joins and other things. I showed how to do predicate evaluation. And then the last one here, logging operations. I don't think anybody actually does this, and this is something that we're interested in pursuing here at CMU. The way to think about all these things is any time you have to make a decision about, you know, do I go down one code path in my if statement or switch statement, if it's this type versus that type, or if I have to parse anything, then code specialization can make a big difference. If you know exactly what the data is going to look like, then you don't have to worry about making runtime decisions. You bake everything ahead of time and then sort of set it and forget it. So this is the core idea of code specialization, and this is what we're going to focus on doing this in the context of query compilation. So I sort of alluded to all these things before. The benefit to get out of this is that we don't have to do any function calls just based on what type we're looking at. We'll know everything ahead of time, and we concur everything to inline execution. We know what all the predicate is going to be, and so that means that we don't have to traverse that tree anymore. We can just write the machine code to do exactly the previous variable. Check of this offset equals this plus one. Then in our loops, we can try to avoid having function calls, and this is going to allow us in the content, at least in the paper that you guys read from Piper, is that you're going to be able to keep tuples in CPU registers, which is the fastest memory you can have, and go through the pipeline and execute everything really, really fast without ever having to write anything out to CPU caches or even out to memory. So just to ground everyone what we're talking about here, so this is the same pipeline that I showed a few weeks ago when we talked about query optimization and query planning, and the piece that I'm adding is right here, the compiler. This is after the query plan comes out of the optimizer. We then want to convert it into either LLVM IR or C++ code or C code, depending on how we're doing our code generation compilation scheme. So we take the physical plan from the optimizer, feed it into the compiler, and the compiler then generates a native machine code that we can execute directly inside of our system. So it's the same plan that we talked about before. Now we're going to convert it into something that's much faster, and we're no longer going to interpret the plan. We just execute it directly. So there's essentially two ways to do code generation. The first approach is what it's called a transpilation. It's also called source-to-source compilation, and the idea here is that we're going to have code in our database system that knows how to take our query plan that the optimizer generates, and then it'll convert that into C or C++ code, and then we take that C++ code and then run it through our favorite compiler, and that generates native code. You run it through GCC and it spits out native code, and then we link in that shared object, and then we invoke some function on our shared object. So that's one way to do it, and we'll discuss that first. And then the approach that you guys read in Hyper, in the Hyperfavorite is to do JIT compilation, and this is where instead of generating C++ code, we can generate some intermediate representation called an IR that we can then efficiently compile into native code using something like LLVM. You can also do this in Java. You can generate bytecode to feed that into the JVM and have that executed for using in Windows. You can generate bytecode for the CLR. It's the same idea. So we'll go through both of these one by one, and we'll discuss the different tradeoffs of each of these. So one of the optional papers that you guys didn't read, that was listed on the website, was this prototype system called Hikoo. And so Hikoo was doing the translation stuff. It was doing the source-to-source compilation, where, again, they'd given a query plan, they'd generate C++ code, and then in that C++ code, since, again, they know what the data you're trying to access, they know what type it has, they just, again, bake that in all into the execution code. And then they link that in, and then when you invoke the query, you just invoke the run function on that shared object, the thing that you generate, and then it can then produce the same kind of answer you would have, and it would produce the same kind of answer in the same format that your interpreted engine would have as well. So again, for this, they're using Office-to-Shelf Compiler, they're using GCC, and, again, from the upper levels of the system, it doesn't know that it's using co-generated execution engine. It just knows it gets the tubeless back in the same format that it expects. So just go through what this looks like. Say we have a basic query here, select A, select start from A, where A dot val equals some input parameter plus one. So the interpreted plan, so the example that I showed with Volcano before would look like this. You just do a range scan over the table for some number of tuples, and then for every tuple you get back from the table, you evaluate this predicate, and then if it evaluates to true, then you just emit it to the next operator in the tree. And so all the things we're going to try to avoid in our C++-generated code is that in the case of getting the tuple, we don't want to worry about having to look up in the catalog to figure out what the type is. We don't have to look up the signal, where I find the different attributes that I need for this tuple. And we don't have to even worry about getting a pointer to the tuple. If everything's in memory, then all we have to do is now basic arithmetic to say, I'm at this offset in the table, I know the size of the tuple that I'm looking at, because they're all fixed size in the fixed size pool or the heap for the table. So I just know how to jump exactly to that to get the tuple that I want. And then the other problem also, too, we avoid in the predicate, we don't want to traverse that tree like I showed before, and we don't do any work about looking into the execution context. I decide where's the parameter that I need to fill in that value, and I don't need to cast anything to the appropriate type. So the template plan or the C++ plan that we would generate essentially looks like this. So at a high level, it's doing the exact same thing. It's just scanning over the tuples and emitting any tuple that matches. But the key differences here is that we have these constants that we can pre-compute ahead of time based on what table we're accessing, and that we then fill in to do any kind of calculation or arithmetic to find the tuple that we're looking for in the template plan. So in the case of before when we said get tuple, we had to know how to jump to the right offset based on the size of the tuple. We don't need to do that anymore. Now we just multiply the address, the starting address of the table, by what offset we're looking at by the size of the tuple, and that jumps us to the right location. For this, I'm assuming we're using the row store or NSM storage format, but the basic idea applies the same in the column store. And then the same thing for evaluating the predicates. We know what offset we're looking at in our input sequence or input parameters for the prepared statement. So we just jump to that location to get what we want, and then we just evaluate it to produce the value that we're looking for. So again, we don't have to evaluate any function that traverses the tree. In this case here, this code does value equal some parameter. That's exactly how it was written in the where clause of the query, so we don't have to do any additional lookups to see whether this evaluates to true. It's literally just the one instruction to say, does this value and this register equal that value in that register? And that's really, really fast. No longer we do function lookups, no longer do we have to pass around results of different operators in our expression tree. It's just one instruction and produces the output. So this is why this is going to be really, really fast because we're just avoiding all this extra work we really have to have in a general purpose system. Okay, so one important thing to point out about this is that in our, your question, yes. So this question is, in the sort of the, the compiled version of the query plan, because I no longer have these extra function lookups, does that help with... Correct, yes, yes. You don't have any more, you're less likely to have branched prediction because... Yes, we'll talk about more about this on Thursday, the branched prediction stuff, because you still have an if clause here, right? You can't avoid that and how it's everything here. There's other ways to avoid that. Again, we'll talk about that on Thursday. But yeah, no longer in here, the predicate one's probably the easiest one to understand because like, that's like you're traversing this tree, right? And it may be the case that like, if you have a conjunction, it may be the case that you don't have to go down one side of the tree based on what the other side actually produced. But like, the CPU is probably not going to be able to figure that out because it's going to be different for every single tuple. So yeah, so it's absolutely having, you don't have fewer branched predictions because sort of everything's sort of inline together. There's fewer if clauses. And I also did, when you do a function lookup, right, if you're using C++, if you have all this sort of, you know, if you have inheritance in your objects, you have to do a lookup on the virtual function table to say what's the actual function I want to look up. So there's actually two lookups to do a function, right? So we can avoid all that here. So one important thing to point out about this code generation stuff is that it's still going to be able to invoke the other parts of the system that aren't using code specialization. So in our case, in Peloton, right, we have our indexes written in C++, we have our concurrential manager written in C++. We don't want to rewrite all that into like LLVM. And you don't want to sort of have to rewrite all that and bake it into our query plan. So it's sort of obvious in this case with Haiku because you can just, you know, you're generating C++ code and of course it can invoke other C++ code. But in the case of LLVM IR, you saw this in the hyper paper, you still can invoke the other parts of the system that may be written in C++. And so the benefit of this is that for those parts of the system it's unlikely that code specialization is going to provide you with really any benefit. So rather than writing sort of a more arcane language that's hard to debug, you can just have your generated code invoke the regular code that you still have written in your regular language. And that makes it easier to debug. It makes it easier from a software engineering standpoint. So again, in the same way that when you have interpreted queries, the interpreted queries would invoke the transaction manager or the log manager and all these other things. Your template code, your generated code can do the same thing. You can still interact natively with the other parts of the system without having to rewrite everything. And that's often overlooked and I think that's a really important aspect of this. It makes it easier from an engineering standpoint. So in the Haiku paper they want to evaluate how this actually performs. And what I like about this paper is they actually implemented all the different variations of how you could do sort of translation or code generation in their system. Going from what we have where everything is interpreted which they call the generic iterators to their sort of auto-generated query-specific code that they're going to run using GCC. But then they're also going to have code that's sort of handwritten by an expert and they want to see how well this performs versus the sort of auto-generated code because ideally you want this to be very close to this. So this paper is from 2009 so this is running on a much older CPU than we're used to but the high-level ideas are essentially the same. And so in this graph I'm showing they're measuring the time spent doing L2 cache measures, memory stalls and actually invoking and executing instructions. They also measure L1 cache measures but because those are so short and so small that they don't really show up so I didn't include them but they're in the original paper. So again this as expected the sort of the interpreted engine is always going to perform the worst. And then as you sort of get closer and closer to in this case at this point here the super-optimized or the truly optimized versions of the query plans you see these guys are performing the best. So this is showing that if you add a little bit more code specialization you get closer and closer but it's only if you do the full code specialization for the entire query plan do you get the best performance. And they're showing that in their case the high-coot, their auto-generated code is actually just a little bit faster than the optimized hard-coded stuff and this is because the compiler can they've written the codes just the way that the compiler can define additional optimizations that the hard-coded one wasn't able to utilize. So again another thing here you see that you don't have any memory stalls for these guys over here is because you don't have any branch mis-prediction you don't have any function look-ups or function invocations so you're allocating less things on the stack everything's sort of in line very nicely it's a very tight kernel. So this is a great paper that shows the benefit you can get from doing code specialization. So what's one obvious problem with doing this translation or source-to-source compilation? Compiling C code is slow. Compiling C code is slow absolutely yes. So they also measured the time it takes to actually compile the query so this is just running the query this is assumed after you've already run it but then they also measure what it takes to actually compile these queries so for this they're comparing running GCC with O0 meaning no optimizations with O2 now they didn't do O3 which is normally what people use when you ship production software and O3 is much more aggressive than O2 it does a lot more things so it's definitely going to take much longer but what you see in the case of O1 this is like the bare minimum you can get with GCC without any optimizations this is around 100 and 60, 150 milliseconds all the way up to over 200 milliseconds now if you turn on the optimizations now for this last query Q3 now you're over 600 milliseconds that's a lot so before in this example here I don't forget what query this was but this was sort of the micro benchmark but their query could run in roughly 50 milliseconds but if you turn on compilations the compilation takes 600 milliseconds so all the benefit you get from doing query compilations is throwing out the window because you take so much to actually compile these queries and for OLTP this won't really be a big issue because everything's going to be actually as prepared statements and everything will be the same query as you see over and over again so the first query would take a long time but if you cache that everything will run faster and this is actually what MemSeq originally did and I'll talk about that at the end but again for another query, a one-off query if someone's typing in the terminal and the query takes 6 times as long as it should because you're compiling then that's not a good trade-off so again the LLVM stuff will solve this a little bit but it actually still isn't magic, it isn't perfect so the main takeaways to get from the haiku stuff is that doing compilation with GCC is expensive relative to the time it takes to execute this query and they were sort of solving the problem of having to do the function lookups to evaluate predicates and get tuples from the table but it still was sort of following the basic iterator model of the volcano model for pulling up tuples from the bottom of the query plan and then pulling them to the top so you're still calling get next, get next, get next all the way down the tree to go move tuples up all right, again this is real quick part of the reason why GCC is so slow is when you think about what it's actually doing so they're actually, when the query comes in they fork the GCC process passing the source code that they generated from the query plan and then GCC allocates memory, compiles it and then writes it out to a shared object and then they then link that into the system so when GCC starts up it's looking at config files it's looking at command line parameters it's looking at environment stuff there's a lot of extra overhead just to fork GCC just to do this compilation so that's part of the reason why this is going to be so slow all right, so the 2000s with Haiku are GCC slow and then they are not doing full pipelining all right, so a pipeline is where we can have a in our query plan, in our tree we can have a tuple move as far as up into the tree as possible without ever doing a context switch to switch and look at another tuple all right, and this is going to allow you to have better cache locality maybe you can also be able to keep your tuple actually in CP registers and move it from one operator to the next the idea is that you bring your tuple into your CP registers and you do as much processing as you can on it before you have to move on to another tuple or another operator and that's going to allow you to reduce the number of instructions you have to do because you're reducing the number of cache actually it's still the same number of instructions but you reduce the number of cycles because you're not going to have any cache misses so we go back to the same plan that we have before so in this query plan the pipeline boundaries are defined into these four areas here and again a pipeline is where we can take a single tuple and we can do all the processing that we need on it before we have to then block into the next operator to get more data or get more tuples so the simplest one to look at is here so if I'm doing a scan on B I can scan the tuples on B evaluate the predicate that I need to evaluate and at this point here when I actually want to compute the aggregation using the group by I can't compute the count of number of tuples per group until I get all the tuples in so this is called the pipeline breaker because the pipeline has to stop because in order to go to the next pipeline I have to get all the tuples from the pipeline below me the same thing for this this is on the built side of the join so I can do my scan on A and I can do all my filtering so I can't actually do anything I can't do my join because I still have to wait for the other tuples from the other pipeline to come in so again the idea of a pipeline we're going to see this in hyper is that they want to do as much as possible within a pipeline for a single tuple before switching to the next tuple or the next operator and then keep everything in localizing your caches so the way hyper is going to do this is using LLVM to do query compilation two major things I want you to get out of this paper one was the using LLVM instead of C++ and the second is that they're going to do this query execution or query model query processing model instead of the pool based model that we saw in haiku and traditional volcano style database systems so with the LLVM if you're not familiar with the LLVM it originally started off as a sort of a new compiler replaced clang in GCC out of a UIUC and then with the end of sort of evolving into this sort of collection of different compiler technologies and components that you can then build together to make sort of more complicated things so the LLVM like the low level virtual machine piece is one component of all this other stuff that the LLVM project works on and so the core thing that we care about is going to generate this low level programming language called IR intermediate representation that is going to look a lot like assembly language but it's going to be not specific to any CPU architecture so meaning like if you wrote assembly for x86 it may not run on spark machines but if you wrote sort of assembly using the IR the LLVM would be able to compile it to run on whatever hardware or ISA that you want to run on and again we talked about this before not all of the components in our database system were had to be written in the LLVM IR because we're still going to be able to make calls out to C++ to the other parts of the system without having to rewrite them those won't get the speed up benefit you would get with LLVM but it makes it easier from a software engineering standpoint that we can still reuse other code so first of all so Hyper is going to do LLVM and then they're also going to use this push base execution model so what I'll say is though these items these aren't these aren't conjoined like you can have the push base model without using LLVM and you can use LLVM without using the push base model the paper is combining two really good ideas into a single one research paper so this is the same query plan that we saw before but now we're going to annotate the different for loops in our generator code based on the pipelines that I showed before in this model so the first pipeline here, pipeline one it's going to iterate of A and then it's going to do the evaluation of the attribute to see whether it matches our predicate and then if it does then we'll produce we'll add it to the hash table that we're going to use for the join later on and then once this is done then we can then transition to pipeline two where we can then spin through on B and do the same kind of thing and now for this since we need to compute the count on B we can write into a hash table once this is all done the hash table is complete then we can compute the count and then spit that data out into the next pipeline which then compute the join on first feed into C or C will feed into the first join use pipeline three to compute that and then use that output and then produce the final join up the top here so I'm walking through this as if it was running as a single thread but again this is the same system from the this is the same project that was in the morsel's paper that you guys read so there's no reason that different CPUs, different cores, different threads could be operating on these different pipelines at the same time it doesn't have to be, everyone doesn't have to always do the first one then switch over to the next one you can, if you want in your schedule you could decide how to spread out these different tasks and run at different times based on the locality of where the data is processing, yes this also requires some sort of physical plan knowledge what the joins are determining which side of the join is going to be blocked depending on the physical plan so the question is doesn't what I'm describing here depend on the physical plan absolutely yes so when I said we don't care about the physical plan I mean that in the discussion of this material I don't need to tell you that it's a hash join versus a sort merge join but absolutely, you need to know what's what to actually produce the code I'm being vague here materialize T into hash table I'm saying in English what it's actually doing but you're absolutely right, you would take the physical plan and then bake in to the LLMDR what exactly is the equivalent machine code to do that that's a good point so what performance benefit you can get from this so the papers are nice because they in the original version of hyper they actually did it the way that Haiku did it and the way Memcigo used to do it where you would do translation, you would generate C++ code so we actually have an Apple Apple's comparison of a system doing the LLVM and the same system also doing C++ code gen and we can see what the performance difference we have so the first thing I'm going to point out is this is running, the Y axis is on scale and for the comparison they're going to be using hybrid LLVM, hyper C++ vector wise monadb and then the unnamed one was or dbx was just oracle so oracle monadb will be doing actually take it back oracle would definitely be doing interpretation so it's not a surprise that it's slow across the board monadb does it does code specialization for just the predicates but it doesn't do any code specialization for the scans and the other things we'll talk about vector wise at the end vector wise actually does something quite different they actually do code specialization by pre-generating all different possible types of primitives that you could execute on a query or execute on a table and I'll show you examples of this they'll pre-generate all the functions to take value equals some parameter and they'll generate one for ints and floats and whatever other types you have so they'll generate 5,000 of these different primitives and they need to know how to stitch them together at runtime so you're essentially sort of doing the same thing as the LLVM stuff, the C++ stuff but you're not generating on the fly you pre-compute everything when you start the system or when the system is created so this is showing across the board the LLVM approach is beating everyone else and I think for these times here they're also including the compilation time as well actually that may or may not be true because compilation should be roughly around 20-30 milliseconds so I think this might just be just for execution so again this is showing that with LLVM plus the push base model you outperform everyone else I think the C++ version is also doing the push base model as well so it's not clear how much benefit is the push base model versus the LLVM stuff so then this graph here is actually a combination of the haiku paper and the hyper paper so it's not a true apples to apples comparison because this is running on clearly running on different hardware but it's essentially the same TPCH queries so this is just to give you a rough idea of the difference you can have compiling things with LLVM versus GCC O2 for haiku and you can see for for the LLVM you're in the tens of milliseconds to do compilation with these queries which is now getting a bit more reasonable for LLTP this is still bad because for LLTP you want to return your queries in milliseconds like sub milliseconds like microseconds but for an OLAP query that may run for 500 milliseconds paying a 30 millisecond overhead to compile it is not that big of a problem so I will say though that the when you start doing more complicated queries and actually when you start to do compilation on what are called machine generated queries so think of like say you have like a dashboard or some kind of tool you can use to look at data and there's all these little checkboxes you can select different options to do your scans things like that the application will then generate a query that then runs on your database system so for those kind of dashboards and those kind of things the queries can end up being quite big like I've heard Google tell me sometimes they have queries that are like just the SQL itself is 10 megabytes so when you have these kind of queries and you run it through even LLVM even though it's faster than GCC now you're talking like a second compilation so there is a way to solve it the hyper guys have solved it but they've solved it a year ago and they haven't even written the paper about it so I don't want to say on camera what it is when they told me it was like oh yeah that's actually really clever that's the right way to do it so I'll tell you after if you want to know I'll tell you afterwards off camera because they're working on the paper and I don't want people to see the video and get scooped on this so they had this problem again when Hyper got bought by Tableau they started hooking it up to real Postgres tools to see people how and when you use things like PG-Admin the first thing PG-Admin has done is a bunch of queries and joins in the catalog to see what's in your database and they would find that because they have to compile all these queries when the system starts up PG-Admin would stall for like a couple seconds and then actually turn on whereas in the real Postgres it would come up right away because it didn't have to do all this compilation stuff so again they solved that by doing something which I'll talk about afterwards alright so as I said before we're actually very interested in compilation in our own system so this is a little micro benchmark that my PhD student Brashant worked on last year that compares the current version of Peloton which does the interpretation versus the new LLVM engine that we're hoping to put into the real system this week for some basic TPCH queries and you just get an idea of how much slower our current system was compared to what he can do right for some of these basic TPCH queries he's going from 88 seconds to 900 milliseconds same database same query it's just so much faster and I think this actually includes also the compilation time as well I will say though actually we can get even faster so this is actually the unoptimized this is sort of the first incarnation of his of his LLVM engine we have the newer version that he's going to hopefully push out for us soon we'll have you cut this down to like a factor 3x and the way he's going to get that speed up is what we're going to talk about on Thursday doing vectorization and prefetching as well so these things aren't mutually exclusive you can do compilation plus vectorization plus the push based stuff we talked about you can sort of combine all these things together and you can cut this down even further so in the remaining time I actually want to talk about now how do real world systems actually do this so I'll start off talking about IBM in the beginning and then I'll go through each of these the one that I think is actually very different than what most people do is vector wise or what's called acting vector and then MemSQL actually does what Hyper does, at least the newer version so I'll talk about what they did before and I'll talk about what they did afterwards so as in all things in database systems a lot of these ideas aren't new so this code generation stuff, this query compilation stuff it actually was in the first database system the IBM guys actually implemented the exact same things that we're talking about here today in the 1970s remember the project started because they got like 7 or 8 people that all had PhDs they put them in the same room they all divided up and worked in different parts one person worked on query optimization one person worked on SQL so one person worked on code generation and so what they would do was they would take a query plan in and then convert it into the actual assembly code by sort of splicing together different code templates for each operator and then they would invoke that assembly rather than interpreting the query plan and it would run much faster than the interpreted version the reason why they actually they turned out to be bad for them and they talked about this when there's a bunch of papers so this is one of them called the history and evaluation of system R there's another sort of like there was a retrospective or a reunion of some of the system R engineers and they talked about all the different problems and issues that they had trying to build the very first database system and in these transcripts they talked about how this seemed like a really awesome idea seemed like a no-brainer that you would want to do but the problem was in the 1970s you know it was before sort of x86 became sort of the standard ISA there's all these different hardware manufacturers they all had their different chips assembly code even IBM had different models of their systems they all had completely different assembly so it was really hard to it was impossible from an engineering standpoint to be able to have this run on anything because for every single new system you wanted to port it to you have to rewrite all the code generation stuff to support the new assembly the other big problem that they had was any time that you change you change one aspect of the system how they laid things out in memory you then had to go back and rewrite or change all your code generation stuff so someone changed the storage manager and said now the header for every tuple has these number slots this metadata you then go back and change this in the compiler side of things so that every time this happened it was a lot of work to go back over and change all these things this is actually the same problem that MemSQL hit which I'll talk about in a second which then led them to switch over to do the LLVM so as I said this was seemed like a really good idea but because of all these issues I've been abandoned this approach so when they never commercialized system R the first relational database system they came out with was DB2 and some of them will tell me that about 50% of the original DB2 code came from system R they brought in some of the Segal stuff and other things but the things that they didn't even bother to point at all was this piece here the code generation stuff and up until maybe the last decade or so or the last 15 years nobody really actually did this because everyone always looked at the problems that IBM had and they said this is not something we're going to worry about and also too they weren't operating on memory databases or databases that can fit mostly in memory if your disk is always your main bottleneck you're not worried about making function calls as you evaluate tuples so again this was originally in the 1970s but now it's actually invoked today because the hardware is caught up where we can actually this is actually something that makes a big difference in the case of Oracle as far as I know at least what I can tell when I look at the documentation is that they don't do any code generation of query compilation or any aspect of the system so not even like the for the predicates which is sort of the first thing that everyone always does as far as I can tell the only thing that they can support is they can take your store procedure is written in PLSQL and they can convert this into their own dialect of C called pro star C or pro C and then from then they can then compile that into native C or C++ code which is the when you invoke the store procedure all the query processing is done with at the interpretive model now in the more expensive versions of Oracle like Oracle Exadata they do some early filtering where they can push predicates from the execution rack, the head nodes, they can push it to the storage node and filter tuples directly where the disk restores so they don't ship them back over the wire and they don't think they do compilation but if you go download regular Oracle that run on a single box it doesn't do this one thing that Oracle is doing that is different from everybody else is that they're not even bothering with the code generation or the code specialization by generating software on the fly to execute things efficiently they're skipping that entirely and actually putting SQL functionality directly in the hardware that came out probably two years ago so when you buy the newer spark chips from Oracle remember Oracle bought Sun and Sun was making the spark Niagara chips and the newer versions of the spark chips they actually have explicit instructions to do database stuff so they have instructions to do memory scans they can do dictionary compression and decompression, vectorized instructions and other things so they're actually not even doing code specialization at the software level they're doing code specialization at the hardware level and this is going to be way faster than anything LLVM can do because it's all baked into hardware so this is sort of a smart move from Oracle's standpoint I think from a business side because if people are on Amazon they're using RDS if they're using cloud infrastructure they may be more amenable to switching from Oracle to an open source or cheaper system but now if you say look you have your Oracle box they'll lock you into to specialize hardware that you're not going to get on Amazon or Azure so I think that's sort of interesting we'll see how this plays out I will say also this idea is not new either putting database functionality on hardware was in vogue in the 1980s when these database machines these specialized appliances came out and that sort of went away because commodity hardware became cheaper and better we may talk about this later in the semester but there is sort of this seems to be a trend now to add back specialized hardware and databases for databases to speed these things up for Hecaton I talked a little bit from the beginning so Hecaton can compile both the store procedures and the SQL statements itself and one of the interesting things that they do is because again Hecaton was designed to be this engine that was meant to work with the entire SQL server ecosystem if you're going to have you know Hecaton transactions have to touch tables that are in the non-Hecaton portion of the system you don't want that to go through the interpreted engine could be really really slow so what they could also do is they could generate c code for queries that have to touch tables that aren't in Hecaton that can then invoke directly the low level storage managers of the regular side of SQL server and had that run really fast without doing interpretation right so it's sort of the same thing as in Haikou like you're generating the c code and then you compile this into using Visual Studio Compiler that you then link into the run time and then you can get whatever data you need so obviously this is dangerous and Microsoft cares about security a lot so they have a bunch of additional mechanisms in place to prevent you from writing SQL query that could then take over the SQL server box right so it's not just you can just write role C and do all these things they have a bunch of checks to make sure that you don't do buffer overruns and all the problems that Windows used to have in the early versions of the operating system you want to avoid that in your database system within Pala Clouder and Pala is another it's one of the other few systems that actually use the LLVM but unlike in the case of with Hyper where they compile the entire query plan using LLVM so far as I know in Impala they only do compilation for just evaluating the predicates avoiding how to go down the tree but then also for record parsing so this part, record parsing is actually different we haven't really talked about this at all for the other systems because Impala was meant to run in the Hadoop ecosystem so they want to read files off of HDFS and so the idea is that they can support all these different formats for all these different serialization methods or protocols like servers or thrift files or barquette or just HDFS files so they want to speed up parsing those files and then read them into Pala real quickly and so if you know the schema of those files then you can bake an LLVM program that knows how to jump to different offsets to find the data you're looking for and to produce it into the format you need to read into memory that you can then execute the Impala queries directly on so I won't say too much about Impala now he's the lead developer of Impala from Cloudera he'll be speaking at the end of the semester so he'll talk a little bit about this LLVM stuff they're doing as well because we haven't really talked about the shared nothing aspect we haven't talked about distributed databases we're really focusing on how you actually make things run faster on a single node and so this is the stuff that I think is relevant to what we're talking about here so acting in vector formally vector-wise formally monadb x100 is in my opinion actually was one of the most another one of the most standard database systems that were out there it was an in-memory system that could do really fast execution of queries because it would do the sort of vectorized processing instead of calling get next to get single tuples they would call get next and produce a vector tuples which they could then do simd and other vectorized instructions on really really quickly so what's interesting about vector or vector-wise and actually it's really sad when you look at the website for acting in if you go google acting vector or vector-wise it takes you to the acting homepage and they're trying to pretend like the product doesn't exist anymore they basically let go the entire vector-wise team which was a huge mistake well I'm on video or whatever and all those guys all those awesome people that built vector-wise some of them went to snowflake db but some of them they all went to data bricks that work on spark and it sucks because I think the system is awesome Dana's done some experiments with it and it outperforms pretty much anything else it's sort of neck-to-neck with hyper and this is sort of for our new system that perchance we're working on we can get pretty close to vector-wise but in my opinion what they were doing was actually really interesting vector-wise, I think I said this in the beginning they don't do on the fly co-generation of query compilation meaning when the query plan comes in they don't generate C++ code or the LLM IR what they instead are going to do is they're going to have these different primitives that perform some small task that you would have to do inside of a query like scan a column and evaluate a predicate where all the columns are integers and the predicate you're evaluating has a boolean in it or a double in it so it's going to generate all these different primitives and they say that in the paper they'd say there's about 5,000 of them and then when a query plan comes in they're basically going to stitch together these different primitives to produce the query plan and they just sort of run that and let it go and so in this case here you're still going to have function calls which we said were bad because they're doing vectorized execution they're doing a function they're not doing a function call per tuple as you would in the volcano model they're doing it for every 1,000 tuples or 10,000 tuples so the overhead of a column of these functions is sort of amortized across the doing these things in batches so give an idea what this primitive looks like so this is doing a scan on a column and for both of these you want to evaluate the same predicate where the attribute at a particular column is less than some value and so the two things I'm showing out here is one they're both doing the less than operator so this is baked into the actual function call but then we're passing in different types of arguments so the compiler knows that if it's an int32 it's 32 bits the double will be 64 bits so it knows how to align all this and generate efficient machine code and then at runtime we pass in these pointers and invoke this the compiler will say alright well I know I'm doing this query plan and it's going to do a less than operator on this column and the column is 32 bit integers and it's a double in my predicate so I know I want to invoke this function so I don't have any conditional branches other than the thing we talked about here which we'll talk about on Thursday but I don't have to do any look at this to say is it a double, is it an N, what is it and I know how to jump exactly to the offsets that I need so again they compile all these ahead of time and they put it all together so this is why I think in the results that I showed the VectorWise was doing pretty close to what Hyper can do because essentially it wasn't doing code compilation as we're describing here but at the high level it more or less is doing the same thing everything is sort of pre-generated alright so with MemSQL in the first incarnation of MemSQL up until about last year they would do the haiku style code generation they would generate some C code and then they would invoke GCC and then link in that shared object so again as I showed before GCC is really slow and they knew this for when they would do their experiments or show people the system and so the way they would avoid having to invoke GCC every time is that they would take any query that would come in and they would convert it into a parameterized form and then cache the shared object that you generate from GCC so that the next time you execute the same query you can just use the cache shared object so let's say I have a query like this select star from A where A, ID equals 1, 2, 3 so this is baked in as a constant 1, 2, 3 so if I generate C code that had this 1, 2, 3 baked in it any other query that came along that used a different value here couldn't use the same plan so they would just do basic they would extract out the constants replace them with input variables, input parameters like a prepared statement and then cache that shared object so then when the next query came in the same syntax except for its A, ID equals 4, 5, 6 they would know that they could reuse this and not have to invoke GCC again so their approach to this was basically doing naive string matching so they would always extract the constant and just see whether there was an exact string match in their cache plan so that means that say if I had like A, ID equals something and B, ID equals something if I had another query that came in later that would just flip the order of those things it was B, ID equals something and then A, ID equals something their string matcher wouldn't be able to find it right obviously when you fix this you can look at the syntax tree and avoid these issues but in their version that they had prior to last year it didn't actually do this and in the old documentation if you read some of the old blog posts about MemSQL they would give examples where to address the slowness of GCC they would show this example where you would invoke the first query and the runtime would take one second then you evoke the same query again and the runtime would go to zero and that's how they would show that yes GCC is slow but you can cache these things and be smart about it so when I first met the MemSQL guys in 2013 yeah 2013 I knew about them for a while but I went to go visit them in 2013 before I started CMU they sort of confessed to me that like had they they'd been around for a while if they were going to do MemSQL all over again they actually would have not done the cogeneration stuff that I showed before at the very beginning because they were having the exact same problems that the IBM guys did was anytime you change anything, any layout in the system or any organization of the system you have to go change the cogeneration part of the system and that became very brittle any minor change would break everything else and it became really hard for them to add new operators and expand the functionality over time and then in 2015 they started working on the new version of the system which then came out in 2016 where they hired the guy that did the hip hop vm at facebook a hip hop vm was like facebook's optimized version of the php runtime this guy drew a porosity and they had him rewrite the entire cogeneration portion of the system to now use LLVM and what they do is actually kind of interesting so in the case of the hyper paper you guys read they have a bunch of macros a bunch of macros to sort of generate the LLVM IR together to stitch together a query but the problem with hyper we're sort of going to face this problem in our own system is finding people that actually can write LLVM code is actually really hard it's hard enough to find people that can write C++ code inside a database system and in the case of hyper it was pretty much written by two dudes right and not every student not every person that worked on it was able to modify the LVM stuff and so the mem seagull guys saw this it was going to be a problem too in their system and what they decided to do is they would have sort of multiple stages of starting with a high level language that you'd use to generate your query plan and then get to lower and lower representations till you finally got down to the LLVM IR so they had this thing called the mem seagull programming language which more or less looks like a dialect of C++ and if you click this link here when I post the slides it'll take you to a high scalability blog from this guy and it'll show you what this thing looks like but the idea here is that you have a query plan that can then be converted into this C++ DSL and then you can take that and then compile that using LLVM into what they call the mem seagull byte bit code which basically looks a lot like the JVM byte code if you know how SQLite works SQLite has sort of these op codes as well the same basic idea and then you can take that byte code and then compile it again in LLVM to generate the IR which then produces the native code so the reason why they do this is because again not everyone is going to be able to write at this lower level here but most people you think anybody you're hiring to work in your database system should be able to write C++ so they can all work on this and then you just have a few smart people write this part to actually do the conversion so you can modify all of this without changing any of this part here another aspect that is not really addressed in the hyper paper but it actually is a big deal with LLVM is that it's actually very hard to debug because if you crash, if you get a seg fault and you say where it actually failed instead you're going to get a bunch of assembly that the LLVM generated for you and it's really hard to say where that assembly actually came from in this case here you can compile this and run this and debug this and see whether it actually works and knowing that this stuff is solid you can make changes and debug this without worrying about this part down here so again your regular developer is going to open up GCC and see whether the high level stuff actually works looking at assembly stack traces and things like that so I actually think this is sort of from a from a software engineering standpoint I think this is the better way to do it than what the hyper guys do but you need to hire to do the hip hop VM in order to make this happen this is a lot of software engineering work that doesn't actually help the core functionality of the system but it actually just makes it easier to maintain and so this is as far as I know MemSQL is the only one that does this alright and the last one I'm going to talk about is VATIS DB to be very clear there's I think we see the VATIS saw or VATIS DB there's also VATIS without the E at the end which is another database startup that does optimization for MySQL these are the Postgres Green Plum guys so what they actually do is they have a custom query accelerator that they wrote in LVM that does all the things that we talked about before you can do the predicate evaluation they're doing a push base model like in hyper and they try to inline everything and they can leverage the hardware to do overflow detection actually this is another big thing that I forgot to mention in that when you evaluate that predicate tree at every single step you have to do all these extra checks like is the thing null did I overflow we talked about this before when we saw the difference between the numerics and the floating point numbers and the fixed point numbers there's all the extra checks that you have to do which makes it slower so when you evaluate predicates using the interpretive model all those extra checks are actually slowing you down too but the hardware can do some of this for you and the LVM can help you with this so the way VATIS works is that they use the execution hooks that's inside a Postgres so that when a query plan comes in it goes to the normal query Postgres query optimizer and then if it's a select statement then they intercept it and then shove it off to their LVM engine that knows how to read and write directly on the Postgres data files but it has a brand new execution engine and they can do this, they can do query parallelism the way the Postgres can do and they get much much better performance anytime you have to do any modification or update delete or anytime that you make a change to the database schema like updating the catalog then it goes to the regular Postgres stuff so again in this link here is at a Postgres conference probably two years ago and one of the founders of this company they sort of go through and talk about all the things they had to do to make VATIS be 100x faster than stock Postgres they're also using a column store they're also doing some other things and a lot of the same techniques that we already talked about alright so what are my parting thoughts so I think query compilation is very important it makes a huge difference as we saw in these numbers here you'll see a bigger difference in performance for OLAP instead of OLTP OLTP again you're doing index lookups to go grab small number tuples so the LVM doesn't really help you there it's really for the OLAP stuff when you have to access a lot of data then reducing your instructions makes a big difference in my opinion I think the 2016 version of MemSQL the LVM stuff that I talked about with the intermediate language the DSL I consider that actually the best compilation engine that's out there I think also Hyper is actually very good too as well but it's difficult to work on and Hackathon is very good as well it does some other smart things and the vector wise is the same too but again it might be dead at this point so in my opinion that any new database system that wants to do analytics if you're not doing code generation and query compilation you're not going to be able to compete with these other systems that do in the case of Spark in Spark SQL they have a blog post probably a year or two ago they have a new paper I think last year they're now doing code compilation and query generation this is the trend of database systems that's going on now and this is what everyone is doing and it's kind of weird to think about because it's a lot different than what we teach you as an undergrad or in the intro course about this interpretation model you sort of have to rethink how you're actually going to architect the system to take advantage of this okay so any questions about code generation and query compilation yes sort of tangential question why is it a custom database paper not to name commercial systems his question is why is it a custom to in database papers to not name the actual name of the system if you have to sorry ignore this if you have to go go but if you want to learn to answer his question I will tell you so in the paper you guys read it was called dbx and this is because of what is called a DeWitt clause so David DeWitt is a very famous database professor he used to be at Wisconsin he started a new research lab at from Microsoft called Jim Gray lab in Madison and then as of last fall he left Madison and now he's at MIT with Mike Sternberger he's a long time friend of Mike Sternberger so they're both at MIT together so this is a Wikipedia page I know this because he maybe updated this last week because he didn't want to change it himself so to his question here he was referring to the reason why you have to not name a system is what's called a DeWitt clause and it's named after this database professor so the story goes when David DeWitt was an early database professor at Wisconsin he made this system called Exodus and he benchmarked he benchmarked Exodus there's a relational database system he benchmarked against the the two major relational databases that are at the time Ingress which was his friend Mike Sternberger and Oracle from Larry Ellison so in his paper that came out his system destroyed the other systems and so he published it and said look how much faster my system is than Oracle and Ingress Sternberger was pissed but the way Sternberger got back at him he called it at the middle of the night and said how could you do this how could you betray me Larry Ellison he went to the president of the university the chancellor and demanded that the University of Wisconsin fired David DeWitt because they maligned his database system right and he went to the department chair at Wisconsin and said you have to fire David DeWitt and obviously the university was like no we're a university we're not doing this right so by personal decree of Larry Ellison the Oracle Corporation would not hire anybody from the University of Wisconsin system for one decade for ten years right and then they went back and changed the user license agreements in there for all the Oracle database systems so that you could not name the Oracle Corporation by name in the Oracle System by name in any academic study that was not sanctioned by Oracle so that's why you see this DBX and all these different papers as far as they know a bunch of other systems also had the same license agreement but in general people just avoid that because they want to avoid the same mess that DeWitt was into which now flash forward to 2009 DeWitt and Steinberg and I worked on a paper we compared like Hadoop and MapReduce against Vertica and another commercial database system and Vertica was DeWitt's and Steinberg's startup so obviously they were able to beat Hadoop without any problems, Hadoop wasn't very good still wasn't very good but Vertica crushed it but in this commercial database system the parallel version I was able to get Hadoop to beat that and so then we showed this company we said hey look our our grad student running Hadoop is able to beat your million dollar database system what do you think about that so same thing they start freaking out and they threatened to make a DeWitt clause on their database system so in exchange for replacing it from the real name of the system which I can tell you offline in calling it DBMSX they had a DBA common fix the system for us to make it run faster so it got a little bit faster than Hadoop but not by much because it would look bad of like here's this Hadoop thing this is like early 2008 Hadoop was able to beat this million dollar system by just doing some simple things so this is what's known as a DeWitt clause there's a couple articles here in these links in the wiki beta page that describe more about this a bunch of different systems that have it so I would say it's sort of like a chilling effect like not all systems have DeWitt clauses but people just avoid that because you don't want to piss people off more than you need to know any questions Thursday we'll talk about vectorization and then I'll send an email out about doing a rebase for your code for the latest changes we have in the system okay alright guys thanks see you on Thursday