 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Alright, let's just start it, we have a lot to discuss. Okay, so this class I'm going to try to jam as much as I can in about query compilation and code generation. And again, it's sort of like the vectorization stuff, it's another technique we can use to make our queries run faster and run better. So in last class we spent time talking about how we're going to use SIMD to vectorize some of the core database algorithms we have in our system so that when we run to scrunch all scans we can achieve data parallelism where we allow the data system to execute the same number of instructions but operate on multiple tuples at the same time. So again, this is data parallelism or interquery parallelism is another way to describe this. So the paper I had you guys read today is from 2011 and it's a really seminal paper on query compilation and code generation. It certainly was not the first and it wasn't sort of the modern era, it wasn't the first to say here's how to do it but it sort of set off this investigation from multiple systems and researchers using the LLVM to optimize things. So a lot of the papers in the early to mid-2010s sort of give the false impression that if you're doing vectorization you can't do compilation, if you're doing compilation you can't do vectorization and the hyperpapering you guys read they talk about we don't need to do vectorization because they can do this push-based model and data-centric optimizations but we'll see at the end these aren't usually exclusive and you can do both. So we showed this slide before that again since we're trying to execute sequential scans in our OLAP database system we have to rely on making the database system actually the execution engine as fast as possible so that we can get better results and make the system more efficient. So we've gone through a bunch of these already so how to reduce instruction count, reduce cycles per instruction and then paralyze things which again we'll cover more of that next week but today's class we're really focusing on this first one here. Now we care about reducing the number of cycles certainly but the compilation, the code specialization technique we're going to talk about here today is really about how to make sure that the database system does exactly what or only executes instructions for exactly what it needs for that query. And then the idea is almost like hard coding a program to do nothing but execute that query. So we've already and once we've already done this then as part of doing this we can also design the code generation components in our system to be aware of what the CP wants and generate code for us that's going to be optimal for this. So the reason why code specialization and query compilation is going to be important is because if we're bringing all these files from disk into memory and we assume that there's not going to be any disk stalls in while we're processing queries. We can do asynchronous IO, we can let something in the background go fetch the data that we need. Obviously there's a pause in order to get the first thing maybe we need into memory but while we're crunching over these large Parquet and Orc files, again think of them in like hundreds of megabytes not four kilobyte pages, the while we're crunching over those files something else in the background is going to go fetch the data we need next and bring that into memory so that when the execution ender comes around says okay let me do more work the stuff you need is already in memory. So we don't have to worry about disk stalls in this environment something else is going to sort of hide that those stalls for us. Plus as I said before the disk and network have gotten really fast. So it's less of an issue. So there's this great paper from Microsoft in 2011 where they talk about what would actually take to build a database system to run orders of magnitude faster. If you're just trying to reduce the number of instructions that you have to execute while you run queries. And they made this observation that if you want your database system to run 10x faster then you need to execute 90% fewer instructions than you normally would. Seems like a lot but that's actually doable right? It's not far fetched that we could achieve this. But if you want to build a system that goes 100x faster now you're down to really cutting the bone. Now you've got to execute 99% fewer instructions than what you were doing before. And there isn't a flag in GCC like 0100 that's going to achieve this for us. It's going to be through careful engineering and through this code specialization and compilation stuff we're talking about today is how we can actually achieve that. And of course as I said it's not just reducing the number of instructions. We also care about the cycles per instruction because we don't want we can execute fewer instructions but if we keep stalling going out to memory because the things we need aren't in our L3 cache then we're not going to get the benefit we thought we were going to get. So today's guys is really about although we're never going to get this get 99% fewer instructions but we can do a pretty good job and get somewhere in the middle. So first I want to talk about the background of today why we want to do code generation and query compilation and then we'll talk about sort of the two main techniques. There's the source-to-source compilation, the transpilation. That'll be from an early work before the hyper one. I think the hyper work paper talks about it. Basically how to write C++ code or Rust code or whatever you want to generate another programming language code and then you run a traditional compiler on that. Then the hyper paper you guys read this is about generating a low level representation like some IR within your database system for the query and then using something like an embedded compiler like LLVM to compile that. So at a high level they're the same they're still achieving the same thing like they're doing code specialization for a query but there's sort of two different ways to approach it and there's going to be engineering and compilation costs trade-offs between the two approaches. Then we'll go through a quick smattering of a bunch of different systems that are going to do these different techniques and then we'll finish up talking about we'll do a quick Q&A about how things are going with the projects. So as I already said the way we're going to be able to reduce the number of instructions we have is through code specialization and again the high level idea here is instead of having in our database system these giant switch statements that are going to check to see what the query operator type is or what the plan operator is or what the data type is or what expression you're trying to run in your where clause instead of having these switch statements to iterate over these things that execute the exact piece of code you want we're literally going to hard code some sequence of instructions that do exactly what the query wants to do and again the great thing about SQL and the relational model is because everything is declarative we know exactly what the data looks like we know exactly what the SQL query wants to do and it's not like we can roll the dice and get things wrong we can look in the catalog and know exactly what we want and again something like a parquet file or orc file even if we've never seen the file before we can look in the header and figure out is this schema matching up what we expect and then we can do code specialization based on that. So one of the ways we're going to get in addition to being able to hard coding things another big difference is that there's going to be this there's the way humans write code that as I said before may not be the best way to generate code for a modern CPU meaning like there's ways to set up code like using a volcano model as an example that from an engineering perspective it's great it's easy to debug, it's easy to write it's composable, we can move these operators around and not worry about what's below it but as I said all that indirection all that branching is bad for a super scalar CPU so instead we can design our code specialization components in our database system to emit code that no human would ever write willingly but is going to be best women can pile it into machine code for the CPU right? So if you do a really simple thing like this assume again we're doing the volcano model we have a three-way join on tables A, B and C and you see roughly the query point like this you're scanning you know the leaf nodes are scanning you do some aggregation you do some filtering and there's a final join at the top so again just assuming that it's the volcano model the implementations of these operators would look roughly like this you're iterating over each input tuple from your child, you're adding some predicate and you're pushing things up and again whether or not this was vectorized or tuple at a time for our purposes here it doesn't matter so again but in this case here think about how you'd actually implement in C++ code you would have a pointer to some root node in the query plan and then at runtime if you assume it's in C++ there'll be a virtual function table to look up and say oh at the very top I'm doing a join so they call it the join operator and even that you could have different types of joins you could have an abstract class for the join and then you have specializations for team them but still it's going to be function pointer lookups at runtime to figure out what you actually want to do and that's bad for again for our modern CPU same thing when we want to run expressions assuming we have just this wear clause inside this nested query well that will be represented by some abstract expression tree we have the different operators and then the inputs are corresponding to the different elements that you're accessing in the query so the dumbest thing to do to actually run this would be again traversing the tree just walking down evaluating every single node looking up on whatever the context that's provided for the query substituting the values, pushing things up so again this is like would be a naive implementation some systems actually do this but in Postgres is a little bit smarter other systems are a little bit smarter even before doing compilation or code specialization but this is roughly what you're doing at runtime the giant switch statements or function lookups to go figure out what this is so all right well this is like it's a three-level tree is that really big of a deal always think of extremes and databases if I have to do this for a billion tuples as I'm scanning long table B then this is going to be super expensive to do this over and over again all right I mean I don't have too much detail with Postgres we'll see this in a second they do jit compilation now for the where clause prior to this they were using a technique called direct threading it's an interpreter where you have this in a ray of pointers and they're calling to that but even again because they're doing it on a tube at a time that sucks if you're doing it on a lab and the jit will help a little bit for that all right so the idea is that we want to specialize code in our database system for anything that's going to be CPU intensive for anything that we know we're going to be spending most of our time while we run a query we want to even other parts of the system but for now we assume it's just queries we want to try to specialize that so that there's no indirection no lookups to figure out what the type is what the table is, anything like that it's literally just hard coding as if a human wrote for exactly the query plan access methods like the scans some systems will do this for store procedures we'll see this in the case of Oracle they'll compile a PL SQL UDF into their version of C it's sort of like it's called pro star C something like C naught here at CMU it's a restricted version of C then they'll compile that to machine code operator execution like the joins the aggregations all that you can compile how to convert a ware clause or predicates inside your query convert this expression tree into an actual program and then logging operations I don't think anybody really does this but the idea would be if I'm doing recovery could I compile the interpretation of log records for our class this semester we don't care about this because we're not worried about recovery but some systems can do that as well so when we're actually going to do this for anything that's going to be focusing on queries we're going to do code generation or code specialization when we have the physical plan so a SQL query shows up we go through the parser go through the binder and then now we have a physical plan then we convert that into specialized code because we want to reason about physical plans and not something more higher level or abstract for other things like store procedures you would do this compilation when they call create function so most systems are going to be doing this or some variation of predict evaluation so we'll see haiku, we'll see hyper they're doing what we'll call holistic query compilation so they're taking the entire query plan and then specializing that and compiling that systems like Postgres, Spark these old versions of Spark and QuestDB they'll be doing something like this because the idea is that you have an existing system that is doing interpretation for executing the query plan but the where calls often is the most extensive thing but then rather than rewriting the entire engine to do code generation you just do code generation for the where calls it's less of an engineering blast to make that change I would say also too that we're not going to have any security concerns in this for what we're talking about pretty much the entire semester because I don't care about that stuff too much of course until they steal my credit card I care it would be like we're not worried about someone doing code injection of sending us a funky query that we then convert to C code that then can leak out SSH keys we're assuming that somebody else sanitize anything that we're given we ignore all that if you do care about those things in the case of in hackaton, in Microsoft, SQL server and Oracle we don't have the version of C to prevent you from doing stupid things but again we assume that the code generation code is written by us the developers and that we're not going to do malicious things in our own code we can ignore all that some of this I've already said before the benefits we're going to get is since we know all the attribute types a priori we don't have to do anything and look up to figure out what the type actually is it's this type of this size and furthermore if it's encoded we know exactly what the compression scheme is there's no surprises in this because we know everything ahead of time likewise we're going to know all the data types ahead of time and we can try to distill it down to the low level primitives like greater than less than equal to which there are hardware instructions for us to make that run really really fast again complex data types or user defined types sometimes are tricky but we can ignore that and then we're going to try to avoid any function calls in loops as much as possible now we'll see in the case of VectorWise and it's the system that do the pre-compile primitives they are going to have function calls in loops but they're going to amortize that function call lookup cost or the jump cost because we're doing batches of tuples if you had to do a first single tuple that would suck but if it's a batch of 8 tuples or something like that that's less of an issue so as I said before there's basically two ways to do this and again at high level they basically look the same different ways to think about how to generate the code we want to then compile for our queries or predicates or whatever we're doing so translation is also sometimes called source to source compilation and the idea is here that we have code in our database system that can generate other code and then we just run a traditional compiler to generate the machine code that we then link in and execute as if it was a shared object or is a shared object in our system so oftentimes you'll see how C++ code that generates C++ code which you then compile Amazon Redshift is famous for this and then the paper you guys read for Hyper they're going to basically have their own intermediate representation of IR where they'll generate some lower level lower level implementation of the query then they would compile that into native code or in the case of Hyper we'll see in a second they actually can interpret it or generate assembly they do crazy stuff that's in the later version that's in Umbra but they don't not taking typical C++ code they're taking some lower level lower level representation and we'll see the trade-offs between the two of these so we'll go through translation first and then we'll go to hyper jit compilation next so the sort of the first system in the modern era when I'll say like the late 2000s or the 2010s that would do co-stabilization of query compilation was this thing called Haikyuu and it was an academic system at a university at Edinburgh and so what they would do is for any query that shows up they would have C++ code in their system then they would do a fork exact into GCC which would then compile it into a shared object and then you would then link that into the database system process and the way this would work is that you would have to have the program that you're generating for the query would have to implement a known function with a signature so that the database knows how to call into that shared object and run the code it has to look like a standard entry point think of like the main function of the program without obviously calling it main so for this they were just using an off the shelf compiler I think it was just GCC they weren't trying to embed anything this is a precursor to the LLVM so what's one big problem with this yes compilation slow, why GCC slow, why this is for LLVM so it's doing a fork exact what does GCC do when it starts goes read a bunch of config files as if you're running from the command line it's not really meant to be run it wasn't designed to be run in the critical path of any query in the database system it's like a general purpose compiler so it does a bunch of stuff that a general purpose compiler would do go look to see where my libraries are link those in parse config files and figure out what I'm allowed to do and of course it's a fork call so now I got to talk to the OS to go spawn a new process and run that on the side GCC is going to allocate memory and do a bunch of stuff so that's going to be one of the big problems with this approach but we'll see why this actually is from an engineering perspective going to be easier to maintain and debug going forward compared to the LLVM stuff so at the high level it looks like this so say this is some simple query it's like pseudo python code so for this get tuple operation if you were doing an interpretive plan you would first go look what table am I accessing go look in the catalog to figure out what the scheme is now you wouldn't do this on a per tuple basis you would obviously cache this outside the for loop but you have to do it at least once then you're going to say if I want to go get this tuple within a page or a block with the offset and not the table size then I've got to return a pointer to the tuple and then now in this if clause to evaluate the predicate I'm traversing that that expression tree that I showed before pulling the values up checking to see whether things match and then deciding whether to terminate early or keep going or circuit things and then return true or false and maybe you have to cast things in the right data type as you go along based on based on what the attributes are looking at so again this is like an over complicated it's a high level explanation of what the query is actually going to do and maybe again some of these things you can cache up here but again we're doing if you're doing this on a per tuple basis it's a bunch of ways to work to do the same thing over and over again so what high queue would do is they'll generate sort of stubbed code like this where you set up some parameters that are given to you when you invoke this think of this as a function for the query and then now within my for loop now I'm not doing any look ups to say what's the size of the tuple or what were offsets are like literally hard coded values right and same thing to do the evaluation the compiler can recognize that some number plus one it can fold that in and evaluate that once and not repeat it over and over again right yes this question is like this eval predicate we extract that out and we say oh it's plus one and we hard coded that in the program this question is there ever a case where the expression tree cannot be expressed in C++ no why wouldn't it like literally like yeah I can't think of anything I mean for in clauses you can't probably generate exact like you may not be able to use like not standard library but like I think everything you could use is standard library like in clauses or arrays or vectors you gotta maybe use a STL library for that you can get it down to the exact instructions I can think of like going back whatever that query was I didn't have a query alright sorry the if the expression was where value equals input parameter plus one that plus one you don't want to interpret over again since you have the input value you can bake that into the exact C++ code and then now the equal sign in C++ and the compiler knows how to go to town on that and optimize that as much as possible yeah the plus one is from a SQL yeah maybe I might have removed the SQL query by accident yeah the the query is like basic select star from table where value equals input parameter plus one and then whether or not the your query optimizer does that you know evaluates the plus one before it gives it forward action and runs it it depends on the optimizer implementation yes for systems that you do they're optimizing just like not too expression folding yeah it's a question like for the systems that do predicate compilation did their query optimizers not do any of this folding stuff ahead of time to let the compiler do that all for you for systems that they retrofitted the compilation after the fact like I think Postgres does this like they'll do that because they didn't have the other thing before but obviously if you know you're always going to compile stuff then you can just take advantage of that as well as additional optimizations that traditional source code compiler will give you so what's interesting about the haiku approach is that the it's us as the data system engineer developing the code generation piece we can do anything we want in that meaning like we can invoke any other part of the system inside of our generated code that means like we could call out to the network code to send messages if we wanted to we could go get data in and out of the buffer pool manager if we wanted to run transactions right so we can pretty much do anything we want because it's just it's as if the code was regenerating was shipped with the database system when it was being built by the engineers right and that means that we don't have to have any specialized bridges to call out to other parts of the system we just invoke it as if it was a function built in and you see this a lot in it's not exactly code specialization but you see this in a lot of the extensions for Postgres because they're just linking and shared objects for better or worse they call all the parts of the database system because you can because it's just C code of course you have to manage memory that's a whole other issue so the one key advantage though even though the compilation is going to be slower a key advantage of the translation approach is that debugging is going to be relatively easier easier relative to the compilation from Hyper because what are you generating C++ code what if it crashes what do you have C++ code you can walk through in a debugger you have nice tax traces you have nice symbols you can figure out what broke now you got to do a little extra work to maybe annotate the generated C++ code or C code whatever you're generating to reference back what part of the main system generated that code because again I don't want to debug the thing I just cogent I want to debug the system that generated the code but you can put annotations in there to figure out what it came from so now you just take any off-the-shelf C++ programmer whoever it is and they know how to use GDB in theory and they know how to condemn debug your program you don't really need specialized people that have to understand LLVM or assembly to make all this work right and that'll be one of the things we see in Redshift and when we read the photon paper later on from Databricks they talk about the debugging of a LLVM JIT or a compiled system is the engineering burden is very very high so they decide to avoid it so I'm going to show some results from the Haiku paper and they're going to compare it against for their implementations so the first would be like a generic volcano style database system that is using generic predicates or expression trees and evaluating them so this is like version zero of any system so many so many builds think of like bus tub then they'll have type specific iterators using I think C++ templates to inline as much as possible then they'll have a hard-coded implementation sort of a first pass written by a grad student then you have an optimized version written by another grad student and then you have the Haiku code the ones generating the C++ code right so in this here the measurement they actually used but it's so small we're going to ignore that and it's as you expected right the generic implementation, the textbook implementation of the database system is going to be the slowest but as you go along and you start adding more optimizations things are getting better and better so the thing that we really care about is this part over here where the grad student was able to generate optimized again handwriting the query plans this is what this is TPCH I think or this here and the difference is quite small but the Haiku system is able to generate C++ code that's better than the hand generated code right again and this is because you think about it you only need to build the code gen piece once and you can put all the tricks you know how to make the queries run fast within that one implementation whereas if you're literally hard coding the queries you've got to optimize every single one individually right so the optimized ones are just going to be faster because there's fewer branch misses there's allocating less memory there's fewer memory jumps for functions because everything is almost like inline yes sorry how do you know when L2 cache occurs oh how do you collect this data oh I didn't teach you guys about performance counters basically the CPU can track all this for you every CPU has performance counters so in the inside the CPU is maintaining all this information about your programs and you use a tool like Intel's vTune the sort of implementation called perf it's almost like if you ever use like Valgrind but like Valgrind is instrumenting the code as it runs this is like you don't do any instrumentation of the code the CPU just counts it all for you and then you can turn on perf there's a very little performance overhead and then dumps out like a perf file that you can then look through and see where the cache misses were actually who here is familiar with perf about half so maybe we can cover that in a week or two just crash course on how to do performance debugging yeah CPU is great, CPU can measure all this stuff for you and again we can measure down to like L1 cache misses but we're ignoring that because it would be way too small alright so again this paper is 13 years old 14 years now I think 2008 14 years old so this is running on a really old CPU that's obviously been exceeded by everything else a Core 2 Duo but that part I don't sorry so the exact numbers I don't care about is the relative performance difference that matters and even if we had a modern CPU it would still look the same way alright so again as you said what's the problem with this approach it's the compilation cost so how much time does that take so in here they're comparing the high-Q generated source code the compilation cost for either running with O1 or O2 and as I said we don't ship database system software with O3 compilation because it might put things out of order and actually running slower than it would have just O2 so O2 is what you want to ship source code with obviously when you use O1 if you're trying to debug something then you run with O1 but obviously O1 is going to be less aggressive on the optimizations so that's why the compilation cost is lower alright again old hardware back in the day but it just goes to show that first of all you see almost a 3x difference in the compilation cost so O0 and O2 but again now we're talking for this query here 600 milliseconds to compile it right and for scale factor 1 you can probably run this query in 10 milliseconds 20 milliseconds so this is a big problem because if your query is only going to run for a fraction of the compilation cost you might have been better off just not running the compilation step at all just run the old interpretive version now if your query is going to run for like 5 hours and it takes 600 milliseconds to compile it yeah who cares but again the disk has gotten faster the network has gotten faster, CPUs not as much but like with a well-built execution engine you'll be able to get through some queries can finish in less than 10 milliseconds so your compilation cost is going to be a big problem now the hyperpave you guys read doesn't solve it, they solve it afterwards we'll cover that in a second but in that case again because this is a forking GCC it's a much more expensive compilation process yeah this question is what does haiku do or what do people do yeah so this question is what do people do with the compiled code do they just throw it away and discard it or can they cache it we'll see redshift in a second they cache it and they cache everything and haiku it means an academic system but you can obviously imagine it like because I can parameterize it potentially to just put it into like make it a function where I pass in the input value and I don't get maybe the constant folding that I would want but at least now the compilation cost would be lessened and then you essentially end up with vector-wise because that's what they're doing but they're pre-compiling everything when you ship the database system not at runtime for queries but yeah caching would help you know you're going to see it over again yes is that what redshift aqua is this question is that what redshift aqua is aqua is a harbor accelerator that's we won't cover that even before you get to aqua they cache things yes okay so as I said the relational operators are a great way to read about queries and we have these composable query plans we can move operators around anywhere we want and not worry about what's feeding into what because they're just sending tuples but that's going to be problematic when we come the time to execute it if we do a literal translation of the relational algebra query plan into C++ or whatever you want to compile into that may not always be the most efficient way to do this as I said there's a long compilation cost in C++ like who they were not supporting full pipelining they were still processing one tuple at a time from one operator to the next and because it's from 2008 and the vectorization stuff and the other stuff we talked about so far that came much later alright so the hyper paper you guys read is not an easy paper to read so hopefully no one spent too much time on the appendix on that the IR stuff I think it said not to read it right so what he's going to do in hyper is that rather than generating C++ code they're going to generate LLVM IR directly and then they would then go ahead and compile that into machine code so you end up again with the same you would end up with the same machine code in theory as the haiku approach but you're not going to C++ and then converting that to machine code of course the GC is going to put that in its own IR it's going directly to the IR with a bunch of C++ macros that then compose a query plan hand it off to LLVM and then LLVM can go ahead and compile it now the the challenge of this paper also too is because he's a genius he's introducing two key concepts he's introducing this code compilation stuff with LLVM but he's also introducing the push base execution and that makes it seem like you can't have one without the other it's not true but he's showing you how to can design these using operative fusion design these query plans be very efficient try to ride query tuples up in CPU registers for as long as possible and the push base approach is how you do that so let's just look over a really simple example so this is the query we showed before so we know how to divide up the query plan into pipelines right there's a pipeline breaker that says that we can't start executing another pipeline until all the tuples are processed by the the child pipeline below it now we'll cover next class how do we take these pipelines and divide them up to tasks we run them on different cores and different nodes and schedule them that's next week for now we'll just assume that it's a very simple dependency graph to know we've got to run run one and two can run in parallel but three can't run until two finishes and four can't run until one and three finish so what hyper is going to do is that they're going to generate these essentially a bunch of nested for loops it's the push base model so these nested for loops are going to be able to when one pipeline do as much work you can for a single tuple and only go back to the next iteration of the for loop once you've done everything you can with that tuple so you can think of the boundaries here are these pipelines like this and at the end you have this long pipeline for where you're just iterating over tuple c probe into the hash table probe into the second hash table and then emit it when it's done so in this case here for any tuple that is going to match the join on the join clauses for the other two tables I'm going to do all that processing up the pipeline before I go back and I can keep they're going to be very careful keeping all the this data in like the cv registers themselves rather than saying hey here's it's in memory I hope it makes it they're actually going to be very careful to try to put it into the registers and not put anything else in there until they have to go back and get another tuple and again these are just showing the dependencies between them so when you compare, so he's going to compare against the initial version of hyper that he wrote the dude code generation that was like the high queue approach with C++ and then he had his LLM based approach but he's also going to pair against vector-wise, mode ADB and hyper again this is what 2011 2010 so these were sort of state-of-the-art OLAP systems at the time case of Oracle they're just showing like here's what happens if you have a volcano based system it's not a column store it's like the worst case scenario and I think he calls it dv-davis-x in the paper that's because Oracle has a license agreement that says you can't name them by name in any what's that? it's Oracle, yeah but the paper's 10 years old so who cares maybe they care I don't know this question is why why is Oracle better than Q4? yeah that one I don't know yeah I don't know that it could be the query optimizer so this mode ADB at this point was maybe 10 years old Oracle was 30 or 35 years old they've had millions of dollars in the query optimizer whereas mode ADB was a small academic team maybe it's because mode ADB is picking a bad query plan we'll see bad query plans later on in the case it is here for Q2 vector-wise crashed they didn't get a result so again I don't care about so much the difference between Oracle and everything else the thing we sort of care about is the hyper LLM based version and everyone else in the case actually for mode ADB it has an interpreter that's going to use opcodes that's going to look like sequel light we'll see in a second so it isn't traversing the query plan trees the way we saw it before it's a little bit better than that but it's still not an exact program for the query plan so the reason why the hyper one is going to be better than say the C++ version of hyper is that they're going to be more aggressive pipelining to try to ride tuples up in C++ registers because they have low level control of what goes in those registers because they're generating the IR whereas in C++ you're hoping the compiler can figure that out for you and it doesn't always do that whereas in the LLM stuff you can't have exact control in the case of Q1 it's just a wear clause and an aggregate, there's no joins and the wear clause is simple enough but again the riding things up that single pipeline as far as possible is why they're going to get better performance here alright so again compared to other systems this is pretty significant but again this is measured in milliseconds it's over 10 years old at this point scale factor one is only one gigabyte so it's not that big but you kind of see how we're getting to almost the bare bones of how much better you can actually get in these systems yes the data is going to get bigger but if you have the process more data there isn't 0100 it's going to make the number of structures and get the X2 go away so co-specialization is going to get us almost to the bare metal speeds so these results are pretty good so to imagine how the compilation costs and I know this wasn't in the paper you guys read and this is not like a true scientific evaluation of these two approaches because I'm taking the numbers from the Haiku paper and the numbers from the Hyper paper which again we're running on different hardware at the time but again you can kind of see roughly the performance difference between the two systems right so again Haiku is doing translation so it's generally C++ code evoking GCC taking the shared object and putting that in and running it whereas in the LLVM Hyper in the LLVM it's all in the same address space, a separate thread runs the compilation stop and you can initialize LLVM when the system starts up and you don't have to re-initialize it over and over again as you see the compilation times are getting down to under 20 milliseconds which is not great but it's acceptable so between Haiku and Hyper? that's this so Hyper C++ is an approximation of Haiku right it's actually better than Haiku because I still think this is I mean it's still Tupo at a time but I think he's still doing the push-based model on this I think right in the case of vector-wise they're doing pre-primitives but it is actually doing vector-wise execution now I don't know whether they're doing SIMD on this again 2011 and 2012 maybe AVX2 or something like that but there's other factors as well like it's not always the same before it's not always a true apopsicle comparison because like the way Hyper does fixed point decimals for Q1 is really efficient versus like I think vector-wise at the time in case of Hyper it's literally a 128-bit number that is ripped through instructions with that we don't have to do any lookups or do casting of var charts to figure out what the decimal point is so like this difference between this isn't entirely just because it's doing compilation but I would say for these it's the same data types so it's a good difference of the two approaches okay so as I said multiple times today the big problem is going to be the compilation time if you can do it ahead of time it's fantastic but often times in OLAP system you've never seen the query before I think maybe that exact query so you go ahead and compile it in the case of Hyper they observe that the compilation time of a query is going to grow super linearly relative to the complexity of the query so not so much how much data you're accessing it's like what isn't the actual query like the number of joins, the aggregations, the predicates the where calls and so forth so for LATP not that big of a deal because those applications will be running the exact same queries over and over again maybe just parameterized and the data systems can automatically convert them to prepared statements so we can compile it once and reuse it OLAP queries as I said if we've never seen the query before then this is going to be a problem and so the I don't think it's in the paper but the one of the examples that they told me was after Hyper got acquired by Tableau they had to make it post-process compatible because it was shipped in the Tableau product as like a query accelerator and what would happen is people would install Hyper and the very first thing they would do was hook up PG-Admin to it which is like a graphical interface a web interface to Postgres databases you can see all your tables, you can write queries and so forth and so when you turn on PG-Admin the very first thing it does to figure out what tables you have is it does queries against the catalog to figure out I have these tables of this type and so forth and when you turn on PG-Admin pointing it to a regular Postgres database queries you would start up instantaneously but in case of Hyper because they had to do all this compilation stuff it would be like a 20 second pause for these queries that aren't grabbing a lot of data but it had a bunch of complex joins against the Postgres catalog and the system the administrative interface would pause for 20 seconds so it looked like everything was unresponsive 20 seconds doesn't seem like a lot but again if you're used to having instantaneous access to your Postgres database to PG-Admin something's wrong so then they came out with another approach a follow-up to the paper you guys read in 2018 using a technique called adaptive execution and the idea here is that they're still going to do the same thing in the paper you guys read where they generate the LLVM IR but then instead of not running the query until you finish the compilation step they're going to have this interpreter that they wrote start interpreting the LLVM IR start running the query right away then in the background then they're going to run the LLVM compiler they compile that IR into machine code and then if the compilation finishes before the query finishes then they just slide in the shared object replace the interpreter and the query then runs super fast after that right and we'll see in the paper you guys read in the next class in the morsel's paper there's this natural boundary between the query task or the plan task where you process some chunk of data the worker thread processes some chunk of data and then when it's done it says oh is the next thing we're going to do for the same query and when it does retrieve that task you can go look up to see is the compiled version of that task ready for me and if it is then it just slides in so it's not like you're interpreting and then you cut it off in the middle of a for loop there's again when I'm done running my batch I go get the next thing it's a compiled version of that task rather than the interpreted version so all of this is done seamlessly so the overview diagram here is going to show numbers that they reported in their paper so these are going to walk through how this all works so the SQL query shows up you run it through the query optimizer for them that takes roughly 0.2 milliseconds Hyper also has one of the state of York query optimizers we'll cover that later but that's a pretty good optimization time then they'll take the physical query plan they'll run it through the code generator that's going to generate the LLM IR and that takes roughly 0.7 milliseconds so this is pretty trivial to do this you're measuring microseconds so now at this point there'll be three different branches and these can run in parallel so the first thing is that you take this LLM IR generator here and then you're going to run your own bytecode compiler that can then generate some convert the IR to your own bytecode and then they have an interpreter that they wrote that compiles this I think they told me all of this they wrote themselves there were open source LLM IR interpreters but Thomas didn't like any of them so he wrote his own in two weeks we imported our own German here he's a visiting master German and it took him about a semester to write the same thing it's not impossible it came down you don't actually have to implement your own you don't have to implement all of the LLM IR you just need whatever this thing is going to generate and you control the whole stack so you can do this so then in the background they'll also take this LLM IR and they'll just run the regular LLM compiler running this 0.0 and that's going to generate some x86 code and when that's available you can slide that in so the bytecode compiler there takes 0.4 milliseconds say this one takes 6 milliseconds so if the query is still running after 6 milliseconds then you can slide that in and then if this thing is really taking much longer then you can kick off the more expensive compilation steps like running 02 and that takes 25 milliseconds and then you do just oppositions run the optimized version of the compiler and then you have more x86 code there right so isn't bytecode just x86 code no, I think it's like JVM bytecode it could just be LLM IR and our system we did directly LLM IR I think they wrote their own but again, I don't think it's like it's not doing the heavyweight compilation like hoisting stuff like the GCC or Clang would do right so there's a couple of benefits to this obviously the speed, right because if my query is really simple then this might be enough and this is basically what SQLite does they have their own VM that interprets these opcodes but if the query is going to run longer then if I have to wait 6 milliseconds to get machine code for this that's a trivial amount of time another benefit though is that you now can actually debug failed programs more easily because you have this thing at the top you have this interpreter for your bytecodes so if you run the kapal version and it crashes you can then reverse that and figure out what lines of the code generated that there's opcodes or whatever the machine code that failed for me and then I can step through with a regular debugger at the top and walk through and figure out here's what line of code am I tripping up on yes you're saying if you're using the bytecode compiler correct, yes am I saying that if your query is short is the bytecode compiler going to be better than doing anything down here I mean it's not simple plus it's opcodes it's like the JVM opcodes right but we're doing JIT here yeah like the JIT this is the JIT stuff what kind of code machine code or simple plus code what are you saying your question is if I did this and didn't do any code generation sorry if I just have a bus tub interpretive cell system I don't do that versus having this step is this step always going to be better well for like really simple queries select one maybe but like when things then I want this absolutely but like this will handle the short things without going through this expensive step but then when the bigger queries and more complex things show up bus tub is going to choke this thing is going to rip through it so it can handle both this question is can I switch from the bytecode to xhcode on the plot yes I take a query plan I break it up it's a pipelines and I'll have different instances where this pipeline is running with a task instance and each one is going to process a thousand tuples so I'm running my thread is running this bytecode version of it I process a thousand tuples when I'm done I go check to say hey is the kappa version ready for me if it is then I just slide that in and process the next thousand tuples and the logic is exactly the same so it's not like there's going to be like any difference in the data that's generated results generated from the bytecode version yes yeah the question is can the optimizer be smart enough here to recognize oh well this query is select one I know I don't need to read any data or do anything expensive so just skip all this yes but when you start doing joins and we'll see this in a few weeks the optimizer is going to be way off so like a simple threshold might be like do this and then maybe just always use the background and then maybe some trigger says okay after 20 milliseconds that's enough let me run this too yes so can I say that on the average case it's like building on his point and I say on the average case it might be worse but on the worst cases it would be really good is David is can I say that on the average case it might be worse but on the worst cases it would be really good yes and then on the in that best case is the interpreted system going to be better than compilation it depends on the hardware it depends on what you're doing right yes how many like for what sort of difference in middle oh his question is for what kind of companies would middle seconds matter for one query yeah so the question is like you know so all right on the extreme case are the high frequency trading guys right those guys like they're snort cocaine, they're running around fiber on the river like those guys they want to be they want to measure queries in microseconds but is this an OLEP like this is for both you could use this for both right so the high frequency trading guys care a lot right for us mortals the conventional wisdom is 50 milliseconds and that's usually for like a transaction and that number comes from internet advertising auctions so like when you go visit a web page and you're not using ad blocker you know the the hosting company sends or whoever's running the ad say it's google they send a request out to the different advertising brokers the auction houses to say here's this user visiting this web page and here's everything they know about you and I think you then the contract is you have to respond in 50 milliseconds what your bid is for turning the ad it might have gone down to like 40-30 milliseconds but that's roughly what it is so the response time you have to get back is 50 milliseconds so now assuming around trip times so you got to run a query to go look up and see whether I want to sell an ad to this person within that time there's another number too I think it's from amazon that says for every 100 milliseconds that the amazon product page is slower they lose a million dollars I don't know if that's apocryphal but there's some correlation between slower pages equals less sales yes I've never met anybody who says yeah am I query running slower yeah great and certainly there's a cost element too right you can pay millions and millions of dollars to get that microsecond or sub-microsecond latency but you don't need that or the right trade-off it depends but like like everybody would benefit from this okay so in terms of the performance you can get from these different queries again so you have the three different phases the bytecode version, the unoptimized LLBM and the optimized LLBM and again you can see how this is log scale but there's a pretty big jump between the bytecode version and the simple LLBM pass one and then depending on the complexity of the query the gap between the O0 and O2 between the diversion will be slightly different right again scale factor one all the queries are running a single thread for the queries so you can use the other threads to do the compilation stuff right again but it's not just about performance it's also the debugability and the maintenance of the system alright so I want to do a quick like sort of quick run through of a bunch of different systems that are doing again different variations of code compilation or query generation and I sort of broken up losing to four categories this is not scientific definitions it's just how I think an easy way to categorize and understand things so the translation would be source to source stuff we talked to begin with IQ the custom one will be on the other side but these will be different systems are doing different things sometimes they're using LLBM sometimes they're not or the CLR if it's in Microsoft stuff then there's a bunch of Java databases that are doing just in time compilation with JVM and then here's all the ones that are using LLBM then the skull and bones means these systems are dead so we've killed two at CMU not proud of it but it happened alright so the very first system that did code specialization was one of the first relational database systems and as often times in databases IVM did it first so the very first relational database well the first major relational database system they were building there was a precursor to system R out of the UK called Peter Lee relational test vehicle which sounds like a prog rock band but that was like a prototype we went to go build system R but system R was the first real one so back in the 70's they had an early implementation of code specialization code generation for running queries again think of like 1970's the hardware was terrible the CPUs were slow, disk was slow, everything was terrible so if you had it then interpret the query plan on this really slow single threaded CPU that would just it would take forever what they would do is they would have after the query came out of the optimizer they would then code in IVM system 370 assembly and have then in the summer put that together and run that as for the query plan right sort of putting together a bunch of code templates to do scans and do joins and so forth yes this is the 70's I have no idea it must have been high then as well it must have been high then yes except like they well the function calls were always expensive because the CPU sucked most of the engineering reason was why they abandoned this so they built this in system R again the way they did my system R was a groundbreaking project Ingress was being built at Berkeley at the same time Oracle came a little bit later in the 70's but in system R they got a bunch of like eight people that are all brand new PhD's and like mathematics and CS and tried to actually build it so they had one person one woman built the query optimizer the first like call space query optimizer two other people went off and built in a design sequel and like one dude built nothing but this code generation stuff and so when the system R project ended they did take some bits and pieces out of the code base and put it into the two commercial relational data systems that IVM was building SQL DS and DB2 they didn't carry over any of this code gen stuff right and in this there's this retrospective paper from I think came out in 81 so at that point system R was eight or nine years old and they talk about how they decided to then do this this code generation stuff in the early version of system R but the problem was every time since it was like a brand new system everything was always in flux any time like an API change you had to break the code generation stuff and you had to go and rewrite all of that so it just came too much of an engineering overhead to change this thing over and over again plus then you have to deal with engineers when the query plan failed because of some bug in the code generation code you had no easy way to link that back to what was the assembly what was the assembly code that generated that so again when they built when IVM went off and built the new systems they threw that away yes why did that not deter the germ because they wrote their own debugger look at that in a second yes what inspired them to my understanding from the paper was that most of the advantages of this come with OLAP queries what possessed them I guess to say like okay we really need query compilation now in the 70s well first of all there were no other relational systems built at the time this is not like there was any you could look at oh they did it this way we'll just follow them they were literally inventing this and again you're all used to cell phones and everything being super fast this was terrible in the 70s your cell phone is hundreds of times more powerful than what they had back in the day so everything is super super slow if you then have to do literally lookups interpret the query plan look at the types the query has run forever they're kind of trying to make do with the hardware that they had at the time so I think it's a perfect plausible approach when they did this it's just again it's hard to maintain alright vector wise we keep mentioning this over and over again but I want to bring this up so again what they're going to do is they're not going to compile queries or parameters on the fly sorry they're not going to be cogenting on the fly instead they're going to pre-generate all possible combinations of anything you would want to do on data and query plans for any possible data type ahead of time so I think of like hundreds of integral functions to do things like for a vector of integer in 32 integers here's the less than version, here's the greater than version here's the greater than equal to version they're going to pre-generate using a bunch of scripts all these functions and then they compile all that and then we ship the binary and then your data system binary just has them all in there right so okay well it's a 32 column running against a less than some constant here's the pre-compiled primitive that I want to use for that and then you just basically make now an array of function pointers to generate the pipelines for the queries and as I said before jumps to functions are bad for modern CPUs but if you're passing on these batches or vectors of tuples then that function jump cost is amortized right so basically it looks like this so say you have some query like this and then you have a simple filter you have a string com equals abc and com equals four so then you would literally have a some kind of pseudocode thing like this that is in your source code that you then compile and you would have one for the string one for the integer and at runtime you're literally calling these one after another and passing the vectors along and then you can maintain the vectors of like here's the tuples that actually match now you wouldn't allocate the memory inside the function like I'm doing here but like you would pass this along and keep track of like okay what tuples actually match and then you all the simd stuff we talked about before deciding whether you want to keep processing refilling or make changes I don't think they shifted in the commercial version but there's a paper am I sweating it? yeah this micro-advd paper they have a paper where they they would take all the primitives and run all different possible compilers on them ICC, GCC, Klein with a bunch of different parameters on them and then at runtime they would run basically like an optimization program to figure out like which one is going to perform the best for some for the CPU that they're running on so they could switch out what machine code implementation or compiled version of the primitive at runtime but again I don't think that ever went into the commercial version alright so redshift is going to do the haiku approach of query compilation I think actually the Amazon then hired the haiku guy from Edinburgh I think he was there for a while so what they're going to do they're going to invert query plan fragments into template C++ code and then we can push base execution with vectorization and then since the compilation cost of like forking GCC is so expensive they're going to cash everything and not only are they going to cash for whatever your database is for all your queries they're going to maintain this giant global cash for any query that anyone's ever ran on any database running on redshift so this is amazing because this is a completely different way to think about how to do query optimization or code generation right in the case of hyper and Postgres and all the other systems they're starting cold every single time you turn the system on you start putting data in it it has to warm up its own local cash to figure out how to code generate and compile it and so forth so Amazon basically says instead of us code generating all these primitives ahead of time let's just take queries as they show up generate the simplest code for them cache that then as new queries show up we see whether we have an existing compiled version of that plan fragment for that query and then we just use that so I think the paper they say the hit rate for the global cache of every single redshift database on all Amazon is like 99.95% so no matter what query shows up you know 99.95% of the fragments are going to be in the global cache for your local cache the hit rate is like 85% but then you can go fetch things from the global cache and again in the cloud is a completely different way of thinking about how to optimize things because you can see everything you can share information across different customers now when you think about it the simple code they're generating doesn't have any proprietary information where you could leak things from one customer to the next again going back to the primitive stuff and vector wise you have a column of integers you want to do less than on some constant it doesn't matter whether it's your database or it has like banking information or my database that has a blog application at the end of the day it's just ripping the integer columns with constants so there's no issues reusing those caches those cache compile query plans right they do have a little extra work like every time the version of the system changes similar to the problem that that system are had where they have to maintain compatibility but in the cases where it breaks the compatibility they just have like a background service that just pre warms the cache because they have all the c++ code they generated before and then when a new version comes out see whether it breaks and if not the version just fix it automatically right yes this question is fetching it from the global cache faster than compiling yourself yeah it's a never call it's way faster than you know running things with gcc right and again you get the same benefit like now when things crash you can go you have the most useless code and you can debug that all right so oracle as far as I know does not do any query compilation I bring them up because they do something interesting as I said where they'll take store procedures that are written in PL SQL which is like PLPG SQL it looks like ADA it's the SQL standard how you write UDS we'll cover UDS later in the semester but they take those PL SQL UDS and they transpile them into pro star C which is their internal version of C and then they go ahead and compile that native code because the C dialect is restricted you're not worried about them jumping to some arbitrary locations in memory and trashing the database you can run this as a shared object directly we're not going to talk about new hardware but back in the day after oracle bought sun micro systems some was shipping the spark cpus which I don't think exists anymore but they started putting database operations inside the silicon itself like for compression and security and other stuff like that this is even better than code generation there's an instruction that does exactly what your database does and it was only for oracle stuff but again they don't sell that anymore so hecaton is not an obap system it's a transactional engine but what they were doing they were compiling both store procedures and SQL and they would compile it using into C code which they would then link into with the CLR Microsoft's version of the JVM and as I said before like in this case here they do have a bunch of checks to make sure that somebody doesn't try to do buffer overflow stuff in the C code that's generated I don't know if other systems actually do this or not but what was cool about that also too is as I said the since everything is just linked as a shared object you can have the generator code invoke other parts of the system where you can think of hecaton as like a storage engine for SQL server so the same way like mySQL you can put in ROXDB hecaton was like something you could put into SQL server so then the generated code could then talk to other parts of SQL server that maybe weren't running on hecaton and again you can do this because everything's linked together so SQL Lite is the most widely deployed database system in the world and most of you don't know this but they're doing some variation of code generation so they're going to convert your query plan into these opcodes for this virtual machine that they designed or the one guy designed and then at run time again they're just going to literally interpret the VMs going to interpret those by opcodes as if it was like the JVM right so if you ever use SQL Lite if you want to explain you don't get a query plan tree like you wouldn't normally in any other database system you get this list of opcodes and instructions with nice little comments to tell you what it's actually doing if you want to get the actual query plan tree itself you have to get a call explain plan and then the SQL statement so the reason why he did this is because they want to run on any possible hardware embedded devices, cell phones laptops, any kind of ISA they're certified to run on airplanes your airplane is actually running SQL Lite and so when you want to port the data system to a new platform or a new environment instead of having to modify all other parts of the system you just have to modify the VM and he has test cases to make sure that this actually works so you don't care about what actually generated the opcodes that's always going to be the same from one architecture to the next there's file system stuff you have to deal with and so forth but it's vanilla C++ code actually it might be in C it's vanilla C code that's not doing a lot of specialized stuff all specialized stuff would be in the VM here we asked him once to also could you build an FPGA to actually interpret these opcodes and he says it changes from one version to the next so it probably wouldn't work ok so the hyper was the earlier version of a system that Germans built after it got bought by Tableau Tableau Tableau owned the source code so that went away so then Thomas started building a new system called Umbra we'll talk a little bit about that again throughout the semester but what's amazing about Umbra is that rather than generate the IR the LLMIR the way that they did in Hyper he's instead going to generate an IR that then gets converted into assembly directly so he's going to take the IR and generate an IR assembly I think he supports ARM as well right and then now you have this assembly you can then run instead of the bytecode interpreter or bytecode compiler you just run an assembler which is even faster because it's literally translating the assembly instructions into machine code without any additional optimization passes so then they do the same thing we talked about before where they run the query on the assembled version then in the background they do the more extensive compilation and then when things are available they slide it right in so they have a follow-up work called a number called flying start it came out two years ago I debated whether to read this paper versus the paper you guys read before the paper you guys read before it's the seminal one but it's a bit data now and this is probably the better way to do this or it is the better way to do this if you're going to go down this path but another way to think about this is like they basically built their own compiler in the database system and they built it by themselves who does that? Germans, it's insane then as I was saying before you still have the problem where if a query crashes you are living in an assembly world where you don't have any provenance on what generated that code so then they wrote their own debugger hooks for RR they have a whole other paper on this that it's explicitly designed for taking crashed query plans and walking through the query while it's running and showing it that same time like here's the code lines that generated it right? so the papers call it on another level which I think I said to them I was like this s*** on another level and then they put it as the title of the paper I can share this there's a video they show me it's insane so in Redshift you have to do any of this query cache of every possible query plan as before in the single node approach you don't have that this is probably the better way to do this so quickly I'm going to talk about Java databases because it's very much similar to the LLM we talked about before but instead of generating LLM IR you're generating JVM bytecode and then you let the JVM hotspot compiler whatever it's called now compile that to machine code if it decides to the one I want to bring up is Spark and QuestDB and so the reason I'm going to bring up the Spark one is because as I said they're going to abandon any code generation query compilation stuff in the newer version of Photon but this is a precursor to do this and in the paper you guys read they'll talk about how the things I was saying where if you're doing code generation you have to have people that really know compilers and know like low level assembly and bytecode stuff in order to optimize things where if you just if you just build a vectorized engine in C++ there's a larger number of people that actually can work on those things and they found that even though this may actually run faster in the short run in the long term you can have more people try to optimize the vectorized code that you have and you end up getting better results and better maintainability of the system so what they would do in 2015 they introduced this new tungsten engine where they would convert the where clause expression trees into abstract syntax trees written in Scala and then they would just hand that off to the GAVM GAVM then convert that into bytecode and then invoke it as a function and do that natively we'll see also too in the paper you guys read they'll talk about how in some cases really compact queries this AST would get massive and the GAVM would choke on it and say this is too big I can't run this query or I can't compile it for you and then they have to fall back to the interpreted engine and if you switch over to if you switch over to a vectorized approach you don't have this problem another Java-based database that does code generation is a thing called QuestDB and so it's a time series combinator database out of the UK this is actually written by former HFT guys who know how to optimize Java code and they set a better database system so what happens is the query shows up they just compile the where clauses but they're going to generate an IR in Java code that use ASMJIT which is sort of like a lightweight version of the LLVM then compile that into machine code and run that and so this is from a blog that we wrote where they show two optimizations that they made they converted this into be multi-threaded from single-threaded and go from a JIT system to a non-JIT system so the original version was single-threaded with no JIT and for some query here it takes 30 seconds but then if you do jitting you shave off about 10x you get it down to 3.5x but then if you turn on multi-threading and no JIT you actually do even better than with the JIT but then the combination of the two of them you get the best result for me this is curious that the first thing they did was they built the JIT part first so the JIT came first then they did multi-threaded a year later and to me that seems surprising because the first thing I would do was obviously build parallel queries I don't know why they did that what was the case but in my opinion you should do multi-threading first and then if you want to do JIT second because this result clearly shows the difference and the combination of two of them can make a difference we're way over time let me see if there's anything else I want to quickly show we'll cut it off here so we can pick up a little bit of this next class because we want to talk about the project as well but I didn't get to the single store I'll bring this up next class but the single store approach is probably the better way to do JIT compilation basically it's like the it's like the hyper approach where you with this interpreter then they can compile it and then you can have an additional metered step that allows programmers to work through if there's a failure and then the flying start one is amazing but you have to be German to actually build that kind of stuff and nobody else does what they're doing generating assembly is insane all the newer systems though for the most part are going to choose to do a vectorized style approach again for the reasons we'll see in the photon paper from Databricks alright so next class we'll do query test scheduling maybe I'll cover some of the compilation stuff that we missed and then the paper you guys are reading is from the Germans again this will be morsels even though it's going to be on a single node the idea of how you break up the query plan task into these morsels based on the data that's going to be the sort of key idea we want to build in our system you know I got them I take off the cap my first attack on the bottom throw my three in the freezer so I can kill it careful with the bottom