 And I want to talk about how we can make Postgres faster using just-in-time compilation. And one thing to note is that while I work at the moment for Enterprise DB, a significant part of this work has been done while I was still working for Citus Data. So I think they deserve some co-credit. So the primary motivation for working on just-in-time compilation is performance, as most of you might have guessed. Oh crap, it's just not working. And so just for an example, I'm going to show a profile of a query. And it doesn't really matter specifically which type of query it is. I'm just going to use it as a prop to ship performance. This query here is a query from the TPC-H performance benchmark set, and it's query one. And it's basically a simple query. It's just a group by with a lot of aggregates over a simple table and has a filter about the time frame. And if you look at the profile of this, we are going to see that most of the time is spent in exec interpret expression. And that means we have an expression and we want to evaluate its value. And expressions in this case are things like, hey, what's the value of this column? What's the value of extended price times blah, blah, blah, blah, blah? And so we spent about 45% just in that. And another 10% is roughly spent in slot deform tuple. And slot deform tuple is the function which converts an on disk tuple into its in-memory representation. And then below that, we're going to see we spent another 10% of the time to get some utterance, which is just the wrapper for slide deform tuple. So that deforming in total already is 20% of the time. And then below that, we can see that we have here things like float 8 akim and float 8 plus. Float 8 plus is just adding two floats to each other. Bp char aq is how comparison to char datums and so on. So we can see that in this specific query, nearly all the time spent in either evaluating expressions or including the operators referenced or in the deform in the tuple. If you go down a few entries, we're going to see other things matter performance-wise. So I worked a lot on making each of these faster. Before Postgres 10, we spent a lot more in the equivalent of what was back then exec interpret expression. I tried a lot to make this faster. But at some point, I got stuck. We can't make it faster by just improving the code. We need a new approach to make things faster. And one approach to do that is using just-in-time compilation. And what is just-in-time compilation? It's converting interpreted code. And interpretation just means that you have a sort of program in memory that you want to evaluate. But that's not in the native form that your CPU directly understands. So you have to have a program that evaluates that program, which means you pay some overhead in that double in direction. And that makes things faster in two basic things, approaches and a lot of minor things. It allows you to specialize code for specific cases. For example, you have the code that interprets an on-disk tuple into its in-memory representation, but it has to deal with all possible forms of a tuple. You might have a tuple that consists of 1,000 columns. You might have a table that consists of 1,000 columns. You might have a table that consists out of three columns. And each of those cases, you have to make work. And if you emit instead are able to specialize that code for your specific table, you can remove all the conditionals that are dependent on the number of columns. You can specialize it so it's correct for a specific type of column and so on. The other way is that you can remove all the overhead of the interpretation. And that often means you can remove a lot of indirect jumps. That is, jumps where the jump target is not known at compile time. Instead, it's a runtime determined. And it uses a number of branches and so on. I'm going to go into more detailed examples for each of those. And the technology of doing just-in-time compilation is well known from languages like JavaScript in your browser. Your browser doesn't use interpretation for all the JavaScript in your browser. It said once it notices that some part is executed frequently, it converts it to native code and executes that. Similarly, if you have a hotspot Java VM, for example, it uses just-in-time compilation to convert the Java byte code into native code when it detects that it's hot. And a lot of other languages. So what methods of just-in-time compilation have I considered? A lot of old databases and a lot of other projects do just-in-time compilation by when they detect that some part of the program or query is slow. They just emit a lot of C code. Then they fork a process, invoke the compiler on that C code which they have generated by just-by-text concatenation. Then that gets linked into a shared object. Then they load that shared object using the dynamic linker. And then they run that program. That is not exactly the most elegant approach because you need to fork a compiler, which means you have at runtime a compiler installed on your production box. It means that you can't easily do inlining. And you have to do a lot of string building. So I kind of decided very early on that I don't want to go that. But seriously, there's a lot of production, old school production database out there that use exactly this approach to do kind of just-in-time compilation to make their queries faster. Another approach, and that's getting actually pretty popular these days, is to just have your just-in-time compiler part of the program directly emit code, like the machine code into memory, and then have remapped the page to be executable. Because in modern CPUs, it's not allowed to be able for a page for memory location to be writable and executable at the same time for security reasons. So you have to remap it and then run it. But the problem for that is that for directly to emitting machine code, you have to have a lot of knowledge about the individual CPU. And you have to write this translation from your interpreted program into the native language for each specific CPU. So we would have to have a different layer for that to do that for x86, for x86 64-bit, for ARM, for ARM 64-bit, and for all the other, I don't know, roughly 12 languages or architectures that Postgres supports. That would be a lot of work. I actually started playing along with that, but more out of curiosity. Then that I thought it would be a real approach. And it works, but it's too much. The amount of work needed is too high. It's worth what to note that a lot of other projects started using just-a-time compiler frameworks and then went more and more towards building their own framework. Like a lot of the browsers started to have their own just-a-time compiler framework then use something like OVM. And then finally, we went back to using their own just-a-time compiler framework for performance reasons, but not their reasons don't quite apply. Also, they have a lot more resources than we have. So I kind of said this is not a realistic approach for Postgres. So that basically leads to the conclusion that we have to use some external framework which supports, which absolves us for the need of to deal with individual architectures and so on. And there is a few out there. The most well-known is probably LLVM, and that's what I decided to do. But before that, I looked at a bunch of other frameworks. One which is actually looking pretty decent is there's a GCC library where you can do just-a-time compilation with the GCC. Unfortunately, the licensing is not compatible with Postgres and it's not very mature because it's only kind of started like a few years ago and it has no real production user. So I had to conclude that's not really an option. I looked at the infrastructure used by Luagit, which is also not really feasible in my opinion. So in the end, I concluded that we want to be based on LLVM. And at that point I want to have, can you guys tell me who of you have heard of LLVM? Oh, cool. Who of you have used LLVM like from a programming point of view? Okay, so that's decidedly few. Cool. So what is LLVM? It's basically a framework to build various forms of compilers. And while its name used to stand for something like low-level virtual machine that is abandoned that and it's just a compiler framework now, and one of the important things that it has is intermediate representation IR. And that's some like basically a high-level assembly language where you can emit a program and you don't need to know very much about the architecture. You don't need to know which instructions exist. You don't need to know how to encode instructions or anything like that. All that is not represented in there. Instead, it's relatively high-level, but it is low-level enough that you can give it enough information about how to optimize that program. And you can build a lot of different frontends that generate this intermediate representation. For example, the CLAN compiler, which you might have used, which is C compiler, or Swift, or all generate this intermediate representation. And then LLVM, in total optimizer, can use that code to generate native code. And there's a lot of optimization passes available in LLVM. So it's good that you can take advantage of those. And now you come to the more polisca-specific part. And unfortunately, there is going to be some low-level details. I hope I roughly get the right balance between a lot of details and a lot enough high-level write, but I'm not quite sure. Polisca, as you might know, is written in C. And most of LLVM is written in C++. LLVM has a C API that exposes a subset of LLVM to C. Unfortunately, it's not quite enough. There's some functionality that's not exposed by the C API that we need for POSCAS, or to do jitting in POSCAS efficiently. So what my plan there is, or what I've done, is that most of the usage of LLVM in POSCAS uses a C API. And then there's this two or three source files that do use the C++ API and then expose from that another C API that the rest of POSCAS then uses to encapsulate all the C++ usage into very few files. And I hope to get most of that at some point merged into LLVM so we don't have to use any of the C++ stuff. Although I personally do like C++, it's not bad. And what I did was that all the LLVM usage now is in a shared library, which has the advantage that we don't link the main POSCAS binary against LLVM, which is good because LLVM is roughly 50 megabytes. Like the library of LLVM is roughly 50 megabytes large. You can get that down to like 35, 40 megabytes. But that's still pretty large. And we don't want to force every distribution to install as a high dependency to have LLVM on the POSCAS package which is like half that size. So instead, you can put LLVM into a separate package and install that. And because it's a shared library, you can just put the shared library and all its dependencies in. And from then on, it will get used. Another pretty significant part is like how we deal with errors and how, when they occur while LLVM is running. And that's the next one I want to talk about. So unfortunately, LLVM is not necessarily quite tuned to the use case POSCAS has. Because normally, like the original users of LLVM were like a C compiler. And if a C compiler runs into a fatal error, what is it going to do? It's going to print an error message to standard error, return a failure exit code, and done. Which doesn't quite work with POSCAS. If LLVM while running runs out of memory and detects, hey, I want to exit because I don't have enough memory, we can't, don't want to kill the entire POSCAS server. Instead, the best case is to just return an error. Or we want to limit the scope of the error to individual connections. But if you just exit one from a process somewhere that uses POSCAS, and POSCAS itself doesn't know why we exit, then POSCAS is going to say, hey, something unexpected is happening. Going to kill the entire POST master, restart everything from scratch, and which will kill all connections. Most, we're probably not going to be too happy with that. And unfortunately, LLVM is not exception safe. So even though it uses C++, you can't just say, hey, in that place is throw an STD out of memory or whatever the exception name is, and then we catch that because it's an exception safe for some overhead reasons that doesn't work. So we have to find a workaround that, and I hope LLVM improves the situation on that at some point. Whenever POSCAS enters LLVM code, we have to install a bunch of fatal errors handlers that say that gets a call back, it's basically callbacks that occur when there's out of memory errors or other fatal errors. And those, when they occur, then we can just say, we promote that error to a fatal error which POSCAS knows to deal with, but then we can't ever reenter LLVM because if you were to reenter LLVM after throwing that error, it would call into something that has the corrupted state because we might have been in the middle of a function that modifies the state. So that's unfortunately not supported by LLVM, so we have to throw a fatal error which will allow that individual connection but not the overall server. Which basically means that whenever POSCAS interacts with any LLVM code, we'll have to enter like, before we have to wrap those sections of the code with LLVM enter fatal error of memory and once we're done with interacting with LLVM, we'll say LLVM leave fatal on the RAM. And which that means that in that section when then there is an out of memory error, we'll set kill the session. But note that that's only about the just-in-time compilation. It's not about running the just-in-time compiled function. That means we're dealing with the out-of-memory situation due to just-in-time compilation, not the one running the functions. So that is not great, but most of the time you're going to have a lot more memory usage in the type of queries that use just-in-time compilation because that's going to be analytics queries rather than BULTP queries. And most of them use a lot of memory just because you want to have a lot of work memory so you can build large hash tables in memory and so on or do sorting in memory instead of on disk. So it's not actually easy to reproduce this. I only was able to reproduce these with creating joints between tables that have each 1,000 columns and then join that 100 of them or something and then you get into a situation where you run into out-of-memory problems but that's not that realistic, I think. There's another larger and more important category of errors that is like if you have a query and somebody, for example, cancels the query or your query tries to do insert into a tape somewhere and a constraint is violated, then that query will also error out and in that case we're not inside LVM but we might be running one of those functions but then we also need to deal with the error in a way in a different way because if you just add out and then continue the execution we'd leak all the allocated functions. So in that case we need to know how to clean those up but that's not a problem but in those cases they're not fatal. We can just error out, say, like the normal error out of memory and yeah. The API in Postgres to emit just the time code is the following. Every time, the first time a query wants to use some just in time compilation and how it decides when to do that I'll come to in a couple slides. We'll call, we say, hey, we want to create a context for all the just in time compiled code that we have in this query. You could also have another context, for example, for non-curry related things, for example, we could, which I have not done to use just in time compilation to make the copy command faster and then that would create another context, for example. And then once you've created that context that context will be used for the whole query and whenever we want to emit a function for example, we want to emit a function, hey, this expression is one function to evaluate you call it then we get a module and the module is basically the LVEM language for a translation unit in C. So one object file, once something that will result eventually in one object file and then we can add a lot of functions to that. And the important part is that and when we add functions to the module we'll not compile the module itself. We'll wait to compile it once till we get somebody asked to get the function from the wrong module and which will just return a function pointer and indicates until somebody calls LVEM get function the compilation of the module will be delayed and that's very important because typically if you have a query it will have 10 different expressions that are we don't want to compile each of those individually because the overhead of compiling is pretty much is not just related to the size of the number of functions in both there's a lot of one off overhead. It's better like emitting the functions which are said like you need to do like an M map and then M remap to say, hey, this is not executable code that's really constant time. So like batching all the functions together into one invocation of the optimizer and then the code generator is very good. So when usually when we run a query we'll do add a lot of different functions to the same module and then at execution time the first time a function that executed will actually get the function for that. And if any errors happen during the query execution or if the query successfully end then the whole query gets, the whole context gets the allocated in batch which also saves time. So now we come to the point to understand how does Postgres actually evaluate queries and that has expressions and that has changed in since Postgres 10. In Postgres 10 we have basically our own mini byte code language to evaluate expressions before that we had done like some tree traversal of expression trees and they turned out to be slow. They also had some significant disadvantage of how we can do just in time combination which is why I changed that to this byte code representation with the IA this might be easier to do it and it's faster. Basically if you have an expression like this one where A dot column smaller than 10 and then A dot another column is equals three this will result in a number of expression steps. We'll have scan fetch someone. What that means we'll have to like from the scan table which is the current table that we look at, basically, that we need to look at that we have available deformed the value of column and another and that can do that together it's faster so we do that once for the whole expression and then we need to get the value of that column then we need to evaluate the value of a constant which is obviously pretty simple in this case 10 and then we need to evaluate a function and the function is int for LT because this we just assume this is an int for like a four byte integer and int for LT is in for less than and that's how this function gets evaluated because in Postgres every single operator is implemented as a function call because the Postgres is very extensible you can add any sort of data type by just invoking like create function and then create operator, create type and so on and so we call that function and then we call some setup for a guarantee in the end for example, this one might already return false in which case we do not have to evaluate the second part of the end, right? Because if it returns false the whole expression can't be true so we can jump directly to the end of the expression and then if, but if it's true we have to evaluate the value of this the value of the three and then is either two column values the same and then we again need to evaluate some and expression specific stuff and it's interesting that most of these we have like a big function that evaluates this and this is basically a switch of over the opcode and the opcode is this and then once it's done it will jump to the next expression to the next expression to the next expression most of those are unconditional jumps but they will be indirect jumps because in like our evaluation function we don't know that always after fetch the next one will be a scan and so on instead we'll be, hey, look at an imaginary location what's the next thing to jump for, too and some jumps in here are going to be conditional for example, the one evaluation of the Boolean might say this is false in that case you can jump to the end of the expression or it may say true and then we need to continue evaluation and for the ones that have written an interpreter before this is really cool if we have enough compiler support it's a direct friendly interpreter which is like slightly better and slightly faster but it's still like a major bottleneck as you have seen in the profile before and the important part is there's a lot of indirect jumps in here we have the indirect jumps from each of those to the next expression and so on as you have even conditional indirect jumps and there's one thing short interlude, how does Postgres represent a function call because Postgres can have functions that have like default arguments you can have functions that have arbitrary numbers of arguments we have we can't use like the native C or we can't even use the native C representation instead we have our own function call interface and that basically consists out of that each that the functions get passed appointed to a function call info data and that has some meta data and then has two arrays, one of arguments and one of whether these arguments are null C doesn't have a direct representation of whether a value is null for most data types it has for pointers but not for example there's no representation of this integer is not initialized it is a null so we have an array of booleans that says whether this parameter is null and the function invocation then just invokes the function pointer that's stored inside the meta data for each function then there's a pointer and then you call that and you might realize that that means that we have to do a function call with a bunch of parameters we have to fill in each of the arguments then we have to fill in whether the arguments are null then we have to reset whether the whole function returns null then you call the function then we get need to check whether the function returned null and then we are done that means any of the function calls are pretty high overhead and it will play a role in a bit so this is the code how we evaluate a function expression that is strict we don't need to understand this everything in detail but there is a number of interesting points here for one, we have a loop here that loops over all the arguments and checks whether the function has any null arguments because the strict function will return null if any of its arguments are null and most of the common operators in Postgres are strict for example if you add a null to an integer we won't actually call the int for add function plus function instead we'll just return null so we'll have to check whether any of the arguments is null so we first have a bunch of memory dereferences to get at the information about the function call from the metadata of the upcode then we have to iterate over all the arguments and note if we were to compile this code then we wouldn't have this loop here because we know that there's only two arguments or seven arguments or a hundred arguments instead during interpretations we know that we have to look this up and have a loop so all these branches here about the number of arguments are completely unnecessary then we need to check for each of the arguments hey is this one actually null or not? those we would even need if the program were compiled unless the optimizer can prove that the function cannot accept a null argument then we need to reset whether the function returns null then call it and then we can get the function return values and save them in some scratch space and then call the next and the call in the next one will be an indirect jump because we don't know what the next operator will be so this is again pretty expensive in the ditch case instead we can remove most of those branches because we know we have exactly two arguments so we can just permit the code for each checking each argument we can all the indirect function calls we have here an indirect function call for example to call via the address and if you call via address that's an indirect function call because it can change at runtime which one we call we can replace that with a direct function call to the specific function and that will save us a lot of time just turning on just the time compilation for the function without any inlining without jitting the tuple deforming will go for example for TPC HQ1 on my laptop running a browser so this is definitely not a scientific benchmark this is just representation but the differences are large enough who will go from 28 seconds to 22 seconds and the interesting part is that if you look at the branch mis-prediction numbers with before the just in time compilation like 0.38% of the branches are mis-predicted and given that a mis-prediction of a branch has a penalty of over 100 cycles in a lot of cases that will actually is pretty high and you can see that after just in time compilation only 0.7% of the branches are mis-predicted so we save a lot of mis-predictions and that is the largest cause of these speed-ups and the other interesting part is that if you look at using Curve which how many instruction, translation, look aside buffer lookups you have we can see that we are more than three, roughly three orders of magnitude better when we use just in time compilation than we're not and that's not a perfect measure but it roughly shows how local is your code and this is the generated IR for just the function expression code you don't have to understand this in detail I just thought it would be interesting to show the text or representation of some of this IR and what you can see right now is here it's basically the just in time compilation of the secret we saw earlier we get the pointer to the area of arguments and I'll come back to that in a second then we go to the next block and this is an unconditional jump the compiler can optimize that totally away and then we check here if he access the first element of the argument and now I load it into memory compare it to one and then say if it's one and then we jump directly to the next expression which is the next block if it's not now then we check for the next argument and then we do the same thing here and then here we have the actual function call and the interesting part is here we first reset this is now argument where the function returns now and then we can see we have a direct reference to external function and that's the in this case okay this is from the TPCH query up from the work as I presented earlier it's just whether a date is smaller than a timestamp so we don't have an indirect function call here the compiler knows which function to call and the generated program will have a direct function call to that and then we look at the result but you can see that to do that we needed to access this function call info data extract and we needed to access in the individual elements in that and that means that the IR needs to be able to interact with the structs that we have defined in the C side because we need to be able to say hey at this part of the struct is this element which means that the LVM needs to know about the types that we have in Postgres to be able to interpret the code we could alternatively put in a lot of like constants that says at this offset then cast the argument but that would be relatively complicated and overhead and would be slow so we need to synchronize function types between LVM and the C code and the way we can see here like the function call info we need to add in the LVM IR and we have this offset of seven for example to get at the argument array and the way we have done that is basically we have one C file which does not get linked into the final binary which lists all the types that LVM needs to know about and then we use LVM, the C line compiler to convert that into bit code because one of the cool things about C line is you don't have to necessarily emit like machine executable code it can also directly emit this LVM intermediate representation and that allows then us to use that in the area R and then we just at the startup we reload that file and then we have the type information available in both places and the offset unfortunately there is LVM certification doesn't have field names so we have to have just I just added a bunch of defines that say they have the offset number that we can then reference in the C code that emits the LVM IR that's not pretty but it's the best I could come up it would be cooler if LVM optionally could have field names because they were ahead of that would be purely at compile time it wouldn't be that bad before that I had manually synced up the types which obviously is extremely failure prone and a lot of work and yeah so we saw already in the profile that we spent roughly 20% of the time doing topical deforming so another very important case to just in time compile is the tuple deforming and you can optimize a lot of things away in that case because while the tuple deforming code normally needs to be able to deal with any sort of table shape it can have one column of type integer can have 1000 columns of type text and so on so the code is very generic it's if you could generate code for this specific for a specific type of table with a specific format you can remove a lot of these conditional arguments we know that how many columns there are in the table we know whether there's a guaranteed number of columns for example if this 1000s column of the table is not null then we know that at least 1000 columns exist in the table even if somebody later added a column to the table we know it has to have to be at least 1000 if there's a not null column at the end or at some point we know which columns have which alignment so by generating the code specific to a specific tuple desk which is the description of how tuple looks in Postgres we can get quite the speed up even though afterwards we still have it's so the tuple deforming is a lot of it's still a bottleneck which is not surprising because looking at the tuples it's going to be if you have a query that looks at 100 gigabytes of data and all of that is a memory you just assume then that's going to be the main source of memory accesses and if you have 100 gigabytes of data that's obviously never going to fit in your caches so you can still like later rewrite how evaluate queries to be more memory efficient and there's a lot of potential for that but it's worth what to know that even if you after just in time compilation you spend a lot of time in this and if you just enable just in the tuple deforming versus like I'm getting the whole query but not the tuple reforming you can see we get from 22 seconds to 19 seconds roughly and we reduce the number of branches per second from 14,000 million 14,000 million and to 11 yeah 1,100 even though the whole query took longer we have less branches per second which is cool we saw earlier that the generated query had a lot of had to have like function calls to like date or in timestamp or float for acumen which is the aggregate code or float eight plus or whatever so all of those are in C code in Postgres it would be much cooler that if we could inline those so we don't have to have external function calls and because I told you earlier that the function call overhead in Postgres is pretty high because we have this arbitrary number of arguments we need to know whether they are null but a lot of cases that you call functions if you have just a function that adds to integers we don't care about whether the arguments are null because it's a strict function so we can optimize all the we don't need the null array if the function never can return null then we don't need this has returned null check so we want to enable the compile LVM to optimize all the data way but for that to happen we need to be able to get at the source code of the function that we call and luckily as I mentioned before C-Lang can generate IR for functions for all the Postgres code so at build time what I've done is that we generate LVM IR for all the Postgres code and install that in a server directory and build indexes to know in which of those files is the code for that and that function and then we can at runtime if you want to do inlining can look hey, we call this function how large is this function if it's 10,000 psych lines then we don't want to inline it but if it's below some cutoff then you want to say hey get this inline this definition to the program and then the compiler might be able to optimize things away so for doing a little bit there's a new directory in the backend which is just slash big bit code and then inside that you can have pro-module code one of the modules will be Postgres which is just a Postgres code but we want to enable like extensions to also use advantage of the just in time compilation so if you have an extension and you install that on the server it can like for example we could possibly have like a Postgres extension that wants to do some jitting I don't think it will help a lot but let's just assume that it could install that into Postgres index PC and in those files in there and then those could be used for just in time compilation unfortunately we couldn't I couldn't make use of LVM's inlining logic LVM has inlining for cross-module stuff for its link time optimization but unfortunately that has some requirements of able to be able to promote the visibility of functions so if there's a static variable that multiple functions want to see multiple modules want to see it can just say hey make this instead of a static variable rename it into some unique name and make it externally visible we can't do that because it happens at one time we can't recompile Postgres while it's running to just evaluate a query that wouldn't work obviously that would be way too slow and you can't replace a running image anyway so that's not feasible so we had to write our own inlining logic and it does some uses this index that we have built and builds a combined index over all the extensions that are available and then checks hey there's an extra functional reference can we inline the target functions and then there's some safety checks how large is the function if the function is too large then we don't want to inline and so on so how do we at the moment at the plan time decide when to do just-in-time compilation and when not it's this basically a very naive cost-based analysis we have a bunch of we have three variables which say if we have jitting if the query has a cost that is higher than that then we want to do just-in-time compilation without optimization if it's then even more expensive above the optimized cost then we want to do optimization and if it's even more expensive about the inline oh sorry there should be say inline cost if did inline cost above inline cost then we will do inlining and it will make the decision for the whole query instead of in comparison to something like a JavaScript just-in-time compiler what they do is they have the so-called just-tracing compilers which means they will check hey how many times has this function been executed if it's more than 10 times then it will emit it I started with that but it turned out to just be slower because we have something better we have how expensive is a query and we can use that as a proxy to decide when to do just-in-time compilation that works reasonably well I think I need to still tune about individual values here and I think at the beginning we'll need to get some field experience one interesting thing is that if you use just-in-time compilation to a profile and use normal perf without any support for profiling you'll get profiles like this where there's a lot of symbols because it doesn't know about all the just-in-time compiled functions so you need to have a talent to add additional information if you do that we can get simple get sensible profiles and then we can see evil expert or the generated function and in this query there's three of them or more but like three of them that we can see prominently the profile and one interesting part here is that we can see even after just-in-time compilation which made the query roughly twice as fast we can still see that we spent the majority of the time still in expression evaluation and that's largely because the generative code we do is not that great and I'll come to that and the other part is that the query just inherently spends most of its time in doing expression evaluation because it's like a large number of aggregates yeah this is just another I'll skip over that because of time so what at the moment is the problem why aren't we getting bigger gains and the main reason for that is that the generative code is pretty crappy and that's because I want to get the basic feature out instead of spending a lot of time optimizing everything and then never get to the point where you can get extra field experience and the main problem is that the generated programs reference a lot of memory locations and the memory locations are per query so we'll have a lot of references to hey we wrote this pointer from the static memory at the set it's hard coded in the program and they wrote that and unfortunately LVM is not very good at doing that store elimination and so on when they reference memory locations that aren't on the stack if we were to be able to convert the program so they all reference only on stack locations then LVM can convert that into its form where it can detect a lot of dead stores and loads and things get faster so we have to prototype this and we get roughly another 2x performance benefit from just improving the generative code the other big thing is that at the moment because of the same issue with memory locations all the generated programs have to be done for every single query execution because we reference memory and the memory is per query memory so we can't just reuse the program because the pointers in there will point to some freed memory which obviously will not work well and it's pretty important that we enable caching at some point because for example, query 1 if we don't do any optimization, no inlining it takes 3.5 milliseconds or 4 milliseconds in total due to adjustment time compilation but if we do enable optimization then we can see that it takes 180 milliseconds or something in total to do optimization and that's already quite the fraction for a query that takes 18 seconds or something afterwards and if you'd said accidentally misestimate when just in time compilation is beneficial and you do just in time compilation for a more complicated query that takes a second and just in time compilation takes also a second even though it might be faster to execute afterwards you will have lost time overall so caching of that will be very important but that's not going to happen in Postgres 10 yeah, and I think that's where I'm going to stop and if somebody has questions I'll happily answer them Do you get instruction rescheduling especially across inlining boundaries? Yes, there are some parts that limit the benefits because as I said, some of the things are references to external memory and even if you add an notation to it hey, this memory's lifetime ends after the query and it begins in the query and unfortunately can't eliminate all the memory references so the benefits of that are reduced but once you convert all the external memory reference to be allocast local in this function then memory, like LVM, converts all that to registers and then after that it works perfectly and you can even, in an inlining it's like duplicate overflow checks and stuff they get done once, it works beautifully but there's just a lot more work needed to get there Well, I have a very strange question I mean you did a very cool thing how you made your company open-source it and this is the first question and the second, maybe I was not listening good enough it's not clear to me do you have a custom IR emitter for common bad code instructions or you took the interpreter implementation and just translated it to IR using slang or whatever and you're already using it How I got my company to do that is two-fold A there's a lot of customers that say hey, you want Perkz to be faster so there's customer demand and B is that before hiring I said I want to only work in open-source so that was that part but the second part is I do emit custom there's a secret that emits the IR for the most common instructions because it uses and directly emits those because that's a lot faster than doing something else and for the less common ones it just calls a function that evaluates it like that's the same between the interpretation and the non-interpreted version and then the inliner can pull that in and then LVM can if it decides it's beneficial to specialize that function to generate which then results in basically using C-langs code to emit it but doing that always is too slow so it's a bit of both I don't know whether we still have time You mentioned the extensions before how hard would it be for an extension to benefit from this kind of optimization? It's basically a bunch of make-fall rules where you have to add a bunch of make-fall targets that say, hey, generate the code and then install it into the server and then afterwards it works but it'll only benefit if it's like C code that's for the operators if your functions then call into, I don't know, some extra library that you don't have access to then it won't benefit but for the simple cases and the simple cases are the ones where you benefit most because then the function call overhead is the biggest part then that works post-it was a stupid example because most post-US functions are so expensive and do so much stuff internally that you will not see a huge benefit but if you use, for example, the, I don't know, Contra, Petri, GIST extension, then there helps a lot more. For the people leaving, please use this.