 Hi, I'm Andres. I work at Citus Data. About half my time is spent on Postgres, where I'm relatively long-term by now contributor and committer. The other half is spent working on Citus distributed layer on top of Postgres. What I want to talk about today is how to make query execution faster. To start out, why did I want to make query execution faster? Largely because I was seeing comparison with other databases, and in some workloads Postgres performed very, very comparable in other workloads, not so much. Since I work with Postgres, I like Postgres, I didn't like the not so much part. I wanted to improve upon that. As I said, Postgres, there's some axes along which we perform well enough, there are some other axes where we don't work that well, and just to define along which we are talking about, there's a couple of things when we talk about performance efficiency, and it's all a bit wishy-washy, but though I'm going to define how I mean these. Scalability is usually the property that you can add more resources and can do more work. In common computers, you can either add, for example, more cores to the computer or you can add more computers. What I'm not really talking a lot about the scalability today. What I'm talking more about is efficiency. How can we do more work with a given set of resources? At the moment, for example, if we disable parallelism and have a query that runs a single core, we don't use that core very efficiently, and there's other database products that can do so more efficiently for some workloads. Another property is concurrency. How much work can you do in parallel? I'm also not going to talk a lot about that. So really my focus today is about efficiency. So what kind of workloads are we very, very awfully talking about today, that a reasonable number of people are hitting with Postgres? One very common one is OLTP. Another one is OLAP and HTAP. Online transaction processing OLTP is basically when you have very, very short transactions that do a very few writes, do a few reads, and that is where Postgres generally performs very well. We scale well, there's some issues, but in general, it works pretty well. None of the performance benefits that I'm talking today will benefit this in a meaningful manner. There's some minor benefits, but it's not a lot. Then there is online analytics processing, which is basically you have a bit more complicated queries. Your queries run longer, every query analyzes a lot of data, you do aggregates over that data, and then display that result. That's a lot of the queries that I'm talking about, the improvements will help a lot there. Then what is becoming more and more hip and common is that you do try to do both analytic and transactional processing in the same database, and that's also why it's interesting for a database that is more focused on to the transactional processing side of things to also be good at analytics, because people don't want to have a different data store necessarily for both your transactional and non-transactional workload. So to understand a lot of the problems and benefits and difficulties of making things faster, one of the first things we have to understand is unfortunately relatively theoretical, and that's a pretty important property called Andal's Law. So if you assume that we have a piece of work that consists out of two parts, and by the way, I've just cribbed this from Wikipedia, the graph, because I'm bad at graphs, so I copied an even worse graph from Wikipedia. So if you just assume that we have a piece of work, this is supposed to be red, I don't know. Yes, I'm noticing that. In theory, this is red and blue, so I'm calling it A and B. If you have a piece of work that consists of two parts and piece A takes a little more time than piece B, and you make piece B a lot faster, 5x, and making a piece of software 5x faster is really a lot of work in law, most of the time. Then the total amount of time that we saved in relation to the total amount of time taken is actually not that much, because while we made it five times faster, before that it was maybe 0 and 5th of the time, so the total resulting time that we saved is not actually that large. And if you instead make the part of the system that takes a lot of time only a bit faster, or 2x faster in the simplified case, then the total amount of time we saved is large. And you can, Amda's law is a bit more complicated than that, and it often is also applied in parallelism, but that's the very basic thing. You have to optimize the parts that take a lot of time because otherwise you don't see a lot of results. And one of the problems there is that in Postgres, there's a lot of parts that you can improve. Usually there's not like a single part where you can see, hey, this part of the system, part I, takes a lot of the time. Usually it's you have a lot of parts, and all of them take up very little time, so it's very hard to see what actually takes up the time. It usually looks more like this. And you have some parts that take a bit more time, but then there's also a lot of small parts that take time. So it's not always easy to see where you have to tackle the problem to gain speed ups. I started that, my initial idea when it's in the beginning of the 9.6 cycle was that I would work on batch tuple processing. And to understand what batch tuple processing is, we have to very quickly understand how Postgres actually processes tuples. If you have a, who have here ever seen the explain, has ever seen explain output by Postgres? Or, sorry, the other way around. Who has not seen explain output by Postgres? And this is like a graphical version that's done by the PGM and it's a bit, for some people a bit easier to understand, I don't know. If you have, for example, it's not really important what the query does in total here. If you have, for example, a hash join between the orders table and the customers table, traditionally the way Postgres does this is it scans one row from the orders table, then that row is returned to the hash join node, and then the hash join node tries to probe the customer table, and then we've previously built a hash, but it does so one row checks then the next row, then the next row, the next row, and every time it finds one where there's a match in the hash join, then it turns that up to the next layer. And then here happens the same thing, it does that one by one by one by one. And that actually is not necessarily very CPU efficient. There's a lot of problems with that. So my initial idea was, hey, I'm going to improve upon that. And I spent like six weeks or something on that. Afterwards, my performance benefits were like 3%. I had a bunch of artificial benchmarks where it's like, hey, yeah, okay, I see 60, 80, 100% performance benefits here, but once I ran it on actual useful queries, the benefits were largely none. And that's because I actually ended up optimizing something that was not currently really the bottleneck. That's why I brought up another law before. This is something that is architecturally bad, but we have two other things that are even worse. So that's why it didn't really help a lot. So if the benchmark I ended up using for some of this, because just because it was handy enough is TPC-H, and this is one example query from TPC-H, and it's not really that important what it does in detail, the important part for performance analysis here is it does a select and then does a lot of aggregates over one table. There could be a bunch of joins, but it doesn't really matter. And the first thing that I noticed after staring at it like for a couple of weeks was that we actually do a lot of these aggregate computations and we ended up that the way Postgres works is that if you compute an aggregate, it evaluates the value of the argument to the aggregate and then it does call so-called transition function for each of the aggregates once, it finds a new row in the same group, or group by. And what we do at the moment or in 9.6 and earlier is that we compute the value of this expression separately from this expression, separately from this expression, separately from this expression and so on. And it turned out that sometimes if you do a lot of work in a lot of small pieces, it's smaller than if you do the same amount of work in one big chunk because there's some optimization efficiencies you can gain by doing repetitive work only once. So the first thing that I tried to do was to just compute all of the arguments to the transition values one by one. And then there's also some other optimizations that we ended up doing, but those were important to do because otherwise we just hit that as a bottleneck. To see the query, if you look at the query, this is the textural explain output from that query, you can see that the actual scan of the line item table, which is what we select from, takes in total on this roughly 4.5 seconds. And by the way, you can see how old this benchmark is. But the total time amount of time in the query is like 19 seconds. So obviously the scanning, the data is not actually the largest part of where we spend time. It's also not the sorting at the end because we only return four rows, that sorting four rows obviously doesn't take a lot of time. So nearly all of the time is spent in the hash aggregate. So the time spent in the hash aggregate was largely spent in two parts. One is evaluating this list of arguments to the aggregates and the other one was that the hash part that the hash aggregate actually has to do like a hash tail and do aggregation and optimizing those two together, like already yielded quite a bit of speed ups. And that was important because otherwise I couldn't see the later performance benefits. If you compare 9.6 with master as of that release version, that's like two weeks ago. And I will explain in a second why I used two weeks ago. Then we can see that a lot of queries improve a fair bit, especially the long running ones and some smaller ones, there's very little benefit. To see the, like as a percentage, we can see that some queries improve like tremendously, others are a bit slower and it turned out that one query is slower, but that's actually an unrelated change. I haven't yet figured out where that is. So for one query, we have a 9.6 regression. Oh, Robert says it was him. Okay. Okay. So yeah, we have some small regression that we have to fix. But so we can see that even like with these relatively simple changes, you can already see quite substantial performance improvements, but it's still not that great. By the way, this is all TPC-H and I run these in non-paralyzed mode just because then the timings are a lot more reproducible because you don't have to start new processes and stuff like that. And I ran it on scale five because I wanted to do it memory resident and I worked on this on a flight and so I only had my laptop available. But even after that, we spent like roughly 60% of the remaining time over the total of expression evaluation. If you spend 60% of the total time in expression evaluation, that's actually a very, very juicy target for optimization. It was very hard to see that we spent all of this time in expression evaluation because that's not just one function. We do it, we've spread it over hundreds of functions or dozens of functions. So where do we use expression evaluation? We use it for like evaluating the target list of a query and the target list of a query is like this list here of columns or of expressions in a select list, evaluation of where clause like where foo equals whatever or smaller or whatever and goodbye clause and a lot of other places. So this is a very central thing. And the way it previously worked is that we had basically a tree of expressions. There was, if you, for example, crap, I was supposed to be an expression over there. I don't know. This is something like an expression which I had written there. Oh, yeah. I don't know, I think it was black, but it was like where a smaller than or bigger than something and b equal or and b smaller than something. So what do you end up with? We first, the top level expression was an and expression. Then we had a function call or another function call and then we had that function call had to evaluate its argument and one of them evaluated a scalar var which is like A and one of them has had evaluated the constant expression which would be smaller than some date, for example. And then that once these arguments have been evaluated, then the function result could have been compared. The function result would be the smaller or bigger or equal operator and so on. So that was relatively expensive because we like basically we curse through this tree. We walked the tree and that's something that's relatively easy to implement but it's not very good for performance performance on a modern CPU. So before we did basically the tree walk, every of these expressions here except evil and and so on was set as a callback and that was one of the problems why it's hard to see where the time is actually spent because we call them by a callback. So the time spent spread over all these different functions and there's no top level entry that makes it hard discernible which one it is and it was recursive which meant that we spent a lot of time, energy like walking on the stack and the CPU level or. What I tried to do is to convert this expression relation from this tree walk based into translating it into an opcode format like basically what Python do internally and a lot of these languages, they translate the data structures into a more suitable format for execution and that result that we don't do any callbacks in the common case anymore, we can remove a lot of the jumps by doing some lower level trickery and one and very importantly, it's not recursive anymore. So if unless we have like subqueries and everything is evaluated at the same level of the stack and that means we can remove a lot of checks and it means we are actually more efficient. So the results of that and that's based on top of the previous optimizations and so that's the two weeks ago and then death LLVM off is basically my development snapshot as of a couple of days ago. This has now been integrated as of Saturday into Postgres so this could now be actually master but I didn't want to update my slides for that because there will be later version with additional patches. So we can see that we have another, I don't know, 30 something percent of the improvements on some queries and 40. And here we can see that in that version of my patch we had a small regression in query number five that's since been resolved but I would left it in here because it's actually like very, very easy to optimize for a lot of the cases but then have regressions in some smaller cases. But what this means that we could, using this newer expression evaluation framework, we could save a lot of time for the queries where we spent a lot of time in the expression work but after all, I said earlier that we spend like roughly end in the end 60% of the time in expression evaluation. Even if you make expression evaluation 40% faster or 70% faster the total, we still spend most of the time in expression evaluation. So all of this is nice and great and it's obviously a good result for Postgres 10 but we could do a lot more to optimize. So what after we were, I did the best I could making expression evaluation faster and after a while I just couldn't find easy ways to make it any faster and that's pretty typical. If you look at a lot of the languages after you can only do so much with interpretation after that the overhead of interpretation itself is pretty high so the next thing to do would be to remove interpretation and generate native programs and what does native programs means basically? We basically compile the where clause or whatever we have into a program and then run that natively without having any interpretation so we don't have to do any like check what the next instruction is and if the next instruction is that and that then do that and that and these if it's that then do that check state they're pretty expensive in comparison to doing it natively and the other thing is that if you have these virtual instructions in an interpretation framework then you have a lot of indirect jumps and modern CPUs are very good executing things fast if they know what the next thing would be but they're very bad if they don't know exactly what comes afterwards and that's the case in interpretation. So just in time compilation is the thing to do and you always every one of you probably uses just in time compilation daily because all your browsers, all the JavaScript that's executed or the majority of the JavaScript that you execute on in your browsers if you use some website is actually just in time compiled to be faster because they hit the same thing they couldn't make JavaScript any faster when run interpreted so they had to do just in time compilation and they also have a bunch of very familiar problems namely that they can't just always do just in time compilation because the overhead of doing just in time compilation is pretty high so you don't want to do that unconditionally instead you really want to do that only when actually the time spent doing this additional work minus the time saved is actually negative and it's not really not just for these expressions where you could use just in time compilation there's a lot of additional places. So I implemented a basic just in time compilation framework with using LLVM I don't know whether any of you have heard of LLVM it's basically a framework for to build compilers if you ever used the C-Lang compiler I used Xcode on OSX or used Swift from Apple for any development that's all based on the LLVM framework and then some additional work and I wanted to reuse that because I don't think it's a good idea to develop a separate compiler framework inside Postgres. I additionally initially started without using LLVM doing all directly but it's not something we could actually maintain the amount of work would be too high. So I set out to do this with LLVM and basically I used the expression evaluation framework that I previously developed because that also allows for a lot easier just in time compilation and turned that into a then added optional just in time compilation. As you can see the performance improvements are again pretty nice if you compare just that's the new expression evaluation framework and then on top of that the JIT is another 40% and that's after the improvement so that's like the relative improvement is pretty large but even afterwards we spend a lot of time in expression elevation so there's still a lot of optimization potential and I'll come to that in a second. What you can see here already is that in query 11 which if you look at the total amount of time which is very short in comparison to the rest just in time compilation is actually negative. What I did here is I set the thresholds for just in time compilation to just always do just in time compilation because that's a easier and b allowed me to show this and I also ran this benchmark without parallelization and if you then enable parallelization the times in general obviously get a lot lower because we spend a lot less time and you can see that for some queries just in time compilation still is a good benefit but you can also see that for some others the benefit is already like pretty negative. Looking at this generally if the queries are taking very short amount of time after just in time compilation then the performance benefit of doing just in time compilation is like negative. So again this is with this artificial setting of always doing just in time compilation which you obviously wouldn't want to run in production but I just wanted to include this because just in time compilation is a nice tool for long running queries but it's a really bad tool for very short running queries and parallel query is only interesting here because it allows to turn relatively long running queries into shorter running queries and I wanted to plug that because that's one other big avenue for making Postgres fast is generally parallel query and I think Robert is doing a talk about that tomorrow. So there's a lot of more going on than just this. And yeah so I think the framework is pretty good we have to do a lot more intelligence when to actually do just in time compilation and that's work that's not going to be in Postgres 10 because I already spent a lot of time doing the framework thing that gave the previous benefits but this one is probably another couple months of work and unfortunately time is quite limited. So that's stuff where I know pretty well how where we want to go and I think I really want to get the just in time compilation into Postgres 11. After that I want to talk a bit about what further optimization I can see and in similar vein but so far has anybody questioned so far? Okay so once one other bottleneck that I see in a lot of query and that various people in this room have already complained about and various forks of Postgres spent a good amount of time trying to make faster is the so-called tuple deforming and what that basically is we store tuples in some format on disk but that on disk is so much how Dan's argue about but it's denser than the representation we use in memory and this for but if we wrote that we actually read from disk or available like has been filtered by an index needs to be converted from this on disk format into an in-memory format and that's something that happens for analytics queries a lot of times because we process a lot of rows so that's something we have to optimize a lot and one it's a similar thing that with the expression evaluation it's currently very slow because it's very hard to predict for the CPU because every we obviously have to deal with tables of each type of format sometimes you have tables with only one column sometimes with multiple columns different types, different widths so that's actually like pretty hard to do in a very fast manner I spend a lot of time like more than a month of actual time not like with even with excluding interruptions and I managed to speed it up by 3% and adding a thousand lines of code so that turned out not to be a very fast way of the code so that turned out not to be that great so I think that's another one where you have to do just-in-time compilation because unless somebody comes up with a great idea how to make this faster without just-in-time compilation I wasn't able to and a bunch of other people that tried weren't able to do a lot about it either so I think there we need to do just-in-time compilation as well and the speed-ups I have a very preliminary patch that doesn't handle every corner case so it's not actually ready yet but we like the de-forming part itself gets up to 80% faster which is like a pretty beefy improvement even if we like obviously the total amount of time the query gets faster, it won't be 80% because we don't spend all the time in the tuple de-forming and one thing here that's pretty important is that I'm currently optimizing basically one particular benchmark which is called TPC-H and it's a benchmark from, I don't know, 96 I think something like that which is probably not the best representation of the workloads you guys run into so if you have workloads where your queries are too slow please, please, please tell the community what you're doing and what's too slow because otherwise we're optimizing workloads that are not actually representative of what people are doing and that's not really helpful to either you or me because I wasted my time and you wasted your time upgrading Postgres are you okay? You wasted your time upgrading Postgres and it's not faster after all even if you promised, hey, we made this type of query a lot faster but because we didn't optimize for your queries it wasn't actually faster so please if you have workloads where your performance is where performance is, query performance is a bottleneck please tell the community, say, hey, I'm doing this this is too slow if you can provide benchmarks of what's actually too slow if we can tell you how to get that data please come to us and tell us about that because otherwise we won't improve what you're doing in general, do you- What's the best way to specifically do that? There's the Postgres serial hackers mailing list or you can either, on the last slide I'll have my email address or the general Postgres mailing list if you just search Postgres in mailing list you'll find the right one or you can search for my name and it'll also pop up and I'll can forward it but yeah, just as an informal poll who here runs into problems with queries in Postgres? So like roughly the half or so, okay, please, at least third of you come back and tell us what's too slow yeah, and my hope is not really that I personally implement the just-in-time compilation for tuple deforming for everything else because there's only so much time I can spend on it but my goal is basically to get into Postgres 11 the framework for doing just-in-time compilation in Postgres and then other people including hopefully some people here in the room will start using just-in-time compilation for other parts of Postgres. So as I said, like three slides ago is something that my just-in-time compilation still spends a lot of time for a lot of queries in expression evaluation and that's not really a default of just-in-time compilation that's the fault because my just-in-time compilation is stupid because it was a lot of work but what I did was it was not okay, whatever. What I basically did is I just had this tree of expressions earlier where we do, which we walked at that point like recursively then for the fast expression evaluation we converted that to a linear form and then that I translated just to a native program and ran that but that means that there's a lot of function calls in there because Postgres, which is very extensible you can add new data types at runtime you can create extension which creates a new data type so a lot of the stuff is done in a very extensible manner but that also has some performance overhead. For example, if you have a work loss that says where A plus B is smaller than C what that will result in is an expression that first says get me this column from the table then do call float eight or in four or whatever plus operator with another one that you get from the table and then you run the operation float eight or whatever data type less than and another column in the database and so for this expression we already like do like two function calls and these function calls are actually not necessarily necessarily necessary if there's a built-in types into Postgres we could actually provide the just-in-time compiler with information about this data type and then that data type could actually just be in line in the compiled program and that's actually possibly a lot faster because the operation of adding two floating point numbers in a CPU you can do that a lot of times a second but the overhead with the function call takes about roughly a thousand times longer than if you were to just do it in line. I mean part of that is because we have to do some error handling that you probably would have to do still afterwards because we still have to detect like overflows but in a lot of cases a lot of the overhead can just be optimized away but the problem for that is that then we have to tell our just-in-time compiler namely LVM how this function is actually defined so we have to develop a framework how to tell LVM how all of our internal functions are defined which is actually not impossible because LVM has this or ceiling together have this nice feature that you can generate just-in-time compiler ready code for all of these functions and then you can load them into memory at the start of Postgres and then Postgres and then when it just do just-in-time compilation could actually inline all of those. One of the big problems here is our extensibility. It's quite possible to do this for the built-in types so a lot harder to develop a framework where we can do this for external data types if you for example wanted to do I don't know which extension you guys use frequently post-GIS and you want to do post-GIS just-in-time compilation I think it's not particularly good target for just-in-time compilation but it's the most common data type I could come up with. Then we wouldn't want know this definition of this type so that's one of the big challenges of how we can do this in a way that can be used by extensions. Similar and more importantly for average and expression similar like that we call if there's an expression average we call extract this column, we extract this column then call the plus operator for the data type and then we call for each row that's in the group we call the average accumulator which basically internally just adds up the sum and the count and then at the end once we have all the values inside the group for this expression then we call the result of that which will then do the result computation and it's probably in, I don't know I think it's relatively easy to see that usually the performance will be either like most of the time it's spent in either the plus operator or the average accumulator because the final result usually there will not be that many different groups so you'll only call that a relatively low number of times so one big thing would be to call the accumulator in like inline that because that's also like if you just think about what you would need to do for average you need to add the current one to the count of rows you need to add the value to the sum of rows and that's it that's really it's again it's like 15 instructions or something if you do it right but at the moment we spend a lot more time on it because of function call overhead and some other internal details one of the big problems here is that the way Postgres calls functions is actually relatively complicated because we have to support calling function with nearly arbitrary number of arguments and CPUs don't really do that so we have a lot of weird overhead of that we support functions that return multiple rows as a result that the C doesn't really allow that so we have a lot of framework around calling function so that's a really a lot of overhead that we could optimize away in a lot of in the optimized cases and generally we could also just generate beta code because the code I'm generating right now is really like simplistic we there's a lot of immediate values if that tells anybody was we have pointer values embedded in the program code and yeah we can do a lot more of that so I really hope that we have some another good chunk of performance benefits after this then I can after this we can actually go back to what I started with we can do batch tuple processing because after all these optimizations then finally my initial guess where our overhead was is actually turns out to be true we have the overhead then becomes that we do these one by one by one tuple processing through the query tree but that's bit out because we have to first get this just in time compilation stuff done and I have a very early my early approach from the beginning that initially got 3% after applying it on top of all these changes I got in some queries 0% because that's just still on the bottleneck another percent I got another 40% of the performance improvement on top of that so there's a lot of optimization potential there and I'm pretty sure that there's a lot more ideas than just the stuff that I came up with yeah oh I thought there was an email in there okay it's just my name on that side of the data I thought I had the email there but apparently I'm done yeah I think that's all I had to talk about any questions yes so what I so far I tried a bunch okay so the question is whether a an observation was that we probably it looks like we need to keep both the not the interpreted version of the code around because we have to do the current version of code for expression evaluation because just in time compilation is not always suitable and we have to do the optimized one because the interpreted version is not always suitable so and then the question I think which I think it would be relatively straightforward to show that in explain or in well not in explain but in explain analyze or it but so far I've experimented with two ways to determine whether to use just in time compilation one is you could just set a cost factor like every query that's bigger like more expensive than these many cost units gets everything just in time compiled or the other one was every expression that got called more than n number of times gets just in time compiled both of these have some advantages the latter one has the advantage that we only do just in time compilation for the expression that where we actually call frequently instead of the other stuff which only gets called once but it has the disadvantage that we do the slow one for number of times and but we could easily indicate that in the explain analyzed output because that gets the post execution query tree just final comment I was saying it seems like there's a real potential for divergence where the two algorithms are not exactly acting exactly the same and you would want to know when that so that's your following maybe I sure hope not I had I started out with like adding assertions to every time it's an expression it's all like with just in time compilation it's also done using not just in time compilation and then could the results are compared and overall regression tests and the same for tuple deforming because I was very afraid of that but I the algorithms should not diverge at all but who knows I assume the tuple deforming jidding could be done either at boot time or then and then cached and updated after every table DDL or and used across many queries so it's not necessarily easily possible to do this at cross process boundaries so you do it what I did was I cached it in the process for the common case but you also do deforming entire levels of the curry tree for example if you do a hash join and then restore like a the compressed version of the tuple in site the hash table then we do need to deform that again after the query that one is not easily cashable so but yeah so if you have a long-running back end that you would cash across multiple cure queries essentially so far I didn't even bother to implement like a limit of that cash size because like the compile functions are really small actually like usually you end up with like I don't know 300 bytes or something so that's not actually like that important but probably over the longer run you should probably implement like hey these you're allowed to keep that many cached things around I haven't thought too much about that honestly do you think there's a lot of potential for ahead-of-time compilation using the same framework perhaps during a statement preparation so at the moment all the just-in-time stuff is tied to execution time things I think that's a really bad thing we should do it so we can actually cash all this at prepare time whether I would still probably not do it like unconditionally because even for prepared statements a lot of that time you execute them only once a lot of the expressions in a prepared statement are not actually executed frequently so we still want some intelligence when to do it but we definitely should be able to cash that in site the prepared statement so we only do it on the first execution or something yeah I think we should do that I don't quite know all the details how to do that yet thank you our computers these days have so many cores it's hard to keep all of them busy even with parallel queries and things like that so I'm wondering if there's any idle time left over among your cores might it make sense to when you encounter a new expression during execution actually start down both paths and see which one gives the answer faster for that particular iteration meanwhile perhaps some lower priority the code generation thread can continue generating your code in the in a jit fashion and then the next time around your code might be already ready and effectively get your jit for free instead of having to pay a penalty up front or having to store it somewhere things like that I wonder if that's a viable idea I think that's a longer term thing that's right beneficial I think there's a it is a lot of things that make that actually non-trivial because like I said like the one processes memory layout might not be the same as another processes so there's like one one function might only exist in what the current backend because it has been created the current transaction so there's a lot of complexities around detecting when exactly to do that but I think some I don't think we're likely to do the try to execute the jitted one and the non-jitted one in parallel but the idea that we would start to jitting in in a separate process and then use the jitted one once it's ready I think that has a merit it also has some associated runtime overhead because you need to start every now and then check hey can I replace this function now with the jitted version once the jitted stuff is ready but I think that's something but I think there's a lot of things that have to be done before because there's a very relatively advanced thing and we're not there with regard to function calls such as substring on a string or to extract the year from a date time do you feel will actually get to a point where we could just in time compile those particular functions in there so we don't have the function call and just eliminated overall for the most commonly used I think for a very common one yes I think what we what my current idea is basically at when to compile postgres we go through all the functions that are defined postgres knows which those are and then you see them to extract the I don't know where they how much you know what the extract the bitcode put that in the catalog and then inline that I think there will we'll have some size of restrictions because it doesn't make sense to inline a function that calls half the back end for relatively obvious reasons but yes I think we definitely want to do that I I'm not sure I think it's unlikely that you see a benefit on substring just that's the overhead of the function call itself is not where the most of the time spent but for like plus smaller and stuff like that we definitely should do that thanks so in one of the slide you are mentioning right like for expression comparison right it has to work not only with the default data type but with the custom data type as well so why can't we have two separate parts one part for default data types and another for custom data type where it may be expensive for the custom data type so and we could definitely do that one of the problems is that it adds like adds a bunch of code duplication if we start going down that path I would rather start have like a generic solution that's usable both by in core data types and by external code data types and if you can't figure out a way to do that I'll definitely think we'll like I personally think I am pretty sure that some other people in the community will disagree we'll have to hard code operations for like the very common data types in float in four in eight float for float eight I think though if you inline the operations for those you already see a lot of benefit and I think that's potentially something we have to do I would rather have a generic solution I mean like any generic solution is going to add a cost for the additional cost right a little bit maybe not necessarily if we do just in time compilation and we have the definition of the like an inlineable definition of the function and then there's no difference whether it's an in core and external data type anymore so in that case we're good and it obviously would only work if like at in the extension you would provide the definition of that in line of that in lineable definition definition but I mean if your extension doesn't take advantage of that then I have no problem making it slower because then we can't we can't do magic thank you thank you the approach that you describe here is this very carefully crafted conscious kind of approach where you're targeting specific features of the expression I wonder if a more blind approach might also bear fruit and what I have in mind is to just write the entire expression evaluator in a language that would compile down to P code and what I'm thinking is how Java works and then have your virtual machine which has just built into it just pick up and speed up as it goes and that we have less work and less specific things to to try to do at at the level where you're working on I don't know if that comparison has been thought about or or attempted so I think it's a bit more closer to what you actually describe in reality but it's like we have this machine code like this opcode based its language for interpretation then we translate that one into the compiled one and so that is like it's basically we translate it into something like Java's bytecode and then then do jit on that so that's actually kind of similar we I don't think we would do something like doing that on a very granular path like with Java you can with some jvmc can do that like an interval loop basis where you can optimize those and only do just in time compilation on those I don't think we want one to go there because the overhead of having to detect when to do that is like to count whether it's worthwhile it's actually noticeable but oh you know I think the dispatch overhead is way too high so like we're talking about very few like in total relative huge cycles if you if you go like enter at JVM or something the like will already be slower than the interpreted version like unless you have a humongous expression and then the J like the JVM couldn't call into back into easy to call back into postgres functions so I don't like using existing there like I don't see that we can talk about it later but I have I don't think it's likely but maybe you have a better idea that I think that's all the time we have time for so