 My name is Greg Stark. I've been a Postgres developer since about 10 years ago, 2006. And that's about half of Postgres' lifetime. Postgres was released as open source in 1995, 1996. This is the 20th anniversary of Postgres' open source life. So there's another talk by Magnus, the look at the elephant's trunk with the new features. I'm going to present some of the history of Postgres and how it's gotten to where it is today. So I look at the elephant's tail. 20 years, this is the 20th anniversary of Postgres' open source release. So it's a good time to look back. I'm going to look at specifically the sorting code. And that's kind of a narrow little slice of Postgres. It's an interesting slice to look at, I think, because it affects a lot of user queries, a lot of user-visible behavior. It impacts building indexes, it impacts several other parts, as well as, obviously, queries. And it's nicely isolated from the rest of the source base. It's its own module, it's its own module, and it has an API that cleanly separates it from the rest of Postgres. And it's interesting to look at the history because it shows you the people that were involved, the priorities of the day. And you can sort of see how the priorities have changed over time as the usage patterns have changed and the hardware has changed and so on. One of the things I want to highlight, though, is that it's not just about making Postgres faster. Every release there's some notes in the release notes saying the new version is 5% faster or 10% faster or whatever. And it's very easy to be fooled into thinking or to assume that that means everything is 5% faster. And that's not very exciting because if your application is fast enough, you don't really need, you might find it easy to dismiss a 5% speed improvement that's across the board because your application is fast enough today, 5% faster, it will still be fast enough, it's not going to change your business. It's not going to solve any real problems that you have. In reality, many of these changes, and this is the theme that I really wanted to focus on, many of these changes are addressing problems that users had. They're addressing discrepancies where some things were fast and other things were slow, or some things were fast but they couldn't make them any faster because of hardware, the Postgres didn't adapt to the hardware changes. They could increase the amount of memory but Postgres wouldn't use it, things like that. So when you see Postgres got faster, there was some improvement that made Postgres faster, it's really about there was some user that had a problem or many users that had a problem that limited how they could scale their database or how they could use their database and we solved that problem and that means you might have that problem. You might run into this, you might have an outage one day and the new version wouldn't have had that outage. The risks of upgrading, you need to balance against the risk that the new version solved the problem. That the risk of not upgrading means you could be at risk of running into one of these problems and you report it only to discover that well if you'd been on the new version it had been solved already because you're not the first person to run into it. So one day you change your queries, the behavior isn't what you expect or you increase a parameter and you don't see the performance benefit you expect or you improve your hardware and Postgres runs into limitation and you're not the first person to run into that, the new version might have solved that. So that's the introduction. This is the big picture. This is the last 13 years, I couldn't quite go back 20 years for reasons I can get into in Q&A but I managed to check out and build old versions of Postgres with modern tools on modern hardware so all of these results are comparable. The only difference between these data points is the Postgres code. So this is not comparing the way Postgres behaved 13 years ago with the way it behaves today on modern hardware this is comparing the way it would behave today if we hadn't changed the code with the way it behaves today with the new code. And the battery is dying in this. So I can't use the laser pointer but actually the battery. So this starts at 7.3 and you can see the early days there was a lot of fluctuation, the performance improved dramatically and then later on there were very specific increases, very specific versions and a couple things to highlight about this chart. First of all it's a log scale chart so don't be confused. The differences at the top are just as large percentage wise as the differences at the bottom for like a six inch drop is the same percentage improvement as a six inch drop on the faster queries. So the differences that might seem small at the top are actually major, like that's a two fold difference. The difference between those top two lines is a factor of two and the difference lower down between the green and the, well, it's hard. So each of these lines is a different query, most of them are just doing sorting. So the blue line at the bottom is a plain sequential scan. The green, yellow and red above it are sorting binary data, floating point data and integer data. So these are just sorting binary data, sorry, the green is the integer, the yellow is floating point and the red is a binary blob by day. And the top two are sorting text. The blue one is sorting text that doesn't fit in work mem so it needs to do, I'm going to go into details on the algorithm, it needs to do it in repeated passes and the red is if it all fits, it's configured with a very large work mem so it all fits in memory. Unfortunately, all of these, they actually fit in the physical memory of the machine. I did run a benchmark where it didn't and there was a pretty small difference. It actually didn't really illuminate anything in the chart and it made it more cluttered. So I'll repeat this chart after each improvement so you can see what the impact is. So 1995 was the first public release, I couldn't compile it, it's not in that chart, but I did go look at the code, the sorting module was 600 lines of code at that time. It turns out sorting is a difficult intellectual exercise but it doesn't actually require a lot of code. That code is actually fairly similar to what we have today. What we have today is a lot more lines, there's a lot more features and error handling and some optimizations that inflate the size of the code but structurally it's very similar. There's one major change between that and what we have today which is when the code, when the data fits in memory, Vadim in 1998, 1997, sorry, added the change to behavior so when their data fits in memory it uses an algorithm called Quicksort. Now Quicksort is important to understand because it's actually fundamental to a lot of, it's actually what you're using in many cases. Often your data does fit in memory, it has a lot of advantages. One of the main advantages is it doesn't need to be tuned for the size of the CPU cache or it doesn't need to be tuned for the various cache levels actually, not just the CPU cache, because it uses a divide and conquer method it breaks down your data into successively smaller chunks, it eventually reaches a point where it fits in your CPU cache and it always takes advantage of that equally regardless of the more CPU cache you have, the better the performance, the Postgres doesn't need to be tuned to take advantage of it. It has other advantages, it has a very low overhead constant factor, so all these sorting algorithms are N log N, but the factor in front is relatively small in Quicksort. There are some other factors, it has some downsides as well. The code in Postgres is actually used as an example in some discussions online, you can find interesting papers on Quicksort and the Quicksort in Postgres is a good example of which of those downsides need to be addressed. I'm not going to go into detail on how Quicksort works, the important thing to understand is that it's a very efficient algorithm and it makes use of the CPU cache well, but the key point is it only works if your data fits in memory. When your data doesn't fit in memory, we need to do an external sort and I do want to spend a few minutes talking about external sort because a lot of the improvements over the years have been to this algorithm including fairly recently. The algorithm that we use is almost straight out of this book. This book was published in 1973 by Donald Knuth. It's been the Bible for computer programmers for all that time and in fact the comments and the codes refer to specific page numbers and steps of the algorithm in the book. I'm going to try and rush through this because there's some new, recent changes in Postgres at the end of the talk and I want to leave enough time for them. This is fairly technical but it helps to understand the improvements. The basic strategy is that you can only fit some of your data into memory. You want to sort all of it. You're going to fit into memory what you can, generate a sorted subset of your data, the size of your memory or actually hopefully as much larger as you can than memory and then work on the next batch and generate another subset and then once you've generated all of these sorted subsets you need to merge them into one long sorted list and one of the goals is to make these subsets as large as possible and it's actually possible to generate sorted subsets that are larger than you can actually fit in memory at any one time. So the first step is generating this sorted subset and in this example you have to imagine that you can only fit three datums in memory at a time. Obviously in reality you're going to fit thousands but it's harder to put on a slide. So you load the first three into memory, you load it into something called a heap which lets you extract the least element from that subset. It shifts up to the top and then you output that least element to your sorted subset. Then you move on, now you update the heap, things shift upwards and you load the next value which is the fourth value here and now you know what the least element of the first four elements is and that one gets outputted. And you proceed, eventually you reach a point where there's an element that you've missed your chance. It should have been outputted ideally back here but you don't have an infinite amount of memory, you had to output something, you've outputted the B and the C already so you've now discovered oh there's an A, it's too late, it's missed its chance. So you have to mark that A as being part of the next sorted subset. And you proceed, in this case you can output the D and then you have another element B that also missed its chance but you can still output the F and now you discover oh there's actually a G, I can still output that G, that still hasn't missed its chance and eventually you reach a point where the next element belongs to the second set, the second sorted subset. They're actually called runs. So now you start outputting elements to the second run and so on. And you actually have to continue, you might generate hundreds of these runs. And this is, I'm sorry if I'm going too fast but this, as I said, there's recent changes that I've added and I'm worried that it's going to be too long. Is this fairly clear? It kind of depends on the already knowing what a heap is which I didn't want to get too much into. Yeah, I got it. The key point is you're generating these sorted runs and the sorted runs are a subset of your data. And I meant to put that up when I asked if there were any questions. But then the next step is you want to merge all these sorted runs and again that uses a heap. You load the first element of each run, that tells you what the first element that you need to output is and then you need to replace that element. That first element came from run two so we need to replace it with the B so that the heap always has exactly one from each run. It's sort of exactly the way you would merge a pile of name cards at the registration desk. You would have a bunch of piles that are all sorted and take the one that goes next and now you've exposed the next one in that stack of cards. So we replace it with B, sorry, replace it with B, and that's a bit confusing. We replace it with B2 there which came from the run, output the B that's B1 and then we replace that with C1 and we continue and we merge all of those runs. There's one detail that isn't apparent in this slide. That works fine if you've got three runs and you can fit three things in memory but if you have more runs than you can actually merge in one step you might need to repeat the process and merge repeatedly down to fewer and fewer runs. Knuth actually goes into quite a bit of detail about how to schedule that merging because especially in 1973, tape machines were expensive, tape machines were slow and you spent a lot of time rewinding tapes, a lot of time changing tapes so there's algorithms here for what to do if it's hard to point out but these are each different orderings in which to schedule these merges and this is in the book, it's a fold-out page it's like a picture book. One of these algorithms is if you can run your tape machines backwards then you can write all your things to one tape and then read going back the other direction and then write and it's more efficient than if you have to rewind. Some tape machines couldn't do that so there's an algorithm there if your operator needs to change tapes how long they schedule it's calculated in these factors aren't really that relevant to Postgres these days not many people are running on tape machines so our algorithm, obviously these tapes are just files on disk rewinding is really fast so one of these I think it's the third one is the one we implemented in 1997 Tom so two years later 2000 Tom, actually it's 1999 it was released in 2000 but the code went in in 1999 Tom rewrote some of this code namely the abstraction layer that represents these tapes so that it would reuse space so that instead of having a file for each tape there would be a file with some blocks belonging to one tape and some blocks to the next tape and the key point there was already at this point Postgres is mainly sort of an academic exercise in the early internet days there weren't a lot of big production systems running on it but already people were pointing out that it kind of sucks that you need to store one copy of the data and the copy that you're merging and the copy that you finally output you needed a lot more disk space than the size of data you were actually storing so he solved that problem and this is what I was getting to at the beginning about these aren't just speed improvements these are solving problems so after this change you basically needed as much temporary disk space as the amount of data you were sorting not two, three or the worst in some cases four times as much disk space as you were actually sorting it also removed a limitation so when the data was more than two gigabytes it used to be that none of the tapes could be larger than two gigabytes so he's multiplexing files in a clever way so you could actually sort more than two gigabytes of data on disk at a time and there was a second I've mixed up these two went in these are two steps of the same change the same this is what I was just talking about before about the reusing avoiding the multiplying the storage space these changes were in 7.0 these days it sounds like early days but I wasn't involved yet but this was quite relevant back then oh yeah and then another major change he actually used this code when you're building an index it used to have its own sorting code with its own behaviors so at this point 2000 the code looked very similar to what we have today it quicksorts if it's in memory it uses these tapes on disk if it's doing an external sort it's using this heap in memory to generate these runs and then the heap to merge it until a couple weeks ago that was pretty much exactly what happened when you ran sort in Postgres today it took 6 years before somebody said there was really I mean at this point 2007 people are using this in production on the internet for websites they're using it to run their business this was very usable this was production level code production quality code but in 8.2 Simon Riggs came along and said this stuff from Knuth about tape drives back in 1973 it was a pretty reasonable assumption that you would have some fixed number of tape drives in your data center and you weren't going to go buy more tape drives just to run a sort faster it's a lot easier to buy more hard drive the hard files on your hard drive why are we still merging I showed merging 3 tapes in fact the Knuth algorithm merges I should actually point that out here all of these have the same number all of these algorithms have 7 tape drives it's all about how to make the best use of your 7 tape drives so Simon Riggs pointed out we can have dozens of tape drives or hundreds of tape drives you can have whatever number makes sense for the sort that you want to do and as a result performance dropped in these queries by about a factor of 4 2 to 4 so the sorting sorting I can't see the colors the top ones sorting by day dropped from almost well dropped from about 45 seconds to about 25 seconds some of them were even I think slightly better the idea is you can have you can have as many tape drives as you can merge in memory in one shot the limitation is basically whether you have enough memory to buffer enough I.O. that you're not jumping between tape drives too often so we want to read several kilobytes or about a hundred kilobytes I think from a given tape drive file at a time we don't want to read one record and one record if you have a million you're only going to be able to fit one record from each tape drive if you have a machine with a reasonable amount of memory you can probably have you can have thousands of files and read several hundred K from each file in each step so he implemented that he calculates how many can it read how much can it fit in memory how many tapes can it fit in memory if it's reading several hundred kilobytes from each tape and now it merges typically hundreds of tape drives in one step and the interesting behavior there is usually you can do one merge you don't need to schedule these multiple merge passes though back in 2006 if you're doing data warehousing you still might have needed multiple merges today probably not so just to look at that graph again that's this first big drop here and you'll note it affects it affects all the different data types but it doesn't affect this blue line on the top that's the one where sorry it doesn't affect either of these this this is quicksort it shouldn't affect that blue but I think this I need to look into that this might be the one where that was actually doing I.O. I might have dropped the wrong line there I think I'm going to look into that then this guy came along with a nice shirt there and he noticed that sometimes he was working on a website and he needed to do paging where you display the first 10 results and then go on to the next 10 results and the next 10 results but you have thousands of results and you only want to display a page at a time no it's um oh yes so the way we do I wasn't a committer so I would submit the patch and Tom reviewed it and edited it and probably rewrote it and committed it commit message there may be somewhere I had to cut off the last few lines of the commit message but usually you'll see the credit at the bottom um so the idea was if you only need the first 10 results of your or the second 10 results of your data set you don't actually need to sort the entire result set in order to just get that subset you do need to look at all the rows but you can once you've determined that a row doesn't fit in that subset that you're interested in you can throw it out you don't need to keep moving it around and comparing it to other values um and this works especially well well typically you're doing a small enough subset that we never did implement this for external sort so this only runs when you're sorting in memory because you're probably fetching 10, 20, 100 200 values that are going to fit in memory um but you are going to have to process all the rows the the the biggest win is when sorting the entire data set would have to have been done on disk but now we can sort just pull out those top 100 process all the rows in memory and so you can the performance impact is arbitrarily large depending on how much data you're sorting versus um the size of the page but you can see this impacted just this one it brought this purple line here from basically the same speed as sorting text in on disk god yeah so up here it's the same as the blue line oh I know what the other changed this is the more tapes had a big impact because you're not actually reading all of the the values so more tapes means you're likely to be doing it in a single pass so you're likely to be done very early in the process whereas before you might have to do multiple passes and you'll have to do all but the final one right to the end anyways so it brought it down from similar to other sorting to almost the same as a sequential scan because it does have to process all the data but it's just processing them in memory and so this is 10, 20, 30, 40 seconds and sorry 4 seconds and this sequential scan is about 1 second whereas before it was several minutes and everyone thought that was brilliant for 4 more years until another guy with a striped shirt Peter Gagan became a force for a force for sorting speed and said it's it's crazy that we're using these generic comparison operators that need to go through a lot of overhead to do sorting and we have this generic sorting code that needs to call these generic functions when if we if we implement a special function for each data type and in a way that is visible in C code and this is a very technical change but the important thing is to specialize the quicksort for each data type so that it could be the compiler could inline these functions and could optimize the quicksort to so if you're sorting integers it becomes just straight binary code that will sort the integers in memory and without going through all of these levels of abstractions and one of the nice things about this is that it gives us a place where we can implement optimizations specifically for integers or specifically for floating point or specifically for text whereas right now all we could have is well prior to this change all we had is a comparison operator that we could optimize to the sort and this is a use well this is the specializing the quicksort for each data type and this had a large impact this brought sorting integers and floating points down from in this data set this data set by the way is about a gigabyte so it's not so large that it's like large large data but it's large enough for these benchmarks having to run it hundreds of times brought it down from well you see there's two this is sort support that was the earlier changes it's a bit hard to read there brought it down from about 17 seconds down to it's about a 30% improvement but the important thing is this allowed later changes which I will get to imminently so right now we're up to here and this looks like a small change but when you realize it's a log scale graph so this is 10 seconds that's 20 seconds the next line is double the time incidentally there was an interesting speed improvement here that's because that was a point in time when 64 bit machines became quite common prior to 7.3 Postgres didn't support 64 bits at all that's why I couldn't run these benchmarks but up to here people the 64 bit machines were high end machines they were relatively obscure by 2008 people 2009 people were running 64 bit laptops 64 bit there in a lot of places and somebody commented that really floating point numbers can be passed around in 64 bits we don't need to be passing around pointers to them that was improved and the impact is visible in the sorting code because suddenly sorting didn't need to be passing around pointers and it's faster it's an interesting example of the hardware changes driving the improvements a couple more things in passing problems that users reported was raising their work mem settings I think it was still called sort mem at this time and not finding postgres using all of work mem this was because I think this was because we double the amount of the size of some of the data structures at each step and if you was using 55% of memory it couldn't double it anymore so it would use at worst about 51-55% of your memory and usually somewhat better but people were expecting it to use the amount of memory they had configured these are problems that impacted users and the small release notes that sound technical actually reflected real problems that were fixed and upgrading would mean avoiding running into that problem yourself how much time do I have left this is another such change where you were limited to 2GB for reasons that were internal and not relevant to the user again people were complaining and we was addressed I'm going to I wanted I wanted to talk about this it's relevant this is driven this was users complaining you know they're sorting data prior to about about 2010 or so they text meant ASCII at least in the US but even in most of the world text was really Postgres always had fairly good support for collation but it was non-US collations and non-ASCII encodings but it was slower and people knew it was slower and a lot of people chose to run in ASCII because I think that was the dominant choice but that changed over time most people would just configure Unicode without thinking about it without considering because they need that code, that support but it is a lot slower to compare Unicode text strings in complicated collations people expect it to be comparable to comparing ASCII text they don't expect to see a huge performance it and this was a change to address this unexpected huge performance hit and it is still something we want to support going forward we're going to need to address a problem which is really a problem in the operating systems that and I have slides here showing like how it works you convert the change converts these Unicode text into this big binary blob and then the binary blob can be compared just using an efficient memory compare the problem is that some operating systems, these binary blobs that it generates don't compare in a way that's consistent with the normal collation comparisons and so you would get actually broken indexes and bad results it was only in certain collations and only on certain operating systems so going forward we may be able to identify when it's a problem and when it's not and make sure that we still use the optimization but only when it's not a problem but hopefully we'll have to see about that the main takeaway though is the kinds of problems these changes fix are these discrepancies where you expect two things to behave similarly or you expect your performance to be consistent and there's this inconsistent performance where you're comparing Unicode text and suddenly it's much slower than when you were comparing US addresses and ASCII however there was a big improvement recently and this is why I went into so much detail on the the code the generation of these runs one of the this change was inspired by a sorting a sorting a paper on sorting called AlphaSort which talked in detail about the experiments they did generating these runs using different algorithms and what they settled on and this was the AlphaSort paper was in the late 90s but on large machines these days even on regular server the lessons of that paper are very relevant so AlphaSort well not that AlphaSort AlphaSort a new this is the abstract of the paper and that's gray very respected eminent computer scientists not a Postgres contributor unfortunately but one of the authors of the paper used QuickSort to generate runs and replacement selection to merge the runs replacement selection is the algorithm where we merge those runs using a heap but they talked about how they selected QuickSort to generate the runs they tried using a heap to generate the runs like we did and they found it wasn't really necessary it wasn't it was much slower QuickSort will generate runs just sort what you can fit in memory so it will generate more runs the runs will only be as large as fit in memory but QuickSort is so much faster that the extra runs that you have to merge you don't get back that performance the I'm sorry you're generating more runs so the merge step is slower but QuickSort is so much faster that you save so much time in the initial run generation that the merge doesn't you you don't lose as much as you you saved in the first step and the important observation is as memory has grown the likelihood that you need to do a second pass or a third pass all of those complicated merge scheduling algorithms has gone down dramatically and in fact if you configure a very small work mem 4 megabytes this is actually slower because it has to do these multiple passes so if you're the largest table size with the smallest work mem it is actually noticeably slower but even a halfway reasonable size work mem like 16 megabytes or 32 megabytes you have to be sorting very large data sets to ever go into that second merge pass and this change which is sort of frustrating because it's throwing away all of that complicated algorithm but this change takes like 60% or so as much time as the the old sorting algorithm on a typical server these days for 64 or 120 megabytes you can do you can generate so many you can afford to generate so many runs and still be able to merge them in a single pass so that's it's quite hard to see here because there are so many changes in short time the final drop here is the most visible part where if you look at the green and the orange and the red line they drop quite sharply at the end you can actually see it in the top as well that's why this is the one where it's external sort the red one is quicksort it does drop down to where it's faster than 9.4 significantly faster than 9.4 even after you see the upward line is where we reverted the unicode the SDRX FRM fix it's still faster than 9.4 quite significantly going forward your laptop forwards formats the text differently going forward there's still things coming down the pipeline there is Robert Haas who just left the room is working on parallel infrastructure for parallel query and there's a lot of there's a good possibility we'll be able to use that to implement sorting on multiple CPUs there is talk of using SIMD instructions or GPU cards for sorting there's actually already a contrib module to do use GPU cards for sorting as the hardware changes and the usage pattern the user's usage pattern changes it drives a lot of different priorities as I talked about with the tape drives things change any questions? that's the last slide any questions? that I should I can go back to the chart sorry? I can't hear you oh I see you so I used to be strongly in favor of our current approach because it would be consistent with all the other applications on your system it uses the system collation yeah but those ones are the ones that have to explain why they don't behave the same as the rest of your system but I think this latest incident has probably convinced me that that may be the wrong decision as Peter pointed out that bug we've reported it now to the G-LibC people and there's been no response this is a minor part of G-LibC whereas that bug if it were there in ICU it would be a catastrophic failure of their main goals so I think if we're going if we're going to proceed with that kind of optimization and that kind of feature set and the idea of versioning collations and being able to deal with upgrade if you upgrade your G-LibC and the collation changes today you don't know and your indexes are all broken and you wouldn't know so I think this latest incident has gone a long way to changing my mind we've seen it well across rel releases is a little more reasonable but people expect their database to be intact so I may be changing my mind ICU for those who aren't familiar is a library for doing collation and unicode handling so instead of using the system libraries instead of being consistent with the system libraries it would be consistent with other applications that use ICU one of the main downsides is ICU is several times larger than all of Postgres so having a dependency but anyways there's pros and cons there so these characters are unicode characters they're more than one byte they're not part of ASCII at all so the collation rules well you can have a collation rule for ASCII encoded text as well that's complicated like it could be case insensitive or it can consider case as secondarily to the text so that text that differs only in case compares not equal but next to each other even if like binary so A in capital A might sort in the logical order but together even if the text after them is all different and so rules on how things are sorted rules on how to sort in America are relatively straightforward whereas the rules on how to sort that are substantially more complex and those rules all have to be like we have to call the function for sorting sure call the function for doing the sorting rather than just comparing the bytes of memory once we call strxfrm then we can just call memcomp on these two binary blobs and I guess I was skipping I was making an assumption that back in the day 1996 when the internet first started people were pretty satisfied and there are still websites you can find out there where all capital letters sort before all lowercase letters and people were pretty satisfied with that because that's close enough for a lot of uses well so well the way even the U.S. circle is slower than memcomp because if you have James with a capital J and James with a lower James with a lowercase J rather than have all the capital letters come before all the lowercase letters James and James would sort next to each other the capital letter might still be before the other one but it won't be separated by all the other lowercase letters but that's still relatively simple this is much more you can actually see even though they start with the same letter sorry different letters they both start with the same binary blob and the first three bytes or nibbles rather apparently Cyrillic is notoriously complicated to sort also you need rules on how to compare that with ASCII text which is not obvious