 My name is Mark. I am a consultant with Second Quadrant and have been a minor contributor to Postgres over the years. Today I'm going to go over some performance-related topics from over the past year and specifically the patches that have been committed to the 9.6 branch, a couple of extensions that are available now and take a little segue into looking at some mainframe performance and pose a few paths that Postgres may take after 9.6 is released at the beginning of whatever the next version may be, 9.7 or whatnot. So the first patch I wanted to mention is how many people have large tables, more than a terabyte? How many of you fear when that auto vacuum comes around? So this freeze is one of the things I wanted to talk about and I wanted to talk about how many people have large tables, more than a terabyte? How many of you fear when that auto vacuum comes around? This freeze is patched by Sawada Masahiko at NTT. What this patch does is it automatically marks rows that have, well mark rows frozen so when that vacuum process comes around it'll, if the majority of that table hasn't been changed it'll fly through that and hopefully alleviate any tension that you may have when that process comes around. The next patch that's worth mentioning is this atomic pin and unpinning of buffers in a shared buffer pool. The purpose of this is to improve read scalability and has been authored primarily by someone who might have to help me with my pronunciation. Uri Zerlov with Postgres Professional. What this patch does is that it starts to use, it uses in place of spin locks, the atomic instructions to move the buffers or mark the pin and pin those buffers in a shared buffer pool. There's a chart that Alexander Korotov posted also with Postgres professionals on the hackers list showing the performance of how 9.6 looked before the patches were applied. So that bottom graph down there, this is from a four socket Intel server with 18 cores per processor, 144 total threads. Anyone have that many cores on their system? No? One person? All right. Oh, well, it may be worth noting that this patch was tested on Intel and Power and I don't know if anyone's tried it on Spark yet. So with the spin locks, that bottom line down there on the 9.6 branch before this patch was applied, you see it peak at about 70 clients running against the system and afterwards drops fairly dramatically, in my opinion. So when the three colored lines are just different variations of the patch over its development and with these removing those spin locks and using these atomic instructions, this patch allows the throughput to peak around 130 clients and afterwards, while it doesn't really get any better, it does level off and does not drop as dramatically as before. So this patch allows about twice as many clients in a PG bench test to connect and push the system closer to its limits and just looking at the peak performance, the throughput peaks, the peak of the throughput doubles with these changes to the shared buffers. How many people saw Robert Haas talk about parallelism earlier today? All right. So I'm really not going to try to repeat too much. Robert Ahmet and Harry Babu, David, among many other people who have helped get this new functionality into Postgres. What I want to do is just kind of illustrate in a slightly different way that how you still need to be careful with this advanced parallelism or the addition of the parallelism, that it's not going to always make everything better. So what I have is a... Oh, now I forgot who... Sorry, someone posted these numbers on the hackers web, so what they were were 100 gigabyte TPC-H database. One of the tables, the line item table, is 80 gigabytes in size is going to have 60 million... 60 million? 600 million rows in it. And just looking at the basic parallel seek scan, just because it simplifies the example, what we're going to do is just take a quick look at the behavior of selecting rows from a table and varying the selectivity. What happens when we select all the rows, selecting 25% of the rows and selecting 12% of the rows. So, you know, pretty well contrived. We'll just select one of the key columns and do a modulo on the keys to see if we want that row or not. So in this contrived example, the planner has been modified such that it will always try to do a parallel seek scan just so we can look at the behavior. And if that red line on top, what that's showing is single process, selecting all the rows, going to take, what, 250 seconds. And you had a second process to help divide up that work. So you end up doubling your response time for that query. And then selecting 25% of the rows, you had a second process. Sure, it speeds it up. About 25% start adding more processes to scan that table even though you're only selecting 25% of the rows when response time starts going up. So the purpose is just to illustrate that it's important that we get the planning right and between how many rows the planner thinks you're going after and how many processes it thinks it should be using to perform well. So very contrived, but as this thing keeps improving, probably still need to be careful on when it's actually in use. Now this chart shows a little more about how well the parallel aggregation works. It's the same table as before with the 600 million rows, 80 gigabytes of data. Using query one of the TPCH benchmark, which is doing a sum and an average over those 600 million rows to produce only four rows of output. So in this chart, the time axis is on a log scale so we can see the behavior a little bit better. The blue line that's going through the middle, I guess, is the actual response time of running query one as we increase the number of back ends that are performing the parallelism across the x-axis. So the top line, which is supposed to be some kind of yellow-orange, I think, is a percentage on the right side of how far the efficiency of the parallelism, in other words, if we look at the response time of when one process is performing this aggregation, is two processes cutting the time in half or not. So what we see is that with the parallel aggregation of these sum and aggregates, the degree of parallelism scales up to about 10, fairly close to 10, meaning that the response time with 10 processes is about one tenth of the time it took with one process. And then the next, looking at increasing the 30 parallel processes, the efficiency was measured at about a 20% loss of efficiency, still 25 times faster than how long that aggregation took with a single process. So now, just to mention a couple of extensions that may be worth looking at. Anyone use a lot of numerics in their database? No? A couple of people? Are you happy with how fast it is to do calculations with numerics? So just another illustration of that line item table used previously with 100 gigabytes of data, 600 million rows. There's four of these columns in this line item table of the quantities in the line item and the extended price of the discount and tax for a line item. If all four of these columns were numerics, and yeah, so if comparing what happens with this same query one that's doing an average and a sum across these 600 million rows. Sorry. Oh, yeah. Sorry about that. To describe that fixed decimal extension a little bit more, what this is is it's using 64-bit integers to provide a predefined precision decimal similar to numeric where numeric is a selectable precision data type. And an important thing to note is that with the fixed, implementing the fixed decimal this way, calculations may be truncated as opposed to rounded in case those were issues with your normal use with numerics. So again, just these four columns out of 16 columns in this 600 million row table will compare how the aggregation goes between having them as numerics compared to fixed decimal. And I feel like I have a slide that disappeared. Oh, oh, I see. Sorry. The thing on the laptop is a little too small. So with that 22, there's a set of 22 queries in part of that power test. That chart on the left is query one that I was describing earlier. So in a query like that, using a fixed decimal, this fixed decimal data type instead of numeric increased the response time, made it twice as fast. And in three other, well, a total of four queries out of the TPCH benchmark, which used a fair number of numerics in the queries, also improved the performance fairly significantly because of the use of aggregates in those particular queries. Some increased as much as three times, others one and a half times. So now I'll go over a little bit of replication extension. How many folks are using Sloney or Vondies? No one? Streaming replication? How about that? All right. How many of you guys using streaming replication wish you could break it down a little bit more, not replicate the entire database? And do you guys choose not to use the Vondies or Sloney because it doesn't meet your performance needs? No? Convocation? All right. So Thomas Vondra also had second quadrant. He did a little test of this extension that that second quadrant has worked on called Pgeological. Anyone heard kept up with that? All right. So he took it out for a spin. Using Postgres 9.5, he took a look at how well, well, took a measurement of how well PG Bench ran using streaming replication with Pgeological taking a look at Sloney and Vondies. And just to give you an idea of the system size, he was using an I2-4XLR JW instance, four virtual CPUs, four SSDs attached to the instance, 122 gigs around. And this is what he saw. The blue line on the top was what he saw with, just using streaming replication. So with one client, one client in the PG Bench test, everyone's relatively close and not a huge difference between using streaming replication, Vondies, and Pgeological. Just setting up two clients. Everyone's scaling up okay. But then after two clients, Vondies, yes, Vondies has passed its peak. It's that green line on the very bottom. It doesn't seem to be doing any better as we're increasing the number of clients of PG Bench against the masternode. Then the remaining guys, Pgeological and Sloney, are keeping pace with streaming replication up to four clients. At which point, Sloney starts to fall off a little bit and take a little bit of a dive towards how well Vondies is doing at that point. At eight clients with PG Bench, Pgeological starts to not keep pace, though, but it will continue to keep up a little bit until it reaches about 16 clients, at which Pgeological flattens out. So at 16 clients, it is interesting. We're levels out at that point at 16 clients. What that sort of shows us is that Pgeological has a 15% overhead from streaming replication to keep up with the activity on the drive of PG Bench. The limiter? Oh, oh. I don't, but I think it's fair to say that the overhead of the logical decoding that's used underneath Pgeological is very likely around that. Yes, so PG Bench is running against the measure and these numbers are the results that PG Bench are returning. But for, right, right, right. So I don't think I have a good answer. I don't think I know the point. Yes, right, to have a right, right, right. I'm not sure where the code pass is exactly, but I actually wonder if it, I probably really shouldn't guess. I probably should just ask someone to tell me. But I was always under the impression it was part of the streaming part that before it streamed, it's decoded in sense. At least. Right. Alright, so anyone ever had any interest in mainframe? Anyone have a mainframe? No? Anyone ever had an interest in mainframes? No. Yeah? So then, has anyone seen the pricing of a mainframe? Alright. So shortly before the 9.5 release, I took a look at some OLTP results against Z13, which I think is what the current Z13 platform came out. There's this 1x1 line that has come out, some of which is based on the Z13. Ran some tests on the mainframe, took a look at how well it did to a popular competitor. And well, because this was shortly before 9.5 was released, we actually did these numbers on 9.4, but I hope it would be encouraging to also think that what we see here is going to be repeated on what we would see with 9.5 and 9.6. So a couple of basic details. These Z13 mainframes are clocked at 5 GHz. This particular one had 8 cores and can be threaded up to 16, 64 GB around. The competitive system was chosen to also have 8 cores, 16 threads, although these processors were clocked at 2.4 GHz, also with 64 GB around. So this first test is the result of PG Bench, doing only a read-only test. So what this is showing here is that bottom line is the transaction performance of the competitive system without threading enabled. Next line up, threading is enabled, and we can see that there is a little bit of an increase in transaction throughput. The third line up is what the mainframe can do without threading enabled and then the top line showing that we can still increase throughput on the system by enabling threading on the mainframe. So comparing the mainframe in this read-only test to its common competitor is showing that it's able to produce twice the throughput. So similarly, using PG Bench to do a simple write-only test, what we see on the bottom is that with the competitor system, the reason why you only see three lines is because the two lines overlapping on the bottom is showing that the competitor system threading enabled or not enabled is not showing much improvement for a simple write-only test with PG Bench. Whereas the mainframe above it, the next line up is the throughput we're able to get without threading and then showing that we can get a little more throughput with threading. Yes. Yeah, usually legally related. It is not going to be necessarily condoned by both parties. But you're welcome to guess. Yeah, so the mainframe is showing what did I write on there. Still, even with the write-only test, the mainframe is able to get twice the throughput as its competitor. This DBT2 test is a more complicated OLTP type test and it's based on the TPCC benchmark if anyone's familiar with that. And what we're trying to show here is the scalability of a more complex OLTP workload. So as we're increasing the number of clients against these two systems, what it's showing that the mainframe is able to more or less keep pace of 1.7 times the throughput, almost twice the throughput still of its competitor as the number of clients hitting the system increases from 10 all the way up to 80. Sorry, a couple slides got out over here. So at this point I wanted to point out a couple of directions that Postgres may take in the future. Of course, these are not going to be the only things that are going to be happening in development, but these are a couple of things that I have some numbers and charts to show. So the first thing I wanted to point out was column oriented databases. Does anyone use any column oriented databases? And for you guys using them, are you using them because you need to be able to do calculations better than Postgres can aggregations or whatnot or massive parallel in compression? So a couple of use cases that I'll mention, which of course are not everyone's, or people have other use cases for them, is some people are looking at column oriented databases for data warehousing, ad hoc querying, and computing aggregates over large volumes of data. So for those of you using it, I would imagine you took a look at the column oriented databases because of some deficiency in Postgres. Well, I'm assuming that you're all Postgres users here in the room anyway. You took a look at it because of deficiencies in Postgres and you needed to be able to do more than what Postgres could provide for you. So what we're going to do is use this dbt3 kit, which is basically a TPCH derivative. We're going to take a look at a couple of anonymized column store databases, which again you are welcome to guess at what they may be. But these are open source column stores. Oh, sorry, was that a... It doesn't mean there's still not legal issues to work around. So we'll call these guys CS1, CS2. All right. So then we'll take a look at the load test. Part of the TPCH benchmark is to see how fast you can load up your database. These are one of the metrics that when people review results of these benchmarks that they want to know. So we'll take 100 gigabytes of raw data just to give you an idea of what this is. It's eight tables on Postgres. It's also part of the load test. What's the word I'm looking for? The building of the indexes is part of the load test in addition to getting the data into the tables. And for both these column stores, building the indexes were not necessary for this to be able to run a power test, which is actually the queries against the data later. So in other words, we're building the indexes for Postgres because we need them to run the future tests whereas these column stores do not. All right. So our first column store database was able to load 100 gigabytes and a little more than an hour. Second one took about three and a half hours. Maybe a little twice still is the first one, but who wants to guess how long it took Postgres to load and build these indexes? 38 days. Anyone else want to take another guess? I don't have any prizes, but it will be fun to guess. 15 minutes? Almost nine hours. I think the 15-minute guess beats our 15-day guess. So whether it's because we need to build indexes, whether it's because it takes that long for Postgres to show that much data in the tables, this is two and a half to seven and a half times slower than getting data into a column store. So now we'll take a look at the power test, that test of running those 22 queries consecutively to see how well they do. My math isn't that good, but what I'm going to show you is how the score of these power tests are scored by a weighted geometric mean of the response times of each of the queries. So our column store, first column store scores almost 8,200. Our second column store scores almost 7,300. Who would like to guess what Postgres scored? 350? 10,000? 20,000. Did I hear 6,000? I already forgot who was closest, but Postgres only scored just a little over 3,900, which is almost half as low as those other two guys. And just to give you an idea, the queries that Postgres were particularly bad at compared to these column stores, these queries were intended to answer questions like, what is the 10 un-shipped orders with the highest value we weren't able to collect on? Another one was to query the revenue volume of local distributors of this database about a part supplier data warehouse. So it wants to see how well each of its local distributors are doing. Another query is supposed to show you which of your customers are having problems based on how many orders they've been returning. So these are the type of questions that you're trying to get out of your data. It seems like column stores are able to do these sort of things better. So then the question to pose is what can Postgres do to close the gap? Showing one possible path, which isn't really just for closing the gap for column stores, it's talk about Postgres XL a little bit. Any users of Postgres XL now? XC maybe? Something else that looks like a massively parallel database? No? All right. So Postgres XL is a... What's the best word for it? It's the next generation of implementation that was started by people who have worked on Postgres XC. Have people heard of XC or Stato? So Postgres XL is an MPP implementation, massively parallel processing database. And it is supposed to be an all-purpose, meaning use it for OLTP workload, use it for data warehousing, Foley Acid, Open Source, Scale Out. The Postgres XL project has just released its update against 9.5 earlier this week. For those who might have heard of it, the last time XL was actually released was against a 9.2 code-based. So if people have looked at it before, maybe interested in taking another look at it now. So before we get back to the numbers of how this thing looks against a data warehouse, this chart illustrates the scalability against a simple read-only PG bench workload. The bottom blue line there is the baseline of where 9.6 was at in running PG bench. Setting up two nodes with PGXL increased the throughput 1.8 times, almost doubled it. Doubling the number of nodes to four increased the throughput only about three times from the baseline. And then update nodes, it almost gets to six times the throughput of what 9.6 does on a single node. Also take a look at, with updates, Postgres 9.6 on the bottom, two nodes with PGXL. It seems to be a little bit of a benefit at the lower number of clients, but as you increase the load, it's really not looking too much better than a single node with 9.6. But then when we increased the number of nodes to four, we were able to be able to double the throughput with a simple update workload. And again, adding up to eight nodes, the total throughput is about three and a half times more. So now this is what we see going back to the power test and what might be encouraging down the roads of competing with column stores. What this bar chart is showing is the 22 queries on the power test and the response times in the lighter blue on the left is what Postgres 9.6 does compared to Excel on the right. Oh, a couple other important things to probably mention is that the Excel cluster here is 16 nodes. The amount of data here is actually not 100 gigabytes. This is actually three terabytes of data that these queries are running against. So if someone can help me do math, 100 gigabytes is 600 million rows in that table. Now we multiply by 300. 1,800 million rows. So as we're getting into the terabyte range, Postgres Excel is showing that it can pull ahead from a single node Postgres instance pretty well. And throw out a couple of numbers. In case you're wondering, the power score at three terabytes for 9.6 was only about 700. And if you remember at 100 gigabytes, the score was as it was able to get almost 4,000. Whereas now Excel with three terabytes of data scores almost 4,400. So it's able to score about six times better than what 9.6 can do with three terabytes of data. Sorry, I have more slides out of order here. Oh, yes. All right. So if those of you who didn't see me flash the answer will have some more quizzes. Did everyone see how long it took to load three terabytes of data? Okay, does anyone want to guess this one then? All right, so the load three terabytes of data, it took more than 79 hours for Postgres to load that and build all those indexes. Who wants to guess how long it took Excel to load that data? 10 hours. Any other guesses? 16. Three hours? 13 hours. And yeah, so math says that that is five and a half times faster than 16 nodes. A 16 node cluster can load data about five and a half times faster than a single node with three terabytes of data. Oh, sorry. I'm in the same slide as before. So that's all the charts and questions I, quizzes I had for all of you. Yes. Oh, sorry. Oh, so the buffers in the shared buffer, you have to know if it's dirty, not dirty. You have to make sure you, sorry, it's not the dirty, not dirty. You have to know whether you can pull that out of the, or when you're pulling data out or pushing data in, you got to make sure that you're the only one touching that buffer. So traditionally that's used by Spinlock and Spinlock bringing on the processor just keeps checking, are you free yet? Are you free yet? Are you free yet? And the methodology of doing that kind of test and set, well, no, I'm sorry, that's the atomic operation. So to use the Spinlock to see if that buffer is free is significantly slower than using the atomic processor instructions like a, I think it's compare and swap. Thank you. I had the cast in my head. So using those atomic operations from that the processor provides is actually a more efficient way of being able to say, can I do something with this or not? Any other questions? All right, thank you for listening.