 Mae'r ffordd i chi i chi'n oes omserssu Mark Wong. Rhaid i chi'n gwybod. Mark wedi bod ni'n rhaid iddyn nhw'n ôl wedi bod i gweinio'r gwaith, oherwydd wedyn y rheswm yn ym 100 yma yn y gweinlusu, a'u hunain i gael gwneud o'r bwysig. I'm standing in for Mark. This talk is written by Mark, and I'm going to attempt to give it in some justice to the way that he wrote it. Mark has performed almost all of the work that's described in this presentation in the sense that Mark is a really excellent performance team. Mark and I started working together on Postgres performance about eight or nine years ago. We did a lot of early work on Postgres performance in the 808182 type timeframes. Mark's ability to set up quite complex tests was very important in the early analysis of things like the hot feature where it took quite a while to understand the behaviour of the systems. The type of work that Jan and Robert were showing in the last talk is very important. It can often take months or even years of work to truly understand what's happening. I was lucky enough to be able to hire Mark into second quadrant about a year ago, and Mark's begun quite a lengthy task of investigating some of the aspects of performance on Postgres. Luckily, we're not duplicating the work that Robert and Jan are doing. They're more focused on OLTP performance. A lot of the stuff that I'm talking about here today is around business intelligence performance and focused on a number of different benchmarks. I've got a range of different graphics to show you, but I'm going to go through the full story of what we're going to look at. Starting in the background, we're going to discuss the patches and new features under test. The background to this work is that we're running a project in Europe called Axel, and that stands for Advanced Analytics on extremely large European databases. Why European databases? Well, they're paying, so obviously, but luckily for us, European databases look almost identical to US databases. It's just the addresses that are different. The salaries are higher in the US, but hopefully that's just because you guys get paid in dollars rather than euros. I think somebody's made that joke already. The Axel project has got a number of different partners, a couple of different universities and another small and medium enterprise. We're working on a number of different aspects of analytical systems. One of the most important of those, which we don't normally get to talk about, is actually security and privacy. This project also sponsored work on row-level security and also the work that's being done on audit. The things that I'm going to talk to you today are really about the core extensions to Postgres that we've been working on and performance testing around that. There is quite a lot of stuff coming out of this project that's worth talking about, but it's difficult to mention it all. What we had a look at to do some testing on was the 9.5 development edition of Postgres. That's not going to be available to you for production use until approximately September this year, assuming the commit fest closes sometime before then, which is supposed to close on March 15. We decided to evaluate a number of different features. One of those is something called the BRIN index, which I'm going to explain to you what it's for and show you some results about what it does do and what it doesn't do. Also, we've evaluated a couple of other patches related to performance in a business intelligence context. There are also the benefits inherent in some of the other work that's been done. Even though I'm not talking about it specifically, there are some other tweaks that are giving us performance improvements as well, but I'm not going to go into those. The first feature is block range indexes, or BRIN indexes, and they were written by Alvaro Herrera. They were originally known as Minmax indexes and various other names before that, but if you're familiar with Oracle Exadata, there's something called a storage index in that system. This is in some ways very similar to that. What is the objective of a BRIN index? It's basically an index type that's aimed at very large databases. The idea is that we want to avoid some of the overheads that you get on B-tree indexes. Who's got a database larger than 100 gigabytes? Cool. You'll know that building B-tree indexes on very large tables sucks. That becomes something of a problem because if you've got 100 gigabytes of data and 300 or 400 gigabytes of indexes, then clearly the indexes become the bottleneck. On the other hand, we realise that performance with B-tree indexes can be very nice. What we're looking for is something that's a bit like a B-tree, but solves some of the problems with B-trees. That's kind of what we get with BRIN. What we have is an index type that exploits the natural order of the data on disk. If you're going to say to me, yes, but not all columns have a natural order on disk, then you would be absolutely right. In those cases, a BRIN index is not going to give you much at all. The cases where a BRIN index is going to help you are things like a date column, because obviously as we write the database, there's a natural ordering associated with the log date on events. It just so happens that date is, of course, one of the most important use cases because there are actually very large historical tables. It's a very frequent use case in databases. The other case is where there's some kind of natural ordering around the use of primary keys that are assigned via sequences. For example, an order number. We know that if we're accessing an order number that is 100,000 orders into the past, we know that that's likely to be associated with a particular area of disk. What we're doing with BRIN is we're actually exploiting the natural ordering of the data as a way of speeding up queries. That gives us quite a few different advantages and I'll show that that has actually realised itself in our implementation. In a B tree, we keep track of all of the data. In a B tree index, every single record in the main table has got an entry in the B tree. There's a one-to-one relationship between rows and index pointers. In a BRIN index, we only index ranges of values. For example, if we pick a range bound in the BRIN index of 10 megabytes, then we will have two index pointers for that 10 megabyte range of the index. As a result, you can see that the BRIN index has significantly fewer index pointers within it. As a result, the index is going to be a lot more compact. With the right data, we can still get roughly similar performance between a BRIN index and a B tree, but the index itself could be around 1,000 or 10,000 times smaller than a B tree index. Some of the other patches that we're testing are the concurrent hash table patch that Robert had written. There's also a series of patches that Andres had written related to atomic locking. The concurrent hash technique was basically exploiting this patch as well. In addition to this particular atomic patch, there's a lockless clock sweep patch which related to buffer eviction. Basically, making shared buffers more efficient. That was quite important in business intelligence because we were actually dealing with quite large expanses of data. There's also some changes to the way that LW locks were padded, which caused problems with false sharing. I'm just reading the slides. He says, yes, poor excuse crap. That was a very poor excuse, wasn't it? We won't try that gambit again. We've done some tests on a couple of different systems. We've got two different systems under test. The first one is at Barcelona Supercomputer Centre. The characteristics of this system is that it's got a very large storage array, quite well connected. It's a storage intensive system. The second one has got a much smaller storage array, but it's got a bigger RAM and faster CPUs. We were able to compare and contrast the two different types of hardware for these tests. Obviously, there's a size limit on what we could do on the Manchester machine. The tests that we're looking at here were basically the TPCH benchmark. Why TPCH? Well, I think the main thing is it's a reasonably well specified test. Obviously, there's discussion points around whether it's an applicable test. We've tried not to get too anal about the specifics of the test. What we've focused really on is the intention of the tests themselves. We've also compared the difference between power and throughput tests. The main difference between power and throughput is the power test is in some senses a measure of the ability of the system to dynamically show performance results. If you're an analyst or a data miner and you're sitting in front of the system waiting for it to finish, how long does that take? Whereas the throughput test measures the ability of the system to produce significant numbers of reports. In that case, it's much more to do with the ability of the system to absorb a concurrent workload. Actually, it turns out that the two different types of tests are actually we get different results in each. The TPCH load test is actually quite a complex schema in these days where having more than one table is considered heresy. Look, it's got almost 10. It's not even in a star schema format. What I do know is that the test was originally developed by Teradata in the days when people wanted to express their business problems in application neutral data formats. A lot of the queries involve significant and complex joins that we really wouldn't get on some of the simpler Hadoop-style systems. I think that's a pretty reasonable test. There's a variant of the TPCH test called the Star Schema Benchmark. The reason why we haven't gone into that is simply that the tests themselves are slightly tweaked around. It's a bit confusing to compare things accurately. For these purposes, we've just stuck to the TPCH test. Mark has got some pretty strict methodologies about running the test. As you've seen, there are some points of the presentation that I don't know in detail. What I would say is I was working fairly regularly with Mark on the results of the test. Although he performed the tests, we were discussing the results. I didn't necessarily work with him on the details of the methodology. Some of the things here he's talking about, the specifics of the actual load and the power tests. The tests themselves look like this. There's 22 separate queries in the test suite and they are typically complex queries. What's good about this benchmark is that they're not made-up tests. They actually represent useful business questions against that data set. That is what I consider to be particularly important. Very often when you see people talk about business intelligence performance, they use a made-up, unreal query that just happens to go really quickly on their tests. One of the things that we were looking to do was to make the tests as real-world as possible. We're not trying to blind you with some impressive results on a particular small set of queries. We're interested in measuring how well we're doing in terms of PostgreSQL's performance across the board. There's no sense of trying to get good results here. We're trying to get realistic representative results. Query 12 is another example of that. There's probably about two-thirds of the queries in the query mix have got restrictions on date ranges. This is quite a nice example of a complex query where we're joining the orders table and the line item table. We've got a master detail relationship going on. That consists of a complex case statement. There's some complex where clauses, complex joins and followed by an aggregation. I think that's a fairly typical example of the queries in the test suite. It does actually cover most of the things that we would like it to. Let's go through to the brin tests. What we're doing is loading the data initially with no indexes. Then we're going to load data into a table that's got a single btree index. Then we're going to do the same thing but with a brin index. I'm going to flick ahead to the results because I understand these. The column on the left is how long it takes to load a table with data. In that case, it took 5.1 minutes to load a 10 gigabyte scale factor test. With the brin index added onto that table, the load plus the index build took 5.6 minutes. With one btree index, the load time leaps up to 9.4 minutes. The effect of the btree index is ought to almost double the load time. Whereas in the case of the brin index, we're basically only adding 10% to the load time. These results, those gains come from the simple fact that building the btree is significantly slower than building the brin index. The great win there is that in order to build the index, we need to scan the data. The brin index can be built completely in memory as we build the index. With the btree, we need to scan it, we need to sort it and then insert the inserted rows into the index. It's a significantly longer process. That gets a lot worse if the index itself falls out of memory. If you're building a very big index, then we end up needing to do a different sort algorithm which slows it down even further. These are the effects of data that's completely in memory. What we then did was to produce a table growth test. What we were doing here is to simulate what would happen in a very large database that is growing by a fairly constant amount each day. What we were doing there was simulating the idea that perhaps 1% of the table would be added to each day. It's a high volume incremental change situation. I think that if I just pull out the details there so that you can be sure to see them, the blue line at the bottom is loading the data if we had no indexes at all. The red line just above it is the time taken to load that data with a brin index added. The line at the very top of the graph is the amount of time taken to load data when we've got a single btree added to the data. The important aspect here that I'd like to draw your attention to is the fact that the red line is almost flat whereas the btree line goes up very slowly. It goes up so slowly that on a day-to-day basis you probably haven't noticed that your database is slowing down. As your database is getting bigger, those btrees start to fall out of memory. Every insert into the btree slows down as the btree gets deeper. What eventually happens is that graph keeps on going up and up and up to the point where when you get to terabyte-sized databases, building btrees on those things, I did already use the word suck, so I'll just repeat that. It really does suck to have very large indexes. What we've got with the brin index is one of the objectives of it was to have what we call a constant time addition for the brin index. As we go up into the terabyte range of data, the brin index is able to maintain the insertion time as we expand. That's particularly important for very large databases. The objective here is to build a data type that doesn't just work reasonably well on a 10 gigabyte system, but have something that's going to work well at 10 terabytes or dare we even think about what it will be like with a 10 petabyte system. I can tell you if you did have a 10 petabyte system, I don't really want to sit and wait while a btree gets added that's going to be like five petabytes of btree, so I don't even want to be there. In fact, it probably wouldn't ever finish, at least not until we get parallel index build. By the way, where is Mark? Are you still hiding at the back? He went out again, damn. I was going to ask him if you wanted to take over. The overall load test on a 30 gigabyte system showed that using brin indexes appropriately reduces the actual load time for data by 26%. This took a little bit of database design knowledge because the basic original test had 28 btrees on the database. What we did was selectively replace eight indexes with brin indexes. Initially, we just replaced all the btree indexes with brin indexes, and that's a bad plan. As I explained before, brin indexes only work when the data is appropriately organized. It is not true that you can replace every btree with a brin index. What we did was select the best cases where we could use a brin index. That was good and useful, and it reduced the overall load time by 26%. 15 of the 22 queries were able to use the brin index type effectively. That's basically three-quarters of tests. This is an interesting slide because the pink bars there are the performance we get from brin indexes. If you think that you're not seeing it correctly, the brin indexes are slower than the btree indexes in quite a few cases. I'm saying that that's actually a design objective of the system that what we wanted to do was get build time and constant insertion time were higher on the list of performance objectives than data retrieval time for some of the queries. What we will also see in other areas is that there are cases where the performance of the brin index is very cool and much better than a btree, but in general what we were doing was to have a retrieval time that was roughly in the same ballpark as btrees, while at the same time making them much more practical to use. What I'd like you to do is think about what that's going to look like on a 10 terabyte system, where if you've got a btree where the btree is so big that it's practically unusable in that situation, even though the access time, if you add one, would be better, you realise that if you've ever tried to create index concurrently on a multi terabyte system, it is, I think the words, very severe impact on your system is not quite sufficient. Whereas a brin index will actually build quite quickly on some of those systems. So we do have some happy results, you can see query 20, query 2, both show massive reductions in query times as against a btree, whereas in a lot of the other cases you can see there is some degradation, but it's not too bad. So the results that we had were that using the brin indexes gives an overall improvement, and this is averaged across all of the different query types of 28%. So there's a clear net benefit from replacing some of your queries with this new query type. And I say some of your indexes. Another index comparison, here we're comparing having red is no indexes at all, green is btree and blue is brin index. So you can see that there are cases where btrees just simply don't help, but the brin indexes work quite nicely. So on the power test, having brin indexes against no indexes is obviously fairly good. I'm not sure that anybody was suggesting having no indexes was a viable option, but we measured it anyway. But the overall results were that practical tests were showing that a brin index is 612 times smaller than a btree. Obviously that's just one measurement. Don't expect it to be exactly the same. There are some configurable parameters that you can set on a brin index, and also it will depend upon how your data is organised. But in general, you should see massive reductions in size when using a brin index. Now, the reason why that is important is what we're saying is if you've got data that's say 5 terabytes in size, then a brin index on that data might only be something like 10 gigabytes. Whereas on a 5 terabytes system, if you build a btree, it's easily going to be bigger than a terabyte and be quite slow. So what then happens is on the larger systems, you get a huge amount of cache back for querying, for sorting, for joining, and it actually makes a significant difference on what we call the memory pressure in the systems. So instead of having btree occupying almost your whole cache, now you're going to change that to actually having that RAM available for query workloads. So there's some scalability results that we've got as well. How much longer have I got? 15 minutes. Thank you. So on the scalability tests, we can see that the additional patches that we tested produced a 20% improvement and a 14% improvement just with the most recent tests. So what we're talking about is comparing 9.2 against the latest branch of 9.5 development. The scalability of the tests was considerably better. This is on a throughput test. We can see the number of streams was better and the overall throughput was better as well. So the tests that we've got showed that taken as a whole, the overall results were that we can significantly improve the number of queries per hour that these systems can handle in comparison with the same situation roughly two years ago in terms of the software. On a four socket system, it becomes very clear that we've made significant gains in terms of throughput. So what I've wanted to do just... So the other results that I've got in terms of the tests were actually broken down by queries so that you can see that we've made gains on every single query in the benchmark. And a lot of the queries have improved radically in terms of their performance. What I would say is that we are yet to make even larger gains in terms of query throughput. There are a number of queries that we cannot improve without either materialised views or parallel query. We've kind of done pretty much everything we can think of to do. There's a number of them where we're also hoping to improve performance using column store techniques. One of the tests that we did on this workload was we mimicked the performance of a column store by literally dropping all of the columns that were not needed for particular queries, which obviously is cheating. But what we were trying to do was prototype what the performance would be like if we had a column store and the columns that were not part of that query were basically just not touched at all. And when we did that, we got an improvement in performance. Just do nothing else of 25% improvement by using column store. So one of the things that we're hoping to do in the next release is work on column stores and really this is for Robert's benefit that there are certain queries that we understand need parallel query in order to get improvements on. But this is quite a good query mix because we're able to identify particular queries that can be speeded up by particular techniques. So I'm switching around a bit with some of the performance results that I've got. We also did some testing. These results come from Thomas Vondra. And what Thomas did was use a similar benchmark called TPCDS. It's a business intelligence benchmark in the same way that TPCH was. And interestingly, you can see that we've made significant gains in performance from the days of 80 down to the latest version 94. So the performance differences that I was talking about just now were differences between here and the latest 9.5 development version. And the performance that we've achieved on a four core box difference between 9.2 and 9.5 development version, we got a 232% increase in throughput just between this point here and this point here. So you can see overall we've done quite a lot of work to improve things down the years. One of the most interesting things that I note from this graph is you can see in the middle of the graph there's kind of regressions. We're actually doing relatively badly in terms of performance. And that's basically the period where we stopped measuring performance. So the early days are when Mark, who used to work for OSTL, was active in regular testing of Postgres to check that we were actually improving performance. And that worked. You can see by regularly publishing to everybody what the results were of performance tests, we were able to bring things down. Then I think Mark moved on to some other jobs and basically nobody took over for a few years doing those performance tests. And what's then happened is round about the 9.192 timeframe, people like Robert, people in other areas have started to do more performance work and we've begun to bring the performance back down again. So while you might look at that and go, oh, you had a problem at 9.0, the problem was actually a systemic one in the sense that we weren't measuring performance and so it crept up slightly. So that's a lesson for us all. I love this graph because it's so difficult to interpret. So I feel it kind of lends me credibility to have a really complex graph. This is a very strange way of saying that the GIN indexes, GIN fast scan produced an algorithmic speed up in performance. So what actually happened was no matter how fast the query was, it was replaced by a very fast query when we get at 9.4. So basically the slower the query was, the more it was speeded up. But the reality there is at 9.4, GIN fast scan basically completely flattened the curve. So queries that were executing in quite a range of different times all came down to very fast performance. So it's a complex graph to say we made it faster. Cool. And is Mark going to come back in? No. Could somebody just pop their head round the door and grab Mark if he's there? Okay. So the highlight really of the performance results was that taken together using the concurrent hash table, using the atomic locking improvements and the better cache line padding, plus all of the other improvements that had already gone in to Postgres. And then in addition to that selective replacement of certain indexes with Bryn indexes gave us overall the difference between 9.2 and 9.5 development was a 232% gain in overall system throughput. So this is a busy data warehouse executing a large number of queries. Now that result is particularly interesting to me because what I see out there is that there are a lot of people using Postgres for business intelligence. When you speak to all these guys doing big data, if you have a big data system, it frequently ends up that queries go faster but they're not able to produce as many answers per hour just simply because some of those systems they're using all the power of the system all at once and they're poorer at throughput than you would expect. So one of the key places where Postgres has got utility there is the throughput case where there's a very busy system doing lots of complex queries and that's actually the area where we've improved the most. So I think that's an important result. But just to link back through what you've heard, basically what we've said is that the new Bryn indexes are very exciting, they've got very clear advantages when used appropriately. Plus we've got a number of other internals tweaks that are coming in the later releases and they are proven to be very effective for business intelligence. I can't speak to whether they are improvements for OLTP only because we haven't measured that. I believe they are and I've got a feeling that if I recall from Angers's results the benefits were actually better in OLTP but I don't have hard numbers on that from the systems. So what I'd like to do at this point, seeing as Mark's just walked back in, is don't give me a clap, give Mark a clap because he did all of this work. So given that you've arrived, Mark, you can probably help me with some of the questions on these things. There were some points I couldn't talk, I got one thing wrong earlier actually. Were any particular questions? Yeah, basically because there will be a call to the index management functions but that won't necessarily result in an action on the index. So the Bryn index is indexing the minimum value and the maximum value for a whole range and unless those changes, unless those change, the index itself won't change. Now what I would say is that the Bryn indexes are not ideal for heavily updated data. They are definitely more designed for sort of the append only case. It's not true that they only work if it's 100% pure append. It's more a case of there's a sort of gradual decline in performance the more heavily the table is updated. Just to repeat what I said, heavily updated data will be a bad thing for the Bryn index. No, the data is not clustered because of the existence of the Bryn index. The Bryn index takes advantage of the natural ordering of the data. If there isn't a natural ordering of the data then there would be no advantage in the Bryn index. Right, so I mean it's possible for you to cluster. You can issue the cluster command to physically sort your data and then it will have a benefit. But the real benefit comes from types of data where the arrival of the data naturally causes a segregation. Obviously if we've got events or transactions or something like that there's a natural ordering of the data so that the data at this end of the table will have an earlier log date and the data at this end of the table will have a later log date. So as we insert into the table there's a rough ordering to the table. It doesn't need to be exact. We're not relying on an exact sort property. What we're talking about is a natural tendency. So if I ask for the last week's data we know roughly that's up this end of the table and I can avoid all of the IO associated with the full table scan. So it sounds like a hit and miss operation but what we're aiming to do here is produce a special type of index that is optimised for very large tables in a business intelligence context. The row ID obviously already is sorted but we're interested in indexing data that the user would know about. Nobody does look up on row ID because it's not meaningful outside of the database. I think it's more a case of particular columns benefit from one type or the other but the real benefit for the very large tables comes where you actually replace one or two indexes with maybe say five brin indexes because it's actually more practical to index columns than before when you sort of thought I'd love to put an index on that but I don't have 200 gigabyte spare for the B tree that would be created because the brin index is actually smaller and faster to build you can have a brin index on it. It's just more practical to put there. It does an index heap scan in the same way so that you can actually use multiple types of index or multiple brin indexes to retrieve the data but obviously only if the optimiser sees that that will work it still goes through the optimiser in that same way which gives you all the selectivity problems and that kind of thing. An impact on the cluster operation? To all intents and purposes cluster is impossible to use in production because it's so dramatic a command nobody's got enough time to sit and wait for that. Any other questions? Do you have a question for Mark? Those are all about the design of the brin index. Thank you by the way. I believe I'm between you and lunch so never a good place to be. Thank you very much.