 First of all, good morning. My name is Ilya Kosmodemianski and I work for PostgreSQL Consulting.com. Few words about myself. I'm with PostgreSQL now more than 10 years as DBA consultant and previously I was an Oracle DBA and partially a DB2 and I will talk about disks with PostgreSQL. Actually, the main topic, how PostgreSQL works with disk is very complex and it's, okay, it's not possible to cover it in pretty large tutorial. I'm not saying about the talk. So it will be some extraction of information which is important for DBAs. That's why I call it checklist. So some topics will be not covered indeed but I think the checklist style will be more useful for the people who practically works with PostgreSQL because all this information usually you need to collect from different sources and sometimes they say different things. So the outline of today's talk. Why in general a database needs to disk and what special has PostgreSQL about that? Then I will say about some bottlenecks with disk performance and the most important thing, how to diagnose them because okay, you could try to achieve the maximum performance but in vacuum, it's quite impossible. You need to diagnose where you have a problem and if you have a problem. The important thing is that tuning PostgreSQL disk performance, it's not only the magic with PostgreSQL Conf because you need a proper hardware, you need to configure it well, you need to configure your operating system and everything works only together. You could not do something with just one thing and try to suppose that everything is okay. So this outline, let me show you start. First of all, why database needs to disk? That's quite obvious. It needs to read pages from the disk and it needs to write it because we have a pretty well-advanced open-source database that means that it is transactional. So it writes write-to-headlog and then needs to synchronize the dirty pages from memory which are recorded in write-to-headlog already. It needs to sync it with data files and this process usually called checkpoint and it could be in any database an issue with our performance because it's implementation, it's ideology. Just need to put a lot of pages from memory to disk and it could be performance issue. Then we talk about PostgreSQL, some addition, there is some addition. First of all, it's auto-wacom. You know that PostgreSQL needs to clean up all the versions and at least it could be not cost-free from terms of disk performance. The second, if it is not configured properly, it calls a lot of other issues with your database which finally will affect disk performance. And another specific things, it's PGC log, the byte mask with transaction states. I hope you never experienced problems with disk contention on this one, but it could be. And finally, some user queries related things such disk-source hashing and temporary. In this talk, because it's a checklist, I will concentrate on mostly on two topics. This one, the issue with check-pointing and this one. And mostly because they are quite common, for example, among our clients, then they came to us. It's a usual problem to, the first question to ask, what is the auto-wacom and with checkpoints? And a couple of words about these two. If you have contention of PGC log, you could easily move that to the RAM disk because it's nothing bad could happen if your RAM disk crashes, you will experience the short period performance problems, but then everything will be okay. So it's quite an easy task to move to RAM disk. But this contention on this file, that's about very heavy write workloads, transaction workloads, and I hope you never experienced that. For these things, the main recipe is to explain, explain, analyze and have a proper workman settings because your hashing could be swapping on disk and it will be very slow in that case. So the most problem with many relational databases and Postgres by its design with fsyncing, this synchronizing of data pages with data files. Usually the page, the block from the disk travels in this stack or queue, then it is on disk, it's a page on disk, and then you read it, you just put it in charred buffers. And the important thing that it always travels through operating system cache because it's the Postgres design, Postgres way. And then it becomes dirty, you write the redo information to write a head log, but the page still is dirty and still is in charred buffers. Then you need to issue a checkpoint and this page will be synchronized using fsync to the disk. And if somewhere in these three things, you have not enough throughput, it could be an issue for the performance of the old database because if you have, for example, large charred buffers, but bad disks or unconfigured operating system cache, you will stack immediately. I believe, and at last, Pigi Koninotova, Greg Smith made an excellent talk about this throughput during, through these things. And if you want to know more details about that, I recommend to find the video, I believe it exists on YouTube. So why we need checkpointing? Why all this pain with performance? Like I have said, the tuples, if one tuple became dirty, we need to mark the whole page as dirty and then we need to put it to disk, then we checkpointing. And this periodical checkpointing is designed to make the restore procedure, then we try to restore from failure or something like that, or just restart the database to make it faster because we do not need to scan wall right ahead log backwards. It could be very long, it could be a lot of right ahead log segments files. We just need to scan it backwards until checkpoint because at the checkpoint, we know that it is synchronized. So we could not avoid these spikes of rights, but we could make them work in a bit better. So how to understand that we have a problem with checkpoint? The first tool to use is actually IOS Ted and it's last column percent of disk utilization. If from time to time, periodically you have spikes up to a hundred percent, that means that something periodically eats your disk and most likely it is the checkpoint process if your configuration is untuned. Then there is a PG Start, BG Write, a single role view in PG Catalog, I believe it exists in one and every current unsupportable version of Postgres. And it looks like, a bit later, just a couple of words about IOS Start and monitoring. It's quite common that people just measure this metric, IOPS, and many cloud hostings they say that we guarantee an average and the maximum throughput and they mean actually IOPS. Then you use this metric just alone, it's particularly useless because somewhere, especially in virtual appliances, the latency could play a significant role. If you have average throughput, for example, 10,000 IOPS and the latency is not uniform and sometimes it has spikes, sometimes some of operations are pretty slow in comparison with others. For the database, it could be an issue. It's bad for your optimizer, it's bad for writing to disk. So at least you need IOPS plus latency and percent of utilization because using this metric, you could understand how many write and reads your disk subsystem could manage because you only have the 100% of your disk. In addition, it's not bad to have all these metrics. So you have IOS Start and try to use all metrics it shows here. The good thing is to have everything about that on the graph because graph helps you to see the trend. You could just sample something from IOS Start and see no problem because the spikes are rare and from time to time, for example, in hot business hours, your database eats 100% of fire and you could not see it just using the files. So put everything on the graphs. That's just, for example, the add-in of LSI made SSD accelerator shows that disk utilization decrease drastically. That's why you need to monitor everything on graph. Without it, it's difficult to say, for example, if your settings change helped and if you have a problem. So Pidgey Start, Pidgey Writer. This is, I would say, pretty bad. That means the Postgre systems not tuned well just from when they're on my laptop. The important thing is that you can reset it periodically. And if we say about tuning this performance using Pidgey Start, Pidgey Writer, I would say you need to reset it once upon a day because for a long time, it's practically useless because it's a lot of information which you could not interpret. Sometimes when you need to localize a particular issue, you could reset it once per an hour and see if there's a result of your settings change. You see the first column is how much checkpoints are queued by timeout and the second one is how much checkpoints requested because a lot of segments write it. And the most thing, how I could recognize if I do not know that, that is a bad and tuned instance of Postgres that I have both of them. If I have heavy writes, I would like to try to benefit from writing at once to my fast disks for my good rate controller, a lot of checkpoints and it will be checkpoints by request 29 plus 13 and here will be zero. At this moment, it's enough to know about PgStartBgWriter but we'll return to that several times. This one is a better one from this server we've had right workload and this server, if I don't mistake and produce it about in peak times about 8,000 requested checkpoints per hour it's pretty high workload and you see this is sample for 10 hours. So here no checkpoints by timeout only checkpoints by request and we shall see, does anybody any time looks into the documentation with meaning all these fields? One, two, three. Yeah, well, I shall say a bit about that. The next important things are how much spent the checkpointer process in write phase. The second is in the sync phase because you need write and then you need to sync and this thing buffers clean is how much dirty pages was dumped not using checkpointer but using background writer. In Postgres, you could dump the dirty pages from the shared memory using in general two good ways. It's the dedicated checkpointer process and a background writer who works in background that falls from its name and it could steadily dump the dirty pages between checkpoints and you could use it to avoid very high spikes of fire during the checkpoint. That's the most important columns and we'll return to them a bit later in this talk. So that's what I have said. It's nice idea to put everything on graphic. You can see here the green one is buffers dumped by checkpointer, the red one, buffers dumped by, flashed by background writer and you see the correlation with disk output. For example, that's not a good example that's here they have the problems with IO. You see this 75% and our disk are almost utilized completely. That's not a good idea because if you have some spike on your internet site or in your system, you have too small room here to cope with this spike. And here the background writer just dumped some pages because probably we changed something here in configuration parameters and you see the disk utilization slightly goes down. So the graph is the good thing to see how to work with that things because here you clearly could see that IO utilization spikes caused by intensive work of checkpoints. Finally, the checklist, what we need to do. The first thing is to get the proper hardware because if you have one set of disk in your server, it doesn't matter how good your server is. You could have a lot of memory, you have a lot of CPU but if you have one single bad disk at this picture, you will see the problem right here and doesn't matter how everything else works. You just have a problem with that. So the idea is first of all to buy the proper rate controller. The proper one is that that has its own processor because if it use your CPU for its rate things, it's even worse than the software rate because it will be very, very slow. The second thing, most modern rate controllers have an option of a battery cache and you need to buy this option because with the battery, you could use the cache of rate controller to improve your right performance and this cache is backed by the battery and if you have an outage, your data will be not lost. As a manufacturer, I do not like to advertise some manufacturer of hardware but usually then we work with LSI or it's Dell analog. It's quite okay. The utilities are very conservative and usually you know what you are doing. With Hewlett Packard for example, we experienced it a couple of times, the problem that they have top range rate controller and it's interface shaped to work only with top range disk from Hewlett Packard and if you install another disk, you get the set a speed of exchange information between disk and rate controller. So if you use HP, it could be very good hardware but you definitely need to check by part numbers if it works together because such problems exist. The important thing you need to monitor your battery state because if your battery fails, first of all you will experience the automatic switch of your controller to work with the absence of a battery and that will cause a problem with your performance immediately. The second settings of rate controller is to use cache mode right back. That's an obvious thing. You just need to benefit from your rate controller cache not from your disk cache because right through, that means your operating system returns from the F sync and finally it thinks that everything is okay then it puts the data from the rate controller cache to the disk cache and this disk cache most likely could lose the data in case of outage. Of course you could have a proper enterprise level SSDs which have a capacitor on it and it could dump your data if the electricity black out but you should be sure that it exists. Not with any disk it works. IOMod should be direct as in most cases because the now option IOMod cached means that your cache will be used for both writes and rates and with the PostgreSQL with any database actually you do not need this rate caching on your rate controller because it will be double caching or triple caching in case of PostgreSQL because you benefit from the clean or dirty pages from the shared buffers to read them and if it will be on rate controller cache you will never benefit from it because it's too slow for the database and disk write cache mode will be disabled which is obvious too like cache mode. The next step is to afford a proper disks into your proper rate controller because everything coupled very tight. First of all if you use zero attach SCSI you prefer to use the small fill factor because you will benefit from the physics or from geometry according to the tests the sick time for small serial attached SCSI is faster two or three times and for the database is important. Many people believe that SSD will save this work and makes the DBA job quite useless but that's not so easy because for example if you put the desktop level chip disks to your server you could have very poor performance and your system will be pretty unreliable because it could not read write a lot of data and the disk fails shortly in comparison with enterprise level SSDs. So if you use SSDs especially PUSD database installation for your OTP database try to use top level enterprise intals because they provide sufficient reliability and they could work fast and predictable. Then another thing that with PUSD installation you could not benefit drastically about write and write a headlock for example because it's sequential operation and probably the better idea to keep some hybrid installation to have some really hot data in the special table space with SSDs and everyone else on fast serial attached SCSI that will be more reliable and actually to be safer. Today exists some top level SSDs with life cycle comparable with serial attached SCSI but I never recommend it right now because I could only check that in two years because disk lives some time and at the moment such disks are relatively new and I never seen the SSD disk on PUSD installation which leaves such long as serial attached SCSI so maybe in two years this all becomes a bit obsolete. This is I think clear for database traditionally we use rate 10, 1 plus 0 because it's a good compromise between reliability and speed. Sometimes people try to benefit from rate five, rate six actually the size of disk and the price of that size today doesn't matter so much but rate 10 wins clearly about the speed and reliability of the rate five and rate six and the final problem if you could not use the proper hardware the only thing to make your database fast is to turn off synchronous commit but you must understand that for example if you haven't financial data and you turn it off your application could think that you'll receive from commit and the transaction was saved successfully but it could be lost laterly and it's quite impossible to find such cases because for a brief examination your data is consistent but you have lost wall transactions and it's not a good idea if your data is valuable and I think most of people who use their databases thinks that their data is valuable and it could be users data, it could be financial data, it could be a database, salary data, something like that. The next year is file system and generally if we're talking about performance we're advice XFS or X4 now which one that's mostly religious output but some of them because of for example ZFS or some LVM provides comfort for operations you could change slices, you could resize them and so on but these things have their own cost and if you switch off all these comfort things for example in ZFS you could achieve the same probably performance things like an XFS or X4 but without this comfort so in generally XFS or X4 the important thing, this thing is wide known now our time makes file system faster but the second thing is quite commonly misunderstood if you have the rate controller with battery backed cache you need to turn the barrier off the barrier is the Cisco which is called then your Linux for example could not write the journal data for the data which is already at the disk or vice versa that means it issue a barrier and then it stops to flash pages to disk and tries to resort kernel buffer to find the journal data for the data which is on disk already or the data for the journal data which is on disk and if you have a lot of shared buffers today it's quite common to have 128 for example the things will be very, very slow so if you would like to benefit from your rate controller you must turn it off and if, yeah then you use disk array the things really differ from the manufacturer of the disk, that array but in general a lot of things are common you just need to see the recommendations for the rate controller and see how you could implement that in your disk array usually one and every vendor issues some recommendations of how you should tune that for example from Intel, from Asana Rays, IBM Rays I never seen the bad advices in such manuals so you just need to see and the ideas are quite the same like with hardware array controller we go on up and now we reach operating system level and in operating system we have something like the same thing, I mean it's a database something like database and you have such parameters as duty ratio first of them is which percent of kernel barfer is allowed to use for dirty pages before the PD flash or another flash demo will flash the pages to the disk and in most Linux distributions this ratio is 10% for I think default and maximum ratio and that means that between 10% fill and 20% fill something that dirty pages will be dumped imagine that you have 128 gigabytes of operation system memory and you have 25% for shared buffers and you only will start when you have definitely more dirty pages than your rate controller cache could afford and that is a bad thing because almost very, very probably with enterprise array you could afford something like that but with normal disk rate disk controller it's quite impossible so my advice is to switch to alternative method dirty background bytes and calculate it based on how much BBU you have on your rate controller if you set these parameters this and this will be effectively zero and you will this and atlantive one and you always could check if you have the proper settings for your disk array of course if you do not use BBU backed rate you should keep these values much, much smaller at least divided by four for this example and finally if you have everything in place you could try to benefit from better checkpointing in Postgres we usually recommend to set checkpoint timeout to infinity the maximum value today is one hour and that means that most likely you will checkpoint by the checkpoint segment you could set it for example to some large amount and then your database will only checkpoint that it have a lot of data that calls a little bit slower recovery but you can choose all this my advice is to choose between checkpoint by timeout or checkpoint by segments do not mix it together when you checkpoint by segments this will be very high if you checkpoint by timeout if you need special point on recovery to be fast you just keep this there large and you will always checkpoint by timeout then you checkpoint your idea is to avoid the spikes of disk IOU utilization that means you will try to set checkpoint completion target to 0.7 or 0.9 to spread the IOU activity between two checkpoints Postgres could manage that pretty well and instead of having one endeavor of sync just 10% for example if you set here 0.1 right after the checkpoint happens you will try to prolong this between checkpoints and your disk utilization will be slightly lower so with heavy workloads such settings in most cases works well you could check if your operating system hardware and everyone else was configured properly in I believe in most nine versions of Postgres there is a utility called PgTestFsync it has pretty huge output there's only one piece of that and it really issues the checkpoint the fsync operations the same like with checkpoint and if you for example have something like that it's maybe okay it's manageable to leave it that if you have two times more most likely you can work with very high right workloads and for example if it is two times less than this values that definitely means you have a problems most likely you have a barrier enabled or something like that so if you remount something if you upgrade your hardware just check this one it tells to avoid the stupid mistakes and finally some small trick about checkpointing and then there I will say a bit about how to work on as I said there another way to dump dirty pages in Postgres is to use background writer and it has its parameters in Postgres Go Conf and if you have intensive writes you could try to keep these values close to max values that means then BG writer delay happens it will be awakened and for one scan it will dump this value of dirty pages that means that between checkpoints your background writer could really help because usually it is not so heavily loaded like checkpointer and background it could dump a lot of pages and the performance will be better because it allows to spread the disk utilization between checkpoints more or less and one more comment these defaults actually are for not heavy write workloads because if you have not very high workload you need to avoid too aggressive work of background writer so why it's important to not forget auto-arcom? I think who tries to switch once auto-arcom off? And what's the result was obvious finally your database has a problems a lot of problems but in terms of disk performance most problem is that your database will be larger the fragmentation will be higher and finally you will have a lot of dirty pages which could contain not a lot of tuples and that means your checkpoint will have much more pages to dump to disk it's not a nice idea to have at all and then your auto-arcom is not aggressive enough then it could not cope with disk load it will work longer and more intensively probably and that is not cheap from the point of view of disk IO so usually then you have auto-arcom I think many speakers told about that you could keep it as aggressive as possible here my normal advice is to use scale factor and that means then 1% of data in your table changed it will be auto-arcomed and normally it's quite enough for proper settings of auto-arcom in modern versions of Postgres but small addition to that is that sometimes you need some improvements to that things sorry, yes sometimes you need to avoid high disk usage of auto-arcom and then the poor man idea but very useful in practice is to re-nize and tionize them using Chrome periodically and it really works, it's a better solution from my point of view then, nap time and so on because it really does not hard auto-arcom and it works but resource consumption in your database server is much better the second trick is then you have bad disks or you have heavy right workload you just can maximize a bit increase your max workers for auto-arcom why it is actually important because by default you have three of them and for example here we have auto-arcom the red line is max value of max workers and then we have three of them sometimes we see that all the time all the workers are busy and that mean if all the three workers are busy and you are trying to, for example next table needs to be auto-arcomed it will be in queue then next worker will be free it could auto-arcom that table but it will be not 1% of changes it could be 10% of changes 20% of changes and that makes your auto-arcom not aggressive enough so if you see on your graphic for example that almost 80% of time all your auto-arcom workers are busy you need to increase the value and then you will see that things are really better because you will not have such shelves here like here so that's important from the point of view of disk performance for auto-arcom and finally I would like to thank our client who provide us with all these tricky cases to cope with my colleagues who actually in charge about that and the team of Ocmeter I.O who provides such beautiful graphics so if you have any questions you could ask them now or just email me later and I will answer. Thank you. I'm sorry, from time to time but generally we do not. I would say that for small workloads it doesn't matter for high physical better. Yes, if you have good disk you should decrease that because it also increase that because it allows auto-arcom to work more aggressively. Yes, that's a good point. Yes, 10, 20, something like that. More it's, I think it doesn't matter. After-arcom cost limit. After-arcom, Wacom cost limit, this one. Disabled. Yes, that means the special one is disabled. It will be available, I will publish that and I think organizers too. Which is that activity? Or just pay us if you prefer that. You need actually information what this auto-arcom worker is doing. Analyze or something like that. Then you need just that activity that's most convenient for it. More questions? Cases where tuning the background rider would be more efficient than tuning the checkpointer? If you have a host that has a lot of dirty pages. I would say that if you have heavy right workload and if you have a high eye utilization by checkpointer it's a good idea if you have good disks actually to let background rider work. And that is, you could usually could see that the disk high utilization decreases. The only case then you do not need this background rider to be aggressive is then you have low right workload. And it hurts a bit cache and hurts a bit reads. So by default if you do not need this aggressive behavior you do not need to turn it on. But if you need you need it. I never saw that cause the problems. Some more questions. So if no questions I would like to thank you and if you have some more you can just easily write me or ask here today or thank you.