 I'm Andres Freund, I work for Citus Data and I'm a PostgreSQL developer and committer and I spent most of my time playing around with PostgreSQL in one way or another and today I want to talk about how does PostgreSQL deal with I.O. and how that influences performance because that's something that's actually like closely related and sometimes it's hard to understand why you need to tune a parameter to make performance better or why it makes it things worse. So if you run like throughput tests on normal Postgres and you didn't tune it with 906 you often get grass like this. The blue line is the number of transactions per second, the red line is latency in milliseconds and you can see that it's a pretty terrible picture. In this case this was done on my workstation at home, unfortunately on rotating media because I had destroyed my SSDs with tests the day before. They don't last all that long but yeah it doesn't really change if you use SSDs the latency numbers are not quite as bad but obviously this is a pretty bad picture but then it's also done with the default parameters so there's a reason why it's bad. So that's basically I want to see why this looks this way, I want to explain how it looks that way and how you can make it better. For that we unfortunately need to look a bit into the architecture of how Postgres works and the basic architecture is this that we have various processes. These are the things in the middle, the connections, we have some background processes like the checkpointer and some other internal processes and they all access some local memory that's just whatever when you call malloc or Postgres's malloc that's allocated on the left-hand side and they all are attached to some common memory that's the shared memory and that's where we store like the buffer IO where we have locking information to coordinate between the connections and by like by size the buffer cache or what you configure by shared buffers is going to be the largest part of shared memory most of the time and since shared memory obviously influences and like shared buffers influences how IO works I'm going to explain a bit how that looks like but it's basically looks like this we have one hash table that says this block in this table is mapped to this block in our shared buffers and then each shared buffers page is eight kilobytes of data and then some flags and locks and so on associated with that and but it begins initially when you start the database the hash table will be empty and all the buffers in shared buffers will not contain anything so how does actually reading data work so whenever you do need a page from some table in memory to read it for example because you did a select star from table what happens when when the query execution engine finds I need the first block it looks into the hash table checks whether the specific entry in the hash table already has a block associated with it if the if there is one then that's easy then we can use that so on further executions we don't have to do the IO to the operating system again but in the case we don't have what we do is we find a free block in shared buffers if there are no free blocks we'll have to do something else but that's the next step so once we found one we point the entry in the this shared table and lock acquire some logs and do some internal stuff and then the next part is the important part we ask the operating system please open the file and give us that block and what's important in contrast in Postgres to some other operating systems Postgres does not use a direct IO so when it does a read call system call the OS may have it already cached so in the a bit faster case the OS's own page cache will say okay here you go that's the page if that's not the case the US then we'll ask the storage to read the give the data back and that will take a bit longer but eventually also have read it and once that's done we can load the data in our shared buffers and then we're done and can return the page to our internal user so what does happen if the data is actually if we don't have a shape free shared buffers because our workload is bigger than our shared buffer setting that it will not always be the case but especially in the default configuration it quite frequently will the algorithm Postgres uses for that is called a clock sweep algorithm and we earlier saw that I had arranged all the shared buffers basically in a list from zero to however you many you have configured and the clock basically arranges this the number of buffers just in a circular in circle and when we have one clock hand and the clock hand is basically shared by all processes that's where we where the clock is pointing at and when we need a new page what we do we look at the buffer that's at the moment under the clock hand and each page has a so-called usage count and the usage count is basically information about how often has this page been used so let's say that in this case we have the pages a usage counter for that means it has been used somewhat recently so we can't directly reuse it because then we would potentially throw out useful content so to do is we just say okay now we have looked at it so we decrement the usage count by one and then look at the next one and at the next page we do the same again and so on and at some point we'll find a page that has a usage count of zero potentially that means we have to go around the whole clock of all pages by the number the maximum type a number of usage count and that at the moment is five so potentially we have to go around the whole clock number of times so and once we have found that page we either will just throw out the old content or we will potentially have to write out the current content because if somebody had written something to that page it might have been updated in memory then we obviously need to update the file system view of that page with the code with the new content and that works the following way we again we just have the entry in our shared buffers and then we need to lock the page and that's relatively boring what we then do we do a right system call to the operating system and there he is important again we don't use o direct we write to the operating system and the operating system does its own caching and after that we just say okay we're done with writing out and yeah so there's one more very important part about the how to understand how databases and not just Postgres generally do IO and that is for consistency most databases use something like val logging undo logging and that basically works like that if you whenever you do a modification to the database say you insert a record you insert something into a log and that says insert into table so and so this content and whenever you commit you say commit transaction so and so and you write that and so you have this lock that with with with each modification which goes a bit and whenever you commit a transaction you don't have to write out all the changes made to all the data tables what you do is you just say sync the lock up to this position and then you can return the commit to the user and whenever so the yeah and whenever this lock then grows obviously you got the all the time and when you at some point it might happen that your database crashes because your hardware goes out because you kill nine the Postgres process or something in that case what happens we start at the lock from the point where we started the database basically and just look at the log log entries when they log entry says insert something into that position in the table then you do that when you so when it says mark this transaction is committed because it committed or boarded then you do that so we go through the log and replay all that actions that's how the durability happens because we have redone all the operations have happened since the crash and now we are up to date again and once we did the crash recovery we can say okay we just start at the new position from where up to where we did crash recovery and then we can just accept rights again and we will fill the Val again and the reason I had this these small bits at the beginning and then the larger block is that in Postgres the Val is by default created in 16 megabyte chunks and these are each physical files and you'll have sometimes interact with those 16 megabyte files if you ever did like point-in-time recovery or streaming with application or something like that so once once the database is working again what happens you'll just create new log entries and the lock entry the log will grow and grow and grow and at some point you have the problem that the log will have grown to an indefinite size because so far we don't know how to remove old lock entries and the process that's obviously not something you can do because over the lifetime of a database you write will write many more times the amount of disks base you want to have available for that also if you never if you have always started the beginning of the data of the Val after a crash it would take a lot of time so there's a process called check pointing which basically says I'm trying to write out out of shared buffers all the modification that I've done previously and write them to disk and then I'm f-syncing all the files in the database which means that all the old modifications are now safely on disk and I can say now I can remove the old lock entries that were made before my checkpoint so I can now just delete the files and continue writing it's important to know that in postgres this checkpoint doesn't happen like in the foreground and while it's blocking but this happens in the background you can continue writing while it happens so it's all a bit more complicated than I'm drawing here but that suffices basically for the basic understanding and then we can just continue writing and the way how reading and writing is done and how checkpoints are done explain a lot of the performance problems that's why I went with it to explaining them so when do we do these checkpoints and why is that important by default the default setting is checkpoint timeout equals five minutes so every five minutes we go through shared buffers and write out all the dirty buffers and shared buffers and that's important because like if you say my my shared buffers is one terabyte of memory and I dirty all of those five terabytes of memory one terabyte of memory all the with all the time and then you do checkpoint every five minutes and you can calculate what your maximum write rate from checkpointing is because it will be five one terabyte divided by five minutes so you'll have a very high constant write rate so you need an IO sub system that can handle that there's also other reasons why we can start checkpoints when the val grows too much then we say okay we want to write create a new checkpoint because we want to remove the old entry so we don't use arbitrary amounts of disk and up to 9.5 the configuration variable for that was checkpoint segments and it was defaulting to three which meant that after 48 megabytes we wanted to do a checkpoint and that obviously is in practice ridiculously low for anything that has a high write throughput in 9.5 we have a different setting which is max val size and that has a slightly less insane default setting and the relation between the reason we renamed that is that checkpoint segments had the problem that the actual amount of disk space we needed was more than checkpoint segments you had to basically multiply things around to figure out how much the disk space is going to be that's why we renamed it and re jiggered how this all worked and it's now max val size and whenever if you enable the logging of checkpoints you will see that if you that's the reason why a checkpoint is started you will see checkpoint starting x log and potentially also oh checkpoints are happening too frequently you're this is not good for performance it will warn you every now and then you also can trigger checkpoints manually by saying checkpoint in the sql via sql and they will start that and up to i think 8.3 the way we did checkpoints was that we had some process that did the checkpointing but it tried to do so as fast as possible so if you had shared buffer set to i don't know 100 gigabyte it tried to write all the dirty buffers in these 100 gigabytes as fast as possible and then went to fsync all the data but if you imagine writing out at a sudden point in time 100 gigabytes of data out that will make everything else very slow so what was introduced back then is spreading checkpoints so we'll just say our checkpoints are allowed to take longer because it's not that important that we remove all the value immediately so by default it uses a setting named checkpoint completion target and it's set to 0.5 what that means is that we try to finish a checkpoint by half the time the next checkpoint starts and that means we will just sleep whenever something is happening and when we've done some work and we've compute how far we have to go and sleep accordingly and yeah that means but that means also that if you have too much work to do during checkpointing and the completion target might not play a role it might not slow down things if if he wouldn't finish otherwise so to go back to the graph from earlier i said earlier that this is an untuned postgres installation and you can see that after i don't know i don't know whether you can see it i can't see the screen after i don't know 20 seconds it starts to get very slow and then it's slow for a period and then it starts to be fast again so what did we tune wrong the first thing we might assume is that the default shared buffer setting was too low so let's increase shared buffers to something larger in this case 16 gigabytes and that actually makes the whole workload fit into shared memory but you could see it might have slightly improved performance but it's still pretty bad so what we can guess now next is okay the default setting for checkpoint segments are very low so what happens is that we will frequently do a checkpoint and that's whenever a checkpoint starts you can see that the performance tanks very very badly and that's because the checkpoint is writing out data quickly because every very quickly exceed checkpoint segments or max val size in this case because i did this test with 9.6 and it's busy doing that and that slows down throughput and increases latency because every write takes longer because of the checkpoint also happening so let's try increase the max val size so checkpoints happen less frequently and the result is this note that the scale at the bottom changed the whole year the entirety is 300 seconds now the first 300 seconds are where the performance is consistently very good and then we see after 300 seconds the performance goes down what's happens is that then at that point a checkpoint starts and starts to write out data and suddenly we have this zigzag pattern where performance is very good for a short moment of time and then very bad for a short moment of time so why would would that be i said earlier that whenever we write out data from the database we just write it out to the kernel and we don't use o direct so it catches the data in its on its own and then every now and then it will the kernel will decide now i have cached enough memory and i need to write it out so one assumption possible assumption is that we write out the kernel writes out the data in a not very regular manner so we can now monitor the kernel for how much dirty data it has and this is the graph that shows the red line shows how much the the blue line shows how much dirty data it has and as you can see you have very similar patterns to performance so the problem is basically that the kernel caches too much and then whenever it writes data back performance suffers so one idea would be we can change the os os configuration to tell the kernel to catch a lot less and for pure database ool to p-stale style workload that actually can up to nine five including nine five be a very good idea what you can do you basically you can tell the kernel to write data out more often that means you can reduce dirty writeback sentisex to a lower setting or you can say write out data whenever more than that these kind of bytes are dirty and that's dirty bytes ratio so the one of the problems with that is obviously that it can increase the total amount of writes and it can increase how much how random the ios because the kernel will try to do some reordering it's not that good at it at least in linux but it trials to do some of that so if you try and tell the kernel that it shouldn't catch more than 40 megabytes of dirty data you'll get to this kind of workload which already looks a fair bit better we can now see that performance is very good for 300 seconds and then we are slower for about 150 or something and then the performance is very good and now we can see actually the effect of the checkpoint completion target that i mentioned earlier you can see that after 300 seconds we start a checkpoint and we try to finish that checkpoint by half the time we will have to start the next one and that's exactly these 150 seconds where the performance is lower so what we what the next thing we could do is try to spread the cost of the checkpoint over more time so what you can do we can increase checkpoint completion target to something higher then you'll end up with like something like this i think i set it to 0.9 and that's already pretty good i mean performance is still not as good as if there were no checkpoint but that's pretty obvious if we don't do any writes performance will be better but and one thing to also consider is that this work this benchmark is a saturation tests we try to do writes as fast as possible most applications aren't that way that means most of the time your application has a certain demand for a number of writes say i need to do a thousand transactions a second or something and everything is okay it's it doesn't it's not a benefit to you if you can do 10,000 transactions a second if you don't have 10,000 transactions to process a second so the important part is just to get the performance above your the required baseline and so increasing checkpoints the completion target can very often achieve that by spreading the work so to get back from the example to more how you can tune things the i'll give some basic guidelines about how to do tuned shared buffers the most important part is you shouldn't tune shared buffers to something that's high yeah sure next slide give me one slide and then i'll come i'll ask you afterwards whether it answered your question so in shared buffers the most important part is leave enough memory to actually have other work to do like the queries sometimes need memory for sorting you're you might want to run applications you want to do run aggregates so you shouldn't set shared buffers to like 90 of the memory of your machine then there's some problems which i'm going to go into very little detail later but you should only set shared buffers to a very very large setting if you can fit your hot data set into if you can fit your working set into shared buffers because in those cases none of the scalability limitations around shared buffers and postgres really matter to you so if you have a workload where you can say i have 256 gigabytes of memory and my workload actually is like 100 gigabytes large then it can make a lot of sense to set shared buffers to 128 gigabytes there unfortunately are a few exceptions to that if you do very frequently do a drop table or re-index of tables or something like that then large shared buffers will hurt you because unfortunately we only have a hash table as i explained earlier over shared buffers so when we drop a relation we need to remove all the pages that are in shared buffers so our solution to that is we walk through all shared buffers so check whether it's a relevant page and then remove it so that can increase the cost that normally doesn't matter for production workloads but for regression test workloads it can make a huge difference i've seen regression tests go from minutes to seconds by reducing shared buffers the other cases are if you have bulk writes in a workload which is bigger than shared buffers then reducing shared buffers counterintuitively can often increase performance and if you remember i was talking about this clock sweep earlier the reason for that is that then just can take the clock sweep a lot longer to find a replacement page and the clock sweep then often will have to write out dirty data from shared buffers so that can increase the cost and if you actually use a large shared buffer setting and you use linux consider configuring huge pages because that can drastically reduce the per-process overhead of having a large shared buffers so that's the basic guideline so for shared buffers so for the writer headlock for the sizing of checkpoints the most important guideline is you should always have checkpoints triggered by time never by the limit of the val size so what you want to do is to size your checkpoints so by setting appropriate checkpoint times out and max val size that the timing triggers them the reason for that is that the timing is a lot easier to estimate about how long it's easier to estimate how long five minutes take because you there are five minutes so we can spread the work on a very regular basis but if the if we have to do checkpoints because of the val size the your workload might not be completely predictable so we might need to speed up and slow down the checkpointing and that can influence your application performance so it's okay if you have like a nightly batch job and then checkpoint segments is the limiting factor because you might not have other work queries but that's the basic and one the other important part is that increasing checkpoint segments means like having larger checkpoints increases how much time you spend on recovery because after a crash you need to replay all that RAS so if you have a thousand segments that will take longer if you have only 10 a very very rough guideline is that in current hardware you usually can do like three to eight segments a second to replay if you have decent ish hardware but there's unfortunately also a contra argument to having a relatively small checkpoint segments besides the amount of total IO the other that is that whenever a page has been touched first in after a checkpoint we don't just write the change like for example we updated a row we also write a so-called full page image that means we will lock the whole content of the page into the val and that can drastically increase the amount of val I've seen cases where increasing checkpoint timeout from five minutes to one hour reduced the total amount of val by more than a factor of five so we went from I don't know 100 something gigabytes an hour to 20 just because we didn't need a lot less from these full page writes and having less of these full page writes also influences replication performance because they're pretty large and you have to transport them you have to write them on both sides so often increasing the number of checkpoint segments increases the replication performance actually but it also decrease increases the amount of time a new base backup takes till it can accept read queries or till it's recovered does that kind of answer your question or not yes it's no no yeah we stream like if you stream replication we continuously send the data in 8 kilobyte chunks or 16 kilobyte chunks or something and we can apply in these increments and if you use like archive-based replication like with archive command and recovery command then it's 16 megabytes because the val is always written in 16 megabyte increments yeah I mean it's pretty easy to create scenarios where your network can't keep up with the amount of val I mean I've seen 40 gigabytes and links saturated with val so there's some performance features in postgres that try to make this all more efficient there's the val writer which does just tries to write val so back ends don't have to do it when they do the insertion but there's not really anything to tune and it's mostly important if you use as an conus commit but you can't tune it so it's it's there and does a job then there's another thing that's pretty important which is I said earlier that if the clock sweep finds a dirty buffer that it replaces it says it has to write out that buffer and that means that the query that does like a select suddenly has to write out data which is you don't want it to do that work because it should answer the query as fast as possible so there exists a process called the background writer that basically just looks at the next few buffers ahead of the clock and tries to write them out unfortunately that thing is overly complicated and under effective and one of the problem there is that the defaults are very bad for any real high throughput workloads because its configuration values limit the amount of maximum work it can do to four megabytes a second and if you have any real throughput that's not very interesting so if you want you can tune it by configuring the two settings at the bottom that's bg writer delay which means how long does it sleep if it sleeps after doing some work and bg writer liu max pages after how many pages it has written out it does it sleep you can increase that but unfortunately even then it's not very good it's too complicated and we're working on replacing it for nano seven and surprisingly there's another very similar process which has a lot of needs tuning in pretty much every database that's not of trivial size and that's actually auto vacuum and that causes the in my experience the most down times of postgres is that auto vacuum is not tuned because it is by default also cost limited and the cost limits it has limited to four megabytes of work per second at the max and often lower than that and that means that if you have more than four megabytes of rights a second it can't keep up and then you'll just fall behind and that's how you often end up with databases that have these huge amounts of bloat if you ever had auto vacuum problems and you ran the default configuration this is likely the reason so you will if you don't if you run a non toy database you will have to oh there's a mistake the vacuum cost page miss the default is not 20 it's two sorry oh 10 yeah okay but i mean this value is oh i copied the value twice that's the reason yeah okay that's what dirty is actually yeah so okay sorry there's a mistake there and what but the way it works that it just calculates a cost factor cost while performing its work and after it reached cost limit which is by default 200 it sleeps for 20 milliseconds and if you do the math that ends up that if you have more than four megabytes of dirty data you'll that's the maximum you can do basically so you need to tune that if you have any right activity in your database one of the if you have decent hardware you can often just disable the cost limiting by setting i'm only sleeping zero seconds a milliseconds that works often otherwise you will need a lot higher cost limit and lower cost delay so yeah do that then we have this problem with what we saw earlier that the kernel catches all these writes and writes them back to the hardware in irregular patterns and one of the ways we can do that is by changing these or as dirty defaults but we can what you can also do is we can just force the hand of the kernel to write out data on a more regular basis and that's what you did in 9.6 so in 9.6 you don't need to do this dirty data tuning because the dirty data tuning has a lot of problems if you have other writes where you don't want to immediately write them out for example you have a temp table or you have a disk space sort you don't run the those to hit uh disk unnecessarily so we have now in 9.6 a feature that works on linux and some other platforms but not as well as on linux the way it just forces the kernel to write out pages if they've dirtied more than depending on what what writes them out a couple megabytes and to alleviate the cost the problem that that increases the amount of random IO the checkpointer now sorts the writes and that's something we should have done a long ago and the other problem I mentioned is that the hash table at the moment we can't do any sort of give me the next buffer in the file very efficiently what we have to do is we have to do a completely separate look up for that and we can't use the hash table for sorting because hash tables per definition have no sorting and that caused a lot number of problems and the other problem part of that is the problem is that hash tables are actually from an efficiency point of view they are 01 or in the normal case but the problem is that the constant cost factor for something that's accessed across multiple CPUs is not very is pretty bad so we probably have to replace the hash table by something else also we had a number of locking issues around this hash table because this hash table can be accessed but it's often accessed by all processes so the locking around that is very important the most important improvements have been in 9.5 so if you run into locking issues before a 9.5 upgrade to 9.5 we have some more important improvements in 9.6 but those are mostly if you have a working set that's a lot larger than the database than memory and I think the best replacement for the data structures we use as a instead of a hash table we should use a radix tree I'm trying to work on that I'm not 100% sure I can get it into 9.7 but that's my hope the other big problem is that our cash replacement scales not very uh scales badly I mean you saw this clock earlier in the past what we did we whenever we wanted to get a buffer from that clock we just acquired a global lock and only one buffer could advance that one process could advance that clock and look for replacement pages we fixed that in 9.5 by better by using atomic operations but the still the problem is this clock has to be advanced and if I saw earlier that we sometimes have to go around the whole clock five times to find a replacement buffer and if you have a couple million of shared buffers going through a couple million an area of a couple million that's really expensive so our fundamental fundamental cash replacement algorithm is just not very efficient and the reason for that is that it's mostly been designed for efficient changing of the reference count which is very nice but the replacement has become so inefficient with today's memory sizes that I think we need to replace the fundamental in algorithm I don't really know how to what would be good there we've been various people proposing different algorithms but it's pretty hard the other problem is that this this maximum usage count is five and that means we have a very low granularity about how often pages are accessed you often reach a state where you have all buffers have a usage count of five and then the clock goes around to reduce all of them to zero or all of them to one or zero and then buffers get replaced and so you have very little locality and that's something that you can actually see sometimes in the database using the PG buffer cache cash extension and that's actually a very interesting thing to do in your database look at how regular your usage count is because that allows you to do some estimation about how well does the caching work in your database if you have a very unique you need distributed usage count that usually means that your workload is bigger than a lot bigger than share buffers that you have a very that the caching doesn't really work that well if you have mostly a high usage count and then a third or so that's very that's a lot lower that means you have a well cached workload and you can by looking at how that develops over time you can estimate how soon you're going to hit the wall of okay my cache is too small another problem is that I said earlier we don't use o direct and that's a problem because that means that the kernel page cache can contain cache data and that postgres caches the data and that can mean we just waste memory doing caching the same we nearly the same data twice it's unfortunately not that easy to get to use just use o direct which avoids using this page cache because the linux or most kernels do a lot of optimization about how to queue i o to devices and if you don't use use the facilities to do that things get slower so I think we we're making slow progress and being able to do that but we're by far not there the other big problem is that it's a very os specific so we can't easily do it because the linux implementation would look very different than the even solaris implementation or free bst implementation or even more so than the windows implementation okay this is basically what I wanted to go through if are there any questions