 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. We're excited today to have Andreas Poin. He's a core committer on the Postalist team based out of Microsoft. Andreas is here to talk about Postgres, obviously, but he's very, very explicit in saying in his bio that he is not a DBA. So he's not here to answer DBA questions. It's more about Postgres internal questions. So if you have any questions for Andreas, as you give him the talk, please unmute yourself, say who you are, and feel free to do this anytime. Would this be a conversation? And with that, Andreas, the floor is yours, go for it. Cool. Today I want to talk about how IO and Postgres works, why it works like that, what the problems are, and why we want to change it and what made us change it right now rather than having done so in the past. I started working on Postgres about 15 years ago first, and then increased it over time to do more and more of my time. To get started with, Postgres has a fairly simple design in the end, and the basics have been the same for at least 15 to 20 years. There's a buffer pool in shared memory that buffer pool, all the data accesses go through the buffer pool, and the buffer pool is filled by buffered IO using Pread, Pwrite, or whatever the system calls are for the platform. And when I say buffered IO, the relevant thing for those of you that don't know is that the application in this case Postgres will do a system call, but then it will not directly do IO to the drive, it will first go to the kernel page cache, the page cache in the kernel might already have the data, but if not, then the page cache will go and ask a drive, say, give me the data, then the drive will return the data often via DMA, and DMA just means that the copying of the data doesn't have to be done by the CPU. And then at the end of the IO, the kernel gets notified that the IO has been completed, and the data is copied from the kernel page cache into the application, and then this is call completes. As a contrast to that, there's also direct IO, and here the kernel page cache is not included. So the application will do a system call, and that will be converted into IO by the kernel, and then the drive or whatever hardware can do directly, do memory access into the buffer provided by the application, and do that again via DMA, so there's no manual copying of data that's required. For durability, Postgres uses a fairly standard write-ahead log approach, so whenever data from the buffer pool has to be written out, it has to first synchronize the log to disk before writing out the data if it has been modified long enough ago, or recently enough that the corresponding write-ahead log is not yet written out. Since we've over time improved the design here, like we have group commit, we have the wall in memory can be written concurrently by multiple processes, because that was often a significant bottleneck and similar other improvements. Checkpoints aren't done synchronously by user applications or anything like that. They happen in the background, and there's a dedicated process doing so, and that works reasonably well. Typically, there's continuously ongoing checkpoints. There's also further helper processes. One is the wall writer, which unsurprisingly writes out the wall, while when applications don't need to do so, there's a background writer that will write out data buffers before back-ends have to when they need to do buffer replacement. The buffer replacement uses a very simplistic buffer replacement algorithm. It's not really clock, but it was originally based on the clock paper, which basically has a primary goal is that it's very concurrent. One doesn't need to maintain expensive linked lists or something of all buffers. Instead, each buffer just has a number of times that has been used. Since the last time, the usage count was reset, and each backend goes through the clock of all buffers and checks whether a usage count is zero, and user step buffer decrease the usage count that goes on. The background writer basically integrates into that and tries to write out buffers before back-ends have to, because obviously when queries or something have to write out data, that's time they spend that we normally don't want to spend in user-oriented queries. How do you maintain, because if it's clock, you can get a sequential scan flooding. How does Postgres ensure that Postgres doesn't blow out the whole cache? There's a pretty mediocre defense against that, which is that Postgres has ring buffers, what they call that are used for operations that do bulk IO, like sequential scans, vacuum, and so on. Those ring buffers are fixed size, and whenever new pages to be read, and it's not already in shared buffers, then we use a page from that ring buffer and discard the old contents from the ring buffer, which then means that sequential scans, for example, I think by default have 128 kilobyte size ring buffers, so they will not read in more new data than fits into that ring buffer. Most of you might have seen the problem with that approach, which is that if you start with a cold cache, sequential scans will not fill shared buffers very quickly, because they will use that ring buffer. Now that's a bit, not quite as bad as it sounds, because it only happens for the whole ring buffer design for sequential scans, it's only used for relations above a quarter of shared buffers, which is pretty crude heuristic, but it definitely is a problem that in production workload sometimes you can be more analytically oriented, cases you can end up with. That's why it's never been loaded into shared buffers effectively, even though the whole workload actually fits into shared buffers. For example, if there's one large relation, one can do that manually using an extension to read in the data or something like that, but it's definitely a practical problem. That ring buffer is that controlled by work memory on a per worker basis? No, that's just a compile time constant, a set of compile time constants, and copy user larger one vacuum user, something of that in the dozens of kilobytes or something. Rage copy uses a very comparatively large one, like up to 16 megabytes, I think. The checkpointer also does a few other tasks, like write a syncing files as part of the checkpoint, all the data files need to be synced, and it does syncing some auxiliary data, it does some other related tasks, but it's not particularly interesting. One thing that we had to do to make the buffered IO not completely terrible was to add control over dirty data in the kernel. When, for example, the checkpoint on a large database will write out a lot of dirty data, so the kernel page cache can fill up with lots of dirty data, and when the kernel then suddenly decides all that dirty data needs to be written out, latency can completely crater. Around 2014, there were a lot of cases where the kernel would start to write out data so aggressively that every other IO was delayed by hundreds of seconds. We saw delays where not a single IO was happening for 300 seconds. Obviously, there were some issues in Linux, but it's also just a huge problem if Postgre writes out 100 gigabytes of dirty data into the kernel page cache on a large system, and the kernel has not a real throttle on how much data can be dirtied, then an eventual f-sync, or just when the kernel decides in the background to write it out, obviously that can take a long time if your storage is not very, very fast. These days, we use sync file range and M-sync or other similar approaches to control to limit how much dirty data the kernel is allowed to have, and that can be set for different tasks, like a checkpointer has its own limit, backends have their own limit, a background writer has their own limit and so on, and that has improved latency, especially worst case latency drastically. Given how simplistic this approach is, why didn't Postgre basically fail? I think that's a good answer because we've been told that using BufferDio will basically not work at all, and it's terrible, terrible, terrible. There's definitely some truth to it, but it also turns out that for our use case, where Postgre is most used, it's actually not always terrible. For one, Linux does a decent job doing read ahead internally. The page cache is pretty decent compared to some other operating systems. I've mentioned earlier that our buffer replacement algorithm isn't great and the kernel actually saves the day often there because if Postgre's buffer replacement algorithm fails, the kernels might still have the data. Obviously, using our own buffer pool and the kernel buffer pool has significant issues as well, like the double buffering that can happen between those can waste a lot of memory. You said a Linux release, page cache reflects about how it seems to be the best. Which one is the worst for all the different platforms? Windows. Windows, by far. We know there's some... I don't know. I think part of it is that there's a lot of meta data slownesses that happen even when accessing the page cache, then the control over what is... It is not very aggressive about keeping stuff cached, especially in older versions. But to be honest, my Windows experience is limited and very, very dative despite my employer, so... I wouldn't bring it up, but they are paying you. I know what you're going to say, like HBox, throw someone... I know us, nobody cares about under the bus. Yeah. I don't think so. Keep going. And another reason that it was actually kind of okay was that most of the systems that Postgres ran on, especially back in the day, but also still now the storage isn't that fast and not that concurrent, so limiting the limited... or the simplicity of the AeroStack often didn't turn out to be a huge problem. And that we do use prefetching, we have parallelism, and that hides a lot of the... some of the cost of that, the simplistic approach. As a related question, you might ask why are we doing it? The main reason for that is that Postgres is a very, very small team compared to other database vendors. If you look at the big commercial databases, there's teams of hundreds to thousands of people that are working on the core engine. Postgres, when I started, maybe had, I think, one full-time person working on it, or close to full-time, a bunch of people doing it part-time, and it has increased, obviously, now. We're definitely more than 10 people working on it full-time, but still it's nothing compared to a lot of the other database engines. And even some Postgres forks have more people working on them than the core Postgres team. Not all of them, but yeah. Another reason is that it's actually not easy to change. There's been a lot of patches and proposals, like, let's just use DirectO, or let's just use whatever, and it's not actually likely to be beneficial. And part of that is that when Postgres uses a process model, and this has been known to be a weakness since basically before Postgres was Postgres QL, and it was, oh, it's a short-term hack to work around the portability limitations in 1990. And unfortunately, as it happens, such design decisions creep into more and more places over time and aren't easy to change. There's plenty other reasons for why it's hard to change. Another thing is that it's not actually very, like, just using DirectO is not a solution. Remember what I said earlier that when using DirectO, the kernel doesn't do any caching. So if the client application is very simplistic, that basically means that one IO will happen at a time. So Postgres will submit an IO, or if Postgres were using DirectO, Postgres will submit an IO, the kernel will ask the disk, then it will return to Postgres, and so on. And which means that if Postgres does five IOs, it will take five times the maximum latency. And that's not really acceptable. So DirectO, where the kernel doesn't do any caching and doesn't do any read-ahead and so on, is only viable when using asynchronous IO at the same time, because then Postgres can say, we won't submit the IOs one by one, wait for the next one. Instead, we can submit a bunch of IOs at once. The disk can process them at the same time and we can get the results in parallel. Obviously, that only helps if the storage has the capacity to process multiple IOs at the same time, but that's been the case for quite a while on enterprise storage and for a decent amount of time even for normal consumer hardware. Another big problem is that the IO APIs are very, very platform dependent. There's basically no API that works on more than one platform. There's POSIX AO, which was supposed to be portable, but it's so incredibly bad that it's practically unused in my experience, and for example, the Linux implementation is not actually using proper IO, it is implemented using a thread pool and it's full of data loss style bugs. So it's basically just something implemented, somebody hacked together to get some POSIX compatibility. And it's even just using a simplistic IO API is not going to do that, so we need to do a lot of work to make it better. Another reason is that just the way POSIX is used, it turns out to often be okay to use Bufferdial. And one of the main reasons for that is that when running on shared hardware or when running on a system that isn't maintained by a DBA or a team of developers that has a lot of knowledge about how their data accesses look like, they weren't tuned POSIX to have a large enough shared buffer. The machine will often not be dedicated, so you can't even set a shared buffer as large enough because there's other tasks running on the same instance. And in those cases, the kernel managing and deciding which application at the moment can have its data cached is very beneficial. And so Bufferdial is something that is basically required for those use cases, not in general. And traditionally, one cannot use direct IO without direct IO. And that's still the case for nearly all platforms. I'll get back to that in a second. Another thing is that at least some people in the POSIX community have fallen to is that sometimes when we make pragmatic decisions not to implement something, they become over time like this hallowed truth that all that stuff that the big vendors do is just wasted effort or something. And I think Bufferdial is one of the cases where we have elevated our pragmatism to the truth. Now, if it's actually kind of working okay, why would you want to change that? And the main reason is that the hardware landscape has changed and there's also other things like POSIX has just gone bigger and the adoption has grown, so we have more resources to do. But the main thing is that hardware has changed. And I'll just limit my... Things I'm going to talk about here just for 10 reasons, but one big trend is that we now have storage that just behaves very different than traditional storage. And I'm talking about NVMe, which is a storage protocol specification for how storage can be accessed and typically used for SSDs initially only for the very high-end server ones, but now if you buy a laptop, it will have an NVMe SSD. And NVMe here is literally just a placeholder for a very kind of modern storage technology. There's storage that doesn't use NVMe to access it, but it's by far the most widely used VStave. And NVMe accesses typically storage via PCIe, but there's also abstractions where it can be used over a network, over a fiber channel, and many other things. And one of the things that is very different than a lot of traditional storage protocols is that it allows multiple queues. One can put multiple IOs into those queues and can submit them at the same time with just a single uncached write, like single hardware MMIO access. And that compares... It's a very large difference to some of the traditional protocols where every single IO requires multiple uncached accesses, and uncached accesses, memory accesses, like on the orders of thousands, a few thousand cycles. So it's really not cheap. If one needs to do that for every IO a couple of times, then that's really a problem. Particularly, or it didn't used to be a big problem because storage is so slow, but these days, storage is fast enough that that's a real problem. And with NVMe, this patch overhead is so small that the indirection to the kernel page cache is a significant issue. If I do have a single process during IO, it's often faster to do direct IO, or direct IO then use the kernel page cache, insane as it might sound, because the kernel page cache overhead adds more latency than the storage itself. For that one needs to have a high-end storage, but the overhead is very significant. The intra, like switching between, if many designs for using IO in databases use worker threads to do all the IO or similar things, and with modern storage that actually is a problem. The dispatch, like having to switch to a different thread to then submit an IO, that alone will often be like a fifth or something of a high-end drive of the latency a single IO has on a high-end drive. So that's really expensive, and that has gotten a lot worse with the changes that had to be made in response to the specters and other type of related hardware, processor attacks. The improvements for dispatch and for latency in general have been orders of magnitude over all hardware, whereas the memory latency and so on has actually improved very little over the last 20 years. It's basically now like 1.5, 2X faster. The bandwidth is much better, but the latency hasn't improved a lot. There should have been a separate... At some point this was the bottom here with two paragraphs, I don't know what I mis-edited there. The latency for a single IO on NVMe can go down to four microseconds, and that's for high-end storage, like Intel obtained with tuned kernel and whatever, but that's not a lot of time, and so adding other sources of latency is a significant problem. Another thing is that, compared to older storage protocols, the bandwidth has just increased drastically. If, like, using SATA or like the one of SAS of the same time, bandwidth per drive was limited to at the top and to 500 megabytes a second. These days, a lot of SSDs are limited solely by the PCIe bandwidth they can use, and they use PCIe 4x, so if attached to a system with PCIe 3, that's 3 gigabytes a second on PCIe 4, it's about 7 gigabytes a second. So if using the kernel page cache, the bandwidth for copying the data from the kernel page cache back into the application memory is a significant problem. On server hardware, it's a bigger problem than on laptop hardware, as it turns out. The memory bandwidth one can do with a single process is on the order of 8 to 9 gigabytes a second. So a single process cannot copy that data or can just about copy all the data from the kernel and cannot do anything else. It's also that at those speeds, the kernel like at least Linux or actually basically all operating systems that I've looked at, the page cache just doesn't scale to those speeds. If one has more than a bit of memory, the amount of memory allocations, all that is just required, it's just so high. NVMe also can do a lot of IO concurrently or modern storage can do a lot of IO, but it was hard to expose before NVMe. SSDs are basically very extremely parallel. Typically they have a lot of different flash chips that can be used independently and so they can all process IO at the same time and that can be exposed via NVMe and that couldn't really be exposed before. Now this HNVMe drive can expose multiple queues to the operating system that then can be filled in parallel by different, without a lot of contention and that's just a very... This orders of magnitude of performance different characteristics differences of previous technologies are very substantial. The other trend is, even though it's not quite real hardware, is that cloud network storage behaves very different than a lot of traditional storage. It has actually reasonably high latency compared to local storage on the better types of storage that are typically much more expensive have somewhere in the order of 0.3 milliseconds latency. There's some that are a bit better and that number like 0.3 is like from sometime mid last year I think. The cheapest storage has typically somewhere between 1 to 4 milliseconds latency and the difference between cloud vendors and generations of storage inside those cloud vendors. But what difference... 4 milliseconds is still better than like a spinning disk but it's... like a spinning disk had like kind of 10 to 16 milliseconds in consumer hardware down to 4 on enterprise hardware. So it's on the same... it's close enough. But the big difference is that typically they have a lot of concurrency and that's something that old storage didn't have. Even though the better ones you can submit... or if you have the necessary size to get... to be allowed to submit that much IO one can do dozens to hundreds of IOs at the same time. And that definitely wasn't the case for storage before like when everything was based on disk with a few memory caches in front. And one needs to have a lot of IO in flight at the same time to even be able to hit like the band... to get decent bandwidth because otherwise the latency will just mean that one cannot utilize the storage at all. And currently the wall right latency in Postgres on cloud storage is often a major issue because the way we do IO it ends up with needing to hit... like have occurring the overhead of storage latency multiple times. And we have only one often like a group commit but we have only one flush to the wall going on at the same time. And like these changes like are large enough that we have to change something in Postgres. So I've been working on moving Postgres or make prototyping and then working further on using IO in Postgres starting in 2019 sometime. And partially that was motivated that because at that time Linux got a new IO interface IO Euron and for one it has some nice efficiency properties in general but the main reason why it was suddenly becoming more interesting for me in the context of Postgres is that IO Euron is the first interface that supports doing buffered IO asynchronously. And as I mentioned earlier for a lot of use cases of Postgres there will not be somebody that can tune Postgres well enough to have a properly sized share buffers and memory requirements that encourage are just not suitable for some even if you made it automatically tuned. Since I started working on like two colleagues have joined working on it most Tom, Thomas Monroe helped me a lot with adding a lot of portability and Melanie did a bunch of general work and then helped a lot with prefetch like improving the prefetching but more on that later. There's a few design constraints around Postgres using IO in Postgres. One is the process model. A lot of the IO APIs aren't easily usable with multiple processes without occurring some problems. And we could obviously work towards using threads in Postgres and I personally and I'm not alone in this think that's something we should do at some point but it's a huge project on its own. One other big and somewhat related issue is that if in a naive design it's quite possible to have deadlocks because one process starts in a synchronous IO and then ends up being blocked on a lock and the holder of that lock also is waiting for that IO start by the first process to complete and at that point nothing can continue. So for that it needs to be possible that the process other than the issuer of IO can complete IO because then they can avoid the deadlock by just saying okay I'm processing the completion of the IO. We also want to have a design that doesn't require intra-process context switches for IO. That's not possible on all platforms because they just don't have the necessary API for it but the design should be that we don't incur those. Intra-process switches are even more expensive than intra-thread context switches because the memory mapping changes between the different processes so that the TLB and so on needs to be changed or at least there's contention about TLB contents. Maybe we're jumping heavily like we're getting a worker and the worker now with TLB will we also thread it? We didn't do any threading because that's just such a huge project on its own. Another thing is that we need to deal with the portability issues of IO because still to this day all the APIs that are usable are basically platform dependent and if there's going to be dozens of users of IO inside Postgres we can't have operating system specific things creep into all those places. We need to have an abstraction for that and we don't want all the complexity of IO's creeping into all the different places in Postgres to IO. We don't want to have a lot of knowledge of how exactly to schedule IO in vacuum in sequential scans, in bitmap index scans and all that stuff because then it will just become unattainable for a team of RSI's. Out of that basically flows like some minimal design that I can't go into detail in this time but the basics are that IO completions can be processed in any process and that avoids the deadlock issue that I mentioned earlier and IO itself doesn't the IO module itself doesn't know anything about shared buffers or something because there's different types of IO there we might need to do IO for wall which doesn't go through shared buffers there's data that is not in shared buffers that we still might want to use IO and so on. So there can do some optimization of IO before it's submitted to the hardware or to the operating system and for example if multiple IO's for different shared buffers are submitted we do combine them into one IO using scatter gather IO and some other similar improvements and obviously that will depend on the specific hardware capabilities. We have to do a lot of improvements just to the general Postgres code. Some of them have already been committed some of them are still basically queued up as a large part of the IO patch set and the basic way that most code interacts with IO is that there is a streaming read interface where the users of IO provide a callback that says give me the next IO that needs to be done for example read this block at this offset and then like the IO infrastructure can call that whenever it knows that more IO's needs to be submitted depending on the speed of the hardware of the side of shared buffers whatever and that turns out to work pretty well for converting most IO reads because then shared buffers and sequential scans can just read the different buffers that are not in shared buffers and index scan can just read the buffers that are necessary even though they're completely randomly distributed across the file and so on. Right now we have four different back ends for IO in Postgres or in the patch set of Postgres one is IO hearing and that's what I started with that was what motivated me to look again at adding IO to Postgres and one of the nice things is that submitting IO does not or it could even seem like getting completions for IO's doesn't require any inter-process contact switches and with in some use case one can even just optimise even the assist calls away because the kernel can pull the queue of to be submitted IO's and do it submit without needing a system call that's probably not going to be a major use case for Postgres because the CPU requirements for it are fairly high typically that means that there has to be a kernel thread that is running at 100% CPU usage and that's probably too costly for most users of Postgres but who knows and the other nice thing about IO hearing is that it supports IO when using buffered IO for buffered reads these days it doesn't need a kernel thread for writes it currently still uses a kernel thread to do the IO asynchronously but that's ongoing work to also not use like a dedicated thread or something to do IO one very important way of doing IO is not actually doing IO it's using worker processes that do IO and the reason those are so crucial is that without that we couldn't require use of IO in all code of Postgres because not all platforms will have decent IO support and even if they do have it we might not actually have the code for some platforms and traditionally portability has been something valued very highly in Postgres there's also POSIX AO and that's mostly there because of macOS and FreeBSD but the API has real problems so it's there's a lot of ugly code associated with it but I think at the moment the biggest back end we also have a work in progress IOCP back end for Windows there's some more work to be done mostly because none of us are expert Windows developers so a lot of it has been developed remotely by SEI and that has some development cycle latency. We've converted quite a few subsystems to use AO at this point the most important one actually was committed to Postgres itself recently and it just uses POSIX and so on when as committed and then with the AO patch that will actually use AO and the reason why a read ahead during wall replay is so important is that before that I often had the like the generation of wall was so fast due to using SEI that recovery would read the wall and then read all the buffers referenced by the wall and because it was using direct AO would be extremely slow I've had a lot of cases where within 5 minutes generated so much wall that replaying it with direct AO literally took hours and that makes the development cycle very painful because as one might imagine it's pretty common to have crashes when developing a completely new code the check pointer was a lot of that's a very easy conversion to background writer similarly one of the harder conversions was that wall writes are now done using AO and there can be multiple IOS in flight for different parts of the wall and we can now use with decent performance OD sync which is basically a mode where the kernel can submit the AO and once the AO is completed it's already been durable we flush to disk. There's no F sync or F data sync or something at the end and with modern storage that can be more efficient because writes with OD sync can be implemented by adding a bit to each AO saying write through the cache whereas something like F data sync or F sync is implemented by just saying write out all the dirty data in the drive cache and obviously can be a lot more expensive we have vacuum we have sequential scans, bitmap index scans and other similar things converted so what have the results been so far my first conversion was check pointing and it was also I think the most fun because before we could do maybe 1.5 gigabytes of check point could write a second or something now I was able to reach hardware limits of 4 drives in my workstation and because I only have PCI 3 that's somewhere on the order of 12 gigabytes a second and it's very unlikely that we'll need higher bandwidth for check pointing ever because that's just so much data it's hard to generate that much dirty data we can do sequential scans a lot faster now it's because we don't need to do the memory copying from the kernel page cache and set can use DMA directly from the drive into shared buffers that visa was a lot of CPU cycles to process the IO and I'm seeing up to 2.5x faster sequential scans who put per core when there's cache passes obviously and one important part of that is that it also scales much better to larger number of the larger degree of parallelism for sequential scans obviously that only matters if the sequential scan is can be processed so much fast because the conditions are cheap enough to value it if one has like a complicated JSON predicate or something then the evaluation overhead will be so high that the overhead of the IO itself does not matter we have concurrent copy and that data loading in parallel is much faster and one thing that I didn't fully expect beforehand we also have gotten a lot better write heavy OTP performance and that's mostly because we can write multiple parts of the wall concurrently we can multiple flushes of the wall happening in concurrently unfortunately that requires that the buffers that need to be flushed are not the same because we only can write out whole pages of wall at the same time which has the odd performance characteristics that initially the throughput just increases like it used to be pretty slowly because every time data is written out like the interval pages are only partially filled and the concurrent write volume is high enough we have independent pages that are independently full or that can be flushed I've implemented that we can pad such partially filled wall records but that turns out to and that's what other databases do but the wall volume increase with that at the moment is just tremendous in some bad case I've seen that use up 3 to 4x the wall volume which is like a problem we probably can improve some of that just by using better default because by default each wall page is 8 kilobyte and we can easily reduce that for 4 kilobyte but we might need some trickery to improve that further by for example doing some using the not padding it with zeros but using filling it with data from background tasks or something so that we can use the rate quote unquote wasted space for something useful instead Can you explain that a little bit more do you get the thing I've got some pages I'm running at the redhead log but I have an extra space I'm going to piggyback random stuff yeah for example we could have vacuum paste based like do be more aggressive when there's padding space available and then not immediately write that data into the wall but fill it only when there's padding space available and there's some other like Postgres uses full page writes to write void issues around torn pages and we could do some eager writing out of pages in those situations the redhead log would keep stuff that doesn't necessarily like that aren't redhead log records yeah we could generate them basically whenever we have to flush partially out of data I wish I had a better approach but so far that's why not really fast efficient compression I guess that makes it worse yeah actually I had the weird cases where compression compressing the wall by the throughput worse because of when it leads to that issue question from the side yeah I was thinking about this issue the writer could not wait for some determined amount of time before trashing the entire record to fill the record yeah I think we probably need some heuristics like that like a certain delay or the only pad wall records if the concurrency is high enough for example I've not quite implemented that yet but I think that's probably the minimum we have to do so what's not working with AO and Postgres right now why haven't we merged this there's a lot of odd performance cases like not necessarily in the most common use cases but that are common enough to be problematic it's actually for example hard to extend files in a way that doesn't cause problems Postgres uses a design where each file where each table has its own file where sets of files rather than having one huge file that then is like internally subdivided for different relations and that has definite simplicity advantages there's some advantages around like how quickly space can be released back to the operating system or the file system but it also means that when we have to extend those files we have to decide by how much to extend it that's not a problem when using BufferDio because most file systems these days have something called extension where it only determines the actual file size that's written out to disk when writing out data and before that it just buffers it in memory but with DirectDio we have to allocate the size when we extend files and it's not always easy to know how large a relation will be we need it if there's a new table and we just start data bulk loading it we don't know yet whether it will be just a few kilobytes in which case bulk extending it with 128 megabytes will be a problem or whether it will be a huge file in which case bulk extending it by 128 megabytes every time is not a problem and it turns out that fragmentation even in like 128 kilobyte chunks still isn't measurable performance wise mostly because it increases the size of the metadata in the file systems which increases then access overheads and memory, the cache ratios. There's a lot of paths that aren't optimized for DirectDio that's just work there's nothing super hard about converting more places that aren't optimized for DirectDio but it's definitely work there's a lot of implementation weaknesses particularly around when doing asynchronous write back and back ends and deciding when to do asynchronous write back how aggressively but that's also something that can be addressed reasonably easily I think there's some weird slowdowns and that is one of those cases where it just needs a lot of operating system specific and potentially operating system version specific tuning for example with IORU when using Buffered IORU sometimes IORU is slower when using Buffered IORU than just using a read because IORU always ends up doing the IORU in the reads as part of the process context whereas like when doing read ahead in the kernel that it's the work is split into a bunch of kernel demons and if there's enough IORU that one person has 100% CPU usage and that's not actually that hard to reach then the kernel threads win so initially IORUing is faster except that when we hit a lot of except when there's a lot of throughput then suddenly IORUing has a slight performance regression it's all solvable but it does show how there's a lot of weird operating system version specific tuning also the weaknesses of our buffer replacement algorithm including the ring buffers really show when using Direct IORU another kind of related issue is that there's a lot of cases where some other weaknesses in Postgres prevent fully benefiting from IORU for example with when vacuuming a relation now IORU is not the bottleneck anymore like the data for the data files or for the wall but we access to determine like the transaction commit status fairly regularly when vacuuming and the buffering for that is so simplistic that if there's enough data we can end up being bottlenecked by copying the relevant data from the kernel all the time and spending 90% of the time just caching and discarding the cache for commit statuses and there's lots of other similar things so what are the next steps there's a lot of polishing needed as typically in a prototype it starts out pretty ugly and it just needs to be get cleaner and more document and so on and that's one of the big differences between I guess working on academia prototypes and stuff that we want to ship to users but yeah we can also just start merging some prerequisites there's enough of that that we can do bigger questions are stuff like what do we need to do on an algorithmic basis we currently have a very simplistic algorithm for how aggressively to prefetch but because the hardware characteristics are so differ so much we really need something adaptive and we're working with a simulator to evaluate prefetching algorithms in a lot shorter time than using real hardware to improve that and I really hope that you can get something mostly adaptive without needing lots of tuning there's lots of planner and executor that improvements that we need to be able to do AO because right now for normal index scans for example we don't know which other pages we could read ahead because the way the executor and the planner is structured and then that might also require some planner improvements we do need to switch to different better buffer replacement algorithms fairly obviously why yeah we currently only do accurate prefetching because we know that the data will be needed soon it might be worth doing some heuristic prefetching I've been doing some experimental results by just prefetching neighboring pages when doing for example index scans like fetching the heap data or a table data that seems to work reasonably well but obviously that was just experimental and I'm sure there's lots of downsides but yeah those are the biggest issues with that I think I'm done I will have everyone here so we have about 500 questions so open up to the audience fire away I'll meet yourself if you want to go for it I will ask questions so there's a lot here so the the basic proposal is that now there will be a separate worker that dishandles IO an IO schedule right? I mean that will only be used as a fallback normally we use IO U-ring or IO CP or whatever the IO the process do that IO themselves got it okay okay so the fallback would be like if you're running on a someric it doesn't have it basically acts as a broker as it was yeah okay okay or if the kernel is too old to have IO U-ring or something like that yeah okay okay and then the and then so it still I think of the technical data Postgres my students and I were talking about this the OS page cache the process model and then the append only FVCC so the only thing really you have to change here with this new approach is just getting rid of the OS page cache again in the cases where you can all the other parts of Postgres stay the same yeah there's some other there's some architecture work that was needed but like it doesn't address any of those big ticket items that you mentioned and I think that was one of the design goals was to be able to do work on just quote-unquote just one because it's already multiple years of work and if you just want to fix all of them at once we'll never get anywhere okay and then when you talk about the buffer replacement algorithm like so would that be keeping the ring buffer and or like using ARC I don't know if the IBM patent is gone yet like something that can handle sequential scan flooding all within a single unified space of memory I don't think we would necessarily need the ring buffer for read IO we probably would want them still for write IO to limit the amount of durating that one process is allowed to do just for latency and impact production purposes I do think we would want to switch to another algorithm I don't think ARC there's some issues in ARC with how to allow for sufficient concurrency if I remember correctly it's surprisingly hard to have like a lot of the theoretical literature and cache algorithms buffer replacement algorithms skips over a lot of the real world concurrency issues associate with them there is a solution to that but there's also some issues in car though it's been a while since I looked at it I definitely think we need the car is probably closer to what we want than just ARC car being the clock variant if you will of the original ARC but there's some paper that I can't remember the name of but yeah are you talking about the one that they use in ODB I don't know right now we'll have to look at that I'm not good with remembering acronyms unfortunately yeah I think it begins with I I actually had a kind of a question but it's it's going to be one of it's kind of vague and high level so compare with me why is it that so you said something about heuristics heuristic prefetching meaning prefetching based on speculative criteria I think that's kind of what you meant I wonder how does one think about that there must be some kind of general sense of a break-even point I thought about doing more that stuff in my own work on indexing you know you sort of have more bandwidth than you know what to do with potentially is there some kind of high level cost model that can inform intuitions I mean or is it sort of just try and say I think for now I think it's fine to not have any heuristic prefetching from what I've seen I think we could get better results but I would not touch it for the merged version I have no good feeling good intuition about how to what kind of heuristics to use and how to limit the worst cases and so on I'm sure that there is you can probably get some information from what the current latency is and how much IOs we are allowed to have in flight and do more aggressive prefetching based on the number of IOs in flight but I don't know he has a question can I have my question now yes so have you looked at using Intel Optane in the byte mode so basically the trade-off over here is that you get more of it but it's slightly slower I've looked at it I don't think it is usually interesting for now just because it is such a different model from how most of the storage is accessed and the gains aren't all that huge and it's not widely available so designing for it is probably not quite the right trade-off it would be fairly easy to have like a different back-end or something like AO back-end that redirects everything or that benefits from persistent memory by avoiding all the overhead the normal IO overhead but I'm not entirely sure there's also some like we can't just use persistent memory or something to back share profits directly because of ordering requirements and so on in the end it's just too niche to care too much about at this point just for my build yeah my last question would be this is basically another way to think about this too is now that you have this check with the system boots up what do I support AIO do I support IOU ring if not then I have a fallback mechanism but now because you have this API that would allow in the fallback mechanism case the PostgreSQL regular worker to communicate with the IO worker the IO scheduler in theory now do you think this would be easier for everyone who's always making your databases now that they could just replace the fallback option and replace that sort of process with whatever whether it's Optane or whatever device you know S3 whatever device they want to now too and still get all the benefit of Postgres up above that stays the same these replace that one prod that schedule worker then now they can support new IO without having to rewrite everything or even use a foreign data wrapper is that a fair way to think about this I suspect there's quite a few limitations around what information you have available to make this work to just do huge paradigm changes as part of just that layer because at the IO layer you might not have enough information to decide about things I suspect in a lot of cases it would be more useful to change things on like create a new table access method or switch things out in the storage manager API which is currently not very often replaced but you can change how things are mapped to files there and that might actually be the better place to change things for this purpose I didn't realize there was a storage management API I mean in that case it was just compartmental so there are some limitations around that too but it might be a good sense for this I'm just going to give one quick question I just had one question about this statement that you said you could generate 12 gigabytes per second check point in traffic I was curious how were you able to do that were you having a tool of your own or was it a real load from for some database I think I was using bulk loading data and I disabled the logic that I used a bulk loaded data in a way that the ring buffers don't take effect that limit the amount of dirty data and then I just shut down the database and saw what I was reaching I see so you didn't have wall logging enabled but the check pointing itself doesn't do a lot of again it writes a tiny bit of wall but it's not typically limited by wall performance the data loading generating the dirty data takes a lot longer than that part definitely doesn't happen with 12 gigabytes a second so I think the point is mainly that after when using AO in a good way as part of check pointing then the check point is not a bottleneck anymore because you can't generate dirty data quick enough right okay thank you you made lots of other improvements that actually improved the other side that's great thank you