 Please welcome next speaker, Thomas Wander. Right, so hi, my name is Thomas, and I'm here to talk about F-Sync and how it's actually used in Postgres and how we failed to actually use it correctly for like 20 years or so. How many of you actually are using Postgres? Right. Okay. So how many kernel developers are here? Okay, so don't worry, I will be careful. This is really a very different kind of talk. So, well, very little about me. I'm a long-term Postgres contributor. Nowadays I'm also a committer, and I've been working with Postgres for like 20 years, and I didn't know about how broken it is in this sense. And this is a really different kind of talk compared to the other talks, right? Because usually talks are success stories, right? People want to share with you like how they implemented something new, how it works, how amazing that is. This is like a slightly different kind of talk. Usually the talks have happy endings, right? They will present like, this is how it works, and that's it. I don't really have a happy ending here, although there is light at the end of the tunnel, hopefully. Right. So this is the usual kind of talk. This is the talk I'm going to present here. By the way, I know my accent is kind of funny, so if you don't understand something, let me know. And I will repeat that with exactly the same kind of accent, so you can misunderstand it for the second time. But I usually give talks that are kind of practical, right? They will show you how to do stuff. This is not the kind of talk. I usually talk about Postgres stuff, about implementing features in Postgres, or using features in Postgres. This talk includes a little bit about, like, kernel behavior and so on. And finally, it's really about mistakes that have been made, or assumptions, incorrect assumptions, that have been made, like, in the past, a long time ago. If you have any questions, by the way, please shout, don't wait until the end of the talk. I find it kind of difficult to answer questions a long time about topic, which was presented, like, half an hour ago, so ask right away. Right, so if you are using Postgres, you probably are familiar with the concept of a checkpoint, right? So a very quick introduction into how Postgres handles durability, right? So if you look at the system, you might see something like this, right? At the bottom here, there is, like, the storage device. On the storage device, we have two kinds of files written by Postgres, that's data files, and the transaction log. The transaction log is accessed, and mostly just written, using direct I.O., right? So it's not accessed through page cache, which is managed by the operating system. The page cache is, like, a general purpose cache for the whole system. But Postgres also has, like, a smaller amount of memory, which is used for specific database caching managed by the database. The page cache is managed by the kernel, the shared buffers is managed by the database. It's usually, like, much smaller. Postgres relies on the page cache buffered I.O. quite a bit, right? So when you do, like, some modification to the database, what happens is the database will first write all the changes and descriptions of the changes into the transaction log, using, like, a very small buffer, but it's going through direct I.O., not by buffered I.O., and then we do the changes in the shared buffers. When you do a commit, we only flush, you know, we only flush the transaction log, right? So that's it. So at the end of the transaction, you will be in a state like this, right? So you will have some dirty data in the shared buffers, like modified contents of the data files, but it's not on durable storage, right? So when the database crashes at this point, the Postgres will essentially read the changes from the transaction log and apply them again, right? So that's the whole idea of the transaction log, how it's used in Postgres and in other databases. But if we only did this, the transaction log would pretty much just grew over time into, like, terabytes, petabytes or whatever, right? So what we need to do, and also we would have to always apply all the changes from the beginning, right? So after a year, if the database crashes, you would have to reply all the year worth of changes, which is, like, not very practical. So what the database does, it does something called a checkpoint, right? It looks at the current position in the wall, essentially a sequence of changes. It looks at the position. It flushes all the data from the shared buffers, all the changes. So it writes them into the page cache, and then calls fsync on all the modified files. And ultimately, when this succeeds, it deletes the unnecessary part of the transaction log and remembers that if it crashes, it only needs to do a recovery from that position, right? So this happens regularly, like every half an hour, 15 minutes, something like that in most cases. And that works as long as nothing fails, right? Which is kind of the rainbows and unicorns land. We know that in production systems, that's not the case. So what if there is another, right? What if something fails during the checkpoint? Well, it's critical. It's really, really important to actually detect the error, like to learn that something failed. Because at that point, what we can do, we can crash the database and force a recovery, right? It's kind of annoying because the user connections will fail, they'll disconnect, the database will do something for five minutes, and then it will restart. But you will not lose any data. You will not lose any committed changes. It must not complete the checkpoint as successful and delete the old transaction log because at that point, you have lost data. So when you do, when the error happens during the right phase, right? If I go like here, in this phase, when we are writing the data into the page cache, that's kind of okay because we can detect the error. We still have the changes in the shared buffers, which is managed by the database. And we could retry, right? So that's okay. But we don't really see this very often in production because those are like a copy from one part of the memory to another part of the memory. It doesn't involve any ios storage systems or whatever. So that's mostly a theoretical problem. What's worse is that when we call the fsync, the data is supposed to be written onto like a durable storage, either local disks or like storage connected over network or something like that. And that's completely managed. We only initiate the fsync, but otherwise it's completely managed by the kernel. The database has no say into how that works, how the recovery, how the errors are reported, and so on. And that's where the problems lie, right? Because we can't, from the database side, we can't easily retry the fsync because the page cache is managed by the kernel, so we don't know whether the data still will be there or what happens, right? So what were the expectations when this was designed or implemented like 20, 25 years ago in Postgres? So the first expectation was, so if there is an error during the fsync, the next fsync will actually retry, will try to flush the data again, right? So if you have like four kilobyte page modified in the page cache, the fsync for some reason can't write that onto the storage because maybe it's like a network attached storage and there is like a network hiccup or something, the next fsync will retry. Well, the reality is that the first fsync will fail with an error, but the data is like discarded from the page cache, right? So it will just throw it away, and the next fsync obviously will succeed because there is no data to be written again, right? So that's kind of annoying, and it means we can't actually retry fsync, right? And this is not like a problem only for Postgres. This is a problem for all applications that use fsync, especially in like non-trivial cases because it's getting worse, right? Furthermore, we kind of expect that this is universal behavior, but it's not, right? So, for example, the file systems will behave in slightly different ways. So, EXT4 will leave the data in the page cache, but it will simply mark the page as a clean, right? So you have the modified data there, but unless you modify it again, it will not be written again, which is annoying, right? Because it makes the failures kind of unpredictable, right? In some cases, you will lose the data, in some cases, you will not lose the data, and so on. XFS and BTRFS will simply throw away the data and it will mark the page as not up-to-date, right? Like stale page, which is not really POSIX compliant, but then again, it's maybe better behavior than what EXT4 does, right? Because we at least know that the page is like obsolete or something. Neither of those file systems actually retry the right, so you lose the changes. In some cases, you will learn about it. In some cases, you will not learn about it, and so on. So, that was the first expectation. The other expectation is about behavior in multiprocess system. So, if you have an application which is using multiple file descriptors, multiple processes, and those processes may access the same data file, then what happens if you invoke the F-Sync from one process, but you also have a file descriptor to the same file from another process, right? Which can easily happen if you have like a running database and someone connects over the console and says sync, right? Which is like a way to run F-Sync from the console. So, will the other processes learn about the error? Well, it depends. The truth is, only the first process that actually initiates the F-Sync will probably learn about the error. But there are cases where that may not be actually true. So, it also depends on the kernel version. Essentially, all kernel versions up to 4.13 are kind of doomed, right? We can't really detect the F-Sync failures at all because it depends. The representation of the file in kernel is kind of transient, right? If you open a file, you get a file descriptor. But the kernel has no idea which process, which file descriptor has written which page. So, it can't easily map the failure to the process, to the file descriptor that actually invoked the right. And similar issues, right? And then it can actually evict the i-note from memory, which means it will also forget about the error and things like that. There have been improvements over the years in the aeroporting. But unfortunately, it got fixed. Then it got broken again. Then it got fixed and broken again and things like that. So, what we really don't have, well, at this point we have a reliable way to get the error as long as we keep the oldest file descriptor around, right? Which Postgres didn't do so far. We have like a small cache of file descriptors. So, when one process closes the file descriptor or file and another process needs to open the file, it will get the file descriptor from the cache. So, we don't do like system calls over and over. And we can modify that to actually keep around the oldest file descriptor around. So, we will always learn reliably about the failures on the newest kernel, right? So, this is like expectation number two. So, wonderful, right? So, so far I've been talking mostly about kernel. When I said kernel, I meant like Linux, right? But it's not really limited to Linux. It's a kind of universal problem, right? What happens with the memory when the, when you fail to write it to the storage? Like, you have two choices, right? You can either keep that around so you can retry later or you can discard the memory. And most systems, including like BSD systems, actually just discard the memory because otherwise it would be a potential memory leak, right? One of the reasons which is used as a, to demonstrate the issue is like when you just pull the USB stick out of the, out of the port, out of the machine and you still have like dirty data to be written to the, to the storage, right? It's never going to be successfully written after that. So, you would, it doesn't make sense to actually keep it in the memory. So, it kind of makes sense from the kernel development point of view. For us, it's kind of unfortunate because it means we have to do all much more work to actually reliably detect and handle errors, right? It's not really against POSIX. I've been trying to read the POSIX specification. I don't know how many of you try to do something like that. Don't do that, that's my advice. It's not really simple to decide like what POSIX actually requires, right? Certainly, in many cases, it's kind of ambiguous, right? It says like very general requirements, but it doesn't say how exactly that should be implemented. So, that's one reason. The other reason is it doesn't really matter what the POSIX specification says when the systems already behave in a certain way, right? Because you can't really implement stuff for like ideal POSIX compliance system when you don't have such systems in production, right? So, this ship kind of order sale, right? I mean like the systems behave in a way. We have to handle that. We want to handle that reliably. And now that's a question like why did it take like 20, 25 years to actually notice this? Well, in the past, the storage was usually designed for like database service, right? So, it was designed for reliability. You got like locally connected drives connected to a RAID controller with bright cache with a small battery. And it was really reliable, redundant, and when it failed, it failed spectacularly. Like it failed and it never came back again. So, like, it was obvious that it got broken. And those errors were kind of more permanent, right? It's not like the thing that on one attempt fails and then it actually succeeds. So, why is it a problem now? And by now, I mean like a couple of years back. Well, we have much more systems with network-attached storage. So, it might be different types of sand. It might be EBS on Amazon. It might be NFS. We also have things like thin provisioning, which also leads to temporary errors, right? So, you have a thin provisioning with a quota. You run out of disk space on the storage device. After like a few seconds, the system will notice that like there's a lack of space. You will delete something and suddenly it succeeds again, right? It's completely transient. And we also have fixed so many data durability and data corruption issues in the system. Not just in Postgres, but also in the kernel that this actually starts to matter now, right? I mean, these errors are kind of quite rare in general. They are becoming more common, and they are also becoming more common with respect to the other errors, right? So, it's not the noise anymore. It's something we can actually investigate and reproduce, which leads me, of course, to the problem of actually causing the problems, like invoking such issues while developing the database. And we, for a long time, like how would you cause that? We didn't really have like a way to cause easily such transient errors. A couple of years ago, we got the DM error, part of the device mapper in kernel, which is like a great tool to actually do this, right? And that's one of the ways how we actually have been able to reproduce a scenario where we actually lost data this way. And I can say that this explains so much in the past. Like, we've been observing, of course, data corruption issues in production systems. And whenever we saw like NFS, we said, look, NFS is known shit, so it's definitely that, right? I mean, like, we don't have any other explanation. It's definitely because of NFS. I'm pretty sure if we went back and actually re-investigated all those reports, a significant part of that was probably because of these issues. The other question is, why is PostgreSQL actually using Buffered IO at all, right? There are databases that actually adapted the Direct IO, doing all the caching, not using the PageCache at all, but doing everything internally. Like Oracle, for example, as far as I know, is not using Buffered IO. Other databases probably are doing the same. Well, that's really about the history of the project. Like, PostgreSQL started as a research project at Berkeley. And the focus of that research project was not to develop a Direct IO-based database, right? The goal was to develop a database which was extensible, was adopting some object ideas, object-oriented database ideas, and stuff like that, right? So it was focused on very different parts of the database, and it was very easy to just adapt the Buffered IO. It was also the acceptance of the complexity of the IO stacks, right? I mean, implementing IO stack is not a trivial thing. And we are database engineers, we are not IO engineers, so we don't really want to do that part of work. We don't want to waste work on that. We want to benefit from the work of the kernel community. And I think over the years, it actually went quite well. Like, 20 years ago, the development team behind PostgreSQL was much, much smaller, right? Nowadays, we have, like, hundreds of people submitting patches and reviewing patches, but 20 years ago, it was maybe five people, right? So we know what the problem is, kind of. So how do we fix the issue? Well, option number one is, like, modifying the kernel, right? Making it to actually keep the memory, the modified data around and actually doing the retry properly. Well, that's something that would work perfectly for the database, but it's something that can't really be done. First, we would have to convince the kernel developers that it's the right thing to do. As I explained from while it might actually work for the database, it probably wouldn't work for other use cases. And it's not something that can actually fix existing problems, existing systems, right? We already have, like, hundreds of thousands of servers in production, and the likelihood of this getting to them to those systems anytime soon is, like, zero, right? We still have systems that are running kernel 2.6, something on large, on old CentOS systems, right? So that's not really something we can use for existing systems. So that's not really a solution we can use. So we have to solve this in Postgres with minimal help from the kernel, right? So what we have since kernel 4.13 is a way to actually kind of reliably detect the errors during fsync, which is by keeping the oldest file descriptor around and actually looking at that particle file descriptor. We can make sure that in those cases we reliably detect the error and we can trigger a panic, which is essentially a database crash, a recovery, and you don't lose any data. It's, of course, a disruption of the service, but if you really care about, like, availability with Postgres, you should already have, like, a standby with a failover, and that will also work here. It requires modifications to how we cache the file descriptors, because in all the versions of kernel of Postgres, we haven't actually considered, like, which file descriptor is older, stuff like that. We can do that now. There is a patch which we'll likely get into Postgres 12, which is supposed to be released sometime in September, October 2019. It will require the new kernel, but we can't really do anything about older ones. I'm not sure about backpatching to older Postgres versions, but chances are it will be backpatched to existing Postgres releases. Right, and there should be actually another slide which says, like, a long-term solution in Postgres, and that's maybe over the past few years essentially adopting the direct IO approach. Like, not raw devices. We would still use the file systems available in kernel, but instead of using the page cache, we would do all the work, all the f-sync, essentially, from the database, right? So there are some proposals how to do that, but it's while reworking the file descriptor handling is, let's say, trivial, this is going to be, like, over maybe two or three years' effort to actually get it working without significant performance regressions and stuff like that. So that's, like, a long-term plan, which I think was, like, already discussed on the mailing lists and so on. Right, so that's mostly what I have here. I do have a bunch of links here, which if you need more or are interested in the topic, first there are some discussions on pgsql hackers where you can actually see how we discovered the issue, right? Which as, I don't know if you have ever seen such issues, investigations in practice. It starts with, like, a minor issue that you investigate and then it's like a rabbit hole, right? You get into the small issue, then you discover much more broken than you thought, then you get into another rabbit hole into, like, much more significant issue and so on. So that's the first two discussions. And then these issues were actually discussed on LWN in, like, three different articles. The first one is specifically about Postgres. Explaining, like, why we were so surprised about that. And then there are two different articles explaining, like, how the error reporting in the kernel got broken and fixed and broken again and so on. And finally, there's a very nice talk by Matthew Wilcox from, I think, Microsoft, but who works on kernel nowadays. And he had a very nice talk at PGCon in Ottawa this year, which is, like, the main Postgres conference for developers in the world. And he was exactly explaining, like, what are the issues with tracing the errors and tracking the errors in the kernel for different file descriptors and how it got broken in different versions of kernel. And there is actually, now there is a video on YouTube with the talk, so if you want to know the details from, like, a much more knowledgeable person, this is probably the recording you should be looking at. Right, so that's all I have here, I think. So are there any questions? Yeah. My question was, how do you go about testing that your solutions are correct and how are you testing the fsync behavior in different file systems? Are you reading the code or are you applying tests to it somehow? So the question, do we have a testing for the correct behavior? Yes, my colleague actually has a script which actually uses dm error to reproduce the errors with and without the patches, and we know, it's really difficult to say that it's, like, perfectly correct, but we see that with these changes, with the new kernels, it actually behaves correctly. We don't lose the data anymore. Thank you. I'm pretty sure he actually shared that on the Postgres mailing list. Just look for Craig Ringer in the mailing list, and that's him. What inspired you to do this research? Sorry? What inspired you? So, well, we have customers running on Postgres, and they actually lost data, they run into data corruption issue because of this. Right? So we've been investigating, like, why there is, like, a broken index or something like that, and in that particle case, it was kind of more complicated because they've been running on, like, multipath with multiple paths to the devices and so on. So we have discovered, like, various issues in that, but ultimately it turned out to be because of this. Right? Is there actually a need to fix the problem by the Postgres project itself? Couldn't you rather leverage a plug-in-based system like this done for the table engine, for example? Right. So this is really... There is an effort to have plug-able storage in Postgres. It might actually get into Postgres 12, but it's, like, mostly independent thing because even the plug-able storage will use the buffered I.O. And once you use the buffered I.O., you are subject to this issue, right? So, yes, I mean, we still need to fix the issue even if you use different I.O. storage engines. Yes. Are there any recommendations of what we can do right now with our existing deployments such as use this file system? Right. So if you are using things like thin provisioning, you need to be carefully monitoring for, like, disk space and so on. You might actually use multi-path to... because that can actually stack the errors and put them into queue when there is, like, the... when the variety of the device actually fails. So it kind of can replace the queuing or keeping the data in the page cache. It can do that at the multi-path level. And otherwise, not really. I mean, it's really like a problem between, like, a disagreement about the kernel API between Postgres and the kernel. Yes. We have lots of Postgres databases and we recently have started seeing F data sync errors and we are hot-plugging CPUs to our libford machines. How do you go about actually debugging errors like this? Because we see these errors, but I don't really know how to get into this. So I'm not sure I have a good answer to that. I mean, visualization is, like, a very complicated topic. I don't have an answer to that. But essentially, if F data sync fails, there's something wrong in the system, like, misconfigured. I don't know which kind of visualization you use, how it's configured, but I don't recall, like, F data sync errors unless there is actually a problem in the... at the device level. Okay. I don't know. Are there any more questions? Is there some operating system kernel that behaves correctly according to the Postgres requirements? So as far as I know, the Postgres doesn't really say what should happen in this case. At least I've been unable to deduce that from the Postgres specification and I haven't seen a clear explanation that should be against POSIX, like a quote from the POSIX saying this is wrong. So I think it's acceptable POSIX doesn't really say it shouldn't behave like this. That's my understanding. Okay. One question. You had the line Illumos and FreeBSD behaves differently. Is it known what kind of different handling this is? Right. So thanks for reminding me. I said that BSD generally behaves just like Linux kernel. ZFS and the systems based on Solaris, Illumos and so on, actually they will keep the data in memory as far as I know. So like ZFS, when using ZFS. So ZFS is like an exception to this. It should be resilient to this kind of errors. Right. Which Postgres versions are going to crash now? So where is this back ported? So it's not committed yet, right? So all production versions of Postgres actually have this problem. But as I said before, it's not like something that would suddenly make the Postgres unreliable than before. It's the likelihood of the error is still the same. It's just we now know about it and we are fixing it. Yes. Hi. I would assume that the file system will report an error if a sync fails, right? You could read the... Right. So I don't know if I understand the question exactly, but there is actually an issue that the F-Sync, the write back, happens in the background. So when you are writing the data, even if you don't invoke F-Sync, you can actually have IO errors losing data. And then the error is actually reported on like close or another write or things like that. And it's exactly the same problems all over again. I'm talking about the kernel directly. You will get the file system itself will have an error in the file system log, no? In the file system error log. There is a system call if I'm not totally wrong. So the error is not really about file system, right? It's like you fail to write a block to disk for some reason, which is like independent of the file system. It gets reported to the file system and the file system can respond to that in some way, right? So that's the problem. Thanks. You mentioned FreeBSD and Lumos behave differently because of the F-S. Yes. How about Linux, the F-S? Good question. I don't know. But it's a good question because I think FreeBSD is going to adapt ZFS on Linux, right? But my understanding is that the different behavior will be on ZFS on Linux too because it's not using the page cache, right? It's using the ARC. So I think it's going to behave just like FreeBSD in this case. That was my second question. Is there any file system known on Linux which behaves correctly regarding to the requirement number one you mentioned? Because then you have a lot of problems out of the way for existing installations by putting the database files on such a file system. Right, so... It's not a solution, but it's a hot fix. Right, so good question. I don't know. Any more questions? Okay, so if there are no more questions, I will be still around. So if you want to ask personally, I will be here. But I think that's it. Thank you.