 So, first of all, I'm going to give a quick introduction about what the problem is. For our large-scale data centers, the memory errors happens not quite rare, so it's not quite open, but not quite rare. So, there are two kinds of errors. First one is uncreated errors. The other one is uncreated errors. Uncreated errors already means the errors cannot be fixed by hardware, so your data is actually corrupted. For encrypted errors, it means the errors could be fixed by hardware, so your data is actually consistent. And for the memory failure handling in the kernel, here I'm referring to the uncripped errors. So, for our page cache, the memory failure does truncate the hardware-pointed page from the page cache, regardless of the clean on dirty. So, even if the page cache is dirty, it's going to be truncated by the memory failure handler. So, truncating dirty cache may cause some problems for some of the data loss, because once the page cache is truncated, the later access will read the data from the disk, but actually the data is already incase-cited. So, but the problem is that, the bigger problem is that it's a silent corruption, because there is no any notification to the user from the kernel as long as the page is not mapped. So, yeah, this will cause the problem is that the memory failure will cause the data loss, and it's very hard to debug. Even though we are aware of some data corruption on data loss, it's not easy to trace back to memory failure, because sometimes you maybe didn't think of that it's caused by the memory failure. And the solution is that, I'm going to propose a simple solution is that, first of all, we keep the poison dirty page in a page cache instead of truncating. So, and the second step is that we have to make the other first things be aware of the poison page, because in our current design and the implementation of the first things, there is no check about whether the page cache is poison or not. So, the first things assume the data in page cache is always consistent, and actually ignore the poison page. So, we have to make the other first things be aware of there may be a poison page existing in the page cache. So, it involves all the... I have some bullet points about some process related to the page caches in the first thing code. The first one is a readback. So, I think readback is fine as long as the dirty flag is clean. And the second one is job caches and other colors which may invalidate the page cache. So, we should prevent from invalidating the poison page caches. And the third one is that truncated and hole punch and something similar, it should be fine to allow truncate invoked by the users. Because the users explicitly request to remove the page from the page cache. The user doesn't care about the data anymore. And the first step is that we need to notify the applications and user space by returning some error code when the poison page is accessed. For example, in the read and write path, actually the page fault path already handles that. And for some other process, for example compression or encryption, I'm not quite sure because I'm not a third team expert, so I need some advice from fellow third team developers. And we approach it. So, to make the third team be aware of the poison pages, we could choose three, we have three, basically have three choices. The first one is to just simply check the hardware poison flag. In every code path, we're accessing the page, for example the read and write, because there is a page flag called hardware poison. So once the memory failure code will set the flag for the poison page, so we can just check the flag to know if the page is a poison or not. The second choice is that we just return now from the page cache lookup code for any poison page. It's a, I think it's a simple list of ways and you know, encourage the list to work. But the disadvantage is that all, maybe not all the process, but maybe the most core size, when the now page pointer is returned, the first team assumes it's a no-memory error. And don't assume it's other errors. So if we return the no-memory errors, it sounds confusing to the users. So the users may still be not aware of this data consistently issue. And the third way is that we can return an error pointer instead of now. But this may incur more changes because some process actually doesn't care about if the page is a poison or not. But it doesn't care if the return the page pointer is now or not. So usually the code just check if the pointer is now or not. If we return the error pointer, we have to add actual code to tell if the page is a poison and now it's just a normal regular page. So the steps to fix problems... Okay, I'm just gonna stop in here. Could you go back to the previous slide? Okay, so we can't do either of those second things. We can't return now and we can't return error pointer because the information about whether a page is poisoned is per page. But everything is being converted to the folio lookup rather than doing a direct page lookup. So that's just not going to work because you might be accessing the part of the folio which has the hardware poison. Yeah, I know, but actually the memory failure is going to split the large page. But that doesn't always succeed. Split the large page to the space page. Page split doesn't always succeed. Sorry? Page split does not always succeed. Yeah, but in return error code has a memory failure. 100 is failed. So yeah, but I get the appointment. We'll leave the large page with the poison and the sub-page in the page cache. So we cannot know which sub-page is the poison. Yeah, I agree with the problem. So we may have to find a way to resolve that. Sorry, I think that first approach is the only one which actually works. Yeah, actually there is another flag called page has... I forgot the name, but it actually said in the head page and it tells the kernel that there is at least one sub-page with the poison in the large page. So we can check that. And if we found that, we may have to iterate every sub-page. But it's not very optimal. I understand that that exists, but if somebody is reading a part of the folio which is not hardware poisoned, then that should be allowed to succeed and we can't do that with either the second or third option here. So we have to go with the first approach. Yeah, I agree. So actually I did convert the TMPFS with the first approach. One of the things we're doing recently on the DAX side is telling the file system that a failure happened. So I'm wondering if that's another approach. If the page is already dirty, you know that the data on the storage is bad. You still do the trunk cape to the file system to record the error. And then when they try to read it back, then they get the EIO from the file system. Yeah, I think we could use maybe the similar solution for the page cache tool. Yeah, I'm inclined to support that particular solution because correct me if I am wrong, but it appears that since the information is only being kept in memory, once the system crashes, all knowledge that there is a dirty page that was never written back is lost and you still have silent data corruption. It's just that what is on disk is either non-existent or the previous contents of the page before it was modified. So it seems like this is an awful lot of work to only allow EIO to be returned until the system is rebooted or crashes. And I guess the question is, how worthwhile is that? And if we're going to go to all of that effort, it's certainly more effort to ask file systems to keep state about the fact that a particular page is corrupted. But maybe that's the better approach if this is really a problem. And maybe the next question is, do we really need to do this on the granularity of a page or do we, I suspect for many use cases, the file system could just simply mark a flag on the file, right? The entire I-node saying, the contents of this file is suspect. You should restore from backups or take other administrative means to recover. And that might be a little bit more of a practical approach. Yeah. First of all, I think you're concerned about if it's worth it. But I don't have a good answer for that. So first of all, the memory failure does happen, as I said, not very rare in the large-scale data centers. Actually, I do take a lot of problems about that. But how often it happens on the dirty-page cache, I don't have the data, so I cannot tell right now. But I agree it involves a lot of work for the file system developers. Yeah. I think the question is very worth it or not. And for the second question, to set a flag for the I-node, to mark all the I-nodes, the whole file is corrupted. I'm not quite sure if the user doesn't care about that. It should be fine. But if just one page is corrupted, for example, you have a big file, and one 4K page is corrupted, the whole file is corrupted, I don't think it's a good idea either. I'm trying to understand the objective of this effort. Is it that you want to give the error code to the user and the user handles it how he wants to do it? Yeah. Okay. So I think my point is that notifying the user, the problem that has happened on existing, is better than setting the data loss. Yeah. It's all together. Yeah. Sorry? It's also about preventing garbage from getting written out. Going back to your point earlier, Ted, if the system crashes, then the slate is wiped clean. Then we're all good. That garbage memory doesn't exist anymore. That's the good scenario. Well, it's not entirely a good scenario because now we've had data loss. I mean, you've always got some amount of data loss on my system powered off without awning. But usually it's somewhat time constrained, right? There are syncs that happen invisibly in the background every 30 seconds. You haven't lost that much probably. But this could be sticking around for years. But I guess if it's still in the page cache and you give an error to the user, the user could actually take that, move it to another file and recover that data and not lose it, right? Or am I misinterpreting? So I think that of all of the things that you've talked about, I think the flag makes the most sense because that's easy to put into the generic code. And I think that if we want to be smarter at some point in the future, we can do that. But we need to start with mark the pages as poisoned, make sure that we don't return any data and we don't write that data out and mark the mapping error, like Jeff did all of this work to make sure we returned errors to the user, we should take advantage of that. And okay, yeah, does it suck that like, okay, you only have one page in this file, go wrong, but not the rest of it? Yeah, but from my experience, we are not very good in the kernel and even worse in user space in dealing with errors in the first place. So having granular errors as opposed to like, this file is unreadable, let's err on the side of the biggest blast radius so we don't cause more problems. So I think flag, put in the generic code, check for flag, if things are wrong, make sure it doesn't get written out, set mapping error, and then we can have further discussions later on about how fancy we want to be. Okay. I'm just going to point out that for virtual machines, you sometimes have like fireback memory and the virtual machines can actually propagate or like your hypervisor can propagate, for example, memory error to the virtual machine. In case you would be like damaging that whole file or prohibiting it from being used anymore, you would essentially kill that virtual machine although that virtual machine might be able to deal with the MCE itself. So that's just something to keep in mind when we're talking about like completely corrupting the file essentially compared to only like a single page that a guest operating system might be able to handle. But I mean, I agree, like it's the easiest solution to just mark the whole file as corrupt, but it doesn't apply to all use cases, I think. You might provoke more harm than actually not doing anything. Yeah, right. So I agree for the VM case. All of a sudden the whole VM goes down because one page is screwed. I understand that and I think that's a valuable thing, but we are going to fuck this up like in the generic case for several releases. We need to get the normal, just the file is screwed case working and then start talking about how do we handle this in the more fine grained cases because yeah, okay, it sucks for the VM people. But again, speaking of somebody that manages a humongous amount of machines, I don't care at all. I want to see that the thing died and that I need to replace memory. So like we're kind of like negling about like things that yes, we can do and we could do a lot and a lot better and do some fancy stuff, but we need to make sure that like the brain dead stupid case works first and then go from there. I also want to say that you've got to use a check in page access path, so I read slash write. There was an upside here in the right path. If you're writing the entire page, get rid of it. Get rid of the corrupted page because now you know what data should be there. Because it's just being overridden by user space. I think any rating, the poison page should be prevented because the data is already existing. You write back to the storage, the data corrupted. No, I'm saying if user space calls the right system call and it's going to overwrite the entirety of the corrupted page, then you can throw away the corrupted page. That will still be a problem with copy and write funds because because it flushes the data first as whatever is existing on the on the on the disk and then gets in a way. Yes, but if you've taken a snapshot or something of that sort, then you're in trouble. Yeah, this is why I argue against fancy things in the first place. We need to just fucking mark the pages or we might mark the mapping in there. We can even add a fancier thing. Like mapping has a hardware poison page. Really do not write anything out. But I think that simply having a hardware poison flag and then putting the logic in all the generic places and all the files to recognize this flag and to do something with it. And then we can start looking at fancier solutions. Yep. Do you have anything else? No, I don't have anything else. Perfect. Let's go have some lunch. People on the Zoom call will be back in an hour. Thank you. Thank you.