 Okay, let's get started. Okay, so my name is Gabriel and this talk is called a guardian angel of file systems, monitoring for file system errors in data centers. First of all, I apologize for you not being able to see me. There is a technical issue on my part, but still, since you cannot see me, let me show you a little bit about myself. This is me, at least how I look like when I was a bit younger. So my name is Gabriel. I'm a senior kernel engineer at Collabra, which is an open source software consultancy and one of the sponsors of this conference. In the past five years, I've been focused mostly on improving gaming and Linux. You might know me for some efforts around a few tax weight mood poll, which became few tax V development. It's a major feature for gaming and Linux on top of Steam, but I'm also a file system developer. I've been contributing back and forth with some features. For instance, I wrote the case incentive support on the X-T4, which is now used on Android phones and is also quite important for gaming and Linux. Finally, I'm also the maintainer of the Unicode subsystem in the kernel, which is used by file systems want to interoperate using UTF-8 for case folding and case insensitiveness in the kernel. I should also add that I'm a Brazilian. So if you want to make questions in Portuguese or if you want to make questions in Spanish, I can answer those questions directly. This talk is presented in English because I will also reuse this at a later date. I will also be presenting this topic on Open Source Summit Europe. So I've prepared the talk and the slides in English, but you are welcome to raise questions in any language as if that is allowed by the conference. In this talk in particular, I'm gonna change a little bit of gears from gaming on Linux or how you improve gaming on Linux to talk a little bit about a different work case, different use case, which is servers and data centers. So in order to get started, let's recap quickly what a file system mirror is and what are the problems that it might bring. So this kind of errors usually happen when you have some broken data or something that went wrong on your disk, on your volume or on your operating system. It happens sometimes when you have a faulty disk that has an unreadable block or something that's when it got written to the disk or once in the disk, it got a bit flip or something went wrong and you ended up with some corrupted data. It might happen because the file system tried to read the data and it couldn't read. So the block layer signals the file system that that is an unreadable block or maybe it actually read the block but what it found there doesn't make any sense. Maybe it tries to check some of the block and it sees that it's wrong, the check some is wrong or maybe it finds some data that simply doesn't make sense when it's looking at some metadata. Why a file system error might happen? Well, maybe the easiest, the simplest case is a faulty device, a disk that is not being able to read a specific block but there are some other issues, for instance, an error in transmission or even a file system bug and file system bugs are very common and unfortunately as much as we try and pass file systems, developers are not perfect, we introduce bugs and even though we try to minimize it, it's always important to be able to recover when these situations happen. So when you get your first file system error, that is usually a sign that you have bigger issues. For instance, a small issue on an iNode, if you copy that iNode around the issue might propagate, you might lose a sub three, a subset of directories of your file system or it might means that your disk is dying and dying slowly. So after a while, it will stop functioning completely or it might be something that is very limited to one single file and it won't actually become a problem in the future. It's just a specific set of data. But the important thing is, you need to be able to detect it quickly. Since administrator, a system administrator will usually start some recovery procedures and the most common of it is running FSCK which is a file system check and repair tool. It's decided to walk over the file system, correct errors, query the user, ask the user what they should do when they find some situations that cannot be decided, whether it should remove the file and link the file or whether it should try to recover or the administrator might just decide to recover from the backup. But doing that in a timely manner is essential because once your errors propagate into your backups, you lose that information. If your device is failing, now you might not have missed some important data, but in the future, if the device stopped working, you might not be able to access that. Of course, if you look to the image on the right, I would say that that is a disk that is a little bit beyond repair, but if we are still not there, we might be able at least to save some data. So that obviously raises the question, how do I know that my file system is failing? What can I do to figure out what is going on before it might be too late? The first sign that there is something wrong that we usually get as users of a system is when an application fails. It throws a mirror telling us there is something wrong. There is something that I cannot do. I cannot save this file or I cannot read this file. This is something that happens synchronously. So when you're trying to do an operation, your application gets a mirror code that is returned by a system call. Sometimes it's in involve, sometimes it's EIO, but that basically tells it's the operating system telling your application I cannot do, I cannot execute the system call. But that doesn't conceal much information of what went wrong, what is going on. And that depends on the application being able to do the right thing. So an application that receives this error might just decide to ignore it and pretend nothing happens and going through a different path. Or the application might trigger a message to the user or if something is designed in a way to be more error prone, it will maybe try to recreate the file. But still, it's very application dependent. Some applications might just ignore the error. So that doesn't give us a safe way to detect this kind of errors. Another solution is to check the mask, which is a tool that will print out the kernel log. And the kernel log will actually include some more information about the error, that is obviously a file system dependent. But this image is an example of what the EXT4 file system will print on some error. It will dump a message that will show where the error happened. As you can see here, it happened in EXT4 and the score remount at the line 6273. So this is a specific link to the source code. It shows you exactly where the error happened in the kernel code. And it gives you a human readable description of what was the error. And then this is a single line that will get shown in your kernel log. This is good enough for a user who's familiar with this notation, who's able to detect it, but obviously this is not very machine readable friendly since it's plain text. But the bigger problem with the kernel log is that it's noisy. As you can see in this image here, this is a kernel log that I just dumped from my machine. I used to say that the kernel log is no man's land because any kernel code can write pretty much anything in there. So it's a limited size structure where every event in your system is sometimes captured in more or less lines, but still there is a lot of noise there. Just in this image, you can see that I'm intercalating some Bluetooth messages because I connected my headset. There is also information about my keyboard in there. When I plugged in the keyboard, there is also stuff on when I put my laptop to hibernate. So the kernel log is extremely noisy. And worse, the kernel log is basically just a ring buffer. So it has a limited size. Once you finish writing to your kernel log, it's always gonna start writing on the beginning, overwriting what was already there. That means that if you have an error coming from your file system, it might just disappear in all these noise. And since we're talking about plain text here, there are still some parsing challenges for a tool that would automatically look over this kernel log and try to interpret it looking for file system errors. Some messages might be wrapped or truncated. So some messages might collide with other messages when they are being written to the kernel log. And you might get the start of a message and then the next line is something else completely. And then you continue the previous message. There were some improvements done on the kernel to prevent that from happening, but still possible to happen. And as I said, the kernel log is just a ring buffer. Therefore, if you completely fill it, it's gonna overwrite stuff on the top. And this is particularly problematic for file systems because when a file system error happens, it usually propagates quickly and there are sometimes thousands of messages all around the place. A bunch of errors that get triggered in sequence because once you are unable to open, for instance, a file, maybe user space code will retry, maybe it will reach other errors in the kernel. And then it's very common for you to have a waterfall of errors that will just overwrite everything. And those errors are not really meaningful. What matters when we're talking about file system errors is usually the first error that got detected because that is where the real problem is. And this makes it very hard for any parser to walk over the kernel log and see what is going on. But there is an even bigger issue than that. We don't consider the kernel log as a stable EVI, which means that any message that gets written to the kernel log, as since they are written in plain text, they might, and they are not stable EVI, they might be modified. So if you write your parser, your graph to look for a specific pattern and then you upgrade your kernel, there is no guarantee that that pattern will stay the same. It might just get modified and you lost your data. So you need to be careful when you are monitoring the kernel log. It's quite a dangerous tool for a machine to monitor. There are other issues there. I find the kernel log extremely noisy as I already showed you on the previous slide. Just any USB device that is connected to my kernel will trigger a lot of messages, a lot of lines of logs in the kernel, and that is just too noisy. But it's not even just that. Any application with usually root privileges could just write common noise, any random noise, and write them into the kernel log. In fact, they could supposedly overwrite stuff in the kernel log if they want to write something by just routing to this file. System D is known for using the kernel log and writing stuff there, which means it increases the noise quite a lot. And as a final issue, it's a pull interface, which means that an application that is monitoring the kernel log needs to go constantly reading it and checking for new data. It's not ideal. It's not, we want something like a push notification where you tell the kernel, tell me when something happens. Wake me up when something happens that I really care about. Obviously you can pull on the kernel log using the pull system call, but there you're not able to feature. So any new line that gets written, any noise that gets written would wake up your file system monitoring demo. So it's definitely not the right interface for this kind of work. There are some mitigations for the kernel log. Of course, you can improve some of the issues that I mentioned above. For instance, a monitoring demon that persists data onto the disk like system v does, like syslog v does, it will reduce the possibility of you overflowing the kernel log and losing messages. But it doesn't really prevent you from losing any messages. It just makes it makes it less likely. As I mentioned before, there are also some patch sets that attempted to eliminate or to reduce the possibility of rapid or truncated messages coming from the kernel side. So at least what you get in user space is correct. But the worst problem for me, which is the fact that it's an unstable ABI therefore you shouldn't be relying on it for any kind of parsing. It's not, it's simply not gonna change. The kernel log is not gonna be made stable. It's quite hard to control the entire community. And there is no one who is actually trying to monitor the print case that are put in the kernel to make sure that they don't change over time. So this is a big concern for anyone who's attempting to do any kind of monitoring. In fact, as a personal opinion, I completely believe that relying on kernel log parsing for any kind of automated monitoring is a terrible, terrible idea and you totally should not do it. But in fact, it's something that is extremely common. And a lot of people do it. From kernel CI to many others, a lot of people are monitoring the kernel log for errors. And the reason for that is that, well, sometimes it's the only way that you can get some information. For file systems, for instance, unless you run FSEK that is gonna crawl all over your disk to check for errors. And that's for very big disks might take a very long time. There's simply no better infrastructure for you to fetch errors. Sometimes there is, for some file systems, you can look into CSFS, but it's not the true for everyone. So we need to understand that the kernel log is just not the tool for this job. It's great for some other works. For instance, as a kernel developer, you're gonna be using the kernel log all the time. Or if you're a user troubleshooting your system, you can look at the kernel log and you can collect some information because you know what is going on and you are just looking for something specific. And as a human, you're not subject to the same constraints of an automated parser. But still, the kernel log is not the right tool for monitoring. And of course, the situation gets worse once you add more and more servers. Monitoring a single kernel log is feasible if you're a user. But how do you monitor a hundred machines? How do you monitor an entire data center with maybe hundreds of disks of file systems? How do you monitor to look for failures in all of them? Scaling this kind of monitoring in a reliably consistent way is something that is extremely hard. In particular, because of the poll versus push aspect. Of course, you can pull on some on the XT4 CISFS to look for errors. Actually, I don't think you can, but the idea is that you still need to go to a kernel log and to the CISFS and read frequently a file looking for modifications. This is not particularly ideal. You want to be able to push notifications to be notified when something changes. And you want to do that from some kind of central entity, some tool that is gonna be watching all over your data center who is able to check hundreds of volumes, dozens of machines, hundreds of machines, and find something as soon as it happens. Which is why simply that was a no-go for us. When looking at data center, simply it doesn't scale. We're looking for a solution for this problem though. We found FA notified, which is an interface that was originally added to the kernel to monitor a specific kind of file events. So basically it's a tool to monitor reads and writes, remains creation and file deletion. And its original use case, as far as I understand, was for stuff like antivirus scanning or maybe file indexing. It allows you to put a watcher notification in the file system on the kernel side and ask it to let you know once some of these events happen. And the interesting thing about FA notified is that it has an excellent model for a model of objects for file system monitoring. It allows you to hook by specifically watcher into every kind of file system object, for instance a file, directory, or maybe the entire file system. It allows you to watch any entity that you want in a very cheap and inexpensive way. It has basically zero overhead when it's not enabled on a specific object, but it's still quite efficient when you have a lot of events happening. For instance, when you have a lot of errors, you could very quickly merge them into a single error and submit it to the demon who was watching. So when I was looking into how to implement this kind of file system error notification, I found that FA notify was a great fit for what we're doing. So I decided to extend it and I actually implemented a new event type for FA notify. So beyond reads, writes, and creations and deleting, now we have file system errors. It's basically an event that triggers on your object every time that a mirror is detected by the file system. So instead of just writing to the kernel log, now it triggered this kind of error, allowing a demon to just start a FA notify, connect to it, tell it which file systems it wants to watch, and it will be notified every time something happens. And the FS error type, it allows, it makes, it provides this protocol for the demon to receive information on what went wrong. So we decided to provide, we are providing here a more machine-friendly way of interpreting this data instead of relying on text parsing, like we would do on the kernel log. It provides, for instance, some identification on what error happened, what exactly went wrong. It also, it's also planned to be extended to provide some specific error types from the file system. So the file system not only tells you that an IO went wrong, it tells you exactly what is wrong with that device. And each file system should be able to send their specific type of error because an error that exists on Exafast, for instance, just doesn't make sense on EXE4 because EXE4 has a different file system layout. So we allow the file system to be specific on exactly what went wrong. It also allows you, it also allows the tool to provide exactly what object was affected. It tells you exactly what file is broken, what file is failing to be read, what block, which means that a recovery tool could be hooked to this monitoring demon and be told beforehand exactly what went wrong without it going looking and walking through an entire file system. And it also provides you some more statistics like how many errors happened since last time. And this is quite essential because once you have a file system error, as I already explained, it's very common for dozens of errors to propagate and occur in sequence. So we want to be able to know how many errors happened since we last observed to make sure that we didn't miss any. If we miss something that, for instance, got over flown because obviously this interface also have limitations on the size, we can at least know that we miss something, in which case we'll just start a verification on this. This is different from the kernel log when it gets overwritten. We don't know if something that was there before got lost, the thing that got lost was something that was related to file system. It can be anything because it's too noisy. Here we can know that if we lost something, it's a file system error. And the interesting thing about this is that it's a very extensible interface and backward compatible extensible, which means that in the future, if we want to provide more data to the user, we can do it without breaking the API. And this is quite essential for some future cases that I want to discuss. So as I said, this feature is file system specific. It lets the file system tell you when something broke. And this is cool because as I said, you can just get sufficient information, for instance, to tell FSCK, go fix this specific file or go fix this specific structure instead of just saying something is broken, I don't know what is. And then FSCK needs to over a multiple terabyte disk, which takes a long time just to find out what went wrong. So right now we have upstream support for this feature. It's already supported in EXT4 since 5.16. EXT4 was the target for me when I was developing this because we have applications that are using this interface already on the field. But the point is the plan is to send this and support these on more file systems in the near future. As I said, the implementation is very file system specific, but which errors you get actually is very file system specific. But in EXT4, it mirrors everything that gets sent to the kernel log. So if you are one of those people who are just monitoring a kernel log for file system errors, you should be able to upgrade to this tool and not lose any information. You only gain information by upgrading here. Here you can find an example of a tool that I wrote to demonstrate how it works. So this tool, FS monitor, it's available on the kernel source tree in the samples directory. And it does something very simple. It receives a mount point. So here you could be passing your slash mount. And you're asking the tool to monitor this specific file system. The file system that is accessible through that mount point. And when something wrong happens, it will just print this message in a more human friendly way. So it will parse the message that came from the kernel. And it's interesting because, well, since we are able to just tell the kernel to wake us up when there is an error, this basically has zero cost when nothing happens. It's just sitting there as a sleeping task. So in this case, I went into another shell and tried to list to a less, a file that I knew was broken in my file system. And this is what I got. Basically, FSID is an identification, is the unique identification of this file system. So if I do a start-up as I can match this to the exact file system that's failed. I also receive here a file handler with a file handle, which is an identification, a unique identification of this file in time, which means it carries the unknown number and the generation number. And then with this information, I can exactly identify what failed. The generic error record is a structure in my data that shows me which error failed. And this is file system specific. This tells me I can just query my file system and figure out what this means. And then the error count, which identifies as I explained how many errors happened since the last time I looked. This report is different for different kinds of errors. So for instance, this was a mirror in a file and file that has the unknown 13. If we look at the next slide, we have an error that happened on the super block. This one happened as a demonstration. I just abort to the file system during a remount. And here, as you can see on the decoded file handle, it doesn't show a file. It tells you, well, this was a file system wide error. This is a super block error. It also has the type of error there. So you were able to decode this and notify your tool. In the next slide, I'll show you some documentation for this tool. So you were able to build your own monitors, your own demons, or you were able to understand exactly what we are providing here. I, there, it's very well documented in the kernel, in the internal tree. You can find an explanation of what exactly is provided by the kernel. I also published the blog post on how it works and how can you benefit from it. There is the sample use space code that I shared in the previous slide. This is also part of the kernel tree. And finally, you can read your main pages and understand how to use this feature. So as future work of this feature, I think there is a lot of things that we could do here. There are so many directions that this could go and improve data centers. But when I think more about it, it's not just data centers. I mean, this is an excellent tool for you to run on your desktop. This could be integrated into your session manager, into your, into system D. So you would be able to get notifications if your laptop disk is failing and make sure that you have your backups in order, make sure that you go to Amazon and purchase a new disk or just run FSCK. So this is not just a data center tool. This is something that will benefit Linux on desktop as well. But anyway, for future work, one of the things that we really want to do there is I want to start from the middle here is expose the file system specific blog. So we understood that if we are able to send enough information to user space about what error is happening, and this obviously has to be file system specific. So we want EXT4, for instance, to tell which field of the IEO crashed. We can just deliver this information to a repair tool that would be able to perform maybe online repairs. Or EXFAS actually has quite an interesting tool for online repair that could benefit from it directly. So this is a use case that I think we should explore in the future. If we are able, we want to provide this as a blob that is understandable by a recovery tool and then we save a lot of effort. So we don't need to walk the entire disk trying to figure out more information about what went wrong. But there are a few other use cases. For instance, I know people who are doing statistics on file system errors. So they know which parts of the file system needs to be improved or which parts are more likely to fail so the structures can be replicated. So we want to provide you with the exact information of what address failed, what specific line of code failed as part of this notification in order for you to perform this statistics. I think this is also very helpful for when you get a bug report that has a file system error. Sometimes the file system messages are not unique so you don't know exactly what caused that error. We want to make sure that if we link it back to the source code, we'll know exactly what is going on. From just looking at the report, we know exactly the if condition that failed so we have a better chance of fixing user issues faster. And finally, we want to have this in more file systems. This feature was designed to be generic, file system agnostic except for the parts that for instance the file system specific blob but still it requires some file system support. So right now we support the X-T4. I also, I already have a target of adding it to ButterFast because I have a use case for that in mind but we're still not there. Fortunately, extending this to other file systems should be quite a trivial patch. The interface is fairly simple. You can just invoke FS notify SB error from a path, from a failed path in your kernel. And if anybody wants to tackle this issue, I left the X-T4 example implementation on this commit. If you take a look at it, it's basically a few lines patch where we are invoking FS notify on the specific error hooks of the X-T4. And this makes sure that all the notifications that we provided in the previous slides, they get sent to user space. There is one thing that bothers me on this work I have to admit is could we have done it more generic? I, when I look at the Kurdalog, I understand that a lot of people rely on it and parse it to get information, to get error information and system health information for many different file systems. So if you look at the kinds of notifications that get sent from the kernel to user space, I listed just a few here. Some subsystems use netlink to send error notifications. We are using FA notify. Another way to do it, several people, several subsystems do pull scissor fast. You also have notification systems that are not for errors like new events. Event FD, I notify the kernel log itself, that is what is read by DMS. And we have so many ways to do it. And I think this is not neat. We should have a mechanism to be able to send notifications from the kernel to user space. That is generic to every subsystem in the kernel. And I have to say that BSD actually got it quite right with the KQ, KEvent interface. So they have this, the system calls that allows you to register to receive several kinds of events. It's quite neat. It allows you to filter on the event and it's a single point of event notification. So you can write much more interesting demons and you don't have a mass of so many interfaces. Why don't we have that on Linux? Why do we have so many different interfaces? And obviously some of them have, they were all designed for specific use cases, but some of them could be made generic, right? And now I understand. I just explained how I wrote this on top of FA Notify. Why didn't we make that generic? I think the point here is we have a trade off of very specific features that we need to support a specific use case and the generic interface. And it's quite challenging to have such an API. For FA Notify, that as I explained, the FA Notify has such an interesting object model. It allows you to hook to the exact models of the file system. That is very hard not to use it. It's pretty much exactly what we needed, but it would have been much better if we were able to do this in a more generic way. So is there a generic way to do that in Linux? Yes, there are. There is. So there is this watch queue notification interface that was proposed by David Howells in 2020. And it's exactly that. It's a generic interface for events. It's not tied to any subsystem. I wanted to use it for file system notifications, but it's actually currently used for key ring management, but it's completely generic. And I find this interface very interesting. It allows you to filter the event per type so you don't get notified of things that you don't care about. It allows you to hook to all kinds of objects. The problem is, it's still in the early days, there are still some developments there, but it's a very promising interface which is why I wanted to mention it. I think this is something that is similar to the KQK event interface from BSD that is coming into the Linux core. So I just wanted to co-do this interface and express my interest in it. So what I presented here is a few issues on how you monitor events in Linux in general, the specific problems of file system monitoring and the solution for those problems based on FA notify that allows you to detect errors early in a more consistent way over a fleet of machines and how you can use that and tools to use that and an API that is upstream implemented by EXT-4. I also introduced some future work on what we want to do here, what were the next steps and presented a discussion on a few of the interfaces that we should be looking at on how there are better ways to solve this problem through a generic interface. After this, I wanted to thank you for listening. I'm open to questions. I just wanted to add a last note that we are hiring at Collabra, we are a software consultancy and open source focused software consultancy and we have several opportunities on kernel both in kernel core. So we have projects currently on several areas including file system and also on device drivers, codecs and also outside of the kernel we do a lot of multimedia we do a lot of graphics. You might see a few other presentations from my colleagues at the conference. So I'm open to questions. Thank you very much.