 Okay, so the stock is called a guardian angel of file systems monitoring for file system errors in data centers My name is Gabriel. I work for Collabra. I'm a kernel developer for a lot of years And Collabra is an open source software consultancy. We work for several different Customers who approach us with specific issues on parts of the stack. They help us They ask us to help them Optimize the use cases they have so specific bugs. We are specialized in Helping companies succeed using open source So this was gonna be a virtual talk, which is why I have a slide about me, but Since I'm here Let me go really quick here as I said, I'm a kernel developer You might know me from some work around gaming on Linux. I'm the developer who implemented a case insensitiveness support in EXT4 who was later adopted by F2FS I also work at quite a lot on a feature called originally few tax weight multiple which later became the new few tax to interface and Most of that is to optimize use cases of gaming on Linux Makes game make games run faster make games run better Maybe games that that were brought from the windows world. So we're trying to emulate them via proton or games that were natively designed on Linux I'm also the maintainer of a subsystem a very small subsystem in the kernel the unicode subsystem which is used for case insensitive file systems But today the stock is not about gaming I am shifting a bit of gears to look into servers a bit and I'm trying to solve the problem of Reliability, how can you trust? How can you know that your file systems are working fine? Everything is good in your system. So this talk is about how we detect and intercept File system errors that are gonna become a bigger issue in the future that are gonna crash your data How do we detect them early? Just a quick overview a file system error Happens basically when you cannot access some data that was originally on the disk. Maybe the block Where the data was written on the disk got corrupted Maybe there was a bit flip some alpha particle coming from the universe that flip a bit and then your data is now broken Maybe just the checksum don't match anymore or maybe there was a file system bug that Caused your data to be corrupted on this or that means your file system is no longer able to read the data and this Sometimes only affects a file, but it might indicate bigger issues For instance, if you have one block that is failing It might mean that your device your underlying disk is failing, but you don't know We usually when you find an issue when something is broken in your data center You have a few options. You're gonna replace your disk in such case You're gonna recover from the backup. You're gonna call your system administrator. Ask them to fix the thing Another alternative is running a recovery tool that is gonna crawl over your file system Try to detect the error try to fix it The problem is tools like fsck. They need to walk through the entire file system They are gonna check the entire file system for errors. That is time consuming. Sometimes the file system needs to be put offline It's a it's a complicated situation when you are dealing with a production system Obviously if you look at that image on the right and your disk is in that situation I would say that it's a bit beyond repair but uh If before that you can identify that something is broken then you might have a better chance of fixing it So the question is not how do I recover in this talk? We have tools to recover we have mechanism to recover we can recover from backup The question is how do I know that my file system is failing? How do I know that I have an issue in the first place? the first sign that something went wrong is when you get an error code from a system call and As we all know a lot of people ignore them An error code might be just an e-involve an EIO coming when you try to do a write or when you try to open a file and that Using that to monitor file system errors is slightly unreliable because you depend on every application of your system To report it to up to upper layers. You need you need your application Maybe your shell script to check for errors and some tools simply ignore them silently Uh After you get an error code that information is lost if the application doesn't do anything with that It's lost the other way that you can check for file system errors nowadays is check the kernel log The kernel log I'm gonna be speaking a bit more about that and why it's problematic But basically it's a buffer where the entire kernel and some user space dump message above message message beyond message message on the current on the log and Sometimes things to appear. It's not a very reliable way Some file systems expose an interface through Csfs where you can look for At specific files that are gonna list the number of errors that occurred They will provide some information, but none of this is standardized and Still, it's not a polable interface at least This is a specific error files are not polable on Csfs So it's not like you can just set up a demon and forget you need to go check for errors So what people usually do? They find something in the mask and then they go look in Csfs for errors or they have a tool that is polling Csfs all the time By the way, this is what the the kernel log looks like. It's slightly Crip it Criptic, but here we see an example of a file system error. Here. There was an error in EXT for remount Trigger by that PAD the common that triggered it was mount and Human description of the error a board forced by user. This is obviously not an actual error This is me crashing my own file system for purposes of demonstration And then remounting file system read only which is part of the kernel recovery So we have a single line here. I don't know if I can point We have a single line here that actually describes an error in the file system and that single line is somewhere here This is a random picture a random image of my kernel log that I that I took while running a machine The kernel log I used to say is no man's land Any kernel developer any kernel that has any patch that is pushed to the kernel Can write stuff to the to the kernel log and I'm not familiar with anyone in the community Who's trying to gatekeep or protect this from anything that is being written? Which means that sometimes people put credits of the driver they developed on the kernel log This is not common very common nowadays, but you can still find email email addresses of people on the kernel log And The problem is it's very noisy just in this image. I'm looking here I have you can find information about which keyboard I use you can find information about which headset I use There is an Apple keyboard there, but I swear I don't use that one So definitely the kernel log is not the right tool for monitoring In particular, so it's implemented just as a ring buffer So the main problem issue is that you lose messages once the buffer is completely full You start to lose what was on the top when it wraps and go around and start writing new messages on the top It's plain text. Well kind of plain text, but it's still it's a very free form of Logging so you have parsing challenges. You need to be able to Understand what is written there for any kind of automated Automated testing it has some challenges like there there were issues in the past where messages would be wrapped would be truncated There was a patch set in 2019 that kind of improved on that But it's still very problematic, but the main issue is that the kernel log is an unstable API people change their error messages at every release in this case if you have an automated tool that is looking at That is looking at a specific error string, you might just not know when that error happened and It's interesting also that not only the kernel makes it to no man's land Everybody can write to it, but user space can also write to it depending on the distro This is block nowadays, but if you write to this file in your system that will go straight into into the kernel log and might Overwrite something more interesting It also has the same problem as the CZFS file system interface that I mentioned which is You cannot pull on the error So you cannot it's you sorry you cannot be notified when an error happens You need to keep checking and checking and checking which is not ideal for monitoring Just a small comment because I know some people are gonna say it's not exactly like that Obviously you can overcome some of these issues for instance if you have a demon that is watching over your kernel log You can prevent losing messages, but this is not a guarantee Your monitoring is now going to be just as good as your day moon is of fetching information from the log before it's actually overwritten There is also the patch set that I mentioned that I attempted to improve on this problem But as far as I know and maybe Linus is not here. I don't know about any plan to control what it what gets written to the kernel log. So It's very easy to push a patch to the kernel that modifies a message and that's not considered an ABI breakage So definitely the kernel log is a no-no in my in my opinion and actually I wrote before personal opinion here Relying on the kernel log for any kind of automated monitoring of your data center is actually a terrible terrible idea But the area to read but the reality is that We do it all the time and the reason for that is that well, it's usually the only way Don't get me wrong. I'm not saying that the kernel log is useless It's very useful for turbo shooting issues is very useful for development Every developer every kernel developer in the room. I bet you debug your kernels with print K instead of nicer tools But still is not the right tool for any automated monitoring You can see an error message that changed On the kernel log between releases when you are a Person looking at it, but for automated monitoring is not right Okay, and obviously this issue gets worse as you grow in scale If you were a single watch Linux user looking at your kernel log, that's fine You can understand the messages. You can see what changed if you were a Google looking at a huge data center Well, you have a big problem So the question here is how do I solve this problem for data centers for bigger users? How do I have an automated monitoring tool that can look at the entire data center? The the solution that we implemented is based on FA notify Which is an infrastructure that already exists in the Linux in the Linux kernel for monitoring file system events So when a file is accessed when a file is removed when a subtree is modified in any way as far as I know, maybe somebody can correct me in the room, but the original use case for this feature was for antivirus scanning So basically you the monitoring the scanner tells the scanner tells the system I want to be notified any time a file is created so I can go there and Watch and verify this file and check this file so FA notify is quite an interesting tool to solve this problem because it has a Model of objects that allows you to connect to a specific file to a specific Directory to a specific mount point or to the entire file system And it's very low overhead which means that when it's disabled when you don't care about monitoring It will have basically zero overhead, but you can quickly attach to a specific subtree off the system I investigated a few alternatives. I considered implementing a custom Notification event I looked at watch events, but there is there this is the reason I went with FA notify Which is how well it ties already with the file system So basically what I did was I extended FA notify to support a newer type called fan FS error Which not which is now triggered every time that something happens that shouldn't happen on your file system So every time that the file system code detects an issue It's going to create an fan FS error event and that error is very Extensible it has already a lot of information, but it can be extended So it basically tells you where the error happened which file which directory it tells you what kind of error happened and a little bit of extra Metadata, but the point is is extensible so say that a file system in the future Wants to do online repair for instance. It could simply notify the the FSCK tool for that file system of exactly what went wrong So the tool can go straight to that I know then fix it without already knowing without having to investigate by itself What went wrong? There are many applications for this interface in online repair of file systems Basically the implementation has to be file system Specific because every file system has different errors There are others who don't just don't make sense on some file systems and on others they do so this implementation is file system specific I implemented support for EXT for which is what we use in production for a specific customer And this support has been upstream for half a year now. It went out on the 516 And the way we implemented it is that it mirrors the errors that are reported on your kernel log for your file system So now you you so it's easy to migrate It's easy to go straight into this feature instead of continue to rely on kernel monitoring for that file system You have all the errors that you cared about now exposed to this interface and the error looks like this This is a tracer that I wrote that just connects to the to the interface and prints out the information that it has This is The beginning of the of the information. It basically tells you so here I'm watching for I'm running my tool against this mount point Which could be just the root of my file system of my system here. So I'm monitoring everything Inside that file system. This is per file system. So if you have a Submount you need to to run that against it too This is the error that that was generated the event this uniquely identifies the file system is the FSID that you can recover from FS start This is a file handler handle Sorry, this is a file handle and what it does is it uniquely identifies an object in your file system And the xc4 for instance is basically the I know number plus the generation So with this you can quickly tie it to this specific object in my file system has an error And then it prints what was the error type This is a generic It's nothing wrong. What there is this EIO I Don't know what there is this but obviously this is dependent on your file system So you need to go look at it to figure out what is going wrong And then it brings the error count one specific detail about file system errors is that when something goes wrong You sometimes have a lot of errors that get dumped and they overwrite your Entire kernel buffer in a single row and you lose the original information because one error in a file can quickly propagate and if you try to open this file repeatedly is gonna is Gonna just dump a lot of noise into your kernel log So instead we only record the first error that happened for that specific file Which allows us to only focus on the information that matters when you have a file system error Usually is the only error that you care is the first error that appeared and this is what we are catching here But still it provides for statistics and I know people who are doing statistics statistics on this The number of errors that happened This is another example This is actually the error that I trigger at the beginning of the talk when I showed the kernel log It now becomes this it tells you that this is a super block error And it's error 108 which on EXT4 is Translated as EXT4 something abort And this also has the FSID it doesn't have any file handle because this was an error in the file system Itself so in the super block of the file system, and it's not tied to a specific high node All the documentation is there I see one immediate use for this which is desktop managers and Look using this information to notify the user and tell them you need to run fsck you need to go recover from a backup Obviously there is the data center monitoring use case where you are Looking at a huge fleet of machines and making sure that your data is Not corrupted that your volumes are all safe I also see a very interesting use case for this on online repairing. So in this documentation I provide information on how to use this feature focus specifically for user space developers who are attempting to implement this Monitorers for this and there is also a code sample in the kernel tree that I used to generate the previous example Finally some future work as I said the interface is extensible which means that You could for instance One information that I really want to add there is the line and the number of line that Where the error occurred so we can link it directly there is some objection from the community there because it's not such a great use case but I know people who are doing statistics on the On which errors fail the most so paths that needs to be further improved and that would be an interesting use case For us to have there is also the possibility of exposing a file system specific blob Which a lot would allow us for instance. I know XFS might be able to use that for online repair And another future work which is very important I would say the first one I'm actually working on some patches for this is to support it on more file systems This there is nothing here. There is ext4 specific except that is only implemented for ext4 as an X target I would like to see this in butterfasm. I've been writing some patches for this And it's quite a trivial interface for this file system Basically when you detect an error you just go and call this function passing the super block the object to your end node and Which error happens and there is an example patch here if you want to look at it and try to implement it for your own file system It's a very trivial. It's basically four lines patch and all the lines are the same just calling this in different places So this is the feature that I wrote but I I When I when I was preparing this talk, I kept thinking like could could I have done it slightly better? And what bugs me is that when we look at the ways to notify the kernel from user space We have so many ways to do it. You can check XFS. You can look at the kernel log We have I notify denotify FA notify netlink. We have watch queue, which is quite a cool thing and Then when I was researching in trying to see how I could implement this I bumped into this interface from BSD Which is basically a generic queue From users from the kernel to user space where you can send events. You can be notified about different event types You can filter events inside the kernel. It's quite a neat interface and I was like, yeah We don't have an equivalent in the kernel, but turns out that we do So in 2020 David Howell's proposed watch queue, which is a generic event interface is not tied to any subsystem And the idea is to make it generic enough that can be used by several subsystems It's all it already supports event filtering. I think there is some work to be done there for I don't know BPF filter or whatever. I said BPF in the talk so kudos Self kudos and this is already merged. This is already available in the kernel but right now it's only used for key ring management, but It's still in the early days. It's been merged in 2020 and there is a reason I couldn't use this Which is well the object model by FA notify makes it ideal for Any kind of file system monitoring this cannot be hooked to a specific file system yet But I think it's a very promising interface is worth mentioning and I think it's a much better solution to the Very specific notification things that we have in the kernel So this is my talk, thank you very much I'm open to questions But before that I just wanted to say that Collabra as I said is a is an open source consultancy that works on Many interesting things in particular. Well, when I joined Collabra I started working on so many interesting topics that I ended up speaking at Opps open source summits around the globe every year for all of these topics And we actually do some very interesting things. We have Customers on most different areas. We don't do only kernel We also do a lot of graphics a lot of multimedia work and we are hiring So if you're interested in the kind of work that I presented here today or that my colleagues have presented in this conference Feel free to apply. Thank you So Can I answer any questions there any questions? Yeah Hey So when you showed the the error and it's just like error 117 you were talking about linking it back to Lines of code and a way to more uniquely identify them and is that the direction where it's just sort of No, here is the place in the code where a thing happened or is there something going on sort of more like Because thinking like smart where it's you know generic classes of error, right? Yeah So here is the error that was returned at that point. So it's in this case is an exe for a specific error We don't have the information about the line of code that that triggered this error This is future work. It's something that I'm working on we I have a patch for that, but it wasn't accepted I still need to do some work there to have that specific information Yeah, but there's nothing like, you know across different file systems or across like you have sort of with storage where across the industry there's sort of Generic classes of error. No, that is going to be standardized like that is going to be very very file system specific There are there are some well there are errors that wouldn't make sense in other file systems So for instance, I have this block group that that has an issue block group is not a structure that exists in a different File system, so it's not standardized. Unfortunately, we could do better here, but it's not Thank you Yeah, is there support for x2 and x3 by extension of it being added into x4? I think I'm not mistaken. No, so no Actually, that's a good question. I think so because they shared some code is general general, no That's a good question. I don't know I haven't tested it on ex2, but I don't know if the code is shared For x2 too, I Thought there was some commonality, but I'm not very familiar with the area. I can check. Okay, so No, definitely not there is not supported on x2, I think x3 x3 yes, but not x2 to x3 is implemented by x4 It's not the x2 so no Sorry, there is a question there Just remark you mentioned so structured logging approach is great if you have it Sometimes you cannot afford that and you mentioned earlier that the kernel log is not an ABI That's unfortunately true, but since 515 there's been an interesting feature called printk indexing We are doing build process All printk format strings get collected into a special off section And you can at least alert if alert if a new kernel build does not fit any of your patterns anymore So I found that to be quite useful. I agree that improves a lot the situation It doesn't solves all of the issues though, but I think it's a great is a great improvement. It's just a band-aid Doing current specific work doesn't scale to having many system engineers, but writing records Expressions regular expressions does so yes, it's true. It's a bit more palatable But thanks for your work. Thank you Hello Other file systems like ZFS and some write systems have they separate valid file system logs inside the file system Which contains these error messages like that So do you think that this approach is better than that those file systems specific logs? So I think I think it's different use cases right here We are thinking about how we notify the user that something happened. I think logging it inside the file system is also Is also is also important But here is about how do we tell the user that something crashed without the user having to Pull for it. So if there are no other questions, thank you