 All right guys, this talk is about challenges observed when testing with the FES tests and block tests and scaling that. So first thing I'd like to request is that please just refer to it as a FES test rather an ex FES test because even at LSF MMS, I've been present at LSF MMS where new file system developers really don't understand and still know that you can use ex FES tests to test all file systems. So I think it would just help to make it generic. Took a poll on Twitter a while ago about how long people think it might take to stabilize a file system and it seems people think it takes about two years to stabilize a new file system. So clearly that's a bit wrong. Takes about like 10 years give or take depending on who you ask, but give or take it takes about 10 years if you're introducing a new file system. So proper file system design is really difficult, right? Really, really difficult. So whenever you're designing something new, just have patience, right? So how about testing though? Can we help in the testing world for stabilizing file systems? So I figured I'd set as an objective many, many years ago to try to help with this problem. And I figured one of the problems that I could tackle was reducing the amount of time it takes to test a file system. So once I had that done, I figured, well, why not also use this to also test adding new block features? So this is why this is now an FS test and block test talk too. So one of the differences with other areas of the kernel is that determinism in FS test and block test is really, really low. And to give you an example, K-unit tests are highly deterministic, extremely deterministic, right? Self-tests in the kernel have higher determinism, but it doesn't mean that there are always going to be, it doesn't mean that you're always going to have success. Sometimes some self-tests like the K-mod test is an example of sometimes you may have issues and you have to run them over and over and over again. FS test and block tests, however, can be extremely non-deterministic but produce really, really catastrophic failures. Give you guys some examples. Well, actually, before I get into examples, I think if there's any takeaway about some of these things, I'd really like you guys to think about this little graph and understand that the time in involvement for testing really needs to be divided properly if we want to take this seriously. So about 25% of the time really should be dedicated towards test design. 25% tracking results, 25% also for reporting bugs. And then 25% also for fixing low-hanging fruit. And I'll describe a bit of what some of these things are. We're, I think, pretty good at test design, and that's it, really. So in order to help with this, obviously I worked on a project called KDevOps. I try to address variability by using KConfig. It basically also addresses bring up, whether or not you want to use local virtualization or use the cloud. And then for management, I basically just use Ansible. But this talk is now about this project. This is about lessons learned from that experience and actually scaling testing and automating it. To give you an example, though, here's a Leap 15.3 failure. For those that don't know, Leap 15.3 also means that it's a SLE bug too because there's binary compatibility between the kernels and user space. So this is just an example where you have an EXT4 failure failing one out of 300 times. So you have to run the test about 300 times in order to get it. When I ask some file system developers how many times have you ran FS tests in a loop? They look at me kind of puzzled because it seems like not many people do that. Well, it turns out if you do run FS tests in a loop, you eventually will get a failure. Here's an example in block tests. Fails about one out of 80 times, for instance, and it produces an RCU stall, CPU stall. And this is a recent one. I'd like to also thank the folks who did the analysis of what likely could fail here. Is Sinchut, I'm not sure how to pronounce his name, is he here, Koa? Is he here? Awesome, thank you, that was really cool. And then Klaus is also a Kimi developer and then he ended up also looking at these issues. And basically, it seems like Kimi's zone reset is super slow when resetting old zones. So that could be optimized a bit. But there's a few lessons that we can learn from this case, right? One is that future optimizations are obviously needed in Kimi for resetting old zones. But the other one is that the false positive CPU stalls are helpful, right? Because it can also tell us when user space is doing something really stupid. Here's another example. This is a, let's see, a block O09 failure. This failed one out of 669 times, for instance. And when this was reported, it was kind of like really not understood what the heck was going on. And it took about eight months to really figure out what was going on. Really, it was just like ignored by folks. So this was fixed in 5.12 and Yankara has a series of patches that also can be back ported. But like in Amir's last talk, for instance, of merging stable fixes, I mean granted, it was for falses and for the block layer, it's also complex to merge some changes. So for stable, you can't really merge Yankara's patches as is. It's really complex just to give you an example. So fixing some of these issues in older kernels is really, really difficult. Here's a low hanging fruit example. This one was really odd because it was failing for all these different XFS configurations in really different failure rates. It was just really puzzling. But it was a very consistent type of failure. It was just SCSI debug. Well, it turns out that SCSI debug, if you guys are familiar with SCSI, it's a pain to ask when you're trying to quiet things, right? But long story short is that there was a longstanding kernel module user space issue. You basically tickle anything with the module ref count and basically you just can't remove it. So that was a longstanding issue, present in modules since forever. So that'll eventually be fixed in Kmod with a patient module remover. But still, there's some lessons to learn here. A FAST test and block test should not require modules, right? So as I think folks had pointed out before, you know that we, sorry. But I mean, you can't remove it because if there's still a reference held by the module, which is okay. And I think that's perfectly valid. No, no, no, right. But the thing is that if you want to get to the point where you want to create high confidence and in your baseline, you have to remove the module currently. Doesn't mean that you... I'm not arguing against removing the module. It is if that module has some configuration, that configuration might well keep the module busy. So you first have to clean up the configuration. Of course, of course. Of course, this is totally understood. The problem though is that we expect that this is going to work and it doesn't. And basically all your tests just stop. So it would help also if we just don't require the modules and yes. So yeah, I'm aware of this problem. It's been pointed out several times. So currently the block test, especially NVMe category, they assume to test the module removal because there are so many modules involved when it comes to NVMe or fabrics or just NVMe host and a target. And that interdependency has significant impact on various reference counting we do on the host side and on the target side. So what we can do is to move the test cases to just as the module have come separately or in separate category. So you can skip it. Well, so let's think about this, right? What's really desired is to quiet us. It's not really to remove the module. Removing the module shouldn't really be a requirement. For instance, the null block, right? Added, folks that added config.itha support to essentially just remove stuff, right? That's the way to go. It's because the debug already has that too. Can we do that for the NVMe modules, for instance? No, that's what I'm saying because there are so many reference counting has been done on the host side and a target side for different transports. And when you remove the module, you are 100% sure none of the recent changes are breaking the reference counting mechanism. Are you saying it's impossible? It's not impossible, but it is easy to introduce a bug in that part. That's what I'm saying. That's why module removing and module loading, we made it mandatory. So we can just add separate test cases to test those scenarios where you are sure that there are no bugs in reference counting and leave rest of the test cases where it just assumes that modules are loaded. If members, sort of, because we debug is rather old and strange the way it's configured in that you pass the configuration of the block device that you want to create as module parameters. So it can only be configured at module load time. What if we just change the schedule debug to be configurable via configfs? My understanding was pointed out in the mailing list recently that that support is already there. Maybe the test should be changed to use that, not care about loading it on them. If we have agreement here that that's the way to go, then absolutely. I mean, Kristoff recently had, you know, requested that block test stop requiring modules for instance, right? So that's something to consider here. We just need some consensus that this is the way to go and we just go into it, right? So if we have consensus here, then great. But you're pointing out, though, that in NVMe, we probably have no other option but to remove the module. So if that's the case, and you know, well, the good thing, though, is that the patient module remover is there, right? Why do you need to remove the module, though? Well, what I understand is that they're testing that at the time that you're tearing the module down that you have properly torn down everything NVMe related. So they want to test the reference counting in NVMe itself. That sounds like a test that covers SCSI debug's ref counting. That should be a test of the SCSI debug code. That's not really an NVMe test. No, we don't want to test the NVMe modules. Yeah, and I'm saying that they're not really testing what they think that they're testing. No, it's not only reference counting. We also are free of the work use and resources when we unload the modules, which are global to the transport or to the target. And when we remove that, it does the cleanup. And we are sure that nothing has been broken in the cleanup part because there were bugs that we found in the cleanup part before. So we're talking about two different things and let's not get too far on the weights here. We should be making tests that test the NVMe cleanup path. Those are what those tests are. And then we have this knock on effect thing where just because we happened to remove the SCSI debug thing, it caused test failures. That's not actually part of the test. So let's try to separate these things and make it as easy as possible that we are testing the thing that we care about and not introducing other side effects. All right, moving on. It seems that we have at least some consensus on the SCSI debug front. NVMe, I'll just leave that alone, right? Other issues and challenges? Well, it turns out, at least on FS tests, you have the .bad files and then you have also the .junit file. Turns out, sometimes you'll actually find an error present in .bad files but it's not present in .junit. So if you're processing the .junit.xml file, you probably are not capturing failures too. Likewise, if you're doing it the other way around and only capturing .bad files, sometimes you'll actually see the failure in the .junit file but not in the bad files, for instance. So you actually have to check for both. Blocktis is a bit better but it's new, right? And there's not as many tests as FS says but one of the things that's really nice about Blocktis is that it also captures the .dmedics file as well and I think that we should be doing that in FS test as well. Dead? Yeah, I mean, I think we need to be a little bit clear about what's going on, right? If there is a nnn.bad file but it's not marked as bad in .junit, which probably also means it's not marked as bad when you look at the text output, that's arguably a test bug. Yeah, that's a problem. Right, and we should just fix the test. Yeah, that's a problem. I mean, we can try to make the test runners a little bit more robust but let's just fix the test. Yeah, that's definitely just a bug, right? We need to fix that shit. Great. Regarding the DMS, FS tests that keep DMS configured. Who's this? Where can... Oh, okay. Ask keep DMS so you have a per test .dmedic file to check. Yeah, and it saves it when there is an error, right? It's because our CI stuff saves it and so you can look at the message when it fails, so. All right, so I think there's no one opposed to adding the .dmedic files to FS test, right? So it already exists. So right now it will generate the file if it detects a warning or a leak or whatever. So like if you look at the ButterFS results, you can see where we have DMS files uploaded to the results website. But you also, there's another option to say always save the DMS files. Got it. So you can turn that on. Sure. But by default, it only generates the DMS if it detects a warning, leak, all these variety of things. Great. And I copied that for block test. The block test does the same thing. I'm pretty sure it'll only save the DMS if it found something bad in there. But I don't have an option to always save it. If you want that, we can add it. Great, great. All right, moving on. So to address the problem of these non-determinism, the non-determinism on a FS test and block test, essentially in KDOPs, you can configure the steady state which is basically how many times do you want to run a FS test or block test, for instance, in a loop. Just to give you guys an idea, it takes for a setting of 100, it takes about five to six days to run that. Ideally, ideally you would, there's comments or questions. What device? What file system? This is all file systems. No, but five, six days you're running on what device? I'm sorry, can you say that again? The FS test run on loops, you say five, six days. What's the underlying device for the test? Oh, so there is a bit of architectural design here that has to be considered. I basically use loopback devices and truncated files and I test on XFS and on Butterfests as backend for the guests. And then also the sparse files are created on XFS and Butterfests. I test on both XFS and Butterfests. It turns out that if you run the FS test in a loop, for instance, on XFS on the host, you get sometimes results that are different than Butterfests. So I'm actually testing for both and I use both XFS and Butterfests to have more coverage. So you're saying XFS and Butterfests in the guests that's running the test or testing XFS and Butterfests in the whole... I'll explain, I'll explain. So you have, let's say there's one server that has data one and data two partitions on the host. Data one, for instance, is XFS. Data two is Butterfests. I then create truncated files on data one to use as NVMe drives on the guest. Then inside the guest, I also have those NVMe drives. I mount them in MKFS using XFS or Butterfests. Okay, that makes sense, thanks. I think I'm gonna guess what question Damien was really asking, or just ask my own. What could we do to get that down from five to six days to two hours? I'm glad you asked, but I think that I have some crazy ideas there. When I said those crazy ideas, some people kinda like, I'm not sure if that'll work, but I have some ideas and I'm not sure we'll have enough time to get to some of those ideas. But what if instead of being crazy, we just threw hardware at it? Yeah, so my question actually, the context was five, six days, 200 times, when I have drives that are so big that it takes a full day to do a single right pass. So I can't even do a full run on one day. That's just where we are at, depending on what. Yeah, I know, I totally understand this is, yeah, you're right, this is totally on a virtualized environment for sure, yes. I have not covered yet direct raw access to devices, but that easily can be done given that you can just use PCI pass through, right? How large is the drive you're using? Would it be possible to just move it into a RAM disk? So in practice, in the worst case scenario for testing all file systems, you need give or take about 50 gigs. Can run it in a RAM disk. What's that? So you can run it in a RAM disk. You can, but here's a funny thing to address your question too. I actually did that, there's no gain in running it in memory. And the reason is that FS test is so slow, it wasn't designed to be parallelized. There's no gain in running it in memory. Personally, I liked run tests going over 16 terabyte in capacity because you can actually catch bugs over the flows with that 4K times 4G, the 16 terabyte. So it'd be definitely great to optimize a fast test to be paralyzed, but I think we need to think about a proper architecture for a fast test. And if folks are okay with that, then by all means, let's just fucking do it, you know. This is great work, thanks. When you run a fast test and then you establish a baseline, are you running the auto group or are you running all the tests? For which file is this? For XFS and ButterFS, for example. So for XFS, these are configurations that had failures, but pretty much I think there's, I haven't looked into this in a while. I think maybe like eight profiles and each of these basically have different XFS parameters, right? So yeah, pretty much any possible known configuration that is important. The only one that I'm not testing right now, but I will soon is the real time because I'll have an interest in the real time stuff in a bit, but yeah. For ButterFS, I welcome patches. Right now I'm just testing the default. It should be really easy to add new configurations to add support for new test profiles like sections. I call them sections. It's pretty much just a K-config patch. So I think one other thing that's probably worth making sure people understand is it really depends on what your goals are for running tests, right? What I perceive is if you are a QA person who is trying for the platonic ideal of zero bugs, then yes, you may have to do a huge number of baselines and you're interested in bugs that only show up one in 800 runs. At a company, the problem is I could invest a lot of money trying to find those bugs. I don't have the headcount to hire to have software engineers to run down all of those bugs. Right, so. And so I think we need to be very, very clear about what we do, right? I mean, in our environment, we run tests constantly on the hardware that we will actually use in production because those are the bugs we care about, right? Now, I'm not pretending that I'm trying to find all bugs. I'm trying to find the bugs that will matter for my company when I'm running a kernel in production at my company. And so that's just, I have different goals than you do, right? I mean, my goal for that set of tests that I use for work is to maximize bang for the buck in terms of the highest quality kernel that I can afford given the headcount that I am allocated. Totally. And it's a slightly different thing because you can find bugs. It's like, yes, if you told me that there's a bug in EXT4 that only triggers one in 800 times, I will ignore you because I've got bugs that fail more frequently than that. And I don't have the bandwidth to handle the ones that fail 15% of the time, right? And so I think this is really good work. I'm glad that you're doing it, but I think we also need to remember that what may be appropriate for a developer who is trying to validate commits that they plan to send to Linus during a pull request may be different than the sort of QA effort. Yeah, so time is limited, so I just want to move on. And I'd like to say that, yes, you're right, but it basically is, it's a matter of basically management, managing what your priorities are for sure. But what I also want to convey clearly is that it takes resources, as you're indicating, to properly do this and a lot of stuff. So yes, completely. But moving on a bit, so establishing a new baseline for a new file system will take about one to two months. Now, when you don't have a baseline though, a public baseline, just consider it technical debt that we have in the community. Because if we did have that, then we could just move on, right? And as Ted is indicating, it takes time to go and process these failures, right? Who's going to look at them? You know, if you stop every single time and just go try to address the issue, then you can run this loop, for instance, right? So one of the things you have to do then as well is also do a lazy, I'm not sure if this slide now, but I'll just mention now it's a lazy baseline. Basically it means that, lazy baseline just means that once you have a failure detected in at least two types of configurations for the file system that you're testing, you expunge it for all those sections. And you just move on with life. You eventually should look into what the hell happened, but like I said, reporting bugs is also 25% of the time. And again, this needs to be collaborative. Yes. Could you do a bot, for example, at Plumber's at some conference where you're like, hey, here is how we use KDEF ops as a developer, as a file system developer, for example, or as a VFS developer. I think this would be really useful. Yeah, I did submit my talk to Plumber's, so I will be talking about KDEF ops there and giving a demo. Cool, because I have my hacky own way of running XFS tests, but if I have something where I really have an automatic baseline. Well, so this is also an invitation now that if folks really are interested in collaborating on this front, fortunately my employer has provided a server that we can use to collaborate on this baseline for different file systems. So if you guys do want to collaborate on this, please let me know after this talk and then basically we can move forward with that. But yeah, I'll definitely be giving a talk if it gets accepted to Plumber's about KDEF ops. So yeah, like I said, Stable is also volunteer-based. I know some subsystems do the whole out of stuff and if it works for them, great, but if you have someone like Chinard complaining about stuff then obviously things are not gonna be merged easily to the file system. So we need to do proper diligence and for me, proper diligence is having a baseline that have a confidence of at least 100 loops for FS tests, for instance, for block tests as well. I'm a bit anal with test if you can't tell. So today another issue is that we skip tests today that are not applicable, but we should probably just not skip them. If we know that they should fail, we should actually run them and annotate the failure that we should expect because we're missing an opportunity there to test as well. If we have a consensus here on that, then we'll just go ahead and fix those issues. What do you mean by that? What exactly would you be testing for? Well, one example, you know, one example, for instance, is you shouldn't do a random write on a sequential DNS drive, for instance, but you know, it should fail. Well, I don't wanna crash things, right? But hey, maybe I should test it to ensure that it does fail. That wouldn't mean introducing a bug in your file system coding purpose to test that. What's that? I think at that point you just add a separate test, a specific test for that. Yeah, sure. Okay, so do we not have any possibility of gains to express the failure that we should expect if we run the test? No, I think that that introduces too many, like did it fail because it failed to do the write or did it fail because some other thing failed? When we're talking about that sort of thing, I really rather prefer we go out of our way to make sure the thing that fails failed. Yeah, so if you want an example, for example, for zone FS, I do have such test where I use DD to write a sector that shouldn't be written, then run a write on the file and it should be working, but the file system fails and that's to test the error pass in the file system. But that's not a, that failure is expected and what I'm testing is that recovery is successful. So that's not really the same as what you were saying here. Great, thank you. So one of the last things I'll just mention because I think we're out of time is, yeah, there's desires to also have presentation, right? So I'm hoping someone who does have a nice little shiny presentation for results might be willing to volunteer on some of this stuff. I wouldn't call it shiny, but it's there. Hey, that's better than nothing. It's better than nothing, that's for damn sure, but yeah, I couldn't present that later. Fantastic. File systems track this, but we really ought to have a link somewhere in XFS tests or at a Wiki. So like with ButterFS, I know what ButterFS expects because I have five or six Wiki pages for different servers, what to expect and how to set it up and all that. But every file system, we need to have a central place where I can find DXD4 and all the other file systems wherever they keep their Wiki that describes what they expect. All right, let's wrap this shit up. Let's meet at 4, because there's nothing there and we can kind of hash all this stuff out as a boff and not bore everybody else to death. Thank you.