 Hi, good morning everyone This session has three presenters myself Amir Leah from Google and Chandan is from Oracle is remotely Chandal Chandan. Are you there? He will show up I'm here. Hi, Chandan We are the three XFS stable maintainers each responsible of a different LTS I do 5.10 Lia does 5.15 and Chandas tank it's care of 5.4 Of course, we're here to present XFS stable maintenance as a case study because you all have similar issues and I was very happy to see that Sasha is here with us today So we're not talking to I'm not preaching to the choir. That's one good thing and I wanted to start by showing retrospective of what happened in stable kernels XFS and stable kernels over the past five years this is a plot of number of Backports per LTS release It's loosely on the time series. It's not exactly. It's on a release number series, but I try to align them as much as I could and We can see some drama going on here, right? And I'm going to talk about what happened Why does this plot look so hectic But you can clearly see that five years ago there was an Okay period for maintaining backports of XFS to stable trees and Around the year zero or whatever that is around the release of 5.10 Zero Things came to a halt So a little bit of history what what happened there actually you can also see if you look closely around year minus 100 you'll see a slowdown in the fire 4.14 backports and also a Little bit of shaky start for 4.19 so around the time there was a little bit of a Another clash between XFS maintainers and the stable tree maintainers Which had the XFS maintainers go You cannot backport stuff without testing properly and testing properly is this and they defined what this means and then Such I had to do some work in order to accommodate the new defined requirements and that was The period of slowing down and at the same time Louise started polishing up K-Devops and they both had a talk about this at LSFMM 2019 and in the start of the 4.19 is there when Louis started to Run tests on K-Devops and porch some post some patches But after that after a while Sasha got his setup He had finished the learning curve Put the FSS to regression testing and autosale was working fine. What happened around the start of 5.10 Combination of a few things. There wasn't any big disputes. There was some in small incident It's mentioned on NWN. There's a link from my talk on stable backports last year to this small incident This small incident actually had to do with linger time for taking patches and this is an issue that just was just discussed and Even Sasha made some changes about lingering patches longer before they go to stable But it wasn't the main issue. The main issue is that Sasha just couldn't keep up with maintaining FS tests maintaining the Test infrastructure in a way that is sustainable It took too much of his time of his private time to do it and there was also an employer exchange at that time that Contributed and then there was nobody to do the job I mean the requirements were set and their high bar was set for what it takes to get patches into a XFS patches into stable but no one the user was there to do the work. So for two years XFS in stable was on life support I should say that None of the users of those distributions. I don't know that we're using 5.10 Ubuntu Microsoft There are several distribution using 5.10 for the users of those distributions The title is that they get a stable kernel and they are using XFS and XFS is a well-maintained file system. So How would they know they're they're getting? Non-maintained file system nobody told them Anyway Two years later LSFMM last year we met up, of course we discussed this before There was work going on behind the scene between me and Luis and Leah, but We set up of course with Derek and we set up a system where we have three maintainers each takes care of testing their own kernel specializing in the old LTS kernel and starting starting back porting patches and What hopefully looks like a sustainable process? We have the yellow line here of Chanda and making all of us look bad, but But one thing I want to say and you don't see this in the plot The red plot at the bottom is Leah's work this is backed up by Corporate Google that's contributing resources and of course manpower to to a distribution to Google cloud OS distribution that They had business needs for this to make this happen same goes for the yellow line. It's backed up by Oracle and driven by business need of Unbreakable kernel release And puts cloud resources on that while the green line is community resources I started doing that for my employer when they were using 5.10, but right now it's this is volunteer work and Is community resources contributed by Samsung and the work that Luis is doing I mean without getting KDevOps and a machine I Would be able to have good will to do this work But I wouldn't be able to do that and this is what was happening before we started to do it I was doing the back porch for my employer, but I couldn't do the community work so There's a lesson learned here, but first of all we need the companies that do and the distributions to support But also to contribute it community resources So we can deal with orphan releases like 5.10 and Also need to take care of this new one Yeah, so I think a lot of you are probably already aware of the problem And so a lot of right now a lot of the backwards are very ad hoc And so authors will sometimes backport to some branches all the branches or sometimes not care And then also there's the fixes tags and see seeing stable, but again, that's hit or miss And then there's auto so which is going but even with the automatic backporting a lot of patches that don't apply often Just get dropped because no one follows up with them And possibly even worse patches that might apply Patches might apply even if they're a critical predict like prerequisites missing and so that would totally slide under the radar And yeah, also there's little testing So what we've been doing it's it's pretty simple I'm just designate one person per tree and it's just so someone can be more familiar with that specific stable tree and keep an eye on things And then we're just grouping things into batches running a set of tests on them and sending it out to the Linux X of s tests Or linus Linux X of s mailing list with the testing results Everyone can take a look get some people to sign off on it And then if no one yells send it to the stable List and back this is already seen by the chart and stuff But several hundred patches already got backported through this and as we talked a little and they'll be session on K Dev Ops But there's been improvement testings to both K Dev Ops and GCE X of s tests I should think are the primary testing use for backporting and Derek Hey everybody so Obviously, you know I Put this I got this slide into the presentation, but What management what Oracle management kind of wanted me to do it LSF for this talk is to Tell us tell everybody a little bit more about what we've been up to with the LTS kernels now You know it basically in the old days, we would just kind of take a kernel and branch it off Apply random patches to it as necessary ship and ship it to customers, which is basically You know the classic Linux distributor model, but we have reworked all of our processes to Make it so that the unbreakable kernel is basically hitched directly onto the LTS trains so Now so nowadays we actually do take just follow the stream of LTS backports and Everything that goes into them ends up eventually in a UK release. What when we get around to doing that The standard Oracle disclaimer here, I'm not talking about future products blah blah blah blah blah Other than in the most vague hand-waving terms that there will be more LTV There will be more UK kernels at some point. I guess Anyways, so One thing that I that I feel like has been complained about a little bit by the stable maintainers is that there aren't a lot of people who Are willing to stand up and say yes, we use the LTS kernels. Yes, we build it into our products Yes, we are willing to fund development to do back porting work and whatnot and you know upstream work, too So here we are Oracle Linux uses LTS We pick up the odd year LTS is for UEK and we are totally willing to fund The part at least the parts of the stable backboarding stuff that we actually have any expertise about at all I mean, you know, we're not we're probably not going to end up backboarding like I don't know AMD GPU or anything like that But I mean at least for the XFS and storage parts. We do take care of some bits and pieces of that we've actually integrated the LTS process in Into our own UEK Backboarding process to the point where it's now actually easier to get things into UEK by simply taking the upstream patches and getting them backported into Whatever the target LTS kernel is then it is to go through our whole entire bug filing process and bug trees and yeah No, but I don't particularly like it But like doing the old bug process. So it's really nice to be able to just see things slide in Like that Let's see you mind advancing to the next slide So we we would like to continue to ensure LTS kernels stay current for a while. Now, I've kind of heard some Mumblings about shortening the LTS release window. I think it's like six years or something like that and slowly decreasing We would kind of like it to stay We would like we'd like it to stay at least a couple of years because frankly it takes a while for us to get customers actually on to a new UK you know I Since I sometimes work with directly with customers, but often don't I'm not exactly sure how long it takes, but I know that it's more than a few months and probably a Year or two would be a reasonable guess Is my best estimate on how long it actually takes customers to end up on a new UK? so Hopefully the support window for them does not shrink too much But we also recognize that all of this stuff does take a considerable amount of engineer time and some amount of Cloud resources conveniently were a cloud vendor So, you know We got a we hope to find some kind of balance somewhere in there Let's see other bits and pieces So You know, I think I I think Ted's been talking about how he's interested in having a stable team for EXD4 similar to what we do for XFS so I So I just want to say that as the upstream XFS maintainer I am really really really grateful to Amir and Leah and Chandan for taking on this task of familiarizing themselves with a particular LTS kernel release and Backporting things and running things things through QA Because as it is I can I can just about keep up with upstream as it is and Part of that weird plateau in In the graphs on the first slide was just me not scaling and not keeping up with anything anyways, so Oracle is willing to contribute to making various bits of per subsystem or permanent here's Stable patches happen Obviously these things come with some limits like we do not want to have random on-disk format changes back forwarded into stable I mean generally that's not that hard because the on-disk changes are really obviously large chunks of patches and You know Internally, we've kind of talked about whether or not it's better to cherry-pick backports like Like like we're doing right now or whether it would be better to try to forklift and tire releases into old kernels Of course the standard answer to that is LOL folios So I don't know that we can I don't know that we could even really do that. I mean, I think that's just too much stuff to backport Anyways, so that is my contribution to today's presentation Just to address the LOL folios I am totally up for maintaining a folio compatibility layer for old kernels If this is something that people wants to in general, I mean it wouldn't necessarily be just XFS the benefit from this I mean if EXT4 wants to fork or if somebody wanted to forklift EXT4 or butter FS into an old kernel Talk to me and I will be interested in talking back to you Actually, it's not just forklifting We're also going to run into the problem that the changes for folio are invasive enough that Mechanical backporting of patches is also going to be difficult without this layer that you propose So pretty much autosale is going to break as soon as we get enough changes into folio because a patch that applies on a folio kernel Won't apply on a Non-folio kernel and so we need some way of either coping with that or we need a lot more work put into backports I mean I can only speak for butter FS, but chances are if there's a Bug it's gonna be somewhere not folio related and if it's folio related It's because folios are there and we screwed something up. So I'd like I Mean, maybe it's more useful for you guys, but I think for us There's like that nice stark divide right we tend to fuck up things that are not related to anything else just us Yeah What I was thinking was an unrelated bug in a piece of code that's already been updated for folios because now the diff doesn't apply That was the problem. I was thinking we'll have Yeah, I mean that that's essentially one of the reasons why I am very very interested in trying to recruit People who might be interested in Learning more about ext4 for some sort of ext4 stable backports team number one I think it's a really great way of Career progression So if we have junior engineers who today are you know fixing spelling mistakes and maybe a trying to address this spot bugs But don't actually have the background to understand why the sysbot Bug was being triggered or report was being triggered I've actually found that sometimes I'm spending more time with You know new engineers for whom the first thing the company has assigned them up to do is to fix six by sysbots reports Teaching them that you know hammering on it until the until the warning doesn't trigger without actually fixing the underlying bug Is actually not useful? And it occurs to me that if there are people who Companies that just simply want to get their engineers up to speed on a file system Giving them a pathway so that they can contribute in a structured way to the community and the advantage of doing Backports is it Someone has already fixed the bug. It's just simply a matter of transplanting the bug fix Into an older kernel It's a great way for them to get up to speed And it's a great way of expanding the development community for that file system So that's one and then the other is Because of the folio changes and sometimes it's not even You know the folio changes proper It's people saying well while we're making all these changes Let's clean up the function signature. So instead of returning a boolean we return an error pointer And the patch applies cleanly But in fact will horrendously malfunction We had one of these this last merge window where the xd4 tree worked just fine in isolation The mm tree worked just fine in isolation, but when you merged the two the result failed And there was no merge conflict And we've had another situation in the xd4 where a missing prerequisite patch Which you know changed the semantic meaning of a function But didn't change the function signature meant that it builds ship it isn't going to catch the problem We had to revert the patch and then reapply the patch after applying the prerequisite patch in the stable tree and I think some of this is one the reasons why the XFS community really decided to go down the stable maintainers Path and I've started to see that more with the xd4 And I've certainly seen problems where I just haven't had time so critical bug fix patch Didn't apply and Greg and Sasha very dutifully sent me an email saying this patch didn't apply So we dropped it, you know We tried to backport it and we couldn't and I just didn't have the bandwidth to Backport the patch myself and it fell on the floor And if you know if it's a critical bug fix or security fix That's not particularly good for those companies that are relying on stable for you know Keeping up with bug fixes, right? You know there are You know some customers demand that you remediate high severity CVE's and in a short window of time and to the extent that Automated backport solve that problem. It's great But I have a concern that you know things the churn rate has gone up so much in file systems And you know, maybe this is not a new problem because other parts of the file system have had a lot of you know API churn But I've noticed that that's becoming more of an issue in the file system space And that's one of the reasons why I'm you know Seriously thinking about trying to recruit for an ecsthe four stable backwards team much like what has been described here This may not be the best place to actually do the recruiting But if you know of some people who are interested in learning about file systems And are trying to find a way to you know take that next step beyond you know my first patch You know have them contact me Yeah, and another thing with that is It's a very Bite-sized way to learn about things because you get sets of patches and you can dig into just those areas at a time And so it's not as overwhelming if you aren't extremely familiar with the area And also if you are on the latest want like six dot one if you're say on like five ten or five four You're kind of following along with the patches that are already going into the ones ahead of you So it is a very approachable thing to get into The other things we have left to do is to make it easier to identify potential backboards So this does end up taking a lot of time. I think Shondan mentioned it take took him some time to So I'm not quite sure the best route for that. I know besides just manually combing through things I know there are some opinions against adding CC stable or fixes, but adding those things would at least be like a red flag for us to like look at them Okay, make sure you don't miss me even if they aren't automatically pulled in through the automation The other thing is unifying accepting testing procedures. There was discussions a while back over some emails about How many runs do how I can fix to do and things like that and there was very degrees of What people are comfortable with so it would be very nice to find the minimum amount of testing We need to do that makes everyone happy. So we aren't just wasteful with resources. I I guess just one idea that might help Even though some patches are not pegged with the cc stable and even though automation is not used now and we have a proper testing requirement for patches Auto-self is still a tool and I am I am curious if one of the things that might help would be to see if we can get an Output at least from auto-self candidates that could be reviewed by the LTS team So that that would just be a tooling exercise right kind of like hey Can you guys help to provide at least the candidates? It won't mean that you automatically backport them or that you will backport them It's just like at least information, you know, is that a possibility? Yeah, I'm totally unsupported So This is something that was already asked before and the KVM and parts of x86 already get that We would send stuff with a manual cell tag and we saw that in the past few months They had the actual number reduced quite drastically because I I don't know why maybe those subs Maybe those subsystems became better at tagging things as a result, but that infrastructure already exists Okay, is it is it possible to reproduce this entire environment so a way one can do and gather the information ourselves Auto-self is a massive pilot egg that it's still running on this old Azure via All right Can I can I work with you to try to get that to be a possibility so that way then no one has to rely on you know Sasha or whatever, but they can go and get this up All right Just saying that we are a bit over time So There is one elephant in the room here Which is that Derek put his hand up and said Oracle is assisting with this work because they use the LTS kernels But we have several other distributions who also have huge backport teams who need to identify all the bugs Before they backport them and then actually do the backports who could potentially contribute to this effort as well I'm looking at sort of red hat and see say and Android Ted because you have a massive Backport team in Android for file system stuff, don't you? But but anyway, is there a way we could actually pool resources across the distributions? I also want to mention then for from the five years retrospective a Lot more a lot more Oracle and Susan both joined like sync to the LTS since those times So from a minority of distributions that followed LTS now as far as I know There's only red hat That's not no Suze is on 5.15. I'm not saying they're following LTS per se Like Suze enterprise kernels are currently on 5.14 So so we do backports to 5.14 and we do a lot of them but So which kind of does a lot of work what's needed for 5.15 or 5.10 But it's not quite there. Yeah, so So 5.14 is not going to change of because obviously For the future. So we don't have Yeah, so actually plans are that We do kernel version upgrade every other service pack And that means that whatever is released at a time and is stable enough so that it passes Mostly melts long-term testing then we just Hook on that and and then maintain for whatever it takes so Actually, our trees are available So whoever wants to start their backporting on that and we heavily rely on get fixes Tagging we have whole infrastructure internally to Backport and evaluate most of those All right, we're calling time guys, that's four minutes over Wrap it up. I Think the rest of it's Been covered just a summary Ted already gave a spiel a pitch for other file systems Benefits healthier trees Efficiency it does require more testing but by batching things together you save time by working Simultaneously with the other stable maintain your maintainers you save time So if you are looking for better coverage and better testing and catching more of the bug fixes, it's The best way to do it. We found so far As far as making it easier to identify potential backports goes I think it's generally helpful either to put a fixes tag on the Commit itself or push something push a new test case into FS tests to mark it as a regression since there's all those All that tagging that Amir added to say that when this test fails Suggest that maybe you should think about backporting XYZ patch That that has also been helpful for actually watching those things go by