 Hi I'm Andrew. This meeting is being recorded. Oh okay. Hi I'm Andrew and this is my colleague Dylan and we're going to talk about some of the work we've been doing at Google with trying to get our kernel development more upstream friendly. So today we're going to go over kernel development at Google and its issues and we're going to try to present our solution and it's generally applicable themes in the hopes that it'll help other people sort of pull out ways that they can apply this in their own job or life and then we'll end with our current progress and looking forward what we hope to achieve. So the first thing we'll talk about is how kernel development sort of works at Google. More so in the past and a bit of the present we're trying to change it so this might be a little bit backwards looking now. So we have a kernel called prod kernel and it's a fork of the Linux kernel that Google deploys on its production systems so it's what's running in all the servers that are in Google's data centers. It adds about 9,000 patches on top of the upstream Linux kernel. These patches do a bunch of different things right there's internal APIs there's specialized hardware support performance optimizations and some specific security stuff but they're things that for one reason or another one reason or another never made it upstream. And then about every two years we try to rebase all these patches over two-year code-based delta and as you can imagine that's a pretty hefty task to take 9,000 patches and pull them over two years especially because some of them are pretty intrusive into some core subsystems. So sort of why do we have prod kernel? We had internal needs and timelines that necessitated having our own fork of Linux. Some examples we have a method for setting quality of service for outgoing network traffic specific funny rules for omkills. We have a cooperative scheduling API which actually is trying to go upstream now called umcg and then in perf we have stuff to disable sampling of the user stack in the perf tool because it could contain user data. So some of these things they probably could have been sent upstream but again for one reason or another they did not make it. So some problems that we observed with the current process that we use to inform how we want to change the process and see what we could fix. So one of the main or the main hurdle is that Google made features are currently developed and tested in prod kernel which is two years behind upstream. So this presents two major hurdles with getting patches upstream now. We have to rebase the feature across a multi-year delta to get it to upstream and even if we do that we need to retest it on its newly rebased upstream. So you can imagine if you have a patch set that's based on the kernel from two years ago and you test it in Google production workloads and then suddenly and if you want to get it upstream you have to rebase it all two years ahead to the present time and now it's basically untested. So while a feature might have been validated against Google production workloads on its old upstream base there's no real good way to replicate that on the new upstream base without the rest of prod kernel rate because you tested this old patch set on top of prod kernel with all the other features that are there and now you're testing it in a totally different environment with a totally different kernel. So that presents a pretty major hurdle in terms of testing. Bug discovery is another thing that sort of hurts us right. So we have complex workloads at large scales and oh sorry did someone say something? So this sets us up to discover lots of system bottlenecks and deadlocks and other things like that. The nature of our rebase means that there's a very large delay in discovery and diagnosis in the sense that we will we can probably find bugs on your old kernels since that's what we are running but not currently upstream kernels. So this means like even if we do find a bug it doesn't really benefit many people outside of us especially if we fix it because we're fixing a bug that's present in our like years old kernel and then this also means that we can't benefit from upstream help rate. So even if we said oh we found this bug on this kernel the rest of the world is currently much past that. So by staying on a certain major for an extended period of time a bug could have been fixed upstream but it won't benefit we won't benefit until we rebase again or we manually find the fix and back port it. And then another thing that sort of bites us is platform support and back porting. So as upstream adds support for new platforms we have two choices we can either back port the patches for these platforms or we just won't have that platform support until we rebase and this issue is kind of generalizable to all back ports. We need to back port over a large delta and even then it does it's the patches and being tested against the same kernel version it was developed for. So we encounter bugs that might not even be applicable to upstream. And just in terms of resource cost this method of rebasing over such a large delta is extremely costly individual patches end up having to have their conflicts resolved against the new upstream base. For some patches there isn't too much difficulty in doing this but for other patches that are pretty intrusive there's a lot of conflicts that have to sort of be figured out and that are not trivial. And then additionally the entire kernel ends up having to be re-qualified against Google's workloads with the new base. This opens up a very very large like search space if we find bugs. So we might rebase some patches and throw tests at it and if these tests fail suddenly it's like are they failing because of something that happened in the two-year code delta or did we rebase wrong. So you're not even sure like where to look at the first point. And then dependencies among these patches are pretty inconsistently documented making it really hard to parallelize the rebase effort. And then the delay associated with the rebase is just kind of a negative feedback loop right the longer the rebase takes the further behind upstream you are and then next time when you try to do it again you'll be even further behind and it'll take even longer. And this is kind of supported by some of the statistics we've collected every rebase is taking longer and longer we are rebasing more and more patches and effectively it seems like our technical debt is just going up without any any bound. So not even just like an increase in time and how long it takes to rebase but also an increase in the number of patches that we need to rebase and the time delta is also increasing. Structurally this means that our engineers are working on a year old kernel which like we discussed presents the hurdle of a possibly large rebase for any new feature they want to get upstream. And more practically everyone's time is limited right so if we're asking people to rebase that's time they could have been participating upstream instead. And if we sort of generalize the issue there are two basic issues we run into by not being close to upstream we can't passively benefit from near upstream development meaning that if someone puts a new feature upstream we don't just get it for free we have to work pretty hard to actually ingest it into our kernel. And by not being upstream we also just have more internal patches because we're developing against something that is so far from upstream that it doesn't usually make sense from an engineer's point of view to upstream the patches and these two kind of feed off each other and make the problem worse. And yeah now I'll hand it over to Dylan to sort of discuss our hopeful solution for this. Thanks Andrew. So yeah what do we do about this giant technical debt problem? So our team is to introduce a new kind of kernel the icebreaker kernel to sort of present itself as an alternative to the current prod kernel. So there's some general themes to this part of the presentation we want to stay close to upstream and sort of the concrete goal about that is to say all right we want to release an icebreaker kernel for every major upstream version of Linux and so that sort of forces us to say okay we do have to sort of qualify and test and release all of our patches at least at some sort of regular cadence and we want to use this to encourage upstream contribution with the idea being that if we are currently close to upstream it should be pretty reasonable to expect engineers to start sending things upstream and start paying down our technical debt. And in order to do this we need to test a lot at both the granularity of individual kernel features and also at larger kernel wide tests and then sort of the other side of this is we want to be able to deploy this icebreaker kernel in production at sort of a limited capacity to actually like qualify upstream Linux in production before it gets to prod kernel you know two years later. Next slide. Oh thanks. So what makes this important it's really just all about upstream engagement as Andrew kind of alluded to before an engineer working on a kernel feature is sort of creating technical debt inherently right and they're sort of faced with two options right they can either propose it upstream and it sort of has this ambiguous amount of time that they have to wait to get feedback they don't know if it's going to get accepted they don't know how much work it's going to take or they can say I'm not going to do any of that I'm just going to wait till later and at some point you know the kernel org is going to say all right we're going to set aside some time to do this rebase and I'll do it at that point in time and I know that I have that time set aside and so it's like sort of a delayed and certain outcome and so we want to get out of that cycle using the icebreaker kernel. So the other side of this is we want to qualify upstream kernels against Google production workloads and so we have a pretty large fleet and it doesn't make sense to sort of try and fix every kernel bug we discover and so it makes sense to just consume upstream kernel releases to consume other people's bug fixes and then at the same time we become better upstream citizens by simply being better Linux users right if we deploy in our very large fleet a much newer version of Linux that gives a lot more exposure and a lot more coverage to Linux and we can help report bugs to upstream. So how do we make this actually happen there's sort of two sides to this there's actual development on icebreaker and then there is upgrading our existing patch set on top of the new versions of upstream Linux and so the SCM is trying its best to be optimized for developing and then staying close to upstream. So the basic icebreaker structure is we start with an upstream release and then we fork off from that. Next slide please and then so we basically create these future branches and what a future branch is is just a vanilla Linux with some patches on top of it that sort of constitute any feature and sort of the the nice thing about this is that these patches are more or less what you would consider like a patch series that you would propose upstream. So the idea is that we're introducing to our repository a patch series that could be proposed upstream and we're keeping it on a branch just like that. Next slide please. Yeah one more. So development on these branches happens we can get bug fixes and stuff next slide and then eventually they get merged into a persistent subsystem staging branch and these staging branches are sort of fully featured kernels so they go through more testing and next slide please and eventually we do sort of a fan in what we call a fan in merge process to our next branch which is sort of like where the release actually happens of the icebreaker kernel and so in that process everything gets merged together it gets qualified and then finally we do a fan out merge where we actually merge back out to the staging branches so everything is sort of reset back to the state of the universal state of icebreaker and an important thing to note here is that the fan out doesn't actually happen to future branches future branches continue to be isolated vanilla Linux with just these patches on top of it so taking a closer look at a future branch how do we actually do an upgrade and this happens on the future branch level so if we actually are developing on say 5.10 and we want to get into 5.11 what we do is we create a new feature branch for 5.11 and then we merge the 5.10 branch onto the 5.11 branch and so when that happens we're creating a merge commit and we can resolve any conflicts in that commit with the idea being that we're just resolving conflicts for that feature and not for you know all of prod kernel and so it actually is kind of manageable to do it in that one merge commit and then that sort of has the added benefit of saying okay well we actually can keep our stable SHA ones for past commits because we didn't actually rebase anything we're continuing the history of the feature next slide please and so for bug fixes let's say our oldest supported version is Linux 5.10 if we have a bug that we discover in there and say there's some SHA one that introduces the bug we can introduce a fix and we have you know say a footer that says fixes this SHA one that fixes the bug and so we fix it in 5.10 and then we just do do another merge forward to actually propagate the fix forward into the latest version of Icebreaker and so in this context that actual fixes footer is really meaningful right because the SHA one is totally stable it doesn't change there's only one SHA one that introduced the bug and there's one SHA one that fixes the bug and it remains constant through the full history of that feature next slide so staying close to upstream means that these features can be easily upgraded and proposed upstream at any point in time but there's sort of this caveat that kind of it ends up being very important that we we test feature branches at the feature branch level right because when we do this merge forward we need to revalidate the feature at that point in time because there's this new merge commit and so having automated like real written tests can make that like much easier so upstream contributions should be a lot more doable in this context to actually take a feature branch with all these weird merge commits and turn it into a patch series again for upstream it is more or less just a git rebase and if you have enabled the ideal scenario is that you just do the rebase and you're done sometimes it doesn't always work out that simply but at the end of the day the the information of like how do I resolve these conflicts is stored in these merge commits and so even if it is a little bit complicated to resolve the conflicts again the work has been done once before somewhere and so we can extract that information and at the end of the rebase we can actually compare tree IDs to make sure that we actually do end up with the same the same rebased feature that we can compose upstream next slide please so i'll take it over from Dylan and talk a little bit about the tooling we've built around icebreaker so all this stuff we've talked about with the source control model and how development works and how upgrading works is very useful but it's a little bit it would be very difficult if we didn't have any automation around it because there are many mechanical steps involved so in theory the aforementioned ideas they should work but they make sense right you're keeping basically feature branches which are just patchlets that you could send upstream and then you're combining them together into staging branches and a release branch to create the kernel that we want but there are many steps that need to be taken and it is likely the case that a developer doesn't want to be responsible for some of these certain mechanical tasks so a developer might just be concerned with developing his feature and this person probably doesn't want to worry about okay like do I need to merge it into the staging branch or how do I know that all my tests are passing and I didn't break anyone else's tests so for stuff like this to work we need some very well oiled automation but one of the starting at like the smallest level that we can we have feature branches so when a developer uploads a patch set we automatically run build tests across a variety of configs and architectures and this is pretty useful because you might be a developer and you might just check that it builds for one thing or maybe you have a script that runs a variety of build tests against your patch series but what we found is really useful is like if we can just take that sort of worry out of the developer's hand so all they have to do is upload their changes and then the tests get run automatically and you can see very clearly that they passed or not we also validate the commit message and its metadata so in the sense that this will keep people thinking about upstream in the sense that the commit message is something that we'd be okay proposing upstream all this ensures that if a developer ends up wanting to send a feature branch upstream it's already in a very good state likely all they have to do is rebase it to the latest version of Linux which as Dylan sort of mentioned is a much easier task than it would be in the protocol world. Andrew there is a hand up would you like to take the question I don't know if it's for you or Dylan? Sure. Pascal Lambert could you you could ask the question go ahead and unmute. Yeah question for Dylan so just to summarize what you showed previously so in a sense the idea of the icebreaker is to create little bundled patches of those features so they can be accepted into the upstream right that's the whole point basically of the branching strategy to sort of have things separated it makes it more parallelizable and those branches can't always be you know sent directly upstream to have to still be rebased but the idea is that it's it's much more manageable when everything's separated into future branches. Okay okay thank you. So I have a question Dylan on one of the topics that you presented earlier in terms of you said that you would do a test on a feature on a feature branch you might have two features say that you have branches for those separate branches how do you make sure then that those two features interact well when you take it up to the icebreaker? Right so yeah there is this other side that we haven't we didn't talk about too much that has to do with dependency between future branches the the actual testing of the isolated feature branch I think is important just to make sure that it works on its own but we do still test the fully merged staging branch and then there's this sort of other component which is that if we do find out that like you know merging one feature branch with another breaks one of the tests or pretty commonly you know there's actually a merge conflict between future branches because they just modify code near each other or there's actual like an API dependency between two future branches then we say okay there's there is an actual dependency between these future branches and so it's no longer quite as simple but it's still manageable we basically say okay there's going to be a merge between the two future branches on the actual feature branch and not just on the staging branch any longer and then when we upgrade we can manage that dependency by saying okay well the the original feature branch that the other feature branch depends on will be upgraded first and then merged over into the the dependent feature branch every time you upgrade so we can properly propagate the dependencies so that is something we kind of have to deal with but I think a core assumption of Icebreaker is that there aren't too many interdependent features like that did that answer your question yes thank you all right let me find my place um yeah so we just talked about feature branch testing I guess also in the context of the questions asked so that's one level of testing right making sure that the patch set works on its own so that is good in the sense that you can now propose it to upstream and point to test and say hey I ran them against this patch set and it works and then eventually these feature branches they all get merged into a staging branch together and this is where things this is where we have our next level of testing so while we've already validated that these features work independently on their own we still need to validate that they actually work together right like Dylan said sometimes there's merge conflicts between features because they modify code that's around the same area so suddenly when you merge them together you're actually introducing new code so what we do here is these staging branches they have a select subset of all the tests that we run we run them against these staging branches a smoke test so what this looks like is a subsystem maintainer will start merging individual feature branches and on each of these merges we'll run this subset of tests to make sure that we didn't break anything and then this gives the subsystem maintainer some confidence in the health of their subsystem right so they might have had to go and resolve merge conflicts that introduced sort of new code and they're able to upload this merge conflict resolution have the test run automatically against it and they'll see it's all green and say okay I have confidence that the resolution is actually correct and by introducing this new feature A it didn't break test for feature B and then going up one further level our release branch runs all the tests we have and this is good right because this is like actually the source of the kernel that we build and release so we need to make sure that it passes everything that we're expecting it to and if we do find a failure we can always bisect back to the faulty subsystem and then we know the tests that are failing so we can bisect back down to the actual feature branch merge that caused the test failure so this makes the job of the subsystem maintainer and release manager easy in the sense that all they have to do is sort of look at the pre-submit results and see are they green okay good then they're confident and if there's a test failure then it's a pretty well-defined process to track down what actual code is causing the test failure or where it was introduced and again all this is something that's done automatically so all that the subsystem maintainer really has to do is run a few commands to combine the feature branches or propose a merge into our release branch and the test will run automatically and we'll be able to see are they good or not. I have a couple of questions Andrew in Q&A. I think probably the first one is for Dylan. How do you separate the patch sets feature branches so that they are independently valuable or even usable especially considering you are starting with having 9,000 patches partly answered earlier but still seems like a huge effort. So yeah we're not doing all 9,000 all at once the sort of initial goal for icebreaker is to get all the sort of API patches over so that we can run arbitrary Google binaries on icebreaker but we don't necessarily have the same performance that Prog kernel has so that kind of that actually reduces the number of patches that need to be separated out by quite a bit to start and then actually like separating things out into feature branches is usually pretty manageable mainly because we have already separated out patches by effort we sort of have effort footers that sort of organize them saying like okay so this patch is for this effort and so we can sort of figure out which dependencies exist just based off of that largely and then from there it's sort of a lot of trial and error figuring out like okay like what what are the dependencies between these patches what makes sense to actually define different feature branches because I think defining a feature is sort of it's more of an art than a science right? I hope that answered the question. Does that answer your question Caroline? Yes, thanks a lot. Great. One more question Dylan why are these why is the delta two-year upstream kernel between upstream kernel and prod kernel can it be reduced to something three months? I mean that's that's pretty much the goal of icebreaker. The reason the delta is currently as big as it is is because we just let that let our kernel like diverge more and more each time and so every time we want to rebase it takes longer and so the amount of time in between each version grows so for icebreaker we really want it to be like the most recent upstream version. Andrew I don't know if you have anything to add to that? So obviously Google adding kernel specific patches it's been happening for a while both before Dylan and I joined but I imagine it went something like originally there were just like a few patches that we wanted to add and then it kind of snowballs from there right they're like oh it's easier to just add it internally than to send it upstream and it's kind of like a pernicious issue right like at the moment it seems like it's the path of least resistance but if you keep doing this over and over again eventually you have more patches to more patches to rebase and then it takes longer and then you fall further behind and then there's a larger delta to rebase over so kind of snowballs out of control and with icebreaker we're sort of trying to have like a clean slate and we've done a pretty good job about it I think we've been staying really really close to upstream and actually coming pretty close to actual convergence so I think we're doing meeting the goal in a sense we're getting closer and closer right I think I see another hand I don't know I was gonna ask so essentially though by with icebreaker doing work on every kernel version so let's say you do six versions between your two years and dealing with stuff and within merge commits but on your production kernel over two years those patches I assume you're essentially not using merge commits and doing proper rebases on them so essentially you've got you have created two different ways in which you're managing patches and also because with the merge commits you essentially each new kernel version how do you because in a way you have to throw out that merge commit and recreate it and it's going to be slightly different each time and there's really no reason to go back to a previous version if it's midway so how like it seems like there's a lot of deltas there that could be consolidated to make things a little bit easier so I think I heard two questions there is the observation that we have this prod kernel way of managing patches and this icebreaker way of managing patches and then the second one was we're creating a lot of merge commits actually I don't think I understood the second part completely yeah just just the merge commits in general like you are capturing a delta there to allow your rebases to go on cleanly is that done by in order to keep it relatively small so you can manage that do you have a bunch of merge commits for a different dispersed set of commits or is that for all of the commits that you're applying to that new version of the kernel all right so for the merge commit part you'll have to tell me if this answers your question at the end so we might have like a feature branch that's based on 5.10 and then to get it to 5.11 we'll merge it with Linux 5.11 right and there might be conflicts there but they're usually like based on like the size of the feature by virtue of like the fact that the feature branch isn't like a long long patch series it's like maybe 10 or so patches usually the conflicts were able to resolve and then we'll do that again for 5.12 right we'll merge 5.12 into it so there's never like a need to go back and resolve something we've already resolved the only time that might happen is if we're trying to linearize the linearize like take away the merge commits from the history by rebasing it onto the most recent Linux version get re-re-re might not save you in that sense but you've already sort of solved the conflict so you can just go back and look at them I don't know if that answered that part of the question that's fine yeah if you're only dealing with yeah the 10 patches or so as your max then it's not too bad okay thanks yeah and then for the managing like having prod kernel and icebreaker exist side by side it's I think it's more of like a sort of organizational thing right so while we're trying to get icebreaker up and running you can't like put all your eggs in it right you need to give it time to grab on fresh to figure everything out so eventually we want to shift everything over to the icebreaker way of doing things but we still have like a concurrent prod kernel rebase going on and to add that to that I think there is kind of like a liminal space for we say um if you get your feature on board with icebreaker and you make it to 515 and let's say you know the new prod kernel rebase is going to be based off of 515 you can say hey well I have this feature branch it's already on 515 can I just merge it into prod kernel um and then you don't have to rebase them right because you're already you've already done all of that work um and so we're hoping that like in the future you don't necessarily have to literally rebase every single patch um if you already have it caught up in icebreaker you can merge the individual feature branch into prod kernel because you have it separated out you're you're able to do that yeah that's like one of our big selling points is that hey if you like do this one time effort to get your patches into icebreaker um we have automation that will do as much as we possibly can up to the point of resolve and merge conflicts and even then um they'll be much much smaller than you would have to deal with if you waited two years sort of because you can like spread out the work over more time and I think for the most part like people are receptive to that right cool um I think that's it right so um yeah so we went over the automation that comes with testing so testing at every level feature branch level staging and release branch um but we also need some way to combine feature branches um so you might say like what's so hard about merging branches together um but when you have a lot of them it's a little bit annoying to have someone manually do that so we have automated rules that merge feature branches into staging branches depending on maintainer preferences so some maintainers um they just want uh anytime a feature branch has its head moved they want it merged into their staging branch um and other maintainers they want to have some control over when feature branches get merged so we can set up something where like if a new tag appears on a feature branch it'll merge it into the staging branch so that takes care of getting stuff into staging branches and then we have a tool to generate an upload for proposed fan-ins of staging branches and um this is pretty useful because it uh it makes it really easy to get a lot of um um testing done um in between staging branches and subsystems so you might have like concurrent development on subsystem a and subsystem b but they will never be tested together until they make it to um the release branch so having a tool that allows us to merge staging branches um automatically makes it really easy to combine stuff and ensure that it's like actually tested together and then uh this automation makes it so like the overhead uh after committing a branch after committing a patch to a feature branch is minimal so you can imagine like uh from the point of view of someone that's like uh solely invested in uh writing kernel patches that don't want anything else to do with it outside of uh committing stuff to a feature branch um this is kind of like an alluring proposal right so we say okay as long as you put the stuff in your feature branch we have automation to take care of everything else so we can run your test for you we can have your feature branch get merged into the staging branch automatically and make sure it makes it to our release branch and we'll come like tell you if it breaks something in another subsystem as opposed to like you having to sort of keep an eye on it and watch for failing tests and so that takes care of the development side um we also need automation around the upgrading side so so far we've built automation to automatically resolve dependencies among feature branches so some feature branches uh they share dependencies right so there might be a feature a that exposes some useful helper functions and then feature bc and d all depend on it um so when we upgrade we need to sort of like create a graph of features right so we know which ones to upgrade first and then we also have automation to automatically attempt upgrading feature branches to the next version um the reason it can't be done like fully automatically is that inevitably you'll run into like build failures and test failures and merge conflicts um so the ideal case right is we merge the new version of linux into a feature branch and there's no merge conflicts all the build tests pass and then it all the tests pass um this doesn't always happen so sometimes there's like a build failure or tests start failing and in this case we know who the feature branch owners are so we can create a bug and assign it to them and say hey like we need you to do like this little bit of work uh to get the feature branch to the next next version of linux and um I think this is actually a much better system than sort of having a thing where you're like okay these are the patches they need to rebase over two years and if uh you know test fail then you have to go search over like two years to figure out what's actually causing the failure so doing it this way I think like narrows the search space down and makes it much more manageable to do like every now and then as opposed to at the tail end of two years or so and then when we combine this with our testing automation we basically get tests on these proposed upgraded feature branches for free so we already have the automation set up such that if you upload a patch that will do the testing for you um so once we attempt to upgrade a feature branch it's basically the same thing as uploading a new patch set so we already get the testing for free and then when we combine this with the composition automation we have we get combining feature branches for free um so the big idea here is that if you can build like reusable automation then you can just chain it and use it in slightly different but similar scenarios um meaning that like it's useful to have some infrastructure where you can run tests automatically and specify them easily so tying this back to the themes that we had um staying close to upstream we have upgrade automation that makes this easy and I think this is important because um usually the stuff that we the conflicts we encounter when staying close to upstream in the sense of merging from like Linux 5.11 to 5.12 the conflicts aren't like too terrible right like people can usually solve them um the problem I think comes from like the sheer number of feature branches that we are trying to move and number of patches so having everyone do this manually would just take way too much time so having automation for this kind of puts it in the back of people's minds like it kind of just running automatically until there's a conflict that needs to be resolved and then encouraging upstream contribution we have automatic testing at the feature branch level so this makes it pretty easy once you get your feature branch in to say yeah I think I can propose this because I'm confident that it actually passes the tests I wrote and then the test test tests we're testing at all different levels of granularity from the release branch to the feature branch level and this gives us a good amount of confidence that even though we're like merging in new code every single Linux release that it's actually still functionally correct before we deploy to prod and I'll hand it over to Dylan to talk about the current status and looking forward so um yeah how are things actually going uh next slide so um right now we're on icebreaker 515 and 516 Linux just came out but it's looking like our time to actually get from one icebreaker version to the next is quicker than an upstream release cycle so we have some wiggle room and we think that this process is actually sustainable and we're noticing that the more we automate things the more we practice this the better we get at it we can upgrade quicker and quicker even though we're introducing more and more code into icebreaker and so looking forward we've got to get everything fully automated right now we have a lot of CLI tools and sort of an on-call rotation and that's working okay but it would really be great if we just had sort of a persistent service so that really like a member of the icebreaker team just has to um look at emails from a bot that say like hey there's merge conflict here there's a test failure here um and we don't have to worry so much about like how and when we actually perform a lot of these merges um another another thing we want to get done is um if we can actually upgrade quickly enough we can upgrade onto um Linux release candidates instead of just the actual releases and then we can actually participate in the testing of release candidates um as they come out which we like much more exciting than just consuming the final release and then from here a lot of what icebreaker has been doing so far is just keeping up with upstream and being able to carry this technical debt um I think moving forward really we want to say okay now that we are sufficiently catching up to upstream now we want to want to pay down the technical debt um so that we can start reducing our 9000 patches um and that will sort of open the window to say like okay well now that we've upstreamed a lot of icebreaker features we can introduce other icebreaker features um from prod kernel and sort of start paying more and more stuff down um did you want to do takeaways Andrew or should I do it? Oh I can do it there seemed to be a question if this is a good time yeah sure what kind of tests do you use to qualify the kernel if you can share besides the build tests for different conflicts or just which are self-describing um so we have like I guess what's the best way to describe it um so we have like uh sort of unit test like test for I want to say every feature branch we have but I know that's not true um but we have like Dylan's laughing because he knows this is an issue but we we have like a unit test style testing for feature branches and um we test that for like individual feature branches and then on like levels where they're combined right to make sure feature A doesn't break feature B um so that's generally what we use to test before we release a kernel and then um sort of what we test on the release branches we have a bunch of like performance tests um some of them are taken from upstream uh test or open source test suites I can't recall the names of them but um we have a good amount from there and then beyond that we have actual like customers that are invested in like the correctness of a lot of kernel functionality so they give us tests that we also run um but obviously the further away you get from like testing an individual feature it's usually like the more time it takes to run and the harder it is to debug so there's a big spectrum of tests we run and I think uh there is a big problem with our current testing infrastructure that we sort of inherited from prod kernel uh in that it is like sort of an internal uh test suite um that has to be separated out onto its own feature branch because of that and it causes it's causing a lot of problems right now um and we would really like to move sort of away from that and towards using kernel self tests especially for um feature branches right um to be able to sort of develop your feature and then commit onto the exact same feature branch your tests in self tests and you know build it and run it and boot it and do everything just on the feature branch um on its own would be I think super powerful and we're just not quite there yet but so that goal is music to my ears because we have K unit and K self test you know um so that's that's great so I have another question so when you run tests on your feature branches um right do they specifically target the features or do you run um a common set of tests across all your feature brands branches and focus have additional focused feature tests um so for the feature branches like there there's some like testing that um that is applicable like just in general right so the kernel source that a feature branch is on top of there's like stuff that can still be tested that we want to make sure we don't want to break like uh you know make sure the kernel still reboots um but usually in general it's just feature specific testing for that specific feature on a feature branch and I guess in theory we could be running um like the full self test suite on the feature branch right over and it's not going to yes it probably won't test some of the your product specific tests but at least it'll make sure that you get the kernel coverage both right right is the full test of suite run against the integration branch for the feature for the feature branch as you mentioned Dylan once we had the kind of an integration branch also inside a feature icebreaker feature branch right um sorry what was the question I thought I saw I saw in your branching you had actually a branch that was integrating all the feature yeah yeah so um on a the staging branch we can rerun all of the individual like feature specific tests um and then at that point we can also run integration tests um yeah and we sort of have a yeah currently it's not happening to run all the tests against the staging branch um so we run um pretty much all of our core feature specific tests uh on staging branches on and on our next branch and then through our release process we do end up running like more like the customer level integration tests against the release branch some of it's like uh like a timing issue right like you don't want to force some subsystem maintainer to have to wait hours and hours for test results to come back so it's kind of a trade-off between like how quickly you want to allow them to be confident in their integration versus like is it okay to have them wait like a full day for everything to come back that makes sense yeah right so if you were to run like k-unit or k-cell test on those feature branches probably wouldn't add that much time for you and it probably kind of goes across all of the features that you might be looking for k-unit especially say it's a urml kind of build but it'll help you with uh uh targeting certain features if you wanted something that that you could consider with yes the time and balance of time was how long do you run the test and the ROI on the time versus um I totally get it because at the k-cell test and k-unit that is the struggle we have how how long does the test suite take because if it takes longer nobody will run it yeah and also it's like we're using shared resources right so we can't just like take all the machines all the time um right which would probably be required if we wanted to run that sort of level of testing against basically everything yeah if you want to run everything that gets to run run on on your feature branches yes it'll take a long time I get it thanks right um yeah so back to the takeaways that we hope uh people have so the first one staying close to upstream again um one of the things we learned is being close to the tip of where everyone else's makes life easier and it's a worthwhile goal despite the effort to get there automating these processes means that you'll eventually get as close as you probably can to sort of set and forget sort of lifestyle with kernel development um so like there are a lot of patches that we need to bring forward still to have them all be close to upstream but um we've seen from like the time it takes to upgrade and the amount of like patches that still are coming into icebreaker that we think it'll probably be a goal we achieve and it'll probably be worth the effort because I think the hard part is getting there probably much easier to just maintain once we have like all the automation in place and then encouraging upstream contribution um it it's much more beneficial if like or while having a method to stay close to upstream is great for internal patches it's even better if the patch lands in upstream obviously because um outside of I guess responding to like bugs that people find with it it's much easier if it's just already in the tree that we're basing everything off of and then testing um you need to automate testing at all levels no one wants to have to manually kick off tests so um if you can just make like get push the last thing and developer types um you'll probably be in a really good position um where you know they just say okay I'm done with writing my code um they send it and then you know if they get an email that says everything passed great and then if not they'll probably be thankful that this email contains specific information about like what tests are actually failing um yeah so any uh final questions there is a question in the qa box do you work with the future flag if so can you explain your process um so I think I think feature flags are something that we use in user space binaries to control their their behavior I don't I don't think that's something that's applicable to the kernel kernel development there is a yes carlin you can go ahead and unmute yeah thanks I was curious about um the subsystem maintainers so you mentioned that uh the subsystem subsystem maintainers we have to take care about their subsystem well in the staging in the staging setup and that would be once every two months now do you think they will be overloaded or as close as you get to upstream they won't have that much work um so I guess there's like sort of two responsibilities for a subsystem maintainer I guess anyone involved in icebreaker so there's like the maintaining the health of like their staging branch while holding like the kernel version constant so like if we're on 512 they're probably still ingesting merges and development into their staging branch and so I think that that one's like not too much work right you just make sure that the tests are passed you know resolve conflicts within the code that you probably wrote are very familiar with the question about like every two months we upgrade so this is like actually sharded over many many more people than there's a subsystem maintainer so while we have we might have one person responsible for like a subsystem just for like ease of contact within that subsystem each feature has its own owner and it's not necessarily just that subsystem maintainer so I don't think anyone's like in a position where they're upgrading an entire subsystem I think it's much more spread out than that okay understood yeah makes sense thank you and I think another thing to acknowledge is that a lot of the problems that we do encounter when we do the upgrade is something that we would encounter during the prod kernel rebase anyway but the advantage of doing it more often is that we're sort of amortizing it and spreading it out over the full you know two year period as opposed to sort of compressing it all into one critical period you know and putting all these problems in the critical path for deploying a new kernel and I also think like is something like 60 to 70 percent of our feature branches when we try emerging the next version of Linux there's just no conflicts and they just work right so that's like something people don't even have to worry about just the other 30 or 40 percent that is great that if you can manage not having too many conflicts right that means that your your feature the conflicts between your future branches are being worked out ahead of time before you need to go into the product branch so does this process really helps build confidence for product teams to take a new upstream I mean rebase to a icebreaker kernel is that that's kind of your the goal right that your product teams will be able to take it with more confidence yeah like I guess trying to convince people that are consumers of the kernel that it's like tested in the same manner as prod kernel would be and then people that contribute to the kernel that our way of doing things is a easier proposition than the long rebase and I think prior to fully transitioning to icebreaker there's this benefit of saying you know icebreaker 515 has been deployed in production and we've identified a number of bugs but we fixed them and so there's you know and it's been deployed in production not like fully but like you know a small percentage very small percentage of the fleet but we can say 515 has been qualified in production against Google workloads and so we can rebase prod kernel to 515 with more confidence because we've already gotten the head start in in working with that upstream content as opposed to before we were just sort of rebasing completely blind and we had no vehicle to test upstream with until we had actually done a significant amount of the rebase right you're flying blind fingers crossed I guess and yeah going everything hoping everything works so yeah that's great so you talked about a little bit about maybe it's the goal to participate in the release candidates linus's release candidates so do you have thoughts on when you would do that and would you be participating in every single release candidate that comes up each week or you would target the merge window the rc1 which has this bulk of the changes and then then the rest of it is incremental um so I think the way that we view it hopefully happening in the future is that the initial like jump we make is to rc1 and then everything else is incremental after it um obviously there's like some timing issues there right so it's like rc1 is released and then we try to upgrade everything it would need to be blazingly like fast to have that happen in the two weeks before the next one comes out to report any meaningful bugs so we still need to do a little bit of thinking on how we're going to make that work with the timing but I think we've shown that we're able to jump from one version to another and less than a release cycle so there should be some way to actually make this work for jumping to release candidates right yeah you're able to do this with with the seven eight seven to eight week grand generality so if you hit the release it you know the official release so yeah that's so it looks like there is a question um can you share one challenging situation while migrating from prod kernel to icebreaker and how you overcame that challenge do you have any that jumped to your head Dylan yeah tests um maybe that's not a good answer because we're we're still dealing with it um but things are getting better um I think yeah I mentioned before we sort of have a internal not upstreamable test suite that we have um and migrating away from that and towards something more upstream friendly I think is going to be essential for sustaining things in the future um but it is um kind of a headache um because if you have a future branch and you only want things that are upstreamable in the future branch it's sort of frustrating to say no I can't you know I can't merge my my own tests in I have to do this weird ephemeral merge and then build and then test everything um so I think it makes a lot more sense to to leverage upstream test infrastructure I don't know if you have anything Andrew um yeah I think like one challenge that I think still pops up every now and then is um so there are applications and binaries that they like assume they're running on projectional rate because up until this point that that was a valid assumption if you were running in a google data center you're running on product kernel so they would assume that certain features for the kernel would be there and then suddenly we bring icebreaker into the picture and that feature is non-existent so they try to make a syscall that doesn't exist or open up a file that is just not there so I think an interesting thing to deal with I don't know if it was necessarily hard but it was interesting to think about like is this feature worth porting over to icebreaker or is there a change we can make in user space to sort of get around having to port everything and so I think that had the benefit of like it helped us actually move patches from the bucket of like we need to rebase them too we don't need them anymore because you're able to find a user space solution so is it a combination of oh sorry um possible um you can just ask your question so maybe I missed it somewhere in the in the in the flow but I work on a feature it's been tested and it's good so is it my responsibility to submit the patch against the the master branch or is it or is it be part of the flow of icebreaker to do it for me um yeah so um we're trying to get it to the point where it's like your feature branch is tested and good and you can just walk away um in the sense that as soon as like you've submitted to that feature branch um it'll get pulled into the staging branch and our next branch eventually um so but yeah that's basically what we're trying to get to um there's some like weird caveats that we haven't worked out yet so like if it conflicts right with another feature branch then you still need to come back and fix that conflict um but other than that yeah but it's as it stands today so it will already do it today so as it stands today yeah if you submit patches to your feature branch um all our subsystems have this auto merge thing going on so um it'll get automatically merged to a subsystem branch and then um at that point it's like the subsystem maintainer just making sure that the test that we ran automatically passed and then clicking the submit button not very cool yeah i would say it's not necessarily the feature maintainer's responsibility to make sure it gets merged into a staging branch but it it is their responsibility to support and maintain that feature when something goes wrong in that process um because they're the the expert after all in that in that feature that makes sense yeah so do you also run into configuration you did mention feature syscalf not supporting or those are all runtime feature differences and do you also run into configuration options or how do you manage the configuration options to enable certain features across your branches um so like we have a we have a way to do this where you can like add a file that'll turn on a configuration feature but or a config option but i don't know how much we're like allowed to talk about it but we have a way of doing it okay well now any other questions for either Andrew or Dylan there is one question every decade now buzzwords pops up in the market the decade this decade is ai iot meta etc with the emergence of the technology where do you see cardinal development going oh um i'm going to cop out of this one i've only been doing this for about a year and a half so maybe maybe Dylan has more perspective well yeah i don't know if either of us are really experts on ai um there's been talk of um you know oh wouldn't it be cool if we had ai attempting to resolve merge conflicts um i think that's kind of a long way off um yeah that's about all i have to add i mean i think one interesting thing that we've sort of seen is like we're adding more ways to um move like what would be like current logic inside the kernel into user space like an example would be like cooperative scheduling so we're able to like write schedulers in user space instead of having to just defer to the os scheduler so that might that might be a big way things go but that is a very narrow view i guess so it looks like there is a second question maybe it's a follow on or or in future does embedded have scope of innovation um i honestly cannot say i've done much uh embedded programming in my life thank you andrew that's that's um one of the probably predictions like question and there is a lot of innovation happening in an embedded space already so um i don't know if uh there could be more i you know there probably will be more any other questions does it answer your questions out of yes but not completely so any other questions i think we're kind of do you have anything anything to add andrew um yep i think that's yeah that's uh about what we wanted to talk about hopefully it's helpful for other people yes it is it's uh it's some of the challenges you're facing that's fascinating um dylan i was gonna say if you have the chance um it start start without stream first don't get into the situation that we're in now um uh because i you know it's one thing to like try and do something clever to get out of the situation but it's totally worth it to just invest and not get into the situation in the first place so oh i guess another thing is we we have a bunch of open kernel developer positions um it would be nice to have more people on a team to share the work with great yeah that's why i think that people would like to hear that um yeah definitely technical debt is a huge problem for not just um it's it's across the board for all companies right so um what i'm hearing is you're doing a lot of automation and you are making it easier for maintainers and to and then the product teams to also have confidence in taking going upstream so we're all i can you know you can see product teams um being on kind of both sides product team um you're afraid to to it's working i don't want to do this but at the same time you're continuously maintaining the patches that you don't have to so it looks like there is a question have you had situations where severe bugs made it into the product kernel if yes how did you handle it um and so like specifically about product kernel or like uh generalizable to like we had a bug slip into icebreaker and how did we fix it seems like it's a product kernel the the question yeah aster is that i think yes that's generally i just like to know it um had bugs it's a part and how you went about like giving icebreaker um so i think in general like uh the way like i guess severe bugs manifest is obviously if we caught it before like we wouldn't have released the kernel so it's usually we have a big customer or someone who runs some certain workload they come back to us and say like hey this isn't working and we think it's a kernel issue um so when that happens like generally people are it's once we've triaged in and figured out like where like what feature is responsible for this um usually people are pretty receptive at like figuring out and debugging and fixing it and then like mechanically in terms of like getting the fix out um we can either say like okay don't use this kernel anymore because we know it's bad and go back to like a previous known good kernel um and then you can just wait until the kernel with the fix rolls out okay we cannot hear you if you're speaking no i mean thank you thank you that is another question can you provide active suggestions so um this is probably can you provide uh provide active suggestions to millennials who want to build their kernel uh career in kernel and device trial development in in the long run is that something you you want to feel Andrew Dylan or do you want me to feel that uh seen as we're probably in the same bucket as the person uh perhaps you can impart this with some wisdom okay um so we have a lot of resources i mean you the currently millennials is probably current generation probably in my opinion isn't the best position to pursue uh open source career um in any of the open source projects they would want including kernel and device there's so much information out there so many uh resources out there and mentorship programs and this webinar like we are doing series and um is just very accessible um i i totally wish that i had similar opportunities when i was struggling to become a kernel developer so um um to answer your question is there is so so much so many resources and both formal mentoring programs internships as well as webinars and training so many opportunities out there for you if you choose to does that answer your question shout out yes no it does uh if it if it's possible can you provide the guidance means where should one means invest their time in kernel subsystem memory management network driver and any other virtualization hypervisor so i would say whatever you have you find it find naturally driven to a passionate about because if you do not identify what is the best for you as something that you want to do for um for the next five 10 20 years um nobody can really can guide you that way you just have to play with the difference of systems and find where your passion lies that's what i would say when i can tell you you know this device driver that device right but if you're not interested in that you know you will it won't keep you engaged in that any other questions thank you andrew and dylan marisa looks like we're about to close out if you'd like to do your closing or wonderful thank you so much andrew dylan and shua for your time today and thank you so much everyone for joining us just a quick reminder that this recording will be up on the linux foundations youtube page later today and a copy of the presentation slides will be added to the linux foundation website we hope you are able to join us for future mentorship sessions have a wonderful day thanks again