 So, this is a project of flowers which we've been doing for a while at Red Hat, and we want to obviously tell you about it. I can't hear you. You can't hear me. No. You can try speaking louder. So, once again, we're going to talk about a project that we've been doing for a while at Red Hat. It's called CTI, which is sometimes pronounced cookie. And I'm Nicolai Kondrashov. I'm a software engineer at Red Hat. And, well, otherwise, I maintain a UTMAM project and do embedded and electronics as a hobby. So, my name is Major. I've been at Red Hat for about a year. Before that, I worked a lot in OpenStack. And I own a lot of domain names. So, does anybody ever use ICANN's IP.com before? Oh, a few. Okay, cool. All right, I've run that one in a very terrible way. But don't give any other ideas of domain names to buy. Every time I talk, someone gives me an idea. So, I didn't want to get there. Okay, so, as Nicolai talked about, we as a group have built this project called Continuous Kernel Integration, CKI. And it's a group of folks from around the world who come together to change the way that we test kernels. And so, we call it cookie. And someone stopped me in the hall earlier and said, what kind of cookies are you talking about? I was like, oh, you mean like the variety, like what the flavor is? And they're like, no, no, what does a cookie mean to you? And I was like, what do you mean? Like, it's a sweet cookie or whatever. And so, it made me think of something that when Samsung first set up the emoji for cookie, they were the only ones who set it up as what I would call a cracker, you know, that you'd put something on. And so, when we talk about cookies, we're talking about all the other ones that are up here. These are ones that you would want to have, and that would improve your day. That actually made it really funny because for all the people that use Samsung, when Cookie Monster was talking about how much you love cookies, all of a sudden there's a bunch of crackers on there. So, when we talk about cookies, we're talking about the sweet things that you like to eat. So, everything behind this talk is about maintaining stable kernels and how hard that is. So, how many people in here run the stable kernel directly from Greg's treat on their computers? Okay, a few. How many of y'all run Mainline? About the same. Okay. Okay, so this is very difficult. And one of the reasons for that is that you don't see a lot of the bugs that go into upstream kernels until it's taken a long time for it to come down. So, if you think about it, it goes into some developer's next treat, and then Linus merges it in. Then you get to the point where Linus says, let's do a release, and it ends up in an RC. And then finally, it will end up in a stable kernel that Greg will maintain for a period of time. And so that creates a lot of problems. So, as I think about this, let's say you're a developer and you write a patch set, it gets merged into Mainline. And so this is where patches will go before they get brought into the next kernel release. Not a lot of people are going to be running that right off the bat. A lot of people are going to use kernels from their distribution or use a stable kernel perhaps. And so that patch goes in there and time passes. Could be a month, could be a couple of months. Finally, that patch set becomes part of Greg's stable kernel release, which a lot of OS distributions will go and pick up and work with. And also, Greg is probably one of the most efficient kernel developers on Earth. He moves more patches than anybody I've ever seen, and then he also can respond to an email in like 15 minutes. I don't know how he does it. I can't do my email at all. And so then more time will pass. Those patches will sit in that stable kernel for a certain period of time. And then eventually, like a Linux distribution maintainer will pick it up. And they'll find an issue. They'll find a security issue or a performance issue or something like that. And they'll eventually narrow it down to that patch. And that's after doing git bisects, tons of compiles. Sometimes performance issues can take hours or days to tease out. And so that's a lot of extra work. And so then the distribution maintainer says, let me contact the original developer and find out where they were going with this or what the original direction of that is. And usually time will pass because that developer has other things to do. And then finally, the original developer will reply, but they can't remember why they wrote the patch or what it was for. It was maybe to fix one small bug that they found in a corner case or maybe create some support for new hardware. And they're like, I haven't worked with that hardware. I'm on a new team. There's other people that do it. I can't help you. I'm sorry. And so for a maintainer of this kernel, especially if you work on an OS, it's really frustrating because then you get these bug reports and you're like, I have 1,200 kernel commits I've got to go through to find this one problem that this person reported to me on this one piece of hardware. And it could be really frustrating. So the question we wanted to ask and the question we asked ourselves was, what if we could find that patch before it ever made it in? And so that opens up a whole lot of additional questions of like, how do we take a look at these patches before they ever get merged in the tree and how can we test them adequately enough so we actually know if we found a problematic one or if the patch is fine. So naturally we build the kernels here. What we do is we watch kernel mail lists. We use patchwork. How many of you know what patchwork is? I guess most of you know. Not many actually. Okay, so patchwork is a system which watches a mail list or a bunch of them, picks out the patch series that are posted there, parses them, figures out what's the cover letter, what are the tags that are put on commits like packed by or reviewed by, then puts it all into a database and provides a web interface so that maintainers and contributors can take a look at those patches, mark some status on those patches, check the status or do some manipulations with those that are done during the maintainers workflow. And that system, the patchwork system also has an API which lets us watch the mail lists, specific mail lists for specific kernel trees, notice when there is a new submission and pick it up, then put it into our GitLab pipeline where we merge it, test it, and then generate a mail report and send it back to the mail list, reply to the original series thread. So the people who were concerned with that series would know like the contributor who was copied on these messages, the maintainer, they would know that there was a test run and these results. We don't normally send the failures to the mail lists, we send successes to the mail lists, we send the failures to the mail list only, so it's a low traffic and people are only alert to something fail. We also do test commits to the kernel tree repositories themselves and we maintain a notion of the baseline of where the kernel is considered working so that we can differentiate from patch failures due to test failures due to test issues and test failures due to whatever is in the Git tree which we are applying the patches to. Under this GitLab we use an open shift to download those kernels, merge the patches and build those kernels and that lets us really shorten the build time, like it's like 5 minutes now? Yeah, 5-6, yeah. 5-6 and we are talking about building several architectures at once. Then when the kernels build, we hand them off to the beaker system and the beaker is a system for maintaining the inventory of all the hardware that we have available for testing including the machines themselves, the peripherals that they have, the parameters of the CPU, architecture, stuff like that. It also maintains the distributions that we have for testing that we can install those machines. It gives us the ability to turn the machine on and off and install the operating system. It's very convenient for us and most importantly for our testing, it also has support for specifying which hardware we want. For example, if you want a specific architecture, you can say that. If you want a specific CPU, you can say that, which allows us to target tests to specific hardware. This has been one of the biggest part of Red Hat value through all the history is the amount of testing that we do and the amount of testers that we have, the amount of people who maintain the tests and the amount of people who write new tests and the sheer amount of work that's done for every release, whatever we produce. We of course have lots of hardware which we test on because we have partners that we care about and because we just have clients that we care about. Speaking of which, so far we have onboarded a number of test suites. These are not just separate tests, these are test suites like LTP Lite which has a ton of tests or KVM unit tests which is for KVM testing or connected to one NFS test which has a lot of tests as well. There's a number of smaller tests right now testing upstream. For architectures, we obviously have X8664 with AMD and Intel, but I guess nothing more extreme. We have the whole zoo of AR64. We have IBM Power 8 and Power 9 with PPC64 and finally we have the rarest of them all, Pokemon is IBM mainframes and that's 390X. So even on X8664 there's a great zoo of all the hardware that we can test with and all kinds of like types of machines including the laptops, workstations and servers and of course sometimes you even need to test on virtual machines like specific changes like to KVM for example and for peripherals we have hardware ranging from desktop class like audio cards, just basic network cards and GPUs but then up to the enterprise class with any band adapters, storage controllers and high performance network cards and high performance GPUs and what's more we can target those specifically on Beaker when we want to. So we want to get into talking about what we're doing for upstream kernels today. So when we originally had this conversation about going upstream we thought man we have to do this right because kernel developers are very particular about how they want to communicate. Everything happens on the mailing list. If it doesn't happen on the mailing list it never happens. That's one thing I've learned the hard way and then also you don't send anything to the mailing list unless you have all your stuff together first. So we had to make sure that our results that we were sending were complete, they were accurate and we contained the right information for a kernel developer to go back and say okay I can compile with the same options and the same config you had and get the same kernel. So we decided we would have a conversation with great KH and see if we could join in testing some of the stable kernel work. So there's already some other groups in there. Lenaro is doing some work, Google's doing some work with CIScaller. There's some other groups that are participating but we said look we want to bring something a little bit different. We want to do multi-arch. We want to do some different tests that other people aren't running and he said sure that's great send it to the mailing list and we were like well what do you want sent? He's like just send what you got and so we said okay. And the funny thing was is that he would constantly give us feedback. Anytime we messed up he would reply in less than like 10 minutes telling us exactly what he liked and didn't like which scared us at first because we thought he was going to get mad but actually he wanted to send the feedback and he wanted us to change quickly. So we started sending these emails and so we would go through and emails would contain really basic stuff at the top and we started testing his RC releases. So he's got a workflow where he will prep an RC so like let's say if he's putting out 4.20.1 he'll make a tag, he'll do an RC, he'll get everything ready but he won't put it into production or release anything yet and he'll ask people to do tests on it. And so we said well why don't we just start running tests on that repo. So what we do is we send through a result that says hey what is the overall result? Did it compile okay? Did all the kernel tests run? Really basic. And then we also provide ways to reproduce the same compile that we did. So we go and share and say here's the exact config we used, here's the make options we used, you can go run this on your own. And so we also go through all the different hardware testing that we offer right now and all the tests are open source. So if anyone says hey I want to run that test as well it's there and you can go run it. So you can compile the kernel, get the same exact one that we had and then go run it. And we actually offer up the kernel that we built and you can actually just take it from the tar ball if you don't want to go compile it yourself and run with that. And so then after that we said hey wait a minute what if we went earlier in Greg's workflow. So Greg has a workflow that's earlier than that where he has a separate repo called stable queue and he takes patches and he puts them in a directory and it's kind of like the patch list will get bigger and smaller as people do testing and argue about whether a patch needs to go in or not. Sometimes this will have 20 patches in it, sometimes it'll be 100, sometimes more than that. And so what we said was hey every time he changes that repo let's just test it and give him feedback before he even makes an RC. So now every time he makes a change to that repo within about 15 or 20 minutes we begin tests. So we can actually give him feedback before he even thinks about building an RC. And so we go through and we report all the patches that he has in the folder in the order that he originally wanted to place it. And so that way we can give him that feedback before he goes and makes an RC that's going to have a lot of problems. So I talked about patch work but this is something that so far we're using inside Red Hat only and there's one simple reason for that is that while we test whatever Greg is working on we trust Greg but if you want to work with Patchwork you've got to expose yourself to patches sent by anybody and run that code on your hardware. So right now we are figuring out how to do that and obviously people are doing similar things already with running tests on code that have been posted there. It's just a matter of figuring out how we're going to do it. So inside Red Hat we have, we watch Patchwork and Git and we have the extra link stage where we check if the mail sent to the mail list are up to the process and have all the required information in there. So we think that we can also use that upstream to run check patch or whatever is necessary. And so a alert developer sooner than the maintainer has to go in there and see it for themselves. So what else we are working on is, of course, we are always adding new tests and the list that you've seen is smaller than what we have internally. We still have some tests that we need to work on to prepare for putting them outside and some tests might never make it because they are secret source but hopefully we will be able to open source most of them. We want to test more trees and we want to reduce the latency that time that goes between persons submitting a series to the time when they receive the report because it is important for developers, well as you know why we all have CI so that we get the response as soon as possible so that we may keep the same information in our heads as we were working on that patch to start with and we can quickly respond and fix our things, not as Major said, half a year later when we forgot all about that patch. So we are constantly driving that time shorter and shorter. And furthermore, we want to open parts of our process and our data encode. We have some code still inside Red Hat we shouldn't and we are putting it outside. We have a big part of the code outside already. We want to have the issue tracker outside Red Hat and open to the public. We want to have perhaps meetings outside on IRC in public. We want to send logs from the tests and the console logs outside when we send reports. We can easily do that right now of course and we do that inside Red Hat but the problem is that the logs can contain some confidential information like special hardware which we are not allowed to talk about yet or details of infrastructure. And then we need to put out more documentation for the people who have seen the failures so that they can more easily go and run that exact test with exact parameters as we ran them. We already started doing that and we are going to improve that. So if you'd like to have your tree tested if you are a maintainer or if you work with a maintainer and you want them to do that basically send us an email tell us what you want to do and what you want to test. We'll see how that aligns with the targets that we have for REL or what we are doing and if that works out we'll just start testing it and sending you emails. And if you have any tests that you want us to run and I know there are some people which have right here, right now again send us a message we will see how it works with REL and with what we are doing we will work with you to write a wrapper for your test so that we can run it on Beaker and in our system and then basically we run it and we'll ask you to maintain it because we don't scale to all the tests and we will of course help you figure out what was wrong in our system if it's our fault but otherwise we will need to work with you to maintain your test and it's not a big deal we will have a system in place where we can just disable the test for a while so that you have time to fix it and we don't throw it away we will work with you to make it work and so usually at the end someone will always ask you're putting out a lot of hardware you're putting out a lot of work so why does Red Hat want to do this well in the end we want to make better REL and we realize that one of the only ways to do that if we want to have a more secure operating system it starts with the kernel and so that means we have to go outside of what we're currently doing we have to go upstream we have to start working there and stopping the problems not only before they get to us but before they get to other distributions as well like we should find a common way that we can do that so no one else has to go through the pain because it doesn't make a lot of sense for a kernel developer at Red Hat and Suza and Canonical and wherever all working through the same issue but they're not communicating with each other we'd rather do that in the upstream and so finally if you do want to get involved we have some projects in GitHub some projects in GitLab but you can always email us with any questions that you have or if you want to add a test as we said before or if you want to add your tree that would be great we'd rather not go directly to maintainers and say please add your stuff what we would love is for people to that constantly work with a certain tree basically go to that maintainer and say hey look we should get the same testing here I would like to not have to do this on my own laptop or my own server or the set of servers or wherever I would just like to have someone else do this and give us the results in a consistent way and so with that we want to tell you thanks and then we also do have some real cookies to get away if you ask a good question maybe so the question was if there is any interaction with the Intel's test bots right and like are you running the same test or different tests like they do because the servers is quite similar than they are yes it is similar I don't think we have coordination with them we have so many tests to put in the open that we developed inside and that we are using from the open source community that we just haven't started doing that I don't know if that's going to be possible so there was a conversation at plumbers in November there was folks there from syscaller and then the Lenaro folks who were doing kernelci.org were there as well and so we were all having an argument about what is the best way to provide the output because that's the big question the output from everyone looks different and the feeling from the kernel maintainers is that you are just blasting us with emails from all directions and it's great like we never had feedback now we do and we need to organize it into something that came back around to kernel developers say well it's not on the mailing list and not going to look at it so we have to keep sending it to the mailing list and they say well we don't want it to the mailing list and so then it kind of went in a cycle so there's some arguments going back and forth about some of the maintainer trees like some of the next trees going into something like GitLab or going into something like that where the feedback could be provided directly there based on the commit but then of course that opens up for and against there's a lot of workflows that would be disrupted and some people feel like it would make it harder to contribute if you had to work through another system yeah so when we compile that kernel we actually put it on a Z13 box and we will boot it and run through the test there so there's no emulated testing it's not a box well it's a big box it's very large how do you decide what tests are run I mean you can't run all the tests right so which ones are both sides of the event to how far could those as far as I can find the impressions you want me to so I think it's a trade off because like for example I know we were looking at one of the tests I can't remember what it was but it had like a four or five hour run time and it was like okay well that's good there's like good results there but do we really want to drag that out because then it's more hardware getting used it takes longer to get your feedback like that test had better be good and so Nikolay and some other folks on the team got together and said hey wait a minute couldn't we look and see where the patch is in the kernel and then based on that let's run the test and so if a developer is changing something about how a Raspberry Pi manages its serial console maybe we don't need to test the file systems you know that's not a big deal or whatever but if we had tests for a serial console okay maybe that's a great time to go in there and make sure that that still works so I think that's part of it and then the other half I've forgotten your second half so we found some like really odd ball things we found some obvious things that if someone had just sat down like I think in the middle of 4.20 development I think it was basically all the architectures could not be compiled except for x86 someone in our group found that I think it was config rep plane was only set on x86 64 and there's no concept of it outside of x86 64 so when you went to compile on power it would hit that spot and it's like what are you talking about this doesn't exist and that actually stayed in for like two weeks but nobody noticed it probably because the vast majority of development is on x86 and so we find some of those things like really quickly just because we're able to compile it we found some other things there's of course a lot of merge failures a lot of merge failures and then there was this story with Greg just recently oh yeah so I guess now that there's a few kernel CI groups that are doing work with Greg he dropped in a patch that would intentionally cause a kernel panic because he wanted to see if people were panicking and so we caught it and then I can't remember who on the team I think Rachel might have brought it up and then another member of the team is like oh yeah look at this right here and you found the patch and it's just a one liner thrown in a panic and it threw us into a panic because it was like everything failed you know what is this stable kernel this shouldn't happen and then I emailed Greg and I was like nicely done and of course he replied back and he's like I'm glad you're paying attention so yeah we found some interesting stuff so far some other stuff with huge pages we had one with like arm moving huge pages between numinodes and stuff like that so to answer your first question we're all specific so a lot of the stuff has to be focused on what's the name of this product so we have a Raspberry Pi console we have quite a few that's where our focus is in the entirety please go ahead if you'd like to make this more possible at some point so because right now you're providing the service and operating software but you're also great at software but it depends on you and that's pretty much so it's not it's not the portable it's not the whole process it's actually just about not gross distribution other systems so I'll organize them so if I don't have the or if I don't have the I'm just using your service on all the software so the question was are we planning to make it more portable to other distribution systems just taking it outside the head so people could run it so this is a difficult question we are of course putting everything outside in the open source as much as we can and yes Wicker is open source but nobody actually installed it outside the head I think people might have tried and this again ties in with was the way we to the huge amount of hardware that we have and the way that our systems are tied to the management of that hardware kind of the fact that we have all that special hardware and the fact that we need to manage it in our special way kind of ties us into that but we still have parts of the system which we try to keep separate and independent which could be reused and we're putting them outside and we have a big part of it and we're putting more of that and as we reach our goals and we boost our coverage and everything we'll be able to concentrate on making it nicer so hope that helps we are open to accepting patches of course we are open to accepting patches only you will have to have time testing this all but still we have lots of test suites but having it be a service also adds one extra component as we found when we were starting when we were getting started that it gives us tight control over anything that's not the kernel in the system because we do want tight control because if we introduce too many variables then we can't figure out what broke in the process so some of that stuff it was another thing that came up at Plumber like kernel CI has a thing that you can install locally like you can install Jenkins with their scripts and then the problem that they found is that some people have their BIOS is configured differently or they have like power management settings set up strangely or something like that and they were getting weird variations in their tests and that was one of the reasons is that some of the machines were just configured in very unusual ways which some may argue that it will make it better to test it on different platforms but then you're going to get different results for the same kernel and it's hard to narrow down exactly what it was Why do you use the A-O-M-Shift to build the kernel and launch it and test it instead of the give-lap CI or calling it over or something like that? Well, we build a lot of kernels We're actually using give-lap CI We're using give-lap CI on our own instance We're using it to orchestrate everything and we're using it to trigger the builds in OpenShift to scale our building tasks better than well, I don't know Well, it's actually give-lap talking to OpenShift, right? So in a way we're using give-lap but OpenShift provides us a scalable way to build a lot of kernels And internally it's a lot easier to consume OpenShift than other things because of course you don't think quite so much about the capital expenditures and all that kind of thing you can consume part of a cloud that's being used and so that allows us to expand up and down or we could say I need to have this many cores I don't care what they're on but I just need this many so I can do my compiles and then that way we can ask for infrastructure in a really generic way instead of saying I must have 20 of this type of machine racked in this way with this networking config Then there's the release happens inside Red Hat and many patches and then we run out of these 20 boxes and what do we do? So with OpenShift it lets us just boost it up, turn the crank a little and we get the compiles going Yes please The kernel stuff has strong They are all really, really they are really easy for me to use as a kernel browser so I just receive an email they usually say run this command and you will be able to use this and it should actually work I understand that what you're doing is a bit more complicated than that sometimes because we have this hardware so it's not going to work but if I get an email from you then it's not so easy for me to be independent of anyone's and just you're not working when it's coming to any possible time and you spend worrying me because then I kind of prefer or in general the developers might have to prefer something that is very obvious to reduce and use again But yeah I understand your comment will be there on hardware solutions and whatever So the comment was that it's difficult to run the test that we run compared to what it sounds like right, that it's difficult to run our tests compared to for example zero bot or maybe just that OpenShift is just building kernels but yes the problem is there right now as I said we don't send the test logs or anything like that but this is on our to-do list absolutely and it's not only for outside but for internal developers as well because we have a lot of tests and kernel developers don't know all of them they know specific tests for what they're working on but if they break something else then they're in the same situation as you are So we are working on having the instructions there and having the comments there as well as you say but our focus right now is to just start doing it Yeah and that is good feedback and that's why also too all the emails that we send the ones that go externally to the mailing list the email address that's on there is the same one we put in the slide so we welcome any feedback like that that's one that we've gotten before to make it easier to reproduce Any more questions? Ok, come get your cookies