 So, we are going to talk about this thing and us is me Nikolay Kondrashov and Michael is sitting here, he'll join me later. Neither of us does actually look like that anymore. So, we are coming from Red Hat CKI team or CKI project and we're a distributed team which are doing the Kernel CI at Red Hat. So, why do we need Kernel CI? Well, as most of you probably know, we do releases of distributions and each release has different Kernel versions. There's a lot of Kernel versions and moreover, we are one of the major contributors to the Kernel and certainly the biggest one among distributions. So, this shows the comparison of unique email addresses from Suze, Red Hat and Canonical contributed to the Kernel. And this is commits. Each of those is per year and the Red Hat is the blue one. So, we somehow got to make it consistent and reliable. And if you look at this, this is a big cue of how code goes through the pipeline towards Red Hat. And until not so long ago, our tests were only there at the end. So, the developers will throw the commits through the builds over to the QA and they would test it and then come back and they would retest it. That takes a long time. And if you consider how long it takes for the whole code pipeline to execute one patch, to digest the patch through that and get to a release, it's a long way. So, what we wanna do is we want to do this. And we want to do it fast and provide as much as possible feedback. So that ideally the bugs are caught before even maintainers see them at the moment that the developers submit those. And that is hard because there is just so much email. I mean, look at this. So, we somehow are supposed like, I'm not having anything against this message just shows the complexity of these things. So, somehow we are supposed to put our web hooks in here. Like, this is patch number 62 out of 114. And there is this amount of discussion going on. So, somehow we are supposed to test those and provide feedback to developers. So, this has been an ongoing discussion recently in the kernel circles. And for example, the last Linux Plumbers, there was one of those presentations that are happening recently by Dmitry Bujkov, who did a very good take on those issues. And I recommend you watch it if you're interested in the kernel development process. He makes good points. So, what we've built is something like this. This is simplified. And actually lost a slightly more complex slide on the plane here because of how slides.com works. But no mind, this is very simple. And we'll go around. So, normally, if you wanted to check just the changes in the Git repo being committed there, that's kinda easy. So, we have a bunch of repos with track and we check like if there are new commits and we test those, that's fairly trivial. We actually have, inside of that, we have a bunch of Git repos for different releases like RL7, RL6, RL5, RL8. And we also track upstream repos, mainly stable at this moment, but we track also a few others. And we test those commits, of course, and that's relatively easy. We just pull the repo and run our test about which I'll tell you a little later. Then there's the interesting part which I started with. Turns out it's not that hard. So, well, it's hard, it's hard. And don't get me wrong, but you can do it easier. So, there's a typical mail list, like Linux USB mail list. There's a message from a series, it looks like this. Turns out there is a project called Patchwork, probably many of you know, which is used by maintainers. Well, most of all, they use it to track the patches as they've been processed, reviewed, and tested, and they check which patches were merged and which not. So, it looks like this, if you go to patchwork.kernel.org, there is a bunch of projects, and those projects can be mapped to a particular mail list or even a particular tag being used, like tagging the subject being used in the mail list. Oh, at least the thread head is done this way. I'm not sure if that's upstream actually. So, if you go to the same Linux USB mail list here, you can see there's, again, those patches, but this time they're organized into series, and if you click on one of those links, you get to the particular patch series. There are two patches in there, and we can go to specific patch and see what's going on, what's the patch, and we can download the inbox there on the right or the whole series. And the main thing that concerns us about this is that Patchwork has REST API. So, we can go through those projects, we can extract the patch series, the patches, and everything, and we can track when they are peering. This is, of course, sounds very simple, but the devil is in the details where you have to kind of expect that not all messages come through at the same time. When you go check and the series might not be complete, there are bugs, people sending all kinds of messages in there, sometimes they're not picked up, and things like that. Patch series can have cover letter, can don't have cover letter, and things like that. But you can make it work. So, the typical patchwork trigger would be tied to a particular patchwork instance and particular project there and associated with the corresponding Git repo. Further on, we also have triggers from our package builds in Koji and the developers builds in Copper for Fedora and internally for REL, although it's called a little differently. Oh, pardon, let's go back. So, Fedora build system looks like this. There are a bunch of packages there being built, prepared for releases and reviewed, so we can look for kernel. There's our kernel. Let's take this one, and here is information on the build of the kernel, like specific revision and everything and all the packages that were built for all the architectures. And Copper looks like this, oh, wonderfully, finally connected. So, Copper is more for developers. You can have your package build and put into our RPM repo and picked up by your users or by other developers and we track those as well, so you can look for the slash kernel there and find one of those and go in there and into builds and see there's been a build and here's the packages. But we don't talk through the web UI, of course. There's the packages. We talk through the, we listen to the Fedora message bus which is used both by Codege and Copper and internally they are threat head. There's a message bus as well. We just listen to the message. There's a log from our trigger and it checks like, okay, there's build completed. There's a message coming through the bus. We're not interested in this one, neither in this one, but here is the kernel. We pick it up and we trigger our pipeline. So, we also have to test our own CI, so we have a special kind of, two special kinds of triggers for GitHub and GitLab for testing contributions to our CI repos of which we have many and as you can see, we have some on GitHub and some on GitLab because of, well, historically. So, it looks simply like this. You submit a PR or an MR and the bot comes and says like, hey, I'm a bot, send me a message and I'll test this for you. So, the developer says like a test please bot and bot says testing and then it triggers the pipelines through for the various repos that we have. The developer can go have lunch or two and then finishes and the bot says like passed or failed. Well, as it happens, here's an example of my food request. So, the bot comes in and tells us like, here I am, this is what you can do and it's a little different for GitLab and for various repos and can say like, oh, add this keyword or that keyword and we'll test this and that and things like that. And yeah, I am asking the bot to test. So, the bot says, yeah, I'm going then it's supposed to result or failed, of course. So, that's how we do testing for our CI because, yeah, many repos and because our CI is in a separate pipeline and we have like, we internally have two GitLab repos actually handling this but about this, not repos, GitLab instances. That's a complication, a final one, yeah. So, further on that's, and these are two major parts of our pipeline. Of course, the test database and the tool which lets us pick which tests to run and organizes everything into trees and other dependencies like where we can what we can run with this build with that build on this architecture and that architecture and et cetera, whether developers want to test this or that. So, basically the data flow is very simple. We have KPDB, which is a repo with the test information which is currently private because there is like all kinds of stuff for rail internal ones and we are still intending to open it up but it's difficult to separate the upstream tests from downstream tests in a complicated data structure and somehow merge it together. So, we have the database which is basically YAML and we have a tool which takes it and a bunch of parameters and then speeds out the XML for Vickers system which actually runs our tests. And this system is installed internally but it's open source but you will have a very hard time actually trying to install it which we are working on right now very hard and hopefully we'll be able to let you guys enjoy it but it's all open source, it's out there goes in the documentation, it's just nobody succeeded installing it by themselves yet. So, the databases can contain like information about architectures again of which there are five right now. Kind of host types and this describe what do we want the host to have like do we want to have this much RAM or this many CPUs or this much storage or even a particular PCI card or a network card and we organize them into host types because that's easier. We have trees obviously, particular repos or types of repos we want to test and those affect which tests will run where like for example one test can run on the RL7 but not on RL5 and some on upstream and some tests are still internal not many though, most of them are actually out there. Components which describe what things the build contains like upstream only contains a kernel image but internal builds they are built using RPMs and there could be debugging for headers, the internal kernel headers and things like that or tools that some tests need like for example, some tests need debugging for things like that or some tests don't run on debug build. So, and then we organize tests into sets of course like for network tests, for file system memory, et cetera, et cetera virtual machines and of course the description of the tests which themselves of which there are quite a big number soon to be a hundred and these are range from simple tests like just a shell script which just restarts the kernel test something that's done to very big ones like LTP, USEX and the top ones are listed there I guess but those are not all which can contain like thousands of tests. So, an example of tests with data like a description where it is in the report it's actually quite outdated. Anyway, the essence is there. Where it is located and for example, this one is in our test repository on GitHub where most of our tests are which host type it runs on. Additional information of like, I want like this very specific host for this like and it could be down to a specific host name on our in our bigger system like I want to run it like exactly here on this machine because there's only this hardware there which this description could look like this actually this says don't run on these arm systems because they don't work. So, information on maintainers and what's not well this is a discussion for upstream that's not the depth for now. So, these are test maintainers who look after after the test and check that it's working and it's failing and that actually receive copies of failure reports and they are supposed to take a look at as soon as something happens and tell the developer, okay, sorry, that's my bad, it's a failure or say like, this is your problem and they are responsible for those tests which is going to upstream important because upstream developers they don't see that much into our machines and things like that. So they have a hard time figuring out what actually happened which we are working on. So, there's the conditions for the tests to run on like the sets which it belongs to and this is also outdated, my gosh. This is also an interesting part that we specify which source files the particular tests covers more or less so that we can avoid running it when there is a patch that doesn't touch those files and that's why we can kind of contain the runtime at least a little bit, make it shorter when we don't need it and this allows us to kind of describe when to run them for which code. Architectures, the test will run on and which trees it belongs to but there are no components here because this is old and there could be like multiple cases of this RIP I wanted to run it with this file system or with that file system, for example a file system test or additional parameters. And in walking the K-pad to normally people don't invoke it by hand but it runs in the pipeline so you can say like, okay, generate me the XML for this run, for this kernel turbo for the upstream tree, AR64 with this patch and highlight the output and it would look something like this and it goes on and on and on and I'm not going to bore you with those details these are, this is the input to Beaker and saying how to run it and where to run it and in which order. So going to Beaker, it's a big system which maintains inventory for the hardware including down to the components lets you match that hardware has access control like particular groups have access to this hardware, those to this hardware and for example some NDA hardware could be there and protect it. It also does the provisioning and puts up the machines, installs the operating system from scratch using an account normally because we don't support running from images because that's hard to do and we are distribution so we have to test the whole distribution from install. So it installs everything, it talks to the test harness, extracts test results and things like that and looks after the machine so that if it does lock up it then it releases the machine and erases everything. So it could be like the system inventory we can find of those machines like these are not very useful right now this is a tenium, we still have those. You can go into the machine and take a look at the details like this is just one tab about the host information there's the CPU info, storage, peripherals, things like that and this is an example of some of our jobs running for stable repository of the Linux kernel. This one job is just for one architecture and it has four hosts and here's an example of one host and executing those tests. This is a bigger UI, this is a bunch of tests there. Further on, now we're approaching the user-visible stuff so we have a reporter which watches over the pipelines and checks which stage they're on which job they're approaching and sends the email reports to developers or whoever's interested and sometimes it can send an early email saying like okay have we started this test and like watch out or we did the testing or something failed in the pipeline. So there's an example of a successful report that was sent to a stable mail list. Here's the, it starts with the saying like we took this repo, we took this commit and there's the summary everything went fine and we were actually compiled and used those comments. And then we run them on these hosts like these architectures, they are 64, first host, second host, PPC 64, two hosts, X86, X86 got more hosts, four hosts. And we also have a notion of waved tests. A test which you mark in that KPEB DB and saying this test is waved which means run this test as normal maybe at the end of the run but ignore the result and don't take it into account when giving a verdict whether it failed or not we say we ignore it and we use this to test the tests which were just introduced into the system or were being fixed so that we can track like how are they performing like they doing okay and the test maintainer can look after it until it stabilizes then we remove the waved status. And this is done manually because tests are different you have to look after them. And that's an example of report that we send upstream. Our internal reports are a little more elaborate you get to see actually links to the bigger results and explore the logs and everything but those tests actually have, yeah, yeah artifacts and there's a blue link there this contain the binary config files and logs, things like that. So, and then finally, then finally we have the data warehouse it's a system which uses PostgreSQL and collects all the information about our run so also similar to reporter and it's kind of the application of the effort at this moment but we are working on that. So it watches over all the jobs and collects the information like how it went and what's the status and much tests run and there is a web UI which looks something like this and provides statistics how much we failed, how much we succeeded as pipelines and various statuses this has been pulled like from GitLab using the GitLab API and there's a particular pipeline and listing all the tests and all the hosts and you can go and see the results in Beaker how it went and we maintain the test statistics, how tests were failing or passing for example, for exactly for the purposes of deciding like when to waive the test if it's been doing bad and then send it back to the maintainer and say like okay deal with it or we can actually take it out of wave state if it's doing okay and same for hosts like if some hosts are misbehaving in Beaker and that's a problem because there are just so then many hosts that they break and you have to watch out and the host maintainer like whoever maintains that host there they don't have time to look after it so we look at those and we say like okay this host should be excluded from the runs and we add like don't run on that host. So finally the title of this talk, no not yet this actually concerns Guillaume's talk so there is this thing, you've probably seen it at the last slide is the KernelCI.org and there are lots of tests and they were recently approved as accepted as a links foundation project to advance the state of KernelCI and we joined that effort and right now we're working on a database and the system used to aggregate testing information from various CI systems so that at the ultimate goal so that there is a single place to go and check KernelCI results from whoever runs those tests and so that the developers get on a single email with those results and not just five emails from everyone. Right now this is mostly the KernelCI folks and us but others are joining hopefully soon we start aggregating more data but we already have a tool. Well you can take a look like how this looks I don't know if Guillaume showed this but this is an example of how test reports like a top level could look there. So we took a Google BigQuery system for storing those results so they are more readily publicly available and so that people can go and explore the data and see how Kernel is doing and do research if they need to. So this is our repo with the code for that and it looks something like this when it's pushing and it's the data and we are working on a dashboard to show this off and to provide the developers. This is very rudimentary at the moment. Finally the interesting part I took a little while so we store our CI pipeline inside YAML but we store it in separate repos because of the way we trigger those. So to trigger GitLab we're actually doing commits to the repo and I'll show that in a moment. So basically these are two repos and the repo on the left is only including pieces from the other repo and this lets us to let the triggers do commits with the information about what we want to test inside that repo and we need two repos so that these commits don't interfere with the development commits we have. So because there's like every time you want to test something there is a new commit and that's an empty commit and it doesn't have any data in it so we use it just to identify the particular pipeline and GitLab view. So for example the baseline trigger like the Git repo trigger comes in and does and sees that there are two changes in the like in one repo and another and does commits to separate branches in that repo. This trigger actually is retired now but it was quite kind of interesting. So it also does and checks and finds something and does the commit. There's the trigger that finds patches and does commit in that the same branch and finally the GitHub board comes and finds like okay, there is a new merge request and I put it in all the branches that we're interested in testing. And it might look like this for example, the stable branch has those commits all with the pipelines running and this branch has its own commits here and the commits can look like this. So there is data there but it's not for GitLab's consumption only for us as debugging like this says all the variables that we put in there all the descriptions like what we are triggering on things like that is this one is huge and this one is big as well, it's abbreviated. So we use a lot of GitLab extends property which lets us separate the general pipeline into the pipeline into the menu of jobs and stages and into three specific information like our pipeline specific information. This is our shortest pipeline and it says okay, pick those four stages from the pipeline and we have 10 maybe or more. And this says okay, sorry, pick this prepare step where we download all the stuff for the execution all the dependencies and this says like this is the prepare and this is stage prepare and this one we say okay again, pick this create repo x8664 and extend it and this is one of those same okay take this repo take this template of the job and a bunch of variables and conditions and create a particular job for this specific pipeline here. And this time we are using the merge keys to merge those. So we're using extend here extends here because it's a separate YAML file so we cannot use merge keys and we use merge keys here it's the same big one big as YAML file there and this is would be a create repo creating a RPM repository with build results for testing which are then installed in Beaker. The next stage is composed a little differently so we have a huge script which is split into a few YAML objects and finally the last stage looks similar to that and so on. So we have pipelines which are much longer than that. And more involved. So why we took GitLab? Well, we started out with Jenkins. So we had a Python script controlling Jenkins which had a Joe Britton Groovy which controlled another Python script which checked out the kernel and build it and then fed it off to Beaker. So that was not very reliable, hard to debug hard to understand, hard to maintain as a contrast with GitLab we had relatively straightforward system. We could keep everything in the Git repo and keep changing it faster and well, it's more reliable than Jenkins for us. And I hope Michael is able to say something. We have 10 minutes left. Okay. Now you will hear me complaining about GitLab. So, yeah, there we go, okay. So I don't know how many people used GitLab here. And how many people have used GitHub? Yeah. So it's very familiar system. So it's nicely documented. It has a huge API surface. People are familiar with it. So if I say GitLab, people actually know what I'm talking about. I mentioned some other CI technology then people look at me like this. I mentioned Jenkins, people just go away and say like, no, you don't want to be on your team. Hopefully there's nobody from Jenkins here. Sorry. But then we are actually testing kernel. So that's quite similar to other software to test in some aspects, but further aspects especially like testing. There are some interesting issues that you will see there. So one is that actually GitLab, most of these general CI systems don't have any concept of a failed pipeline because of infrastructure issues. So if you look into what distributions do for gating, most of those actually have a test failed thing, maintain or fix it. And then they have something like, oh, our infrastructure failed or our test system. And then it's actually for somebody else to fix. Now, most of you might know that kernel maintainers don't react too well if you email them without any good reasons or if you email the Linux kernel list with infrastructure issues, they get pissed quite easily. So we really want to avoid that and that is not very easy. So you actually need to put stuff around GitLab to make that happen. On the slide on the left, you see what the test system actually gives you, which is Beaker in our case. And then there's this one missing. So a panic code in Beaker is actually the hardware messed up, the kernel didn't boot, but for whatever reason or distribution didn't boot, or there was some power surge or whatever, or actually we messed up or our general infrastructure or had the networking issues, stuff like that. And that's not in the system. Might never get in there because it's not something that you would normally have for your average. Just to add a little bit about that. Yeah, sorry. So one consequence of that is that GitLab. Oh, that's a camera right there. Yeah, okay, sorry. Ah, this is the line, sorry. So the thing is, GitLab has infrastructure issues. GitLab, GitLab CI has infrastructure issues and you can select, okay, you can just start on this failure, on this failure, on this failure. But it actually doesn't matter to me which failure GitLab restarts on because it's GitLab's thing. It can fail in various things, but it doesn't allow me to tell like, okay, restart on this issue or on this issue. I can only say test passed or failed. And that probably comes from where GitLab is intended to be used in. It's like a test run and nothing can happen. It's just running tests on simple software. But for us, if you remember that job, like just one infrastructure, four hosts for one architecture and there are like two, three, four architectures more than that. And what GitLab does is they just kill us. Yeah, and that stays there. And that's a separate slide. I'm confusing the issues, but basically we cannot settle GitLab. Okay, we had an infrastructure issue. Can you restart? That's a big deal for us. So, I mean, they're more interesting things. We actually, that's from the beginning of January. We are producing 30 gigabytes of artifacts a day like kiln builds, opm builds, all kinds of stuff. And if you use a shared GitLab infrastructure, shared GitLab instance like gitlab.com or whatever have you in Reddit, we have a couple of those. People might not have the storage available. So you would want to store it outside. Like in S3, if you build in the cloud, you want to keep it there. You don't want to incur the transfer costs moving it in and out, which is not possible at the moment in GitLab. So you can configure it per instance, but not per project. And it all goes on like these things. So there are certain things where it doesn't really match very well with kernel testing. And you try to work around it. It's possible, but it gets more ugly. So you can take a look at the code, it's on GitLab. Don't blame us for however we did it. It's really hard to upgrade because if you have like a pipeline's running for a day, if you do an upgrade to GitLab, one of the stop-to-accept jobs, so then you don't get any builds, any tests for a day, you can work around it, upgrade different runners at different times, stuff like that. What else do you have? Let me just skip down. Oh yeah, that's an interesting one. Most test systems don't expect the test to reboot. So I don't know why, but kernel tests actually reboot a couple of times. It's like, it's something that that kernel developers think is useful. And so if you need to boot into your kernel, but then you might also restart a couple of times in there. So you can't really have your tests harness like the GitLab part have itself restarted. So you need to have another indirection, just start another VM or have the GitLab part outside of your testing system. In this case, beaker, but otherwise you could just put it inside of your hardware lab, which you can't do at the moment. Yeah, maybe we stop here and take some questions. Otherwise I'd just complain about other bugs. How much time do you remain in this set? Now that the questions were, so the question is how much time do you actually gain from having CI? And I think depending on who you ask, there might be a different answer. So developers might most likely say like it doesn't help us at all in the beginning, especially now, where you might actually get infrastructure issues giving you false positives. But I think we find a couple of issues a week. We're actually four maybe. Four a week, where kernel developers were really sure that they got it right, but they didn't. And that could be patches posted to the mail list or it could be something merged into stable, for example, in the stable Linux. The ultimate goal is actually to free resources inside of RHEL because upstream patches poke. So we want to provide feedback outside of it that never actually goes through the whole pipeline. Now we'll only find out about it when it's already merged and built into an OPM. The ultimate goal is to have work done upstream, which you most likely said. Okay. So you mentioned you were running Jenkins before? We just wrote everything and then switched. Yeah, the question was how painful the migration was? Like we took some of the tools that Jenkins was using and we used them in the new pipeline, but we wrote the everything that was in Jenkins there because you can't really run that in GitLab. So we had to rewrite the big part and we had to rewrite the triggers and things like that. We replaced the separate tool which was controlled in Jenkins with those little triggers that I showed you. Any more questions? Yes? Most some of them, like I think. The question was, is GitLab working on those issues? So there is an issue that particularly pisses me off is that GitLab simply kills the runners. Yeah, this one. They simply kill the runners with sick kill. So for us, it's a runner that's controlling that bigger resources like this, I don't know, 10 hosts that are running those tests for hours and we just forget about them because of that. Like GitLab just forgets like, ah, whatever. And that host is occupied for these hours so we cannot clean up. And this bug was open for years, I think. And they're promising they will fix it soon, so I hope they will. There's somebody hoping that. Okay, our time is up, so catch us in the corridor. Thank you.