 Hello. So I'm Alexandre Belloni. I will talk about the Yocto project auto builders and the SWAT team. So first, something about me. I'm the co-owner and COO at Bootlin. So Bootlin, very quickly, we are doing embedded Linux. We are providing engineering services, mostly doing BSP development on custom boards, things like that. We port the kernel, we port bootloaders, drivers. We also do part of that, a Yocto project, open embedded, and build-out integration. So that's where we are relevant. And we also do real-time, boot time, and whatever. The other part of what we provide are training services where we train companies, engineers on embedded Linux, kernel drivers, and then a Yocto project, build-out, and other topics. Myself, I'm a kernel maintainer for the RTC and I3C subsystems. So I do, actually, I used to do a lot of work on those topics. I still do maintain those. I'm also the co-maintener for the microchip, ARM, and MIPS SOCs. So I kind of hummed down there. They are taking that back, but that's also something I used to do. And the other part is that I'm the Yocto project SWAT team lead. So that's where I'm relevant for that topic, right? I'm living in Lyon, but whatever. So the Yocto project auto-builders. So we, the auto-builders, this is coming from the Yocto project documentation. So basically, we have a controller that is there to manage the build requests. And it will push those build requests to some workers, right? Actually, what we have is that the controller is one of the workers, right? But whatever. And on the side, we have a NAS. On that NAS, we do have the shared state cache. And that one is shared across all our workers, right? We also have a source mirror, so GLD, basically, where that one is also shared by all the workers, right? So that's on the NAS. And all the workers are basically sharing the same location. We also handle the hash equivalency server. So we do have an hash equivalency server. So that will not be the topic of today. But basically, this is very useful for us. You will see that later on, because we have a necessarily new worker. So basically, we are building on many different distributions. And that allows us to actually ensure we know what is the equivalency, right? So the basic is that the base hash is kind of different. And then we end up with the same binary. And so we can say, OK, even if the hash is different, we have the same binary, we have an equivalency there, right? Joshua, just over there, will know more about that than me. So yeah. So basically, the whole project is based on Buildbot. So Buildbot has been chosen because it's a Python continuous integration framework, because it is Python, right? BitBake is Python. So the thinking there is say, OK, we have people that know Python in the project, so they will probably know how to change Buildbot. We have then two different repositories. The first one is YoctoAutoBuilder2. And that's the one that will list and define all the builders. And by builders, I mean an available type of build, right? So that's something you can then create a build from. And then we also have some schedulers. I will go over them later on. It is not a tutorial about how to set up that, because it's actually easy, right? You have a readme. That readme, I mean, it's maybe it will take five to 10 minutes to set up your own auto builder and workers. So it's very easy. I even managed to do it, so everybody can do it, right? And then you have more configuration, which is more interesting in our case, which is in YoctoAutoBuilder helper. It's a different repository. The main file there is config.json. And basically, that config will add more steps to the builders, right? So we have defined builders, but we can actually configure a bit more of those builders through that config file, right? It is simply called by the run config script. And that run config script is inside YoctoAutoBuilder2. And basically, that's how we set up buildbot, right? So there is a nice UI. And if I'm lucky, I can actually show that to you. So yeah, the UI looks like that. So we are lucky. We have a lot of green builds now, so that's good. So basically, that's the console. Each of the dots is a build. And green build is successful. Yellow build that is pulsating is ongoing. Then we have red that is failed. We do have one. Yes, that one is failed. And then we have the pink one, which have been canceled. So they are not really failed, but at the same time, they didn't really finish either, right? So we have a bunch of those, like you can see. I didn't do a screenshot because we have, like, yeah, well, I know we have 81 different builders. And we are building most of them, not all of them, but most of them multiple times per day, right? Let me get back there. So the workers. So we currently have 26 workers. So you have their names there. It's not very interesting. But what is interesting is that we are building for many different distributions. So we have AlmaLinux 8, Santos 7, Stream 8, Divian 11, Fedora 35 and 36, OpenSuzi, Tumbleweed, and Ubuntu 18.04, Ubuntu 20.04, and 21.10, and 22.04. So that really, the goal there is to be able to push bills to many different distribution and know where this is working, where this is failing. And we will see later on, I have some examples of very weird issues that did happen and that are not that easy to find. Sometimes it's very easy, right? OK, there is a missing package or whatever package has changed name or something like that. Very easy to fix. You install the package. That's fixed, right? And sometimes you have some issues that are not that easy to find and to really know, to pinpoint, OK, this is because that particular bill is running on Tumbleweed something and things like that. Obviously, we are not running all the bills on all the distribution all the time, right? We are scheduling those bills and they are getting kind of randomly assigned to workers, right? So it's not always random because we have some bills that are Ubuntu specific, Santos specific, Fedora specific, but apart from that, we are kind of randomly assigning bills to workers. Those workers have two different architectures. So that's also in a goal to make it build on many different things. So we have two types of workers. So basically, x86 and ARMv8, so ARC64. And the x86 workers, most of them have 28 cores. So that means 56 threads. So they are Intel Xeon E5. We have some that have less cores, so 24 cores, 48 threads. And we even have two workers that have only 12 cores and 24 threads, right? The amount of RAM goes from 128 to 384 gigabytes. So that's nothing, right? And so we then have two ARM workers. So one that is 64 cores, cortex A72 with 256 gigs of RAM and another one that has 32 cores and 128 gigs of RAM. So that's good. There are new ARM workers, so they are not yet in the pool. But we, I guess, ARM gave new workers, which is very nice from them. And we will put them in the pool soon, right? So the main drawback of that is that it will create a fair amount of maintenance, right? Because it's not like we have something that is homogeneous and say, OK, we just have to install those packages on that server, and that will be it, right? Because all the workers are kind of different. It creates a lot of work. We will see that later on. Sometimes we have issues with permissions that are not the same, right? So for example, on Debian and Fedora, those will be the same, but on Tumbleweed, they will have different permissions for device files, different group names or user IDs or things like that. And that creates a fair amount of maintenance, because this is something you have to know. And those are the kind of things that also change when updating. And some of them are rolling releases. So we do update every Friday. So every Friday we have a maintenance that is scheduled. And we do update, and sometimes things are breaking, so we need to fix those. We also have two workers that are specifically reserved for build performance testing. And so that means that they will not get any work scheduled apart from the build performance testing, right? Something we'll see to point out is that now about half of the workers are working from SSDs, which did improve not only the performance, but also the reliability of our build. I will talk about that later. It has been for a few months, it has been quite an issue to actually manage to avoid false positive, because things were getting slow, and so builds were failing. So we'll see that. So builders, what kind of builds can we do? So we have currently 81 different builders that are defined. I don't expect you to read through all of them. I have some information on that. So how do you know what is getting build by a particular builder? Well, basically, most of the builders will build core image Sato, which is a fairly large image. And they will do that using Pocky. And Pocky is kind of designed to be configured to build as much as possible. The goal there, obviously, is to extend the build coverage as much as possible. We want to build the wall, open embedded core, and Pocky, and things like that. So else, you can have a look at the config.json, where you will see exactly what a particular builder will do. So we have that just there. So that's QMU x8664, for example. And that one, we have a particular machine. So we will be building for the QMU x8664 machine. And there is a template. On top of the template, we are appending new image FS types. But what is interesting there is the template. And the template looks like that. So that was the arc QMU template that is there. And this is a definition of that arc QMU template. So there, what do we say? Basically, we enable build info, build history, things like that. And then we have the targets we are going to build. And you can see that right now, we will be building core image Sato, core image Sato SDK, core image minimal, core image minimal dev. And some others, it doesn't fit on the screen, but that's fine. And then we have sanity targets, which means what kind of tests do you want to run on those targets? So core image minimal, test image, core image Sato do test image. So those are the kind of tests. And then you have that multiple times. So we can also build later on. We build the SDK and the x SDK. And we also run some self-tests later on. But again, just up to you to look at exactly what is done. I have a quick summary there. We have two parent builders, which are named A-full and A-quick. So A, because then alphabetically, in the alphabetical order, they are first in the list of things that are getting built. It makes it easy to select them in the console. Those were the first two columns. If I go back to the console, so there you have A-full, which is still building. So that's the yellow one. And then A-quick. So that's why it's A. And then we are full and quick. So basically, those are parent builders. When you ask for those builds, you are not actually building anything. You are just starting other builds. And so on the right of A-full, you have all the builds that have been started by A-full. And obviously, A-full contains more builds than A-quick. The goal of A-quick is to quickly test something. The goal of A-full is really to validate that the current branch is OK. So we also have AUH for AUH for Automatic Upgrade Helper, which tries to upgrade all the recipes to the latest upstream version. I can guarantee you that that one never finished successfully. But it's still useful, because some of them are building fine, and then we have some failures. So we'll have a look at that. Basically, that one is scheduled every, well, on the first day of the month and on the 15th day of the month. We then have machine-specific builders. So those will be building the images for the Yocto project members' machines. So Beaglebone for TI, Edge Router for, I guess, is for Cisco, if I'm not mistaken. And then we have the Intel ones, so generic x86, x66, x64. We have those ALT. When you see ALT there, it means that we are using Pocky ALT, which basically includes system D instead of the C5 unit. Then we have QMU-based machine, which is very important, because by default, OpenAmbit Core only supports QMU-based machine, and that allows us to actually run the images we are generating on QMU machines. So we will get to that. Then we have the performance builders. Those are basically there to record the build time and other performance-related metrics. So we are trying to understand whether a particular modification is making the build quicker, hopefully. But quite often, the build will be a bit longer. So as soon as you add stuff, obviously, it gets more longer. Then you have documentation. So we are building the documentation. We also include the Yocto project members' layers. So we check that. We also have two targets, so check layer and check layer nightly, which are checking whether the included layers are Yocto project compatible. So this is very important. I will not say what this is about. But yeah, basically, we want all the layers to be working well together. So this is something we can check. We then have metrics, which is a pretty new builder. Metrics will go and check CBEs for all the packages that it knows about. So everything that is in Pocky, basically, will be checked against CBEs. And it will report when we have CBEs that are not fixed. So obviously, we have that result. And then it requires some manual work to know, OK, we don't care because we are not actually using what the CBE is about. Or then sometimes, we will just backport patches. And in that case, we have to tell CBE checks that patch has been backporting. And so we are not affected by that CBE. We have P-tests. I will go to that later on. So P-tests is basically running tests on the actual image. We'll see what those are. We have LTP, so Linux Text Project. So we are running LTP on those machines. Again, QMU emulated machines. But we do run that. Then we have reproducible, which is ensuring that the generated packages are bit-for-bit reproducible. Very important. So all the packages that are generated by Open Ovid Core are reproducible, fully reproducible, apart from Golang and RubyDocs. So we'll see that. Then we have WIC, which is generating multiple disk images using WIC just to test that. And then non-GPL3, which is building with incompatible license set to anything that ends up with GPLV3, which is very useful for some companies where they want to avoid GPLV3. So obviously, this includes MetaGPLV2 to be able to access all the necessary tools to be able to build with GPLV2 instead of GPLV3. So those were the builders. Now we have some schedulers. So we have builds that are scheduled. Any builder can be triggered manually through the BuildBot interface. So if I go back to the interface, I can select, OK, that particular build, I can select the branches and send a new build to the autobuilders. But we also have a quick that will run every day of the week, but not Sunday. It will build master. Then we have a full that will also build master, but only on the Sundays. Then we have check layer nightly that will check all the layers every day. Metrics also runs every day. We also do a check layer nightly for Kirkstone and Dunfell. AUH, like I said, will run twice a month, so on the first and 15th. And build performance will run four times a day, so 3AM, 9AM, 3PM, 9PM, so that we actually can make an average of what is happening. Obviously, you may find a bit of variation when building only one build, but sometimes this is actually caused by some external factors, like, OK, I had a slow internet connection at the time. So I couldn't download my packages that fast or things like that. And finally, we have docs which will run on every commit. It is the only builder that will run on every commit that is pushed to the repository. So why? Because a naeful build will take five to nine hours to complete, and it will load most of the workers, so meaning that for five to nine hours, all your workers will be pretty loaded. And then it's not very practical to start to build automatically for every commit that is made on master. And it is even less practical to do so on patches. Because the goal there then will be to test all the patches that are sent to the mailing list. We don't do that, because if it takes nine hours every time, we will run out of resources quite fast. So build testing will be a manual process. And this is where the SWAT team comes in, so this is what I do. So Richard Perdil used to do that on his own. And this is something he managed to offload. So I'm doing that. I have also a colleague, so Luca, that is doing that. And for other branches, so we do that for master, master next. And for the stable branches, we have other people that take care of that. So Steve Saccoman, for example, will take care of Dunfell and Curson, which are the LTS releases. So the process starts by reviewing and collecting patches from the mailing list. We have multiple mailing lists, because Pocky is composed of multiple projects. So we have Big Bake, we have Open Obed Core, but we also then have Meta Yocto that contains all the Yocto projects specific, so the definition of Pocky, for example. So we have two layers actually there. And we have Yoctodocs, the documentation. From those patches that we did collect, so we do a quick review on that. We are not testing all the patches that we can see, but we do a quick review. The patch looks good. We create a new branch. So we apply those on the repositories. Then we create a new Pocky branch using Combo Layer. So Combo Layer will basically take all those repositories and create a new repository, which is Pocky, from those repositories. It is not a very nice tool. So I will not talk. No, it is not a very nice tool. And honestly, it doesn't work very well, especially once you are multiple people working on the same branch. It's helpful. But this is not the topic there. Then we push that branch upstream. So I have my branches are on Pocky Contrib, which is a Pocky repository that is available. It's publicly available. You can have a look at it. And then I tell the auto builders to start an eventful build on my own branch. That build runs. And if we have build failures, then hopefully I can find very quickly which patch did create that failure, which usually should be the case. Because master is building just fine. Master next is not building properly. So probably I have a patch in there that is making my build fail. So in that case, that's very easy. I remove the patch, and maybe I will collect other patches and things like that until I have a stable branch. Once my A-fool is successful, I provide my branch to Richard. He will do a final review of those patches, and he will merge everything in master. That's the usual workflow. Yeah, OK, the slides should have come before. So we have self-test builders, which are running BitBeck self-tests, which is testing BitBeck and the API that includes the parser and fetchers. We have also OE self-tests, which is a target that is keeping the reproducible tests. Because the reproducible test takes a lot of time. Most of the nine hours is actually because of the reproducible build. Because basically, the reproducible build will build twice, and then compare those builds. It tests other things. So you have the list of what is tested there. Honestly, it doesn't test enough. We'll see that. We need more tests. And it also runs OE pie length. So when pie length three is available, it will run that on the Python module. So BitBeck and anything that is written in Python in OpenMed Core. Yeah, I'm missing a slide that was looking for. OK, no matter. So we also have other tests. So we do test the SDK. So we have two targets for that, test SDK and test SDK X. They are basically doing the same thing, while the test SDK X is testing DevTool on top of the regular SDK. The tests are at that location. So this is in OpenMed Core, so lib oeqa SDK. That assumes that the SDK environment is set up, which means that if you want to be able to run those tests on your own, you will have to first populate SDK and then run the SDK environment script to be able to run those tests. What do they test? Basically, whether you can actually generate binaries for your target. The goal, OK, I have an ARM target. I want to be able to generate ARM binaries using my SDK and not MIPS binaries. The auto builders don't have a board farm. Like I said, we are building for machines, but we are not actually running those builds on actual machines. But we have a powerful server, so we can run QMU, and we do run the QMU images on the workers. And so we have the test image task that you can actually run on your own, if you want. It runs QMU. It will boot the generated kernel using that root file system, and it will test many things. Which makes it also part of why this is so difficult to get workers working inside the auto builder farm, because to be able to use run QMU, you also have to have tuned tap interfaces, things like that, that run QMU will look for. So this is also yet another complexity in the maintenance for those workers. So I said that we have the LTP builders that will run LTP inside that image. So it will install LTP on the root file system and run that in QMU. Then we have the P tests. And what are the P tests? Well, the P tests are basically package tests. And those are tests that are coming with the packages, so with the upstream release. So we get the unit test from upstream, and we run those tests. And as you can see, we have OpenSSL, G-Lipsy, LTTNG, Python 3, they all have tests. And we are running those tests. You have the full list of tests that are run in that include file. So Metacomp distro include ptestpackagelist.inc. And also, again, on your own, in your own layer, nothing prevents you from adding your own P tests. It is very easy to do. If you want to ensure in your CI that what you do is still working, you can provide your own P tests. And then reproducible, like I said, it will do two builds. The first build is allowed to reuse the shared state cache because then it will go quite faster. But the second one is not allowed because we actually want to rebuild those binaries from scratch. So we do that. Then both outputs are compared. And all the package types are tested. So for those two builds, we are actually building all the type of packages we can. So IPK, Debian, and RPM. So that generates a huge amount of packages. And we have the failures that will be. So when we have failures, we actually upload those failures at that location, so repro-fail. And we even have the defoscope output and things like that. We even have the binaries. So if you need to still compare the binaries, you can also do so. And we have those results, which are at uctoproduct.org slash reproducible build results. And those results are actually quite nice because, as I said, well, 100% of the packages that are tested are reproducible. And the only ones that are not tested are Golang. But this is an upstream issue. And RubyDoc, this is documentation. And it's also an upstream issue. It's basically the RubyDoc generator is reordering some sections. And it does so randomly, so it's very difficult to fix without knowing how RubyDoc is working. This is what it is. So what are the saved results from the auto-builders? The standard output is saved. So you always have access to the standard output. Actually, if I go to the console and I click on one build, so that one was one of the failed build, it will show me right now I have the standard output of that build. So this is what failed. So that one is quite short, so that's nice. But you can see this is basically the output of BitBake. So if you know how to read the output of BitBake, well, it is there. We also have the shared state. Obviously, it goes to the NAS. We also have hash equivalency, like I said. So that one is also exported. So you have actually access to it, if you want. So on typhoon.yoc.io at that port, 86, 87. We also have the build history that is pushed. So if you want to look at the auto-builder's build history, you have access to that. So I don't know if you all know what is build history, but basically that will record the difference between your current build and your previous build. And it does so in a git repository. And so every commit is a separate build. So that allows you to know what is different. We also have test results. Those test results include the results of LTP and P-tests. We have the build statistics. So those are the ones coming from the performance test build. And I guess I can show you what we do. Nope, I don't have that. Let me get that from somewhere. That will be the one. Nope, that's the one. So those are build statistics. And we actually generate nice graphs so that basically the build time is there. And so we don't record the actual version, but that's the number of commits since the first commit on master. And we have a look. So it's about one minute difference between the previous 16 runs. So that's not that bad. Then we have the size of TMPD, the root FS size, the build time for different images. And we also have the do root FS time. So whether we take more time to generate the root FS time from packages, so that's also very good information. So it's quite stable right now on that particular build, which is nice. And those are also available on that particular address. So that's autobilders.io.io slash plub slash non-release or slash release, depending on whether you're interested in a release build. So for example, if you're interested in 4.0.0, you could go to release. But if you are interested in the master builds, you can go in pub slash non-release. And you will find basically all your builds with what happened. So you get the performance report. You get the P test logs, build history. You also have all the failures that are recorded and things like that. A lot of data is recorded. And we'll see that it's not actually enough. So yeah, OK, so that was the built-in output I just showed you. So if you want to look at it, the URL is there. So finally, I come to the SWAT team. So basically, the SWAT team looks at all those build failures. And like I said, sometimes it's very easy. OK, the build failure is caused by a patch. You remove the patch. You have removed that build failure. That's nice. It's not always a case, because quite often we have what we call the intermittent issues. So the AB int issues. So that will be that case there. And in that case, we need to track those. And preferably, we want to solve those too. So what kind of issues did we have? We do track 230 issues that have been closed. So what kind of issues did we have? So those are all the intermittent issues, so not caused by a particular patch. Some of them are caused by the infrastructure. So for example, you have that issue. So that was basically a Virgil not running. And it was a permission issue. So if you look at bug number 14551, you will see that basically, on OpenSUSE, they change a particular device from one group to another. And the builder was not in that group. And so at the time, that group doesn't exist on Fedora, for example. And so we had to add that to the OpenSUSE setup. We also find issues that are present upstream. So that's one of those issues. So there is a Perl install race. So basically, put to text was installed before getting compiled, which is very weird. We had a look at the make file. It was very difficult to find a race condition there. Actually, there are no race conditions in the make file itself. But make add an issue. In that case, it was make 4.1. It had an issue. And make 4.1 was shipped on Ubuntu 16.04 and 18.04. So that one will take a lot of time to find out, because it will mostly work. And sometimes when that build runs on Ubuntu 16.04 or 18.04, it will fail. But it will not always fail. So I gather some statistics. So I'm responsible for that. I'm gathering those statistics. And I'm trying to find, OK, this only happened on Ubuntu, what is the commonities between those two Ubuntu releases and things like that. So that's how we find out. We found out. We also had kind of the same thing. It's sometimes the build would on. And this was caused by make 4.2.1 that was shipped by Santos, Alma, Linux, Stream, and OpenSusie. We did report that bug upstream to Santos. They said, OK, you are doing weird things with make or whatever. But no, we are just compiling the kernel. So they wanted us to report that to Fedora. But basically, Fedora was to redact. But basically, redact didn't care too much, because they already moved out make 4.2.1. So basically, the solution was to desolo make 4.2.1. That one is also painful to find out. For a while, like I said, we had a lot of performance-related issues, because multiple builds are allowed to run on each worker. So you may have multiple big dates running in parallel on the same worker. And that will increase the load. And what we find out is that sometimes we add RCU stalls on the kernel when it's running in QMU. Why? Because basically, we were compiling, I don't know, WebKit or Node.js or something really load-intensive. And QMU didn't have any cycle to make the virtual machine go forward. And so basically, the kernel that is running inside QMU was just bailing out, OK, it's been 20 seconds that I didn't run. Let's crash. So how do we solve that? Now we are using the make load awareness. So that is dash L, that is just there. So dash L52. So if we go higher than a load of 52 on the builder, on the worker, we will stop, well, make. We'll automatically stop enqueuing more work. We also did lower the limits for exe. Those settings are specific to the Yoctoproject auto builder. They are different from the current default for open embedded core, but those may be useful for your own CI. And so for exe, we are so limited to Astrads with a maximum of 5% of memory. Because exe, when it's compressing, it takes a huge amount of memory, and it was also starving other processes. We are also copying the root fs image to TMPFS so that we can avoid IOs when running QMU. And finally, the workers are getting switched to SSDs, which did also improve the reliability of those builds. Finally, yep, I'm out, or yeah, really out. So I'm out of time, just my last slide. What needs to be worked on? And that's where we need your help. We need better logging, because sometimes it's very difficult to pinpoint what the issue is, because we don't have the logs. Or because we are building so much, we do remove the build directories. And sometimes it's too late to go there and fetch the logs we are interested in. So we need to improve that and particularly collect the relevant output when it's failing. So if you are up for some development, you could do that. We also need better p-tests and better OE self-test. So p-test is useful for everyone, because again, the p-tests are unique tests from the upstream packages. So you actually contribute to upstream, which will actually improve also the Octo project. And finally, some tests will benefit from being more robust and especially less timing dependent. Because again, we are loading those workers so much that sometimes it just fails because something times out. And it's not actually a failure. It's just, OK, it timed out. And also, I'm also timing out. So I will take the questions, I guess. I will take the questions off stage.