 Hello, happy to hear you today. Thank you for coming to our talk. My name is David. Co-presenting with me will be Erika. About me, I love open source and free and open source software. I worked through my life since I was like 15 and start playing with Ubuntu, with everything from GNOME to Linux kernel libraries and stuff. And I like it a lot. And I will have this talk mostly because I love to optimize for performance and recent years also for simplicity because having fast code and fast stuff isn't always the best thing if no one understands it. Let's go for another slide. So introduction of this talk. First, I will talk about introduction, what we're using, what software we're using, what hardware we're using. Then I will go very quickly through level one and level two. That's like basic usage of CI, the stuff if you used ever CI, you usually do. And then in level three, I get to some interesting things, like device under test and how we're testing to real devices. And Erika will continue with level four, which is running to farm in the real hardware and taking care of it. Let's go then. So what is Mesa3D CI? Mesa3D is itself graphic drivers for Linux. You have one part of the driver inside Linux kernel. And everything which takes care about rendering and OpenGL and Vulkan, that's in Mesa. So that's Mesa3D. And Mesa3D CI is solution built on GitLab CI because we use GitLab. We have own instance, free desktop GitLab. And it's a solution built on that to test on the real hardware. So before we start, my talk will be mostly focused on the pre-merged testing. We do also other testing. But what is important, when the developer submits the code to the project, we need to quickly test the code and ensure he doesn't break anything. On the other hand, we don't want to block other developers or the Merchant Quest for too long, so we have to take it in a reasonable time frame. So our goal is about 20 minutes to do testing. And of course, make developers happy because set developers and broken testing, that's never going well. So level one, we do basic things. We build containers for a few distributions. We're using Debian, Fedora, and Alpine right now. And for testing later, we're using only Debian images. But we built also these distributions because, for example, Alpine has muscle lip see, so it has different environment. So we want to build it on it. And Fedora, you know, because why not? And for this, we're using CI templates, which help us prepare to images and stuff. And it's provided by free desktop GitLab infrastructure. So it's off this thing we're using. GCC Cilang, we have around 20 built tests just for building different combination with different options, like the LTO. And we use address sanitizers, memory sanitizers. Test everything we can offline without hardware. And for linting, we use RustLinting, Cilang format. And for RCI scripts, we're using shellcheck because we will get to it why. So level two, testing without hardware. After we compile to Mesa, we do simple compilations. On some jobs, we testing unit tests. So basic functionality of the library. And we testing shaders with ShaderDB, where we kind of fake the GPUs. We say, like, we have GPUs. And we test shaders on the back ends so we know the generated graphic shaders are OK. Then we use runtime testing. And we can do that without hardware because we have a few drivers which runs on CPU, just so we can test Vulkan and OpenGL on CPU. Thank you. So this is our small pipeline. Some people has 8K monitors. So they fit the whole pipeline on a screen. You probably won't be able to read itself jobs. Don't worry if it gets at night and we are, like, tired with it, we can either. So this is, like, around how much? 200, 250 jobs. Every job isn't currently on the slide. So I had to try to fit it. And so this is kind of level three. Most of these jobs are from the devices. So what we use, we have multiple solutions for testing because many companies contribute to this solution. So we have many approaches integrated into GitLab CI. We have LavaFarms. That's the first thing. LavaFarm is automated validation architecture. Originally, it was built for ARM devices by Linaro to get their testing, you know, to get their testing. And these days, we're using it for AMD64. We're using it, like, for everything. And these farms has advantages in, like, things you have, like, monitoring on top of them. You can set priorities for the jobs. So you have, like, a lot of abilities how to handle stuff, which we sometimes use in RCI. For example, we use a lot of priorities because since we allow developers to run manually to jobs and test if they need, we need to merge, which is the merge pipeline, which is going, like, pre-merge before the code gets in. So we need, like, get into 20 minutes. And if someone starts excessively testing their jobs on the CI, you know, we block the other people which want to merge. So the prioritization in these farms is very useful for us. And then, for example, we have LavaFarms. They're using a little bit different approach. They're using containers. So they boot into minimal interface, and they just load the container with tests. And these tests are same as we're using on other devices, but differently packed, you know, into a container. And we will talk about containers versus root FS, which we use on other devices very soon. Then we have Barbone devices, which have no prioritization and these stuff. There's, like, multiple farms. So a lot of people, some people running these devices at home without farm and just for tear testing. Some people, some companies running these are also, like, Barbones, like Google, for example, because they didn't want to use Lava. And these farms, every farm has, like, a little bit different of handling. So on every device, you have to count with a little bit different environment. So when you're writing tests or you want to, like, have reliable results, you always have to test against everything. What is good that, like, one test can be run everywhere? So, like, the tests are shared. So if we write a test and test them, like, they are everywhere same. And sometimes some devices have different kernels. For example, Raspberry Pi has custom kernels from vendor because of the reliability and everything. But on the other hand, if you want to enable some kernel feature, which is very useful for our testing, you cannot do it because these kernels are shipped and you cannot update them. So for this part of hardware, we're also using smart logic. So for example, if you push code into our repository and it only touches, for example, Intel code and not the shared parts, only Intel code gets tested. So you don't need to excessively waste cycles and you see the pipeline and waste so much energy and power and time to test everything. Also, we have kill switches for Farm. We currently have, like, seven kill switches. So when Farm starts failing, like, because of network, because of out of space something, something breaks randomly, you know, then we can shut down the Farm and continue working and testing, not having to disable whole CI. So environment, when you want to run the job on every device, it will probably run a little bit differently. So you need to test on almost every device when you're developing for it. You have to take extra care about some variables and stuff because the variable might be not set in your test on that device. So there is some extra complexity to do that, do all it, and containers or root FS. We have two approaches. One is the container. So you test, you just load the container on the device. Valve, for example, use it and the non-hardware jobs using it. And it's great because of the developer can just download the image and run it locally on his computer without setting up stuff. On the other hand, it's a little bit slower. For example, for lava farms and some bare bone devices, we're using root FS, which advantages is performance. Because you just unpack some root FS on NFS server or send it to device. And that's all, no overhead. Probably over time, we get to containers because for developers it's more useful. Every test is different, right? So because we need to cover a lot of topics, every testing suite has different inputs and outputs. So you need to handle that somehow. Some tests handling flags, some tests handling failures differently, reports are different. For some things, we try to wrap it into something and provide same way of output. For some tests, we just adapt the test a bit. So we send some patches to upstream or keep patches aside. But we always try to keep a small amount of patches as we can because we have to maintain it. And it's like, huge pain. Let's switch to the other slide. Nice. Let's switch to another slide. Nice. So most of interesting in our terms is stability. Because when you test graphic hardware, stability isn't the strong part. First, let's talk about parallelism. We're using a huge set of tests, which takes, for example, like eight hours or 10 hours. If you want to test them before the code gets merged in 20 minutes, it can be kind of an issue. So first thing we do is we use parallel jobs. So of course, we just chart the test over eight or 10 device. So we also chart the time which is needed to run it. This is first part. And second part is that parallelism inside jobs. Because the GPU tests usually don't utilize the system for 100%. So we use parallel runs even inside the runners. So for example, we run in eight threads to tests. And there is some cost of it. And that's like flags. Because the tests are usually not meant to be run in that huge parallelism. So sometimes something fails. And it's very hard to debug what failed and why. Like, we know what failed. But it may be just one job run with another job once 1,000 runs. So that's a hard thing to handle. And we handle that. And we will get to that with FlakesPoint. So Flakes, yes, yes, no, yes. You fix something. It works. But once in 100, in 10, in 1,000 runs, it doesn't. So we figured out we cannot get into state when everything will work every time. Because this is like a lot of thousands of tests we run. So we have multiple layers of handling this stuff. First, GitLab is a wonderful piece of software. But sometimes you say, like, write three job when it fails. This is one level of handling our Flakes. Because when we want to merge stuff and something just fails once for some no explainable reason, we don't want to block developer. So we want to retry at least once. So we retry once. But just recently, GitLab had a bug, which means when you retry once, it just stuck the job in a queue. And until you send another job, it gets stuck. But we are constrained by time limits when we're emerging. So that's failed pipeline and developer is unhappy. So what we do, we send dummy jobs to it just to get it working. And recently, GitLab fixed the issue. So this is first level. Then you have infra level. Because we have farm in different locations. They are connected over internet. And internet sometimes fails, even data center fails, switch somewhere fails. So sometimes you have issues with transferring root FS, transferring test jobs. Happened sometimes. Sometimes storage fails. Sometimes itself, GitLab runner fails. It happens. So you have to retry. That's still handled by retrying the job. Then you have the device itself where you like getting data over the serial port. So originally, when you boot the device, you're getting data over serial port. And these devices, which converts serial output from the devices to USB and to some machine which takes care of it, they sometimes fail or misbehave, which is very unpleasant. And it happens only time to time. So for example, recently, my colleague implemented SSH. So we just use a serial port for beginning. And when we can, we switch to SSH to be sure that we get all the inputs and the outputs of the machine correctly. And we can parse it and understand it. And of course, then you have GPU level flags. As I said, you run a lot of stuff in parallel. Sometimes driver in some rare occasion have not handled the corner case. And then one test fail from 10,000s, and you have to rerun. So we are able to mark these tests usually inside the testing suites. We just mark them as a flag. And if there is flag, it gets reported, but it's not going to fail, which is nice. What is very useful is that part where we monitor everything. So every day, we have reports, what test was flaking, how often it was flaking. And depending on this, we can update the expectations, we can report to developers, we can handle it somehow. And when this got in the place, it was the most useful thing for the increasing reliability, I think. So conclusion of my part is, if you have CI as R for developing GPU drivers, you need at least one CI developer. Community can help you a lot, because a lot of people which are not developers or CI developers just came to us and sent patched like, OK, you can upgrade this dependency, fix something, you can improve the script. And it's amazing, but you still have some people which are working on it full time. There is another thing, because for example, on RCI, collaborating a few companies. So if you're merging some bigger change across all the devices, all the farms, it takes much longer. And you have to be aware that with longer time, you wait to get merged. There is chance someone else pushed some Merchant Quest and break your changes. So it's like the CI is still changing very fast, even if it's really huge and already very covering almost everything. And what is most important, that reasonable, reasonable reliability of the CI can be reached. We're still getting some fails, some issues, but on the scale what we testing, it's still perfect. So the developers are feeling plus, minus happy. They are getting their code inside. So everything is perfect. Anyway, thank you for your attention and I will pass to Eriko. So hi everyone, I'm Eriko. I work for Red Hat. I'm gonna talk about my project, which is to run my own farm. I actually run it at my home. This work is not necessarily related to my work. It's something that I classify more as a community work. So a few years ago, maybe like four or five years ago or something, I started participating in this Lima project, which maybe, I'm not sure if anyone heard about it, but it's a community driver for the first generation of ARM GPUs. It's a GPU that's a little bit old at this point, but it's still used by a number of embedded devices. For example, maybe you went across the hall and you came across this device. It's a pretty popular device these days. It's called a Pine Phone. It happens to be running with this GPU driver that we developed with the community. There was fairly active development in this, both in the Mesa space, as well as in the Kernel space for the past years. I mean, we got Supertooks card to run, so I guess that's basically job done. So now we are in a situation where the driver is more or less stable. We have a very good coverage in the OpenGL-ES compliance test for this device, and so development more or less slowed down a little bit. We get games to run, so I guess we could say job done. But there's still one thing we can actually do as a contribution for this, which is to care about the regressions because people are actually using this. If the developers are no longer pulling it every day and running the tests every day, something that would be really, really great to have is to have coverage in CI so whenever people are pushing code to the shared infrastructure of Mesa, or actually new code to implement new features or fix a bug in the driver, we have coverage for it. But who is gonna maintain it? Like there's no company backing this up. Nobody's getting any payment to do this work. So in one of the conferences I attended, we were actually giving some of those boards to the speakers, and there were a few left, and I got offered to bring them home and maybe set up some CI farm somewhere. They're actually this device. I have a stack of those in my home and they are the farm that runs the jobs. Can you put it in the stand? Nowadays, we are at a state where this is getting tested as part of the big matrix pipeline that David was talking about. If you can see here, there's the Lima Mali 450 jobs. They are running on these boards at my home. We can see here also some of the parallelism things that David was talking. For example, this test, the Piglet test, they are actually, they take a long time and you have this rule that we should try and keep something like under 10 minutes. So we split and they actually will turn on two boards and run half of the tests in each of the boards so that we can reduce the run time. So what did I have to set up to get this working? I decided to use a lava farm because I didn't want to implement yet another set of power on and connects to the serial port and like read from the serial port and type in some commands and this kind of thing. For example, there's also, these boards don't have a lot of storage so they use NFS route and they need to download the kernel from TFTP and all these kind of things. I get basically for free by setting up lava and also because, let's say it's a GitLab project, running GitLab runner as a separate runner as well. To set up the hardware sites, I needed the actual boards which I happened to have because I got them from the community I guess. I need to run a lava host somewhere and a GitLab runner somewhere. I do have a separate server which is running this as a couple of virtual machines which are next to those boards. You need to have some solution to power the boards on and off. They are not actually on all the time. Every time there's a new job for Mesa, they need to be powered on and they need to pull the kernel and everything. So my current solution for the power instead of going for some super expensive power control device, I have some of those Wi-Fi control device. I just send like a HTTP request to it and it turns on or turns off. That's like a online script I can run and then I can just put this script into lava and lava will take care of that part. Also need serial connection for this. I'm using those USB cables as well. I have a picture in the next. And then there's the whole network infrastructure which, no, not yet, which I'm gonna talk about later. This is my view which I have accessing lava directly. It's not something that is visible outside because I don't even have to expose lava to the internet. But those are what I can see, the jobs that are being run by different merge requests that people are submitting or sometimes people testing their own branches. In the last month, it ran 1,682 jobs, not counting the boards that lava is running just to check that the board has not disconnected or anything. So some of the challenges I had to do this, setting up the initial secure network, basically I'm putting these boards on the internet and by definition they are downloading code from the internet and running inside my home. And that's something that, well, it's not very, it could be a little bit insecure I guess to run code from the internet. So I set up the whole isolation for this network. So basically these devices, they are blocked by firewall and by switch level as well so that they cannot see any of the other computers that are connected to the same network. Some of the other challenges were related to infrastructure reliability. So all those things that I have listed there, they have failed at some point and nowadays since it's run as part of the actual CI, if my network is down and someone wants to merge something that is in the common part of Mesa, it will not be merged because the tests are gonna fail. So it took some time to get to a reliable state. I actually set up the boards way before I actually put them online as part of the pipeline. Each one of those things, I don't have time to go over all of them now but they eventually had some flakes and I had to replace something or change something. People ping me on IRC like, hey, our lab is not responding. If I don't fix it in like 10 minutes, they're just gonna disable the farm which is completely fine. Results that I got from this actually, I got a lot of new developer engagement. So like sometimes someone is enabling some new feature in the common part of Mesa and they actually don't really care much about the driver we're developing but they're like, hey, you know what, actually I'm developing this feature and I just add this one line to your driver and they actually enable this feature for you as well and CI is happy about it. So we actually get a lot of new patches by doing the work of having the tests actually in the upstream. So that's something super cool that we have now. We started with just super simple OpenGL ES2 tests but they actually improved more to the EGL tests, to the Piglet tests as well and then, well, first we start just with the OpenGL ES1s. So many regressions were prevented. I figured out a couple of kernel bugs that nobody noticed before but because CI was running the tests every day, actually some of the kernel bugs we found out because we tried to bump the Mesa CI version and that failed. And the nice thing is that it's very easy for anyone to disable the labs. I actually checked. The lab has been on for two years. People disabled it six times over the last two years but it's like, it can happen that my network is down or something and someone can quickly flip a switch and I'm not blocking anyone and I think that's a good thing. Hopefully I can inspire someone that if you participate in some project that has some super specific hardware that you need, it is actually possible to do that even in a project as big as Mesa with so many contributors from many big companies. I was actually talking to someone this week and I said that Lava was actually something relatively pleasant to work with and I basically followed what's in the Lava documentation and I set up my own Lava lab and that has not been a hassle since then. So I'm kind of happy to work with Lava and it takes care of all the boring things I didn't want to care about. Mesa has some documentation as well coming from the other Lava lab which I think is basically a collaboration at this point. I did have to learn some new stuff especially about the network side of things. For example, to provide some isolation that I was happy enough with to put this on the internet. But like I mentioned a few times already, the fact that we have a good way to disable me in case I'm causing some trouble is perfectly fair and I think is a must thing to have if you actually go forward to do something like this. And network isolation is really a must. You don't want people like we had incidents and I can talk to you maybe if someone is interested about it but fortunately nobody like these boards are not even able to see what's in the rest of my network. And then I'm gonna pass the word back to David. Thanks. So I had to put slide what's coming next. There is a lot of things coming next but what we're working right now is like for example for Mesa when we want to edit a new test or update some dependencies we have to rebuild like one hour long pipeline on at least three architectures which started like the developers was saying like okay I need this dependency a little bit newer what is inside the Debian. And like let's compile this and let's compile that and after some time you have like one hour pipeline when you're compiling CC cached software and like it takes a long time. So right now trying to split it so like if developers need to change some dependencies or test not different Mesa but different for example libraries we use so they can have it like in few minutes not like in one hour. And we also trying to increase testing coverage so for example we have tracing we have traces which are like imagine it like replay of games or applications which we just take like one or two frames from the game and we just feed it to GPU so we don't replay the game itself we don't have to run it. We just run the frames which goes into GPU. So yeah Q and A. Okay so and of course adding more devices because more devices is always better. Let's switch to Q and A. And that was quick. Yeah I'm trying. Oh right wrong button. Too much a little bit back. Yeah questions. So questions do have any questions for us. Here is one question. I will answer. So all right repeat the question. So what's going to happen when the GPU starts crashing to kernel? So first we usually enable testing on our CI when like the kernel drivers are already in place and at least a bit stable but for some let's say some other devices which never had like great coverage of kernel support and like we still test them at some point it sometimes happen. There are two types of crashes. One crash is just the driver crash but the kernel continue working. That's completely fine and we can like even continue if it can support restart of GPU. At the point when to device crashes it's not problem because we have like set up timeout and when the console when the SSH or serial connection doesn't for example print anything in five minutes or like some time then we just shut down the device and rerun the test and if it's crashing continuously the developer has to fix his you know box. Is that answer for your question? Yeah like the kernel crashes on GPU loads is not really a problem because every time we are downloaded the kernel and restarting the board so it's not like some test can run and then like the next test which runs can be affected by it. Also Mesa maintains its own kernel which is run on the boards and every time we're gonna update this kernel we will rerun the whole pipeline multiple times to make sure that this kernel that we are putting and be gonna be downloaded for Mesa CI it's stable on all of the boards and also part of the CI work is maintain this kernel as well to make sure that it's not gonna cause any problems. Other questions? Yeah. So the question was how we define the priorities for the March and for to get the right jobs in right time. So for example for Lava we currently using approach that like the GitLab runner which serves Lava has some like cache. He can like you know load like let's say like if we have like 80 devices we have something like 24 jobs open and so everything which gets pushed into these 24 jobs cache and each job has own priority. For March we have like the highest priority. For user tests we have like lower priority and for the nightly runs or like the runs inside the main branch when like it gets already merged we have like lowest priority. And so the trick is when this job gets like ordered into GitLab runner the GitLab runner you know picks up you know with the highest priority around them and others have to wait longer. For our Lava farm yes. Yeah. So thank you for your time and thank you for coming.