 Hello everyone. Nice to see so many faces here. So I know this is the 2pm talk so I have a lot of images and pie charts so hopefully we will all stay awake. My name is Labra. I work with Collabra and today I would like to share with you some of the struggles and challenges of growing and maintaining a laboratory for upstream testing. So first of all we're going to take a virtual tour of the Collabra Lab Lab. We're going to see what devices are inside, how it is set up and we're going to judge a little bit our cable management. Next we are going to discuss how laboratory are relevant for testing upstream code especially relevant for open source projects and we're also going to see which projects are currently leveraging the Collabra Lab Lab. Next we're going to see share some of our daily headaches so what can go wrong in the lab and also what are the challenges of adding more and more devices. We're going to also discuss maintenance and monitoring and a little bit of performance tracking and finally we're going to see what's next. So Collabra is maintaining a laboratory, a lava laboratory in Cambridge in the UK. Our goal building this lab was to have an ecosystem of devices that open source projects could use for automated testing. So basically we wanted to be a big test bed for open source projects. It runs lava so the linear automation and validation architecture for controlling all the devices and running tests on them. So as of August we had like 158 devices of 32 different types. Actually shame on me for not updating this slide because in the past few days we added like 10 or more devices. These devices are spread across 15 different racks. Each rack is controlled by its own server and of course on top of all these devices we have all the hardware infrastructure to control them. So network switches, power supplies, USB hubs, tons and tons of cables. This is the architecture distribution as of last month. So at the moment the work of the tests that run in the Collabra level are targeting X8664 and ARM64 platforms so that's why we have so many of them. And as of the device distribution, the majority of devices that we currently have are Chromebooks. We also have some embedded SBC and also some QM instances that's the yellow thing here. And yeah, so this is about the distribution that we have at the moment. This is the growth that our lab had in the past few years. This goes back to April 2020. So we had like 50 devices more or less and we ended up with 100 more. You can also see that every time we add a new device type, we add more than one device of that type in the lab. So this is to assure a little bit of device redundancy. Here you can see what it actually looks like. These are some of the latest racks that we have set up. Yeah, the last one is a little bit empty but it has been filled up by now. Yeah, I mentioned cable management just because I knew this looked good in this picture. Yeah, the real important thing apart from cable management is that everything is labeled. So it happens quite a lot of times that we need to unplug and re-plug stuff. So if there is a label there, it's going to be easier. So as I mentioned, our laboratory runs Lava. Lava is a CI system to deploy your operating systems to both physical and virtual devices. So it handles the power control, it handles the access to the serial consoles and it also of course schedule the test jobs on the devices. So examples of tests that you can do in Lava is for example checking if your kernel changes boot correctly on all different platforms or you can also test user space changes against different kernel versions. Lava is designed for validation through development. So the main idea is that developers should push their changes to a development branch. The development branch then gets tested and results are fed back to the developer so that you can check that your changes do not introduce any regression on different devices and also that there's no other side effects. It has also a very scalable scheduler. It can run dozens of tests every day across the same instance. So yeah, it's really good for adding more and more devices. As I said, Lava helps with automating the boot and deploy phases of the boot process but it's not a complete CI solution. In order to close the CI loop you will need to figure out a way to automatically build your binaries and artifacts. Actually submitting these test jobs to Lava and finally getting back the results in a formatted way to the developer. You cannot really expect developers to go and check out the Lava dashboard and see how the test job does. It's much better if results get formatted well formatted and they're sent back automatically over email for example. So let's see what are the base requirements for a device to be enabled in Lava. If for example we want to run an NFS test, so we test that boot a system over NFS, we're going to power on the device. The device will start one or more bootloaders. We're going to need to stop the bootloader process at some point. We're going to need to load a kernel image, a device tree if that's needed around this image. Then we are going to need to set up the command line, any command line options that might be needed for this test. And we're going to then kick the boot, let the device run again. The device of course will run the kernel, load in its run disk, mount it to fast. And once we get to the login prompt, we're going to log into the system and run whatever script we want to run. Finally we're going to power off the device. So in order for devices to work in Lava, we need to fulfill some base requirements. So the device will need to be able to turn on and off remotely. We will need access to a reliable text output console, for example, a serial console. And we're going to need to be able to load a kernel device tree and whatever we need remotely. So as for the Lava point of view, the device configuration consists of a series of Ginger2 and YAML files. So you have the device type template which outlines the requirements to boot the device. So things that go in the device type template are, for example, the type of boot loader, the device runs, and extra command line options that may be needed. Then we have the device dictionary, which is template with all the device specific commands go. So, for example, which command is needed for powering on and off the device or accessing the serial console, and also other device specific characteristics such as the IP address if needed. And then the last piece of configuration is the health check, which is a special kind of task job that is used to check that the device is functional. So it boots, deploys and boots test image and just checks that the device works correctly. This is a task that is supposed to be run on a regular basis. The frequency of this job can be configured in Lava. What we usually do in our lab is running a health check on every kind of device that we have every day to make sure that all devices are stable. So just to give a little bit of context, I took some bits and pieces from a kernel CI job that ran in our lab recently, just to understand how a test job in Lava is structured. So you have a set of information describing the type of test that we're running. So in this case, we're targeting one Chromebook. Then we have a job name defined. So this is a really long one actually, but it's really descriptive. We set up also the priority and every time out. So for example, in this case, we're going to say that the power of action should not take more than 30 seconds and the overall job should not take more than 10 minutes. And also we're saying that this job should not stay in the queue for more than two days, then it gets canceled automatically. This is a publicly visible job. And in this case, we're also defining some extra command line options for this test. So every Lava test job has three actions. We have the deploy action, the boot action and the test action. Here you can see the deploy one. Here we're basically defining where Lava should look for the device tree blob, the kernel image, the modules archive, the RAM disk as well. And yeah, we are also saying that all these images will be loaded over TFTP. Next step, the boot action. Here we are defining the type of job that we are going to run. So this is a very simpler job, which does not boot the system over NFS, but it just uses a RAM disk. The method is depth charge just because this is a Chromebook. So it runs depth charge as a bootloader. Here we are also defining the prompt string that Lava will look for in order to understand what the boot process is finished and the device is ready to run our test script. We're also overriding default timeouts saying that this boot action should not take more than five minutes. Finally, there's the test action. Actually this test has two test actions. I just took one because they're pretty similar. So this example shows an inline test. So pretty much everything is defined inside the YAML file describing this test job. You can also have test job definitions stored in a separate repository and have Lava fetch them. In this case, we have a lot of properties describing the test. So a name for it, the type of operating system it runs and metadata in general. And then we have the steps for the actual test jobs, which in this case, we're just running a D-mask script, which just checks the kernel log for errors. So basically this whole test job will fail if any error or a message is detected in the kernel log. So this is basically what a job is, how a job is structured, but how can we actually submit any job in Lava? We have a few options. These are just some examples. You can use the API directly. Lava provides two API, the XML, RPC1, and the REST API. You can also use a Python tree command line, which is Lava tree, and it allows to interact with all the Lava objects. So you can easily push device type templates or a dictionary or submit jobs. It's really useful during development where we are still defining, for example, the device dictionary. We can easily push it to the device and just quickly test if it works. Another thing that we use a lot is the Lava GitHub runner. It serves as a bridge between GitLab and Lava. So it allows to submit a job, monitor it, and retrieve the results as job artifacts. So you can set up pipelines in GitLab to have jobs scheduled automatically, based on whenever new changes are pushed to a certain branch or whenever a merge request is open. Oh yeah, it uses the Lava API, which in turn uses the Lava REST API. So what's the process that we use to add a new device in our lab? First of all, we prepare the device to run tests. This usually includes reflection of firmware in some cases or enabling some debugging function if we, for example, need to access serial console. Once the device is ready, we prepare the Lava device configuration. So we write the device type template, the dictionary, and the health check. And finally, we add also, of course, the network configuration for it. Next, before even thinking of installing the device in the lab, we do a lot of stress testing. So for every new device type that we add in our lab, we run about 1,000 tests, 1,000 health checks on it, just to check that the device is stable enough to be moved to production. And finally, when the device is ready, we just install it in the racks. And if the device type that we define is a new one, we just kind of send it upstream. So I wanted to give you an example of how a device can be prepared. The process for Chromebooks is very different from the standard procedure that you will have for a single board computer. So I just to show you this one as we have so many Chromebooks in our lab. So capabilities such as powering on and off the device and accessing the serial consoles on Chromebooks are locked by default. These capabilities are also part of what's called the close case debugging. So in order to unlock these capabilities and being able to power on and off the device remotely, for example, we need to unlock these capabilities. These are implemented by the CR50 firmware, which runs inside one of the chips that are inside of Chromebooks. So here you can see we have the AP, the application processor, which is the main processor that runs ChromeOS inside a Chromebook. Then we have the embedded controller, which takes care of the power sensors and keyboard interaction as well. And then finally, we have this Google security chip, which runs the firmware implementing CCD. So in order to unlock the capabilities, as I said, we need to interact with the Google security chip. For Chromebooks, this happens through a special kind of cable, a USB-C cable, which is called this USQ cable. What is that is basically it instructs the CR50 to enter debug mode and it expose a console. From that console, we can unlock the capabilities that we need for this device. Finally, the actual software tools that we use to power on and off the device and find access to the serial consoles for all the different chips is the HDC tools, which are maintained by Google and developed by Google. So with this USQ cable and these software tools, we are able to fulfill two of these requirements. So we're able to turn on and off the power remotely and also access the serial console. We still need to find a way to boot an arbitrary kernel, DTB, and init run this combination. So in order to do that, we need to look quickly at what runs in a Chromebook out of the box. So you have core boot, which is the main system firmware, which will load Depthchart as a payload. Depthchart is the ChromeOS bootloader and that of course will in turn load ChromeOS. So what we need to do to be able to load whatever kernel image we want is we need to stop at the Depthchart phase, somehow drop into a bootloader prompt and execute our commands. So what we usually do is we take ChromeOS out of the picture basically and we just build and flash a new version of Depthchart, which has support for the command line interface and also the TFTP protocol. So once we have that, we are able to stop the Depthchart phase and just load whatever we want. So yeah, this is just a summary of the things that I just said. So with these tools, we're able to fulfill all these requirements and the Chromebook is pretty much ready to run tests in Lava. So as I said, once the device is prepared, we need to do some stress testing. And in the Collaboral Lava Lab, we have two different instances. One is the staging instance and one is the production instance. So in the staging instance is the place where we move devices when they are being prepared. So once the device is ready, we just move it to staging and run a lot of health checks on it, just to see that it's stable enough. Staging is also the place where we test every Lava patch before upstreaming it. So if we have changed the Lava code base, we just push our changes to the staging instance first. And this is kept updated to stay as close as possible to the upstream Lava. And yeah, this is also where we move the devices. So if we have a faulty device in production, we just move it back to the staging instance and we run all kind of tests on them before moving them back. In the production instance, we only have production-ready devices. So it's really important that these are continuously monitored to make sure that they're in good health and they're functioning as expected. So we have seen a little bit how the lab is set up and what kind of devices we have inside, but what do we actually use it for? So at the moment, we have two big open source projects that are leveraging the Collaboral Lava lab among other labs as well. And these are ChronoCI and MesoCI, which are currently submitting hundreds of tests job every day. Having a lab is really, really important for these kind of projects, especially large-scale projects, because it allows to test the code on a lot of different platforms in a standardized way. It also helps finding regressions and identifying the root codes, for example, through automated bisection. It improves the overall long-term maintenance and code quality, and it also helps catch mistakes earlier. So if you push a patch that just causes a kernel loop on a different device step, the one that you use for your local tests, with the lab, you're able to track down the error and find the cause of the issue. So you have more pie charts. Here you can see how many jobs were run throughout August by ChronoCI only. It is well distributed between the devices that we have at the moment. As you probably know, ChronoCI is a project focused on mainland Linux kernel continuous testing. It's not only a boot test. It's not limited to just checking that device boots correctly. You also have all kinds of baseline tests to check for basic functionalities. You also have subsystem tests, such as lib camera compliance tests and v4l2 compliance tests. And you also have some user space tests, such as the test tests, which are for the Chrome OS user space. The other open source project that is currently leveraging in the Collaboral Lab is Mesa CI. Mesa CI is focused on Mesa pre-merge conformance tests and post-merge automated performance tracking. Here you can see a list of all the APIs and drivers that are currently covered. They are quite a lot and the list keeps growing and growing. Here, yeah, the Mesa CI jobs targets more of the x8664 platform, so it's the majority of tests are running on that platform. So with so many jobs running every day in our lab, a lot of things can go wrong. I'm a software developer, so of course I'm going to blame the hardware first. You have a lot of hardware degradation issues, so you may have faulty cables, you may have batteries dying or power supplies that are not charging the batteries properly, or even SD cards getting degraded. Of course, you have network issues from time to time, and these are especially critical because they can affect both the devices in the lab and also the lava servers, so they need to be addressed as quickly as possible. Then you have all kinds of issues that are related to how the devices are set up in the racks. So for time to time, you may need to move a device from one rack to another, or you may need to unplug cables. In the process, it's only natural that sometimes cables just get knocked out of position, or the lead of the device gets a little bit too close, or you may have also overeating because the devices are too close to each other. And finally you also have some firmware bugs, of course, bugs on the firmware that is running on the devices and also bugs on the firmware that it's running on the hardware debug interfaces that you use. So it's only normal that all of these problems happen, but they need to be addressed as quickly and as effectively as possible in order not to block, not to interact with the test results, and also potentially block merge requests. So we saw that, for example, Mesa CI uses the test results for pre-merge conformance. So if a merge request from a user gets blocked, we need to make sure that the reason for that is that because the changes introduced make the test fail and not because the infrastructure is not working as expected. So a lot of errors are automatically detected by lava and marked as infrastructure errors. Here we can see an example of it. This is a Chromebook where during the bootloader phase, the Ethernet interface had some problems and we were not able to reach the bootloader prompt, so lava in this case will, the action taking care of loading the kernel image will time out and an infrastructure error will be raised automatically. Also every time an infrastructure error is raised by lava, a health check is scheduled on the same device right afterwards. So if this, for example, were just a temporary network glitch, this subsequent health check will pass and the device will stay online, but if this problem is actually a real problem, the health check will fail right afterwards and the device will be taken offline automatically. So for all the common issues that cannot be detected automatically by lava, you may want to add a test in the health check. So for example, if you need to check for the battery capacity of the device and making sure that the battery is in good health, that's a good example of a test that can fit inside a health check. So some of the things that we have learned in all of these projects is that the monitoring is really, really important to make sure that any issue is addressed quickly and also that devices get replaced when they're not functioning correctly. So lava provides quite a lot of APIs to detect events and status changes in the devices, so you can build all kind of metrics on top of it to just make monitor the overall status of the devices and also to send notifications when the devices go offline. It's really important also that you write health checks that are robust enough to detect animal functioning so that the devices are taken offline right away and we don't need to manually intervene on that. Yeah, so also it's also really important to monitor the infrastructure errors that is especially useful to spot an issue with the racks or with a specific rack or with the way a specific device is set up inside the lab. And finally it's really important to provide enough device redundancy so that all the pipelines from the projects are fed and also so that the side effects of one device going down are not too harsh. So adding more and more devices of course mean you're going to need more space, you're going to need a lot more hardware equipment and also a lot of maintenance is required. This is just to make sure that all devices are kept online and if they go offline we need to make sure that they return back online in a very quick way. So of course the more devices you have the more hard it is to track the status of all these devices and finally having more devices usually means having more test coverage. So more tests are running and they need to be tracked. The job log for each test running is logged in the Lava database so more devices mean more tests and more tests mean more load on the Lava's DB. This can also have of course an impact on your monitoring processes. So some of the recent achievements that we have we added quite a lot of statistics to monitor how the servers running Lava are doing and also we added some statistics on the database usage. With this analysis we were able to identify for example that Lava was spending quite a lot of time on API response pagination so that gave us a good opportunity to improve some of the SQL queries and also to upstream optimization on the pagination. We also have now a dummy load generator. This is just to emulate high database load scenarios and spot any possible Lava regression, Lava performance regression when we do an update. So these are the next steps so we're looking forward to keep adding more and more devices to increase the overall lab capacity and also cover more and more platform from different vendors. In this process of adding more and more devices we will of course keep improving our infrastructure and our monitoring tools. This also means adding more automation on the process of recovering devices that go offline and also automating all those operations that are currently manually done such as re-plugging cables. So while adding more and more devices we have more opportunities of introducing more automation and reducing the manual operations. We of course were looking forward to keep reporting issues and sending more patches to upstream and adding more devices we also look forward to adding more tests. So increase the test coverage. I think that's it. I should have time for some questions. Not really like it's not what Lava does. It's more maybe what Lapgrid does. Oops sorry yeah the question is if there's any interactive mode to debug the boards in Lava and yes or is it not as far as I know this is not like the the scope of Lava it's just for automated tests so developers cannot really access the boards inside. Yeah I mean it really depends on the machines that you use to run Lava on. We have quite a lot of USB hubs just to add more and more parts if that's the question. So besides the type of exceptions that is raised by Lava so for example the infrastructure error you also get a different message based on the type of error that occurred. So Lava has defines quite a lot of different exceptions so based on when an error occurs during the boot process for example and also based on the symptoms it will raise a different exception so you have a few base exception types but you also get like a nice message describing the error. Of course this is not catching all of the possible errors that you may have. In general it's way easier to detect in an automated way messages that manifest with the clear error message but if for example you have you stop getting output from the console in the middle of the kernel process it's much harder to automatically detect if it's just the kernel hanging or if for example the serial cable just got disconnected out of the boot so you have some flexibility where you're not going to be able to catch every single possible error. We track them internally like we are tracking which USB hubs and which type of USB hubs we have on every other server but no I don't think we monitor like the performances or anything. I don't have information at the moment but I can add maybe a slide with some references. I don't think you have a way to do that from inside Lava so Lava will just export the results in various formats but you will need to use some kind of other process external to just fetch the results and send them like kernel CI for example sends emails to the developers whenever something breaks basically or there's been any kind of regression but that needs to be figured out outside so that's the reason why Lava is not a full CI system you will need to have another tool external to that. So at the moment I'm working more on the software enablement of the Chromebooks and the other devices in the lab we have two people working on site so physically accessing all the devices and unplugging cables when it's needed. Which tool sorry I don't think so I'm looking on other members of team but I don't think so we'll check it out though. So there's still five minutes I see if there's any other question. Are you guys sending any results to KCIGV? I am not sure actually still looking at it. Yeah I'm not sure if the kernel CI pipeline currently using I would assume so because it's all integrated but I'm not sure. Yeah probably and thank you so much.