 Hello, my name is Joshua Watt, and today I'm going to talk to you about doing continuous integration with Yachto in Kubernetes. Little bit about myself, I've been working at Garmin since 2009, and we've been using open embedded in the Yachto project since 2016. There are the various ways of contacting me if you would like to do that. I'm going to start with our typical development workflow. We write some code, build it, do some testing, and finally deploy it somewhere. In dealing with the Yachto project, this would mean writing some recipes and layers, feeding them into BitBake to have it build up an image for us, performing some on-target testing of the resulting image, and finally deploying the image for our end users to consume. When building on our local PC, we generally find this workflow to be efficient and perform it. Builds are not re-running unnecessarily, and when they do, we see reuse of previously build components. The naive approach to scaling up our builds for continuous integration is to simply replicate how we do local builds onto our CI builds. However, when doing this, you may discover that your CI builds run quite slowly. They do not seem to exhibit either the performance or efficiency of local builds. The nature of this performance degradation primarily comes from this parallel and stateless nature of CI builds. Unlike a local build, CI tends to run many builds in parallel, and also tends to do clean builds from an empty workspace instead of having access to the build state from previous builds. The Yachto project has implemented several build acceleration techniques over the years. Many of these techniques are an optimized form of caching, and are automatically enabled with a local cache when doing a local build. These caches greatly speed up build times, but using a local cache doesn't work well with the parallel and stateless nature of CI builds, since using a local cache prevents sharing of the caching data between the parallel builds and the caches lost every build cycle owing to their stateless nature. As such, we need to establish centralized caches that can be shared by multiple builders when scaling up CI builds. Fortunately, all the caching mechanisms used by the project are designed to be shared in this way. The primary and most effective cache that the project uses is called the estate cache. When BitBake builds some select tasks in a recipe, it will archive off the output of the task into the estate cache for later use. If in a subsequent build, BitBake sees that the task it wants to build already has an archive in the estate cache, it will simply extract to the archive from the cache in lieu of performing any of the tasks that would have led up to it. In this way, it is possible to skip the entire download and compile cycle for a recipe, effectively skipping to the end and saving a significant amount of time. The estate cache can be accessed in a few different ways, but the recommended way to share the cache between multiple simultaneous builders is to host it on an NFS file share. Build pipelines should mount the NFS share and configure the build to access the cache from the mounted path. The estate logic will correctly handle simultaneous readers and writers to the cache done with this method. This is the method used by the Yoctoproject CI system, called the autobilder, and thus is one of the best tested. Similar to the estate cache, BitBake can also use a download cache. This caches source code downloaded from the internet, preventing slow download tasks if a recipe needs to be rebuilt but the source code hasn't changed. Similar to estate, this can be shared over NFS. In fact, it can use the same NFS share as estate if both are put into distinct sub-directories. Both the estate and download caches can pull from an HTTP server as a read-only upstream source. Exposing the CI-NFS caches over HTTP can thus be a huge performance win for your local developers, since they can point their local configuration to use the CI HTTP cache as an upstream source. This means that if a developer tries to build a task that CI has previously built, it will simply be pulled from the CI cache over HTTP, skipping the need for the developer to build anything locally. For extensive CI systems, this can cut a lot of time out of local developer builds, especially for complex recipes that take a long time to compile. However, for this to work, we need to implement an HTTP front-end for both the download and estate NFS caches. Another more recent caching mechanism that has been implemented is the hash-equivalent server. This server records the hashes of the actual outputs from tasks stored in estate and determines if two different runs of the task that BitBake thought should be different are in fact the same because they produce the same output. When an equivalence like this is detected, the server will record it and instruct future instances of BitBake to use the older estate artifact instead. This can dramatically increase the reuse of estate archives, particularly when there are trivial or non-functional changes to recipe metadata. Hash-equivalence uses a simple JSON over TCP protocol and can service many clients simultaneously. As such, we should also run one of these services. The final service we may need to run is a PR service. This service is used to ensure that packages generated by BitBake have a monotonically increasing version number every time they are changed. This behavior is important if you have devices that are pulling packages from a package feed to do software updates, as they need to know which versions of a package are more recent than others. The PR server was just recently changed to use the same JSON over TCP style protocol as the hash-equivalent server. If you are doing updates based on package feeds, you'll need to run a centralized PR server so that it can correctly track version updates between all your CI jobs. So, as you can see, we end up needing quite a few ancillary services to go along with our CI pipelines. Many of these services are fairly simple, but we still need to maintain them and ensure that they are available, reliable, and scalable if we desire reliable CI builds. In addition, we also want our CI pipelines themselves to be available, reliable, and scalable. Fortunately, there is a system that excels at being available, reliable, and scalable, and Kubernetes. For those unfamiliar, Kubernetes is a platform designed to run containerized workloads and services. Many complex web services are built on Kubernetes, and it is perfect for running the ancillary services we need to build our CI pipelines. However, we can go even farther. While it might seem a little strange at first, I believe Kubernetes is an excellent solution for running the CI pipelines themselves. Using Kubernetes allows us to leverage its extensive abilities to improve our CI pipelines and ancillary services. First of all, Kubernetes can be run in many different environments, anywhere from Raspberry Pi clusters to desktop PCs to local bare metal servers to huge cloud providers. Regardless of where the services are run, the description of how to run them is the same. This also means that it is much easier to share configurations between different users and get consistent results. Additionally, Kubernetes services can be configured to provide high availability and reliability so we can always be sure our services will be there when we need them. Finally, we can use Kubernetes to schedule our CI pipelines in an efficient manner since it already has an advanced cluster aware scheduler. Before we start looking at how to leverage Kubernetes to do scalable builds, let's first take a look at the hardware I use at home to test and develop these methods. We'll start with the cluster of testing hardware that I utilize. Here is what my test cluster looks like. I have a weakness for devices with the Raspberry Pi form factor and have collected a number of them. The most interesting one of the group is probably the RockPi X, primarily because it's the only non-arm board in the cluster as it's based on an X86 chip. It also happens to be the one on the top of the tower, so we'll feature prominently in my demonstrations since it's the easiest one to see. All of the devices in the test cluster are connected to a control PC through a variety of interfaces. First, each device's power adapter is plugged into an outlet that is controlled via a USB relay from the PC. This allows the PC to independently turn each board on or off. Next, each board has a USB serial adapter, which allows the PC to see the main console output as the board boots up. In addition, each board is connected to an Ethernet switch. The PC is also connected to this switch via a USB Ethernet adapter. Finally, each board has an SD card multiplexer attached. This device allows the PC to control via USB if an SD card visible on the reverse view is attached to the PC itself or the board. The RockPi X doesn't have an SD card slot, so the multiplexer is connected to a USB port, and the BIOS is configured to boot from there. For all the other boards in the cluster, the multiplexer is plugged directly into the SD card slot. From the back, we can see the other components of the test cluster, including the USB relay and the outlets that it controls, as well as the 16-port USB hub that connects everything to the control PC. The control PC is connected to the cluster via a single USB cable to the hub and an Ethernet cable to the switch. In order to test an image on a board in the cluster, the control PC will first turn off the power to the board using the USB relay, instruct the SD wire multiplexer to attach the SD card to the PC, write an image to the SD card, instruct the SD wire to attach the SD card back to the board, then turn on power to the board. As the board boots up, the console output will be monitored via the USB serial adapter. Once the login prompt is seen, the controller will know that the device has successfully booted. Finally, the control PC will SSH into the device to validate that the network interface is working. Here you can see my home server cluster that I run builds on. It consists of three used Dell R610 servers. They aren't particularly powerful or well-equipped by modern server standards, but they work well enough for my purposes. I'm running bare metal Kubernetes, meaning I'm not using any virtual machines to make up the cluster. Each of the three servers is directly attached as a cluster node. Now that we know the hardware involved, let's look at the major software components that I utilized for the CI builds themselves. For the actual build pipelines, I'm utilizing a project called Tecton, and for the test portion, I'm utilizing a project called LabGrid. Tecton is a cloud-native framework for implementing continuous integration, continuous delivery pipelines. All objects managed by Tecton are defined as Kubernetes resources, meaning you can use all of the standard Kubernetes tooling, like kube control, to construct and run pipelines. Tecton breaks down pipelines into tasks, with each task running in its own pod. Tasks are further broken down into steps, where each step is run in its own container within the pod. LabGrid is an open-source project run by Penguintronics intended to provide a library for testing, control, and automation of embedded development boards. It provides board farm management software, as well as a Python interface library for accessing devices on the cluster. It also provides PyTest integration, making it very easy to write tests against devices in the cluster. Tectons communicate with the board farm over SSH, with public-private-key authentication, which is used to provide access control. Now we can combine this all together to make RCI pipeline. Tecton is used to construct and run our build pipeline in Kubernetes. First, Tecton builds the code using a slightly modified version of the crop's build container. Once the build is complete, Tecton also runs the LabGrid tests using a container that contains the LabGrid PyTest client. This container connects to LabGrid running on the test cluster over SSH. So that leaves the ancillary services that we need to accelerate our builds. Here, using Kubernetes really shines. These services are in essence pretty simple, so it's not very difficult to write service descriptions for Kubernetes to render in pods. This allows us to leverage the high availability features of Kubernetes to ensure that these services are always available for our builds, and potentially even add support for load-based scaling in the future, although this may require some support in the services themselves. For the download in S-State NFS caches, you can choose to provide an external NFS server or run an NFS server in Kubernetes itself. Meanwhile, providing an HTTP front end for the NFS shares is a trivial Kubernetes service. Tecton provides a workspace functionality based on Kubernetes' persistent volume claims to persist data between the different tasks and steps in a pipeline. I tried a few different methods for allocating this workspace. First, I tried using an NFS volume. However, BitBake was unhappy using this as its build directory. Next, I tried using C volumes allocated from the Work operator running on my Kubernetes cluster. However, the disk access using this method was unacceptable. Although BitBake is very IO intensive, I think this was more due to the disk and networking limitations of my servers than an inherent problem with using CIF or network block devices in general. I suspect that on a production quality cluster, network block devices would yield acceptable performance. Finally, I settled on using local persistent volumes, which bind mounts a local disk from the server into the containers. This method works well for my setup as it gave me the performance I needed, but may not work well in another setup as it requires significant local disk space on the Kubernetes nodes, which is uncommon. Fortunately, this shouldn't pose too much of a problem in practice. Because volumes are allocated via the Kubernetes persistent volume claim interface, it is abstracted in such a way that it is simple to make it use a different allocation mechanism optimized for the cluster you are running on. The final piece of the puzzle is how to set up the SSH connection between the PyTest lab grid container running in Tecton and the lab grid coordinator controlling the board farm. Kubernetes runs pods in a virtual networking environment, which makes it difficult for the lab grid coordinator to expose itself to any of the pods running in the Kubernetes cluster. Additionally, I wanted to make it so that the coordinator and test cluster could be physically located anywhere in the world, and the PyTest container running in Kubernetes would still have access to it. The primary reason for this is to demonstrate the real-life situation where you might have your builds running in a cloud provider, but your test cluster is on-premise. I evaluated several different options for exposing the lab grid coordinator to the Kubernetes virtual network, including VPNs and a project called CubeEdge. But eventually, settled on an HTTP2 tunneling utility called Go HTTP Tunnel. The lab grid coordinator connects to this publicly exposed service and then reverse proxies SSH from inside the cluster to itself. This then allows the lab grid PyTest container to SSH to the internally exposed SSH proxy service, which gets forwarded across the HTTP2 tunnel to the lab grid coordinator. I've been really happy with this method since I switched to using it, as it was pretty simple to set up, has good security through mutual authentication, and has been very reliable. I had originally intended to demonstrate that this works as intended by bringing the test cluster with me in person to the conference and doing a live CI build where my Kubernetes cluster back home would run tests against the local devices, but this unfortunately didn't pan out. Now that we have everything set up to perform CI builds, we need something to test. I decided that the best way to demonstrate the effectiveness of the CI pipeline was to have it compile images to run a selection of DOOM-based games via GZDOOM for every board in my cluster. Shown here is the Raspberry Pi 4 booting ChexQuest, and also the reason why I won't be quitting my day job to try and become an elite FPS player anytime soon. The repository uses WISC to do build management, which allows each board to be built against multiple versions of Yachto from the same branch in the repository. Utilizing this, I can build against the Yachto 3.2 and 3.3 releases for each board in the cluster. Additionally, the CI jobs are set up to build each release against both a known good point for each layer the board consumes and also the latest head commit for each of the respective release branches. This allows me to very easily validate that the latest release branches for each layer haven't broken anything before I bring them into my project. So now I can show you a demo of all these pieces working together. Here you can see the Tecton web interface running on my Kubernetes cluster. You can see that it is empty because I do not have any previous pipeline runs. Switching over to the terminal, we can start by showing the ancillary services that I have running on my cluster. You can see that I have the Yachto cache service, which is the HTTP front end for the combined estate and downloads NFS caches, and I also have a hash equivalence server running. I'm running an NFS server for the estate and download caches on one of my cluster nodes instead of running it inside Kubernetes itself. We can look at the definition for the Kubernetes volume and see that it is mounting the share from the Galactica cluster nodes NFS server. Tecton pipelines are described by a YAML file, which we can take a look at here. I need to start a lot of pipelines that are all very similar with slightly different parameters, so my YAML file is actually a JINJA 2 template that I expand before passing to kube control. You can see here that at the top I'm defining several loops in JINJA, one for each product, one for each version of Yachto it should build against, and one for whether it should pull the known fixed layers from the versions or the latest upstream. As such, when this is expanded, it will create four pipeline runs for each product. Here you can see that I've configured the pipelines to use the Fast Disks storage class, which I've defined in my cluster to utilize local persistent volumes. And here at the bottom I have a simple check task, which does some quick checks on the repository code. Only one of these pipelines is run since it has outside the JINJA loops. We can use J2 to expand the JINJA 2 templates in the file, and then pipe that to kube control to have it create the actual pipeline runs. The pipelines were created, and now we can switch over to the Tecton web UI and see the pipelines start to fill in there as well. We can go in and look to see what one of these pipeline runs looks like while it's executing. This particular one is building the Rockpy X against the hardknot branch, with the latest of each upstream layer. You can see the pipeline run is divided into tasks on the left. The first task it is currently running is the clone task, which is further divided into the prepare and clone steps. Here we can see the validate task, which does some simple validation of the build configuration. You'll notice that this pipeline run doesn't start building the build task as soon as the validation task completes. This is because of the exceeded node resources status here on the task. I've started way more pipeline runs than my cluster can simultaneously build, so this particular task is going to have to wait until there is more compute time available on the cluster node before it can start. One of the nice things about Tecton is that you can define the cluster resource requirements per task, which is why the previous two tasks were able to execute. In my pipelines, I've stated that all tasks other than the build task use a single CPU and a small amount of memory, whereas the build task needs 10 CPUs and 20 gigabytes of RAM. We'll go back to the terminal and see if we can find a pipeline in the build task so that we can see what it looks like when it's running. I want to find a RockPy X build so I can show you what it looks like when you run tests on the device. I don't seem to see one here. We'll check back later and see if we can find one. Okay, it looks like the cluster started a build task for a RockPy X, which you can see here. So let's go look at that in the Tecton web UI. And here we can see that the build task is actually running. I've got a camera pointed at the RockPy X in my cluster in the lower left of the screen, but it won't do anything interesting until we start running some tests. The build has been running for a few seconds already, so we can scroll through the build output. You can see the standard environment setup output, and we can see that BitBake is parsing recipes down here. Here you can see the standard header that BitBake prints out before it starts building. And now the build starts running. We expect that this build will almost entirely consist of these set scene tasks, which are the tasks that run when BitBake is restoring artifacts from sState in lieu of actually building. Nothing should have really changed from the last time I built this product, so I would expect everything to restore from sState. And there, as I expected, nothing was actually built, and instead it all got restored from sState. BitBake is now going to do several tasks to construct the final image files. These take a few minutes, so I'll fast forward through them. The build task has now completed, and we can see here at the top that it took 8 minutes and 57 seconds. Now we can go to the LabGridReserve task in the pipeline. This task reserves the RockPyX board for exclusive use by this pipeline run. Since I have four pipeline runs that need to access the same physical board, this ensures that only one is using it at a time. Now the pipeline run has reserved the RockPyX on the cluster, so it can start to run the tests. Here you can see the pipeline starting to run the actual tests. The first thing it will do is cut power to the RockPyX. Now it has just switched the sD wire to the controller PC and is flashing the image to the sD card. You can tell because of the flashing blue LED on the sD wire. This takes a while, so I'm going to speed it up. The image flashing is complete and the sD wire is now attached back to the RockPyX, which you can tell from the green LED. The power will be turned back on to the RockPyX soon and it will begin to boot. As the RockPyX boots, the test is going to look for a login prompt on the serial terminal, and once it does the test will pass. After that, it's going to SSH into the board to validate the networking. All the tests have now passed, so the last thing the pipeline does is release its reservation on the RockPyX board in the LabGrid cluster, so some other pipeline run can use it. We can come back later and look at our Tecton web interface after all the pipeline runs have completed. Here, we can see that the longest run took 1 hour, 8 minutes, and 31 seconds. We did 24 builds in that time span for an average of about 3 minutes per build. I have a few observations about my use of Tecton as a CI engine in these examples. As stated by Tecton itself, is the CI CD framework not necessarily a full CI CD solution? As such, I'm not sure I would use just Tecton as a production CI pipeline for Yachter builds. I suspect many of these issues are simply because Tecton isn't really intended for the kind of builds I'm doing, or I'm just not using it correctly. It probably works great for the more cloud-native workflows it's designed to handle, but I've found it's missing a few key features that one might expect. For starters, there is no automatic cleanup of pipeline runs. Instead, they continue to stay around in their final state until they are manually deleted from either the web UI or one of the command line utilities such as kube control. This wouldn't be too much of a problem since the task pods are in a finished state, which means they are no longer consuming CPU or memory, but the mounted workspace volumes continue to be attached to these pods. Keeping these large volumes attached to completed pods consumes a lot of disk space, and in my case means I must delete all previous runs before starting a new set since I have a limited number of local persistent volumes on each node. Related to this, Tecton has no provisions that I'm aware of for persisting the logs from a build after the pipeline run is deleted. I believe this is because Tecton simply uses the Kubernetes pod logging as its logging mechanism and once the pods are deleted, the logs disappear too. Finally, Tecton has trigger support for automatically triggering builds on specific conditions, but I can't really enable that for my GitHub repo since it would quickly consume all my persistent volumes. As such, I have to manually trigger every run, which is not ideal. This leads me to my final thoughts. In the future, I plan to try out a few other Kubernetes-based CI CD systems to see if they have a better experience. I know GitLab can use a Kubernetes cluster as an executor, so I'm curious if that can tie into the other ancillary services. I've also recently been really excited about Zool, which seems like an interesting CI CD concept. Finally, Jenkins X is a cloud native continuous delivery system that actually uses Tecton under the hood and seems promising to provide a lot of the features that Tecton itself is missing. Ideally, when all is said and done, one could write a Kubernetes operator that would make the setup of a Yachto build pipeline in Kubernetes simple. Operators are a powerful Kubernetes concept where a Kubernetes service can be written to manage the installation of other Kubernetes services. For example, one could write a Yachto build operator that would ensure that the requisite CI CD system is correctly installed and configured, and also set up and configure all the ancillary services required to make it run smoothly. This would allow anyone to easily set up a Yachto build pipeline on just about any infrastructure. If you would like to take a look and see what I've done in this presentation, you can access the Git repository shown here. It includes the Ansible setup for my servers and LabGrid cluster, as well as the YAML files for my Kubernetes setup. It's probably not an estate where you could directly use it to set up your own cluster, but it might provide some inspiration to try it yourself. Are there any questions?