 As she said, my name is Sinyaki, I'm here with Michael. We are from the CKI team. What do we do? The CKI team also known as Continuous Kernel Integration. We basically try to avoid ramping kernels developers for living. We run CI for the Linux kernel, which means that we grab a commit, we trigger a pipeline, we build the kernel, we test it and we try to report the results of what we just tested. That can't be too hard, right? Well, some time ago, we started like this. It was basically a proof of concept. We needed to show that testing the kernel on a continuous integration workflow was possible. So we sat down besides a big behemoth of course, Jenkins, because you can never go wrong by using Jenkins. But that was really useful to show us and people that testing the kernel was possible. But it involved a lot of Python and OpenShift projects and clicking and stuff that really didn't scale that well. So after showing that this was something that was possible that we could maintain and scale up testing of the kernel of the Linux kernel, we needed to revisit our problems. We started gaining more and more responsibilities. For example, with kernel developers, we needed to onboard new kernel trees. We started testing kernels from Brug and Koji, which are some builders that Red Hat and Fedora uses. We also test Git repos. We are now testing GitLab merge requests. We have a patchwork, which is an interface for mailing lists and the patches sent to the mailing lists. We also provide the gating for RPM packaging. And we are working on the kernel workflow infrastructure. I don't know if you've seen any of those talks. It's really interesting change on how Red Hat works on the kernel. But we also work with test maintainers. We help them onboard new kernel trees. We help them onboard new tests. I also configure target testings and we'll give them some feedback about how those tests are going. So we needed to be somewhere like this, where everything is automatic, everything is on track. We don't have to go clicking buttons and putting cars back on the tracks. But yeah, it's really not that easy. So we decided, I mean, we found out that we were gonna need some fancy keywords like software-sized reliability engineering and things that help us keep our machinery working. We started building our services under this main lemma that is that any component or dependency that can fail will fail. And some of them will fail even more than others. But we needed to make sure that all the failures can be retried successfully. And if there are some that cannot be retried, we need to know which ones are and try to keep them as minimum. So all the failures at first need to be prevented. Having fewer components or simpler dependencies helps us prevent important issues. But also once the failures happened, those need to be detected and we need to recover from that. What some background about what we have. Basically, we run on a lot of different infrastructures outside of our control, like different OpenShift clusters, Beaker clusters, Beaker is a hardware provisioning. So that's how we, what we use to get machines for the testing, but we also run on AWS on gitlab.com and many other platforms. To communicate all these services running on different platforms, we use a MQP cluster, which is a rapid MQ and a lot of queues to connect all the services. We also have internal microservices running on all these different clouds, like services to trigger the pipelines, to send the reports to babysit the pipelines until the finish. But we also have like really important pipeline components like the GitLab runner, which we use to run the tests, a test database where we store all the results and information that we generate, the beaker provisioning, et cetera. So the first thing is how to prevent these problems from happening at the beginning. Well, basically minimize the essential pieces. If you have less critical pieces, that means that less things can go wrong. We divided all our stack into three main categories. Essential components are the pieces that are needed for the service to run are our single points of failure and are the things that we need to keep at minimum. On the next layer, we have the necessary components. Are those things that need to run at least some time for the operations to work? Like for example, the reporter component, you don't need to have it running all the time with four nines or nine nines. We just, as long as it's running sometimes and it can pick up the work that it was generated before, that's not a problem for the operation. But we also have like optional components, which are those that provide observability and increase the reliability of the system. We cannot fail our testing if for example, the logs aggregation system is down on all the metrics or stuff like that. The first main mission we made are the message queues. We tried to decouple all the pieces of code. For that, we translated all the rest APIs into message queuing and that allows us to run all this like, to be abstracted from the place that the code is running as long as it can connect to the RabbitMQ server that turns everything into way more reliable and distributed system. Increase as I said, the service portability because you're not tied to the place that the code is running as long as it can connect to the message queue. And it allows us to automatically reprocess the failed messages after some time. We built a system on RabbitMQ, vanilla RabbitMQ that allows us to retry the messages that failed after some time. By default, if a worker returns the message to the server, that message would get on the beginning of the queue so it would get automatically reprocessed the next time, like right after it was returned and that could kind of block your services if something is going on with that particular message. In this case, the message goes to the back of the queue and gets reprocessed after some time. For example, as I said, we have this RabbitMQ cluster. It's hosted on AWS as it's one of our key components. We need to make sure that it runs with as much as availability as possible. We have a webhook bridge for those applications where we cannot decide if they have REST APIs or use message queues. We built a very reliable bridge to turn webhooks into messages in the queues. And for the retries, we are converting external message queues like UMB, which is an internal Red Hat message bus, or Fedora message bus into our own message where we can retry the messages successfully. Another example are the S3 buckets. This provides a very generic way to store artifacts. We don't need to rely on NFS volumes or Git hosting or other platforms that we were using before. S3, it's universal, it's fast, and it's really reliable. So we use AWS S3 for external files. We use OpenShift and MiniIO for internal files. It can also be used as a poor-month database. We also use it to store some files and to distribute the status across some pieces. And again, it increases a lot of the service portability because you don't need to, you don't care where it's running. Like for example, with an NFS volume, you would need to mount that volume on all the places you wanna use it with S3 as long as you have access to the server. It's all fine. For example, we use this as Ccash for caching Git repositories. So we don't clone all the kernel trees every time for pipeline artifacts and the database, as I said, for configurations on some static files. The third thing is containers. We try to turn everything into container. This allows us to forget about infrastructure. Don't worry where the things are running. Just we distribute the packages like this. We don't install packages. Just put them into an image and use the image everywhere we need to use it. Like for example, we use AWS Lambda for the web hook bridge, but we also use GitLab runners and we don't care where it's running. We can run the runners on GitLab or sorry, on Docker or Kubernetes or some disposable machines. We don't care about that. The next thing is detection. As we said, we have a lot of moving pieces. We build and test the pipeline, we have microservices, we have Chrome jobs, everything running on AWS, OpenStack, different Kubernetes clusters. So we need to log and monitor all these things running. For the logs, we tried to pipe everything to DevNull, but that didn't solve our problems. So we had to improve this. We found out that Loki was a really good solution for it. It's basically like a Prometheus, but for logs, it helps us index the logs and have easy retention policies for the logs to be rotated easily. For the metrics, Prometheus is an excellent. It's so good that we realized that every application deserves a slash metric endpoint. It allows you to monitor the internal styles of the services. You can know how a service is doing, how long is it taking. And for our Python applications, most of our code is Python. So using Python client, a Prometheus client, it's really important for GetSolution. It handles everything. You just need to define the data you wanna expose. Everything, it's visualized on Rafaana. We have dashboards like this, which allows us to see what's going on and have some alerts if something goes wrong. And at a higher level, we use Monit. Monit is a simple solution for monitoring. You can write your own shell scripts and make them fail and Monit will alert you about that. But it also has some basic, really useful checks like hosts, uptime, file systems size and capacity. But we also like, as I said, with our custom scripts, we check for bigger hosts queues, S3 bucket sizes and RabbitMQ queues size. And not acknowledged messages. This is really important to also recall all these problems because as you can see here, we can have a good track of what happened and when. And like at last, if everything explodes, you want to be the first one to know when something goes wrong. So we use S3 which catches all the problems on our code and we can have some nice tracebacks and be aware of what's going on with our applications. We use sentry.io for that and an internal instance for the internal code. All these, it's not really useful if you don't have alerting. So we have all these low-key Ruby queues and Monit things go into IRC which is our main channel of communication but also are sent to the mailing list if anyone reads that. But it's the most important thing, it's the IRC because it's where we spend all our time. But the thing is that even after you realize that something went wrong, you need to recover from that. So we have on the most basic level the worst problem, it's the network always as that cat was afraid of. We have a lot of network issues. So we added retry loops to every network access in the pipeline and on all the services. So we know that any transient network failure is retried several times. But we also used, as I said, the re-queuing of messages in RabbitMQ. There's some information on our blog about this system but basically we just re-queue all the messages indefinitely because as nothing is as motivating to fix a bug as a message full, as a queue full of messages. With all the alerting, we get alerts every time a message fails, so it's really easy to, I mean, it's great so you don't lose the failures, you can retry the failures, you can go and test locally with that message that it's making stuff fail. So it's really helpful. And the third, when all these things went wrong and didn't catch the problems on time, we have the pipeline herder, which is a bot that maybe sits the pipelines and if something failed, it knows about many failures that we know we can safely retry. So it goes and checks the logs of the jobs and it retries them for us so we don't have to click anything. If that's not enough, we have failed fallbacks. As I said, we have different clusters, we have different OpenShift clusters. So we can change the runners if something goes wrong on one of the clusters we are running, we can fall back to the others really easily with just pushing a button. So Michael is gonna tell us how we tried to do the jobs. That was a good introduction. So Njaki talked about the aspect of how we keep it running but now the system would be the most stable if we don't change it. So I need to say next slide in Njaki Duga. So we thought about how to approach actually our DevOps culture and so, and Steph Wotos also talked about this a lot now in the last couple of years. I really made those things up here but one aspect we consider pretty important is that it is as open as possible. So this is about code, about documentation, about how we work, about communication. But it also needs to be safe. So everybody is scared of certain pieces of code and we try to minimize them so that if you change something or if you have to change something, you feel safe and are more or less also eager to clean up pieces as it becomes apparent. And the third piece is that deployments need to be painless, production deployments, but also deployments locally, even if it's not really deploying but just running the code yourself just for figuring out whether it should work. What also going into staging and cannery deployments that might be hard to do if you don't have a lot of experience in a certain area of the code. So I think the next slide starts with the openness aspect. So openness in our case means, of course, open code. We have a lot of repositories for all these pieces. We tried to consolidate them where it makes sense but still it's a lot. And the open by default is actually something we try to live by. Now, there are certain pieces that we are not really comfortable putting them, for example, in GitLab.com, mostly secrets, internal documentation configuration. Our deployment yamlers are still internal. That is something that we want to change. Also the Ansible code we've learned a lot from what the Fedora infrastructure people put online and we want to do this but at the moment there is still this encrypted secrets file on there where we don't want to just have a password between our internal infrastructure and the world. So this piece is in progress. We've worked a lot on documentation and there's far more to do. Trying to find out what our audiences are. We are a service so we actually have users. Current developers that are more or less forced to work with us, we want to make that as painless as possible. But we also want to document the individual pieces. There are readme files or there should be readme files on all of those repositories. And something we've worked on is trying to inventurize how all these pieces fit together. We've shown that from CE and I will show an example in a minute. And again, we still have internal documentation. So we are not there yet. So this is, oh yeah, there's more to that point. And yeah, one of the things that work well is having documentation Fridays so that there's actually an excuse to work on documentation. So this is an example of what the inventory looks like. It tries to describe a service or a crunch up in this case, what it's good for, where you can find the code or where you can find the deployment, what it depends on, what other pieces might depend on it. And this gets rendered in our website. So on the next slide you see an example this looks like on our homepage. So that you can actually start to explore stuff you can click through. The idea is also to use that, for example, for monitoring so that you can define in this structured YAML format your monitoring or even alerting connection between deployment and code and put more metadata about how it all fits together in an abstract way, let's say, but still should make it easier to get a feeling how it all fits together. We try to be open in our daily operations. That means moving issues from Jera to gitlab.com which is also quite a bit nicer for the features we only use from Jera which are really nicely covered by gitlab.com. You have an open channel that's mostly populated by the bot that is alerting about everything but also all the discussions happen there. Everything is merge request based so there are no direct pushes and everybody can basically subscribe to the merge request or the projects and chime in. We try to formalize our feedback process a bit because sometimes discussions happen on IOC and you might not be online so you might miss them. So for more general stuff that might be a good idea or where you just feel you won't have feedback. You have an RFC process which basically means you put up a merge request to the documentation repository from the next slide as an example and you can ask for feedback. It's still, it's a feedback process explicitly. It's not, I mean, you have processes like ADR, architectural decision records where you actually ask for decisions. In this case, we are an HR team. We trust each other well enough so that you can actually just ask for feedback and we trust that whoever put this up will do the right thing afterwards. So they are the domain expert. They get feedback and then they can carry on whatever they wanna do. The next big chunk is safety and in this case, safety to change. So we want to make that as safe as possible. We want to make reviews as painless as possible that means linting and testing as much as you can. We try to linch shell code, shell in YAML code, all kinds of Python linters that we can get our hands on unit tests, markdown, linting, link checking, everything we can think of to automate and also to have less to review because if it's automatically linted, there's already automated feedback coming. Means also we have a linting script that you can just run locally and we just work hopefully independent of your explicit Python setup. You're still working on using formators. The packet team is quite a bit better or like further along with this trajectory. So using something as black for our Python code might actually take pain out of the formating aspect, let's say instead of just using Flagate, but that's the work in progress. And one other aspect for the testing is that these dependencies that we have shared libraries that are reused in other places. And so if somebody changes the library, we don't get one dependent testing pipelines. And so this is still on the list to do, but should also happen in the next couple of weeks. As I said, we split up into microservices also because I'm pretty old. I need to fit pieces into my head. So I have the tendency to simplify things. And that also makes it less scary because if you have the feeling or if you have a good understanding or at least you think you have a good understanding, it's much more fun to change code and to clean it up. And also in our experience, microservices are structured better. So there are some interfaces that you more or less need to either explicitly or implicitly define between those pieces. And that actually makes structures more clear. What's on the next slide? I've no idea actually. Oh yeah. And the last aspect is how you can actually deploy code. So we are basically, it's always nice. So we say we are an HR team and that means for us living on the edge. So if you merge something to main, it gets deployed one way or the other. And the way that works for us is that you get a review but you are responsible for merging it and for actually making sure that any fallout is handled or to revert whatever you change. So it's your code. Reviewers can give feedback but they are not responsible for whatever you want. So it's a process that's working quite well. It's very important to have some restraint at the end of the day or just before the weekend and it's a good idea to shift the deployment to the next day or to next week. And Jacqui already discussed the structure of what we have. And now for deployment, the last two types of stuff we run. So in turn the microservices and packaging components in there are slightly different in how you can deploy them. So for microservices, those are containers. We package them up. They should be policy free and code should be structured in a way that there's a non-production mode that makes running the code safe. Like for example, if you normally send emails, if you set this to false, this production is false, it shouldn't send emails. Otherwise you get current developers yelling at you. We centralized all our deployment yammers for OpenShift, all the Ansible scripts into one piece that allowed us to actually clean it up and get some structure in there. Also to get an overview, what do you want? And as everything is deployed automatically from there, if you change something, everything is redeployed. It's even potent Ansible and in OpenShift or Kubernetes yammers, so they shouldn't mess things up. But it also means that it should work. So if the deployment code is brittle, you will notice because it runs quite often. And it also keeps people from messing with the infrastructure in a manual way because basically their changes get over in the next time such a pipeline runs. Now, normally for microservice, you want to try it locally. We have an helper to build this container image. You can also just pull it down from a merge request. All our merge requests built all of those container images continuously. You get them tagged per merge request. You get them tagged per pipeline. So whatever you are interested in, you can get from GitLab, from the registry there might be faster than building it locally. Now, if you want to run those container images, that is always painful because you need to have some configuration on them to make them do what you want. Most of the pieces have a command line interface that actually allow you to check the microservice, how it processes some certain aspects. For example, for the kernel web hooks, they check merge requests, whether they have certain labels set, stuff like that. And so they have a command line interface to basically check a certain merge request. For example, you can see what they do. So if you would think of function as a service, most of our microservice are mostly event processing. So you can basically do one-shot run on some event, cloud event, something. But so yeah, it works reasonably well, but getting them up locally is a bit of a pain because you need to have this configuration and it's not something that you can easily share because there might be some production credentials in it. For the next aspect is actually how do you can, can you actually do a testing deployment of them? Now you have this non-production mode. So either you run it locally in this non-production mode or remotely on a Kubernetes cluster doesn't matter. So you can just move it there. There's a shell script that nobody runs because it's painful to run a shell script. And so the last thing we've worked on is basically making that automated by the press of a button in GitHub. So you have a merge request, you have this beautiful button that says deploy it into a merge request environment and it just spin up a comparable deployment config in an OpenShift Kubernetes to try it out. And okay, the idea is that you can find a link and then inspect the logs, for example. Same goes for production deployments. Normally it's deployed automatically when the metric is this merge, but if you want so, sometimes to find out whether something is really fixed on your code, you can also deploy a merge request into production and that will survive a redeployment of this deployment repository. Basically it takes a container image as being production and within two minutes it should just appear on OpenShift and you can actually figure out whether you solved the problem or you can press another button and another job to roll back, which should also be quite painless and within a couple of minutes. Now for the other big piece, which is the pipeline. So this is, the pipeline is always a nice work, but it's basically a lot of Python tools duct tape together by a lot of Bash, 2,000 lines of Bash, I think at the moment, which is really hard to test, but you can test the components. It's actually the way you should test them. They all have a command line interface. You pip install them and there you go. And that is actually the way these tools are also developed. For actually running the pipeline itself, we support cannery deployments. So for all the, that's on the next slide, you can run your non-merged code in a pipeline. So basically normally these pipelines are just container which is they do a pip install in the beginning of the main branch and you can override it to say like, I want to have my branch instead of the main branch. And we have a bot that will help you. And if you're nice, you can ask it a lot of complicated questions, basically saying like test this old pipeline that passed a while ago, used the same kernel revision. It should still pass, and you can modify also parameters if it's necessary for testing your changes. If you, for example, mess with the pipeline yaml, but also, for example, if you, for example, change how the beaker provisioning works, you can actually figure out whether you broke the production setup before it gets merged. Okay, and then production deployments is for the pipeline, basically you merge it, you get it as soon as it hits the Git repository, it will be used for the next pipeline that comes around. And if you need to reword code, it's a Git reword and the button press away. So that's also quite easy as well.