 All right, good afternoon, morning, evening, whatever time of the day it is for all the attendants of the DEFCON-CZ. My name is Petro Millar, and together with my colleague Honka, I would like to give you an overview of the challenges that we needed to solve when the OpenShift CI, the system that we are developing and operating, grew enough for it, so we needed to adopt a multi-cluster topology instead of running everything on a single cluster. So to give you a very brief overview of where are we coming from, we both are members of the test platform team of the OpenShift Engineering. And we develop, maintain, operate, support OpenShift CI, which is the continuous integration system that allows both developers of OpenShift itself and developers of software that is supposed to run on top of OpenShift. So OpenShift CI provides feedback, testing of their software. So you can think of it as some kind of Travis CI, but Travis CI that automatically installs an OpenShift cluster for you and allows you to deploy your software on top of it and execute any kind of serious end-to-end tests, for example. So that's what most of our workflows are looking like. OpenShift CI itself is a fairly large instance of the continuous integration system. Proud is the upstream Kubernetes CI system that we also take part of developing in the upstream. And we operate it on a fleet of OpenShift clusters, and I will get to explaining more about the topology a little bit later. Size-wise, we are right now in peak times, our clusters in the fleet have something around 200 nodes. And we support more than 800 repositories across more than 60 organizations in GitHub. Per week, we run something over 100,000 of jobs and we build more than 200 images per week. So speaking about topology, I will just give you a very brief overview of what do we have. So at the moment, we have a central control plane cluster, we call it FCI, that serves as the main central cluster that coordinates everything else. So until sometime, this kind of central cluster was all we had. Like we were running everything that we had on top of a single cluster, but we eventually scaled enough for us so that we needed to adopt multiple clusters. And we did it by introducing a concept of so-called built-farm clusters. Right now, we have many, the image shows just three of them. So how do we distribute our workload among clusters? Proud itself, like when there is a job to be executed, it is represented as a proud job resource on top of the control plane cluster, FCI. And when that gets executed, it's realized in proud submitting a port on one of the built-farm clusters. Pretty much all of the ports are executing a tool called CI operator, which is our test orchestrator binary that knows how to build and test OpenShift. CI operator works in a way that it creates a temporary namespace to submit the actual testing workloads on. And in most cases, the test workloads do something like set up, which means install an fMRO OpenShift cluster, run some tests against it where tests can be either test the cluster itself or test can be install something on top of the fMRO cluster and test that on top thing, and then tear the fMRO cluster down. To support everything in this area, we need a bunch of auxiliary tooling that makes sure that everything needed by the jobs themselves are mispresent. We distribute things that we make sure the built-farm cluster contain all the deployments, all the services we need, et cetera. So I'll just walk briefly through the challenges that we will be speaking about today. So one of our major challenges is how to make sure that each of the built-farm clusters is containing everything that all the CI jobs need to actually execute. So for example, to install an OpenShift cluster, you need a bunch of images that constitute OpenShift itself. You need a really spilt image that represents a version of OpenShift to test. You need a bunch of secrets that allows the job to talk, for example, to AWS or GCP or other cloud platform, et cetera. So we need to make sure that everything needed is there. We need to deal with managing the fleet of built-farm clusters. So we want to add them. We need to make sure that every system knows about, for example, a new cluster in the system, et cetera. We need to deal with a category of problems that we now need to do decisions about which jobs should run on what built-farm. There's a class of problems associated with that. Running multiple clusters with different roles in the system also is more difficult to monitor, more difficult to maintain. It changed the way how we handle incidents, et cetera. Lastly, we would like to speak a little bit about what is still missing right now, which constitutes mostly like future work that waits for us. So I will start talking about the first class of challenges that we have, which is making sure that all content needed is actually present on the clusters. So if I start with making sure that the clusters themselves have all the deployments, that they all have all the configuration, all the services deployed, we use GitOps methodology for that. And we wrote a tool that's called ApplyConfig, which is basically a glorified OC apply variant, which the tool iterates over Kubernetes manifest stored in a directory structure that somehow represents the individual clusters in our system based on the role and applies them to the cluster, basically doing something like OC apply. One big part of that is our big use case is making sure that the manifests are actually valid before we try to apply them at the cluster. So the ApplyConfig has a mall that we run in a pre-submit job, pre-submits our jobs that are executed against the full requests before they are merged. And ApplyConfig has a dry run mall that does extensive validation of the candidate manifests before they merge in. So a big thing for us was having a single tool that both validates and eventually applies the manifests on the clusters. Second big category of the content that we need to make available on CI clusters is images. We deal a lot with images. So we have something like a central registry. That's an internal registry on the app CI cluster. And that's the canonical current state of the art bleeding edge set of images that should be used by all the individual CI jobs. We build a controller called justimagesDistributor that distributes the versions of the images from the center cluster to all the internal registries on the build farms so that the CI jobs that end up running on the build farm can only use the cluster internal registry of the appropriate build farm cluster. If we merge some code and the code is get successfully tested, we build new versions of the images, we build them on the build farms, and they get promoted, so-called promoted, back to the central registry on the FCI cluster. And this calls the loops. And by this process, the images are made available for other CI jobs to consume in a continuous integration fashion. A last big part of what constitutes a content to be made available for CI jobs consumes is secrets. Our users who are setting up CI jobs for the repositories, they often need to provide us custom secrets, not us, but to provide their CI jobs, custom secrets, for example, keys to specific accounts on cloud platforms and other kinds of secrets that only their CI jobs need. So in order to provide self-service for them, we built a solution based on the HashiCore vault where they can manage their secrets without us needing to support them. And we built another controller called CI Secret Bootstrap that whose job is to distribute the secrets to the individual build farms so that they can be actually consumed by the CI jobs running on these build farms. So that was part of the talk that talked about making content available on build farm clusters. And the rest of the talk would be delivered by Honkai, my colleague, who I now give a word to. Hello, everyone. My name is Honkai from OpenShift Test Platform Team. From this part, we will illustrate how to achieve the goal that making a cluster, OpenShift cluster as a build farm cluster. Next slide, please. Pro is the core part of the OpenShift CI and it has various components such as DAG and hook. There are other tools for GitHub, for GitOps and managing secrets and images as we introduced before. To become a CI build farm cluster, these parts in CI have to recognize the new cluster. All we need to do is to add cube configs to the right secrets. For example, hook is the proud component which gets involved when some GitHub event takes place. Suppose we have a new cluster called build03, we need to create a service account hook on the cluster, generate its cube config and save it in WoT. Our tool, CI secret bootstrap will receive it from WoT and use it to create secret hook in the CI namespace on the app CI cluster. The secret has the cube configs for hook on all clusters. They are mounted to the file system so that the hook instance can load and use them. The component hook recognize the new cluster build03 after the secret has the build03 cube config. We do this to other proud components and other CI tools. And then the cluster is part of our CI system. This cluster is the CI build farm cluster. Next slide, please. Once a cluster becomes a CI build farm cluster, we can run jobs and proud jobs part on that cluster. The goal of job dispatching is to distribute workload evenly among the clusters which is achieved by two CI automations. One is called dispatcher, the other is called sanitizer. A configuration file describes where a job should run. Job names and file names are used in a configuration file. The configuration file is the output of dispatcher. It uses some heuristics to keep workload even. We cannot simply randomly assign jobs and files because the job has different frequency of occurrence. Sanitizer reads the configuration file and generates the cluster field in each job definition accordingly. Eventually, proud creates the path for its job on the cluster according to that field. This configuration file is also used to handle failover if some cluster is done. Next slide, please. So from this slide, we will talk about the routine operations that our team does on those clusters. Next. The clusters in the CI system are dogfooted with the candidate versions for the coming release. We might hit some issues with these pre-release versions. That is the reason that Builder 1 upgrade always goes first and we have Builder 2 always upgrade manually which is the cluster for failover. After the version has been stable on Builder 1, we upgrade Builder 2 to that version. A cron job keeps Builder 1 up to date on these stream versions. If it has a wide stream upgrade, it is manual. There is usually someone from our team who watched wide stream upgrades. The soaking time for Builder 1 is one week for wide stream upgrade and it's one day for this stream upgrade. Failover is simple with the configure region file that determines where jobs run in the last slide. We have started to use OSD clusters for our build farms recently for the simplicity of cluster provisioning and expert support from the OSD team. The version for those OSD clusters are kept up to date automatically after all build farm clusters are successfully upgraded, we start to upgrade APPCI. Next slide, please. A multi-cluster system is definitely challenging for our team. We need a central place for notifications and alerts to converge. For this purpose, we deploy a monitoring stack for prow. It has its own instance of prow method and alert manager. The alert manager is integrated with PagerDuty and Slack. Someone from our team will be notified if a service is done or a critical job failed or some build farm cluster is done. If a cluster cannot be fixed in a reasonable time, we will migrate the jobs away from that broken cluster. This can be achieved by simply creating a pull request that modifies the configure region file for dispatcher and sanitizer. All the efforts we made on prow and other tools to support multi-cluster are paid off in this scenario. Any single build farm cluster is less critical. Moreover, isolating the workload of CI jobs from prow makes APPCI more stable. Less CI downtime makes our users of CI system the OpenShift developers more productive. Next slide, please. The OpenShift cluster can be easily created and destroyed like resources in the cloud. We want to catch that up and we have done some work in this direction. Joining a new cluster to CI build farm can be easily done by several hours of work. There are more steps we can, we want to automate. Our dream is one day that we can have a auto-scaler for our CI build farms. A cluster joins in and retires from CI system as needed. We are certainly not there yet. Pro is running on APPCI which is still a single point of failure in our system. OpenShift CI would face outage if APPCI is done. Fortunately, APPCI is an OSD cluster which is managed by Red Hat OSD team. Their support is quick and good. Another potential bottleneck comes from test images distributed to that distributes image from APPCI to other build farm cluster. It watched every image streams on every cluster so that it can keep every image up to date. As we have more and more clusters in the CI system, it can use more and more resources. We might need some powerful and dedicated nodes to host a department in the future. Next slide, please. So this concludes our presentation. So feedback and questions are welcome.