 All right, so thank you very much for the invitation and yeah, it's really good to be at a conference again with all the challenges of rediscovering how to do the presentations live again. So today I'll be talking about how we've been using GitOps at CERN and especially the needs we have for multi-cluster, multi-cloud deployments, some quite large deployments we have to handle. So where this all comes from and then hopefully I'll leave some time for Q&A. So very quickly I'm a computer engineer at CERN. I work on containerization, Kubernetes, also some machine learning and accelerators. I also do some work in the CNCF in the TOC and I co-lead the CNCF research user group. So CERN has a quite large private cloud. The Kubernetes deployments are also quite significant. We have more than 600 clusters, like thousands of nodes, more than 10,000 cores today, very on Kubernetes. So it's pretty significant and this is our world on premises and this is where I will start to talk today is to describe what are the needs we have to support this kind of infrastructure and what we try to do to ensure like high availability, proper upgrades and how we use GitOps to help with this. So starting on our on premises world, the reason we decided to go with multiple clusters and not have very large clusters, there's quite a few reasons for that. The first one is to isolate workloads to have some sort of multi-tenancy by just splitting clusters. The state today probably would allow us to do it differently but there are still quite some advantages to doing it like this. Also, we had some issues with control plane of the cluster, not necessarily working always very well. So kind of splitting clusters reduces the blast radius as well. When we do upgrades, this is another area that we struggled quite a bit. So just doing cluster replacement instead of actual upgrades of the clusters is another way to handle this and I'll talk a bit more about that. Initially, we also had this need to split clusters to get access to heterogeneous resources. This is not necessarily true today but it's still something in some cases we need. So when you start handling multi-cluster and deployments over multiple clusters, being able to do it in an automated and fast way is key and GitOps is the answer we used. So when we started looking at how we did this, if we look at the past, like before we moved to containerized deployments, we were using traditional VMs or physical machines and using things like Chef or Puppet to deploy, mostly Puppet at CERN. And in reality, if you were a service manager, most of your work would be done already in Git by just updating the here values for Puppet, the configuration in Puppet, and then this would propagate to the node. So the notion of doing GitOps was kind of there already in this way of interfacing with deployments. This is also very nice to introduce people to containerization and to Kubernetes. A lot of them will go through quite steep learning curve to start using Kubernetes. And if they can start just by editing YAML and doing configuration changes in the multiple deployments this way, they can more confidently build knowledge of the Kubernetes stack as well. And then when we start doing this, then we also gain all the benefits of using Kubernetes and containerization, which is to avoid configuration drift by having multiple deployments, the much better reconciliation of the deployments that Kubernetes allows by having this declarative state that then propagates to the different applications and services. So how do we do this internally? We have this notion of clusters as cattle, very much like VMs as cattle existed before. So we try to have multiple clusters serving the same application of service. So on the left side, we see like a load balancer on top that will split the load from one service across different clusters with 50-50 distribution. If we want to upgrade this cluster, say we have two clusters running Kubernetes 121, we want to upgrade the cluster 122. We actually usually don't do in-place upgrades apart from a few exceptions. We just add a new cluster. We configure in our GitHub deployments that the service should direct like 20% to this new cluster, gain some confidence with the new deployment. And eventually, this will allow us to slowly get rid of the old clusters and move the workloads. We can also do things like cluster autoscaling to size the clusters according to the load they have. So one key thing we do here is that, and this is something that you will also configure in our Git configurations, is that you will need to have a way to expose service type load balancer into a single central load balancer instance. And this we do by having a slight change on the normal cloud provider where we say this service would actually link to this load balancer pool with this UID. And this allows us to do this spread of the load. So from this state where we had three clusters, one with 122 with 20%, then eventually we move to what we have here on the left where we got rid of one of the old clusters and then we split the load half-half between the 121 and 122 clusters. And with time, we deploy a new 122 and we get rid of the old ones. And this is kind of how we do the upgrades of clusters. One thing that is pretty obvious here is that if you don't have a very good automation of your deployments, this is extremely hard to maintain. So this is where we put our investment. The other part that would like to discuss regarding our on-premise deployments is even for components that need to be kept up to date, even if you're doing like version clusters with the Kubernetes versions, there are a lot of components you will have to do security patches and do small upgrades with time, even within the same cluster. So we keep this, we have this idea of release channels where we try to have each of the clusters linked to a branch of our GitOps configuration, where we evolve them independently. So for example, say you have a 122 cluster, you will link it to a 22 branch that has all the configurations we want for that specific Kubernetes versions for the base components in the cluster. And you can also choose if you want like a stable or kind of more development branch that are QA. That will allow us to roll out gradually the changes first in QA, then in stable. So the recommendation here would be that you have your version clusters, but also you have maybe a QA and a stable cluster, so that you can validate in your deployments all the changes as we roll them out gradually. So this is how we try to structure our deployments in a way that gives us confidence in for upgrade of the different components. So one key thing is that when you deploy a cluster at CERN, you actually deploy also what we call an umbrella helm chart. And this helm chart has a bunch of dependencies that have basically all the base components we need in a cluster. So you can see here base, which is like again, kind of a multi-component chart, but then EOS is our storage system, the GPU deployments, Fluent D to forward the logs to our central collection, some internal components here in storage, and there's a bunch more. And what you do when you deploy your cluster, you basically can, all you have to do is specify this values file that has like the base definitions in the chart itself with the versions we recommend, but actually users can override these values if they want. So they might disable and enable some of the features, or for example, for Fluent D, we have this notion of a producer that will tag the logs with the application that is forwarding them so that then we can do correlations between the different services. So all these pieces allow us to have like quite simple way of managing all the dependencies and quite simple way of managing all the multiple deployments we have, which as I mentioned is like more than 600 clusters today. So when we do that, like getting people into GitOps, a lot of our users will probably just start with writing some YAML files with their deployments and all the Kubernetes resources they need. But actually, when you want to onboard them with GitOps, it's quite essential to do a very big push for training and dissemination. So this is what we've been doing the last maybe two or three years. People need to know like if I want, if I have multiple clusters, how do I handle the load in my services, the distribution, or if you're running some sort of batch workload, how do you redirect batch jobs to the multiple clusters? How do you handle like the queues in the different clusters, things like this? So we wrote a couple of simple getting started tutorials. I think they are open. So you can have a look. They are getting outdated because these tutorials are still for Flux and Argo CD V1. We actually started by writing one with Flux, and this was the main tool we were using. Some users started using Argo CD as well. So one of them actually forked our tutorial and wrote the equivalent in Argo CD. These tutorials will tell you everything from like recommending the structure and how to handle environments, using either multiple directories or branches to managing secrets with something like SOPs. So in both cases, we actually rely on SOPs for secrets. So this is something we push internally quite a bit. And this is the key to being able to have this multi cluster deployments that we want to keep. So now after we did this, we started jumping to the public cloud. And this brings quite a lot of challenges. I will mention a few. But there is one good thing is that because we already rely on the Kubernetes API and all the Kubernetes ecosystem for most of these deployments, most public cloud providers offer this managed Kubernetes services where a lot of it can be reused. And this was the key to ease our transition to the public cloud. There are quite a few motivations for an organization like CERN to use the public cloud, even if we have a very large data center and access to quite a lot of resources internally. But one thing, if you imagine like CERN will have a constant load in our servers, but we also have like physics conferences that come, for example, during the summer, where people will have needs to run a lot more analysis than usual. And this comes with periodical load spikes. If we would provision our resources on premises to this kind of load, we would have quite a lot of waste during the rest of the years. So there's a motivation to use on demand resources to cover for this. And then with a lot of machine learning happening everywhere, but also at CERN, we have a need for a large amount of accelerators. And these are like GPUs and dedicated accelerators like TPUs and IPUs. For GPUs, we have some in-house, but we need a lot more. So we try to go to the public cloud for this extra resources. For TPUs, which are the tensor processing units from Google, for example, we can't have them on premises. They still have a value. So we use them on the public cloud. Also, we use the public cloud resources to get access to new type of resources and evaluate them before doing big tenders on premises. And also disaster recovery is one of the use cases we have. So it's quite a different use from what we had on premises, but in reality, because we already had this multi-cluster deployments, we kind of knew how to manage this in a similar way. So just a very brief thing regarding the needs for public cloud just to give you an idea of why we need this. A couple years ago, we did a keynote at KubeCon where we did an analysis in five minutes of where we analyzed 70 terabytes of data using a Kubernetes stack. And this works very well when we prove that we can scale our deployments for this kind of load we need. But actually, and yeah, we built the Higgs plot that gave the Nobel Prize in 2013 live on stage. This was a really nice way to say, okay, we can use the public cloud for real things. But then regarding the deployment, this was kind of a stunt. We did a once in a time deployment, which is not really useful for anything else. So we didn't even get halfway to what we need to use the public cloud in a generic way. Things that we knew we would need is supporting multiple cloud providers. We cannot have one single provider. Support multiple regions. If you use the public cloud, you know that the resources available in different regions can vary. So you need to be flexible in this way. Then integration with our in-house tools. I mentioned like storage systems, but we have a lot more that we need to kind of propagate to the public cloud for our applications to be able to run. We need to have some sort of centralized monitoring and log collection tracing of the application. So deploy at multiple public clouds, but also collect it centrally. And then accounting, costing. And yeah, you can keep adding things that you need when you start using the public cloud. But again, we had these components well structured. We had an API that is available in all the cloud providers. So there was quite a lot of work already there. I will just mention in the same way that I had this slide about doing things at scale just to strengthen the need for the public cloud for us. But on the other hand, we also learned that the resources are not infinite even in the public cloud. If you use CPUs at tens of thousands, you're probably okay. But if you start going larger than that, you might need to go multi-region. If you're using GPUs, this can happen even in 1000 GPUs or a couple of thousand. You probably want to start splitting the regions. And there's one thing that is like the cloud can you can have a huge benefit if you use things like preemptible spot instances that are really cost effective. And the availability of this is also quite limited. So you need to be kind of smart and exploring multi-region also to reduce the cost here. So the main message I wanted to put here is that these are just these are really reasons to really stay flexible on your choice of not only clouds, but even regions within the same cloud. And in the same way that I explained that to have multi-cluster run premises, we need to have automation and GitOps. If you go multi-cloud, multi-region, this is even more important. So I'll finish with this example. I was talking to Cornelia about machine learning just before. So we have some cool machine learning use cases. And to have an idea like if we have use cases with one GPU, you might take like an hour to do a single epoch. And then just by using the public cloud and exploring resources, we can speed up 100 times that. So at the same cost. So there is really a push to explore this resource. So I will try to spend the rest of the presentation describing how we handle the public cloud. It's quite similar in terms of the stack we had on premises. What we add here is that we start managing the clusters themselves using GitOps. This wasn't the case before. People would deploy their clusters and then register by using Flux or Argo CD, where the application should be coming from and add this to the cluster. But in the public cloud, because we need this flexibility of multiple clouds, multiple regions, we started looking at can we actually do the same for what we call the underlay, which is the clusters themselves. So we would have basically three parts that we would need. We would need to support the cluster of multiple clouds, multiple regions and manage them in an automated way, have the base services I mentioned, and then manage the applications themselves. So we end up with three groups of components. The first one is again the underlay, the clusters themselves, the infrastructure, the base services that we need on all the clusters for applications to be able to run. And then the services and applications themselves, which are cluster specific, and we will allocate them to different clusters as needed. So we do this by having a single Git repository, where we have all the configuration for the clusters for all the definitions of the base services and the association of the services and applications with each of the clusters. So all the complexity of a very large deployment with, I don't know, you might be having a deployment with AWS Azure and the Google Cloud in multiple regions. And basically, you, you are looking at a single repository to manage all of this. So this is a view with the ROCD, where we have, again, this infrastructure, the underlay, sorry, and then the infrastructure and then the services and the applications running. And we use this notion of like app of apps that I think in V2 is called an application set. But it's the same concept where we aggregate things into groups. So this is the view of what the deployment would look like at CERN. So we have one cluster at CERN that basically is responsible for managing all the public cloud deployments. And this will count the definition of the clusters and then the definition of the base services. So again, you might have a cluster in GCP, say West Europe, West four region. And then each each in this cluster will have multiple node groups. And we will have node groups that auto scale, for example, for different types of GPUs and then maybe a TPU. And then we'll have a similar one where we have, for example, here we see the needs to have multiple multiple regions. For example, in West four, you won't have NVIDIA T force, we need to go to West one to get the T force. So basically, we need already two regions here just because of the access to different types of resources. And then we need AWS because this is where we got ARM Graviton True. And then we need Azure for GPUs or IPUs, for example. So this can be quite complex to manage. But in reality, at CERN what we have is the one big YAML file, one very big YAML file that basically tells you for all the clusters I need, I call it like this, this is GCP cloud. And then I tell it the region I want to use the cluster version. And then the different node pools that I need to have. So in this case, like I have a node pool for NVIDIA 100 GPUs, and you would see this continuing. And then the system, the key thing here, if we're talking about clusters is that the cluster has to become a Kubernetes resource, which is not like the Kubernetes cluster being a Kubernetes resource is kind of like the complicated to start with. But actually, this is a really good idea. The same way we do the declaration and reconciliation of any kind of resource in Kubernetes we can do for cloud resources as well. So for that, we rely on a project called crossplane, where they have this notion of a provider for the different clouds. So in this case, you see like a template where we define GKE cluster with the name. And then for a bunch of node pools, like a template for a node pool that will define like the max mean node count, which kind of accelerator it should use, if any, the type of image and disk that it should use. And this is basically expanded based on this YAML file that I showed you here. So again, one single bit repo allows us to do all of this. So if you look at our Argo deployment in this case, it will be really large. But with this app of apps and navigation, you can kind of get a nice feedback loop all the way up in the status of the resources. And you can see the reconciliation happening in two cases here. So in this case, we added an EKS cluster that is being reconciled. And then you can see for I had this West 4 example cluster with all the node pools. And you can see that they are already reconciled. So the layer above this is the infrastructure layer. Again, this is extremely similar to what we do on premises. But it does have some particularities because say for example, at CERN we have this registry, which is based on Harbor, then we have Prometheus. Now if we start having this very distributed infrastructure, we started looking at how to handle this. So we actually deploy Thanos. And in each of the clusters on the cloud, in each of the regions that we have a cluster deployment, you will have a replication of all these services. So the registry will also be deployed there. And in addition to that, we also defined the replication rules of which images should be replicated in advance. And like proxy caches, if we need any kind of policies, for example, things that we want to express is who can run in this cluster in this cloud. And we use based on labels, for example, that have to come with the applications, some storage systems that we need and the monitoring as well. So this is something that is quite easy to deploy. And in most cases, these are like Helm charts or, I don't know, customize based deployments. In other cases, like the replication rules, we actually have small tools that will do API calls to the systems. And this is a view of what a dashboard would look. So if you would have the on-premises GPU monitoring, you would see a single cluster, a couple of clusters on the same server. And you can see the same services. Here, you can actually see also like per cloud, you can do an aggregation of the queries. In reality, for the service managers, this is all hidden. They don't really have to bother with all the details of the deployments. All they care is about this dashboard. And one key thing that is also important is our users don't really see all this complexity. They care that we integrate, for example, we integrated GitLab. Here, I give an example of a GitLab CI. We have a bunch of runners to run, for example, jobs on GPUs. We had already the integration with Kubernetes on-premises. We actually integrated the runners also with our external cloud premises. So this would be one of the applications running across all these clusters, GitLab runners. And our users will just submit via GitLab their CI jobs and they will run anywhere that is appropriate. But they won't realize this. So it's kind of a nice way to expose all this to the end users. And we do this with machine learning frameworks as well, where the jobs, like a distributed training job might run on-premises on the public cloud, depending on whatever is happening without much trouble. So I come to the end. So basically, we are, like, GitOps is the key for us to be able to manage hundreds of clusters on-premises, like thousands of deployments, tens of clusters already on multiple clouds and regions. So we use GitOps for the base services, but we started using them for the clusters themselves. And you can add a cluster to our infrastructure on any public cloud and have it serving production workloads in less than 15 minutes, which is quite an achievement with all this stack that we rely on. The flexibility also allows us to have much more cost-effective deployments. And then the Kubernetes API is really the common language, and all the ecosystem is the enabler for all of this. So if you are maintaining any of these projects and are contributing to them, like we have to say thank you, you are helping us in a huge way. And with that, I come to the questions. If there are any, hopefully, there's time. We have several questions. If any of our attendees can access these slides later on. Yeah, sure. I can upload to the agenda. Any other questions? Given that you're connecting from your own-premises system to multiple different cloud providers, how do you handle the networking between your own data centers and the public cloud? Right. So it depends on the service. For most of the use cases, like I mentioned, GitLab CI runners, they will just pick a job and run the job on the public cloud. It really depends on what the job needs. If the job only needs, for example, access to our storage systems, we have this replication in place of whatever storage system is needed. One of the base services is this replication to kind of extend the storage to that region. In some cases, if the service really needs tight interconnectivity, you need some sort of VPN set up between the different data centers. We don't have a lot of use cases on boarded into our public cloud deployments that need that. We really use them for kind of batch, totally independent workloads that fit much better without this need for interconnectivity. Thank you.