 Hello! We are junior site reliability engineers from CERN, the world's largest particle physics laboratory. I am Konstantinos, and this is my colleague Raioula. And we would like to take you to sail with us today through our first experiences of building Kubernetes operators. This is CMS, one of the four big experiments at CERN. We try to understand nothing less than the origins of the universe. At the heart of CMS, collisions happen many times per second between particles of two opposing particle beams. And out of those collisions, an explosion of secondary particles washes through the building-sized three-dimensional camera that is the particle detector. And this produces a lot of data that then takes a large computing infrastructure to transform it into knowledge. And that is a big part of CERN engineers' activities. But because this is a big organization housing 12,000 and more physicists, we want to take care of every aspect of their work. And that includes public outreach. So let's see what exactly we are going to talk about today. First of all, we want to explain to you what kind of infrastructure we are building. It is related to public outreach. How we use operators and what exactly they are. Then we want to show you upgrading some websites with our operators. And finally, we'll see what we have learned. So let's see what kind of infrastructure we are building. This is home.CERN, our organization's home page. And it is housed on the infrastructure that we are replacing with this design. Together with about 1,000 and a bit more other websites, all Drupal websites, made by a lot of people around the organization with different requirements and that every day hosts about 80,000 unique visitors. And the peaks can reach even 1.5 times that much. So Drupal at CERN is actually a rather complicated ecosystem. We provide on top of core Drupal a set of curated modules that are accessible to everyone. But that everyone is a very wide-ranging set of users. There are people that come from many backgrounds. There's like physicists whose responsibilities also include site building and also administrative personnel and communications expert. And actually a very small minority of the site administrators are actually web developers, people with site building experience or Drupal experts. And all of these people need to have the same easy experience of very reliable hosting where they don't have to take a lot of responsibilities upon themselves to respond to security incidents or be responsible for upgrading their websites or from recovering from common failures. And at the same time, they need to have the flexibility to use custom modules and themes that are not part of the standard CERN Drupal distribution. So all of these requirements are what our infrastructure has to satisfy. We are not talking about a basic Drupal hosting service but about a managed and complex software as a service. Drupal itself is very complex because it has a lot of modules that can be included with core Drupal. And in our case, they need to actually be injected by the users at the instant at the time when a site is instantiated. And also a single website has two pieces of state. It has a database and also it has a directory. So take now 1500 instances of a complicated thing and then try to automate the business and operational logic. Let users self-provision websites and then do all this with a very small engineering team. That is more or less the recipe for the problem that we are solving today. So how do we actually create our sites on Kubernetes? Let's see our design. First of all, on an OpenShift Kubernetes cluster, we have a Drupal site. This is an entity that we'll go into in a bit that represents each Drupal site that we are hosting. This entity is operated by an external component that is aptly named an operator. And we will see this external component creating other Kubernetes resources that the Drupal site needs to be instantiated. But the Drupal site on its own needs to also integrate with external services provided by other CERN teams. The way that we organize CERN IT is around small teams that assume ownership of their services and provide them with APIs to the rest of CERN. And in this case, we are integrating with a CERN authorization service, providing authentication and authorization services and coupling with a CERN SSO page and with a service that provides us with databases hosted by them and administered by them. So when all of these components which are operators are put in action, they instantiate a fully-fledged Drupal website that we can see here. And you can see the site directory, the database that is being provisioned externally, the step where we actually build the image that is going to run for the website by injecting user configurations and running a source-to-image build step. And once those modules, the injected configuration, is baked into the final image, that is being served in production by the serving port, the serving deployment. That is not enough, though, because the site needs to pass through an initialization phase that ensures, for example, that the SSO page, the SSO integration has been set up. And only after that, it can be made accessible to the public through an ingress route. So all of this is functionality that is handled by the Drupal site operator. So we've been talking about operators, but what are they exactly? Let's try to understand them. This is ATCD, Kubernetes' key value store. It holds all the information about what should be in the cluster. And actually, I believe that you are familiar with many of those entities, a pod, a job, an ingress resource. All of these resources are provided with Kubernetes Core, they are core resources of Kubernetes, and they specify a property that the cluster should represent in the real world. So when a user creates a pod resource, the Kubernetes API emits an event that a piece of code that is called a controller is watching for. In this case, the pod controller. So the controller is a piece of code that knows the internals of the behavior of how this resource should behave. And when it is notified that something has happened with the resource that it is watching, it then takes action. And its action is to make sure that the world, the actual state of the cluster, represents what is specified by the resource. In this case, in the case of a pod, the pod will specify that there should be a set of containers running on a node. And the controller, if it finds that the assigned node does not have those containers running, might, for example, call the Kublet instructed to instantiate those containers. So this process of comparing the specification of the resource with the actual state of the world is called a reconciliation. And it happens in a loop. The controller is always active, running, listening for events. And every time that an event happens, it runs its reconciliation logic. Now, these resources are also spread around in the Kubernetes cluster following manner in the control plane and the state of the world will most likely be represented by the worker nodes. And this is more or less what an operator is as well. An operator is essentially taking all of what we have described now for basic Kubernetes and extending it to custom types that we have defined as users. That's it. And in this case, the custom type, which is actually called the custom resource definition, or CRD for short, it is a Drupal site. And this piece of code is the Drupal site controller. Together, and with a way to deploy them in the cluster, we form the Drupal site operator. We can think of the custom resource in the following way from a developer's perspective. It is analogous to a class. The custom resource itself is the data fields of the class and the controller is the methods that know how to operate with this class. So this is what the operators are, and how do we actually make operators? Do we have to start writing them from scratch and to write all the logic that knows how to talk with the Kubernetes API, for example? No, not really. We can benefit from software development kits, the operators, SDK and CoupBuilder, which make all the scaffolding, which put in place all the scaffolding for us and make it much easier. We write most of our operators in code with a few exceptions in Ansible. And essentially all that we have to do, we take already scaffolding and run the reconciliation logic in a loop and access the Kubernetes API. And all that we have to do is to actually input the reconciliation logic itself. So now that we have described how operators work, it is time to let my colleague Rahula take you on a deep dive into how we have implemented operators for this design. Thank you, Konstantinos. Now that we have covered in Friday's and some concepts about the operator, let's talk about the operators we have developed. The first one would be the Drupal set operator. The functionality of this is to create Drupal sites. Here you can see the custom resource of a sample site. We have the name of the site and we have some spec. I would like to focus on three specific fields here. That would be the Drupal version, the published field and the site URL. The Drupal version defines what should be the version of our site. And then we have a published field that tells us if we want the site to be published to the internet or not. And then we have a site URL which defines the published URL of our site. So at some point we want to change either of these. Say I want to change the URL. So I just have to make a change here and the change would be propagated to the infra and then the perspective ingress would be changed accordingly. Now say I give an incorrect URL which isn't abide by any DNS rules. Then the operator will try to enforce it as well as that. It's an error. And then it comes back and then it sets the state back to the previous working URL. So this is exactly how an operator syncs with the world and then maintains the state. We also have an object called status in the same CR in the same custom resource. So the status represents the current status of the Drupal site. If something goes wrong internally, the operator will update the status accordingly. In order for that to use us to understand what actually went wrong. So what is the capability of our operator? What exactly can it do? So there are five levels of capabilities defined by the operator framework. We are currently at the end of level 2 with Adrupal's operator. But our goal is to be at level 4. So currently we are able to provision sites and do basic upgrades. We already have designs for backups, metrics and recovery from some common failures. We'll be implementing them in the coming weeks. Remember the other operators mentioned earlier like the authorization operator and the external dv operator, etc. So these operators are not providing the application like the Drupal site or any other service. But they are very critical in fra like they take care of SSI integration. They take care of a database. Since it's already mentioned that we have different teams providing different services at CERN. This is an essential part in the bigger picture. Say we have other use cases besides the Drupal site. So we have other web services and all of these use cases require the same inter operators. So we can reuse the same authorization operator with other use cases. Given the CRD is essentially an API and that's how the operators work. This whole thing makes an operator composable and makes it really easy to integrate with other different services. Okay, now let's dive into the interesting part, the demo. So we'll try to upgrade through two of Drupal sites live in the demo. Do note that we've been using the words update and upgrade interchangeably. This is a mix of conventions of Drupal and Kubernetes. But what we essentially mean is that we want to change the version of the website. So here's our first site. It's called kipconnay. It's already installed. Now I'm going to go off this. When I go to the YAML spec, I see that it's running version 8.9.13. I see that it's running on URL kipconnay webtest.7.ch. If I go to the web page the page is accessible. Now if I can try to confirm the status of the version from the web page I see that's the same version, 8.9.13. Now let's see if we can do an upgrade on this. So I'm just going to go here, edit the spec of the custom results to 9.1.x and save it. And I see that the update status and the condition has been modified. Now I also see that the new builds that I've been triggered that are going to build the new images for the new version. While this happens I'm going to walk you through the workflow of our upgrade. So how exactly does our workflow look like? So the operator will first put the site in a maintenance mode. So which is basically a state where it doesn't need writes. But if the site will still be accessible and then the operator takes a db snapshot of the current state and then it rolls off new images of the new versions. And these images will eventually roll out new parts. Once the new parts are running the operator tries to update the db schema. This is a required part for a Drupal update. So if something fails in either of these cases the operator performs a rollback and it also updates the same in the status of the custom resource. And after this it will put the site back in maintenance mode and then continues to reconcile. So at the end if we have a successful update we'll have a new version of the Drupal site serving request. If not we'll still have the same old version serving request. Now coming back to the demo. So the workflow is still happening while it happens. That is one other thing I want to talk about. So we tried to migrate around 1000 plus websites on our pre-prod just for this demo but we were only able to do around 420 plus. We've seen some scaling issues that prevented us from migrating more websites which you are figuring it out. So in a way every step we take we are having to look at our infrastructure and then adapt accordingly. So I guess the update has been completed. So if I go to the web page and try to refresh the same URL I should see a new version. So I see the new version here which is 19.1.5 And if I go back check the status I see that the updated status has also been set to false. So this is a successful upgrade. Now let's try something else. So we have another site which is kipconb which is already serving. If I go look at the spec on the previous one. So if I go look at the web page the web page is serving. If I can confirm the version of the throughput from the web page it's 8.9.13. I'm just going to go change the version as I did with the previous one. I'm going to change it to 9.1.x and then save it. So I see that there are new builds that have been running. I can see them on the right. So while this happens when I delete a build part the build should eventually fail. So the build part has terminated now if I can go verify that by doing what we get builds I see that there's an error in the build. So when there's an error in the build the upgrade workflow should not go ahead. So if I can go back to the web page and see it is still serving with the older version as it should. So because the upgrade didn't go ahead let's keep serving. And I see that there is a new status condition that tells the user that there has been a failure. So we have seen two scenarios here. In the first one we have seen a successful upgrade. In the second one we have seen a failed upgrade. Now back to Konstantinos. So now that we've taken you through what our design is and how we use operators let's see what are the outcomes of this investigation. What have we discovered? First of all I'd like to take a stop at our development practices because they are really what has enabled us to deliver on this. The most important of all is that we use GitOps. We define the configuration of all of our clusters as a big handle chart with a lot of sub charts as components. And then we maintain the state of the running cluster to what has been defined by Helm with Argo CD applications. And this is critical because it allows us to run end-to-end tests from CI at the click of a button. It essentially enables the process of auto provisioning clusters for development purposes or for CI purposes for running the end-to-end tests. And this thing taken together makes us have development clusters that are really close to the production environment. Essentially the difference between our production environment and our development cluster is a few Helm values and that is all. So if we take these things together and then the learning material that we used from the Kub builder book and operator framework, we really had a very good head start to get going on this even though prior to this project we had no other experience with writing controllers or custom logic for Kubernetes. So in the end we have bandaged to provision a highly automated infrastructure to solve a rather complicated problem and all that with a very small team of engineers. Essentially we are four people working full-time on this. Then we have used the operator model as a principal component of what we are doing and why was that? Because in the end the Kubernetes API because in the end we can use Kubernetes as a common API to control not just containers as was once thought but many different kinds of resources. And I believe that this is the true core value of what Kubernetes is transforming in this day. So with a final remark I really invite you to visit our project on this link here and give us your thoughts on your feedback. Thank you very much. I hope this was pleasurable.