 I'm Jack and I work at CERN and for those of you who haven't heard of CERN it's the European Center for Nuclear Research based in the beautiful Geneva Switzerland. So you can see here on this picture in the middle we have the lake Geneva that's the blue thing in the background there are the Alps that's the white stuff and then on the right side there there's the main campus at CERN right at the border with France and in the middle of approximately a hundred meters underground we have this thing called the LHC the Large Hadron Collider that is where we accelerate protons almost to the speed of light and then let them collide into each other and measure and analyze what happens there to understand the origins of the universe. But we don't just do physics and some of you might know this guy also used to work at CERN Sir Tim Berners-Lee and what he did there one of the things was he actually invented the world wide web. On campus we also have our own data center where we are running all of the physics workloads but also our websites and web infrastructure that we have and I want to bring you back into the year 2014 when the computing landscape looked much different and at CERN there was a general push to adopt more continuous integration practices so for automatic building and testing and at the time one of the best choices for this was Jenkins so we wanted to encourage users to use more Jenkins in there to have automated builds but at the time this basically involved when you wanted to have a Jenkins instance because it cannot really easily be shared you had to request it via ticket then someone would set up an open stack VM for you and provision it via puppet and so this was all a very manual and tedious process and we were looking for ways to consolidate those resources because of course then also each Jenkins server and agent had its own VM but also to simplify the management of all of these distributed and independent Jenkins servers and around this time we were evaluating various options that were available and actually OpenShift came up as one of the interesting options especially because it already had the build in Jenkins template back in the day which proved to be a really ideal starting point for just getting started spinning up a small Jenkins instance and with a couple of agents but it was also relatively quickly realized that you can't just offer Jenkins as a service but you still need to give the admin a lot of control over the instance like they need to set some environment variables or put some special files in place so kind of the scope expanded and very quickly it was then also understood that OpenShift is in fact capable of much more and so we could instead of just using it for our own services we could offer it to our users as a platform as a service at the time there were lots of small web applications and this still even holds true today written in python php that by themselves don't need to a lot of resources so it really makes sense just to put them in a container and host them on a shared hosting platform instead of each app has its own dedicated VM that of course you need to over provision so at any time you have enough CPU and memory capacity there were three main requirements for this kind of platform as a service it should have a low setup overhead so that you can easily get started also for small projects it should be resource efficient so that we can share unused resources between the individual individual projects and also the ongoing maintenance costs for the user should be as minimal as possible ideally the user just takes care of updating their application and everything surrounding like infrastructure certificates operating system updates should be handled by the administrators and it should be seamless to the user and this is in fact what we then were able to do with open shift origin another great feature of open shift at the time and still today is the source to image workflow which allows you to have a continuous deployment which was at the time simply not possible with open stack VMs in puppet you couldn't just push your code into a repository it would automatically get built by some component and then also being pushed out and actually installed on a server or in this case on a on a container platform so this is what the build config and deployment configs and open shift really brought forward but while this was very easy to do and to set up still this is back in 2015 2016 a lot of knowledge sharing had to happen so we had to teach users how to use containers how to use docker how to use kubernetes all of these things even though they are relatively abstracted away but of course when there's a problem you still need to understand how to fix it and also we don't just want to deploy open shift as an island that is really pretty inside but has no connection to the rest of cern's computing environment but we had to develop integrations so that the open shift that applications running on open shift could access and integrate with the rest of cern's computing infrastructure such as dns firewall storage systems so we've been running open shift in production since roughly 2015 2016 with the first apps like bigger apps and around 2021 we started a big push towards moving to okd 4 now okd is the community distribution of of open shift and it it now fast forwarding to the year 2023 we can really say that okd 4 has become the foundation of the web services infrastructure at cern why is that because okd provides us with out of the box multi-tenant highly available and secure base yes we could also achieve this with other kubernetes distributions but we would need to put in a lot of effort and do a lot of things of ourselves which we already get out of the box using okd in addition we take the vanilla okd and enhance it with additional features and integrations to fit okd into our computing environment so these are things like hostname registration and dns setup certificates and backups and various storage integrations again and we do that we're doing all of this using various operators controllers and webhooks some of these are regular third-party open source components such as velaro some of you might be familiar with or the cert manager other operators we have developed ourselves um either because they are specific to our use case or they are just using internal services and we're using then also lots of webhooks which will get into a bit later now based on this kind of own distribution of okd that we have we are then providing different cluster flavors so our biggest cluster is the platform as a service cluster at least in terms of size of the compute nodes and this is really just standard container as a service so the users put their pods in uh we run the pods we take care of managing all of the infrastructure and the user just needs to take care of their application um essentially we're just exposing the kubernetes api to the user but we also have more advanced use cases such as our app catalog cluster where we offer particular applications as templates such as grafana wordpress or nexus here with a few clicks a user can for example create an instance of grafana without having to configure it themselves but we kind of out of the box provide a pre-configured version that is then already integrated with cerns sso for example so that you can log in with your regular account and we also take care of updating these applications for security updates etc so the user in this case is really just taking care of the configuration parameters but we are actually in control of the software we also have our webios cluster which is essentially also offering static site hosting and cgi this is based on the compute size actually our smallest cluster but it's hosting over more than 4 000 websites which really shows you how resource efficient this is instead of having lots of lots of small VMs here again the user doesn't have any control over the software that's running but just provides small configuration snippets that basically say i would like to expose this directory over hdp and we take care of the rest and finally the most advanced use case is our drupal cluster which drupal is a content management system it's the most widely used cms at cern and here we we really have built up a lot of automation with a very advanced operator and the which is taking care of things like not just running the software but also taking database backups database migrations that are necessary making sure that the website is available if you're interested in more more details about this drupal operator there is a talk tomorrow afternoon from my colleagues from the drupal team that is called house cern scale sass with operators where they will go more in depth on this operator in particular now all of this is well in good but now we have these different clusters and they're all a bit independent from each other and also we don't expect all of our users to know how to interact with kubernetes or you know how to write yaml manifests so we have a thing that is called web services portal that offers a stateless web ui on top of these different clusters so it federates them and it is kind of the entry point for non-technical users and what you can do there is you create one of these static websites i was referring to before so this is what's shown here in the middle you basically just put in the hostname that you want the website to be exposed at description and which path from a shared file system it should be and then you click on create and that's all you need to know you don't even know that it's using kubernetes in the background and after a couple of seconds then your website is online something similar is also available for the drupal case here again it's more advanced so you can also choose which drupal version you want to run you can also take backups and see which backups have been taken off your site so that you're in control and behind the scenes what's then actually happening is that this web services portal does nothing else then talk with a kubernetes api and create a custom resource there one example we can see here so this is just a example where we can see the spec has the configuration that the user put in such as the version and the hostname and then in the bottom we can see that the operator already populated the status of the custom resource so it says when the last backups were available like when the last backup was taken when the database was checked for any migrations that need to be applied and if the deployment has the expected number of replicas this thing gets picked up by the operator by the drupal operator which puts all of these things into place but for the user what all they see is this basically now i want to talk a bit more about the the infrastructure how we actually take care of our okd clusters and our production clusters are kind of our pets they are stateful because we as the administrators we're only taking care of the infrastructure side of things but the users are putting workloads in from their side and we don't have any control over those workloads if someone creates a deployment we need to keep running it and we cannot just overnight reinstall the cluster and then their deployment is gone additionally each cluster is completely self-sufficient and isolated so it has a separate open stack project and also all of the other resources it uses or interacts with are separate this is very important so that we can reduce the blast radius in case anything goes wrong something gets misconfigured or goes hazard we have a two internal tool called okd ctl that allows us to quickly create clusters and delete them again and perform other common operations on these clusters which paired with the functionality of okd that the cluster you can upgrade the cluster in place from inside the cluster which is completely seamless and fully automated gives us the the opportunity to to really spin up clusters as we need and configure them and all of this is enhanced by the fact that we're using argocd now argocd is a tool for doing GitOps so you're putting all of the the state that you want to have into a git repository and then argocd will deploy those resources and also continuously ensure that the resources are as described and if this is not the case either because they were manual actions in the cluster or because for some reason the resource state cannot be enforced such as you want to you want to spin up more you want to spin up more machines but you ran out of quota you will we will also get automatic alerts for this and it fits really well with the operator driven cluster management that okd already has out of the out of the box now what this allows us to do is have fully automated cluster provisioning so we can completely automatically spin up a cluster with all of its dependencies and resources that it needs it takes a while and then we can also run integration tests on these clusters which allows us to test almost every feature both from a user perspective so for example as a user am i allowed to create a website am i allowed to create a deployment but also from the administrators our cluster backups working are ingress and dns host names being populated properly this allows us to deploy changes to our clusters frequently and reliably because we have this extensive suite of integration tests that we can always run on a fresh cluster overall the deployment looks something like this so at the bottom we have our private open stack cloud on which we put okd and then we put argocd inside that okd which then manages okd itself because you need to configure certain parameters but it also manages various other components that we deploy in addition and this is just a small selection of the components that we deploy one particular focus i would like to have is on the open policy agent which is a webhook um so you can kind of intercept requests as they're coming into the kubernetes api service and this allows you to kind of put glue between different operators and controllers so for example when a when a pod request is coming in you can say hey i would like to have this pod should have a particular annotation or if a particular label is present while creating this route then change these and these parameters of the route and it's really nice for gluing different controllers and operators together if they're expecting particular annotations or labels to be present now to round up i would like to talk about what we like about using open shift origin and okd for after so many years as already mentioned in the beginning the fact that its multi-tenant secure and highly available out of the box is a great plus for open shift also the fact that it's so stable and that by that i don't mean that it doesn't have any bugs but you know that in the kubernetes ecosystem there's a lot of churn which in fact we just heard about by michael and bridges talk uh csi um ccm being moved from entry to out of tree um but open shift protects us from a lot of this or it makes the transitions much easier also the powerful web ui is really used widely by a lot of our users who are not that kubernetes savvy but they just want to create a deployment they want to deploy something from a git repository and with a couple of clicks they can easily do that without having to be a yaml engineer and finally the fact that it's fully open source is really a game changer because we can troubleshoot and fix issues that we have ourselves if we if we see that a particular component is not behaving as we wanted to we can look at the source code and understand why it's doing what it's doing but that also means that we can contribute back and over the last couple of years we have contributed back several fixes that we have discovered where something wasn't quite working or we needed additional features for our particular deployment one takeaway i would also like to share with you is the fact that while open shift has great documentation really extensive documentation it's still useful to have your if you're offering something like a platform as a service it's still useful to have your own internal documentation for your users and our users really appreciate it that they have an easy getting started point and then when they want to look into more advanced things we can provide them external references also writing your own operators can be a great way to to alleviate some of the daily churn that you need to do as an admin and it's not that difficult thanks to things like the operator sdk it was also really worth to automate the full cluster provisioning which is especially in an on-premise environment like we have it not really self-evident it took a lot of work it took a lot of work while we were moving to okd 4 but in the end due to the fact that we have this fully automated provisioning we can have excellent integration tests that allow us to catch a lot of bugs very early on with that i would like to thank you for your attention also thank anyone who's working developing open shift and a special shout out also to my fellows from the okd working group thank you very much yes i see questions maybe because you're using some dedicated notes for some of the workloads um so the question was that we have a lot of notes in our cluster yes yes i can just quickly go back to the slide i also noticed that the density of the course and memory is quite low but that could be because of the virtualized environment yes so we are we are running on on vms and um so so these use cases are really very different like the the the web use for example is really lightweight in terms of compute and memory requirements whereas the platform as a service we have some really relatively big apps running there that consume a lot of memory for example and um so it just depends but our our standard is usually somewhere around vms with 16 to 32 gigabytes of memory because if you check you can see that on average you have 8 to 15 cores per virtual machine yeah which can be low depends so in general we're not running any really compute intense workloads because um CERN also has a dedicated kubernetes service like where you get a dedicated cluster for yourself there's just support infrastructure basically no no this is this is the actual infrastructure this is like open shift but open shift is more for like what we call small and medium web applications they're like we're not running intensive um data analysis on these on this environment so what is then your over subscription ratio for these resources i cannot tell you that off out of the box because it would be good to know we can we can surely have a chat there's another question over there uh yes so we are actually using the bats the bash automated testing suite for that which also has an integration for kind of making kubernetes tests a bit easier but in the end it's just kind of bats with then some bash scripts for for like creating where we're basically creating kubernetes resources making sure that they're in the state that we want them to be in