 Hello, everyone. First of all, thank you all for joining today to the session now, Send Developers Benefit from Kubernetes and CNCF Landscape. My name is Antonio Onappi. I'm a computer engineer at Send since 2015. I work daily with Kubernetes and Java. As you can probably understand from my accent, I'm Italian, particular from the place where the pizza was born, even though I promised to myself that I will try the Chicago one. So for the people that don't know what is Send, Send is the European Organization for Nuclear Research. What we do is study fundamental particle, how they interact in order to understand the fundamental laws of nature. In order to do that, we build the largest particle accelerator in the world. We call it LHEC, that large electron collider, is a 27 kilometer ring underground, in average more or less 100 meter depth. And it's also the place where the World Wide Web was invented. And we like to say that it's a place where we do science for four pieces, because people from different countries, different cultural backgrounds, different religions, they all work together in a respectful environment. And what we don't do, we don't do black holes, as maybe you could see in movies or YouTube videos. I mean, the agenda today, I'm just going to first say what I do at CERN, what was our services running on VM models, which were our challenges, which is now the architecture and all the techniques that we started to adopt since we moved to Kubernetes. So what I do at CERN, I'm in a team that is basically in charge of hosting critical Java application for the CERN daily life, in particular in different fields, so finance, administration, engineering. Also we host the single CERN infrastructure of the world CERN, that is based on the key clock. So our users are mostly developers from different departments, as I said, and identity and access management engineers. We run more or less more than 80 applications, mostly Java. We have more than 3,000 pods, more than 400 nodes, Kubernetes nodes, and around 35, 36 clusters. How was our life when we were running these services on top of VMs? Everything was working. There was no really reason or motivation because something was breaking that we had to move to Kubernetes. But I mean, the memory can explain you easily why we decided to move to Kubernetes. We were wasting a lot of time in repetitive and easy tasks, in order also to upgrade and provision new infrastructure, and also to maintain custom scripts and puppet code to basically automate all our operations. This was taking us a lot of time and we could not do anything else. So I think one of the important aspects for us is that we are the kind of platform team in this environment, and we work with a lot of developers. Only the sides of the teams are pretty different. And the only way to survive is through the automation. Another aspect for us really important is that infrastructure was ident completely to developers. They don't have access to machine, to pods, or whatever. And they also don't have any power in terms of customization of the running environments. This is a bit of timeline. We started to look at Kubernetes at the end of 2015, let's say in the free time. Then we understood the potential and we decided to basically build a production service on top of that. And in 2020, we basically moved all our production services on top of Kubernetes. That was not the end of the journey, but actually the beginning of a new one. Because we started to adopt a lot of different technologies. We went through a lot of migration, adoption of GitOps, and so on. This I want to just show you a bit of the current architecture that is running on Kubernetes. This is a simplified version of our architecture. You see that we have mainly one Kubernetes master cluster where we run all the tools that are used to manage all our infrastructure. So Argo CD, Argo workflow, logstash open source, and Prometheus. And then we have a set of clusters and clusters where we actually run the application of the developers. As you see, the cluster are a building block of our infrastructure, and we have a lot of them. And the way we deploy is using the, at CERN, we don't have public cloud. We use a private cloud based on OpenStack. And there is a module called Magnum that allows us to provide a Kubernetes service. And so basically, we deploy a Kubernetes cluster using this service provided by another team at CERN. And all our Kubernetes cluster are basically Terraform files inside the Git repo. And each time we want to create, update, or delete a cluster, we just do a commit and then a Terraform pipeline generate or update a cluster. We spend a lot of time also evaluating alternative to that because it is, yes, Git-centric, but still not like we would like to be. For example, we look at cross-plane, but the main issue was that there was no plugin available for OpenStack. I remember I also give a try to the Terrajet in order to generate, but it didn't really work. And also cluster API that actually this works. But the issue is that this is really only focused on Kubernetes cluster. Why we were looking for a solution that was bit generic? Because still we have some VMs to manage, and we like to have a technology like Terraform that could both manage Kubernetes cluster and VMs. One of the principle that we follow for the cluster is the cluster as cattle. As I said, each application is deployed in multiple cluster, in multiple availability zone. Then we put a cluster of load balancer in front of it to redirect traffic. Why we use this paradigm? First, each cluster is just an entity that we can easily replace with another one. User isolated. So developers team have their own set of clusters. So we increase security, basically isolating users from others. And this is extremely resilient because we actually did this in the past, where basically the Kubernetes team in charge of basically the Magnum service accidentally killed more than half of all production clusters. I think we had around 400 nodes, 250 nodes were gone. There was no downtime. No one could see anything because basically two out of three production clusters were killed, but one survived. So we were basically for a few hours integrated mode, but still up and running. If we were just using one cluster, well, that was just matter of lack. We could just go down. But the fact that we have so many clusters is a maintenance over it because you need to upgrade them to, and this, I mean, upgrading 40 clusters is a mess. That's why we started also to look at the virtual clusters with the vCluster to kind of consolidate some of those clusters in virtual ones and to have the same level of security, basically isolating users by virtual clusters. One of the important things of these architecture is the application deployment. We differentiate old way and new way of deploy. In the past, we were used to have basically a JSON file that was describing all our infrastructure, Kubernetes clusters, application, proxies, relationship between them. This was the only source of truth for us. And basically we had the custom script that was generating, getting information from this JSON and then put these, translate those in Kubernetes resources. So the developers were just going on Argo workflow, run a job and deploy the application. The problem with that was that every time there was a change in the configuration, we had to trigger again this pipeline with the restart of the application, bringing some downtime that was kind of mitigated from the fact that applications were across different clusters. We started to, with our users, a new process, a new way to actually let them to deploy that is based on GitOps. And basically the difference is that our source of truth is not anymore our JSON, but a Git repository, where they define the Elm charts and the Kubernetes resources they have. Argo CD is in charge of basically reconciling all the time what is in Git with what is in the cluster. And basically if there is update of small mineral part of the configuration, this does not require a restart of the application. This way we empower the users and this give us a bit more, give them a bit more flexibility, us a bit more time, but also a lot of new challenges that I will explain a bit later. This is how it looks at the monitoring infrastructure. You see there are different layers of our monitoring. There is a bottom part that is basically all the Kubernetes cluster hosting the Java application that have a Prometheus that is volatile. So it's gathering all the metrics for all the Java application deployed there and publish them. And then those one are federated with another layer, the middle one. Basically we call it the central Prometheus cluster for each developer team, developers team. And this actually has a state because basically all the metrics are stored in Mimir. And also all the configuration of Prometheus and alert manager at this level is managed again via Git. So developers are via Git, they can publish, they can commit recording rules, alerting rules, and this then get configured on the Kubernetes cluster. And then the latest layer is basically our Prometheus that is federating all the others. It's just a way for us to have overall overview of all metrics and also to set alerts. So as soon we get issue somewhere, we are aware. Last thing is the login system is pretty simple. Basically each Apache tonkat has a fluent inside container that ship logs to Elastic, sorry, to Logstash. And then Logstash send them this to open search. Again, the configuration is fully managed via Git and Argo CD. One regular team of our infrastructure is GitOps. So we really manage all our infrastructure via Git, repository via commit. What is GitOps? GitOps is basically an approach where Git is used as a soft root or everything. Everything is declarative, application, infrastructure. They are all defined in Git as a YAML. And there is nothing completely new compared to the past. But the main difference from the past is that the continuous reconciliation concept. In the past, you saw this with our Terraform. You were just pushing the commit. And then the cluster was created. But there was never any sync phase between what was running and what was in Git. While, thanks to the GitOps controller, what is in Git is constantly checked and synchronized with what is in the cluster. Why we started to adopt those things is because we get all the advantages of Git, basically tracking and versioning. It's really easy to roll back. You just revert the comment. Then you come back to the previous situation. And then also, this put in place a formalized review process. When you commit something, this goes through a pipeline that validates the change. And then, if it's needed, there depends on the criticality of the piece of infrastructure. There is a human person that actually validates again the things and accepts or not the major request. This is our current infrastructure. Basically, we have only one ARGO CD that is managing everything. It's managing our monitoring log infrastructure, the jobs, current jobs infrastructure itself, and also other ARGO CD instances. Those instances are used by our developers in order to deploy changes, for example, like recording and alerting rules, or for updating their application that are running now in GitOps. Now, there are a lot of nice books about GitOps and things about concepts. But there is not really golden rules how to implement it. And we started this within 2020. And it was even worse than today. And I think I struggled a lot to really understand and to find an optimal solution, at least for us. What I understood is that there is no golden rules. Usually, it depends on what you run and how you run it. For example, one thing that we did was using a naming convention. Basically, we have tons of repositories with a suffix. Usually, we distinguish repository by sources. So they all gather sources in terms of mchart, customize, pure yaml, JSON net. And then we have all the repositories that we call it dash application that only contain ARGO CD application. And there, we have an application set. And also, we implement a lot of this concept of app of apps that comes with the ARGO CD. As you can see, we use the approach of having multiple repositories instead of one. Because this way, we could isolate use cases in users. The problem is that when you have so many repositories, it's really hard to follow which repositories does what. In particular, if you don't work with these things every day. Another thing that we decide to implement is to use the single branch instead of multiple ones. Because this facilitates. You just commit a change in one branch, and then these get propagated everywhere. The bad things of this, if you just match something wrongly, then you can kill all your infrastructure easily. It happened once to me that I was playing with Prometheus sources, and I accidentally matched something to master that was not supposed. And I killed all the Prometheus systems all over the clusters. It took me a while to figure out and also to realize that I did that. Because of course, bringing down my monitoring infrastructure, also the learning went down. So I could not actually see it. But then rolling back brought back everything. E-Tops is really nice. Gives us a lot of, I mean, flexibility also help us to automatize a lot, but also brings a lot of challenges. One of the biggest ones for us is the secret management. There are mostly two philosophy. You can store your secret in Git in encrypted. This way is nice because you have basically the secret together with your deployment. But it requires a lot of maintenance. Because first, you need to care about the key rotation. So every time you update the private key, you need to re-encrypt all the secrets in all the repositories. It may require some additional components in the infrastructure. If you think about select secrets, it requires to have some components in the Kubernetes cluster that maybe you don't want to install because also they consume resources. And a commit, so basically, if you have this secret in different repositories for any reason, then you have to do a commit on all of them. And this, you see, this does really scale well. The other solution is to have basically an external store where basically in Git, you only have a placeholder to the Git secret. So for us, at least, it's good because we delegate this operation to another aspect team. So they care about rotate keys, backup of the secrets, and so on. You change the password in only one place. You update the place in the external store, and then Argo CD will automatically update the secret in all the clusters. But it creates an external dependencies that at runtime is not a big deal because once you deploy everything, that's fine. But in case of disaster recovery, this external storage becomes a big dependency that we need to have it to recover. So our solution was basically to go for the second one. So we have an external store that is send-specific that we invented that we call Teji. How we actually talk to this is using the Argo CD Vault plugin that is, this is open source tool that is widely used in the Argo CD community. But we basically implement our own code to work with our external store. But since this is a kind of ad hoc solution that we use at send, we want to actually move to something more standard. So that's why we start to plan to move everything to Vault. But we actually all done at some point because everything was ready, but then there was a license update. So we are still evaluating which is the impact and decided based on that. In general, what I see is that in the CNCF landscape is really missing a tool that does a secret management. Then another challenge is security. As I said, in the old model, developers didn't have too much permission. They could not do anything. Now they have access to the infrastructure. They can customize Kubernetes object. They can define their own images. And they can download whatever from the internet and then basically create a security bridge in our Kubernetes clusters. So we have to find the balance between how we improve their user experience and how we can actually make sure that everything is secure and controlled. The way that we started to mitigate this, first is to sit down with developers and agree on some policies. So telling them, look, okay, we are okay in that you provide Docker images, but first you use our base images. Don't run as route and things like that. Use a private registry. Of course, talking is not enough and you need to enforce those policies. And that's why we started to implement this with OPA, with in particular gatekeeper that is running for Kubernetes. The issue with that is that it's mostly for the syntax because it's just this funny rego syntax that is bit of nightmare sometimes. So that's why we also started to look at the Kiverno because it's pure YAML and is specific for Kubernetes. And in general, in the context of security, as I said before, the cluster as Cattle paradigm helps a lot because we isolate users by cluster. Another of the challenges of these GitOps is that we miss completely a full picture. As I said before, you can easily get confused. You don't know which repository defines what. And if you have new people that joined the team and have to, I don't know, even run a pod, they got completely lost because which repository should I commit? Where I should go for doing this? The way we are trying to mitigate this is first using the naming convention and to label all the Kubernetes objects so we know where they come from. But also, since we have only one rego CD that is managing everything, we would like to basically to query API, extract all the information and put them in a way that is easily queryable via, I don't know, JSON or JQ scripts. Now, in order to finish, which is our feelings after a few years of production? Operation, before Kubernetes, we were really busy with a lot of simple and repetitive tasks. After Kubernetes, we increased a lot of our efficiency and productivity, and this gave us the possibility to focus on other projects. For example, when I started, I said, we historically always run this hosting platform for Java application, but recently we started to run the key clock infrastructure because we had the Kubernetes experience, our colleagues from the key clock they just wanted to use Kubernetes to deploy. So they asked us to take over, and this was only possible because we were less busy than the past. In the past, when we were creating a new infrastructure environment for a new application, it was taking days to sometimes even a week to configure everything. Now, it takes really few hours. Another big advantage is that in the past, application were getting just stuck. So it was not crashing, it was getting stuck for any reason, and the load balancers was anyway redirecting end users to those clusters. And only human intervention could actually fix this issue. With Liveness, Readiness, Probing, Kubernetes, we actually implemented Hellcheck, we actually started to cut off the pod that were kind of stuck or not working, and then kill them with Liveness Probing. This way, we're just getting a new, completely environment that was just working, and this is now completely transparent for end users. Configuration and tracking. In the past, even though we had everything in Puppet and with automated script, there was always someone that connected to a VM, put, do some changes and making things fixing like that. Then, after two years, you were just replacing the VM with the new OS, and then the things were not working. Why? Because basically someone did those changes manually. With Kubernetes, this is not possible. We are always, first because Kubernetes is mutable, and we always know which Docker images are running and which tag. Automation. We were doing also a lot of automation before, but we had a lot of custom scripts that we implemented. Now, we started to adopt a lot of multiple CNCF tools with the wider community. So, if we have any issue, we can ask, we can get help. That was not the case with our custom scripts. At the same time, we can also be part of that and contribute with our work. User flexibility, in the past, was basically minimal. There was no, they could not, they didn't have any power. Nowadays, instead, it's the opposite. So, we actually delegated them simple tasks like, I don't know, defining alerting rules in Prometheus. Why we, they should open a ticket with us to ask us to add a new alerting rule while they could do themself. Of course, this, as I said before, comes with some drawback in terms of security. And then disaster recovery and business continuity. I mentioned before this incident that we were basically affected, were basically 250 nodes were killed. If I have to think in the VM world that 250 VMs were just killed from a day to another, it will take, I think, weeks to recover everything. While, in that case, for us, it took literally a few hours, a couple hours to bring back everything. And so, to be again in a production state. Now, take away, we are extremely happy about our journey, even though this is not finished and actually is always evolving. Sometimes, it's really hard to keep up with all the new tools and things around that happening. One day, you use Prometheus and the day after, there is open telemetry. So, it's pretty hard. We can say that there is a more reliable service than the past, and this was not coming from us, but actually from developers. I can say that Kubernetes helped us a lot to increase the team productivity and efficiency. We could actually shift our attention to other tasks and also implement new things. And we also replaced a lot of ad hoc solution with standard approaches. This way, it's much easier to enroll new people while before people were not willing to work with us because maybe we were using some fancy pearl script from 20 years ago, but now we use a lot of new stuff so we can find also easily new person that are expert on these fields. And, I mean, doesn't it make any sense to reinvent the wheel? While there is such a big community that has more or less the same issue that we have, why don't profit off the work of all of them? Of course, this migration, this journey to Kubernetes and the GitOps was not easy at all. First of all, the documentation. I've been using a lot of tools from the CNCF landscape. One common factor was the fact that documentation is always the last things to be updated. So really, a lot of time I had to go to the source code of the tool, checking to understand if I was doing something wrong or was actually a bug in the tool. And this is not optimal. Plus, in different version of those tools, there were a lot of breaking changes. For example, in Argos ID, at least I remember between version 1.8 and 2.3, it was an nightmare. Every upgrade was basically a breaking change that was taking a lot of two-migrate application from a version to another. Now, this is getting more mature and much better. I could basically update to version without problem, without breaking anything. But in the past, this was happening quite often. I mean, I mentioned Argos ID because it's one of the tool that I use most, but this happened often with other tools as well. And then last things, but not least, we had to spend a lot of effort in convincing first developers, but also other colleagues that the direction that we took was the right one. When you have something that is running and working, even though it's time-consuming, it's not extremely efficient, the people ask you, why I should just change something that is running? Why I should just risk to break everything to innovate? I think one of these sentences that I found online was really expletive of the... A lot of people won't progress, but they don't like changes. And this is what we face every day, still now. I see a lot of colleagues that, when you mention Kubernetes and that you're running production, they just start to yell like, are you crazy? You're running Kubernetes, your service is on production on Kubernetes. So this is kind of a fight every day. And also, since we had to convince those people, all the changes that we did in these years were we tried to do them in phases. First, we moved to Kubernetes, then we started to adopt... We moved everything to Kubernetes, but using the same technology stack from the VM. And this was a nightmare. Then we started to move to new technology stack because we get the feedback from developers that, okay, we are on Kubernetes, so everything looks good, so you can go ahead. Then we started to update the technology stack and they start to see the benefits of that with Prometheus and so on. And then we start to move again to GitOps, but all this happening in phases and make our work extremely harder and slow. If I could go back, maybe I will just go in one shot, but I believe that this will not be possible because it's also a lot of human interaction and convincing people that you are doing the right things and as soon you do something bad, they will come back to you asking why you did that, while before everything was working. So this I think is the biggest challenge is in this world today, for us at least. And that's it from my side. Thank you all for the attention and if you have any question, please. Any question? Yes, please. Not really, I mean, the question was if you, when containerized the application, we see any performance issue. We never have seen any performance issue. Also because, I mean, this application are web application, so they don't have a really high criticality in terms of performance, so we don't do IO or things like that. So there was no difference at all. In our, so the question is if these clusters that web are for the accelerator controls, not because we are focused on the part of more IT side, we work with developers, but then I think they recently start in the accelerator sector to use a Kubernetes cluster. Well, they are planning to do it. We will probably collaborate and try to help because, I mean, we have these expertise and I know that someone in my group is already working with them, so this is happening. We are getting more and more experiments, accelerators, people that want to run this in Kubernetes. The question is which Kubernetes version we are using. Now we are on 125. The thing is that we are buying to the version that is provided with Magnum. It's not dependent on us, so. Recently, the team provided us 127, but us, I mean, we need a bit of time to plan it because upgrading, sorry. For the cluster, it will take a while because also every time we upgrade the cluster, we also upgrade all the components, so Argos, CD, Prometheus, and we need to validate everything. Thank you all for joining and for the attention for your time. Have a good day. Thank you.