 We'll start the session briefly. Thanks so much for showing up. I'm really happy to see a packed room But today we'll be talking about Something a bit specific and an eventful day that we had Kind of a year ago at CERN now So my name is Ricardo Yeah, we'll be talking about today. We will be deleted a significant amount of our production workloads and Everything that comes with it and how we we live through the day and hopefully this will be a bit entertaining as well so Welcome to our Session, I think this is a way a good way to describe it. This is our way to deal with it after a year as well So again, I'm a competing engineer at CERN I work a lot in Kubernetes containers a bit of machine learning as well. I'm also Have a couple of roles in the CNCF. I'm in the TOC and also I call it the CNCF research is a group And I work at the Cloud in Metsern Working a lot of with Kubernetes Open stack and networking So a few numbers about the infrastructure that we try to eliminate We have currently after the incident we'll describe like more than 400 clusters almost 3000 nodes and Quite a few cores and RAM as you can see like 15,000 36 terabytes of RAM In the first pie chart, you see that grouping of clusters like per group. You see that we have a few Let's say power users that have quite a few clusters and on the second pie chart You see the number of nodes that they have like you can see that the second user, which is the color is matching Has quite a few more nodes with less classes, so they have bigger clusters so about Production Kubernetes service we have at CERN. It's a central API service that users can provision Clusters, it's like one click provisioning for their clusters and then they can scale them up down at different kind of nodes And like the purpose of this service is that to allow users to create Clusters with different flavors different Kubernetes versions To select an HA control plane or not and they can select which kind which add-ons they want which CNI Which CSI drivers enabled monitoring if they want what kind of fingers controller and In our IT department and also but also other departments in the organization We have multiple teams that consume the service and they run multiple IT services each team and they have usually Some production clusters at least one QA and a few testing clusters And there are also like personal environments for users to experiment with new technologies and learn new things one of the reasons that Users or teams have multiple clusters is that they treat our cast clusters are as cattle They try to implement best best practices and not have like single points of failure Usually they create one cluster and then they are start adding more clusters and usually those clusters are behind a single or like multiple load balancers So that they can distribute the load They can do read the green deployments and they can roll out new features new clusters like new versions As you can see here for example, you can have like 121 122 they can add later 123 etc All right, so as per said we we have quite a few clusters We have this model where we offer clusters a service a bit like a public cloud provider would do we offer this internally I'll go quickly through some of the use cases we have for this The important thing is that what we are talking here is mostly about services There are some specialized deployment as well of Kubernetes that are single use case and can be kind of large scale as well I'll mention one as an example So the first example of something that runs in this service where we try to delete the full thing It's a it's Atlas is one of the big experiments at CERN for the large hydrogen collider. It's a big Physics Detector that is in the ground and they have pretty big numbers They generate a lot of data and just to have an idea of the scale of the thing We can see they actually move something like 50 petabytes of data a week They generate a lot But they also move it around between CERN and other centers and also within CERN in different places So the control plane for that and all the bookkeeping that needs to be done to do all of this is run And managed through the Kubernetes service another one is Physicists publish a lot of articles and they need a nice way to kind of find this data So this is also something we run in this service. There's a team that runs this inspire Inspire HEP and HEP data projects and this is a nice portal where they do some sort of also machine learning over the papers And then they do they allow people to discover nice contents to go through for their research topics and Then even the campus services So if any I've been I've seen a lot of people that are either still at CERN or were at CERN before during conference So if you see these logos, it probably doesn't mean much to you But if you've been at CERN before working all of this is very familiar These are all the internal services we use for all sorts of things They're kind of critical and we also run them in the service. You can see it's It's kind of even if it's just for the campus management because it's a large organization It can be quite big. So they have like a total of 400 nodes just for this in the different services All right, and then as I mentioned, this is kind of the managed Kubernetes that we offer on demand There's also use cases where they are much larger scale and more critical And things like running the the actual triggers or the event filters for the experiment I mentioned that less before you can see a picture of the of the detector there generates one petabyte a second needs to Reduce that to something like 10 gigabytes a second Traditionally, they've been doing this using a large CPU cluster They are evolving that to use GPUs and other resources but actually the big transition they will do in the Next couple of years is that we did an evaluation where we what they want to do is instead of having their traditional way of Managing applications there. They want to run a large Kubernetes cluster and the reasons for that is the flexibility and changing the workloads and Managing them. So just to have an idea of what they are trying to do They they run something like 30,000 applications at any moment when there's beam running when the beam stops They want to switch that to simulation workloads and they have to do this in like one minute or two And when being comes back up, they have to relaunch the 30,000 applications again within one minute So we we did a lot of work also with six scalability in Kubernetes to make this scale We had a test cluster that had two and a half thousand nodes and where we Cooked these numbers that you see here and we Verified that this can work. So the production cluster will actually be a single cluster with five thousand nodes That will be serving the critical part of the Atlas experiment computing So let's start going to the event that we will describe today This is like a spoiler alert in the end Things went fine, but even that they see it now. It still gets me a bit stressed But it kind of worked and those are colleagues were quite nice with us and they were very understanding so the incident was that by accident We had the maintenance tool that started to delete the full production That we had and thankfully it tried to do that one by one. So not everything in one go it actually deleted on almost 120 clusters and Thankfully this resulted Okay, initially we didn't know that it was only degradation and lost of capacity But there was no downtime for like Like no production services had downtime only if you like testing testing instances So the root cause is that this maintenance script that I described so we had implemented like a tool that we run regularly to clean up like orphan clusters and orphan clusters are Kubernetes clusters that have their VMs deleted and a lot balancers and other related computer storage resources But there are still entries in the database Of the service and this makes it a bit cumbersome for us to do operations Do monitoring and take matters of the service so regularly we just clean up similar to what the control loop would do But it's like an external tool The resolution to fix The incident was just to recreate Whatever that will have deleted, but this was quite challenging as we will see in the next slides Yeah, so I'll start with this so we'll go through the timeline of the day So the day started as most days start at CERN, which is over coffee and a lot of discussion This was early in the morning over coffee. We usually talk about dark energy dark matter and the expanding universe Actually, not really This particular day. We were actually discussing the change. We were about to apply Even if we didn't feel it was like a big change, we still review what we are about to do during the day So first we'll give a bit more details So there were discussions like, okay, everything is fine. Let's do it. It's a small change. It's nothing big we have tested this and All all the return codes and tests are correct We spent two three weeks validating this so everything will go fine. That was before lunch Next So at 2 30 we said, okay, we'll go for lunch. Everything is fine. And then we clicked it and we started and Well, we were waiting a bit for the tool to kick in and then we will start monitoring the logs and A few minutes after like one or two minutes after the start of the script Like one of my colleagues from the identity service came in and said, I think we're hacked Like all the identities of the clusters are getting deleted. I'm not worried. It's fine And we are just doing this to clean up the orphan cluster. So it's expected And then the other colleague that the registry The container registry runs in a Kubernetes cluster that we eat our own food He said I don't see the registry clusters. Something must be wrong. So I was a bit worried so I clicked the abort button to stop all of this and Then I tried to list the set of clatters because I thought okay, maybe my colleague was looking at the wrong project Or was doing something wrong in his environment But it turns out that it wasn't and the next minutes were felt like a bit like the next slide Like this one it lasted a few minutes in this mode and then I I passed by Ricardo's office and Yeah, so the next few minutes after this This is a very important point in any organization and CERN takes a lot of care about the safety of everyone So this is actually a screenshot from the learning catalog at CERN Anyone can apply for this training. It's first aid and life-saving. It says things like how to react in case of external hemorrhage Wants loss of consciousness, which is pretty much what was about to happen here So it's also very important that the target audience is all personnel working at CERN which covers also IT people So this is what what we had to do for for for the next couple of seconds at least This is an actual picture of the defibrillators and and fire extinguishers You can set can find at CERN so this lasted a couple minutes then we kind of calmed down and Started looking into the issue So around like what is important here is that we started this operation at like 235 239 we detected it We got the first report from a user about something is wrong like really wrong around 41 so we got this we were also getting alarms and things but those We kind of were ignoring trying to understand exactly what was happening because there was way too much coming So some user reported that the UI for the registries failing for me. This is okay. Something is big So we spent the next 45 minutes after this trying to do an assessment I understood okay something went really wrong How do we how do we deal with this and this was really important to kind of structure this So the first thing first thing to see is like what's still up and what's not so we immediately Identified that one of the services that was down and actually the only service that was down Was the registry and the reason for that is that? We actually got unlucky. Let's say like this and we deleted all the registry clusters behind it Which ended up bringing down the service now Normally and we've tested adding new clusters. This wouldn't be a big issue And we could get to the cluster back and and the service back in a couple of minutes In this case, it wasn't quite the case and we'll go into a bit more detail of why this was the case In in this event at the same time We were lucky because we run the registry clusters that it was the only service One thing that became very clear also after this experience is that the registry is really like completely critical It's important to launch new new applications Of course like for stuff that is running and that we didn't delete They kept running without a registry But if you delete the cluster of users and you ask them to recreate it's kind of critical to how the registry running So this was the first priority and what we did is we branched out the team So there were people focusing on okay Go get the registry back and do whatever is needed and the rest of us We started doing the right what was needed on the other side Which is basically this is way too big to to do the normal procedure So let's start contacting the different teams at CERN to fully understand. What's the impact in their own service? And this was really essential So the COVID era actually helped a lot because we have this direct communication even using zoom With many of the services and we basically established this communication. We had quick calls with them. What's up? What how can we help is it down and what we started realizing is that? No one was down I think there was one case of a service that was down but most service we're saying okay We are still up. We're degraded, but we're still up. So this kind of helped us kind of relax a bit and the reason that that These services were still up is again back to these clusters a schedule and the dissemination we've been doing internally For for best practices on gate ops and automation and I show here like the our best user It's a team that manages multiple services. You can see here Actually, I see the person there. So it's even better and this is a one one of the Like the most well-behaved user and you can see what the impact of this event was for his own service Or that the team services you have like for each of the services they're maintaining they have something like eight clusters in total we deleted six and Still the service was still running. So for the other ones in some cases none was impacted but in no case we actually had a loss of complete capacity and This is really a shout-out to everyone in the certain teams also for for doing the right thing This took quite a long time to contact everyone So really it's it was really important to branch out and do and split the tasks You can see on the right plot. You can see a similar view in number of nodes Of course the services were not behaving as expected But but it helped quite a bit So yeah, as Ricardo mentioned like the first part was to bring the registry up And and we didn't realize that we had created the circular dependency in the Kubernetes service by introducing at the Harbour registry before the Harbour registry were using the GitLab registry and Then we set up Harbour and we were advertising to users to consume it set up like rules for making immutable tags and Using a vulnerability scans etc And then we thought we should use the same best practices and we started using them But by doing this we created the circular dependency I mentioned so it was quite critical for Our colleagues for our team members that were at the task to to bring it up and they managed to do it quite fast we were also Kind of lucky because the load balancer was not was untouched So the DNS name certificates all of that would be easy To bring back so this lasted like 40 minutes. I think or something and then I Users would be able to start recreating their clusters and next slide Yeah, so usually this would take like maybe five to ten minutes to bring it back because of the circular dependencies and some Worker runs we had to apply to something like 45 minutes And so for the majority of services as soon as we notified users that okay Everything is fine now for the registry you can move on like adding capacity to your services back Most of them took them like 15 minutes maybe 30 some and then at the end of the long tail There were some cases that needed more more time because they had special firewall rules that are not managed centrally Like DNS name propagation that were kind of special and manual manual steps Which shows that everything that involved manual steps was a bit slower because the services were was built like that In a very long time so in the incident they had to recreate everything very fast and then we did a survey to understand what happened and how it was fixed and how everything came back and Really get ops it helped a lot here and as we can see with the survey even for a cluster provisioning There were quite some best practices used by Bar by our users. They were using Terraform Others wanted to use Terraform But because of the setup of their cluster and the lack of Features supporting the provider and they were not using it, but they wanted to Crossplane also is the ultimate goal, but it's not available yet for our environment Other users had very detailed documentation for their own team how to recreate everything or how everything is built So they managed using the existing documentation to recreate everything very fast even though they don't have automation but also for the application which was Quite a success about implementing best practices like almost everyone was doing github's With our goes your flux V1. This was one year ago. Now a lot of users are using also flux V2 and other users had helm charts, but everything was packaged and configured in the values, so it was very easy to reproduce the deployment and That's why they were able to bring up the capacity in such a short notice Yeah, so another thing to highlight here is that you see how diverse some of the tools we use are and this is because we have very diverse teams We have best practices and for like the most critical services. We we kind of follow the same Policy in terms of tools being used, but at the same time different people have different requirements And the teams have different types of knowledge, so we don't currently enforce the the kind of tool they should use So the the rest of the day, so now we are end of the afternoon We we spent kind of looking at the tail of the issues For most services people Realize this pretty quickly and they reacted quickly. We did realize that some things that were impacted Didn't get an immediate reply which probably means they are less critical But we wanted to make sure that this was followed as well. So this was kind of the tail of the work here so as a summary of all of this There are some highlights to do there are things that went quite well the main one is that we had no data loss This would be the main problem Why this happened we will talk later a bit But it's it's also like not we we don't have well-defined policies to to guarantee this unfortunately yet The deployments are really well managed. This is really something we invested quite a bit of time in dissemination of GitOps Processes not necessarily enforcing a tool but making sure that people are aware that this should be the case that they should be able to deploy Their applications very easily There's something that is really good In one way which is cluster creation has been optimized over the years and it's really fast So it takes only a couple of minutes to get a new cluster, which means even in an event like this Which is pretty big we can get the capacity back pretty quickly Multi-cluster and cluster as catalogs and workload splitting it does work like these are the cases where we kind of even if we go Really hard on it reducing the blast radius for the full service is really a big thing We ended up with degradation and almost no downtime Direct communication. It was also very important And on the what went wrong side The singular dependency was the biggest issue and that caused the most stress to us after the initial assessment But also what went wrong is that clustered deletion is also optimized and it's very fast and successful So I don't know if we should do something about it to make it less Less efficient and then other things that went wrong is what I mentioned that some some changes were manual like DNS updates firewall rules, etc And also what we are slightly lacking behind apart from the use cases with Terraform is cluster bootstrapping is a manual process So if we have something that Consolidates all the time Also that would have been addressed by itself like they should have been others by itself Yeah, so this is this is one of the main investments we are doing and we'll talk a bit more about in the demo Finally like the where we got lucky the final for the final part of this post mortem is We kind of got overconfident about what was clearly seen as a small change We knew that the impact could be big but it looked trivial enough to and well tested enough to To kind of let it go and this is something that is constantly in our minds in the in the last few months So we are a lot more careful and we define some criteria on how to follow these changes The other thing we get lucky is that we identified this circular dependency What happened is that we wanted to consolidate the service. We ended up running the registry in the service itself Just kind of grow grown by itself this idea without verifying necessarily So what we did after is really we did a Detailed analysis of where are the other circular dependencies that we missed over time So this is an effort that is quite important as well The final one is that we have no visibility on data persistency or backups This is nice in in Kubernetes You have this possibility to declare snapshots and even generate backups This is not something we have in place right now for some limitations of the integration We have with the storage systems and it's where we are also investing a lot because this will allow us to look at the clusters and See where the data is and what's backed up. What's not and kind of trace To prevent data loss in the future in this case. We did not have any and another part that we invested some time apart from making The cluster provisioning like more automated is to change a bit our policy and never like ever delete Anything on behalf of the user if they even if they request about it We should always delegate to them and tell them how to do it So we never like we block ourselves from even being able to do it Right so this this will be a very quick demo I must confess that initially the idea was to do a real live production deletion of real CERN service and discussing with colleagues involved and not involved in this incident They said if this is a therapy session, I don't think that's a good idea because it might just ruin it So we'll do the second best thing I'll try to demonstrate a bit what we Try to advertise as Automation of the cluster management This is really the part that we're missing all the services are well well done And we understand well how to to manage them the cluster part is still a challenge Ideally we'll get to integrating this with cross-plane as well very soon So just a very Simple description of how this works. So in this case, we are relying on Argo CD We have this bootstrap, which means we can pop up a cluster Deploy this in the cluster and we'll have an Argo CD deployment Which also has a secret management service internally and manages itself Argo CD So that's good We also deploy Argo workflows in this case and this is because there's a lot of manual tasks where You might want to go and write a CRD and a controller and do proper reconciliation, but that can be quite a lot of Work and not all users will will be willing to to develop and maintain those So the second best thing is just to offer them a way to integrate it with workflows Then we have the clusters description. So this is the dream It's to have one YAML file where all the clusters are defined or a couple of YAML files And you can see here like I have a couple of clusters And for each cluster we have the configuration which is basically the version that we want to run and then some flavors for the nodes and the masters as well as Node groups definitions a lot of these clusters are heterogeneous So we want to create like nodes in different availability zones or different flavors GPU CPUs things like this And then something that is really important is this part here You can define Argo CD labels, which are basically the type of workload that this cluster should be running And this will allow us to do a kind of a matchmaking With the applications without having to do a hard coding of the the mappings and you can see other clusters here if we look at the existing Argo configuration we can see the clusters here I have 124 125 and I have an extra one that has nothing deployed right now So and you can see here all the details about how they are deployed again We don't have crossplane integration yet. It will come soon. So the second best thing is to use a workflow for that And then we have the services. So if we go here quickly For a service, so this is a very Simple example of what the service could be like. This is a serving service for a machine learning application The the really nice thing here in addition to all the things I'm sure all of you that use the Argo know about Is that we are using this label selector so that we can match the workloads to the clusters that are saying please give me this type of workloads and this is really a separation of That is very useful in this case And I think that's it for the description. So what I will do here Again if I look at the applications and I select Yeah, and I select here quickly Specific project we see we have two instances of this ML service application. It's in two different clusters This is what we want to provide. So what we will do will do two things. The first one is we'll add The same label so this ML true, which means deploy ML applications in an additional cluster And at the same time, I will also deploy a new cluster. So let's call it 10 I'll just keep the same configuration to speed up things and I will say that I also need ML in this cluster So now I'll just quickly add this push it and just for all right there and if I go back to To my clusters I do a refresh quickly and you can see that it's starting to to to launch this new workflows And there's two one for the previous cluster because it needs to update the labels And there's a second one which is deploying the new cluster, which is this 010 cluster I just showed so in the new one we see the first pods here starting Fully quickly at the logs we can see that the cluster is already in progress So this takes a couple of minutes and because we want to leave time for questions What I will show you quickly is the changes that are being applied in the other cluster Where it's basically saying It's applying a bunch of resizes so I'll come come come back to it when it registers it as well So the way the workflow does is it tries to create the cluster if there's a cluster just ignores it And then it tries to go through all the node groups and this is what it's doing like for each node group So in this case this was node group Yeah, it's trying to resize the cluster and then applying all the changes again Ideally we should have this done with the proper Reconciliation so that is applied faster workflows are the second best thing and honestly There are a lot of processes internally where it's much easier to offer workflows and offer this flexibility to users While we wait, I'll just highlight from the Argo documentation this when you start doing application sets Which are the key here. There's this different generators One of them is this cluster generator and this is where the magic happens Which is you can define your application set and then have the labels saying where where the application should be mapped So let me see if this will be fast enough to To actually appear Yep, so this is the one It's about to change the labels. It did the update ml true. So if we go back to applications We'll come back I need to turn off the CERN. So sorry about that If we come back we see actually an additional ml application there If I do ml instead of 2 now we have the application running in a new in a new cluster Fully up and running and it's just a couple of minutes so this is really what we are trying to advertise to the users and Convince them that this is useful for their own sake as well So yeah, it worked in the end All right, so I think that's pretty much it what we have and we have a few minutes for questions So happy to to answer. Yeah. So just ending here with another quote from from our nice users that say in the end It was a chaos chaos monkey test and we kind of passed it. So it's really nice So yeah, any questions Yeah, I have a question regarding I imagine that you had persistent volumes in the clusters and I know that for example using sef you have a reference with the image id But if you lose the cluster then you how did you recover that image id to To recreate the the persistent volume Connected to the right sef image Right so for for for sef, there's two steps. One is the provisioning of the volumes. The second one is the Attaching of the volumes. So in most cases people use provisioning, but then they reuse the provision volumes meaning that you can actually Mount the application in multiple Clusters if you're doing rbd, this is much more complicated So that's why actually the usage of sef fs at CERN for Kubernetes is much larger than the usage of rbd itself Because you can have this multi-attach much easier Okay, so basically you didn't have These rbd images, right? There were cases, but you can also have multi-attach in rbd, but it's done differently It's like kind of a active passive mode instead of having multi-attach. Okay. Yeah. Thank you Quick question. How did you end up solving your circular dependency on harbour? So what what's the Solution that you chose? That's a very good question Should I answer? So Yeah, the solution it was It's like kind of obvious like you can have a second backup of the images outside in another registry So this is in the end the the solution Yeah, and the service the service itself allows us to specify the registry where the core images should come from So we can flip that when you launch a cluster you can choose which backend registers should be used for the core images So right now we have the core the main registry We have a replica on premises and we have a replica of the core images in a external registry Outside CERN as well for for a disaster recovery. Thanks makes sense. Yeah It's good question