 How's everybody today? This is the last talk of the day. Thank you for coming. This is Bring back production from scratch in under one hour using cops Argo CD and Valero My name is Andre Marcello Tanner a little bit about me. I'm a staff DevOps engineer Which really means like distinguished Yamal engineer? You can find me on these slack groups I Have taken the CK a and CKS exam. How many people here have taken these exams? All right, awesome for anybody who hasn't I'd really recommend these exams. They're really Practical exams that test you under pressure and how to repair Kubernetes clusters and that will become relevant in the coming slides Also, I'm originally from the Philippines. So my boys a lot on the node And I now live and work in Canada a I Work for a company named Ada and we're a AI powered customer service automation company We've been around since 2016. We help enterprises resolve their customer service injuries in any language or channel You know, what's up Facebook text? email browser Ada has automated more than four billion customer conversations for companies like Meta Verizon Air Asia Yeti Square and We serve companies and their customers across 85 countries How is that relevant? So we've been running Kubernetes in production for almost like six years And I've been there just for like four. So I've seen a lot and a lot has changed and About a year ago. We had an incident That gave birth to this talk so Picture this It's the end of the day You're on call You're about to you know, go home log off turn off your laptop and suddenly one of the teams contacts you but hey We've got a problem We're trying to deploy our service to this one production cluster. It's not working So I give my typical reply Have you tried running it again and they do that and they come back. It's not working. So then I say We'll have you open the ticket Because we we don't do work without Jira tickets, right? Oh, no, so I started investigating, right? I go on I go in the cluster I try to see what's going on and the pods. They're not coming up. They're pending It seems like it's been this way for a few days, which is strange So the current deployment is rather old But it's still running right? It's Kubernetes. It's highly available and I go and look into our auto scaling infrastructure. Okay, it's bringing up new nodes That's working and I go check those out in our cloud infrastructure. Ah, they come up and then They Go away. They're not connecting to the cluster. That's strange Why is it doing that? Not supposed to do that. So I go reach out for our cluster management tool to figure out what's going on and I'm gonna talk about each of these tools and What they are and then how we use them how they're important in this So our cluster management tool at that time was cops Stands for Kubernetes kops or Kubernetes operations It's a tool that lets you create it's its own Kubernetes distribution that you can create a production cluster in a single command on any public cloud AWS GCP Azure and more and It gets it up and running and we did this using We had the declarative configuration has like a YAML syntax similar to helm or others out there and so We'd been using it and we had lots of playbooks and experience using it We create all our production clusters using it and their copies like we have a source of truth And then we create our different clusters So there's there's there's a lot of commands, but there's cops create cluster cops update cluster So I go in I Run the cops update cluster command, which doesn't update things right away. It tells me Is it everything expect as expected from my source of truth the declarative file I Run it and it says I want to do all these changes not something doesn't seem right Okay, that's strange So I get another engineer on board with me and we go debug it We like what's what's going on? Why does it want to change all these things? So we go run it on our other production clusters just to make sure that it's not it's not just us and It is only this cluster. Hmm Other clusters they don't want to change anything but with with this specific cluster it wants to change certain files it wants to recreate some cloud resources so we go debug some more and We could We need to fix it because eventually if you don't new nodes are not able to join your cluster then A node may go away and a service that's running in it can never come back. So you need we need to fix it We need to get it running So we made a decision we decided to okay, let's go ahead. Let's run the update Now we've done this a lot of times before it should just work Worst that could happen Maybe it You know it breaks things, but we can get it back running right that's not gonna happen So I run cops update cluster dash dash apply and It asked me do you want to apply these changes? Yes or no? I Type in yes enter The worst thing happened Page your duty alert page your duty alert. I'm on call. I'm already on call so we start our incident process and we have this process in our company where We get on a call. We call it an incident commander. There's communications. There's different people talking to customers We start the process And then I'm like, yeah, I'm looking into it. I cost this. It's me, but we're trying to figure it out. So We go debugging it some more and now we're under pressure. Remember those exams They actually come in handy So we look at what's going on And suddenly it's like, okay Our networking layer we're using Calico. It's it's just having an error Chordiness has another error. It's not coming up. That doesn't make sense. It was working a while ago Why does it not work now? Maybe let's try running it again. So we run cops update cluster It's supposed to work, right? Well, it it usually works and nothing changes. So we're like, okay, okay What do you do? What do we do? Well, thankfully we have Our disaster recovery process That's not it. But how many people Run disaster recovery planning on their Kubernetes environments. No one. Okay Maybe you maybe I'm doing something wrong. Okay. So we have disaster recovery plans for in case things happen We run them once or twice a year. We have a certain scenario And we use our tools to bring back our cluster when they go down so We pull it out. We're like, okay, here's the guide. What do we do? What do we do? This is where Valero comes in Valero is a backup and restore for the state of Kubernetes. It's backed by VMware Basically, it's like it can regularly back up your cluster on a schedule including persistent volumes So cloud volumes, EBS volumes or whatever the volume is in your cloud And then you can start that to another cluster We had used this during our disaster recovery testing to migrate clusters to bring up a new cluster It's basically like getting a cube control getting all the YAML and storing it somewhere. It stores it in a object store So, you know, this will work. This should work, right? We we take the layers back up before things go wrong before things went wrong and we we start to restore it It restores pretty quickly Some things start working but not not everything like our main services are still down and So we go investigate some more and more But it didn't get us get things working so We had to make a decision like when you're under pressure under fire You can't really just stall you got to talk to your team that's there Make a decision move on make a decision talk about the consequences the trade-offs. What are the options? And of course, we have we have our our leaders on the call. We have people on the teams helping us trying to figure what's going on We did have a backup plan because we did our disaster recovery planning We knew that no matter what if something if worse came to worse we could recreate the entire cluster Of course that might take time, right? But so we made a decision as a group as a team All right Can't figure this out. We don't know what what's going on. We don't know how it started It's our restore plans are not working Let's just recreate the cluster and get everything at least we know how that works And so we did we press the big reset button. We deleted the cluster on On purpose and then we recreated it we recreated using cops cops update cluster. It's real quick It brings it all up, but it was empty Now what oh We we did have Valero to restore things but Our next tool is what brings everything together. So We use Argo CD. We had been migrating migrating to it and we migrated our core applications to it. And so We installed Argo CD on the cluster and then it installed our main applications and Then things came up. How many people here know about Argo CD? All right. Okay quick intro about Argo CD since most people know about it so it's a get-ups tool that allows you to Deploy your applications from a single source of truth Previously before using Argo CD. We were doing deployments via Imperative commands from our CI service or from a tool we built in-house like Qt control apply basically But we moved it because we need to be more scalable and we need things to be just reliable Not someone deployed something here and we don't know how they deployed it so we deployed Argo CD and It brought up our main applications And everything was good Actually service came back And so we were okay, I saw the job good good No, but everything was good and it just worked Well, not not exactly. So our disaster recovery plan wasn't perfect Our migration to Argo CD wasn't perfect. We had some services our main services that came up and this restored For our customers customer-facing services, but there were other things in the cluster that weren't there We had outdated guides Disaster recovery guides. We had outdated processes Some things we had to install manually some things we had to go find the Get lab job or CI job and click a play button and make get that installed So a lot of things we had to do to fix it after things came up But eventually we got into its working state and we could stop the incident call so in Total It was like two hours of total outage. It took us about 41 minutes To delete and bring back the cluster Which was by our standards when we do dr. We give it about two hours There were many interdependent services and we had to figure out what's working. What's not working But things were good We were back Now what what was actually the problem in case you're wondering? so this is what we found out afterwards is Cops has a state store. So when you create a cluster, it stores the state of the cluster in a object storage we were using AWS and and so We created a s3 bucket for that and when we did that we were using a new process that we hadn't used before well, not for this we had a new process to create that object storage and had a life-cycle policy life-cycle policy is after a certain amount of time it will delete files and so it turns out that When a new cluster when a new node comes up it accesses this object storage and gets secrets and things that need to connect to the cluster certificates and so when when when we created the cluster It created up we created with an object store that had an expiry policy of 90 days Guess what happened 90 days ago? That's when we created the cluster so in Hindsight we could have fixed this if we knew where to look if we knew More if we prepared for this problem if we knew how our tool failed Actually one of the maintainers helped us find this They probably see it before and so now we know so what did we learn? Through this process Right first thing is to complete your migrations We had several CI CI processes deployment processes We were figuring out how to move over to Argo CD. We were figuring out how to do things We didn't have enough time or we didn't have enough. We had other things that came up And then we had multiple deployment pipelines this service was deployed via the old way this service was deployed by the new Way this service was I don't know So complete your migrations the second is be experts in your tooling There's a lot of open-source tooling that we all use that's all out in display there And if if you're not paying someone or a vendor to support that for you your company is paying you To be the maintainer of that software to be the experts And so when you take on an open-source software when we take on an open-source software We have to learn how it fails learn how it works And also learn all the bad things about it Have guides have training teach new team members how to use it And over time we had accumulated a lot of software that Some people knew how to run that were been there a long time And others didn't and so we had to look over it and look over the critical ones And look at how they worked And the third point is to always be practicing your disaster recovery Now we had outdated guides We mostly we did them for compliance, you know, we went through the exercise we did them but We actually had to use them and that's when we you know found out that they're They're they're not as They were outdated they they had commands that didn't work And so what if someone else was on call that didn't write the guide or didn't hasn't been there a while So if you're always practicing your processes if you're Then you have less chance of them drifting So we learned from this and it wasn't quick, but We fully migrated over to argo cd. We took the time to do it properly We took the time to learn more how to do it to improve our process And this is sort of what it looks like today Maybe we might be doing it wrong But this is what we picked you know When someone deploys it goes through a abstracted ci pipeline job That goes to our get ops repo And then every cluster is running argo cd And they they all deploy whatever they get to deploy for that environment. They deploy it And so if something is not on argo cd. It does not exist in our cluster There is no more manual deployments imperative deploys And we were able to move all teams over to this new method It took time. We did critical services first And then we took other teams through it. Some teams had to migrate themselves But we got there and if there was a service that they didn't migrate well, it must not be that important But yeah, well when we moved over we found some but we fixed those so being experts at our processes so Now when we spin up a new cluster We have a single workflow We're actually on eks now, but And we use terraform to spin it up And then after that we it installs argo cd and then argo cd from there Takes all the applications. We're using application sets now And label selectors and that allows the cluster Argo cd in the cluster to pull in what's right for that environment and deploy it So actually now we can deploy the entire cluster end to end in just a few minutes Now, this is what I mean by dr is our deployment process. We don't do the process once in a while We actually do it quite frequently So now We used to be really bad at upgrading clusters as you know, there's a new version of kubernetes all the time The old versions go out of date. So you have to get used to upgrading it And with our new process, we made sure we got it really good that we could upgrade all the time So we always have like a blue and green cluster. Maybe in the future we'll have more So we recently just last week like we we went from 1.22 to 1.25 Of course, we had to figure out the incompatibilities, but the clusters and all the applications that was a breeze We put up the new cluster. We call it the green and They grab from the same source of truth In our good ops repo our go CD deploys it And then when we're ready to switch over traffic, we switch our load balancers or we do rated weighted load balancing And we've done this already for two different version upgrades And we just keep improving the process So the same process we use for creating clusters for restoring them is is Is our deployment process We're always doing that. There is no other well, we still have backups just in case But we hope we don't have to use them Now it took a team to get here or a lot of teams a village This wasn't just like one small team. This was several teams in our companies three teams working together Um Over a lot of time and We're not done. We're gonna keep improving this process you know So to give you the example we've been running Kubernetes in production for almost six years Our go CD actually for two years And to get here since our incident Like last year It's been almost a year Like it's took almost a year of work with many people to get there so Wherever you are in your journey and we're still on a journey. It'll take time But you get there and it's it's it's really good And we're still learning I met someone yesterday Michael from stock home And he showed me how he set up his clusters with our go CD and cluster api And I learned something there that I probably want to go implement when I get home Are we gonna do a demo and with real production? Um, I didn't prepare for this. No, so Oh I do have this repo up that I use for a lot of demos of what it's like to set up our go CD in one goal on everything This is using a kind cluster So you can do it on your local laptop and docker and then it sets up our go CD managing our go CD and other infrastructure services And then a sample application. It's pretty much the same thing we use just less ci But I encourage you to check it out. That's all We can go So if anybody wants to take questions, uh, please come up to the mic. If not, I'm available on slack Hello, hello I uh, I have a question. Yes. Um, so um Looking at the the speed of your like backups and recoveries I'm assuming you didn't have like Too many like stateful applications or like large databases um, or maybe not. I mean, maybe you did. I don't know, but um What would you recommend if you're running something a bit? You know more stateful like a large database or something like that What would be the way because you know, I don't think argo would be sufficient Because it's all all it stores is like the manifest, correct So you probably need something like velero or you know, it might be slower. I don't know true. Um So we had state lists and stateful applications about our state was outside the cluster. So Regardless of what happened to the cluster we can um We can we as long as we can connect to our data stores So you can use velero for stateful but first For depending on how it's stored, but it's a different process for every database Depending on how that's done With velero, you can configure hooks like when Say when how to how to make velero backup its data data stores the pvcs You can configure hooks depending on like say run this backup tool first and then like to Create the database backup. Um, so it does have a lot of those features Cool, thank you If no one else wants to ask a question, I'll do um following up on that Uh, given if you wouldn't be able to connect to your stateful persistent volumes How would you restore that in terms of using argo and maybe velero in combination? Because what I think about is if as soon as you install argo and you restore from a backup Using velero, you now have the conflict that Velero as well as argo want to apply their manifests How do you solve this conflict? So the question is if you're using both velero and argo and you restore them both which one Would override or which how how would they conflict? So there there is settings in argo to let you back up. Sorry let you Override if someone something else has deployed it Or not override or something is owned by something else, but typically Argo would override anything that it it thinks it's owns So if we use velero and then we have argo in that cluster and they're both Like velero would only run once you would only restore it once that's a one-time thing But argo would constantly be reconciling the state of your cluster based on what's in get So it would override it So you would basically restore from velero and then install argo afterwards and let us do a thing Yeah, if we if we if we had to but for now like Because we're stateless we just use Argo to get the applications running and then the state is outside the cluster If there's any data, okay, thanks. Thank you Just a quick question How would you deal with a whole region failure in that scenario? A whole region failure and With the ks Yeah, okay. So Yes, so we we have done disaster recovery Testing for like a region failure like failure for one region to another This would probably involve your data stores. So Um a simplest way is for us. We do have like our central data store It's in one region We'd have to make sure that's running in multiple regions if we wanted to cover that use case So it really depends on your business needs For us, we've considered it If it is worth the downtime versus the cost of constantly maintaining a cross region cluster Depends on your data store. We use mongo. We also use Postgres Really like more like a business decision like with your Time to recover mean time to recover. Sorry the yeah It depends Okay, thank you How do you manage secrets because that's the thing is that if you restore everything but you cannot access your your secrets anymore What type of secrets like to start the application? Um, yeah, for example to access your data stores So I guess the applications need to have access to the secrets to access your data stores Yes, true. So you would make sure those can be deployed via Argo CD as well. We have Um one way there is plugins for helm helm secrets That you can encrypt your secrets using a tool called sops And it would encrypt the values file and then decrypt it. Argo can decrypt it at runtime like when it templates it There are other tools like external secrets operator that can fetch from external stores like secret manager or gcp secret store But as long as you when you deploy those the same all together with your application argo Then it'll it'll pull in the secrets and deploy those so the goal is Argo will deploy everything that it needs to get your application up and running including your secrets If you want to include infrastructure, you can do that as well Thank you. Thank you How many nodes are in your cluster and how many applications are running on it? Oh, no, it's an application So we have several um You have different production clusters of different sizes Our biggest ones is like sometimes 300 to 500 nodes. We have auto scaling. So it depends Applications we have some amount of micro services some macro services So something under around 50 Yeah, thanks. Thank you Thank you for talking one question for this diagram here. You're not using anything persistent, right? You don't have persistent volumes Sorry, could you say that again to the mic? Yeah, so in this case, for example, if you have stateless set, how would you handle that? Oh, if we have stateless sets and how do we migrate from one to the other? um So we do have some services that cannot be run on both clusters at the same time And when we do those migrations Well, specifically we have those teams. It's a different We can't have it running on both So they have to schedule some type of migration period. Why do they go down or they have a highly available way, um, it really depends on the service, but In argo, we would deploy them with selectors We use the cluster generator With with selectors to deploy only to a specific like the degree the blue cluster at the top and then When they're ready to move to the other cluster and that cluster is live They change their selectors to deploy to the other cluster. So Most of our applications don't have to know what cluster they're running on They just deploy to our deploy through our get-offs pipeline some services Um, we have some kafka services and and other things. They do have to know What cluster they're on they have to actually know the name of the cluster No, but I mean more about data. I mean, what do you do with the data to transfer between the two? Oh the data stores. Yeah Yes, I mean, I mean volumes that you store some data. Let's say like so Yeah, like I mentioned most of our void. We don't stores Stateful services with their own pvcs um We do have some but they're not critical most of our data stores are outside. So like they're in within our vpc and aws and The applications can access them from both clusters So we keep our data stores separate such that we can have multiple clusters accessing the Same data store in one environment Thank you. Thank you Yeah, thanks for the presentation Do you or does your disaster recovery plan also cover for third party dependencies like let's say cloud flare or Sorry, sorry. Could you say that again into the mic? It's uh, does it does your disaster recovery plan also cover for cloud flare or any other External dependencies Yeah, this is the thing which I like to talk more people about disaster recovery But we have a certain like say we have a certain scope of what We would consider a disaster and it may not be the same every time we run it And then we we we plan for that. Of course, there is things like well, what if our Um cdn goes down. That's a different thing. So um We have certain plans some things we have more thoughts about but we yeah We we what we usually consider is like okay if a cluster is down or if a region is down Can we move it to somewhere else? Can we mitigate that? Um, sometimes it's like DNS is down. Let's just go Take a break, you know It really depends on on on the business and what's willing to accept So for us we can accept a certain amount of downtime if You know, sometimes our customers are down also So Thank you Thank you So you uh Talked about like you moved the load balancer to the new cluster. Do you do that manually? Um currently. Yes. So when we're migrating clusters, we're let's say migrating traffic We would we would do either DNS load bouncing um Specifically in our in our case we have cdns in front of them. So DNS doesn't always um Apply or work as easy so There there is like other services like there's a global load balancing service We use in front of our cdns that then we shift traffic over From one to the other so basically we we manually shift traffic over But it's something we do whenever we upgrade a cluster. It's and it's like Uh, and we make sure okay, and then if if we if we see that some when we're migrating traffic that It's not working properly in the new cluster. We just move back. So it's something manual right now, but we're fine with that So so the load balancing service. Do you create like new load balancers in front of ecs and then move? Yeah, yeah, so we have two different like each one has its own environment like when you when you might have ingress we have a We have an ingress controller and that creates load balancers in the cloud and then we bounce that from either the dns or the A higher point depending on what what you what we use Thank you. Thank you Thank you about the github's repo. So is it like one god repo with all your deployment into it or is it multiple ones? Yeah, so this is something worse. We we were learning from or Initially, we were thinking Hmm. Should we have multiple repos one poor service? One per environment one for production pre-production What we went for was one massive repo. We call this the we actually call it like the kraken repo And it houses all our different Gelops files from different environments But the good thing about it is All those commits we see there. We say, okay, someone deployed something here. Someone did this someone did that And it's all in one place. It's just an overhead of what are you willing to do to manage your How do you manage your git git repos and how do you manage access to that because What we consider now is a deployment now is a commit to that git repo So we have to lock that down because whoever has access to that git repo essentially has Can change our clusters So for now it's it's central and we've locked it down just at the ci process And those specific authorized users can change it um Yeah, it's a Mono repo and there are specific settings argo CD if you have a mono repo Okay, thank you. Thank you