 Hi, what an IT crowd. I'm going to talk about how we went from canary deployments to canary clusters, basically how we do failover clusters in Lunar. It will be slightly technical, but I hope that's OK. I work at a company called Lunar. I don't think many people here know about it, so I took the liberty to post this slide. We have half a million customers, 650 employees, almost 700 now. We make an app with a bank, and you can invest in stocks and stuff like that. We are a tech company. So at Lunar, I guess you can call me a Lunar tech or something. Yeah. We are apparently a unicorn now. It's something about evaluation. I'm not an expert. And apparently, we are times two. I don't know how that works with a unicorn, but yeah, we'll look at that. My name is Henrik Høy. I'm from the Nordics, a slightly different climate. My genes are not made for this climate, at least, I learned. So that's it. I'm a co-organizer at the Cloud Native Orhus Group, together with my colleague Casper, who's the organizer. Yeah, I speak at different meetups, conferences, and stuff. I really like doing that, sharing knowledge. I worked as a consultant for many years, but now I work at Lunar. I've been there for nine months. And my hobby is dungeon and dragons. So I basically, yeah, true. So basically, to people who don't know what that is, I paint small plastic stuff. And then with other grown-ups, we play with them, roughly. Yeah. One of the things I really like about dungeon and dragons is that one thing you learn is that a strong team is a diverse team. Four wizards won't survive the tavern. So that's apparent to everyone who played it. So yeah, that's me. This is the meetup groups we have in the Nordics. And we actually combined them to the Cloud Native Nordics. So if you like cold weather or rain or in the area, by all means, join us. Go to cloudnativenordics.com. And we have a Slack work group, I think it's called, where we communicate and exchange speakers and stuff like that. So it's really awesome community. So at the end of today, I'll talk about our tech stack. I'll talk about how we did. Failovers, basically, how did we achieve the capability of doing a failover using GitOps? Because that's not as easy as we thought it was. And then some of the key changes to get speed and get more confident and remove complexity. And then I'll talk just about what we are planning on doing on the road ahead. So it was a journey. I tried to separate dungeon and dragons and work, so you know. But basically, Casper and Bjorn did an amazing job transforming the company from working on a monolith to microservices, which was a great help when doing failovers. We also transitioned from using deployment pipelines to using a GitOps way of working. And that is basically key, because you don't want to trigger 1,400 pipelines when you do a failover. Yes. So our tech stack. We use Kubernetes. No big surprise there. I wanted to make a joke about something else, but I don't want to do that. We use GitOps. We have a monorebo. Just so you know, there's different strategies, but it's a monorebo. We use Flux 1 at this time when we did these things. We have now migrated to Flux 2, which I could do a whole talk about, which is really awesome. And yes, we did pay AWS and S3 bucket full of money to manage your databases. You could call it cheating. We call it convenience. It's really nice. And we use RabbitMQ to decouple our services, not just in space, but also in time. So that's also a great help when you do failovers, if you want to do it without downtime at least. And we use Xtom DNS to control our endpoints in the DNS and to shift the routing weights in Route 53. We are primarily in AWS, but we are also in Google and in Azure or Azure. Yeah. Then we made some stuff ourselves. We made something called Shuttle, and it's kind of hard to explain what it is in very little time, but it's basically a thin layer we put between the developers and our systems. So we decouple our services in Git from Jenkins, for instance, and also the way that the pipelines works and all that stuff. It's a great tool. It helps us a lot. We can change stuff really quickly by using this. It's open source. We have Release Manager. It's a three-headed thing. One of the most important thing is the Ham CTL. It's not the meat. The first chimpanzee in space was called Ham. So now, you know, we get the question a lot. So let's look at the first generation failure-wise. So this is what gave us the capability, and then we could start work on that, right? We would have a GitOps repo with all our manifests. We had a cluster pointing flux to it, and we had users. It's not to scale, just so you know. What we did was we created a branch because there are things in the new cluster that should be slightly different, right? Like cluster name, routing weights, and all that stuff. And this is, I think, the way many people do it. This is the way we did it in the beginning. We would then create some edits here, like the cluster name and stuff. We would then spin up another Kubernetes cluster, point flux to it, spin up all the services, and then that was fine. No traffic was routed to it yet. We would then federate the two clusters. Two-way federation is needed, unless you want to throw away some messages. Then we made some edits in both of these repos, or sorry, not repos, branches. And that would then shift the routing weight on the annotations that external DNS uses. So we now have maybe AD20 or something like that. And then we can shift it again and again until we actually don't use the old cluster anymore. Then we can remove the services. And once we're sure that every service is down and gone, we can then remove the federation, and we can actually remove the old cluster. And then the fun part comes where we need to merge this branch into the main branch, and stuff like that, and point flux to that. That was complex. And it sounds, it looks simple, but when you're in a situation on incident mode, this is not the complexity you really want. But we have the capability, so a lot of work went into this. So it's really awesome. So challenges, a lot of merge complexity in the GitHub's repo is not a nice experience. New deployments would stale during this exercise because we would branch, and only new deployments would, or releases would go into the main branch, but they would eventually get back. Most people felt, a lot of people felt uncomfortable doing this because of the complexity and all this, and it just didn't seem like in the spirit of GitHub's. It should be more smooth. It feels like a hack, so to speak. Some of the observations, most of the edits was the cluster name. So there's a lot of things in the cluster that needed that barcode. It's still not a pet, it's still a cattle, but we need the barcode so that, for instance, fluent bit would know to send the cluster name with it. Because when you have two clusters that are identical, you would like to know which cluster did this log line come from, right? And there's a lot of other stuff, like AWS, AIM, Authenticator, there's the external DNS annotations would need it to be added as well. Each time it would shift as well, because they couldn't be the same, because then you would have the same routing weights on both environments, right? So going forward to second generation, we created two new controllers, because that's how you solve things today. We created one called cluster identity controller. It has some strategies to find the cluster name and then create what do you call it, config maps in each namespace with the cluster name and then the applications can use it. We also created something called routing weight controller to control the routing weight annotations that is then used by external DNS. Yes, so the cluster identity, this is an example of a cluster identity, config map, what it looks like, and you can see you have the cluster name here, in this case, from 2022, and you can then use that in your application. This is what the routing weight would look like, the CR, where you specify the annotations that you need to put on increases with the right cluster name, and you also have the cluster name, because it'll only react to cluster or routing weight objects that has its own cluster name. And this way we can have multiple routing weight objects in the same git repository, in the same environment, but it'll only react to the ones belonging to this cluster. So that's the way we fought branching. It has a drive run mode, which is nice in the beginning at least. So yeah, this is how we do it. And this is what an Ingress would look like when the routing weight controller is done, it would add the annotations and then Route 53 would then connect to external DNS, would connect to Route 53 and make the adjustments. We use Shuttle to manipulate these. It's actually a repo just like any other microservice, and we have Shuttle where we can run these commands like delete routing weight, add routing weight, adjust routing weight. And this is what it looks like before we do the failover. We can see in the YAML we have one cluster with 100%. What we do is we have this extra repository. The old cluster is pointing to our main branch. We create a new cluster. We spin up flux. It points to the GitOps repository. Everything spins up, and we then have no traffic yet to the new cluster. Let's see if this works. Yeah, we federate the RabbitMQ, and then we run the Shuttle run at routing weight. It'll then create the entries. We release this as any other service, and we now can add the routing weights to our new cluster. And then simply by running these commands, we can then gradually shift the traffic. So adjust routing weight to 5050, 8020, 100, zero, and so on. Simply like releasing another application, which is really nice workflow. This is what we're used to. Yes, and once that's done, you can see in the Shuttle, the routing weight resource application, so to speak. We can see that we now have two clusters, and you can see the routing weights. What we then do is we delete the routing weight, release it, and then no traffic is routed to it, and we can then start to remove flux and all the applications. When everything's down, we can then remove the federation, and we can actually remove the entire old cluster. Yes. So this was much easier and much less complex. Some of the effort that went into this, we coded two new operators. So that's also complexity, but it's code. It's much easier for us to handle. We only need to maintain this, and not every time we need to do a failover, where we need to have as much cognitive overhead as possible. We did 17 failover runs in roughly two months period, which was a lot, but it was a heavy investment, but I think it paid off quite well. Every iteration led to improvements. We also use spot instances in Dev, which led to a lot of great findings. Yeah. Some of the results, everyone in Squad Odyssey, which is the platform container runtime squad, can now do a failover. The failover operation is down to five automated steps, and the reason it's not one is because we need to check some stuff once in a while. No GitHub branching, which was awesome. We didn't really like the branching, so getting rid of that was awesome. And we went from spending a whole four hours doing a complete failure work without downtime to actually 40 minutes, which I think is okay. So that was really awesome, speeding things up. Some of the problems, well, we have this identity controller that needs to have these strategies to find the cluster name, and that seems like a bit... If we could only have like a name metadata thing in Kubernetes, we could remove an operator. So if anyone is listening. Yeah, some of the goals, well, we want to use Flux V2. We can use dependency, remove more complexity from our scripts. We want to migrate to Plus API. We want to migrate or Terraform to Crossplane just to make everything behind a closed loop. Yeah. Culture, what we did was the Deming Cycle, Plan, Do, Study, Act is the Deming Cycle, and we just did that again and again. And with that, I would like to say thank you. They are open source. You're more than welcome to contribute. Thank you so much for your time.