 It's working. Hello, hello, hello. Gosh, there's a lot of you. Well, thank you. Thank you for attending. Obviously, this is the last one between you and Bear. So we're trying to keep this as short as possible. Welcome to our talk. So today, Guy and I, we were talking about how we might pitch our talk to KubeCon. And we kind of agreed that the vendor talks are great. They're good. You learn a lot. But actually, the thing that brings us to conferences is the war stories. We like to hear from customers. We like to hear from companies who actually talk about the things that they do in production. And the failure states that they've seen in production. Because we learn from it. They learn from it. It's great. So that was our pitch. We decided to take the second biggest outage revenue-wise of Skyscanner and present it to you all. Bear bones and everything. We're just going to talk about it. We're going to talk about how we got there. We're going to talk about the actual technical failures, the cultural failures, all the big muck ups, with the idea that this is essentially free therapy for us. We've got 20 minutes of you guys just sitting, listening to us, and we get to talk about it. And if you learn something from it, then it means it's less of a failure. You've got something from it. We've got something from it. It's a data point rather than a failure. But before we go any further, ah, I need this. Thanks, OK. We should introduce ourselves. I'm the top one. My name's Stuart Davidson. I'm a director of engineering at Skyscanner. I run the production platform Tribe, which is a group of squads that run almost like a platform as a service. So we run all the compute, traffic routing, observability, incident management, everything that our product teams use to get their product across the globe. And we do all the kind of heavy lifting underneath. Guy Templeton. Can you hear Guy? No, all right. This is Guy Templeton. He is the lead Kubernetes engineer for Skyscanner. He's also the co-lead of the SIG autoscaling here. So that's him there. Well, there you go. All right. This is Skyscanner. Now, some of you might not know what Skyscanner is. Skyscanner is a website that puts together a bunch of flights, hotels, and car hire from about 1,200 travel partners across the globe. And you want to go somewhere, maybe multiple places at once or in a kind of a holiday. We put together them all together for the cheapest possible price. To give you a sense of the scale of Skyscanner, we serve at the moment, I think it was last month, we serve about 75 million unique customers a month. So that gives you a sense of the size and scale of the website that we're talking about. Today, we're going to go through three different areas. I'm going to talk about Skyscanner's cloud-native journey. So how did we get from the data centers and not a small amount of users, but a far less amount of users to where we are just before the incident. Guy's going to talk to you about the incident and actually the failure state and how we recovered it. And then we're going to go through it together and talk about some of the learnings, because that's ultimately what you're here for, right? We're going to give you some context. Then you're going to get the learnings. Hopefully, we'll have lots of time for questions as well. So please think about these questions. This is a chance for you to learn from us, for us to learn from you. So this is a deliberate attempt for us to share some information. It's maybe not the best forum for it, but there's beers after this, so we can do it there. Skyscanner's cloud-native journey. So I joined Skyscanner in 2015. So it was about seven, eight years ago. And at that time, we had just less than 1 million users a month. Then we got to 10 million users a month. And just before COVID hit, we were over 100 million users a month. So that's 10x, and then 10x again. And it has been a wild journey, right? The number of pieces of infrastructure, the number of systems that you trusted in that had never failed in years suddenly became bottleneck, suddenly became things that wouldn't scale the way you needed it to, or suddenly added latency, or some sort of problem that you hadn't foreseen. So it's been a really fun journey, iterating and iterating and re-establishing what you're going to do and thinking about it from almost the ground up several times along the way. We started off in data centers. Everyone, I guess, were about 15 years old as a company. So we were in data centers. We had five data centers across the globe. And yeah, we made this shift to the public cloud. We saw the potential of the public cloud. You could see we were scaling so quickly that actually buying infrastructure and getting it plugged in and things like that, it just wasn't sustainable. We saw the opportunity of allowing our teams to decide on their own for infrastructure and all the benefits of infrastructure is code and that sort of thing. So we made this shift to the cloud. Now, I say that in one sentence, one bullet point, as if it was a simple transition. But we are still paying the price of our migration from the data centers to the cloud even yet. So at one point, we just decided to not renew our data center contract. That's how it took us such a long time to transition the workloads from a data center native into a cloud native solution with all the funding games of auto scaling and failure states and such. So we lifted and shifted a ton of our workloads and we're still paying that technical debt off at the moment. But we're in AWS fully now. So that's our cloud provider. We are fully AWS. At the same time, around sort of 2015, 2016, this thing called Docker was starting to happen. And it was at the time where people in conference talks like this would say, who here's heard of Docker? And people would go, why? And they were like, who's using it in production? And they'd go, oh, no. It was just about that time where it was all quite exciting but we hadn't quite decided what we're going to do with it. Skyscanner decided to try it out with our CI solution. So it's quite a common kind of first step to try a container native CI solution. It's actually the precursor to Harness. It was a thing called DroneIO. So an open source solution container native. We really started to see the benefit of ephemeral build agents that the squads could define. Instead of release engineers mucking around with Team City agents and Ansible scripts and things like that, squads could define their own ephemeral build agents in containers and do their own builds. We thought this was great. It was groundbreaking to Skyscanner. And that was one of the real drivers to us starting to talk about using containers really pretty early on in production. We had a look at a bunch of options at that time and there was loads of container schedulers at that time. And we decided because we're moving into AWS we would go all in and we would use ECS. And this was the grand plan for our ECS solution. So we asked squads to define in YAML the configuration of their service. And then we would point our internal tool called Slingshot at each Git repo. And from there we would deploy a Route 53 entry behind an ELB, in front of an ELB, sorry. And then we would deploy containers just by adding them and removing them from the ELB. Nothing more complicated than that. Again, this was totally groundbreaking. It totally revolutionized how Skyscanner was doing things. Because it used to be that services and squads would have to talk to the release engineering team. They would get their Ansible scripts updated. They would actually have a release engineer who's like have you done your tests, et cetera, et cetera. With this it was all self-service. So squads could just deploy the services without engaging with release engineering. And it got to this crazy state where we didn't actually know which services were being deployed. It was kind of a bit scary for the release engineering team. And our OKR was like get eight services deployed on Slingshot and we ended up with like 62 by the end of the month. And our OKR stats for the month in the quarter were like thousands of percent, it was brilliant. But this really kind of progressed our path along using containers into production. And it made a, it was, what's the main, it was so important because squads could deploy with confidence. Right, that was the main takeaway from it. They could deploy with confidence. And it moved it away from being this really scary, really difficult process with lots of human interaction. It became this ubiquitous thing that just happened in the background. That squads could deploy several times a day. And it almost, it became a strategic enabler, right? People used deployments to get out of trouble rather than causing trouble. And it just, it was such a big change for Skyscanner. ECS took us a long way. And in fact, we still use ECS to this day. Almost, almost we're out of ECS. But we were finding that we were having to rebuild a lot of components that we were seeing happening in the open source community. And we decided to take on Kubernetes. We thought, this is the way forward. Now, again, there was a lot of schedulers at the time, but Kubernetes had quite a strong ecosystem. As demonstrated here, it's continued that journey. And we went through many, many iterations of trying to get Kubernetes to work. Like I say, we haven't fully got all our services into Kubernetes yet. There are still some on ECS. Our first iteration, we got a consultant in and we spent a lot of money on it. And it was at a time, again, where no one really knew how to run Kubernetes at scale. So they learned a lot. We learned a lot. And ultimately, the solution that we ended up with at the end didn't last about a month, a month and a half before we started to see problems as our workloads grew. Then we iterated and we started, we did this typical Skyscanner thing of the time. Because we'd built so much in ECS, we started to rebuild and build custom solutions on top of Kubernetes rather than leveraging the open source community. That was one of the main selling points for Kubernetes. And again, we went through several iterations of Kubernetes, kind of, I guess, architectures at that time. It got to a point where we got some critical workloads onto Kubernetes, but we had enormous clusters. Like we had one cluster per region. We were in four regions. We had one cluster per region. And these clusters were enormous. Well, what we thought was enormous. I mean, we had the keynote by CERN, right? They're enormous. We thought we were enormous at the time. And, but because it was a single cluster, it meant if any failure, the failure mode, like any sort of slight failure, the whole thing went down. And we had one particular outage where the business asked us to slow down, stop changing things in Kubernetes, right? Stop upgrading Kubernetes. Well, we can't. The upgrade path of Kubernetes, it's like a re-release every 12 to 14 weeks, right? So we couldn't slow down. And that's just the core version of Kubernetes, right? You've got all the different components, different parts of the Kubernetes architecture. So we couldn't slow down. And I remember I was about to go on stage, GitHub satellite, and I was about to talk about our continuous deployment system and how rapid iteration reduces risk. And like, that's the way we wanted to go forward in Skyscanner. But Paul Gillespie, who's our senior tech in production platform, he phoned me and he said, look, we've just had this big outage, not this one we're about to talk about. We had another outage. And the business really wants us to slow down, but I couldn't advocate, I couldn't say, let's slow down. And then literally 20 minutes later, go on stage and say, we should go fast because it reduces risk. So I said to Paul, there needs to be a different way. We need to approach this in a different fashion. Is there any other way of doing this? So all credit to Paul and all credit to Guy and all credit to Guy's team. We went back to basics and we started again. And we started to reevaluate all the technologies that we were using and actually which ones were valuable and try and declutter all the different things we'd installed in our Kubernetes architecture. There's a really good book called Good Strategy, Bad Strategy. And I really recommend you read it. If you've ever tried to do a strategy, someone will ask you to do a strategy. I really recommend you read it because it starts with the diagnoses. So what's wrong, right? Let's start with what's wrong. Not the solutions, but the diagnoses of the problems. And one of our big problems was that we weren't using industry standards. I talked about that. So we decided specifically to set a policy of moving to industry standards. Guy'll talk a bit more about that, about this bit, but we started to adopt, like I say, more open standards, more open tooling technologies. But the big fundamental shift was a move to a cellular architecture. So this is what our architecture looks like in terms of Kubernetes at the moment. What we decided to do was to look at the failure states that we had to work with and create an architecture around that. So in this particular case, this isn't actually well-named, that's a cluster, not a cell. Okay, sorry about that, that's my fault. So these are different clusters. There's a cell per region. And within a cell, there are many clusters. We have an even amount of clusters per AZ. And each workload is n plus two clusters big to allow one cluster to be down for maintenance or upgrade and one cluster to be down for a failure state of any sort. And even then 100% of the work the request coming through should be served by our architecture. Now this has an amazing benefit beyond being a really resilient architecture, one that's other than the one we're about to talk about hasn't caused us a production incident. And we use spot instances 100% in production. So our entire compute infrastructure is spot. And that has saved us literally millions of dollars. The company has been saved millions of dollars because of this resilience. If our nodes are taken out because of spot termination, we've got that resilience. We can deal with the failure state really easily. It's ingrained, it's part of the actual default state. So this is the state we were in. Now, what actually happens, right? So pride come with before a fall. I've just told you that the cells architecture is the best architecture ever, ever, ever. And at 9.52 and 33 seconds on the 25th of August there was an engineering all hands. And it was a really good question because we'd moved about 70% of our workloads into the cells architecture. And I was asked the question why, by another team that hadn't moved yet, why was it a good idea to move the cells architecture? And I said, well, it's the traditional theory of constraints bottlenecks but we have solved the compute problem. It is now the most available resilient thing we're gonna move to a different bottleneck, which at that point was traffic routing. And the solution for traffic routing will be based on cells, so you should move. But cells has fixed the availability problem. At 15.53, not one day, no, it was the same day. Argo CD deleted every service in every cluster, in every region across the globe, ultimately because we told it to. And it died. Thank you. So, how did we go from, can you hear me yet? No. No, yeah, that sounds good, awesome. So, how did we go from 478 services running serving travelers to zero? And it, asshole, good stories do. It's a no-op change. Just merits this, late on the day. And this is the change that killed Sky Scanner. Yeah, so as you can see, removed some ginger brackets. What could possibly go wrong? So very quickly, soon after that, rather than seeing our lovely homepage and being able to search for flights, accommodation, car hire, people instead didn't even see this nice error page. Instead, they saw this. Which, it's not the greatest traveler experience, it turns out, and people get very confused as to why they're not getting access to night for their flights. And very quickly, someone was able to guess what had happened and was smelling Istio mesh because we use Istio, we have MTLS, we use authorization policies, and when we deleted everything, Istio started going, no, you're not allowed to call anything. So in terms of what actually happened, though, going back to what Stuart mentioned, we have Slingshot, our deployment tooling. For reasons when we moved to, when we started moving to Kubernetes, we don't want our developers to have to relearn tooling or completely change their application specification. So we kept the same Slingshot.yaml format and just added on Kubernetes support so they could add a few different fields and now, instead of being deployed to ECS, they were getting deployed to Kubernetes for free. However, that meant we were in a weird in-between state and this is where the trouble originated because we had some things which would move to the industry standards of GitOps and then we had all of the things that users deployed onto the cluster that people care about, things like deployment, services, HPAs, and service monitors. Meanwhile, GitOps is deploying things like Istio objects, resource quotas, but critically, namespaces, which means that GitOps is controlling something that contains things which are not GitOps controlled. So in terms of how that works, for those who aren't familiar with GitOps, I'm assuming most will be, but we have a repository where people declare like their service and onboard it and say which cells they want to deploy to, how many pods they'll have, what resources look like that, so we can do some of the templating of resource quotas to prevent runaway scaling, et cetera, for users. And then ArgoCD is responsible for doing that help chart templating based on the values and then charts defined in the repo and then applies them to the clusters. So if we go back to this one line change, what this is being used for by Argo is telling it what clusters to apply to all of these objects to. So multiple layers of templating. This becomes for Argo the driver of what applications do you need to def against the cluster and apply to the cluster. So when this became an invalid thing, it suddenly is going, oh, I don't need to apply any objects to any cluster. Oh, there's a lot of objects there that I don't need. I'm gonna delete those now. So in terms of our recovery, that meant that we had GitOps, nice, easy, revert the PR, all of our namespaces are back, but all of those objects that people care about like deployments and services are not back, which is a bit painful. So in terms of the recovery, we had to first mitigate. So we got people our nice error page so that they could at least see a traveler with a surfboard instead of an hour back error. Then we had prioritizing getting a single region functional and then also shed non-critical load. So things like price alerts for travelers where we want to prioritize getting users quotes for their travel or information about the tickets they've already bought, but price alerts we can do without for a couple of days. So this is a graph of the namespaces and you can see very obviously where we deleted everything and how we reverted things and cluster, our go CD slowly sort of started going, oh wait, I've got thousands of objects to reapply here. But you'll see that's a very short gap, like it's definitely under an hour for some clusters to recover. However, this is the traffic graph for sky scanner effectively over that time period and you can see that outages far longer and is pretty choppy to recovering to basically normal levels afterwards. So in terms of why that took so long, that's the other bits of restoring clusters and that's far tougher because it was manual recovery. We had dusty run books that were designed for not a cluster where we deleted all the services but kept the same cluster, but instead a cluster where we'd just deleted the entire cluster, spun up a new cluster and restored and we also discovered our run books were not easy to follow when a stressed engineer was trying to copy and paste. Like these two blocks are very similar and make it very easy to copy and paste the wrong thing when, so you might be trying to restore the wrong XED cluster. So this is what the recovery looked like of the actual services. So this is the number of services running on each cluster and you can see here we've got this like stepped recovery as we did the manual restore in each cluster so eventually get back to a point where we left it overnight where actually that the strength of that cell-based architecture we were able to serve all of SkyScanner's critical traveler of traffic out of just four clusters in a single region and able to allow our support engineers to get a proper night's sleep before they resumed recovering the next day. So what can we, what have we learned from it? What can you learn from it? The risk here came from us crossing this chasm between we want to get from where we were to using all these standard tools. We want to get all GitOps. We want to be fully based in that world but we are in an in-between state where we've actually caused potentially more risk for ourselves by still being in that state where GitOps can cause a complete outage but then GitOps can't be used to resolve all of the outage. Yeah, I mean, we provided Argo CD the power to apply all these Helm charts and it used that power and deleted everything. Finally, templates with logic are actually code. Like we made a single line change to a templating file and managed to wipe out the entire cluster. This wasn't caught at PR time because we don't have tests on it because it's just a template. Why would you write tests for that? Especially when you've got multiple layers of templating I would recommend trying to find a way to test that and make sure that what you're changing does what you think it does. And this may seem obvious but don't do global config deploys. Since we had this outage, we have made a change so that even changes to templating like that will only ever roll out to a small subset of clusters at a time so that theoretically we can still only take down a single cluster. I mean, I'm now worried that I'm going to get paged and get told that all of the clusters have gone down again. Yeah, yeah, because you just said it. Yeah. Okay, it's my turn. You switch my mic on. Can you hear me? Excellent, excellent. Right, yeah, I just wanted to quickly talk about another part of the less technical side of this, right? So Guy talked about the technical solution. The change that was made to the file, that was right at the root, it was the root file of our templating. And the engineer that made the change, he was an experienced engineer, right? He was an experienced engineer and he made a change that, I don't know, looking at it, it's a pretty easy one to miss. And it also got PR'd, sorry, it got reviewed by a very senior engineer and they missed it as well. Now, the title of our talk was how a couple of characters brought down our site. That's a bit tongue-in-cheek. It's a couple of characters in terms of the curly braces. It's also a couple of characters as in people, right? Is it a bit of a joke? Yeah, it's a bit, okay, it's not a great one. Sky scanner really works hard at having a blameless attitude to incidents, right? Humans are fallible, they make mistakes, everyone does it. And that's why the sales architecture is so important to us, because it allows people to make mistakes. Now, we thought we'd got all the failure states and we thought we'd put all the guardrails in place, but like Guy says, we didn't apply testing to our templating and we did global deploys in a way that we never thought possible. That root file hadn't changed in like three years. So it was a failure state that we hadn't considered, but we know that humans fail. So when it actually came to the incident management side of things, that's where I genuinely believe SkyScanner's Sean, because everyone that was response, that needed to come to the response of what had happened. And that's not just engineers, we're talking legal, we're talking user satisfaction teams, like the CEO, et cetera, et cetera. Like everybody came into a room, other than the CXOs, that's not fair, the CEO wasn't, the VP of Eng did pop in for a little while and then left to let the engineers get on with fixing the problem rather than figuring out who made the mistake. Now, we still know who made the mistake, but we've not talked about it in any ILDs, incident learning documents, which is what we do after an incident and how we got some of these reflections and retrospectives. But it really wasn't valuable because everybody moved in one direction to fix the problem rather than trying to cover their asses and trying to cover up what went wrong, which would ultimately delay the resolving of this problem. Now, I got a couple of quotes from engineers. Everybody was tired and quite emotional, so these quotes are a little bit emotional, but I think they're awesome. Like Guy says, we actually, because of the architecture, we got to send everybody home really pretty early on, about 11 o'clock at night, but people were tired because it happened four o'clock in the afternoon, people had had a full working day and then they had to resolve this problem, full outage. But yeah, like the positivity and calmness to give us the space to triage and recover, these quotes are, to me, really inspirational and I share them with you, not to big us up, but to try and advocate to you the sense of achievement of that blameless culture. Now, so try it, try it in your organization. Next time you have a problem, don't look at who, just look at what, just leave it out. Don't even talk about the person, you don't need to. It's not important, failures happen. So yeah, I'm starting to bang on about it a little bit, but I thought it was important. Folks, thank you for your attention. I know it's maybe one of the shorter talks, but we're really keen to hear some questions if you've got any things you want to ask. Thank you. I mean, we were kind of hoping someone would come. There was a question down there, but I don't know if there's a microphone. We're kind of hoping there was somebody with a mic. Yeah. Is there someone with a mic? Is there a question? Yeah, all right, tell you what, mate. You, yeah, thanks Carlos. Thanks. First of all, thanks for sharing. Failure stories are always interesting. So actually I have two questions. The first one is about, you mentioned that never, you should never configure global configurations. Do we have stages, different stages for GitOps? And the second one is about maybe, do we have disaster recovery procedure to maybe create a new region at regular intervals? So in terms of the rollouts, yeah, we have a concept that we call channels. So we have different clusters in different channels. So we have effectively a dev, alpha, beta and main. And effectively at PR time, we enforce the changes only roll out to a given channel, a single channel at a time. So that we effectively cannot roll out global changes. I am sure there is some way and somehow that we could cause it to happen. We've tried to catch everything we could. So yeah, we basically progress changes through those channels. We have different testing mechanisms as well. Yeah, there was quite an argument, quite a debate about, do you want to separate each region and have like a separate Argo CD and a separate infrastructure for each region? And it's that balance of like efficiency against that blast radius. And ultimately we have stuck with a single instance of Argo CD to roll out the different regions, but like Guy says, we've got the different file. In terms of your regional question, in terms of rolling out a new region, because we're in four regions, we haven't got a sort of hit this button and roll out a brand new region, but because we're in four regions, we're comfortable deploying to multiple regions. So if that were to happen, obviously there's a couple of layers under the sales architecture in terms of AWS and such, but all that's infrastructure is code. So I think we could do it pretty quickly if forced. Maybe we should run a war game on it. But yeah, it's certainly something that we theoretically could, but it probably wouldn't be our first option. Cool, sorry, we'll get to you next. Hi, yeah, great talk. Do you ever do any DR drills about recovery from things like this? Because that seems potentially like something that, no one ever drills for total outage, but what happens if suddenly everything was to get deleted? How quickly can you get back up? Is that worth even drilling? What do you guys think? Oh, it's really challenging, right? How much time do you spend on a disaster recovery, but also a backup isn't a backup until you've restored it. So there's definitely kind of a, I guess what I'm trying to say, challenge there, friction there, right? We have done, subsequently, we have done far more disaster recoveries than we have in the past, particularly security-based ones. We've done some really good stuff with our security try and gone, okay, a malicious actor has come in and switched this off, what do you do? And some of those kind of theoretical scenarios, you don't need infrastructure, you don't need the chaos engineering or anything like that to actually run through that. All it was was a PowerPoint deck and some sugared up actors that kind of went, oh no, there's things that are wrong. And that was actually a lot of fun and we spent an afternoon doing that, so that was pretty cheap. But even during the back, do you want to talk about the backup issue that we had? Yeah, yeah, so as I mentioned, we had a runbook for doing restores of XED at the time we were using cops to manage our clusters, so we had the XED clusters. But as I mentioned, we had practiced restoring it, but only ever onto completely fresh clusters where we imagined the entire cluster had been deleted. We hadn't imagined this as the potential scenario where we were trying to effectively roll XED back in time, which did show up some flaws in the runbook and some assumptions where we'd made incorrect assumptions so we had to do some on the fly stuff. We have since moved to EKS, we're using Valero now as the backup tool for that, and we are validating those more often, like we covered these scenarios and know how it would behave in both scenarios of a fresh cluster and a cluster that is in a bad state. But there was also an instance to do with IAM policies. So we had restored a bunch of stuff, the guy will go into what, but then security did an audit of our IAM policies and locked them all down. So getting access to the backups suddenly became an issue. Yeah, yeah, trying to on the fly discover why cops had created or had not cleaned up old IAM, old backups, so we were stuck listing thousands upon thousands upon thousands of objects, trying to go, no, just give me that. I don't want you to carry on listing, I just want you to restore. The IAM policy had removed the ability to delete the deltas, so we had hundreds of deltas rather than six, which was meant to be the policy. So there's definitely things like that that will always catch you. Yeah, what extent do you go to, right? Carlos, this gentleman, he put his hand up first, so we'll grab him. Yeah, so as far as recovery is concerned, what is the first thing you do once you saw that you are running zero services? Did you just revert the committee and run the CDR? So we, yeah, we immediately reverted the PR, and one of the joys of the blindness incident culture was we had someone who was not an engineer involved in the actual service who took on an incident commander role, so that freed, and they were doing the coordination between squads, so that freed my squad to do the revert and start figuring out how do we restore things. Whilst another member of another squad, the traffic routing squad inside Sky Scanner was able to do the fix of let's send people to a static error page until we have services back. Yeah, that incident commander role is really important, and there's some new incident management tools out there, they're kind of slack based, and they drive this attitude of having an incident commander who's not doing anything technical, and is dealing with the comms to the CXOs and the legal department, because the legal department had to be told, and like Twitter, like our social media teams and things like that, they were dealing with that so that the technical teams could get on with fixing the technical issues. Why did it take three hours to restore traffic? Because of those issues that we mentioned in terms of like cops, et cetera, it took us time to dust off the rum book, we had to figure out was a restore going to actually fix all of the services, there was a bit of, there had to be a discussion in terms of like what was the best way to approach all of that restore, so yeah, that was largely the lag there, and the restore, because the clusters were quite big, the restores were not the fastest either. Well, one of the favorite things that one of my engineers said, shout to Caitlin, she said, before you make a big decision, go and make a cup of tea. Right, because if you make a quick decision in a situation like that, you can make it worse, a lot worse, and then there was a one point where we were like, this is gonna take days, because like Guy says, get ops replaced on the namespaces, but then how do we deploy each service? And that was actually going like, do we have to talk to each squad and figure it out? So there was definitely a moment of time where we stopped and went right, what is it we're trying to do and what is the quickest way of doing it, rather than going, just do that. So yeah, it saved us days, but it did cost us, you know, time. Can you talk about your data persistence? I'm here. Yeah, hi. Can you talk about data persistence, how much are you doing in the cluster and how much outside? You've mentioned restore, but that sounded like SED restored to me, but then other data persistence for your services. So we currently only run stateless workloads in our Kubernetes clusters. Everything stateful is pretty much an AWS managed service, RDS, ElastiCache, whatever it may be. The one thing we do is we have Prometheus, but in those cases it was, we had some EBS volume snapshotting, I think, as well. But again, I think we just ended up discarding the data because we went as metric data. Like we have Thanos for long-term retention. It's generally metrics about a time when we've not had any traffic on the cluster. So it was, again, one of those things where we had the discussion and went, is it worth the time to try and do that? No, but let's move on and do the restores of more regions. Yeah, and you're right. You know, it's the third rail, how do you deal with stateful services? But in our case, such a large proportion of our services are stateless. So we're benefiting from spot instances on Kubernetes and saving a lot of money doing it. When it comes to stateful services, we take a different approach and we leverage AWS far more. Thank you to Carlos for this, by the way. He's just taking this on. We're not quite sure why there's not a track host or someone that's doing the questions, but thanks, Carlos. Hi, just wondering if you had any cluster attr scalars that just annihilated your nodes and you had to start from like five nodes or something. No, I don't think cluster attr scalars have ever done that to us, he says. Very missed last words. We have, I think we have had misbehaving HPAs. There used to be bugs in the HPA where it would happily scale some things down when one metric said and the other metrics were unavailable, which led to, like I think a prometheus outage led to some services being scaled down which was less than ideal, shall we say. Once we realized we managed to get a patch in upstream, but we also, like as soon as we restored prometheus, it scaled back up. Eventually, how did you get to test a template on your cases? Did you do some, on other clusters? So we have, we effectively have test values that we now use for templating. So effectively, if people are changing templating, the idea is that we will drive that, the test values through that templating engine and check against the unexpected output. Yeah, unit testing, actual unit, treat it like code. So it needs unit tests. Hey, so if you use purely spot instances, have you got a backup plan in the unlikely event that there's not enough spot instances to handle all of your traffic? Interesting, you say that at this moment in time with any AWS person in the room as well. Yes, well, we use, we have a diverse range of instance types, like we use the cluster auto-scaler and auto-scaling groups still. So we do need to revisit those instance mixes, ask new instances are launched, add them into the potential pool. Effectively, if we needed to, we could change all of our auto-scaling groups to on demand instead of spot. We've also moved traffic to different regions. So EU central one had a problem last year, sort of October time, where I'm gonna raise my fist at CERN again. I think they were, well, no, it wasn't CERN, but every so often CERN will take all the spot instances from EU central one and be like, where's our CERN? But yeah, we shifted traffic to US one and that saves us a lot of problems. I mean, we've got a lot of savings plans and things like that that we can set up load groups where we're using reserved instances and we have done that in the past, but because we use such a diverse range of different spot instance types, we can get away with it. And we haven't really been in a situation where we've needed to like 100% move to on demand. We have used on demand when there's big peaks as well. Wow, we filler, filler-bustered you. I only put my hand up because it's kind of related. So it was more just a question to both of you because of Skyscanner. Yeah. Do you know you guys have, is it Turbolift? Yes. Yeah. Turbolift help with any of these things or was that, because when I saw the title of the presentation, I was kind of going, was it Turbolift? First, you don't know Turbolift just mass PR to everyone's GitHub repose, but. It does. Yeah, well, no, I have to use that. So it was one of the things we were considering when we thought, are we going to have to redeploy everything? Like, do we need to use Turbolift to like raise PRs against all the repos in Skyscanner and tell them to deploy to different clusters, et cetera? But like, we sat down, had that conversation and went, we reckon the restores will fix it. There's, it would definitely be useful in other kinds of outage and we have used it for where we've changed infrastructure and just need service owners to update their specifications, et cetera. There's one slight challenge with using something like ECR. So we use ECR as a container registry. We think it's amazing. I'm not going to pay for it, we have to say that. We just, it's been so robust after using other container registry products that we have to run ourselves in the past. But the one problem with it is it's got the account ID in it. So if we were going through an issue where there was a regional failure or an account problem, particularly with cells each region is a separate account, we would maybe have to do a mass change of the account ID. We would love to see that getting changed in AWS. I have two questions I think. The first one is, have you performed any change on Argo after that? Because we had a similar situation like one month and a half ago, but it was only on dev environment. And it was with the first service that we started our developers to deploy with Argo. So during the first week of deploying with Argo, they changed the name of the namespace with same situation more or less and they destroyed all the dev environment cells. Thanks. So we did disable Autoprin and I can't remember the name of it, Cascading Autoprin. So now that does mean when we make a change that means we have to delete things or clean up objects. It does mean we have to, at that level, we have to go in and manually trigger the prune. Yeah, my question was, how did you recover the stateful sets? So we didn't. Well, we put the stateful sets back in place but we had that conversation about the only stateful sets we have running are Prometheus and Thanos. And we went, well, Thanos has shipped the longer term metrics test three and we had that conversation of going, it's metrics, we don't care enough to restore it for time where the site has been down. We know the site has been down. We don't need Prometheus metrics to tell us that. So we just didn't. And what did you do with the databases? I mean, maybe you used a database to run all this stuff. At some point, I mean. The databases are RDS, like it's an AWS managed services. We don't run databases on top of Kubernetes. So that data was never touched by the fact that we'd wiped out all these services. Hi. Have you had actually an outage from a malicious actions of someone? And if so, how did you recover from that? Nope. Cool, good for you. No, I mean, we've war-gamed it. We've done it at a high level with our CXOs and talked about all the situations and scenarios, put them under a bit of pressure about making decisions. And then from a technical perspective, we've also talked about walking through a scenario and then discovering more and more problems and then going, actually, this might be malicious rather than a failure. And then figuring out what's going on from that. With that, there's a really good game day. I honestly don't work for AWS, but there's a really brand new game day which the teams maliciously deal with one another. So that's quite a good one and that's quite a fun one to consider other failure states, but we've never had that. Right, folks, it's roasting up here. Maybe we'll meet you for beers. Thank you very much for your attention. It's much fun. Thank you. Thanks.