 I've got a quick lightning talk on how to avoid a Kubernetes DOOM loop. I'm David Colham, I'm a staff solutions engineer at Jetstack. We are a cloud native consultancy, prioritising it around training, consultancy, strategic advisory and the likes. Today I'm going to go through a story about an issue that we had at one of our customers, quite an interesting one, I hope. Some of the tools that we were using in the platform, you'd probably be more familiar with the logos, algo CD, algo workflows, Kubernetes, Helm, CertManager and Flix. So we've all had this Friday deployment. What happened? So here's a graph of our PubSub. Message has been published. The customer that we were working with was doing a large scale web platform hosting solution. Customers can sign up to the website. The schedule loads of pods, deployments and vice versa. That's kind of like our normal use case in staging is kind of, you know, 12 o'clock at UTC. There's a few things going on, but yeah, this Friday, something went wrong and somebody did a deployment on the API layer. Friday and we come in on Monday and that has just been going on and on and on. I think it was, I realize I can't see the, so it's about 60 messages a second, I think, over the whole weekend. So what each one of those messages was doing was something along these lines. So it was like, we get a message from the front end saying like, yeah, let's create an application of WordPress, whatever. AgraVentz picks it up, the workflow triggers, we get a Helm install, and we have this little run process to say, hey, is the application ready? If not, we delete it. The specific feature that we're looking at here is like pool hydration. So one of the challenges that we had was installing Helm and installing a few applications and waiting for the deployments to be ready, ingress, TLS certificates, took a long time. So we did a pool kind of mechanism, so we already had five, 10, whatever applications running in the cluster over provisioning, and that allowed us to kind of have a really snappy response to our customers. The challenge was the delete app was a fire and forget. So what happened was it was continuously going around this loop over and over. Yeah, it caused a lot of hassle. The impact for that, we were unable to create any applications in the cluster like genuine ones because this pool was just continuously going around and around. Likewise, we were unable to delete any applications in the cluster. So we're in a stalemate. We also ran out of GCP instances within our Kubernetes cluster for our workflows and our workloads already deployed with auto scaling, HPAs, VPAs, and such. So how did we resolve it was a manual cleanup. It took me six and a half hours to clean that up. There was like 16,000 deployments and 18,000 workflows all pending pods pending everywhere. Cube CTL was just ridiculously slow to try and clean that up. I ended up having to write custom Python scripts to do collections and kind of go, oh, no, get rid of that. It was interesting. What was the root cause? Python non-variable reference within an ingress object. A simple mistake in the API layer whereby somebody did a singular and a plural variable name. How could we have prevented it? We did tests, an integration test, of course. But some of the preventions and safety measures that we could have put in as a platform team, not only from the developers and the API layer, was kind of dedicated workers. So this is a bit of a mixed role in a cluster. You've got your workloads, where it's customer-related applications running in there. But you've also got to do your workflows to kind of orchestrate different actions, things like that. So we decided to do some dedicated workers. So all of our agro workflows ran in what we called a system, node pool in GCP. And all of the customer applications ran in the main. You've got rate limiting, so agro events has sensors to do some rate limiting, parallelism limits, so make sure that we were doing only running two steps at a time, semaphores. So we can make sure that only one ad app is running at a time, but to any deletions or vice versa. Using work avoidance where we can as well. So we had a lot of workflows that were doing work that we didn't need them to do every single time. Be careful with retries as well, because we had our all-doing loop from retries. Don't always put a retry policy in, doesn't have to always be always, there's various other ones. But a limit is also good as well. We have had some of our workflows literally continue for seven days. Why? It turns out it's just doing a retry loop over and over again internally. But disruption budgets are really good. We kind of use spot instances for a lot of our workflows, so you don't want them de-scheduling halfway through, because GCP is going to need that CPU. It's really easy to do. Just put pod disruption budgets. Pod priority classes as well. We wanted to make sure that some of our stuff, our tasks were kind of high priority doing backups for an example for nice equal databases. You don't want to wait and pending for a backup. And the backup never gets done until two hours later. It's like, oh, great. Use metrics as well. They're really easy to do. And they're so powerful. Covered earlier on today as well is like set workflow defaults. Creating a config map, workflow control config map. You can set everything really nice from a platform team for developers, other engineers, don't really have to care about that. And they've got them safeguards already in place. So thank you. Ran out of time for the last one.