 Hello everyone. This talk is titled 10 more ways to blow up your Kubernetes by Airbnb and this talk is titled 10 more ways because we have found more ways to blow up since last year's talk that was presented by our colleagues Melanie and Bruce. So, who are we. My name is Jen. I'm a software engineer at Airbnb on the computer infrastructure team. This is a team that manages our Kubernetes clusters and manages the abstraction we have over it to make developing services easier. Hi, my name is Joseph, and I'm on the compute info team with Jen. I've been on this team and at Airbnb for about eight months now. So the outline of this talk will be one, we're going to give a brief intro of Kubernetes at Airbnb to we're going to dive into 10 cases of how we actually mess things up. And three, we're going to go over and do a recap. All right, Kubernetes and containers at Airbnb. So Airbnb started our migration to Kubernetes from the beginning of 2018 and since then has really taken off Kubernetes is actually running in production and we rely on it is a critical part of our of running Airbnb. Our environment is mostly Amazon Linux to for the actual host. We use into images. We use canal for CNI plugin. We do use no port services and and smart stack for service discovery. And we support many languages such as Ruby Java Python and go services. Now let's dive into the real fun stuff. The first case we're going to talk about is called replicating replica sets. So for context, we have a job that regularly scrapes our clusters, and it runs a command something like cute detail get pods on namespaces. This job starts getting out of memory errors and one specific cluster. So let's check it out. Here we see, we see that in most of our normal clusters, we have maybe on an order of 1000 to 2000 replica sets, any guesses to how many this problematic cluster had. It was 56,000. So let's see what else is up with this cluster. When we dive into all these replica sets we notice that all of these are actually under one deployment one name space. As we dive deeper into it, we also notice that the deployment object, when we check its collision count, it was up to 100,000. And so right now our theory is likely that the deployment object is creating replica sets that aren't getting adopted by it. So it keeps on creating more. As we dive more into this incident, we realize that something different about the deployment of this job on this specific cluster that was different from its deployment of other clusters was this one we were actually testing this new feature called topology spread constraints. So we noticed that were that these specs were actually not getting picked up by the replica sets. And the theory here is because the replica sets being created by the deployment didn't have these specs. The deployment didn't know that it had already created these replicas sets. It didn't know that these replicas that belong to it. And so it would just keep on creating more and more replica sets. And good thing we have so many, we had so many test clusters, so we can actually like diff the specific deployments to see what was going on. And in this specific incident the actual fix was simply just deleting the whole name space, which is fine because this was just a test cluster. And good thing we were just testing this feature out in our test cluster. And so the takeaway here is do you test new features and test clusters because for us that's what they're for we were testing this new feature. And what we noticed was it kind of blew up our cluster and just kept creating infinite replica sets, but not a problem. We know to investigate more before we roll this out to production. Our second incident is called mutating time bomb. And what we see here is, we have the service expressing higher rate and the service cannot deploy. And we get this cryptic Kubernetes error saying unable to access invalid index five. So we dive deeper into it and check the status of it and what do we see, we see that to two pods are running at what HPA thinks there should be nine replicas running and the service is panicking because, well, we only have two pods left. And if we dig more into it, we notice there's only one deployment object, but three replica sets, which is really weird. And the replica sets that actually have pods are 16 days old. As we dug more into it we notice, yeah, there's something really wrong going on with the HPA here and the replica sets. Later, whatever we tried, well, we tried deleting the HPA and manually scaling it up because the theory was perhaps there was something wrong with the HPA. That did not work. We tried deleting those old bad replica sets, thinking perhaps it was those old replica sets that were blocking new deploy from happening. But that was not the problem. We also made sure not to delete the replica set that had our last two pods running. We also tried creating a new deployment object just to update it just to kick it and see what happens. That did not solve our problems. And we thought about perhaps just recreating the namespace, but that would definitely delete the last two pods running, which is no good because this service is production. So what do we do now. We page a teammate, we call a friend one hour later. And so when we started digging more into it, we come back to the first page where someone had called out this cryptic error message. And we should have done this earlier where I should have dug more into it, because we realized that this was actually the exact problem. When we googled for the problem we noticed it was actually a problem with JSON patching and related to our mutating a mission controller, which was recently rolled out in the last month. Specifically, the bug was this with our mutating a mission controller. Imagine that on the top left is our input and this is our spec and our desired output is the spec at the bottom left. Here we see we are deleting two items from an array. What the bug was the generated output is this. It generates two operations. It removes index two and removes index three. When our first class glance of this patch looks reasonable, but the problem is after the first operation, the array doesn't have an index three anymore. So the correct patch is you actually remove index three then index two. And luckily this issue was already well known, but we googled for it and we can actually apply it. And so once we realized it was a mutating a mission controller to stop the bleed, we actually deleted the mutating a mission controller from the specific cluster, which immediately resulted in successful pods coming up. And so our immediate fire is resolved. So we're done, right. No, as we dug into it more we realize our change had actually happened seven days ago. And we realized that for some deployments over the last seven days, new code actually wasn't even being deployed. Services were being kept alive by their old replica sets, but over time as pods of the over because set slowly died off services were slowly getting degraded. Yeah, so this problem was way more insidious than we had realized. And the takeaway here is be aware of the existence of mutating a mission controller. This was something we had recently wrote out and so for many of us, when we see an incident, it doesn't immediately come to mind to think hey, it could be the mutating a mission controller be willing to ask for help because paging our additional team member to help us definitely got us moving. So we do load the error messages and do create alerts. When we dug more into it later during the post-war and we realized we actually do have metrics that could have detected this problem earlier. All right, our next incident is called one to auto scaling. So we're going to begin with this Kubernetes issue. This is still an open issue. The gist of it is, let's say you have the situation where you have a deployment with three replica. Later you set up an HPA or horizontal pod auto scaler with a much larger number. You then update your deployment and apply it. But this is where the problem happens. The moment you apply that and or update that deployment, the number of replica resets down to three. So the suggested solution outlined by comment was remove replicas from the last deployment and remove replicas from the current and future deployment to let the HPA do its thing. And that sounded good to us. And so we ship it. Great. So our fix right now is if there's a deployment and HPA, we followed the advice of the comment by editing the last deployment and deleting the replica account from it, removing the replicas from our current deployment and applying the deployment. But that was not the end of our auto scaling woes. So this next case is called zero to auto scaling. So if you actually look at the HPA algorithm, this is the algorithm where desired replicas is a function of your current replicas. Some of you might already see the problem. If current replicas is zero desired replicas will always be zero. So this is a sounds like a quick and easy fix and we ship it right. And so now our fix looks like if deployment and HPA, and if replica is not zero, then we apply that fix. Otherwise, if replica was zero, we can actually just leave it alone and the HPA will do the correct thing. Zero to auto scaling again, but that that was not our final problem with the HPA. In this incident, someone alerted us when they when they did a deploy their number of replicas actually dropped to zero. So let's see what happened here. Someone followed up and realize, yes, this is actually a known issue which we had already fixed, but we were actually migrating our deployment services to Spinnaker. And Spinnaker did not have this standard of work around implemented yet. And so the lesson here is migrations are hard. And so now the fix is, oh, and remember to copy this logic to Spinnaker. And once more, we have more auto scaling issues. So what do you think happens in this scenario, you have a service that just migrated to Kubernetes and still has old EC2 boxes running. So we deploy this service to Kubernetes with replicas equals zero. We then scale the service in Kubernetes using kube kuddle scale deployment. And nice. We realize that Kubernetes is now taking traffic successfully. And so we delete all of our old EC2 boxes. And so now the service has fully migrated to Kubernetes. And now we deploy again with HPA. That's right. And incident happens. The actual bug here is we had manually scale this deployment with kube kuddle scale deployment. But this means the last, the most recent proper deployment was actually still zero replica. We didn't go through a proper deployment or what our infrastructure teams consider a proper deployment because they had used the kube kuddle tool directly to scale up the replicas. And now our fix is add a warning to kube kuddle scale deployment tool to warn people that this is a temporary tool, but they should still go through a normal deploy process and to actually check what the current replica count is. And our takeaway here is turning on horizontal pod auto scaler for the first time is not trivial. So we tested edge cases such as zero one, etc. And remember the non paved paths, because even though we had fixed this issue in our normal deploy tool Spinnaker was a new, a new path that was still being paved. In our next case is called did we really delete all our master nodes. Yes, yes, we do. So what happens when you delete all the master nodes on a cluster is it means there's no API server, kube kuddle commands do not work. There's no new deploys in a cluster is just broken. But luckily for us, this was just a test cluster. And so as we try to figure out how can we bring this cluster back up. Our options were one completely delete the cluster and and bring it up all over again, or to actually try to fix it in place. So these are pretty bad situations to be in but luckily this was still a test. This was just a test cluster. So as you guys realize, actually, the situation was already documented in our run book, where the suggestion was one, delete our mutation controller as an immediate action via the kube kuddle command, and to terminate the kube DNS pod to get a reschedule. The problem with trying to bring up your masters, once they have all gone down as the order in terms of how you bring things back up really matters. In this specific case, following the run book allowed the necessary services to actually start up again and get unblocked for the masters to actually come up successfully. And the takeaway here is, don't drain your master nodes. If you do do check your run books. Our next scenario, our next case is called masters out of memory. In this specific incident, we noticed our master's CPU usage going up. We noticed that the cube API server is going wild and using up lots of CPU and memory. Our CD resource counts are shooting up. And we're getting out of memory errors on our API server. More alerts are firing. More alerts are firing and more alerts are firing. So this isn't this is a real incident. We also realize at least as we dug more into it is that we actually had no access either. When we try to SSH into debug, we noticed we noticed that the nodes were having real issues, and we actually couldn't get in. What is happening here? So our API server is uming and crashing a control plane nodes are severely degraded. We don't have SSH access to them. We wonder is this a CD problem is a CD having issue, because at CD is having issues of things would be really bad. We noticed that it was having elevated objects being created, but it's actually still healthy and fine and existing workloads are are also still fine. So our our sponsor here was to spin up bigger instances for the control plane for the masters. And, and this was done just in time because our old instances actually just completely died. And we noticed that memory was still getting eaten up super fast for getting these new instances are much bigger. And so for postmortem we realize this was actually one of our largest clusters that was happening that was having this issue so we were wondering is this a new scaling limitation that we've just hit. We also realized the incident lined up with a deploy that had dramatically increased their max surge value, and that deploy was having crash loops. And the theory here is that that deployed with a high max surge value. So it was actually overwhelming our API server. And the takeaway here is a restricting max surge would help protect the cluster. And it's a good idea to be ready to scale vertically. We were lucky that we were able to spin up bigger instances for our control plane to make sure that they scale. And we actually still have some ongoing unknowns, such as what exactly is a breaking point in terms of pods or max surge. We've also had larger clusters in the past before. And, but we didn't run into this issue so the question is what is different now. And three, why did memory usage actually jump suddenly? If you remember from our graphs, the API servers were actually using up to like 50, 90 gigs of memory, which is odd because XED can only use 8 gigs of memory. So what was actually using it all up? Authentication and the world of containerization. At Airbnb, we rely on AWS IAM roles as a primary way to authenticate our services. This is simple enough in the traditional non containerized world where each EC2 instance can have their own IAM role. And they can assume that role and they can request the EC2 metadata API to receive temporary credentials. But what happens when you have multiple pods and services running within a single instance? We could let each pod assume any of the other pods roles or make one big IAM role to encapsulate all of the pods that will be running on it. But we don't want to give each pod more authentication or permissions than it needs. To address this issue, we use an open source project called Kube2IM. Kube2IM provides IAM credentials to containers running inside a Kubernetes cluster based on annotations. But what does this really mean? Well, Kube2IM works by running as a daemon set within the cluster when the pod makes a request to the EC2 API. Kube2IM intercepts the request, then using a special IAM role of its own, it assumes the pods IAM role. Then it makes a request the EC2 API itself, receives the temporary credentials for the pod and then forwards it back. Now, Kube2IM wasn't perfect. It came with some race conditions. So the expected course of events are as follows with Kube2IM. First, the note should start up, then the Kube2IM pod start up. Then we add an IP table rule to forward request to the EC2 API's IP address to the Kube2IM pod. Then Kube2IM starts watching for new pods, after which an application's pod can start up. Then Kube2IM notices a new pod. Caches the pod's IP address to its AWS IAM role mapping. Then, finally, the container in the pod can make requests to the EC2 API. However, what we were seeing was that containers would make requests to the EC2 API before Kube2IM had even noticed the pod come up. And because so, when Kube2IM receives requests, it would not be able to find the pod's IP address within its own cache, and it would just reject the requests. This was especially problematic with init containers, which would run before all of the other containers. So our solution, another init container. That's right. We made a new init container called Kube2IM wait, that would basically just keep pinging the EC2 API until it received a successful response back. This works because we had already set the IP table rule before to forward request to Kube2IM. So if you've got a successful response back, that means Kube2IM responded successfully, meaning it had already noticed the pod come up. Now, this is only a temporary solution. We're currently working on a new authentication scheme by which we have the API server use its own private key to sign a token and inject it into a pod on startup. Then, when a pod needs to get authenticated, it uses a different API to get credentials for the pod's IAM role using the sign token. Then the STS verifies the token signature as valid with the public keys. After the validation is complete, STS returns the temporary credentials back to the pod. Our main takeaways here is that init containers are very versatile, despite their simplicity, and they're good solutions even if they're only temporary. CPU limit equals number of CPU cores in the node still has throttling. So one day we got a message from an engineer asking us, can you explain why we'd still see throttling when limit is set to 36? With only 36 cores, how can we exceed 36? So to kind of give context on this, let's first talk about how CPU limits work on Kubernetes. Kubernetes limits pod's CPU uses using Linux's C groups and the CFS, the completely fair scheduler. So let's take an example of a 36 core machine with a default scheduling period of 100 milliseconds. And let's say we have, in our Kubernetes manifest, limits set to CPU of 1024 millicores. That's equivalent to one CPU core. This gets translated into Linux's CPU.CFS underscore quota of 100 because we have the 100 millisecond default scheduling period times one core, which gives us 100. Now let's take a few different scenarios. Here we have a bunch of different scheduling periods. And in our first scenario, we have one container running one thread, and it can run the full 100 milliseconds without any throttling. Now let's say the container actually had two threads, then thread one and two can run simultaneously for 50 milliseconds, after which it will get throttled because the combined usage within that container would have reached 100 milliseconds. Now, in our final scenario, we have thread one, two and three running simultaneously, and then thread one and two either terminates or gets descheduled after 25 milliseconds. Then thread three can run for an additional 25 milliseconds until it would have reached the full 100 milliseconds of its quota, after which the threads from that container will get throttled. So, in our previous scenario, we had in our limits, a CPU limit of 36 times 1024 millicores, which gets translated into a CFS quota of 3600. So, in that case, we should be able to run our threads on all 36 cores for a full 100 milliseconds within that period or within any CFS period and not get throttled at all. So, to figure out what was going on, we ran a bunch of tests. We made one app with a limit of 36 CPU and we made another app with no limits at all. And we ran them on different nodes, swap nodes, and we found that on average, they all ran pretty much with the same performance. But what was surprising was that despite the similar performance, we actually did see throttling on the app with CPU limits at the 36. So, our main takeaways here were that some throttling is inherent and is okay. And that we should be more careful of our metrics. Docker built in CPU usage information may not actually be granular enough to capture fully what's going on. We also learned that percentage of throttled periods is a better metric than absolute value of periods, since the latter is very heavily affected by the number of threads running within the container. CPU limits cause out of memory kills? In Kubernetes, we have memory limits and CPU limits. And intuitively memory limits cause out of memory kills if they are exceeded, and CPU limits would cause throttling. And they do. But we were also seeing that CPU limits were causing out of memory kills. So here's an example. We have on the top left our pod usage spike a little bit after 12. After which it was immediately followed by a huge spike in memory usage and eventually an out of memory kill. So what's going on here? Well, the first reason was that these were Java processes and the JVM garbage collector is intensive and spawns a lot of threads. And the more threads there are running concurrently, the faster each container can get throttled. There's two main flags to be concerned with here parallel GC threads and concurrent GC threads within Java. And these two parameters are used for different stages of the garbage collection. But the number of threads used for garbage collection for either of these flags can easily exceed the number of threads in the main application. And when both the main application and the garbage collector is being throttled, we can have a situation where data isn't being processed and any data that is processed is not able to be freed. Our second and third problem was that only certain containers get throttled and the back pressure we had in place to kind of mitigate this wasn't working properly. Since Kubernetes relies on Linux C groups to implement throttling. This meant that certain containers in the pod can get throttled while some other containers aren't. In our case, we had the main container processing the data get throttled while another container, Envoy, accepting requests was not. This meant that we kept this meant that we were continuously accepting requests. While we were unable to process them, and we were unable to, again, because the garbage collector for the main container was also in throttle, we couldn't free the memory properly either. We had back pressure in place to mitigate this to tell the clients to stop sending requests in an event something like this happened, but it wasn't working properly at the time. So our main takeaways here. Be wary of what other processes may be running on the pod and adjust limits accordingly. Tune JVM garbage collector threads if necessary. And to make sure it's without proper back pressure to avoid overloading our services. Rogue ECR cleaner. At Airbnb, we make a lot of Docker images. This data is from 2019. And at the time we were making over 2 million Docker images per day. And obviously we can't keep them all. But we also can't do something naive like deleting all the old images because we don't know which images might be in use. We build a service called the CR cleaner that runs as a cron job every hour. And basically all it does is finds all the images in use, and it loops through each of our ECR repos and deletes all the old images except for the ones in use. Fairly simple. And it was working great. Until it wasn't ECR cleaner started deleting images in production. And to figure out what's going on. Let's go back and look at what it was doing. And specifically let's get the find all images and use function. Basically, it would find all the clusters, then keep a list of active images, then loop through each cluster, figure out what images are being used by all the pods. Add to that list and return the list of active images. Well, in the recent change. There was an error case. And on error, it would just return an empty list. And you can guess what happens afterwards. Since our list of clusters is empty, it finds no images and returns an empty list of images. Then coming back up. It would loop through each repository and delete all the old images except for nothing. So it would just start arbitrarily deleting all the old images. And our second mistake was a lack of proper alerting ECR cleaner was failing. But we weren't getting alerts for that. And we also weren't getting alerts for image pull back off errors in production across multiple pods. But once we figured out that this was happening, our fix was to first shut down ECR cleaner to prevent it from deleting more images. Then look for crash looking containers, look through ECR cleaner logs and kind of cross reference them to figure out what images were missing and manually rebuild them all. On the right, you can see a teammate rebuilding all these images and keeping track. And this is only a short snippet of all the images that we had to rebuild. And it was very tedious. So thank you kind teammate. And our takeaways here were that the more critical the service is the more crucial and thorough testing and review should be. And we should have proper error handling and alerts for infrastructural issues. And we should also try to make sure that our fixes aren't causing problems elsewhere. In this case, because we were rebuilding so many images, we were causing other people's builds to slow down a lot. And we should have seen forcing that be careful what you break. As part of our uncle readiness schools, we started doing fire drills. So this is what they looked like. Someone would set something up and send a message in our slack channel. So for example, fire drill CI is broken and we cannot do a build for ECR cleaner. We need to ship a new version to production that displays a new logline on startup without relying on CI. So this is all disclaimer. CI is not really broken. Do not panic. This is a drill. This is only a drill. And we did a lot of these and we found a lot of bugs, a lot of places to improve our code and our documentation. And it was great. But there was one particular fire drill that was pretty interesting, but also pretty insightful. And it had to do with the admission controller. This was actually one of the first fire drills that I did as a new member in the team. It was pretty simple. Basically, a change was made in the admission controller to block any creations of a replica set. And it looks something like this. It's a new rule that said if the resource being applied is a replica set, return an error. And the fix was supposed to be simple, just find them all work and commit and deploy that back onto the test cluster that we were using. And that's exactly what I did. And that was that. And it was all going to find until a few days later, the same person who had made the fire drill send this in our team chat. I deployed onto the test cluster, but it's as if the admission controller didn't get deployed, even though the deploy went through. Has anyone seen that before? Well, I didn't know the time, but as it turned out, I had seen it before. When I went back to deploy an old commit. I thought the deploy had gone through, but admission controller didn't actually get deployed. What happened was that there was a bug in our deploy process reporting success for failed deploys specifically to admission controller in this case. But why was admission controller failing in the first place? Well, it was because the mission controller was not whitelisted by the admission controller. This meant that since the admission controller was called replica sets, it was also blocking itself from being deployed. So we were locked out. So our solution just delete the validating web of configuration for the admission controller. And that way we should be able to bypass the admission controller step. But this didn't work either. And that's because when the request is sent to the API server, it goes through a bunch of steps. It eventually lands on the validating admission step, at which point it takes a look at the validating web of configuration resource, which is what we have deleted to figure out what webhooks it should get confirmation from. But as it turns out, on our admission controller deploys, the validating web of configuration is applied before the webhook. This meant that by the time it got for us to update our webhook, which is where the bad role was, the validating web of configuration has had already been replaced. So the actual solution just nuke the whole thing, delete the mutating admission controller webhook, delete the deployment and then redeploy the entire thing from scratch. So our takeaways here, double check that your deploys actually went through. Make sure that the thing that controls your deploys can actually fix itself with another deploy. Be aware of Kubernetes apply ordering, and that even the simple fire drill can reveal a lot of insights and bugs and TLDR never never write a rule that blocks all replica sets. Alright, to recap, some of our top 10 takeaways are one, do test new features in test clusters to be aware of the existence of mutating admission controller. Three, remember the non-paved paths or whatever new paths you might be paving. Four, don't drain your masternodes. Five, be ready to scale vertically. And eight containers are versatile despite the simplicity and are good solutions, even if they're only temporary. Some throttling is inherent and is okay. Be wary of what other processes may be running on the pod and adjust limits accordingly. Have proper error handling and alerts for infrastructural issues. And be aware of Kubernetes apply ordering. Right. Thank you everyone. This is the end of our presentation. If you would like to learn more, check out our engineering blog. Airbnb is also still hiring and a lot of the work we share today were not done by just me or Joseph. It was all done by our wonderful team. So if you would like to work with us, reach out, apply. And if you have any more questions, feel free to contact us to follow up. Thank you.