 Hello everyone. I'm very happy to virtually present today. I'd rather be in the same room as you are, but we can do that hopefully next year. So first of all, let me introduce myself. I'm Laurent Bernal, I'm staff engineer at Datadog, and I work on community topics mostly. And today I'm going to share a few stories of incidents we had, from which we learned a lot, and I'm going to share what we learned during this incident. In case you don't know Datadog, we're a SaaS-based monitoring company. I gave some figures on the left-hand side of the slides, but today we're not going to talk about Datadog the product. We're going to talk about our infrastructure. And to give you an idea, because this is what will matter today, we're running dozens of Kubernetes cluster, some of them up to 4,000 nodes, and we have tens of thousands of nodes. And we're growing pretty fast. In terms of context, we've shared a few stories about our migration to Kubernetes over the last two years. And I gave some examples there. And what has changed in the meantime is our infrastructure has kept growing and it's getting bigger. And we found new scalability issues and also much more complex problems. And doing this, we've faced a lot of issues. And today I've picked three different incidents that I personally like for different reasons. And the first one is actually the title of this talk. And we're going to talk about this incident because it's interesting how interaction between different components made this very surprising event happen. So this incident started when an engineer reached out to us and told us, well, my namespace and everything in it is gone. Of course, the first thing we did is help the team recover and redeploy everything. But then we wanted to understand what had happened, right? So we started the investigation. And the first thing we did is look at audit logs to see what happened. And of course, we found a lot of deletion calls for every single thing in the namespace, right, from replicas sets to events and deployments. And of course, this was matching perfectly the issue the engineer has seen. However, we didn't see the delete namespace call. So the initial call but should have triggered this event. And of course, the question became, why was the namespace deleted? So we're looking at audit logs and we couldn't find anything and we're zooming out, looking at all the events for the specific namespace. And after zooming out enough, we discover a delete call on the namespace four days before the one highlighted there. So this is when the namespace was successfully deleted. However, this successful namespace deletion happened on the certins when the result deletion happened four days later. So of course, we reached out to the engineer that did the delete call and we were like, why did you delete the namespace? What happened? And he told us, well, I didn't delete the namespace. So it was all very weird. And we started discussing and he explained to her that he was migrating to a new deployment method. So let's take a look at how deployments work at Databug. The V1 version of our deployment methods is the one on this slide where engineers would modify the charts. It would go through CI, we would create a rounded chart with everything in it. And at a later time, another engineer would trigger a deployment using Spinnaker that would retrieve the rounded chart and use kubectl applied to apply to the cluster. This was working fine, but we were trying to make things better. And we decided to move to a new version that is described here where all the first part is the same. But at the end to deploy to the cluster, we are using hand upgrade instead of kubectl applied. So it looks very similar, right? But there's a difference. If you use the first method and you have a chart with three resources, A, B, and C, and you create a new version of this chart with new versions of B and C and where you remove A. When you do kubectl apply, B and C will be updated, of course, but A will remain because kubectl won't be given it. However, when you use hand upgrade, hand is keeping track of the resources to deploy it. So it's going to see that A is not there anymore and it will remove it. So that's a big difference, right? If you remove something from your chart, it's not going to be in the cluster anymore. And so what happened here is the chart was refactored and the name space was moved out of the chart. And so when the chart was reapplied using Helm, the name space was deleted. So now we understand why the name space was deleted. But what was very surprising was, why did it take four days? So to understand what happened, we had to look at the name space controller code. And here on this slide, you can see the function responsible for handling the delete name space call. And this function is made of two important parts. The first part is getting all the resources in the name space, all the available resources. And the second part is iterating over all these API resources and deleting them. Pretty simple, right? Let's zoom on the discover all API resources part. If you look at this, this is how the name space controller discover all the API resources that are namespace. And if you can see the part I highlighted there, you can see that if the discovery call fails, then the function immediately returns. Which means if you can discover all resources in the cluster, the content of the name space won't be deleted. Note that this behavior has changed in Kubernetes 116, but this is what happened there. When we saw that, we looked at the code from the controller manager where the name space controller is running. And in the logs, we saw many occurrences of this line that says, I'm unable to retrieve the complete list of API server APIs, because this one is not answering and the metrics one. So the problem with this log that makes it a bit hard to pass is that there's no context. You don't know that this log is coming from the name space controller, so it's hard to link to the incidents. But now that we know that this is the error from the discovery call and we have potential likely cause the metrics API. So let's talk about the metrics API. So if you're familiar with Kubernetes, you've probably heard about metrics server. The metrics server is an additional controller providing metrics about nodes and pods. So you can do Qtltop, for instance, or use horizontal pod but not scaling. And it registers an API service, which is an additional set of Kubernetes APIs and in that case, metrics API. And it's managed by an external controller. And when this API service is registered, two pieces of information have provided the additional APIs you're providing and where to forward this API call. In that case, the metrics server controller in the Qt system name space. So let's go back to the series of events so far. On the first searchings, the name space was deleted. In the next four days, the name space controller failed to list resources and as a consequence, didn't delete the content of the name space. And then something happened on the 17th that unblocked the name space controller and then everything was deleted. So we looked at events on metrics server and we discovered that on this date, the metrics server was umkilled and it restarted and afterwards it was working fine and everything was unblocked and everything was deleted. So the next question was, but why was metrics server failing? So to understand this, we need to look at how metrics server is deployed in our setup. So metrics server is using a certificate so to verify its identity and we have to rotate the certificate and we use a side car to generate the certificate on a regular basis and signal metrics server to reload. So that was completely fine, but it requires a specific command as feature, which is called Shared Pids Name Space. So the side car can actually find metrics server. And for this feature to work, you need to think, you need to modify the spec of your pod, but you also need feature gates on the API server and most importantly, for our example here, on the QBit. In terms of how metrics server was deployed, it was running on a set of nodes where we run all our controllers and this was working completely fine, except it wasn't exactly like this. It was actually looking like this. So what I want to highlight in this new diagram is that nodes were a bit different. We had recent nodes where the feature gate was enabled and old nodes where the feature gate was missing. So when we had deployed metrics server, it had been deployed on a new node and everything was fine. And at one point it was rescheduled on a node node and things starting failing because the side car wasn't able to signal metrics server, so metrics server didn't reloaded certificate and then API server was failing to contact it. So here is the overall sequence of events. Sometime before, the certain metrics server was rescheduled on a node node and discovery calls failed and after being umkilled, it picked up the very sensitive certificate that the sidecar generated from disk and everything was working fine again. And in our case it was a bad thing because when everything started working fine again, everything was deleted. So key takeaway, API server extensions are great because you can extend your API, your Kubernetes cluster to have new APIs, but they can be tricky. If the API is not used very much, these extensions can be done for days without any impact, right? However, some controllers rely this API to be available to work properly and can trigger issues. In addition, the communication between the API server and API services is tricky to do right in terms of security and if you're interested, I really invite you to go and see this talk by Tabisa, a colleague of mine where you learn a lot about TLS and API service extensions. So we're now going to talk about a very different incidence and I really like this one because it's a low level bug that also involves networking which I tend to like quite a bit. So, like this incident on the App Engine staging, but it's still interesting so we're sharing what we learned from it. So we run our nodes in Kubernetes and to do that, of course, we have to build images and an engineer on the team pushed a small change and when we were testing this new image, we noticed that nodes were becoming not ready after a few days. And what was weird is that change was in no way related to the Qubelet or the runtime, but of course, the nodes were not working so we couldn't promote to production. So we started the investigation and the simplest thing to do was to describe the node and we can see this message here where the Qubelet was not ready because the container runtime was down. Container runtime is down. Well, we use containerity at the runtime and so the first thing we did is check if containerity was working properly and CRICTL seemed to say, well, yeah, containerity is working fine. So we went back to the Qubelet logs and we found this line that were interesting, especially the one highlighted in the middle of the slide, status from runtime service failed. So we wondered maybe we can use CRICTL to do the status call. It turns out there is a CRICTL method to do that and so we wanted to do it with CRICTL. There's no CRICTL status subcommand, however, there is an info subcommand that looks very close and so we checked in the code and issue called CRICTL info, it actually invokes the status CRICTL method and look, we did CRICTL info and everything was completely blocked. So now we know that there was definitely an issue and that the Qubelet was right. So we looked at the code of the status function in containerity and it doesn't do much except invoking this function here, netplugin.status. So, well, the next step was to look at this function. This function is not really doing much, as you can see. It's basically verifying that you've successfully passed CNI configuration. And of course, this error we already know it's going to tell you that the network is not ready and it's very clear in the Qubelet logs. However, it's not what we're seeing there. We're seeing a timeout and if you look at the code closely, you will see that online, 126, we have a lock attempt. We have something that hangs. We have a lock. Could this be related? But what we did next is take a dump of all the goroutines from containerity to find blocked ones. And we found quite a lot of them. And as you can see here, all these goroutines are blocked trying to acquire a lock. And there's a different one every two minutes, which is interesting because the errors we were seeing in the Qubelet were also happening every two minutes. And if we look at the stack trace from one of them, we can see that the lines match exactly what we were looking at before in the code. So it's exactly the problem, right? The lock is the problem. So it definitely seems CNI related now. So let's bypass containerity and the Qubelet and see if we can reproduce more simply. And what we did is we simply created a network namespace and used this very convenient tool, CNI tool, that will invoke CNI without going through the Qubelet and containerity. It worked fine. And as you can see at the bottom of the slide, the networking was set up okay and we're able to ping outside of the network namespace so everything was okay. The next step was we tried to delete the network namespace and this completely blocked too. So we actually identified at that point that the delete call was a problem. And after doing that, we looked at the process tree and we found this command here that had been blocked for quite some time. So let's talk about how we do CNI. In this cluster, we use the leaf CNI plugin, which we've been extremely happy with. And the delete calls fails for one of the plug-in, a numbered PDP here. And what we did is we modified the code of this plug-in to add additional logs. And after a few tests, we tried the problem to this call here. And what's weird is this is a call from the netlink library. So a low-level netlink library, it's not a code from the leaf CNI plugin. So we went to the netlink repo to look for clues. And we found this pull request, where it actually says that this function specifically doesn't work for KNL420 and needed to be fixed. And I really want to thank Daniel Brotman for finding the problem and fixing it because after updating the netlink library in our CNI plugin, everything started working perfectly again. So as a summary, what happened here is, well, we configured Packer to use the latest Ubuntu 1804. And at one point, the latest Ubuntu changed from KNL415 to 5.4. And as with just thin, we had an issue with KNL420. And the reason the behavior was so weird, and it took a few days for things to break, was it was a test cluster. And this bug only happened on pod deletion. So you need a pod to be deleted to actually trigger it. So key takeaways from this incident is we tend to consider nodes as abstract compute resources. But we need to remember that nodes are actually instances where you have a kernel, distribution hardware, and all these things can fail. And also remember that error messages can be pretty misleading. Here we have the logs and the qubits were saying that the runtime was down when actually only subparts of the runtime was down and it was hard to diagnose. We're now getting to the last story I want to share with you today. And this one you're going to see is pretty straightforward in terms of what happens. However, we learned quite a lot from it and I want to share that with you. So the problem started with user reporting connectivity issues to a cluster. And when we checked API server, they were not doing very well. And they were not doing very well because, well, they were restarting, they couldn't reach HCD. So the next step was, well, let's look at HCD. And well, as you can see from this graph here, memory usage, HCD wasn't doing very well either. It was even getting womb killed, which is not a great thing for HCD, right? So what we know to start the investigation is, well, the cluster size is basically the same as it was a week before. We haven't done any upgrades to the control plan. So it's very likely that something changed in the cluster and especially something that interacts with the API server. So let's look at the number of requests the API server are serving. And if we look at this graph here, we can see spikes of requests at the time of the incidents, and especially spike in list calls. So we're now going to talk about list calls because list calls are very expensive. So to understand why they are very expensive, we need to understand how API server caching works. And the API server, to make it very simple, are big caches in front of HCD. And when they start, they list resources and they start watching these resources to update the cache. So that's pretty simple, right? And then when clients want to access API servers, they can do get or list calls. And what is very important here, as you're going to see in the next few slides, is when you get or list a resource, basically, you say which resource version you want, because every single resource in Kubernetes has a resource version. And this allows you to get a consistent version. I mean, whichever API server you connect to, if you specify a resource version, you're going to get the same answer. And also, if you do an update, you specify the resource version you want to update, which prevents conflicts when two processes want to update the same resource, right? Because if the resource has changed, your update call will fail, and you will get this error that you've probably already seen that says resource version too old and then you have to retry. And so if you do a get or list and set a resource version to something, you'll get this version from the cache of the API server. There is a specific case where you set resource version to zero, which will give you the latest version from the cache. However, what about when you're not setting a resource version? In that case, API server completely bypassed the cache and actually gets data directly from it. And the reason it does that is because since you're not specifying anything, you're getting the most consistent data you can get. And what's important is, this is the behavior you have when you use kubectl get, but it's also the default behavior you get when you use client go, the default library used to make common test clients, it's going to be exactly the same. So let's illustrate the difference. So as I was saying, a kubectl is never setting resource version, so we have to use curl. And in this example, I'm getting all the pods in the cluster. In the first example, I'm not setting a resource version, and it takes more than four seconds. And in the second case, when I'm setting resource version equals zero, it's much faster, like almost three times faster. I did this test on a pretty large cluster, which explains why it's taking so long, but also I'm using table view. I'm not getting the full JSON object, just a summary version. And I did that to minimize transfer time because on this cluster, the full JSON was about one gigabyte. Something that I want to point out is sometime you want to get a very specific resource and you use label filters, label selects. And in that case, the amount of resource you're going to get from the API server is going to be very small, a few pods. And in this example here, I'm getting only pods from app A, and in that case, it's less than five pods. And you can see something interesting here. If I don't set resource version, it's still taking 3.6 seconds. It's faster than before, but not that much. However, if I set resource version to zero, it's extremely fast. And the reason for this is when you're not setting resource version, all pods are still going to be retrieved from HCD and filtering will happen on the API server. Whereas when you set a resource version, the API server will filter locally from its cache, and of course this is going to be extremely fast. If you've used a client go to build controllers, you've probably heard about informers. And informers are a way to have clients that are much more efficient in the way they talk to the API server, because they maintain a cache of the resources they're interested in, and they receive events when they change. And they're optimized to be very nice with the API server. So what they do is when they start, they do a list and set resource version to zero so they get data from the cache, and then they start a watch based on the resource version they get. There are some HKCs and reconnections which can trigger a list without resource version set, which is bad as we said before because then the code will make it to HCD. And even in Kubernetes 120, it should be much better because watch reconnection should avoid these very expensive calls. So as a summary, remember that least calls go to HCD by default and can have a huge impact on your cluster, even if you use label filter and only get a very small sets of pods on notes for instance. The main message is avoid list and use informers. Something that I wanted to point out too is when you use Qtcl get, you actually use a list with resource version not set, which is bad with HCD. And I believe that Qtcl could have an option that will allow you to set the resource version to zero so you get data from the cache. It would be much faster and make for a much better user experience. It would be much better for HCD and the control pane in general. With a small trade off because you will have a small inconsistency window because the API server could be a bit behind HCD. I really wanted to thank Voitec for his help getting this right. I'm sure a few imprecisions remained, but he helped a lot and thanks again Voitec. Thank you so much for coming back to the incident. So we know the problem comes from this call. The next step is to identify which application is making these calls. So we're getting back to audit logs. And as I was saying before audit logs are very convenient because they can tell you what happened and when and what did it. But they also contain information about when the request started anyway was finished so you can compute the query time. In this graph here, we're showing the community query time for this call by user. And as you can see here, we have a single user that is responsible for more than two days of query time over 20 minutes. Which if you do the math is about 100 thread processing list calls at any given time during this window. That's a lot. This is the same view aggregated over a week in a table form. And you can see here that this specific user here is using about 40% of query time for API server. That's a lot. So as you've made, as you may have seen on the previous slides, the user responsible for this was called not group controller. So let's talk about not group controller. And need now is controller we built to allow team to create pools of nodes and where they can specify the instance site they want to use, for instance, and we've used it extensively for two years without any problem. However, we had just deployed a recent upgrade to provide deletion protection and avoid deleting not groups that were running workloads. And so what this feature was doing is check if both were still running and refuse deletion if workloads were still running in there. Technically, this is implemented as an admission controller. And if you see another delete call, the controller would list all the nodes based on labels. Remember that even when you do that, even if the number of nodes you get is small, you're still going to hit at CD and get all the nodes. And what's important for the next step is that some groups are can be a bit big. And then the next step is to list all the parts for these nodes. But remember that getting all the parts for these nodes also mean getting all the parts from at CD and featuring on the API server to make things worse. All these calls were made in parallel. So that's a lot of calls making it directly to at CD. And of course, this is what broke at CD and trigger the incident. So what we learned here is these calls are very dangerous because the volume of data can be extremely large and they will hit at CD most of the time. And so I said it before, but I'm saying it again because it's an important message using formulas whenever you can. Also, audit logs are extremely useful. Not only can that tell you who did what and when, but they can also tell you which user are responsible for query time, which is very helpful when your API server are overloaded because then you can know what is happening. So in conclusion, I shared this incident because I found that they were a very good occasion to learn a lot about the inner workings of Kubernetes. And that's something I wanted to share as a very quick summary of what we learned today. API service extensions are extremely powerful, but they can harm your cluster. So be very careful in how you monitor them. And we tend to forget that nodes are not abstract resources, which is tempting, especially when you run on cloud, right? You just create a VM and it works. But yeah, you can have kernel issues and the level will issues. And especially for you running large clusters, you probably need to understand how API server caching is working because it's very important. And because it controls the performance of your cluster. Be extremely careful with how clients interact with API server and avoid these calls. I know I've said it a few times, but it's important. Where we've done for today, if you're interested in diving into Kubernetes and managing large clusters, we're hiring. Don't take to reach out to me either by email or Twitter or through the Kubernetes Slack. Thank you very much.