 Welcome to day two of KubeCon Chicago. It's great to be here. My name is Hemant, and today, we're here to talk to you about the hardest incident we ever had to deal with at Datadog. What is Datadog? Datadog is a cloud monitoring, security, and observability company. And at Datadog, we run Kubernetes at a large scale. We run hundreds of self-managed Kubernetes clusters, which help us manage tens of thousands of Kubernetes nodes. And some of our Kubernetes clusters have 4,000-plus nodes in them. So what happened? What was this hardest incident? So on March 8, 2023, our platform experienced a global outage, and our users were unable to access the platform for around 10 hours. And it would take us another 12 hours to recover the last major service. And this happened because we lost 60% of our compute capacity across multiple cloud providers distributed in several availability zones in a span of one hour. This is a story of how we lost almost everything everywhere, all at once, at Datadog. So during the incident, our internal Datadog employees realized that they were not able to access the Kubernetes control planes. And our admin teams were having trouble trying to even SSH into the nodes to try and debug what was happening. So at this point, it seemed like a widespread network outage. So did we roll out a bad change, fleet-wide? No. We actually have explicit policies in place to prevent our engineers from deploying code too fast. So how did this happen? In fact, we run completely independent stacks with no explicit dependencies between them. So there's no way we could have deployed something that can impact all of our data centers at the same time. Luckily for us at Datadog, we have a strong culture of incident response. So it was pretty easy for us to get all hands on deck pretty quickly. Our engineers who build these services own them end to end. And as you can see from the screenshot, we were having around 400-plus engineers working in different breakout rooms trying to figure out what is happening. Soon enough, one of our engineers figured out that we were able to recover some of the last nodes by simply restarting them from Google Cloud Console or via the GCP API. And once we recovered those nodes, we were able to SSH into them and take a closer look at the system logs. And that system logs told us that there was an unattended upgrade on these nodes. And soon after the unattended upgrade, these nodes were losing network connectivity. So what are unattended upgrades? Unattended upgrades are our legacy way of getting critical security updates onto our nodes. And at Datadog, at the same time, we've also been working very hard to build a node lifecycle automation platform that can replace thousands of nodes every day with minimal impact to applications. And it does that in a Kubernetes native way. But as you can imagine, this is a long migration. So at this point in time, we had both of these systems enabled. And if you're interested in learning more about this node lifecycle automation platform, my colleagues are giving a talk later today. Do check it out. So using this node lifecycle automation platform, we were upgrading our fleet from Ubuntu 20.04 to Ubuntu 22.04. And as you can see from this graph, we started our migration in November 2022. And by March 2023, 90% of our fleet was upgraded to Ubuntu 22.04. So we wanted to take a closer look at what exactly was included in these unattended upgrades. Turns out there was a patch to a vulnerability in system D that was being addressed in these unattended upgrades. But this patch had nothing to do with networking. And all we know so far is any node that receives these patches seems to be broken. And we know that this had something to do with system D. So we tried restarting just system D. And that alone was enough to break networking consistently on one of our recovered nodes. So is there a relationship between networking and system D? Yes. There's actually a component called network D system D that manages default networking on these nodes. So we tried restarting just system D network D. And that was enough to break networking on these nodes as well. And after some more investigation, we realized that this was happening only on Ubuntu 22.04 nodes. So we wanted to look at what really changed between Ubuntu 20.04 and 22.04. And we started digging into the commit history of system D network D. And we discovered this very interesting commit. According to this, system D network D would wipe out any IP rules that were installed by anyone but system D network D. And now things started to make a lot more sense. In order to understand how this would impact Datadog, let's take a closer look at how our Kubernetes networking looks like. So at Datadog, we use Cilium for our container networking. And in Kubernetes, we know that every part gets its own IP address and its own network namespace. So most CNI plugins install a few route table entries or IP rules to make sure that the kernel can do the necessary routing. And they can send and receive packets. So on system D network D restart, all these IP rules that were installed by Cilium were completely wiped out. And that explains why these nodes were losing network connectivity. So it turns out system D network D never needs to restart in the happy flow. And a patch for a completely unrelated CVE in system D triggered a restart in system D network D. And that restart of system D network D broke networking on these nodes. And on top of that, unattended upgrades run everywhere on our fleet between 6 AM and 7 AM. And that explains why we lost so much compute capacity in such a short span of time. So how did we recover from this? In order to walk us through what happened next, I would like to invite Laurent, principal engineer at Datadog, to walk us through it. So as you can imagine, recovering from this incident was pretty challenging. And it involved multiple steps. As Hemant was saying, all our regions were impacted. But they were impacted differently. For instance, on GCP, the web pages were not even loading. And on AWS, we had a web page, but with limited contents. So we decided to focus first on GCP. And the first step we had to go through was to get our Kubernetes clusters in a healthy state. So as Hemant explained, we had discovered pretty early on in the incident that we could simply fix a GCP node by rebooting it. That's simple, right? So first, well, we had to recover our Kubernetes control plane so we could know what the state of the clusters was. And this is what we started to do early in the morning. We started to recover the nodes of the Kubernetes control planes. And at this moment, we started to get a full picture of the impact of the incident, because we discovered that 60% of the nodes were not ready and down. And at this time, of course, we had to recover all these nodes. And this graph is showing all the nodes we had to restart. And each color is a different cluster. And of course, you can imagine that we have thousands of nodes in this environment, and it took us time. I said earlier that the AWS regions were in a better state because the web page was loading. So very quickly after we started working on fixing the GCP region, we started looking at the AWS ones. And we discovered that instances were running and healthy, but they were very recent. I mean, that had been up for only a few hours at best. And by investigating a little bit more, we actually discovered that the instances had been replaced. And the reason for this is, if you're familiar with GCP and AWS, on GCP, when you run an instance in a managed instance group, there's no health checks. So if the instance is failing its network, it's just going to remain there without the network. And if you restart it, it's going to be fixed in our case. On AWS, however, the autoscaling groups will detect the instances as being unhealthy, and the autoscaling group will actively replace them. That's amazing, right, because we auto-healed. Well, except we actually run a lot of data stores at Datadog. And these data stores quite often use local disk. Once again, if you're familiar with AWS, you will know that when you have local disk and you replace the instance, you lose the data. And here is a list of example of data stores we use. So we use a lot of open source ones and internal ones. Of course, all of them are very resilient to the loss of a single node or a few nodes. But given the impact of this incident, we had lost 60% of our nodes, you can imagine that in many cases, we had lost current and actually lost data. We don't store any source of truth in the data source so we were able to rebuild the data, either from backups or for other source of data. But of course, it took us time. So at this time, our Kubernetes cluster were healthy and we had a pretty good idea of what the state of the word was. However, we had accumulated a lot of backlog because, of course, our users kept sending us data. And so we need to process all this backlog. And to do this, we required scaling up many of our applications to process all this in a timely fashion. And luckily, at Datadog, we're cloud-first. And cloud is elastic. Well, the problem is, at our scale, elasticity is not always an easy promise. And a very good example is, for instance, that instances don't run in a vacuum. They run in subnets. And these subnets have a finite size. And we're making sure, always, that we can absorb the capacity of the loss of an ability-by-t zone. However, we scaled so fast during this incident that we actually ran out of IPs in some subnets. Another problem we faced is Kotas. Of course, we are being very proactive about the Kotas for the cloud providers at Datadog. And one of the very typical Kotas we're careful about is the maximum number of instances you can run in a project on GCP, the red line on this slide. And as you can see, during the incident, we always remain below the line. However, you can see a flat blue line in the middle. And it turns out, during that incident, we actually discovered a Kotas we didn't know about, which is the maximum number of instances you can have in a peering group when you have multiple VPCs peered together. Once we had discovered this, we had a very strong relationship with GCP, and we were able to create a ticket and very quickly get the Kotas raised. But, of course, it took us time. Another example here is we run Kotas ourselves. However, we heavily depend on Shopify or APIs. And, of course, when you scale very fast, you're going to ask a lot from these APIs. And in this example here, we can see an extreme one where we actually have a controller that is responsible for allocating network interfaces to nodes. And this controller was trying to allocate interfaces. But, of course, we were creating hundreds, thousands of nodes. And at some point, this controller was starting to get really limited. And because it was actively retrying, we got into a state where about 100% of this request were being reclimited. To recover from this, we had to decrease the pressure by reducing the number of nodes requiring additional network interface and by also asking AWS to increase the limits which they also did pretty fast. And this gets us to step three. Now that we had healthy clusters that were able to get additional nodes, we need to recover data, like all the applications we have. And, of course, we wanted to do this as fast as possible. So we wanted to do this in parallel. However, as you can imagine, we have a very complex Microsoft architecture. And as you can see in this example, the chain of dependencies is often pretty complex. And so we did our best to do things in parallel. But, of course, we had to make sure that we were recovering things in order in order to recover data back. So we learned a lot of lessons from this incident. And we're going to share a few today, right? The first thing is we're doing our best to make sure that our regions are isolated. But the problem is they're in the same operating system with the same configuration. And in this case here, this is why we had a shared phase, a shared phase, sorry, where all the instances worldwide failed at the same time. Another thing we learned the hard way is various simple abstractions, like instances and autoscaling, can actually leak and have certain differences between providers. I mentioned the difference between the AWS autoscaling group and the managed instant group in GCP, which, of course, it's a small difference. But in our case, it made debugging and understanding what was happening a bit harder. I mentioned the complex chain of dependencies we have between our applications. And, of course, during this incident, we had to restart the doc from scratch, right? And we don't do that often. So we actually discovered dependencies we didn't know about, which, of course, also made recovery a bit harder. And, finally, and this one is kind of obvious, right? Like, when you run at scale, everything is a challenge because when you want to do large-scale operations, you're going to put so much pressure on the underlying dependencies that you're going to face very interesting issues. So we really want to do better in the future. We want to continue to decrease the blast ranges and make sure that our regions are very well isolated, that our zones within a region are very isolated. We also want to degrade more gracefully, right? Because if we have limited capacity, we want to prioritize the data that matters most. And it's usually live data because it allows you to see what's happening in your systems and to have monitors. And, finally, we already do a lot of game days and chaos tests. But, of course, we've never tried something that big. But when we do this test, usually, what we do is we try and fail a zone, right? But nothing of that scale. And so we plan to do much larger, a much larger-scale chaos tests to make sure that we exercise our resiliency but also our incident response process. Something we really wanted to mention is we wanted to thank our partners there. So, first, the cloud providers were mentioned earlier because they were extremely helpful and they went out of their ways to help us that day. We also wanted to thank the Celium community, of course, because the Celium team very quickly created a patch to make sure that IP roles created by the Celium agents would not be replaced and deleted by SystemDNetworkD to make sure that this incident would not happen to anyone again. Thank you very much. It was a bit challenging to summarize this incident in 15 minutes. So, if you're curious and if you want more details, we wrote a very detailed blog post. And also, we're going to be around for the conference, Hemant and I, and also many other people from Datadug. Thank you very much. Thank you, everyone. Have a great day. See you soon.