 Welcome to Kupcon. Welcome to our talk. It's so great to be here in person and to see everyone here. It's been a tough two years and it's really great that we're doing these things again. So today we're going to talk about the time that it wasn't DNS. Everything seemed to tell us that it was DNS, the logs. It felt like DNS. It had to be DNS and it wasn't. So first off, we're here representing Datadog. This is a talk about Datadog's infrastructure, not about the product that we have, but just in case you're not familiar, we're an observability platform. We monitor all the things. We have a sweet booth in the exhibition hall and you should go check it out. We're doing a raffle for a Nintendo Switch this afternoon, so you should go get your badge scanned and see what we have to offer. Anyway, the numbers that I want to focus on for this talk are the way that we run Kubernetes and we're really big on that technology. We run it at a big scale. Just to give you an idea, we have some figures on the right-hand column here. So we run tens of thousands of nodes and hundreds of thousands of pods. We have dozens of Kubernetes clusters with anywhere from hundreds to thousands of nodes. So we're running this at a very large scale. We run on all the major cloud providers and we're growing very quickly. But before we dive too deep into the talk, I'll let you know who we are. So I'm Elijah Andrews. I'm a software engineer at Datadog. I work on our service discovery and network infrastructure. And I'm Laurent Bernay. I'm a staff engineer. I work in an infrastructure also. And I like to focus and dive into weird and fun networking issues as the one we're going to discuss today. All right. So let's talk about how this all started. So we have a service called the Metric Service at Datadog and the people who operate the service notice that when they did a rollout and a rolling restart, they would see a large spike in errors. So you can see there are two lines here on that top graph. The one of the lines, the purple one corresponds to the server. And you can see here that during a rolling restart, which is what was happening here, there's a huge spike in errors. The red line there is showing this from the client's perspective. The client actually retried. So it didn't show up as an error rate in the client. However, our latency went up by a lot because the clients had to retry. And this was degrading performance in our application. So as we always do, when we have problems with our infrastructure, we check the logs. And when we traced the requests that were failing, we saw that they were telling us that it was DNS that was having a problem. And it really looked like a DNS problem. It had to be DNS. It's always DNS, right? Okay. So let's talk about DNS now. So before we go too far into details, I'm just going to give you like a high-level view of the applications involved here. So at the center, we have the metric service. This is a service that powers our time series data system at Datadog. So what it does is it provides, like when you go to open up a dashboard in our product and it shows you time series data, that's all collected by the metric service. It also powers our alerting engine. So in Datadog, you can set thresholds and monitors on time series data. And when those conditions are met, it can alert you. And that the alerting engine pulls the metric service in order to evaluate your monitors. Behind the scenes, the metric service has a few dependencies. It has to talk to an index for our time series data and our storage layer. So what it's really just doing is stitching together all of the time series data from the different sources that it knows about. So here's what was happening. We are seeing the clients of the metric service see query errors when a rolling restart was happening. And the DNS errors that we were seeing were corresponding to the service discovery to discover its dependencies, the index in the store. We use DNS for service discovery. And so here's what our DNS setup looks like. So this is a somewhat conventional Kubernetes DNS setup. We run the metric service in a pod in Kubernetes. And over UDP, a DNS request flows to node local DNS. So node local DNS is a caching and forwarding DNS resolver. It's a stripped down core DNS configuration. And then when node local DNS, if node local DNS receives a query and it doesn't have it in its cache, the query then flows to cluster DNS, which is a cluster wide core DNS configuration that also does caching. And it is authoritative for resources that exist within the cluster. However, you saw on our first slide, we run dozens of Kubernetes clusters and they all talk to each other. So sometimes resources exist outside of the cluster need to be discovered. In order to do that, we use an external DNS provider. In this case, we are running our stuff on EC2 in AWS. So our external DNS provider was route 53, which is their DNS solution. And so any resource that metric service needs to discover that doesn't exist in the same cluster is resolved by route 53. And so here we are seeing the DNS errors emanating from the metric service. And interestingly, we saw that the failure was happening in node local DNS itself, which is kind of weird because that's just happening over the node interface. Like it's not going over a network and it's not really where we expected to see the problem. So we dug in and we actually saw that node local DNS was running out of memory. And this was really surprising to us because it should never happen. The reason it should never happen is because we set a concurrent request limit in node local DNS. We say if you receive over 1,000 requests at a time, you should reject them and put back pressure on the client so that you don't run out of memory. So this just implied to us that the sizing that we had for node local DNS was wrong. Like maybe we weren't giving it enough memory to serve 1,000 concurrent requests. So the first thing that we did here was we increased the amount of memory that we were giving it. This runs as a daemon set, so it's kind of expensive to give it too much memory. But we quadrupled the amount of memory. We went from 64 megabytes to 256 megabytes of memory. And this stopped the umkills, but we still saw the errors during the rolling restart very surprisingly. And so at this point, we started doing a little bit of math and nothing that we are seeing really made any sense. So we looked at the number of queries that node local DNS was receiving on that node. And you can see, you know, it's usually around 100 requests per second. And then when we do the rolling restart, there's a little spike here, but we're not going that high. Like the highest number of requests per second that it's serving is 400. And we allow 1,000 requests concurrently. And if we say that each DNS request takes like five milliseconds to resolve, which is quite generous, that's like not taking caching into account or anything, node local DNS in theory should be able to serve hundreds of thousands of requests per second. But we're hitting max concurrent with under 400 requests per second. Very strange. So we looked more at all of our all of our graphs and tried to figure out what was happening. And this one stood out to us. So node local DNS has a layer called a forwarding layer. And that's the layer that it's a plug in a core DNS plugin that is responsible for forwarding the request to cluster DNS. And we noticed that the forward plugin was telling us that node local DNS was having health check failures when trying to talk to the cluster level resolver. So the way that that works is those connections happen over TCP and they're reused for requests, but every 10 seconds those connections expire and they have to be recreated. And so node local DNS was unable to connect to cluster DNS. And this explained why we were hitting max concurrent here. We looked at the time out for creating a connection from node local DNS to cluster DNS and we saw that it was five seconds. And so if an incoming query comes in and there's no connection to cluster DNS, that request will block for five seconds as we time out trying to create the connection. And this means that we'll hit max concurrent of 1,000 with only 200 requests per second. So it's starting to make a bit of sense, but it's still really weird. At this point, we thought maybe we were having a networking issue, like perhaps we were saturating the network on this instance. And so we looked at what we should be able to achieve. So we run this on M5 4x large, which is an EC2 AWS instance type. And that instance is allowed to peak at 10 gigs per second on network throughput. And it should be able to sustain five gigabits per second. And you can see our graph of the throughput here, we're nowhere near even the sustained guarantee. So that all looks okay, but we're still seeing things that suggest that our network is saturated, right? We're seeing TCP retransmits, we're dropping packets, and maybe we're seeing microverse instead, right? There could be really spiky traffic that's coming in. But the graphs that we're looking at here are aggregated for every 15 seconds. So maybe like that data is kind of being eaten by our reporting. And when you see this type of thing in networks, what you normally do is look at the counters in the network driver, which count like events that happen and aren't subject to this sort of like time aggregation, like observability issue that we see a lot in networking. And luckily, about two weeks before we hit this incident, we added support for the elastic network adapter metrics. So if you're not familiar with the elastic network adapter, it's a networking interface that runs in AWS's hypervisor. So this is like outside of our VMs that run in AWS, like they have like big real machines in a hypervisor that performs the platform components. And the elastic network adapter is one of them. And they added a feature where if you run f tool dash s on it, it will actually give you counters about the virtualized network interface. And we're also going to tell you more about what we saw there in a second, but just to summarize what was happening so far, when the metric service was being restarted, we were seeing DNS errors. And we noticed that no local DNS couldn't establish connections to the cluster DNS resolver, which implied some sort of network saturation issue. And this gets us to chapter two, AWS networking, where we're going to try and look what's happening on the networking card of the instance. So our first hypothesis was that we were accelerating the instance network wise during rollouts. And as Elijah was just saying, we're imagining that it was due to microburst because the averages were good. So it was like very short spikes. So once we had instrumented the instances to look at the low-level ENA metrics, we saw this promising graph on the bottom left side of the slide, which showed that we were actually exceeding the limit. However, if you look at all the autograph, it's not correlated with the deployment. We're always going over the limit sometimes, but it's maybe just normal TCP behavior, TCP is self-regulating, and it's not correlated with the graph. So it's not another problem. Once we had enabled the new AWS metrics, we had quite a few of them. And so we looked at all of them. And this one really stood out because it was completely correlated with deployments. Every time we had a deployment in errors, this metric was spiking up. And I'm talking about the metric called contract exceeded here. And at that point, we had absolutely no idea what this metric was because we had never encountered it before. So we went to the AWS documentation and AWS explained that in order to do security group, which are stateful firewalls, they have to do connection tracking at the hypervisor level. And this metric is actually telling you that the connection tracking table used by the hypervisor is full. The interesting thing is, as we were saying before, we're running tens of thousands of AWS hosts, and we have never, we had never encountered anything related to contracting on the AWS level before. So that was very surprising to us. I mean, it made sense that the limit existed, but we had never seen it before, which was a surprise. So the first thing we did is, well, we tried other instance types. So the first instance type we tried was network optimized instances because they have high throughput. So it's very good. As you can see here, the two lines in the middle show ingress and ingress drops for the network optimized instances. So it's much better. So it was very promising. However, it had no impact on the contract issue and on errors, right? So it's better in terms of packet drop and throughput, but it's not impacting the contract at all, and we're still seeing issues. So we tried bigger instances instead. We took an instance that was twice bigger, and as you can see on this graph here, it solved the issue completely, right? You can compare the light blue graph and the purple one, and you can see the purple one is much better because everything is basically zero. No contract exceeded errors, no errors. So extremely promising for us. Except, well, it's addressing our issue, but it would mean that our metric service infrastructure would get twice as expensive, which was a bit of a hard sell. So we wanted to understand exactly what was happening and how to address it. So we reached out to AWS because AWS mentioned that there are limits to the contracting system, but there's no public number, and they told us, don't worry, you can track hundreds of thousands of flows. It's usually not an issue, except if you have a very weird behavior. They also told us that, yes, bigger instances at a bigger table, which made sense based on our tests. So given that AWS had told us they were able to track hundreds of thousands of flows in this instance type, we looked at the host where we were running the metric service, right? And this host has a contract to the Linux one, and we were trying to see if things aligned. So the host, as you can see on this graph, is usually using about 13,000 connections and spiking up to 50,000 during rollouts. So this is pretty high, but this is like another metric lower than what we expect the hypervasa to handle. So that's very weird. At that point, we had no idea what to do because things made no sense. And so we went even lower level, and we started looking at the VPC for logs, which are basically like if you're familiar with Cisco, it's like a net flow type of data, where you have information about TCP connections. You have one flow in each direction because they are not stitched together. And you get very detailed information on what's happening on the network. Of course, it's a huge amount of data. It's pretty hard to pass, but it's very detailed because you get all the flows coming in and out from an instance. So the first thing we did is because we knew that no local DNS was not able to establish connection to upstream DNS servers, we looked at egress flows, and we grouped flows by source IP. And as you can see on this graph here, we have flows from the old IP during our rollout and then flows from a new IP, which makes sense, right? We rollout, we replace a pod, the IP change, so we see flows created by the old pod and then for created by the new one. What's pretty weird there is the very big spike you're seeing where we're spiking up to 50,000 flows. So at that point, we were, I mean, as I was mentioning before, this information is we have two connections, two flows for its connection, one for egress, one for egress. And so we were trying to see what was happening for egress flow. So flow actually getting into the instance. And the graph is exactly similar, which makes sense, right? Because TCP, when you have a connection with TCP, you have flow in both directions. Except there's a second spike that is not aligned at all with what we see egress. So everything is the same except the spike where there's no matching traffic egressing the instance. So what we did at that point is we focused on the traffic, ingress traffic to the old IP to understand exactly what was happening. And on this graph here, what we're showing is the flags that we were seeing on these connections. So the blue line is showing you flows without any flag, which makes sense for long-lived established connections because there's no TCP flag set. The red line is flows terminating, which makes sense, right? When we're going to roll out, we're stopping our application and flows are terminating. So we're seeing thin packets. And the yellow line is thin, so connection attempts. And this is the one that's very surprising because this is the one spiking very high and the total is above 100,000 connections over 90 seconds. And if we try to compare this with what we're seeing for egress traffic, there's something that's very interesting here is you can see that the first spike of sins is actually matched by resets. So sins are coming to the instance and the instance is sending reset because there's nothing listening anymore. But after that, we have nothing for the news incoming. So this was starting to give us a good idea what was happening. Reset words were kind of expected, right? Because the metric service is a good application during GRPC. It's doing a graceful stop with a time out of 10 seconds. And what happens when you do a graceful stop in GRPC is the service stops accepting connections. And this is why we're getting resets. It waits for the connection to finish and tells the client to terminate. So during these 10 seconds, it makes complete sense to get resets because this is what GRPC is going to do. However, after these 10 seconds, the pod is deleted and the IP is not there anymore, right? It's deleted. So there's nothing to answer and that's why we're not seeing an answer to the sin packets. Another thing we knew at that time is, well, given we have all the flows, we could identify the application connecting to our metric service and we identified that the alerting engine was actually making the connections. And, well, we looked at the contract on the host of alerting engine nodes and, as you can see here, they're spacking very high. So it seems to confirm that this is what's happening. So at that point, what we know is, well, we have DNS errors because no local DNS can connect to upstream. We know we're actually saturating the AWS contract because we've seen hundreds of thousands of sins, right? And we know which application is sin flooding the metric service. But it doesn't explain why what we're seeing in the instance is so different from what AWS is seeing at the app of other level. So let's dive into node networking. So on our nodes, we use CLM to provide CNI. And the way this works is we have the CLM operator allocate additional IP to nodes that are used for pods. And we allocate IPs on an additional interface. So not the main interface of the host, an interface dedicated to pods. And once the operator is responsible for maintaining an IP pool for pods, and when you create a pod, the CLM agent is going to grab an IP free IP allocated to the pod. That's all good, but then you need traffic to flow into the pod, right? So what the CLM agent is doing is it's also adding a route entry to say traffic to this pod IP should be sent on this virtual interface. We also need to route traffic outside of the pod, and we need to use the right interface. So to do that, we use source routing, and we achieve this with an IP rule. So CLM is creating the IP rule and say traffic from this IP should be using this route table, which will use the additional interface to get traffic out. So in a stable state where an alerting node is sending, it's trying to connect to metric service pod, it sends a scene, and you can see the contract is consistent everywhere where the connection is starting to open. Then we get the CNAC and the connection transition to established state, and it's aligned everywhere on the node in the hypervisor and on the alerting node. What happens when we delete a pod? So when we delete a pod and we still get traffic to the old IP because it takes some time for service covering information to propagate, so client will try to reconnect to the old IP. And the IP is still held by the interface. So the VBC fabric will send traffic to the node. And the thing is, we have this scene, and this IP is not known anymore, right? Because all the routing information has been garbage collected. So we were wondering what's happening to this package, right? So we have this in packet incoming, and we don't know where it's going. So what we did is, well, we tried to do a connection ourselves and simulate it. So we connected from another node, and we captured traffic with TCP dump, and so we see the scene packet incoming, but no answer whatsoever, nothing at all. At that point, we're like, well, we have a scene packet, but where should it be running to, right? And so we ask the kernel, like, if you see a packet of this type with this source IP to this target incoming on this interface, what are you going to do with it? And this is where things start to get a bit more interesting and more fun, which is this error message here which shows that we're hitting reverse pass filtering. Like, the kernel refuses to do something with this packet because it's doing something wrong according to the kernel. And reverse pass filtering is a security feature we're going to talk about just after, but this was confirmed in the kernel logs where we see this warning there, which is, we're seeing a merchant packet, so a packet we should never see on this interface. For those of you who are not familiar with reverse pass filtering, it's a security feature from the kernel to prevent IP spoofing, so you can't send a packet with a source IP that's not supposed to be there, and you have different modes. The standard mode is, if the return pass would use a different interface, drop the packet, so that's trick mode, and load the 7S merchant packets, and there's lose mode, which is, only drop packet if there's no return route. So in our case, well, we're seeing the kernel logs with the merchant packet, and so it makes sense. Now the same packet is incoming, and it's dropped because we're hitting this. What does this mean in terms of connection tracking? So what's interesting here is, if applications attempt to connect, they will fill their own contract, they will fill the hypervisor contract, but then the same packet hits reverse pass filtering in the kernel, and it's just dropped, so it's not added to the node contract, which explains why the contract is so different in terms of size compared to the others. So everything made sense except this, which was very confusing to us. You remember before I was saying that reverse pass filtering can be both, it can either be in strict mode or lose mode, and we set it to lose mode, we see the default in most distributions. And the thing is, when it's set to lose mode, it means that you only drop a packet if there's no possible egress route for the source IP. But of course, we have a default route on the node, the main interface on the node. And so we should be able to route traffic through this. So the traffic should be incoming on ENS6, the pod interface, and egressing on ENS5, and then just be dropped by AWS. But it's not what's happening, so that was very confusing to us. At that point, we had absolutely no idea what to do. So we want to look at the reverse pass filtering code. It's actually not that complicated. So we knew that this is the error we were getting. And so we just look at the code and went back. So to hit this error there, you have to go through this label, ERPF. To get to this label, you need to go through the last result label. Okay. All this makes sense now. And this is starting to get interesting. The only way to get to this error was actually if this variable there, no added error, was set to true. And this variable is set very early in the function, and it's set to true if the interface has no IP. So it turns out our pod interface doesn't have an IP address assigned because they don't need one, right? We just transit traffic through them and we don't set an IP on it. And so at this point, maybe this is the problem of hitting. So we simulate it. So the first test is the one we did before, and then we're like, well, let's add a random IP, whatever, on the additional interface and see what's happening. And as you can see here, as soon as we've added a random IP to the additional interface, it's actually working now. We have a new gross route through ENS5. So to summarize, we're hitting reverse pass filtering because the pod interface has no IP. If it had one, traffic would be routed to the main interface and dropped. So it wouldn't be great, but at least the contracts would have been consistent and we would have understood what was happening much earlier. And something that's interesting is we noticed this and we did a small pull request to Cilium to make sure that we can notify clients when this happens. So now when you delete a pod, you can tell Cilium to send an ICMP error message saying that this IP is not reachable anymore, which means the client will know very early that there's an issue. So here is the status now. So you remember we have DNS issues because no local DNS can connect to upstream. We're actually sin flooding the AWS contract and we're sin flooding the contract, but we're not impacting the contract on the host because we're dropping packets because we're hitting reverse pass filtering because of a weird edge case in the kernel where we don't have an IP on the interface. So now let's get back to the original issue, which is like, why are we sin flooding anyway? So in order to understand why we are sending so many sins, we have to look at the way that our RPC system is set up. So we have two main questions here. Why were we sending sin requests for so long? Like there was a long period where we were sending a bunch of these requests and also there were a bunch of like really big spikes in there and we were wondering why those were happening. And so just a reminder about what our RPC setup is here. So we use DNS for service discovery that goes out and hits the external DNS provider, in this case, RAT 53. And we also use GRPC as our RPC mechanism. And one important thing, it's going to be important in a second, is that we do client-side load balancing. So this means that every alerting engine talks to a bunch of metric services directly, like they have a bunch of IPs and they talk to them. There's no load balancer or any connection pooling in the middle. Okay, so first, why were we sending sin requests for so long? So in order to understand why that period was so long as it was, we looked at the way that our external DNS was configured. So if you're not familiar with external DNS, this is a controller that you can run in your Kubernetes clusters. And what it does is it looks at pod events, like pods coming online and pods going offline. And then it takes the IPs that it returns and it puts them into an external like cloud provider DNS provider. And so what was happening here was like when a metric service pod was being deleted, external DNS would capture that event and then go and update root 53. And so what we noticed was we actually found a really interesting piece of behavior in the version of external DNS where we are running. So this first step here, the metric service pod is deleted. And then the pod receives SIG term and we give it a 10 second timeout. And at this point, we call GRPC graceful stop. It sends go-aways to its existing connections and resets to new connections coming in. But one thing we found is that in our version of external DNS, it was not when SIG term was sent to the pod that it was deregistered from DNS. It would actually only happen when the pod itself was deleted, which meant that we normally took about 10 seconds after receiving SIG term to even start removing it from DNS. And then after this external DNS runs in a sync loop. There's a fun balancing act here between how frequently you want updates and how big you want your batches to be because you can hit problems with rate limits in your cloud provider if your batches are too small. So here we set our sync loop to 15 seconds. So that meant that it took up to 15 seconds for that loop to run and be updated in rep 53. And then again, in DNS, you have to set a TTL on a record and it's a balance between how frequently you want things to be queried and how up to date you want them to be. And so we set a TTL of 15 seconds. And then the last thing that we needed to figure out was how frequently is the alerting engine re-querying that DNS record? So it turns out that this is actually bound by a GRPC setting called min time between resolutions. And we use the default value, which is set to 30 seconds. So the interesting thing about this is that actually the updates that the alerting engine would see were actually quantized in 30 second increments. So it would see an update either after 30 seconds or more likely after 60 seconds if you look at the average for all the other delays. And sometimes even 90 seconds, but nothing in between. And that quantization was actually really confusing to us when we were looking at all the graphs and seeing that everything was a minute or 90 seconds and that solved that mystery. And so this lines up with the propagation time that we are seeing in the graphs Lohan was showing earlier. So the deletion starts. We see a huge spike in sins. And then the clients progressively start using the new IPs. And eventually after like 60 or 90 seconds, no clients are using the old IPs. So that makes sense. This isn't really a balancing act. Like if we wanted to make this shorter, we would have to put more load on our DNS infrastructure and on the cloud provider. And so we were okay living with this. To understand why we are seeing those huge spikes, this one is a bit more complicated to explain. And it requires you to understand the way that we use gRPC at Datadog. So we started using gRPC at Datadog many years ago. And when we started using it, we had the way that we would chart our applications was by writing really thick clients. So the clients were completely aware of how their servers were charted. Like they would download partitioning tables and stuff. And we would actually resolve the DNS entries into IPs and just pass the IPs to gRPC. But the way that gRPC is really intended to be used is actually you give it a host name and it resolves that IP in the background and your application doesn't have to worry about that at all. And we actually started having a bunch of incidents where we were using gRPC in an unconventional way that people weren't used to because we were passing at IPs and we were using some weird settings to support that. So at one point we switched to using the normal way of using gRPC where you just pass it a host name. And when you do that change, one thing you have to change is the gRPC load balancing policy. So the default load balancing policy in gRPC is called pick-first. And what it will do is it will just pick one IP and use that, form a connection to it and use it. And when that connection fails, it will pick another one. But when you're trying to do client-side load balancing and you have a bunch of IPs behind a single host name, you need a different load balancing policy. And that load balancing policy in gRPC is called round robin. And so we made that change like six months before this incident started. And we traced it back to when we switched the gRPC load balancing policy. And so at first we thought, oh, okay, it's obvious. We were using pick-first and we switched to round robin. So all of the alerting engines were talking to one metric service and then they started talking to all of them. But that actually wasn't the case. It wasn't that simple. Remember, we were still doing client-side load balancing in either case. In either case we still had one gRPC channel with one IP. The only thing that we actually ended up changing was the layer in which the DNS resolution is taking place, which was super weird. So we dug a bit more. We read the gRPC code and we realized that pick-first and round robin actually have very subtle but important difference in the way that they handle connection failures. So they manage connections in the background for you so you don't have to have that in your application code. But in pick-first, when a connection is severed, it actually doesn't try to reconnect until you try to use it. It does on-demand reconnection. And this means that when you make a request, it will block the request as it tries to make the reconnect. In pick-first or in round robin, rather, they tried to be a bit smarter here. They said, wouldn't it be cool if in a background thread it automatically tried to reconnect so that by the time your application went to go use that connection, it was ready. And one thing that we did when we were using pick-first load balancing is like many years ago, we set the default max reconnect backoff time to 300 milliseconds. And this made a bit more sense when we were doing on-demand reconnects because we didn't want to block our requests for very long when a connection didn't exist. So our request would come in, and we would retry every 300 milliseconds to form the connection. And that worked. However, when we started using round robin, instead of that happening just in the request path when it was on-demand, this was just happening in background threads everywhere for every single connection that it had. And we did some math. And so we had thousands of alerting engines that were all trying to reconnect to each metric service pod every 300 milliseconds, which meant that we were sending tens of thousands of sins per second. We were just sin flooding ourselves. And this explains those huge spikes that we were seeing in those graphs. And so the fix here was actually just deleting a few lines of configuration, surprisingly. So we just used the default reconnection settings in GRPC where it tries to reconnect on the order of seconds, not milliseconds. And we just sort of let it do its job as it was intended. And it turns out that was actually the core problem here. That stopped our sin floods. So here you can see we did a rollout here. Thank you. Yes, we're very happy when we finally fixed this after this. This took months for us to find. You can see here, client errors are good. Server errors are good. We don't see a spike in response time. Contract is totally sane. And that ended up being the problem. And so we learned a lot from this incident because as you've seen, it was a bit complex. So the key first lesson is, well, sometimes it's not DNS. I promise you, sometimes it's not. I know it's rare, but sometimes it's not DNS. More seriously, I mean, we use very powerful abstractions. We use cloud networking. We use Kubernetes networkings. These abstractions are very powerful and magical in a sense, when they work. And honestly, most of them, they work perfectly. But when they don't, and they leak the underlying complexity to you, you have to dive deep into it. And it's sometimes pretty difficult. Well, JRPC setup can be complex. And we've seen that making changes can be dangerous. We noticed that very low-level instruments, very low-level metrics and logs can be extremely interesting. So here, the ENA metrics and the VPC flow logs actually helped us make sense of what we're seeing. As Elijah was just saying, this required very complex team efforts. It took us weeks to fix. And so we're only two on stage today, but Wendell, Matt, and I have also helped quite a lot. So as you can imagine, debugging this incident was long and painful, but we really learned a lot. And that's also why we're here sharing it, because we believe that many of you could be interesting by this. And we're just over time. So we won't be able to take questions, but we're going to stick around for some time if you want. If you're interesting in that kind of fun debugging issues, we're definitely hiring. And we have a lot of other subtle problems like this one. And of course, you can reach out by email or on Twitter if you want to reach out to us. Thank you.