 Hello, everyone. We're very happy to be in Vancouver with you today. And today we're going to talk about an interesting network Canada bug we found a while back. And so how it impacted our workloads, how we debugged it, and how we fixed and mitigated it. So before we start, let's introduce ourselves. I'm Laurent Bernay. I work at Datadog in the infrastructure teams. And I'm Eric Mountain. I also work in the infrastructure teams, sorry. More specifically, I work in the Kubernetes teams. We deliver the Kubernetes clusters that run all the services that Datadog runs on. And so this isn't about Datadog, but just I'll introduce Datadog very briefly. We're a cloud-based monitoring and security system. So we have lots and lots of integrations that allow monitoring different pieces of software. We're now over 5,000 strong. And we have millions of hosts reporting into Datadog, sending us trillions of points every day. So that makes for a lot of traffic. And quite a sizable infrastructure. So we have tens of thousands of nodes, split over dozens of Kubernetes clusters. So, yes, as I said, we're not here to talk about Datadog, but about a problem that we observed. And so it started with folks coming to us saying, hey, we're being paged at night. Our application is lagging. We're seeing issues with some applications consuming from Kafka. And the throughput we get does not align with the instance type. So that's a nice way of saying that the throughput is really bad, because no one would complain if it were too high, right? Looking a little bit more closely, one of the things that comes to light is that we have an 8Q network card. But all of the traffic is leaving on one single queue, Q0. Based on this, we start trying to do bandwidth tests, that kind of thing. So quickly cobbled together tests with IPerf. And we see that if we go host to host, we're at about 23 gigabits per second, which is close to the limit for this instance type. But if we go from a pod to another host, we're down to 16 gigabits per second for exactly the same test. Taking this a little bit further, we start using Flent. So Flent is an opinionated framework for network bandwidth and latency testing. The advantage with this, rather than our cobbled together IPerf test, is that it's, as I said, opinionated, so it has its own structure, it knows how to launch a certain number of things that will saturate links and things like that. So that's pretty good. And so here what we see is that we're at about 16 gigabits per second, which is the same as what we saw in the previous slide for the pod to host traffic. But we also got very high latency here. We're around 10 milliseconds. So at this point, what we've got is folks complaining that they're getting paged. It's so far relating to two applications running in AWS. And the symptoms are that we have packet drops and therefore retransmits. We've got, so this is a situation that changed, right? Initially people weren't being paged and then they started being paged. So we're now seeing lower throughput than the setup that works and we've got higher latency. The consequences here is that we're having to run more processing pods in order to compensate, right? We scale horizontally in order to compensate. And so it's costing us a lot more money. So if we backpedal a little bit to this single transmit queue business, why is that a bad or an impact or whatever? So for one thing it means that there's a single CPU that's handling all the completion IRQs. So once a packet's been sent by the network card, it's reporting back where an IRR queue and all these IRQs are coming to the same CPU because they're all from the same transmit queue and the transmit queues are mapped to CPUs for the IRQ handling. So it makes it harder to get, so this is something that we talked to AWS about as well and one thing that they told us, well a few things they told us are that it's harder to get to the maximum instance throughput if you're using a single transmit queue. And in particular, well you can get full bandwidth on certain instance types like the very large ones, but it can depend as well on the family of the instances and the generation, like very old generations, you will never manage to get the full bandwidth on a single transmit queue. So I mean, I don't know how it works exactly, but I can take a guess that it's a bit like a CPU, a generic CPU where you have pipelining and different stages of pipelines can run in parallel to process instructions while here the network cards likely are sending on the wire a packet from one transmit queue while they're dequeuing from another transmit queue, et cetera. So this is the parallelism that the card can have. I mean, at a guess. So how does this work? If we look at the stack on Linux. So we have the network card, the virtual network card. The card presents a certain number of transmit queues, so let's say eight because that's what we have in our examples. These are actually mapped to, well they have queuing disciplines ahead of them. So in our setup, the default setup for the operating system we were using in fact, it's FQ Codel. So FQ Codel is flow queue control delay. It's a certain algorithm for dealing with shunting packets around and deciding what is the next packet that should be handled, et cetera. So I'll talk about that a bit more in a minute. And ahead of this is the root queue discipline, which is multi queue, and it's a very simple one. It's pretty transparent. What it expects is to receive from the IP stack a packet that already has a transmit queue set, and it will then just simply send the packet onto the queuing discipline that matches that transmit queue. Okay, I didn't expect that that's like that. So if we look at some of the queue discipline statistics, we see that indeed, if we look at the queue discipline statistics, we see that all of the packets are being sent on the Q0, and not on the others. So we haven't shown all eight, but you get the drift. And so essentially what's happening here is that we've got all these queuing disciplines and all these transmit queues, and everything is just going where the red line is at the top here, okay? If we go back to the numbers, we see that one interesting thing is that we actually have drop packets on that queue discipline zero. And it's quite a significant number, 2%. That's an awful lot. So if we look at FQ Codel in a little bit more detail, how does it work? So there's a stream of packets that need to be sent from different flows. And so the first thing that's done is that a hash is calculated for the on the flow, on the, yeah, on the flow. And that allows to dispatch the packet onto one of a number of queues. Then from there, there's a round robin algorithm that will pick off packets from these different queues based on how recent they are and when they will, well, when they were last taken, when bytes were taken off them, et cetera. And this will be then the packets that are forwarded onto the transmit queues themselves. So one thing, Codel here, control delay. So this is active queue management. Basically, it is actively deciding to potentially drop packets if they are queued for too long. The reason for this is the Codel, the idea is that it's to prevent buffer bloat. Buffer bloat is, for instance, when you have, you can have a nominal stable state where you've got a rather high throughput, but your queues are fairly full, and so the latency is actually quite high, even though you're getting good throughput. And the idea of a queue call is to prevent that kind of situation from lasting. It tries to eliminate it by dropping packets and therefore forcing the TCP protocol into managing the congestion window, backing off of it, slowing down, that kind of thing. So this is, Codel here is likely why we've got dropped packets. So now let's try and see if we can hone in on what exactly is broken. So our network setup looks like this. The host has its host network namespace, it has a primary network interface card for which all the host traffic comes and goes. The pods, they have their own network namespaces, they have their own IP, and all their traffic goes through secondary nicks. So here there's only one, if we had many pods we might need more secondary nicks because you can only have so many E&I IPs per nick. What we know is that what we're seeing is the pod egress traffic being impacted. So that's the traffic here going through in S6. But what about receive? And actually, well, receive is fine. So it's balanced across all the receive queues. So we've got no problem there. Host traffic, well, we've done a benchmark. We can see that that works fine. Okay, what if we send the host traffic through NS6 instead? Well, that works fine too. So maybe it's our C&I plugin. So we use Cilium. So maybe let's just try and do a simple network namespace setup by hand. For one thing, this is a setup where we won't have the Cilium BPF programs loaded. So if we reproduce the issue, we know it's not the Cilium BPF programs. So we start off by creating a network namespace. In there, we create a vese, which is basically a wire with two ends. And we put one of those ends in the network namespace that we've created. We allocate an IP to the end that's inside the network namespace. And we arrange for packets going out of the pod to go to the host. I forgot one, no, that's right. Okay, and also we then add a root so that at the host level, so that traffic coming from the pod will be rooted out through NS6. And we have a rule to say that anything that is destined for the pod goes on the vese into the pod network namespace. So now we have a network namespace that's fully configured, what happens? So from our test, what we see is that we actually have the same issue here. So we've managed to prove at least that the BPF programs of Cilium are not the cause here. But we have something going on with our transmit queue setup all the same. So another thing we can do now that we know how to set up our own manual network namespaces, we can try putting more transmit queues on the vese inside our own network namespace, the pod network namespace as it were. And that's interesting because it actually fixes the issue if we have the same number of transmit queues as on the actual network card, the physical network card. But it's still not ideal because it doesn't fix it if the number isn't exactly the same then the behavior still reverts to everything going through the transmit queue zero. Another thing we tried was to use fair queuing instead of FQ coddle. So the difference here is, so although it might not be intuitive, this is something I discovered, fair queuing is actually more recent than FQ coddle. So the thing, the idea here is, for instance, within data centers you don't necessarily need or for servers, you don't necessarily need the coddle aspect, the active queue management aspect of FQ coddle. And so FQ is basically that part ripped out. So you take traffic packets, you distribute them across queues for each flow and the round robin algorithm picks them off and pushes them off onto the queue. And so when the queue is full, what you expect is simply to have back pressure which will lead to all dropping. And so the server application will forcibly slow down because it can't queue the packets. So interestingly, there with FQ we do actually mitigate the issue. We have the native performance, the native throughput of the instance type, 25 gigabits per second, and we have around two millisecond latency which is five times less than the other test. Now if we start looking at the nodes, we've now got a situation where it's easy to check different nodes. And what we see is that AWS instances with Ubuntu 24.2 plus are impacted. 24.1 is not impacted and GCP instances are not impacted regardless of the version. So at this point, AWS Ubuntu 24.2 nodes have the network issues, the issues that traffic goes over a single transmit queue. We're using FQ Codel. If we switch to FQ, then it mitigates the issue. And GCP nodes are not impacted. So we could stop here and say, okay, we use FQ but that would be relatively unsatisfactory. So now we need to figure out how we can debug this and Lauren's going to talk about that. Yeah, so as Eric said, we had mitigations but they were not satisfactory. The first one is because we could use FQ but we're not convinced that it's the best approach. And also having multiple transmit queue on the VS device would be challenging because we use instance type of different sizes. Some of them are pretty big with eight transmit queue. Some of them are very small with single one. So we need logic to make sure that the number on the virtual center device would match the number on the physical device. But more importantly, it makes no sense. I mean, it works on all the kernels, doesn't work on new ones and we want to understand why. So the first question is we want to understand and want to answer is how do we pick the queue, right? And it's actually easy. You can just look at the kernel code. Well, I'm saying easy, to be honest, it took us quite some time. So this is a very simplified transmit sequence, right? When a packet is, when the routed decision has happened and a packet has to be sent, the first thing you do is you call defqxmit which is going to send the packet through the device. And two important things happen there, right? The first thing is we call the function called netdefcorepictx, right? And this function is supposed to pick the queue. And then you go to the transmit function itself which will first run qdisk and then invoke the transmission function from the driver itself. So let's dive into netdefcorepictx. I'm sorry, it's gonna be a mouthful because kernel function names tend to be long, so sorry. So the first, I mean, this function is actually pretty simple. It's like not even 20 lines. And you can see here that there's a simple test at the beginning which is does the device have more than one queue? Of course, this is a single queue. It makes little sense to pick a queue. And something interesting is happening next. You see the test there. Oops, NGO select queue, right? This is testing if the driver has provided a specific select queue function. And if it has, then we're gonna call the select queue function from the driver. And otherwise, we're gonna use netdefpictx. And remember this one, because we're gonna see this one quite a few times. So at this one, we're like, well, maybe we can expand the difference between AWS and GCP because the drivers are different. So we looked at the kernel code for these two drivers. And on AWS, we use the enedriver and on GCP, we use the virtair one. And as you can see here, on the GCP driver, you have the full struct and you can see that we're not implementing select queue. So we're using the default from the kernel. However, on the AWS side, the enedriver actually supplies a function. So maybe we're onto something there, right? So let's look at this function. So enedselect queue, the function provided by the AWS enedriver, defaults to using netdefpictx. So the function that is a default one used by GCP. So, but it looks like we're not doing that, right? Because it's working on GCP, but it's not working on AWS. So maybe there's something happening in this test here. So before we dive into this test, let's take a quick step back. You'll, I'm going to mention SKB quite a few times. So an SKB or SKB for long is the main networking structure in the kernel. So it has metadata about packets, right? So there's no actual data, only pointers to data, and but there's all the metadata. So the structure is very, very, very big. So I added only a few parts of it. So things that we're gonna use later, the device. So the device that is used to transmit the packet or the device on which the packet was received. And at the very end, all the pointers to different parts of the packet. So you have the network header, which is where you get the IP information. You have the transport header, which is where you have the TCP or UDP headers and so on. Something that will be very important for the rest of the presentation is this small field here, which is called Q-mapping. And it seems very interesting, right? It says Q-mapping for multi-Q devices, exactly what we want to look at. So something that's important is, depending on where you are in the kernel, in which function is using this field, it means a slightly different thing. When you're sending the packet, so we're actually sending it on the wire, this means which Q you're gonna use, right? However, if you're in the function, if you're picking a Q, if it's zero, it will mean that the Q has already been picked by another component in the kernel. And if it's zero, it means you have to go through the action of selecting Q. So let's get back to ENA Select Q. So the first check here, which is checking if Q-mapping is different from zero, is verifying if the Q has been recorded by another component, right? So if it's true, it means that we already have information, in which case we're going to restore it, right? So, and the way it works, I was saying before, is like to store Q information, you just add one and to restore it to subtract one, which means zero means the Q hasn't been selected, you can put pretty much whatever you want, but if it's greater than zero, it has already been picked. So what we wanna do now is see what the value is at different step in the kernel. And this used to be pretty difficult, right? But we're very lucky now because we have eBPF in the kernel, which means we can use BPF trace and had hook points at different functions in the kernel to see what's happening. And you can see here this very simple BPF trace code, which is about 10 lines, is going to allow us to actually look at what's happening at different steps in the kernel and look at the content of the Q-mapping field of the SKV struct. So you can see it's pretty simple. And this function is the FQX mythos, the initial function called when we're sending a packet in a device. And it takes an SKV as the first argument, which is why we're looking at arc zero. We have the field Q-mapping, of course, which is the one we're most interested in. We can also pick the network device. And the last few lines are just a very ugly way to make sure that we're filtering traffic for a specific IP, which makes testing easier, right? So what do we get? So we load this BPF trace program and then we run a simple ping from a network namespace on one of our nodes. And you can see here that we're calling this function twice, which makes sense if you remember our setup, right? We call it once in the podnetwork namespace and then one on the hostnetwork namespace. And the number after dev actually the device indexes of the different devices and I put the name because it's easier. So ECH zero is actually the name of the device inside the podnetwork namespace and in a six is the one on the hostnetwork namespace. So what's interesting here is the Q-mapping field has a different value in both places. It's zero in the podnetwork namespace and it's one in the hostnetwork namespace. And it's always the same, right? Because of course it could be one time, right? But it should vary, it never varies. So at this point, we have this program that works but it doesn't give us much. But what we can do because we have EPF trace is we can instrument many functions in the kernel and this is what we did. So we instrumented a few key functions to see what was happening. So here we instrumented all the function responsible for sending the packet inside the podnetwork namespace and you can see here that we start with devqxmit and we end up with vsexmit, which is the function of the vsedriver to send a packet. Once we reach vsemit, we actually leave the podnetwork namespace and we enter the host. At this point, if you remember what Eric said before, we have to route the packet, which is why we go through all the writing functions in the kernel and the kernel will decide at this point that we need to send the packet on the NS6. And then we get to the final part, which is transmission again, but this time in the hostnetwork namespace on the NS6. And now that we're here, let's look at the Q mapping fields, right? So in the podnetwork namespace it's zero, which kind of makes sense, right? We have a single transmit queue, so it makes no sense to pick it. At the very hand, the NSLX queue actually has Q mapping set to one and if you remember the code, it means the queue has been recorded, which means we restore it back to zero. And this means that every single time we're gonna pick zero because every single time we're gonna hit the NSLX queue function, the Q mapping will be set to one, so the queue will be considered recorded and we'll just subtract one and use zero, which is the first transmit queue. So now we know why we have the issue. What makes very little sense, though, is this here, this transition. Why do we go from zero to one when we cross the network namespace boundaries? So we looked at a few functions and after looking at VSXMits, we discovered this line there and you can see at the very end that we're actually recording the queue. I'm like, okay, and we look at this, we read it a few times. I'm like, well, we should be always recording the queue. So we should be impacted everywhere and should have been the case forever. So we're like, maybe something changed in the kernel. So we used Git and we found this very small patch, right? So there was actually an optimization in an idea that in some cases, when you were using XDP, you might want to record the receive queue to have the information when you transmit. So you could run the XDP code on the same CPU optimization purposes. And that in deeper, it turns out, this code has been introduced in kernel 5.11.11 and also back ported to a few Ubuntu kernels. Turns out the ones are impacted. So 20.04.2 and 20.04.3. So all these things are to make sense and if you remember, there's something very specific happening with AWS, but at this point, we wanted to be sure. And so if we get back to NSElectQ, if you remember this function here, NetDevPictics is the one we should be using. So before doing anything, we're like, well, is using this function okay for us? And because it's the one we should be using on GCP too. So it should be working. So we look at the code. Once again, the code is pretty simple and there's this very promising function here, which seems like it's computing a hash and allocate and associating a flow to a queue that seems perfect, right? Except if we look at this function itself, there's this very good logic at the end that reminds us of what NSElectQ was doing, which is if the queue was recorded, then restore it. And we're starting to be extremely confused here because of this code, we should have the issue on GCP too, right? Because the queue is recorded there too. So we get back to this function here and like, oh, there's this code here called get XPS queue, right? And I don't know if you're familiar with XPS, but it's a transmits package steering which allows for mapping flows to queues based on the transmitting CPU. So you have some form of data locality and some optimizations. It's a good optimization, but we don't use it anywhere. I mean, we haven't configured it anywhere. So that's kind of weird. Still, to be extra sure, we connected to a GCP host and well, turns out GCP has a magic demon that actually does some kind of configuration on nodes and it's setting XPS by default on all the nodes. Turns out if we disable XPS on GCP, we now have exactly the same issue, which is good news, right? So let's go to a quick summary of what we found. So we found that the code in VSXMIT is recording the queue and impact the kernels. At some point after routing, we're going through this function here, the net def core TX, which is the function responsible for picking the queue. Then we have two possibilities. If we're on AWS, we're always gonna use ENA select queue, which is gonna say that the queue is recorded and always use the first queue. If we're on GCP, we're gonna use the VATIO driver and then we have two different cases. If XPS is enabled, we will use the queue decided by XPS. If it's disabled, we'll use the XCP takes hash function, which will do exactly the same thing as ENA select queue and we'll use a single queue. All things start to make sense. That's good, but now we need to fix it, right? So we are kind of lucky because we work with the Cilium team pretty closely and when we started having the issue, we ping them and we expand to them what we found and Daniel Borkman immediately said, that sounds like a bad bug, let me fix it. And so he created this very simple commit here, which got accepted in a matter of hours and merging the kernel. If we get back to the sequence of events and what's happening as we go through different functions in the kernel, you can see here that because now we have a patch kernel, the VSXMIT code is not storing the queue anymore, right? So it remains zero. And now, once we reach ENA select queue, we're actually seeing that the queue is not recorded and so we're gonna now go into net dev pick takes, which we were not using before, and pick a random queue out between zero and seven in our case because we have eight queues. That's great news. However, while we also have existing nodes and as Eric was saying, we have tens of thousands and getting a kernel patch of a 10,000 hosts is pretty challenging. Turns out, once again, we're lucky because we use Selium. So we have a daemon set that is loaded on every single host we have and it's actually already instruments the visual scenario device with eBPF code to perform Kubernetes load balancing but also network policies. And because we have eBPF code there, we can actually modify the SKB. And so what Daniel suggested again is like, maybe we can just use the eBPF code we had to force the queue mapping to zero always. And this is what this code does here. You can see here with this very simple function called reset queue mapping that could be called on anything on every single packet and just set the field to zero always. So what does it look like now? So you remember, this is for a kernel that are impacted. So as we go through the VSXMIT code, we store the queue. So it's one, which is something we want to avoid. However, because we have this new patch in eBPF code with Celium, it's actually reset to zero as we go through the VS device. Which means we can now pick a queue and this time it was five. So we've been through all this now, what are the results? So the first one is, well, it looks very good when I was using all the queues, right? So very satisfying. However, it's not the only thing we wanted, right? We also wanted higher level metrics to look better because this is not really meaningful. So we started looking at throughput and throughput is now much better, right? You have host to host on the left-hand side and pod to host on the right-hand side and you can see that the numbers are pretty similar, right? We're maxing the instance, which is good news. Also, if we restart the test with plant, so this time we're using FQ code in both cases, we can see that it's much better once we've patched the node, right? We're, instead of having 16 kilobits per seconds, we're getting to 25, which is exactly what the instance provides. What about our applications? So key network metrics start to look much better, right? The first metric here is TCP retransmitted by flow and you can see that they're going down immediately after we deploy the patch. And the TCP latency, so this is the SRTT as measured by the sockets, was around eight milliseconds and after the patch now four milliseconds, which is, of course, much better. And what about business metrics? So this is the one that was, I mean, this is actually the reason we started to do this in the first place. I mean, the reason the teams were complaining was because they had issues with Kafka or in this example here, they had issues when they were transferring data with Nestry buckets. And you can see here that the P99 for the transfer latency used to be between like two and four seconds and after the patch we're now below 0.5 seconds, which is, of course, much better. And this gets us to our conclusion. So what we wanted to share is, well, achieving top network performance is still hard. It seems like it's a solved problem, except if cloud providers start to provide us with instances that can do 25 bits per second or more. And of course, if you want to do this efficiently and on node, it means you have to tune your operating system. And of course, interactions between the different components of the network stack are very complex. But you can see how, I mean, the commit that created the issue was extremely simple and understanding the exact impact was, took us a lot of time and was pretty challenging. We also wanted to share that, I mean, this totally sold us on BPF trace, BPF trace as a system team because it allows us to debug very complex kind of behavior in ways that are pretty easy. And also, in some cases, especially on the network side of things, we can actually do things on packet and mitigate issues without waiting for a new kernel release. And something we wanted to say also, it's, well, sometimes it's actually a kernel bug because of course, when we started, nobody would have guessed it would be a kernel bug. We're like, well, it's probably something we messed up with the configuration, right? Turns out this time it was a kernel bug. If you're interested in details, there's many more details in the issue on the Cilium repo. Thank you very much. If you're interested in debugging this kind of weird and funny issues, we're always hiring and we have a few minutes for questions if you have questions for us. No questions. It was all perfectly clear. We'll try our best. Okay, thank you very much. Thank you.