 Hello and welcome, everyone, to our talk. It's really cool to see a fully packed room. Today's talk is about better bandwidth management with EBPF. I'm Daniel Borgman. I co-maintain EBPF in the Linux kernel, and I work with the Syllium team and the talkers together with Christopher and Nikolai, who are also from the Syllium team. Before we look into bandwidth management, I wanted to start out with an interesting metric, which you can see here on the right side. The metric is basically the density, like the number of parts per node. And what you can see here, like the more people use Kubernetes, the higher the density gets. So for example, it's probably reasonable to assume in 2022 that there is a median of around 50 parts per node. And with that increasing density, it also means that there is a competition for resources on a given node. For example, CPU and memory. So basically, operators have to issue that they need to tackle how to allocate and efficiently use resources. One tool that you can use in the parts pack, for example, is that you can define resources, requests, and limits. For example, resource requests, they define how much CPU and memory a given part requires. And then KubeNet will make sure that that part is scheduled in a node, which can actually satisfy that constraint. Limits, for example, if a part overshoots its memory usage, it will get OM killed. So now the question is, what about networking? So TCP, in its nature, sends as fast as possible. What you can see here on the right picture is a typical TCP congestion control algorithm that will try to send more TCP segments to the network. So there is an exponential growth fast in the beginning until it experiences a packet loss and then it will back down. And then it will try to do the same thing again. So really, the output contract for TCP is to send as fast as possible. And shaping is typically done by device output queues. So basically, the queue limit for those output queues, as well as the remote receive window size, determine how many packets from TCP can be in flight. But who actually limits a part's network usage in Kubernetes? So there's infrastructure for that in Kubernetes. So there's a bandwidth enforcement. So far, this has only been experimental, unfortunately. So if you look into the parts back, there are Kubernetes-specific annotations. So you have an ingress annotation. You have an egress annotation. Then you can, for example, set something like a part should get a egress bandwidth of 50 megabits per second. And support for that part annotation so far has only been implemented with the bandwidth meta plug-in. That's a plug-in from the CNI plug-in collection. And that basically implements a basic token bucket filter from the Linux traffic control subsystem. So how does it look? For example, if you have a typical node that you can see here, there's a part in the node, and the part is typically connected to the host namespace with two weave devices, one lag in the host namespace, one lag of the weave device in the part namespace. And token bucket filter queues are attached there. But the real problem is it's not scalable for production use. And I will show you why. If you, for example, set an ingress bandwidth rate of 50 megabits per second in your parts back, then basically that bandwidth meta plug-in will then attach a token bucket filter queue disk to the weave device in the host namespace. That is attached from the TC subsystem in Linux on the egress side. But if you look at the logical traffic flow point of view, it's ingressing into the part that traffic. So if there are many clients connecting from the internet to that part, for example, they will all hit that token bucket filter queue disk. And it has actual design issues because the token bucket filter queue disk, given it has to track its state in terms of the shaping, there's basically a single log that has to be taken across all CPUs. And that's a huge contention point if you want to, if you receive traffic from multiple CPUs at the same time. And that basically completely defeats multi-Q capability of physical NICs because typically a NIC has many receive queues. Then once packet arrive on the different receive queues, CPUs will pick that up in parallel. They will process it. They will forward it in parallel. But then they will all hit that single token bucket filter queue disk. The other issue with that is also that queuing on the receive side is actually a no-go because it consumes resources. The packet basically made it over the wire. It consumed your network. And then only to receive it on the node itself where it then goes into that queue disk, it's waiting in that queue instead of being processed. And only then to be dropped, for example. So that really causes bufferplot for your network. If you look at the other direction, for example, if you install a egress band with a 50 megabit per second, then that bandwidth meta plug-in will basically set up an additional device. It's called an IFB device type. And then all the traffic from the host weave device will be redirected to that IFB device. And on that IFB device, there's the token bucket filter queue disk installed with the given rate. Why is that work around, you might ask? But the issue is, logically, that traffic in the host namespace, when it egresses the pod, it derives in the traffic control subsystem in the Linux kernel on the Ingress side. And on the Ingress side, the Linux kernel cannot shape. So you need to redirect it to a different device, only that you're at the TCEgress layer. And only there you can attach the token bucket filter queue disk. So if you have multiple applications in your pod, they can send traffic from multiple CPUs, and they will all hit that same token bucket filter queue disk. That also has design issues. Because now you're actually queuing twice. You're queuing on the Weave device, like on the IFB device. But then also you're queuing on the physical device, because there's typically also a queue disk attached that is handling all your outgoing traffic. For example, on Linux by default, that's FQ Codel. So that really causes buffer bloat, which is a big issue. Then again, you have the single lock contention point across all the CPUs. So you cannot actually make it scalable across CPUs. And then there's also a mechanism in the TCP stack, which is called TCP small queues. TCP small queues is there to reduce buffer bloat, so that you reduce excessive buffering in queues. And TCP small queues basically works the way that TCP stack tries to not send too much packets. But now those packets are stuck in queues. And once they are processed there, they go to the upper stack. And it will basically fool the TCP stack that the packet might already be on the wire, even though it's still on your host. And last but not least, you now need three net devices instead of just two. So overall, I would say it's a latency killer, because you have to deal so much with queuing. It doesn't scale across CPUs. And yeah, it's not really ready for production use. The other thing is like the nature of TCP, when you send as fast as possible. And what you can see when you look at the queuing theory, so once you get to the point where you consume the bottleneck link close to 100%, what you can see here in that graph is that the wait time of packets in the queue, they basically skyrocket to infinite. So there has been some interesting research from Google folks, and they were asking themselves, so can we get rid of queues entirely? So they came up with a really cool idea, which is called the earliest departure time model. So they basically said, let's replace queuing fully and come up with two simple core pieces. One is the earliest departure time, so it's basically a time stamp on the network packet in the host, which will dictate the time when the packet can be delivered to the network at the earliest possible point in time. And the other one is the timing wheel scheduler, which will basically hold this constraint and then send packets out. So how can this model be applied to Kubernetes? So enter eBPF, you've probably heard eBPF many times at this conference, so it's basically a way to make the kernel programmable. You have small programs that get verified for safety and they can be attached to different points in the Linux kernel. And Cilium is an eBPF-based CNI, or you can better call it like networking platform because it does many things, so it does takes care of pod connectivity, but also service load balancing, network policies, and so on and so forth. I don't want to go too much into details. The one detail I want to go into is the bandwidth management. So we implemented the Cilium bandwidth manager, and that bandwidth manager basically allows for lockless, earliest departure time-based pod rate limiting with eBPF. So basically what you can see here, the enforcement point we moved from the Weave device to the physical device so that you don't need to queue twice. You can avoid this additional buffer load if you have it on the physical device. And you also don't need the upper stack. So we have a mode in Cilium, which is called BPF host routing, where all the routing can remain in the TC eBPF layer. And it can be forwarded there directly to the physical device. And you can use certain functionality of the networking stack, such as, for example, the FIP lookup. You can just reuse them out of BPF itself. And the good thing with that is that it will keep at TCP small queue feedback properly. So how does the architecture look like of the bandwidth manager? If you set a pod specification that it can, for example, send 50 megabit or so, then in the end, like the packet arrives at the physical device, there's an eBPF program attached to it that was orchestrated by Cilium. It's just taking care of the packet departure timestamps. And Cilium also sets up a multi-queue queue disk with so-called fair queue leave queue disks on the device. And those fair queue leave queue disks, they basically implement this timing wheel scheduler that I mentioned earlier. If you look at the performance of this architecture compared to the token bucket filter approach, it's really interesting because we run an experiment with multiple concurrent flows in parallel, in this case, 256. And those flows, they are basically ping pong flows. So it will send one byte of data to one direction and one byte back to the other direction. We do that because we want to stress their latency. We want to see the latency of the whole system. And each of those concurrent flows has a rate limit of 100 megabit per second. So now what you can see here is the time in microsecond. So the yellow bar here is basically the token bucket filter approach. And the green one is the earliest departure time model that we implemented with Cilium. So when you look at the median, it's around seven times lower latency than the traditional approach, which is based on queuing. And even if you look at the P99 latency, you get over 4x better latency. And if you look at the actual transaction rate, so when you run net perf, one transaction is basically like a ping pong, so one byte of data in one direction and then the other direction, you get over seven times better transaction rate under this 100 megabit per second constraint. All right, so far we have seen how we can implement scalable bandwidth management with Cilium based on the earliest departure time model. But what about thinking even more broadly? Can we do even for the internet better bandwidth management? So what else does the earliest departure time model enable? It enables BBR. So BBR is a TCP congestion control algorithm that has been developed by folks from Google Research. And the internet today is basically what you can see on the left side of the picture. So you have a loss-based TCP congestion control algorithm. So if you don't change any defaults on your Linux laptop or server, for example, you will have that model on the left side. It's the so-called TCP cubic congestion control algorithm. So as I mentioned earlier, it tries to ramp up its congestion window until it experiences congestion loss somewhere, and then it tries to back down. So you see the sawtooth pattern for connection. On the right side, you have the BBR. So BBR is modeled in a different way. So it's basically modeled around the delivery rate that the traffic can be delivered to the receiver as well as the round trip time. So it's not based on packet loss, basically. So when would it be useful to consider BBR? So if you have a Kubernetes cluster where your clients connect from outside of the internet, then it would be really useful to look into that. Because it will significantly improve your latency for low-end last mile networks, because typically, buffer bloat happens on your home routers, and it will improve there. But it will also significantly improve the throughput for high-speed networks when you have a long delay, in this case. So I run an example. For example, I reserved a server on New York from the packet.net, and I tried to do an IPerf measurement to our lab in Zurich that we have. And what you can see here with the defaults, the default is TCP cubic and FQ Coddle as a Q-disk. What you can see here when you look at the bitrate, it's slowly trying to ramp up until it reaches a point where it gets to 400-something megabit per second, but then it experienced loss somewhere on the path. And at that point, it's slowing down again, and it's trying the same thing until it reaches 400-something megabit. So overall, the average bitrate is 270 megabit per second. And now if you switch over to BBR for this experiment, you will see that it will ramp up until 400 megabit per second, and it will plateau at that point in time. So just by changing that congestion control algorithm for connections over the internet, you can reach from 270 megabit per second to over 400. So now the question is, can BBR actually be used with Kubernetes pods? So BBR works in conjunction with the FQ Coddle disk, which implements this timing wheel. So you really need a timestamp for your packets. But the problem is, when you have network namespaces, then the kernel actually clears the timestamp when the packet goes from one weave device to another. So in that case, BBR can actually not work because it will get an instable rate because the timestamps are zeroed. So why is that? It's because of a kernel limitation. So in the kernel, we have a packet representation which is called the socket buffer. And the socket buffer holds all possible metadata around the packet, including a timestamp. And the timestamp has, like for the receive side and for the transmit side, it has a different clock base. For example, for the receive side, it has the so-called clock TAI, it's an atomic clock. And for traffic that is going out from the transmit side, it has a clock base of clock monotonic. And the problem is, the socket buffer is like the most critical data structure in the Linux kernel networking stack because you have to keep it as small as possible just to avoid cache misses, just to keep your performance higher. So there's just a single 64-bit timestamp field in the kernel. And that is why it had to be cleared. And the problem is, people tried in the past to standardize just on clock TAI, but it didn't work because it's like the TAI clock when you is typically taken from, like somewhere from the hardware, for example, if you have a hardware clock and it will take over the timestamp into the packet. But if there's buggy hardware or if you have some clock screws, it will basically mess around with the timestamp too much. And for example, if you get a too high offset for your timestamp, then it will cause a drop in the fair queue disk because it will go over a certain horizon. And that will break TCP. So it has been tried in the past to just standardize on a single clock, but it had to be reverted. And then when you forward from receive to transmit, that's the limitation where people had to clear the timestamp back to zero. And just as a note, like this monotonic clock is not prone to such clock screws. So yeah, that's the reason why it couldn't work. And we have worked together with guys from Facebook to fix the problem in the Linux kernel. So we presented the issue, I think it was at the last year's Linux Plumbers conference and they run in the end into exactly the same. And so we fixed this together. And now the timestamp is actually retained for like the outgoing timestamp for sockets that are in parts. So when packets traverse the network namespaces, it will be retained on then all the way until the physical device. So now with the bandwidth manager that we implemented in Syllium, with the whole architecture, what basically happens is you can do all the routing in the TCBPF layer. The timestamp will be retained and also the socket association to the packet which is needed for the TCP stack to get the proper feedback. And then on the physical device, there's this multi queue with the FAQ leave queue disks to implement this timing wheel. And then it can basically work. Just as a side note, BBR you actually only need to enable this on the server side because so you don't, it's not necessary to have it on the client side because typically all the bulk traffic goes from the server to the client. So now to our demo. So in our demo, we basically want to show you a streaming service that we implemented like a mini Netflix in that sense. And we want to compare it for cubic and BBR under different network conditions. This is basically how our setup works, how our setup looks like. So we have a mini cluster, that mini cluster has one node on that node. There are two parts. And yeah, there's an FF impact part which basically ingests a video and it chunks it up into smaller pieces. And the engine export is basically the front end which serves those video chunks to a client. And it is basically behind the Kubernetes service that we expose to the internet. And last but not least, we have an external client. It can be somebody with a phone, it can be with the laptop or just the regular workstation. It connects to that Kubernetes server that we have over the internet. And we emulate latency and then also later on latency and drop with the help of a so-called net MQ disk. So net MQ disk in Linux traffic control subsystem is explicitly there to simulate bad network conditions. And yeah, so we want to show this under two different configurations. So one is the bandwidth manager with the cubic TCP congestion control algorithm and the other one is the feature that we implemented now with BBR. And with that, I'm switching over to Christopher. So what we have here is two clusters. We try to make them as similar as possible. They're both running in Madrid in the same data center. The only difference between them is one is enabled with cubic, the standard configuration and the other is enabled with BBR. And so what we do is we scrape in there, we see two different videos, okay? We got really excited when we had like a lot of latency problems just walking around. I'm personally data capped on my smartphone so I can't watch a lot of videos or else they're constantly buffering. So what we have here is one of our colleagues in Pontiacina is doing a little bit of sky surfing here. And on the left side here, we have our cluster with cubic setup. And what we see is that cubic is constantly sitting inside of a low resolution and we're loading everything. However, we did introduce a little bit of latency. So what we're doing is we're introducing packet loss like every 1% of packets are being dropped as well as we're introducing our own like 100 milliseconds or more of latency. And we're able to do a sustain in the right hand side with BBR, much higher resolution, even under distress. So both have exactly the same amount of injected latency being injected into them. But what you see on the left side if you open up your Chrome tools is that even at 100 milliseconds, we're not able to pull down the high resolution ever with a cubic. On the right side, same amount of latency. We always get HD quality and we pull that down with the same amount of latency. And so the important piece here is that you will sometimes still, there's obviously congestion and there's packets being lost, but the important part is when BBR hits this, it'll hit the limit. It maybe goes for a bit, but then it keeps right back up at it and it'll hit the HD limit. Under a sustained latency, cubic will never cross over into HD range at this sustained latency. And it'll often have a lot more rebuffering. And we can also look into a few of the statistics using the SS utility. So what we see here in cubic is our congestion window being set a little higher, but it will kind of sustain around seven or so segments that it's able to send during that window. Whereas if we were to do the same thing over in our BBR cluster, we'll get a much higher cap that will be sustainable all of our time. So in here, we're operating at around 416, so much, much higher than we were able to send before. I should mention that the guy in the video was our colleague, Tom. So he was flying in his paraglider at one of our on-site events when we went skiing. All right, so now the question is, is BBR the golden solution? Well, BBR has potential unfairness issues. So for example, if you use it in combination with cubic, so if some nodes are in cubic, and you know in your cluster some others on BBR, BBR is pretty aggressive, so it could overrun basically cubic. You will see a higher rate of retransmissions on TCP, simply it's trying to more aggressively probe. The Google research folks, they are working on BBR v2 in order to overcome those limitations. I just want to call out, there's an interesting talk from the NetDev conference, which from someone from Dropbox, where they deployed BBR on the edge, and also BBR v2, so they did some early measurements. What you can see here, like on the green side, is like there's like a 3% TCP retransmission rate on the BBR v1, whereas the v2 stays around 1%. So it is trying to overcome those issues. But on the other hand, BBR has been deployed in production on a large number of fleets, so it is definitely something interesting to consider. Yeah, so coming back to the earlier problem statement that we have, like how can you make sure that pods can be rate limited? So the bandwidth enforcement from Kubernetes really doesn't have to remain in a poor state. So with the Sillium bandwidth manager implementation, it will be GA with the next Sillium release, which will go out shortly after KubeCon. It basically implements efficient and scalable EBPF-based enforcement with the help of the earliest departure time model. And with the work that we did around BBR, and to retain the timestamps across the network namespace when packets traverse them, we are basically like the first TNI to support BBR that you can use and deploy your pods with. And as a side note, I should mention that realizing this whole architecture around the bandwidth manager is possible thanks to EBPF simply because with EBPF you can do flexible traffic classification and also set the packet delivery timestamp. If you want to try out the bandwidth manager, there's a Getting Started Guide that we have in our Sillium documentation. The bandwidth manager itself, if you want to use the EDT-based rate limiting for pods, it needs the kernel 5.1. The support for BBR is a little bit newer because we got that merged around December in the Linux kernel, so it requires 5.18. And with that, I would love to thank a couple of folks, especially from Google, Facebook, and from our team, and the Sillium BPF and NetDev kernel community. And yeah, so that's it for the talk. Thanks a lot. And if you have questions, I'm happy to answer. Yes. Not sure who will give you the mic. Otherwise, please shout or I will, all right. Yes. So I should repeat the question. The question is if there is a lot of bandwidth available and Cubic is trying to ramp up quickly, is this the same with BBR, right? Yes, it is. It just has a different delivery model around it, and it's not prone to packet loss, but it will see the delivery rate based on the measurements, and it will plateau at that point. So yes. Also, if there's an increase in available bandwidth, yes, exactly, they will both have the slow start ramp up. Yeah. Cool. OK, so the question is if you have a load balancer in the middle, do you also have to set up BBR and the load balancer? No, you don't. So you only have to set it up on the server side, so on your back ends, because it's actually transparent. So you really need to have it on the socket, and that's on the back end in the application. Yeah, exactly, because the packet is basically load balancer based on L4 layer. And you will just redirect it, you will forward it. It's a node in the middle, yeah. Cool, all right. If you have further questions, we are still around here, so you can just come to us and ask or ask in Slack. We will try to look there. And yeah, with that, thanks a lot.