 All right. Hello and welcome to my talk. I'm Daniel Brokman. I work at Isovalent. I work on Cilium itself and EVPF in the Linux kernel. I co-maintained it there and have been contributing for a very long time. Today the talk is about turning up performance to 11. I will talk about Cilium, a new device replacement that we built, and going back with TCP. Really the goal or experiment I had for this talk was basically, I mean, wouldn't it be nice to just turn on the volume knob and improve your performance? Unfortunately, it's not always that nice, but always that easy. The question was really like, what would it take to really get to maximum performance? The question of why? Why is it relevant in the first place? The use cases you might have might differ. For example, one could be scale. You're adding more workloads to your Kubernetes environments. You're connecting multiple clusters in a mesh, and therefore also the traffic that is being pushed around increases. Maybe sustainability to better make use of your existing infrastructure to reduce off or on-prem costs or performance wise to reduce the RPC workload latencies or to better cope with potentially escalating bulk data demands that you have maybe from AI or machine learning workloads. Actually, for the latter, that's quite a big push in the industry right now. We see new nicks coming up with 800 gigabit and beyond, like a big hype around AI and machine learning to push data center innovations. The hyperscalers are increasing their capacity. We even see switches coming up to the market with 51.2 terabit per second, which is really crazy. The question here coming back to the Kubernetes world is how would such a platform look like that would potentially be able to address those future demands? The more practical question is what can we benefit from it today, and especially without having to rewrite existing applications, of course. If you look at the standard Kubernetes architecture or setup, so you have your host, there's a Qplit running, Qproxy running, and if you deploy Syllium as a CNI, for example, there's a Syllium agent, like the daemon itself, and a CNI plugin. So whenever there's a new pod that is being spawned up, it will basically give a handle of the network namespace of the pod to the CNI plugin. The CNI plugin will talk to RPC to the Syllium agent, and eventually it will set up networking devices, IP addressing, routing, and in our case also BPF programs. And then when traffic is going in and out of the pod, it will basically use the upper stack IP forwarding layer, net filter, routing, and so on. There are a couple of problems with that. If you look at the scalability and performance aspects of such a setup, one is the Qproxy scalability, so if you have a lot of services, you run into issues, but also routing to the upper stack, which may not necessarily be very obvious. There are potential reasons, so maybe initially you deploy the cluster and you just went with the defaults. The defaults are very conservative so that you can run on a bigger variety of environments, all the kernels and so on. Or maybe you have custom net filter rules installed, so you have to go to a net filter, or you cannot replace Qproxy for one reason or another. With the Qproxy, probably many of you know it from debugging, troubleshooting, and production. When you have a lot of services, then it's like a linear walk trying to match one of the service tuples, so it can be quite some overhead. Upper stack, there's something, when the packet leaves the pod on the egress direction, when you go to the upper stack, there's something in the kernel that is called SKB orphan. For example, when you have TCP traffic, that is basically there to tell the TCP stack that the network packet already left the node. As you can see, it's not actually the case, because the packet is still inside the host namespace. That is essentially there and it's hard to remove because of net filter, so net filter Qproxy relies on it. But doing that too soon before the packet actually left the node basically breaks TCP back pressure, because the TCP stack things packet already left the node, I can push more. And because of that, you can evade the send buffer limits. And when you look at the performance, what you can see here on the right side, the yellow bar, is when the application is inside the host itself, and you do like a TCP stream workload test with 100 gig nick overwire, so I had like two machines back to back. With 8K MTU, you reach 100 gigabit per second, but if you look at the upper stack forwarding with the VIF devices, it's not that great. It's 63 gigabit and the reason is because TCP back pressure breaks. So the question is really can we achieve the same for Kubernetes parts as well? The answer is yes, and I will take you to this journey that we did. So yeah, coming back in our journey initially, when we worked in Syllium in the early days, the first thing we did was replacing the Qproxy component with the BPF based implementation in order to be better scalable. That covers all the Kubernetes service types. For the north-south direction, we have a per packet load balancing at the TC BPF layer. What you can see here, Syllium agent, when it spawns up, it's attaching BPF programs on the physical devices in the TC BPF layer of the stack. In east-west, we got rid of the per packet net and are doing the backend selection at the connect time. We also have Mark Levin host port support. The next logical thing after that was basically to add support for XTP-based service load balancing. Given we already had the load balancer in the TC side, now it was just a matter of porting that over to XTP. XTP is a, it's called express data path and it's basically an attachment point inside the driver where you can build high performance load balancing so that essentially you can easily collocate workloads and still have the high performance aspect to better scale out. That also covers all the Kubernetes services and we also have Mark Levin DSR support for this. There was a nice blog post on the Syllium website with a production graph. What you can see here is a test run where initially there was IPVS in production, then it got moved over to a single node for the layer for load balancing with XTP. What you can see is here the CPU overhead. It's just really minimal. Once it was taken out again from production and moved back to IPVS, actually two nodes handling that production traffic, it went really high again. XTP really allows for low overhead because it's so early inside the receive path. The next step that we thought was really useful performance wise was to add a bandwidth manager to Syllium. The idea of the bandwidth manager is basically that we support egress bandwidth limits in Syllium so that you can say, okay, for this kind of part I want to have 50 megabits, for other parts 100, whatever, so that you have a scalable egress rate limiting. The way this basically works is that the Syllium agent, it sets up on a NICU, typically have multi-Q on the transmit side and it sets up the FQ scheduler. It's called FairQ scheduler in the kernel. We have a BPF program which basically tells that scheduler the departure time of the packet so that you can configure, so that you can set this departure time based on the rate you want to have. This allows for potentially lockless rate limiting. What you can see here in this graph is on the yellow bar is the classical way to achieve rate limiting, for example, hierarchical token buckets. That's the usual way in the Linux kernel. Thanks to the FQ disk and setting the time stamps in the BPF, the earliest departure time, you can get a more than 4x better P99 latency for that, so it's really nice in this aspect. It's not the only advantage that the bandwidth manager provides because one thing that we added in the kernel 5.19, a colleague from Facebook and myself, we ran some experiments and fixed that in this stack, was that basically when you have applications inside the pod and you would like to use BBR congestion control, that's the congestion control algorithm for TCP which was developed by Google. It's called I think bottleneck bandwidth in RTT, so it uses a different congestion signal than the default cubic one which is loss-based. The problem here is it was not possible before to use that for pods because whenever a network packet traverses the network namespace, the time stamp, the departure time stamp that the congestion control algorithm would set is basically being cleared. The fix in the kernel that we did basically then unlocked this and then you can use BBR from pods and benefit from that. FQ scheduler from the bandwidth manager basically able to then schedule the packet at the right time. At KubeCon some years ago, I did a demo where I compared BBR versus the default cubic over a lossy network. For example, when you're consuming services from a Kubernetes cluster for mobile device or elsewhere or like over Wi-Fi, for example, we had a streaming demo where the streaming application was inside a pod, the server, and over the lossy network what you could see is that the BBR, the BBR it was staying in high definition, the video and with the default it was falling back to the low resolution because the TCP congestion window was reduced too aggressively. So that's the bandwidth manager. The next feature that we added to Syllium is called BPF host routing. So the whole idea there is basically, well, don't use the upper stack for routing. We can do all the routing in TC BPF layer. And we can also make use of the kernel facilities like the kernel routing table and so on out of BPF because it's there as a helper to utilize. So your routing demons can still work and you can benefit from this without having to go to the upper stack. So there's two new helpers in BPF on the kernel side that they added. One is called BPF redirect peer. The other one is called BPF redirect neighbor. The redirect peer is basically for the logical ingress path into the pod. It allows a fast network namespace switch from the ingress side to the ingress side of the weave device. Basically what happens here internally in the kernel is it just resets the device pointers from the physical device to the weave device that is inside the pod and then does another loop in the main receive loop in the kernel instead of having to go to the weave device where it would enqueue the packet in a per CPU backlock queue and it will add more latency and it is slower. So that's basically the kernel internals if you're interested. The ingress side so the new helper here would basically allow you to inject the packet into the layer two of the kernel into the neighboring subsystem and it would help to resolve the MAC addresses for the source and destination MAC addresses and this can be combined with the FIP lookup that is in BPF inside the kernel to basically achieve the outgoing path as well. The interesting thing here is given this bypass as the upper stack there's also not this problem that I mentioned earlier where you would escape the orphaned the packet. So basically this fixes the TCP back pressure issue that was there before. So this is how the complete picture looks here and if you look into the performance numbers again this allows to push the TCP bulk stream workload to 90 gigabit per second instead of 63 so this gives a huge performance advantage. It's still not perfect what you can see here it's not 100% on par with what you have as if the application is inside the host itself. So there are more pieces to the puzzle. The next one that I added into the kernel this summer actually is called TCX. So TCX is in short called an express data path for TC so basically a new design for the TC data path. Why? Because it is really crucial it's the foundation for Syllium where we attach everything like in the main data path that we have in the kernel and to make it more efficient to make it more modern and also more robust. For example we see more and more users that are using the TC data path so even there are multiple users so that they don't step onto each other that's that got support with BPF link which is a BPF concept that hasn't been added to TC given its internals. More efficient fast path and then also dependency controls where you can say okay I want to attach my program before or after that other program so that you have a relative attachment point and it's from a user perspective nicer than what you had before with the old style TC. So yeah like back in the days when in 2015 I added the initial support for TC BPF it was like the most natural fit at the time but now making this more modern it was a good point in time to actually tackle that. With that I also added the framework inside the kernel which is called BPF Mproc so like a multi attach layer where multiple BPF programs can be attached in an efficient way in an array. An array is more cache line friendly than a linked list for example where you would have more cache misses and this framework the goal of this was basically to provide a common you know look and feel from an API perspective because the idea was not to just have it for TCX but also have it in other attached locations for example XTP in the future so that we support multiple programs at the XTP layer netkit devices this is something that I will mention later and C groups and so on. So just to give you the picture this is how it looked like old style like the classic TC you know example picture where you have this you know fake Q disk in the ingress and egress which actually doesn't do any you know queuing in that sense it just basically is a container structure for BPF programs or like for the old style TC that was there also before BPF and this got really simplified into more efficient array and then like an efficient entry point for BPF and with that like from a microbenchmark perspective the entry into a BPF program got reduced into half so we could cut the cycles into half when the cache is hot and I think the benefits will be even better when you don't always have the data hot in the cache so yeah um so now that it's moved to TCX for the physical devices for the weave devices the next step in that journey that got merged recently is basically to replace the weave devices we replace this with something that is called netkit and the whole idea here is basically that for weave devices there's still this point where we have the logical egress path for traffic going out of pods where it has to take a queue like a per se pure backlog queue this is something that is internal to the weave driver in the kernel but it adds latency and the whole idea is basically to move that BPF program that we have in sodium that is attached on the host side of the weave device into the part but into the device in the part so that applications they cannot unload this and we move the BPF program closer to the source and in the BPF program when we know that we want to push the traffic out of the node the whole idea is then we can directly forward it without having to go to this per CPU backlog queue so we can remove this this additional this artificial inefficiency basically and the other thing that it was interesting while working on this is yeah why not make this an l3 device from the beginning right because weave is a l2 so you need to have some op resolver as well but we can just get rid of this as well so this got merged just recently not too long ago and it's interesting to compare this to weave and also ipvlan how does it basically stand against those two so weave as i mentioned is l2 ipvlan l3 and we also wanted to have this as an l3 device so that you don't need to resolve ARP from the from the port side um for weave and netkit this is both like a like a like a main device and a pure device ipvlan works a little bit different so you have to make one of your physical devices the master device and ipvlan and then you can add ipvlan slave devices which you then move into the network namespaces but yeah the bpf program like the programming for ipvlan is a bit of a hassle because in silicon we once had ipvlan support but later on we removed it again but basically when you want to do policy enforcement um we back then we added the bpf program into the ipvlan uh device that is inside the part but it is a problem right because an application could just remove it again if it has the rights to do so and this is something that we definitely wanted to avoid or um if you don't add this inside the part then the normal default ipvlan mode it it could be that this ipvlan device could just talk to this ipvlan device directly without going somehow to the host so you cannot do any policy enforcement so this is another hassle so essentially you have to add this to the physical device and attach your bpf program there for all the different parts um but then the problem is users they don't only have one physical device i mean this is one use case but there can also be multiple ones and then you have to pick and then you have to make then you have to add for example a dummy device and make this your master device and then it gets complicated very quickly so basically what we wanted to have this nice model for the weave devices where you have one in the host one in the part and it's quite flexible so we also added we also had this for netkit and yeah then what I mentioned it allows this fast network namespace switch and egress um you can combine this with the flip lookup in the case of ipvlan uh it is I would say slightly less efficient in this regard because if you look into the ipvlan code it has an internal flip that is not using the kernel flip uh just for internal routing whether a packet has to go to outside of the node or to another ipvlan device in a different part so it has like an internal hash table where it does the lookup and then when you want actually want to route the traffic out of the node there's an additional kernel flip lookup so you can get rid of those two and only have the one in netkit or if you have a use case where you actually don't need to do a flip lookup you can just direct it out of the node uh without having to do that so you're more flexible in this in this sense so it basically takes the best of both worlds and if you look at the performance side if you look at the flame graphs the problem that you sometimes see under pressure with weave devices so basically here you have an application it's it's sending something uh it will at some point queue the packet to the persepe your back lock queue what I mentioned earlier and then another thread will pick it up and process it from there and then send it out right so oftentimes this can be deferred to the kernel soft acu demon and that's not really nice uh so basically what we want is what is shown here on the right side in case of netkit um that it remains all the way in the process context uh when you want to send the packet out of the node so that also the process scheduler in the kernel can account that time better and can make better scheduling decisions and if you look at the performance now we see you as you see this purple bar that is added and you can see okay now the throughput is as high as it is on the on the host right so it gives you the the full uh power same for latency uh p99 latency is as low as it is on the host so it is um yeah all right so now we have a zero overhead for networking for parts so the question is can we push can we push even further we're not stopping here uh so there's an interesting technology that was added uh to the kernel from from google engineers uh from the tcp maintainers uh technology that's called big tcp and that was initially merged for ipv6 first uh i mean because most of the i think google traffic is running and on ipv6 it was merged in 519 and later on it also got added for ipv4 i mean uh for the rest of the world um and the interesting thing the the idea on on big tcp is basically to um aggregate war and and to be able to better cope with uh with the demands on future nicks right so 200 400 gigabit and beyond so with those speeds um in cilium uh we added support for this because i think it's a feature i mean that is very interesting and that that gives you a nice performance boost um there's actually no changes that you need to do on your network on the mtu all of this is basically done on the local host and so it will try to aggregate for many small packets like a big packet and push it up and when we added this for ipv4 like some of the drivers they have different limits um that they expose and there was a nice uh uh comment from an from an intel engineer uh when we got the icc support added which is like 400 gigabit nicks uh it actually improved their request response rates when you have it off versus have it on by 75 percent so this is really impressive um the way this works as i mentioned right so this is a typical picture of the networking stack um what you can see here in those red box is the g o engine uh it's called generic receive offload and basically what this is trying to do is it will take mtu sized packets when you have a tcp packet stream and it will aggregate small packets into a bigger packet here it's 64 k and will push only one big packet to the upper stack so that you can save resources that you don't have to process something n times just once and the same is also the case for the egress path of the stack so it will push down a super sized packet and then either in software but most network card support is also in hardware this is called tso the transmit segmentation offload it will then chunk up this packet and push it out in small packets and yeah and then when you receive it somewhere else on the on the note the same happens again right g o will try to aggregate and so on and push it up this deck the whole problem with this is the 64 k is basically an upper size limit because what the kernel is trying to do is it like the ip header has a total header length and that's 16 bits and therefore the upper size limit is 64 k um and big tcp basically has a way to overcome this because for example for ipv6 it will insert a hop by hop header in the kernel uh as a jumbo gram and then you have 32 bit length so you can have much bigger aggregation limits and yeah what we did in selium uh this is in selium 114 it supports both ipv4 and ipv6 we did some measurements and 192k was a uh sweet spot where you had most of the performance gains that we saw um and that's what basically selium will set up underneath so it will it will set up the physical devices for for that aggregation and also all the devices on the pods right and yeah it has probing as well because uh the melanox nicks that we have uh they have a much higher limit for the transmit segmentation offload uh so 192k can easily be supported for the intel ones they are slightly lower they have 128k uh so that's why we added probing uh to that so that it lowers this slightly to still support other nicks as well and yeah so the running performance benchmarks uh the blue bar what you can see here is the latency in microseconds uh so we can from having it disabled to having it enabled gives it two x better p99 latency so even the request response type workloads they benefit from that uh the tcp stream workloads um as well i mean yeah it would be nice to have like a larger nicks to to to to test out um but you can definitely see the improvements and the transactions per second from for a request response type workload they also improve very significantly so this is a really nice feature and it's it's a low hanging fruit right because it's uh uh to to enable it so yeah that's basically the the complete picture of our journey uh in the future we are also looking into other features that are currently that have been merged just recently or that are merged more in the future just recently there's the tcp microsecond timestamp resolution so that is interesting as well um i think we will i'm very i will look into that to add this into sodium as well um and then bbrv3 so that's basically an improved version of the bbr tcp congestion control to basically address the the the relatively high retransmission rate um so there's still some discussions but i think this will be as far as i'm aware upstreamed uh very soon so yeah to conclude uh you know so with some of the improvements we did from the eppf and sodium side we can get rid of the net and s of the network namespace overhead for kubernetes parts and to be able to better deal with 100 gigabit and beyond there's big tcp which is a very interesting feature um we don't have to change in the the network or anything like that but you can just benefit from it uh with the with the software change and to go even higher than that even beyond that uh that's still work in progress on the kernel side uh that would very likely be tcp0 copy because if you look into the uh remaining uh performance uh overhead when you only have big tcp then it's mostly on the copy side to copy the data from the kernel to the host and then tcp0 copy comes into play right um we had in paris a few weeks ago a netconf workshop where the you know kernel developers from the community met up and um yeah so there are some interesting features but they're still in the works right so for tcp0 copy what would be needed is a support for header data split in the kernel but now it doesn't really have a good framework to configure it or to even see where the devices support this um this is going to come header data split is needed because you want to separate the headers on one page and then the data that you then want to memory map uh to an application to like to page aligned uh you know pages uh for the application um then the combination of big tcp and tcp0 copy is really interesting because this gives a very low overhead for transferring megabytes of data um and then the the like then what is really interesting for ai and machine learning workloads is basically to memory map this into the gpu memory right so that you don't have this detour through the host through the host gpu um there's also patches that are right now under discussion in the kernel mailing list but yeah so with that uh my talk is concluded and if you have any questions i'm happy to take them thank you so a lot of the optimizations that you mentioned today require um relatively recent kernel versions does cilium detect which features are available based on kernel version or does it have like much stricter minimum kernel requirements now yeah i mean like the minimum kernel requirements for cilium is right now for 19 so it's really old i mean we like before that we had 4.9 we recently bumped that because 4.9 is um i think out of out of date i mean it's super old um but yeah i mean we basically uh detect whether a feature can be used and it's also up to the user like depending on what kind of setup they or like what kind of constraints they have and so on so you can opt into this yeah thanks for your presentation you the big tcp example you say you can turn it on and off because we're on uh cilium 114.3 how do you turn it on and off because any values of yama far what flag are we are we looking for okay uh so it should be in the documentation of cilium there's a flag and helm and i don't have it in my head right now um but i have linked it here so it's actually in the slides if you want to look it up so yeah uh yeah i have one question uh so the standard use case is to have like communities uh in a vm right uh in in a public club so yeah yeah that's all this somehow uh work there as well yeah it does it does um so for i mean like even the xtp based load balancing a lot of public cloud providers they offer siov based networking and then you get for example for the melanox or or intel you can attach xtp natively there so that's no problem okay that's exactly what i was looking for yeah and and all the improvements that i mentioned here like with the we've the rise replacement bandwidth manager and so on and so forth also big tcp they also work all under public cloud so yeah hi um same question what about vmware so again any issues with all these options on uh open vswitch with vmware oh um so the open vswitch is uh i mean yeah but yeah with open vswitch you are not i mean cilium doesn't support open vswitch right so open vswitch is a different data path uh technology compared to ebpf uh but i presume maybe you mean you you meant open shift right if i'm okay we can take it offline it's it's fine you mean the siov that did you require okay i'm not too familiar with vSphere to be honest um but the xtp one is is is the only one where you have specific nick requirements and yeah yeah all right hi thanks for the presentation so my question is like if we have a multi-tenancy and we are doing stuff ebpf ebpf on the kernel right are there any security concerns there i mean like so so ebpf itself is set up from cilium as the c9 right and there's there's one c9 running there and that's setting up the programs inside the host ebpf programs when they are pushed into the kernel they are being verified for safety so that you don't do harm to the kernel so basically the verifier will do a static analysis of the program to understand what the program is doing and rejected um and that is basically also restricted to root only so basically applications in the in the cube system namespace will be able to use it such as cilium for example but other application parts not really so yeah thank you all right one more question wonderful presentation and as i know the big tcd ipv6 will set the lens field to layer in the header and add some extension header in the ipv6 header so if a server run outside the cluster and no big tcd ipv6 and the client run in the cluster and the big tcd ipv6 is enabled can you can run communicate with each other and if not how to solve this issue can they can can they do what can you repeat the last can they communicate with each other oh yeah yeah this extension header that is being added is only local on the host right so it's not going out to the network so this is really only in order to push a larger packet up the stack but other than that um it's not going on to the wire oh thank you so i i i didn't find out a question in the cilium's document so i ask Elise today yeah thank you uh thanks for your talk uh quick question about express data path um versus sort of ebpf what led your decision to use that as a load balancing piece yeah so the express data path i mean that's basically an attachment layer for ebpf inside the driver and i mean all the major drivers by now do support it so i mean given we initially implemented this uh like the service load balancing and so on with tcbpf the xtp extension to be able to use that was a natural fit given you can benefit so much from it because the overhead is so low and yeah what what what why i mean because you can move this closer to the closer to the source right so that's like really the first possible point in the software stack where something can be processed where something gets picked up from the driver from the receive rings and you can run and do some actions with it so yeah it was a natural fit oh yeah absolutely yeah i was just under the impression that was sort of like an intel express data path piece oh no no no i mean all major vendors they support xtp also cloud providers as well okay gotcha okay cool that that makes sense thank you okay perfect all right thank you very much