 Hello, welcome to my talk. I'm Daniel Borkman. I work on Selium itself and also on the Linux kernel. So I co-maintain eBPF in the Linux kernel. And today's talk is about 100 gigabit per second clusters with Selium building tomorrow's networking data plane. So when looking at data center networks, what are the large challenges and essentially, I think you can put them into three big pillars, scale, performance and operations. Probably all of them would deserve like a talk on its own. So scale, scaling out into many nodes, many parts, performance, making the best use of the resources and operations, making changes to your clusters frequently without disruptions, for example. So my question is, what if IPv6 could address both scale and performance requirements from those three pillars? So that's what I'm going to elaborate in this talk. Before going and looking into more into the future, I actually want to take a step back and look into 2016 because then we actually first announced the Selium experiment. You know, we started out with IPv6 only container networking together with eBPF and XDP. For those of you who don't know what eBPF is, it's kernel technology that makes the kernel programmable, programmizable in a safe way. So yeah, so this is how we started out. IPv6 only networking, scalable, flexible, global addressing. You don't need that. So we tried to abstract away from traditional networking models only focusing on layer 3 and above. And we built all of Selium on top of eBPF for maximum efficiency. So until, you know, we were really excited about IPv6 because the way how we can implement the data plane with that until of course reality kicked in. And back in 2016, the state of IPv6 and Kubernetes and Docker in particular wasn't quite there yet. So yeah, we had to basically go back and implement IPv4 support up on popular demand, which probably a lot of people are running Selium with. So yeah, fast forward, 2020. How does the situation look like today? Kubernetes has adopted IPv6 single stack for quite a while and dual stack also. And the hyperscale has also made progress integrating IPv6 into their environment. All for most of it, rather dual stack. So if you look at, for example, the Manage Kubernetes offerings, AKS, EKS, GKE, they all offer, you know, various level of IPv6 support. Most of them dual stack. There's one single stack. So, but that's already great. In terms of IPv6, like, why do users want to go there? And a lot of times we hear they want to have more IPAM flexibility. So they don't want to run out of IP addresses and have enough headroom through IPv6 in particular when they have large clusters with lots of churn. So in general, there was actually a really interesting dual stack adoption panel at the last KubeCon. I can recommend watching that. In general, it's, right now it's regarded as a transitional state. So of course in the future, looking further out, we want to go to IPv6 end-to-end in general in order to avoid the complexity of having both, you know, of having both IPv4 and IPv6 in your cluster. So the typical approach that we've seen users, today they are trying to build out, you know, islands of IPv6 single stack on-trem clusters, you know, as a clean slate, and then trying to successfully move applications and services into their cluster. Yeah, so then you go and deploy an IPv6, yeah, an IPv6-only cluster, so no IPv4, but then you face again reality and you will have to interface somehow with IPv4. So that's unless you're air-gapped or you're, you know, you're really lucky with dependencies. And if you look at the Alexa 1000 sites, it's a bit of a sad state because, you know, like roughly half of the sites support IPv6 and the other ones not, so it's really not quite there yet. You will hit a couple of ecosystem bumps on the road. For example, when looking at that, I saw that GitHub, which is very frequently used by developers, not IPv6 ready yet, but their support in terms of supporting that. So there's work in progress. And yeah, so in a modern v6-only environment, how do you deal with IPv4? So you will have to run somehow a nut46 or 64 so in order to translate between the two worlds. So looking at nut46 and 64 with Linux, it's actually not possible. So there's no such thing in netfilter because simply the Linux kernel is really complicated and it has a lot of protocol information in the packet itself. But we added support for this in eBPF. So eBPF environment is a bit more constrained, so it's less complex to be able to do that there. And so now the question is, how do you ingress v4? Like if you have a v4 client somewhere on the internet and you want to connect to your v6-only cluster, how do you approach that? So the first option is to use Solium with a stateful nut46 gateway. And the way this typically looks, so on the left side you see an external node making DNS requests. You get back the a record with the IPv4 address. Then this issues the SYN packet. You have to go through a dual-stack component. So that will be like your nut46 gateway which will map the IPv4 address and port to some IPv6 address and port. And then you can go into your IPv6 single-stack Kubernetes cluster. And there you maybe have a Kubernetes load balancer service with that front-end IPv6 address, and then you can talk to it. And that front-end address will have backends that we can manage. So yeah, so basically that 4 to 6 gateway is sitting at the edge of the cluster and it's the only dual-stack component. And we map essentially from WIP to WIP, like from an IPv4 WIP to an IPv6 WIP and port. And the only thing that is exposed from the Kubernetes cluster itself is the IPv6 that will be accessible without going to this extra hop to the V4 gateway. So this has a couple of upsides. So the IPv4 WIP and port is completely decoupled from your Kubernetes cluster and from your Kubernetes service. You don't need a special load balancer service. You don't have special IPAM requirements for the load balancer, and any public IPv6 prefix works as is. You can even do, like, with load balanc... Like, you can even load balanc through multiple Kubernetes clusters, like, for example, through weighted maglev and so on. So that's possible. There are also downsides to that. So you need to have, like, a control plane in order to push down, like, the IPv4 to IPv6 mappings. And the other one is the original source IP preservation when the traffic hits. The Kubernetes cluster is lost because you have to do the net. It's stateful, so it needs to do a D-net and S-net, but, you know, thanks to EBPF and you can push it into the XTP layer, like, into the driver layer, you can actually achieve high packet rates, nevertheless, like, tens of millions of packets per second. Option two, that is on the table, would be to have, like, a stateless net 4 to 6 gateway. And the way this works, basically, is quite similar, but here you only do the translation on the layer 3. So there's, like, a special prefix when you translate to the v6, and then you need to have this, like, a load balancer service with this special prefix in your Kubernetes cluster as well to then load balance to the back ends. And then you have the advantage to not have state. So this is really highly scalable because you don't hold state on the node, on the gateway node. It allows for source address preservation because, you know, the original client source address is essentially encoded into the IPv6 address. And you can use things like load balancer source ranges to also filter for that, for external IPv4 clients, and no control plan is needed because the translation works essentially transparent. But the downside is, of course, you need to have, like, a load balancer IPAM which understands this prefix and essentially has, like, a public IPv4 accessible address in there. And yeah, so you need to have this awareness. So essentially when the cluster replies back to the client, this, of course, has to be, then, again, routable in the Internet. So this needs to have some public v4 address that is then encoded in the v6. So now, like, let's bring in one more layer of difficulty. So this was maybe more of this, like, more straightforward part, four to six into the cluster. The other layer of difficulty is, like, when you go to IPv6, only cluster and it wants to talk to IPv4. So here DNS 6 to 4 plays a key part in that. For example, when you're, like, going back to the previous example, when you want to do a DNS lookup for GitHub, you get the IPv4 address, but when you try a Quad A record, it doesn't answer. So how do you talk to it? There's, like, DNS 6.4 proxy. For example, here's a public Google one, and it will return you an address where essentially it has an IPv6-specific prefix, and the remaining bits is the encoded IPv4 address in there. So this is the way how it will look. So now you have on the left side your Kubernetes cluster, on the right side the Internet, and if your node wants to talk to the Internet, you have to go through this proxy. The proxy will make an actual A request through DNS to the Internet. It will receive a reply, and it will then encode this reply in a Quad A record back to your cluster, and now your nodes in the cluster can do a SIN request. To this address, this has to go through the NAT 6 to 4 gateway. It will then do the translation, and it will then go back to IPv4 and hit the actual external nodes. So, yeah, so this... Oh, okay, yeah, sorry. So the Quad DNS, for example, supports this kind of functionality if you want to use this in your environment. So the upside is, again, if you have a stateless translation, this is, again, highly scalable, doesn't hold state on the node, and all the traffic in the cluster can be pure V6 only. So the nodes, pods, even the gateway IP, so there's no dealing with IPv4. The downside here is that the IPM management becomes more complex because the pods node, like a secondary IP address, which has this specific prefix, like the NAT 4 well-known prefix in there in order to reply back. So in order to overcome this limitation, one thing that can be addressed, that can address this is essentially to do the stateful translation on the way out. So the pods reply with their primary address, hit the gateway node, do the translation there, talk back to the IPv4 client. So I have a short demo for the 4266-4 gateway. Given the wireless was quite bad in here, I did some pre-recording the other day. So I'm just walking you through that. So here you have an IPv6 reachable web server, and it has an IPv6 address. It can only be reached there. And now if you go into the gateway node itself, so this is running Syllium as a standalone, you will see now that it has an IPv4 service prefix. And the backend itself is the actual, like the web browser. And you can see you hit curl on IPv4 address and you get to the IPv6 in here. So that's the one case, the one situation. And now the other way around, the NAT 624, it's quite similar, so you have an IPv4 reachable address. And it does not, it cannot be reached through IPv6. Now you go to the standalone gateway node. You have an IPv6 front-end address in Syllium in here. And this is all done through EBPF at the XDP layer. So if you do the curl, it will reach the IPv4 page. So, yeah. Okay, so now we have bootstrapped our IPv6 cluster and it can talk to IPv4, so what's next? So as I said, IPv6 does not only address the scaling concerns but also future performance requirements. So enter Syllium with big TCP. So big TCP is a technology that got merged into recent kernels in 5.19. It has been developed by Google and the goal from Google side, but also for the broader community is to support future data center workloads for single sockets where you go 100 gigabit per second, 200, 400 and beyond. So that is the aim for that. And you would say, why care? Well, if you have big data workloads, AI, machine learning, or other really network intensive workloads that have bulk transfers, so it's really useful for that. But even if you don't have it, it will free up your application resources so that you can use that instead of the kernel stack and spend cycles there. If you look at 100 gigabit per second, it's really constrained. So if you think about 1,500 MTU, you would only have 123 nanoseconds per packet and you would have to be able to process around 8.15 million packet per second. So that itself is quite unrealistic for the kernel stack, so you really need batching. And the kernel has batching. It has GRO and TSO, so generic receive of load and transmit segmentation of load. If you look at the way how the kernel receives and transmits packets today, so for example, packets going into the application, you have the network card, you have EBPF layer, and then there's something like GRO. And this really tries to aggregate many packets. So what you can see here, many MTU size packets are coming in and it tries to aggregate this into a single supersize packet up to 64K and it will push it up to the stack so that you have to traverse the stack only once instead of multiple times. And the same way also works on the way out. So the kernel stack typically tries to aggregate a large packet and then on your hardware NIC, it will typically, like most of the NIC supporters today, it will then chunk them up into individual packets. There's also GSO, that's the software equivalent for that. So yeah, so once the NIC chunks them up into packets, usually they are received as a train of packets like for TCP streams and then the GRO will pull it back together so there's not too much latency where GRO would have to wait. But there's an upper size limitation. So the 64K is really the upper size limitation that we have today. The reason is that the packet size is encoded in the IP headers total length and that is 16 bit and therefore it's limited to 64K in the same way of course on the way down. So can we make bigger batches? So here IPv6 is to the rescue. So big TCP basically overcomes the 16 bit limit with just a small kernel change so it will insert a hop by hop IPv6 extension header. And this is just local to the node because the GRO aggregation, you don't have to make any changes on your network on the wire or you don't have to change MTUs. So the way this looks is basically it's like the IPv6 header, it will have a length of zero and then in the hop by hop ladder, in the hop by hop extension header it will have a jumbo extension there like a payload length field for 32 bit so it can theoretically go to 4 gigabit as a packet size which is of course unrealistic today at least but that's the theoretical limitation. And with that you can actually bump the aggregation size for GRO and also TSO like for the way in and for the way out. Right now the kernel has set this to 512K but this can be raised in the future but IPv4 is stuck essentially with 64K limit and we implemented support for this feature in Syllium 113 that is released approximately towards the end of the year so there's just a single knob and then all your parts will go into use it. So Syllium will essentially set this functionality for all the parts, for all the devices and the host and then there may be less obvious parts this will also help for request response type workloads. So we did some latency measurements in our lab and what we found is like on the last bar here if you run Syllium with big TCP disabled which is basically just a stock Syllium and if you then enable it you will see like or at least we've seen in our tests 2.2X lower P99 latency which is quite good and it also helps like for the transactions like we have request response transactions and it improved them by 32% and looking at the bulk TCP streams so the default when you run for example IPERF in your network from part to part when it was off we were able to reach on our test systems 52 gigabit per second so those test systems they are just like in our lab they are like gaming systems but with 100 gigabit NICs because they had PCI Express 4 and then if you turned it on we just got so we got a plus 8 gigabit per second performance improvement so we were able to get to 60 gigabit per second and then if you look at what else would be missing so if you look into actually the trace profile you see like a lot of copy to and from user that's a limitation because IPERF does not yet use memory map TCP so there's a good net dev talk from the Google TCP maintainers into memory map TCP and how you can even go beyond that so that's essentially now the limitation here so yeah so more broadly big TCP is one piece of the picture if I look at Syllium with tomorrow's data plane in mind so how can we get to a high-scale opinionated data plane and if I look at Syllium as a standalone gateway so we achieved this through EBPF on the XDP layer to a layer for load balancing which has a programmable API it supports mark left, DSR, graceful backend termination and then the new things with the net 4.6 and 6.4 gateway so that you can build V6 only clusters and then the other side so Syllium inside Kubernetes as a networking platform so there are like a bunch of pieces of the puzzle to get it really to high scale so one of the things is the EBPF-based Q-proxy replacement which uses also XDP and a socket load balancer in the kernel then we have a feature which is called EBPF-based host routing for lowering the latency I will go into that a bit we have a bandwidth manager I gave a talk about this at the last KubeCon if you're interested so this essentially installs FQ Q-discs on your physical devices and this allows to do pacing and this is BBR for ports which would otherwise not be possible and you can do bandwidth limitation which with the lower latency than the usual means and then the new pieces that we added now in the new Syllium release is IPv6 big TCP support and last but not least a new meta driver as a weave device replacement so I will go into that as the last part of the talk so what is a meta device in order to explain you the rationale how we got there I want to first show the typical data path how it looks in the kernel so typically you go to the upper stack for packets in and out of the pod you have to go to IP routing net filter and so on and we added an extension a while back into the kernel where from BPF we can directly push traffic into the pod just in one go so without having to traverse the CPU back lock queues in the kernel for weave devices so that really helps to deliver latencies you can wake up the application right away without having to go to a rescheduling point on the way out we added the helper to push traffic into the neighboring subsystem so it can do a L2 resolution so you don't have to go through the upper stack and so that's basically the eBPF host routing feature and it really helps with the performance and now the new component is basically a weave device replacement and that essentially helps to be like the goal of that is essentially to have eBPF programs as part of the device inside the pod of course without the pod being able to unload it or change it in any way because you don't want to do that you want to control it from the host namespace for example and the whole idea is to shift the Selenium eBPF programs from the TC layer where we had this originally into the device itself and this helps reduce the latency even further so if you look at the flame graphs for the kernel so this is the typical weave case so what you can see is like clears some of the metadata into a Persepure Backlog Queue also on the way out and then in the worst case the kernel software Qdemon issues another thread so you have this rescheduling point so basically the call graph goes here then it does an nqueue and then it has to pick up the packet again to then process it and with the new device type that we built in tailored for Selenium you just do everything in one go you don't have this additional queuing for this rescheduling and that really helps to reduce latency so our main goal with the eBPF host routing this was all the way for packets going into the pod and now with this device this will basically solve it for packets going out of the pod so this in combination the goal that we had in mind with that is basically to reduce the latency and to increase the performance in the same way as the application would run inside the host namespace and if you look at the latency here you can see in the yellow line is essentially the latency for applications running in the host and it's essentially the same same with the transaction rate like if you have more request response type workloads they will also be on par with the host and throughput as well so that will basically solve it so essentially the pieces of the puzzle like for a high scale data plane that I talked about that are really critical in this scenario is like the eBPF base cube proxy replacement the bandwidth manager that I covered at the kubecon in April the eBPF host routing the new IPv6 big TCP support in the meta devices so when you then run Solium you will get all of that yeah last but not least I want to thank a couple of folks in the linux kernel community bpf and Kubernetes community also those who from Google who created initially the big TCP support in the kernel because I think it's a really exciting feature which has a lot of potential and as you can see it's maybe not like an obvious thing when you move to IPv6 you will get from that so yeah with that I would like to thank you and open up for questions all right so if you have multiple clusters then you would have to have such a gateway in front of your IPv6 only cluster so they would have to talk to each other you would have to talk through that web if you expose it as a service yeah I mean you would have to make that gateway aware of this yeah so well I mean this is still like behind network namespaces right so you would have to get out of the network namespace in the first place you cannot just cross this barrier or like break this barrier in that sense I mean this will also work for a pod to pod so it will also like directly go into the device inside the pod without having to go to this backlog queue yeah of course no it's not tied to yeah I mean this is definitely shared it's a virtual device in that sense which overcomes the limitations that the weave device has oh one thing I forgot to mention while I was actually preparing the talk I noticed that there's actual progress so the GitHub will have v6 support so this is really great okay so yeah so if you have more questions please either come to the so you move or ask me directly or you know thanks