 Hello, everyone. Today we're talking five ways with the CNI. I'm Stig Telfer. I'm CTO of a technical consultancy business called StackHPC. We specialize in high-performance cloud infrastructure and platforms for scientific computing. And I'm Eras Cohen. I'm responsible for cloud networking at NVIDIA. In this talk we will discuss and compare some of the most common Kubernetes networking configurations for performance-intensive workloads. High-performance computing or HPC is a field of computer science that solves complex problems such as fluid dynamics for aircraft design, for example, large-scale weather forecast or that drug discovery through the use of large-scale compute simulations. HPC is one of the most compute network and storage-intensive workloads. Technologies developed for HPC often make their way to more standout data center applications. Another example of compute-intensive applications are artificial intelligence or AI. AI is probably the greatest revolution of our time. It allows computers to solve problems that only a few years ago seemed impossible. AI can create images and text from human description, translate languages, recommend specific items from wealth of options, and even write code. AI is very common these days and used in many web services and other common applications. Both AI and HPC are compute-intensive and they typically cannot run on a single server. They require a cluster of servers and run as a distributed application. When we're running an application in that fashion, networking becomes a critical element for the proper execution of the application. When we're looking at the network consideration, the first and foremost is throughput or bandwidth. And here we would like to have as much as we can. Today we're running at 100 gig and going forward we're already deploying 400 gigabit per second. But latency and the predictability of latency is as important for those kind of workloads. And obviously we would like to have low and predictable latency across all packet types. When we're running a lot of networking, the CPU is very busy with the network itself and concerns quite a lot of cores, CPU cores. We would like obviously to free up those CPUs for running the application itself and therefore we're looking for CPU offload capabilities so that the networking will be handled by the NIC and not by the CPU. And lastly in many of those application environments, GPUs are in play. And we would like to make sure that not only the CPU can access the network properly, but also the GPU. Technologies like GPU Direct become very important. RDMA can address a lot of those requirements and I'll explain RDMA. RDMA or Remote Direct Memory Access is a transport service similar as the TCP and UDP, but it is a more modern one. In addition to send and receive, it also supports memory-ridden write semantics, which allows us to write very efficient applications. It also supports natively kernel bypass, which allows to reduce significantly the latency so that the application can send and receive without going through the kernel. And it supports natively how they're offloads with the right how they're like the NVIDIA how they're, for example, which allows us to transfer hundreds of gigabits per second without any CPU intervention. RDMA was designed initially for technology called Infiniband, but for a long time now it's also available on Ethernet under the name of Frocky. RDMA overconverged Ethernet. In this talk we will see how Rocky can be used with Kubernetes. Now let's review the various network configurations that we will use for this talk and we'll start with Calico. Calico is a popular open source networking and network security platform for Kubernetes created by Tigera. Calico addresses connectivity between pods to pods and pods to service and service to pods and then it includes ingress and egress load balancing. Calico also provides network security features which is out of the scope of our talk today. Calico has a central controller that calculates the policies and routes and store them in an ETCD data store. On each node Calico runs three components. ConfD which monitors the ETCD data store for configuration changes and update bird configuration files. Felix which configures IP tables and Linux routes based on the policies and connectivity. And bird which is an open source BGP agent which advertises Felix routes. Calico supports both BGP and VxLan routing. In our talk we will use VxLan overlay. In addition Calico have two data path implementation. Standard Linux which uses Linux route and IP table as I said before and EBPF which is extended Berkeley packet filters. For our talk we use the standard Linux option. The next CNI is OVN Kubernetes. OVN Kubernetes is the OVN CNI for Kubernetes environment naturally. It uses OVN open virtual networking which is an open source SDM controller that uses OVS. It's from the same project. OVN uses logical components such as switches and routers. It has support security and access control list, load balancers and many other features. OVN programs OVS as its data plane. For those of you that are not familiar with OVS, OVS is a virtual switch that is flow based meaning it can be programmed to implement most of the data plane elements of switches and routers. Next CNI is SRIOV. Single route IO virtualization or SRIOV is a PCI technology which allows to segment the PCI device into virtual devices and assign them to the pod or VM through a virtual function which exposes basically the device or sub-device into the pod or the VM. In networking the device is pretty simple even though it can support advanced capabilities like RDMA. The networking element is pretty simple and it behaves like a MAC VLAN. There's no overlay, no SDN capabilities other than basic switching and quality of service for virtual function. However, SRIOV provides us with very, very high performance close to bare metal as it bypassed the entire virtualization stack. To use SRIOV in Kubernetes we're using the SRIOV CNI and also in addition the SRIOV device plugin. SRIOV CNI cannot support all the network requirements of a pod and therefore it is used typically as a secondary CNI and primary CNI like Calico or OVN together with primary CNI as Calico or OVN through the use of MULTUS which is a meta-plugin, a meta-CNI. Next configuration is accelerated OVN. Before I explain the technology let's first understand the challenges of software-based virtual switches. OVS is built of two components. A kernel model and a user space demon. When a packet comes in it will go to the kernel module or the fast path and classify the packet and see if there is a rule what to do with that packet. If there is a rule the packet will be executed meaning will be sent to the pod or do something else. If there is a miss this packet will go to the user space demon for further analysis and once a decision is made what to do with this new flow it will program the kernel data plane and from that point on packets will just go through the fast path. Now both the kernel module and the user space demon are CPU running on the CPU and the general purpose CPU is not very good in processing high rate of packets. And therefore OVS and many of the other virtual switches have very low bandwidth and very high latency and high CPU utilization. In addition because the interface to the pod or to the VM is ETH pair none of the advanced features of the modern NICs are supported like RDMA and DPDK and so on. To address these challenges NVIDIA added to our NICs an embedded switch. OVS was enhanced through a set of API to program that embedded switch on the NIC and program its data plane. The pods are now connected directly to the NIC through advanced SRIOV mode called eSwitch mode. Every packet that comes from the network will hit the eSwitch first and will classify the packets there. If there is a rule that defines what to do with that the packet will be executed and be sent to the pod and vice versa. If there is no such rule the packet will go into OVS kernel module and from there will continue just like before. With this approach we actually can have the cake and eat it too. We have one primary CNI OVN in this case that supports all the capabilities needed for Kubernetes, all the Kubernetes services, all the network services, the quality of service and visibility and so on. It supports bonding which legacy SRIOV does not and it supports security. But unlike the virtual switches and other CNIs that are software based it provides bare metal performance of the server and in addition it provides additional capabilities that the network adapter may have such as RDMA and DPDK. So those were the configuration that we will run in our lab. Let me just explain what lab did we build. We actually build two labs one for Kubernetes bare metal and one for Kubernetes over OpenStack. On the Kubernetes over bare metal we have two worker nodes. The servers are quite high-end for the performance benchmarking. We're running HPE DL380 Gen10 with 8380 CPUs, two numers each one of 40 cores, so overall 80 cores and half a terabyte of RAM. And those devices and those servers running in Vida Connectix 6DX with two ports of 100 gigabit each. On the OS we're running Ubuntu 2004 with latest kernel and then Vida offered for latest networking drivers. Kubernetes version is 1.23.7 and we deployed with CubeSpray and the pod is running Ubuntu again with the offered packages. The network interfaces are bonded and we're running at 9k MTU connecting to a switch and SN3700, 32 ports of up to 200 gigabit per second. So plenty of interfaces running cumulus Linux 4.4. The masternode is a lower grade server running Ubuntu 2004 and the same Kubernetes version. So this is the bare metal environment. The Kubernetes over OpenStack environment is obviously the same hardware, two compute nodes. But this time on the operating system we did install Yoga OpenStack and we created a VM running Ubuntu 2004 and the latest kernels and offered packages. And inside the VM we basically deployed the same Kubernetes as before and then obviously the pods running in that Kubernetes. The switches and all the rest is exactly the same. Of course in the controller node we did need to install OpenStack Yoga controller. Now the networking on the nested environment, Kubernetes over OpenStack is a little bit tricky. We do have the Connectix device and inside the host we're running OVN with OVS offload. So this OVS is offloaded the data plane into the Connectix as I explained before. Inside the VM we're running Calico which is connected to the OVS non-accelerated and that's the primary network for the pod. And in addition we have a secondary OVN, SRIOV network going to the pod. And so the application can use RDMA which basically goes to the virtual function and then we'll go out through the NIC with the rules and SDN layers of the OVS. Or it can use either the TCP from the net device of the virtual function and then it's accelerated or from the Calico which is non-accelerated. And with this I would like to hand it to Stig to discuss our benchmarks and results. We have an awesome lab environment. Now we need to define some representative test cases. Eris mentioned those key considerations for networking, high bandwidth, low latency, low CPU overhead and predictable fairness under contention. We created a suite of test cases that should be both insightful and representative for HPC and AI applications. This benchmark suite is also open source and available on GitHub today under this link. It uses the volcano drop scheduler to run through a set of test cases and extract results in an automated and repeatable process. It uses Kubeflow's MPI operator to orchestrate the execution of those parallel applications that we're benchmarking. First up we can start on familiar ground by benchmarking TCP performance using the ubiquitous IPERF benchmark. We use version 2 of the application here. We measure aggregate TCP bandwidth for increasing numbers of concurrent IPERF clients. In these tests we're using pod-to-pod networking, which is more representative for these scenarios, rather than include service IPs which would not be representative for most of our use cases. I'm plotting the individual client bandwidth here as a stacked bar. This shows us fairness. Even fairness results in even width stripes and it's fairly easy to see that. The aggregate bandwidth is the total height of the bar. All the graphs are scaled to the physical limits of our lab networking hardware, 200 gigabits a second per server. We set the scene here with the baseline performance of host networking by passing the CNI all together. Now a common choice in high bandwidth networking is to increase the Ethernet frame size or MTU from 1500 pints, which is the standard, to 9000, which is known as a jumbo frame. This has the effect of increasing application bandwidth while reducing the CPU overhead of packet processing. On this slide we have two charts measuring the effect of a 9000-byte MTU and its uplift on performance. It's actually not such a massive difference in results here, due probably to the high-end network cards that we're using. But what is not shown is the extra work behind the scenes that the host kernel is doing in packet processing load for standard frame sizes. This hidden cost is not shown in the end results, but we'll return to expose and explore this a bit later. So we've adopted a 9000-byte MTU for the benchmarks in this presentation as being considered representative of HPC networking use cases. Before we go on though, it's just amazing to see that a modern hardware server can readily saturate 200 gigabits a second of networking. First up, let's start with CNI results with a couple of standard choices. Two of the most popular CNIs for Kubernetes are Calico and OVN. In this slide Calico is the chart on the left and OVN is the chart on the right. Calico is configured with default settings. We've done nothing special here. It's using VXLAN as the transport layer. OVN is configured without SDN acceleration at this stage. It's also at default settings. We'll return to the SDN acceleration shortly. The difference in performance here is stark. We can see that Calico costs almost no additional overhead over host networking and is delivering good fairness. Both CNIs get 30.5 gigabits a second for a single IPer TCP stream. But for OVN, beyond a small number of concurrent TCP streams, our performance tails off to a fraction of peak bandwidth. Why is that? We can speculate about the different fortunes of these two CNIs. For OVN, offloading performance may be impacted by GNIV encapsulation or the complexity of OVN's programmable SDN pipeline. Routing, NAT, security groups and connection tracking are all parts of a flexible solution in OVN but make it harder to comply offloads. OVN may also be affected by latent configuration in our lab environment for hardware acceleration. It's present here but not used for this test, which may be affecting the capacity of the network receive path. It's a theory but we don't have time to test it. For Calico, we appear to be getting benefit from the stateless offloads of the Kinect X6 network cards we are using. In particular, the network card can do TCP segmentation offload to a VXLAN VNI, the kind of networking used by Calico, enabling the kernel's networking stack to process network data in much bigger chunks in a Calico configuration. There's a need here for further investigation, clearly, but for lack of time. With OVN, there may well be room to improve on these results. A popular choice of CNI for high performance use cases is to use SRIOV, in which pod networking has direct access to NIC hardware. Unfortunately, there is no bonding support in legacy SRIOV and our test configuration has bonding enabled. So this restricts us to only one port of our dual port network interface in the test hardware. And this is clear from the results, where the bandwidth quickly hits a cap of 100 gigabits a second. The other network interface, of course, can be used for Calico or another CNI, providing flexible options alongside the SRIOV interface. The state-of-the-art today is the SDN acceleration offloads, as Erez described. We can transfer OVN's SDN flow rules from the host processor to the network card hardware, directly offloading the packet processing work of OpenVSwitch from host kernel to network card. This technology combines the flexibility of OVN software-defined networking with the performance of hardware offloading. But does it fulfil this potential? Well, on the left, we see the accelerated OVN in a bare metal Kubernetes environment. And on the right, we see accelerated OVN in OpenStack, running Kubernetes worker nodes as VMs, providing direct access to hardware using the Mac VLAN CNI. In both cases, our single client TCP performance is over 50 gigabits a second, somehow exceeding the performance of the host networking configuration, which is hard to explain, but may be due to evolutions in our lab setup tuning. With accelerated OVN, we achieve very high performance, quickly saturating the 200 gigabits a second available. I should say the bare metal accelerated OVN graph on the left is actually the best result of five runs. Aggregate performance can be variable, in particular for smaller number of streams in a bonded network configuration. This is likely due to the static nature of lag distribution functions on the bond, creating uneven balance in the flows between the two ports. The same effect would apply to the Mac VLAN chart on the right, but unfortunately we only gathered one result here and didn't have the time to return to this configuration and repeat the experiment. One thing to note is the unfairness apparent in the OVN configuration. The different strike widths of the bare metal accelerated OVN result imply significant unfairness between clients under contention, which requires some further investigation to understand a little better. One theory is that a reduced number of NIC receive queues used in the OVN virtual functions was leading to uneven performance. Cutting edge developments such as these push the capabilities of today's hardware and software to their limits. The benefit we mentioned earlier of hardware offloading is clear when you examine the host CPU load during the test execution. With the Calico test on the top row here, the system's CPU load was measured at about 30% while the tests were underway. With accelerated OVN, the CPU load drops to only 15%, giving a delta of 15% of our CPU load between the two CNIs. Now the CPUs we're using in this test are 40 core platinum Ice Lake Xeons. That's a part with an $8,600 list price. Saving 15% of the CPU cores in this system on a dual socket configuration would be 12 cores or $2,600. If we are working these systems hard, as we really should be, then that's a lot of CPU resource that is not going into our workloads. We can also observe the accelerated OVN flow offloads at work during the benchmarking. In our tests, up to 120 SDN flow rules were offloaded from open V-switch from in the kernel, direct to the NIC hardware. In fact, the Kinect X6 can support hundreds of thousands or even millions of hardware offloaded flows. We're barely scratching the surface of this capability here. This can be particularly powerful for advanced networking applications, concurrently serving many thousands of connections. On to the next test. For HBC and AI applications, we usually implement parallelism using a library called the Message Passing Interface, or MPI. This defines how parallel processes can work together by sharing a dataset and communicating efficiently with one another. MPI embodies a very different paradigm. We don't really see so much of this in cloud computing, and it's vitally important for HBC and also large-scale AI. MPI's design also fits naturally with the RDMA network protocols that Ares was describing earlier. The network performance of MPI can be measured using a suite of benchmarks, a standard one of which is called IMB, or the Intel MPI Benchmarks. One of the simplest of those benchmarks is a pair-wise ping-pong in which messages of different size are bounced between processes on two servers. In this benchmark, we are most interested in the latency for transmitting short messages and the bandwidth for transferring large ones. Here we can see the performance of our range of CNIs for the MPI ping-pong benchmark. The left chart shows short message latency, lower is better. The lowest latency is legacy SRIOV at about 1.83 microseconds message latency. To get a lower latency than this, you would need to go to an advanced HPC network fabric like Infiniband. Accelerated OVN delivers a message latency we measured at about 4.4 microseconds for bare metal and 4.8 microseconds for virtualised open stack configurations. Quantifying the overhead of virtualisation as a small fraction of a microsecond in latency. The right chart shows large message bandwidth, higher is better. The highest performance is accelerated OVN in bare metal and open stack configurations. Accelerated OVN on bare metal reaches over 180 gigabits a second bandwidth for transmitting a 32 megabyte message. In both latency and bandwidth, we can see that hardware offloaded configurations perform far better. We can also see clear advantages of RDMA over TCP, both for lower latency and higher bandwidth. On these charts, results for RDMA protocols are marked with pluses and data points for TCP results are marked with crosses. The difference is clear. So the results for synthetic benchmarks show a clear difference, but what about real world application codes? Computational fluid dynamics, the simulation of air or water flows, for example, is a domain that makes extensive use of parallel programming. Openfoam is the leading open source code for this. We have a benchmark here where we model the vortices created when a fluid blows over the top of a square box, as shown in this animation. To simulate this scenario, we will use 48 parallel processes that work together using MPI. We also wanted to explore the effect of pod organisation. What if we put all the processes in our parallel workload into one pod versus what do we use more pods with fewer processes in each? Within a pod, parallel processes will communicate by using shared memory. Between pods, they should use the local networking but within the host. Between hosts, they will use the Ethernet networking and the CNI. We set up a few test cases to explore the effect of this. So how did we get on? In this chart, we are plotting runtime of our fluid dynamics model for different pod to process geometries. In all cases, we used 48 parallel processes. A lower runtime on this graph means better performance. Disappointingly, there is no obvious effect to the different pod configurations. Our lines are pretty flat. Perhaps MPI was smarter than we thought and was using the same shared memory transport both within pods and between pods on the same node. Unfortunately, we did not have time to return to this setup and investigate this further. What is clear, depending on the CNI and protocol used, our simulation took between 125 seconds and 407 seconds to complete. Network performance is clearly a critical factor for parallel simulations such as this. Highest performance was with accelerated OVN and legacy SRIOV. Accelerated OVN yielded about 125 seconds of runtime. Once again, RDMA protocol data points are marked with pluses and TCP data points are marked with crosses. One interesting thing, we can see the hardware accelerated CNIs that supported RDMA protocol have a clear advantage over everything else. But when the same CNIs run with TCP instead of RDMA, the application performance drops from 125 seconds runtime to 244 seconds. That is almost the same runtime as achieved with Calico. The implication here is that the full potential of the hardware accelerated offloads is only fully realised when using RDMA protocols. Once again, the non-offloaded software only OVN is having a hard time in this configuration. A very different but equally intensive workload is genome sequencing, something that played a vital role of course in the world's response to the COVID pandemic. We wanted to include a test case for base calling, the first stage in the pipeline for processing data for the DNA sequencer technology developed by companies such as Oxford Nanopore. Base calling uses neural networks to extract base sequences from the noises signal data from the sequencer device. This application harnesses the machine learning power of a GPU. About a terabyte of signal data is processed for each genome sequenced. Using open source code from Oxford Nanopore, we created a Kubernetes deployment where the base calling process is presented as a service. We wanted to see what effect this would have on sequencing performance. We measured sequencing performance for host networks and also for CNIs. Unfortunately, for lack of time and due to various integration issues, we only managed to run the base calling benchmark on host networking and the Calico CNI. However, what we can see from this single comparison is encouraging. There was no obvious impact to performance for using Kubernetes service networking for base calling. Indeed, the base calling rate was actually about 1% higher when put behind a service IP in Calico, which is hard to explain, but within the margin of variability of the benchmark performance. In summary, we found that Kubernetes platforms perform well for HBC and AI application workloads at the small to moderate scale of our test lab. Kubernetes would make a convenient job execution framework when coupled with interactive software platforms like Jupyter, and this doesn't have to be a compromise of convenience. For infrastructure without SDN acceleration, our experiments showed Calico performed particularly well, although it is hard to separate out the uplift of all the different offloads and accelerations a modern high-end network card will perform. Further work is definitely needed to understand some of the issues we encountered with performance along the way. For the most demanding use cases, SDN acceleration offloads enable the highest performance, supporting the use of RDMA protocols, which our results showed make a significant difference. Our results also showed that with hardware accelerated networking, performance of Kubernetes both on bare metal hosts and in open stack VMs can be almost indistinguishable from performance on bare metal. With the right networking in place, we can harness the advantages of Kubernetes either on bare metal or in VMs without sacrificing performance. SDN offloads are now at a coming-of-age moment, where the technology remains complex but delivers ultimate performance for cloud-native environments. Finally, I would like to say a huge thanks on behalf of Ares and myself to the NVIDIA and StackHBC engineering teams who performed this research and made this presentation possible. These people are heroes. We hope you've enjoyed it and found it useful. Please do feel free to scan the QR code here and leave us some feedback. Thanks very much.