 We have a few team members that could not make it today. One of them was ill for COVID. One of them couldn't make it because they're in the UK. So we're going to try to do our best to cover for them. So apologize if it's a little bit rocky here. There exists an age-old trade-off between having good abstraction and having the absolute best possible performance. One of the key benefits of Kubernetes is abstraction, making it easier to scale out applications using a cloud-native approach. Today, in our shorthanded panel, we will be discussing the usage of Kubernetes in telco and edge where workloads are highly latency and throughput sensitive and also jitter sensitive, and have historically been scaled up using techniques to squeeze every last bit of process and network performance out of the computers. The need for speed, the need for low latency, and the need for scale cause developers to adopt Kubernetes to deploy workloads onto compute but bypass Kubernetes networking. Instead, connectivity to carry edge and telco network is done largely with expensive routers and SR-IOV, single-root IO virtualization. This allows edge and telco to be deployed as containers and pods onto standard server infrastructure using Kubernetes, but they are not interconnected and scaled out using native Kubernetes networking. Last year at KubeCon 2022, several of our presenters, not including me, held a similar panel on how optimized software and new hardware accelerators, DPU and IPU, can be used to accelerate Kubernetes networking underneath the existing abstraction of Kubernetes networking. What is running in the containers does not change except that they experience a much faster lower latency network. Our panel today is applying this concept to edge and telco workloads. So this is the panel, me and Nabil. My name is Ian Coolidge. I work at Google Cloud on Google Distributed Cloud. Nabil's from Bloomberg. And I'll be covering the motivating use case of telco. And Nabil will be covering the value of offload, the value of standardization. And then we'll be presenting a video on Nupur's section. Of course, the graphic does not render. OK, I don't know why the graphic is not rendering, but it's basically just a nick with some virtual functions. So SROV has been used for years to help containers and virtual machines believe that they are experiencing the use of a single network card and also allowing them to get direct access to the PCI bus so they can bypass kernel networking, typically using DPDK to achieve high throughput low latency requirements. The problem with SROV is that many details of the hardware leak up into workload orchestration. This is a blow up of the SROV network node resource. I forget what it's called from the SROV network operator. And it's got a couple of fields that are always peculiar to the particular nick that you're deploying. So Intel has a nick or Melanox has a nick. There's always some valid combination of these fields that works for that particular nick. But more importantly, the major problem with SROV is that it doesn't let you use these powerful Kubernetes networking features like network policies or service load balancing. Instead, you have to roll your own stacks for these kind of things or just not get them. Some smaller nitpicks on specific to the implementation of SROV network operators that to change the configuration of a node requires the workloads to be drained from the node. And also various network cards have different techniques or implementation within SROV network operator to enable SROV in the first place. So like, for example, with the Melanox nick, there's a MST config program that gets included in there and it gets run automatically and everybody's happy. But with Intel nick, the same functionality is not included. So you have to do it yourself. When carriers are looking to cloudify their RAN or core networks, what they're really trying to do is modularize infrastructure. Kubernetes is the system to do this. Using Google distributed cloud as an example, we provide managed Kubernetes infrastructure to allow telco customers to have Kubernetes network, to have Kubernetes throughout all different areas of their network, from the radio access network to the core network to public cloud. They get the same experience all the way through. With this modularity goal in mind, ultimately we need to aspire to have fungible cloud environment where applications are not sensitive to the exact machinery or network cards that they're running on. This enables networks to become truly cloud where workloads are portable and transferable to different areas. For example, an operator could easily spin up a UPF between public cloud and on-prem without having to worry about what network card is being deployed on the machine or if SROV is supported at all. Nibil? Thanks, Sayin. So what I'm going to talk about really is the offload. So what we'll do first, we'll just quick refresher of what offload means. And then we'll talk a bit about the control plane, what we're talking to do and what we've done so far and then share some experimental results that demonstrate the value of offload and we'll conclude with some goals related to this part of the presentation. So as just a refresher, what we mean by offload usually is offloading packet processing as well as some control plane functions from the host CPU to the hardware or to the NIC. In general, these are, we'll talk about DPUs, IPUs in here in this context because they have basic processing, usually an arm cluster that allow you to do control plane processing in addition to hardware offload for the data plane functions. Example of things that we'd like to do in hardware basically is policy enforcement, encryption, load balancing, some statistics in addition to, as I said earlier, network control planes such as BGP for example, being offloaded to the DPU, IPU in the context of Calico as an example. And I'll use Calico in here because some of the experiments that I'm gonna share a bit later are based on Calico. Some solutions that exist in the marketplace for DPU, IPU come from several vendors and here are examples, Intel IPU. This is a product that's available for Intel, as it says, AMD, Pensando, DPU as well as NVIDIA. So why are we doing this? What are the drivers, what we're trying to achieve? Some of these we could consider them to be drivers or anticipations of the value that we could derive from offload. If you look at the origin of these graphs, we have kind of the origin being host-based, basically everything being done at the host CPU, nothing offloaded. And then what we're trying to achieve really or try to demonstrate and therefore prove the value for is where the offload is. So we're anticipating that there will be basically performance optimization across multiple dimensions along power or total power. That's what I wanna put in because some people anticipate that the DPUs, IPUs, rightly so, are gonna use more power. But it's not about the component power but the total power on the server. The other dimension is security. How can we improve security as a result of offloading certain functions to the DPU, IPU and throughput? That's what accelerators really give you. The other dimension is really total CPU cost saving. The host CPUs are generally being made to really for application processing. So use them for the best thing that they're made for. That's really what that is. So how could we offload CPU cycles from doing network functions, if you will, such as policy enforcement, forwarding, statistics collection and so forth, to the hardware, to the DPU, IPU, and save the CPU cycles for application processing. And the other dimension is really the total cost. So that cuts you on multiple dimension, whether it's power, real estate, density of compute and so forth in a given real state. Some of these, as you'll see later, we kind of conducted experiment that show value. Some of the others are still at least and pretty kind of even-handed comparison. We're still trying to show the value for or not, right? So it's about proving value, really. So let's start in here. We presented this slide last year during a similar plan that we've done at KubeCon in Detroit. The aim in here is really to hit the security, basically. We are looking at the target architecture for the control point integration between what will happen between the compute slash network control plane that usually we refer to as the SDN solution and a server or a physical host that will have DPU, IPU plugged into it. The point is the intention here is not to go through all the slide in detail, but to say the path in here to come from the control plane through a trusted path, which is not running workload, the basic application workloads. So we'll come through the DPU, IPU, and then we really kind of control the instantiation and the workloads and observe what's being instantiated from the trusted domain to the less trusted domain, which is the host CPU. And therefore, there are integration that have to happen between the control plane running on the host CPU and the control plane running in the DPU together with the control plane of the compute, like Kubernetes control plane master, as well as the SDN control plane, like what we could have with Calico and any other solutions. So that was the target, what we've done, we've done an implementation. This is still within our four walls, if you wanna put it that way, where we started that journey. And the initial journey we didn't do the full offload, but rather what we aimed to do that integrate the current way in which Kubernetes and Calico run basically mainly with their components on the host CPU and try to develop what's needed to offload the data plane functions to the DPU, IPU. So in that journey we developed our seven agents that end up doing that integration that allow you to offload the forwarding plane to offload, meaning the Fib in particular, that will be learned in the control plane on the host CPU. So we wanna offload that Fib to the data plane on the DPU, IPU. And the other one is to offload the policies. And I'm gonna talk about policies in here. We also looked at encryption and we'll look at that as well from experimental results. That when we say the data plane in here, obviously on the DPU, IPU, there is what we call a networking as slow path. If you wanna put that way or compute complex that's usually the unprocessed. So we do that integration between the control plane. That's what you see in yellow. These are the agents that we implemented to kind of splice the control plane between the DPU, IPU and the host CPU so that we could exchange that information to be programmed in the hardware. Once we get it to the control plane in the DPU, IPU on the arm, then we offload it to the ASIC on the DPU, IPU. And that's where the actual forwarding path is and the policy enforcement is happening in the hardware. So we use this architecture basically to conduct an experiment using Calico. Doesn't have to be Calico, but in this case we use Calico. So this is very basic experimental setup where we had two service interconnected via 25 gig. We have a source running in a pod and another pod running basically in here IPERF. And you see in here two colors, red and yellow that I wanna call out. What red is really relate to application level processing and what yellow is related to the network layer processing in the host kernel. Basically think about this way. And this is very important because what we're trying to look at that how much of the CPU cycles on the host is taken by the application, which if you remember I said this is really what the CPUs are meant for and which portion of the CPU if you will is taking up by the network layer processing. And that's really what I wanna call out. So this is a basic experiment that was done. Again in here there are many, much more data behind this and many more other experiments. But that shows an example where we had Calico. You see Calico running on the host. This is today Calico with the IPERF. Calico offloaded and then we generated the DOS attack. So this is basically a single flow for which we're measuring performance. That's what you see in the blue triangles that you see in here and the performance in here is throughput in particular. That's what you see in the blue triangle. And what I want to call attention to the network layer processing in the kernel, which is the yellow versus the application layer processing that's done in the red. So when we have everything done and the Calico on the host, you can see good amount of the full CPU core is taking up this is measured in CPU cores. So it's taken up by the network layer basically. That's the yellow. And about 90% of the second CPU core is being taken up by the applications and total we achieve roughly about 19 gigabit per second under the experiment that we had. We offloaded some of that stuff. We basically decreased about roughly about 10% the network layer consumption of CPU on the host CPU and we increased a little bit the application processing. And this very single case where we have a single flow. Now what we did, we ended up generating large number of flows about 250K flows per second considering to be an attack but there could be a large number of flows anyhow being generated whether legitimate or not. But in here the action was to drop them therefore the attack part. And as you see in the third in the third bar as you're looking at it on the right which says Calico host under DOS the CPU network layer processing had jumped up to 100% the application was available for the application had reduced had shrunk from 90% to about 50% or so. But more importantly you see the good put of that workflow that's legitimate had dropped from 19 gigabit to about seven gigabits. And that's the negative fact that you see. When we offload the control plane function, sorry the data plane function of policy enforcement L3, L4 policy enforcement to the hardware we regain actually the application processing so you can see that most of the red jumped up became more dominant than the yellow. That's really application layer processing and we restored back the good put for that legitimate flow to 19 gigabit per second. That's kind of an example where offload could become very useful. So really the general observation that there is consistent results that were observed across this experiment and other experiments for packet forwarding and packets per second under attack or legitimate traffic. And as a function of the number of flows per second this is the flow rate that we call. There is efficient utilization of host CPU because the host CPU is spending more time and doing application layer processing and less time doing control layer processing sorry network layer processing. And then as an observation although we didn't show it in here there are different approaches today to do the network solution. One of them is based on OVS, Calico is another and so forth. Some of these approaches are based on what we call packet driven data plane data plane programmability. So for example in OVS you wait for the first packet and the control pane for OVS will have kind of what needs to be done for a given flow but or for a given characteristics of the flow. When it sees the first packet it programs the flow in the data path and from there on you start exercising whatever actions you want on that flow whether processing or packet drop or Allah or whatever it is you exercise at that point in time. So that's really programmability of the data plane driven by the first packet comes in. Whereas control plane programmability is driven top down if you're from the control plane and that's what we do in Calico. And the observation is that control plane basically driven approach to programming the data path yields often better results in total and we could talk about that on the side after this meeting if anybody is interested in that. The other experiment that we looked at is really the value of encryption. Encryption is pretty costly operation when it's done in software even when you have the DNS set basically where you have the instruction set from some of these CPUs like the Intel CPUs. So most of the experiment that we conducted we conducted with and without EAS IN basically so that we look at the performance and we looked also at the offload of the encryption to the DPU IP in this case. So there is a system under test in here basically a pod that's running a traffic generator and we had NGNX running in the context of VM that's the system under test. What we've done is that we've initiated couple of sessions at a time. On each session we made up to 1000 requests and we repeated that over 25 seconds and we repeated those tests over 30 seconds to get statistically significant results and that's what I'm gonna show in here. What I wanna call your attention in here this is a lot of data being generated as you can see on the x-axis. This is the encryption result as a function of the file size being transferred and what we're measuring in here is throughput relative to CPU utilization. So this is really what you're achieving in throughput relative to how many CPU cycles. As you could see the larger the file transfers maybe really the longer the session duration is and that's what I wanna call back in the gray and I'll talk about KTLS and so forth what they mean but in the gray you see the gray compared to the blue as the file size that being transferred over that connection increases you get much better if you will incrementally more better performance with the offload to the point that you get at one file size about twice the throughput that you get with offload compared to host-based encryption if you want to put that way and that's to the far right hand side. As you notice we focus on encryption because that's really what we aimed to really to look at at what performance it shows what kind of performance improvement or not right? So we start with the neutral position if you will with anticipation that we'll have better performance when we're offload. We used in this case KTLS or TLS in general so we looked at user plane TLS we looked at kernel-based TLS as well as KTLS offload as a means to look at encryption obviously the encryption could also apply to IPsec so the value in here is about encryption not about whether TLS or not but we use TLS as to do the testing in here as a conduit for this testing. Then the other way we look at it as transactional so how many requests can you do for given set of for given CPU cycles? Again as you see in here as the file sizes grow then we get better performance again compared to the grade to blue but also when we have very short transactions you could see that user plane becomes much more becomes more performance if you wanna put that way and the reason for that because there is overhead and trying to setting up the session in the hardware so the more the shorter the duration the more flows you have and therefore you're spending overhead in programming the data plane although like you're not using for too long and that's really where the performance comes in so the longer the duration of the TLS session the better off you are usually. So what are we aiming to do from here? Really in here so I showed you an example of a target that we're aiming for and what we've implemented so far but in general this you could call it as goals as well as call co-action to really enable GPUs, IPUs in standardized way what we need to really look at is how do we integrate the compute and network control plane for the GPUs and the IPUs to achieve trust or security to achieve performance as well to offload the control plane functions and coordination with the control plane functions between the host CPU and the GPU IPU and finally this is one of the data plane functions which is policy programmability to drive for it to be control driven. The other thing while we're talking about Kubernetes in here not at all endpoints and a lot of environments are containers basically but we have a lot of cases where we have bare metal MVMs so the idea in here is to try to enable that also for these type of endpoints which goes beyond just container workloads but also to bare metal MVMs. So what's happening in the standard today to enable that and this is a slide from our colleague Vipin who could paint it today with us there is an effort in OPI going on right now for the focus on the GPU IPUs one of the major initiative there they have is really on the life cycle management because some people look at these as systems inside a lot of the systems so you have a server as a system and then you have the GPU IPU inside it so which has its own usually network interface as well as its own CPU complex and you need to really manage software on it and you need to manage the life cycle of that. So the OPI is looking at that there are a set of APIs being developed also in OPI. DPTK is another set of tools that have been standardized and there is work going on today in P40C. This is work that's happening in a working group and being looked up to be carried with an OPI for standardization. You have this slide that are referenced to this material and what the body OPI is doing in this particular case a relation to the things that you could take a look at. So, Ian, back to you. Yeah, so next we have the first section. My name is Dupur, I work for Intel IPU team. Speakers before me have touched upon Kitas enhancement features, performance gains on specialized hardware and need for standardization. So where do we go from here? We have these two open source efforts going on at Pellet that has somewhat interdependent to bring in cloud native approach while benefiting from hardware acceleration. Let me go over some and explain how. One of the proposals in Kubernetes enhancement is for an agent model where agent runs in a pod. It does not rely on CNI countless but gathers all required data for Kubernetes APIs by watch of objects. This agent implements role-based access and control provides for loose applying and it could obstruction point for supporting various hardware vendor SDKs. The hardware SDK API is generally club functionalities like hardware provisioning, rule of thoughts, PFEF reset function starts and others. The second basic one is related to networking. It addresses multiple network attachments to a pod. This is required for edge clouds for segregating control and data plane traffic. For instance, the F1C and F1U interfaces between EU and EU. This proposal introduces two new objects, pod network and pod network attachment. These two objects with parameters and provider would enable agent to identify uniquely and assign the hardware device functions to the pods. NPDK based CNFs use devices like SRIOV for meeting the stringent performance and latency requirements of various 3GPP service types. These need abstraction and cloud native approach. Today, these are done through device plugins. This fact that is not address fully is draining of these devices on termination of a pod. There are proposals from the community to use health check and readiness failures from the applications to be redirected to traffic. For this to work, it has to be done in conjunction with hardware SDK APIs that clean up flow rows and call function level resets for proper queue draining. This brings to IPDK.io Kubernetes offload recipe using the proposals mentioned earlier. Recipe leverages this agent model to offload networking flows to P4 data plane. It has been done using Calico CNL as implementation supports external airplane. The offload includes service load balancing, network policies, which are switching and routing on primary network. The hardware generated stats at flow and connection level give visibility into every flow and number of connections per end points for intelligent Turing. With the help of community, we have to bring service implementation on non-primary networks too. We are working towards standardization of agent and manager components through OPI. Model allows for keeping the vendor SDK step completely abstracted and running on trusted IP domain. Agent follows the role-based access and control semantics that's bringing key security controls in access to hardware SDK and resources. With the recipe benefiting from hardware acceleration, offloads, adaptive device queues, quads and congestion control, we can meet the 3GPP KPI requirements for edge cloud deployments while enforcing security of various tiers of networking. The added benefit can be seen by uploading compute intensive flows like IP6 to hardware, keeping latency slow and freeing up course. Intel supports node feature plugin to advertise features like supported instruction sets, our DMA look-aside inline encryption and other node labels so that these functions can only be enabled on the nodes that support it. The word is ongoing to get all the device drivers upstream to the kernel. Intel is also working with wider TPTK community to bring in more drivers for the next generation of Intel networking devices. My name is Dan Daly from Intel and I'm so sorry that I'm unable to meet with you today. I got COVID, wasn't able to travel, but what I wanted to do was summarize some of the points that were made in our panel and also talk about where we go next. So if you look at the types of things that all of our different panelists brought up, Ian talked about having multiple networks, being able to drain traffic as it has gone through SRIOV and having security built into the system. Those three items are already in progress both in the Enhanced Kubernetes projects and also in this working example that we have on IPDK which is part of OPI. And so we're working through these items one by one, see how we can get them to work, see how we can make sure that we preserve those abstractions so that the end workload doesn't have to change. It can still remain SRIOV, it can still remain to use the Kubernetes networking that's already taking advantage of but it gets that faster network, gets that lower latency so that you can expand the set of workloads that can run on top of it. And then what Nabil was talking about is are these benefits of these new devices as you go beyond optimized software, you can actually have these hardware, DPUs and IPUs that give you even better performance, lower latency and can start to integrate these functions that are usually very expensive and software like crypto into your pipeline and be able to get that at a hardware level of performance. And these are also things that we have been integrating into IPDK and also things that are part of these Kubernetes enhancements going forward. And then once we get there, we need to be able to have it be standard so that you're still buying into the Kubernetes ecosystem and not having to use any specific interfaces for a particular NIC or device or special software solution. And so what we have found in IPDK is that Calico is offering this data plane API, which we think is a great starting point for being able to standardize not just the interface for the applications, but also the interface for the folks deploying these clusters, maintaining them over time. And so that brings us to where should we go next is that these enhancement proposals have already started to come out and they give us some very important features that we need to be able to accelerate this network. We need to be able to enable these applications to continue to have multiple networks so you can separate your Edge or Telco workload connectivity from the regular connectivity into the pod. And we need to be able to have, allow people to use their existing CNIs but also have the CNIs enable this agent model so the CNIs can report to an optimized data plane such as VPP or DPU-IPU to tell it, hey, there's a new route, there's a new policy, there's a new endpoint, so we can accelerate that when it's available. And we wanna take all those concepts that are being proposed and bring them into a working example. So we have open source code being developed in the open that you can use to see if that works for your acceleration solution of choice and also be able to run it with your real workloads and see the benefit of this acceleration in your system. And then once we have that working example running, there is a community that's being built around open, programmable infrastructure called OPI which IPDK is a part of. And this is where we wanna standardize a lot of these different data plane APIs that can be used not just for Kubernetes but for other environments, different control planes that should all end up using the same kind of constructs for routing and forwarding, load balancing, things like that. And so to summarize, we're making great progress here but we wanted to bring this forum to you to be able to bring it to your attention of what the potential is. And I'm gonna give it back to the lucky gentlemen who are actually there with you in person to summarize and bring us home. Thank you so much and have a good conference. Yes, thanks Dan. Hopefully it's clear that offloading Kubernetes networking paves the way for fungible telco cloud workloads with performance parity of optimized systems. Thank you everyone.