 So thanks everyone for joining and attending today's session. I am Ajay Karambur. I work as a principal engineer in the Cisco NFVI project. Along with me I have Yichen who's a technical leader in the same Cisco NFVI project. Our product is basically a on-prem private NFVI cloud that supports both container Kubernetes workloads and virtual machine workloads. So today we're gonna share our experiences on deploying container network functions on Kubernetes. So the high-level agenda of this is going to be we're gonna start off with an introduction to NFVI. Then we'll talk a little bit about the requirements for deploying container network functions on Kubernetes. We'll talk about various features like CPU manager, huge pages topology manager and multish and SRIOV. And then we'll finally talk about the steps for how to bring all of these features together and do a sample Kubernetes deployment. And then we'll conclude the demo of how you have a sample CNF that leverages each of these features to do a realistic demo. The application that we're gonna be doing and testing today is called BPP. It's a fast virtual switch that we're using here. Then we'll conclude with a summary. So moving on, what is network function virtualization? So network function virtualization is basically the virtualization of the functions of networking nodes into software. So especially in the mobility space, you're seeing an increasing move of lot of 4G and 5G applications which are already being converted into virtual functions. That move has been done for the last few years and it's kind of, there are a lot of private clouds that are running 5G and 4G stacks where there are lots of virtual network functions that are running on top of it. V routers, V firewalls, V load balancers and a lot of the mobility stack are standard examples of network functions. The benefits there is reduced cost, accelerator service deployment and easier life cycle management. All of these are basically benefits of virtualization that have been brought in in terms of network function virtualization. Moving on, what are the general NFA requirements? The applications that are network functions typically will need higher throughput, lower latency and jitter. For people who have worked in OpenStack, a lot of these features should be familiar. Things like CPU isolation, huge page flavors, SRIOB acceleration, FPG acceleration are stuff that have been already implemented in Nova as an example in OpenStack. So today we'll talk about how some of these features can be integrated in Kubernetes and how we can deploy a real container network function. So moving on, what is the difference between a virtual network function and a container network functions? Virtual network functions, the delivery is in the form of virtual machines, more resource overhead, slower provisioning time, lower scalability and resiliency. Typically, virtual network functions are accompanied with a VNF manager, which does the orchestration and the life cycle management of these virtual network functions. With container network functions, they deliver in the form of containers. Obviously the benefits of containers are lesser resource overhead, faster provisioning time, higher scalability and resiliency. And one other thing is container network functions can leverage the life cycle management of Kubernetes for their functionality. So moving on, this is our NFVI cluster. These clusters are typically small edge clusters that are deployed with some of our mobility stack running on top of it. So if you see they are basically a UCS or a quanta server, we just pointed out UCS or quanta in general, it can be any specific server. There are two Intel two port cards that are basically used for the SRIOV and DPDK functionality. On top of that, we run Ubuntu or real-time kernel. We run Docker 1.19 or Kubernetes 1.18.6. Then we have the standard Intel drivers for I40E, virtual function and physical function, and the VFI or PCI drivers for DPDK. Then we have huge pages preallocated in the grub and we have two physical cores reserved for the host with hyper threading that's for logical cores. So in top of that, you have a standard QBDM based deployment with Qubelet and Qproxy. The alterations to that deployment of the Qubelet is configured with CPU manager policy static and topology manager also is configured. The from a networking standpoint, there are three two CNIs, multis and SRIOV that are deployed on top of it. And SRIOV device plugin to manage the SRIOV virtual functions. A typical application pod or pods on top of these are stuff with multiple networking interfaces and using features like CPU pinning and huge pages, et cetera. So that's the NFVI cluster. Let's now dig into different NFV features and let's then finally talk about how to bring them together in real application on top of Qubeletis. Going on, what is the CPU manager? CPU manager is basically for computing intensive workloads. So basically it's a mechanism by which you dedicate cores to your specific container network application. With this, they can run with high performance and they have dedicated cores to run to meet high throughput and low latency expectations. It's natively supported in KADS as of 1.18. It's brought in through a simple Qubeletis config which says CPU manager policy static, but assuming that you've already done the work at the host level to do isolated cores and basically carved out specific cores for Qubeletis and host processes versus the application workloads. In 1.18, a new flag was introduced called reserve CPUs which is basically the CPUs. When we say host CPUs, these are the CPUs not only reserved for the host processes but also the Qubeletis control plane itself. We are disabling CFS quota so that sometimes we've seen spikes at CFS quota enabled so this is being disabled for that. This CPU manager static policy, the pods that are basically deployed with the CPU manager running are running with guaranteed QOS. This is kind of leveraging the QOS feature in Qubeletis which is more broad in this particular use case but in this particular case we are focusing on the guaranteed QOS. There are other best effort and best effort QOS and burst QOS which we won't be talking about in this particular presentation. Huge Pages is the second aspect. Basically it requires huge page configuration on host, natively supports both two megs and one gig huge pages. The nodes automatically discover so this is nothing needs to be done. It's natively available in Qubeletis as long as you have two meg and one gig huge pages configured on the host they will automatically be discovered. Moving on topology manager prior to topology manager being existing this is a relatively new component. The CPU manager and the device manager made independent resource allocation decisions. But as you know when you run VNF applications with high performance or CNF applications with high performance you need new malware scheduling to optimize performance. So the CPU manager so the new malware scheduling is supported in both the CPU manager and the device manager but today it's not supported with huge pages. So you cannot do new malware scheduling with huge pages that's working progress. So you bring in this feature by using a topology manager policy called single new manual. There are other policies which are but this is the most restrictive policy and in this particular case this is done in our use case to be to get the maximize the performance in this topology. So moving on there's more documentation on topology manager of the link we just provided. So we talked about CPU manager we talked about topology manager we talked about huge pages. Now on the networking side when you have these applications that need PCI pass through or DPDK you need SRIOV but that means that you need a part with more than one network interface and multis that is where multis comes in. It is used in the SRIOV DPDK context to help facilitate multiple network interfaces for a pod. So the way the pod will look is you'll have a default interface which is Calico and you'll have additional interfaces which are basically used for SRIOV and DPDK. Multis basically is responsible for calling the SRIOV CNI and getting the plumbing done and it also is useful in exposing the custom CRD and network attachment definitions. We'll see a sample of this in the latest slides. So the next aspect of this is going to be the SRIOV CNI. The SRIOV CNI is the one that plumps the SRIOV VFs to a pod's network namespace and it basically clears them on a pod deletion. It works with the SRIOV device plugin for VF allocation and is used to set various VF parameters of interest like VLAN, Trusted Mode, et cetera. So multis is the component that invokes the SRIOV CNI with the right device ID. So the flow is basically that once the SRIOV device plugin allocates a specific virtual function the multis is going to defend that PCI address information to the SRIOV CNI plugin to basically, the SRIOV CNI to basically do the plumbing. So the right side is a sample attachment, network attachment definition. As you can see there's information like VLAN, what is the IP address range and it's the type field is actually telling multis which is the CNI to call which is in this case, SRIOV CNI. Moving on, the next aspect is the SRIOV network device plugin. Its responsibility is simply to discover and advertise SRIOV virtual functions available on the K8 host. You can do research groupings based on the various parameters and fields. On the right side, you'll see a sample config map where you advertise two set of resources, one which says SRIOV net device and one which says SRIOV DPDK and we are advertising them based on the driver being either I40VF or VFIO PCI. And in this specific case, we are interested in only the Intel XXV710 cards as an example, right? With the vendor and device ID being those parameters. These are user configurable resource names like the name like going back to the previous slide. These are user configurable resource names and they can be configured based on the resource map. It detects Cubeless restarts and auto-registers as well and user creates a config map like the one shown in the right to get it done. Moving forward, so we talked about multis, we talked about SRIOV CNI, we talked about SRIOV device plugin. Now, how do you bring all of these aspects together to create a real deployment? So you basically start with an Ubuntu real-time operating system, then you make host level changes to basically do the Q&D profile, then you pre-allocate the huge pages via graph, then you do the installations of I40 and I40VF driver bundled into the image and then you have something that needs to create the SRIOV virtual functions and bind them to DPDK. In the earlier versions of the driver, this was dynamically done as part of the SRIOV pod creation but right now you're expected to pre-bind these to VFI or PCI ahead of time. The continuation of the steps in the next slide is going to basically show you that once you bind these SRIOV VFs, we are going to deploy the KITS with standard tools, the Cubelet changes for CPU manager and topology manager, you need to make those changes, then you deploy multis, SRIOV CNI and SRIOV network device plugin and then you create the config maps, we went over those in the last slides, you create the network attachment definition, we went over those as well and you deploy the application pod with the needed resources on guaranteed QoS. So this basically helps you bring up a real application with all of these parameters. Moving on to the next slide, you finally see an example of a Cubectl described node with all the features enabled, you see huge pages, you see SRIOV devices, DPDK and normal devices. And also you see, in this particular case, you see that four CPUs, that is four logical CPUs or two physical cores are reserved for the host and Kubernetes, the rest are what is allocatable. So with this, I'll hand off to each one who's going to do a demo and walk you through the demo scenario. Okay, thank you Ajay. So basically, before showing a demo, I'm going to review our host development, we're going to show you what is hovering system and kernel version, we're going to show the mechanism and the stuff we do for the CPU isolation optimization and along with the huge page allocation and SRIOV VF creation. And then we go to a sample CNF application. So in this case, we have a special image build which basically can utilizing the features of the CPU pinning, huge pages and SRIOV, all the stuff we talked about before. And for the SRIOV portion, we do demo both type of SRIOV VFs, one type is for the Linux kernel and a basic binding to I4D VF driver and the other is binding to the DPDK. So in the case of a Linux kernel version and we can consume environment directing a type of interface in the container. So that's easy to consume, easy to see. And in the case of a DPDK, and we're using an application called VPP to consume the interface. So for people who doesn't really familiar with VPP, so VPP is the open source FIDL projects under the Linux foundation and it's full names of actor packet processing and it provides the auto box of production qualities which are out of functionality. So it delivers a very, very fast performance, especially when we have a lot of flows like one million to a million flows when compiled with the open V-switch DPDK. And at the last, we're gonna say, we're gonna confirm and verify the allocated CPU and SRIOV VFs, they are aligned to the same NUMA because we've configured a single NUMA and no policy and no topology manager. Okay, now let's go to the demo and let's see, hoping to show my terminal. Okay, so this is my setup. So the all-in-one note, I can starting with showing you with the operating system information. So you can see that this is the Ubuntu 1804 or 0.5 LTS and it's running on real-time and basically low latency kernel. That's the real-time version in the Ubuntu world. And then next I'm gonna show you first is the CPU peening and the CPU isolation stuff. We go from the pro comment line. So the other interesting stuff coming from here, you can see we have a skill tick, we have ISO CPUs, we have a no software lockup, no HOS, no HZFOO, and RCU, no COVAX. All these things help to dedicate reserving a CPU for the CNF workload. So in this case, all help to make sure that there's nobody else, no other process can interrupt the CPUs that are reserved for the CNF workload. We'll have a full CPU isolation. And then along the same place on the comment line, we have a huge page with a highlight here. So you can see that we have a one gig huge page allocated and they're actually allocating 248 huge pages in total. And at this pages are just equally equivalently divided into two new models in this particular setup, which we can show from here. That's for the NUMA zero. We can see there's 124 huge pages and also we can go to NUMA one, 124. So there are huge pages allocated. So next, basically I'll show you the SRV stuff. In this particular setup, we have four PFs and we are naming it as SRV zero, one, two, and three. So each of the PF functions, we have a 16VF allocated. That's how I have a 16VF out of them. That's our setup. Okay, and then on the host level and from Kubernetes side, we can see CTL get node. You can see there's only one node. As I mentioned, it's all in one node. So let's describe it. I can go through a couple of things with you guys. And first is this section here. I highlighted over here. So this section as I highlighted in the before slides. So you can see that we have 80 VSPU threats in total. And in this case, hyperthorium enabled. So we're reserving four of them for the host. So that's where we can see 76 is remaining for the CNF workload. And also we have a huge pages over here and we have a 16VF interfaces for running DPDK workload. We have a 48VF interfaces to run the net device basis for the Linux kernel workload. And the score down below, we can see that we have a multi stuff deployed and SRV stuff deployed. They're all running in the setup here. That's from Kubernetes side. Okay, now let's move to the real CNF. And so before that, I wanna show you how all the networking is defined. So there are two files here for the CRD and now let's go look at them. Okay, so this one, that's our network for the net device basis for the Linux kernel. So the key takeaway from here is we have a network called SRV net one and this guy is on VLAN 1590 and the IP address is managing by IPAM. So that's the IP address range over here. That's the key takeaway here. And similarly, we have another set file for the DPDK and we name it SRV net two and also we give it a VLAN 1589. And the one thing we keep in mind is even in here we define IPAM section here but because this is binding to the DPDK driver so nobody else is gonna consume IP address being allocated by IPAM. So this section we are simply to speak more because we need application or real application to consume them and IPAM has no control over there. And we're gonna see that in the demo. Now last file gonna show you is our part definition file, that's this one. Okay, so we have a part and that's the name and an annotation we are saying, okay, we want two networks here, not one and then that two. And then we have image and the image has everything we need or we're gonna show you later and we are saying we are asking for eight gig of huge pages in the size of a one gig and then we're asking for four CPU threads with the being pink and also we're asking for one net device for the SRVVF and one for DPDK. That's the resources we're asking for. And the one thing you can notice that there's a limit section and the request section. The value over there, you can see the CPU over here is four and the CPU here is four and then that device is they're all one here, they're all the same. So this is how we do the guarantee the QoS. So if we do this and we can make sure that CPU is getting pinged and resources are all being allocated, dedicated, this is how we do the guarantee QoS class how define the YAML file here. And one final note here on a security context we are adding the compatibility of IPC log. So this is just to run VPP and the VPP need this particular capability to run and that's it. Okay, let's bring it up. Okay, so runs fine. Let's go into the container. Okay, so in here we're gonna verify all the stuff we mentioned before. Let's starting with the CPU pinning. You can see that for this particular container we are being allocated 21, 22, 60, 162 for the CPUs. And because we're asking for four and this exactly four is being pinged over here. That's for the CPU. And also we can verify the huge page. We have another script basically to allocating huge pages. Let's do it. We can allocating one one gig pages successful and then we can go up to eight. Take a while, yes. Now I can see that huge page is also being allocated and being consumed correctly. Now the last piece is SRV. So SRV as I mentioned, there are two type of them. So one thing, one type is basically for the net device is binding to Linus kernel directly. So let's look at it. So this device name type interface called net one and you can see the IP address over here is being managed by IPAM and we know the IP address over here because IPAM assigned us. So let's pin this gateway. Yep, it pins fine. Now the second type is VPP and in that case we need to configure some. Let's first get off the information first. Okay, and so you can see here, there are two PCI addresses over here being injected by the SRV network device plugin. So this guy is basically this guy we talked about before and that this DPTK one is the one we're gonna use for the VPP. So let's copy the PCI address over here and I have a VPP config file to take a look and along this on the VPP process, take a while. Okay, come up. So let's create a interface. By the way, so AVF is the native VPP driver supported for the VFR devices. So basically have of two variations we can use the DPTK version or we can use the native VPP version. They all consume the same VF TPTK interface. Let's create an interface. Okay, the interface is there being created fine. So let's bring this guy up. Set interface date up. Let's give it an IP. Okay, now let's pin the gateway. Okay, perfect. Now you can see that by doing this we can verify that both the type of SRV interface they're properly connected and they can pin outside. Okay, so last piece is the SRV, sorry, the NUMA awareness. So for NUMA awareness and basically let's go back to the beginning of my terminal. Okay, let's go back to the beginning. So we have a task set dash CPA one, you can see this is the CPU we are being allocated 21, 22, 61, 62. And when we're looking from the LS CPU they're all falling into the NUMA node one. You can see that. So we know that our CPU is being allocated from NUMA one. How about our SRV is over here. And these two guys, let's check over there. CAT, CS, BOSS, PCI devices, AF, 02.1, NUMA node. Okay, so one, how about the other? Okay, they're both from one. So this basically confirmed that our CPU and the PCI devices, they're all coming from NUMA one. And our policy does enforce it. Okay, now let's quickly go back to our slides to finish the final section over there. Summary. Okay, so in today's, we talk about a lot of features like both on host level and a Kubernetes level like CPU isolation, huge pages in the host and also real-time kernel. And the Kubernetes, we talk about CPU manager, we talk about the topology manager and we'll talk about SRV, all these CNI plugins. So with all these features together, we're gonna show you a demo, basically we verify it with running VPP there. We can prove that you can do a high throughput and low latency applications with all these features enabled, that's today. And in the future, we believe there's still some works like topology manager, it doesn't support huge pages. So we need to make topology manager to support huge page have a better performance. And also in the future, we believe that FPGA support is gonna be very useful, especially in the technical 4G and 5G, they're gonna be very helpful in the future. Okay, and that these are the reference we are using and thanks for listening and welcome to ask questions.