 Hello, good afternoon. I'm Miguel Duarte, software developer working for the over networking team. I'm here with Jaime Camano, software engineer all working for SUSE on the networking team and we're going to present a talk about how to connect the interface of a virtual machine running within Qvert directly to an interface on the host. Let's move. Okay. Okay. Let's go to the agenda. We'll start by introducing the Qvert project and afterwards we'll describe like what are our goals, exactly what problem we're trying to solve and especially we're going to highlight why we're trying to solve this, like what's the motivation behind all this. We'll afterwards move to the implementation stage where we'll describe the approach to tackling the problem and eventually how we solve the problem. We'll finalize by a short demo and the next steps on this collaboration. So to the introduction, the first question that needs to be answered is what is Qvert? Qvert is quite in its essence is a plugin that allows you to run and manage virtual machines on Kubernetes. It works in a simple way. It schedules a pod and that pod will basically have a libvert the instance running within it and it will run a single virtual machine instance within it. It gives us the advantage of a common platform for virtualization and for containers. The use cases it does have has more but will focus on the migration path from a virtualization workload into a containerized solution. This means you can have your application composed of 5, 10, 15 virtual machines and you go step by step little by little moving into a microservice architecture by splitting your virtual machines into tiny pieces. This way you get the advantage of having a single common platform where your developers and your operation teams will work. You'll use this slide with the architecture of Qvert to introduce the main actors. So we have the virtualization controller which handles cluster-wide virtualization. We have the vert handler. It's kind of the Qvert's agent within the node and finally we have the vert launcher. The vert launcher is the process as I've said before that encapsulates a libvert process and runs a single VM instance within it. Now in order for us to understand what we're trying to achieve here, the easiest and best way to do it is to describe the current status quo of how to connect an interface of a virtual machine on Qvert or to the outside world. So what we do currently is so we have an input bridge and we create a veth pair and we connect one end of the veth pair into the VM and the other one to the bridge. Then we will need another veth pair to connect from the bridge to the outside world. In the outside world, for instance in the node, what we will have is like a CNI black box that might be different things like a Linux bridge, an OBS bridge, different kind of things. In essence for each virtual machine we will have one pod, we will have an input bridge, we will have two veth pairs, too much stuff. So basically our objective is to remove the input bridge, make this as simple as possible and as the name of the talk indicates what we want to do is to connect the interface of the VM directly to the host interface. There's a tweak here that we want to do this without requiring any extra capability on the Vert Launcher pod. The Vert Launcher pod currently runs as a non-privileged pod and the only capability it has is the NetAdmin because it requires an IP address from the HTTP and we do not want to change that. We want it for it to keep that way. So I'm going to hand this to Jaime so we can guide us through the solution part. Okay, so one thing that is commonly used in the virtualization world to simplify networking in the host are MAC-PTAB interfaces. MAC-PTAB interfaces are just traditional TAP interfaces that we use with VM technology but on the side of the interface instead of having this typical standard virtual interface what you have is a MAC-VLAN interface. The MAC-VLAN interface allows you to set up a sub-interface of a physical interface and assign it its own MAC address on the same L2 segment in a way that you just have another functional interface on the same physical link of the physical interface you are sending up on top of. The MAC-VLAN interfaces themselves have different ways of operating. They can be operating in bridge mode, private mode. We're going to assume here on the rest of this presentation that they're going to be operating in bridge mode. So what that means is that all the sub-interfaces that are set up on top of the physical interface are going to be bridge among each other and that's going to be done by the MAC-VTAP driver. So in the pictures that we have depicted here you can see that on a scenario where one VM on the same host wants to communicate with another VM on the same host. That communication would be bridge in the host itself whereas when the VMs want to communicate to the external world they just reach out to the external switch connected to the host. So what we want to do with MAC-VTAP interfaces on the problem that we have here, what we want to do basically is we want to create the MAC-VTAP interface in the host on top of the lower device whatever physical interface you have available in the host and then we want to move that interface into the pod in a way that it can be used by whatever virtual assistant technology you have inside the pod in this case. So we want to do that as Miguel said before without requiring any extra privilege escalation for that pod. So the solution what we are going to do here is can be divided into three different things that we want to do to solve the problem. The first one is in order to be able to properly use the MAC-VTAP interface inside of the pod without any privilege escalation we are going to use the device plug-in framework. So we are going to set up a new MAC-VTAP device plug-in and this device plug-in is going to allow us to move the interface and not only the interface also the top character device inside the pod with the correct access permissions for the pod to be able to access it without any privilege or admin capability assigned to the pod. Then we are also going to use also MAC-VTAP CNI plug-in. This is going to allow us to configure the interface itself and to move it into the pod namespace and then there are the changes required in Qubebit to correctly wire that interface into the domain. Just as a means of a user guide we are going to see now some of the manifest that we are going to use to configure the whole thing. First the device plug-in is not sound here but it is going to be deployed as a demon set with privilege to be able to create interfaces. We just didn't show it here for clarity. This is an example of the configuration that device plug-in is going to take. It is basically a solution that is going to have an array of the resources that you want to advertise through device plug-in. The name, for example, is just the name that you are going to give the resource that you want to advertise. Here it is the same as the master but it could be anything more descriptive that you want. Then the master is the physical interface on which you want to set up the MAC-VTAP interface. The mode is the operating mode that you want to set it up. The capacity is the amount of available interfaces of this type that you want to have on the host. Just a different setup or why would you want to have a name here instead of using the physical interface directly? Let's say that you want to set up different MAC-VTAP interfaces operating in different modes on top of the same physical interface. Then you could have the same master but advertise under two different resources on different modes. For example, ETH0.Britz, ETH0.Private, whatever. Basically what this is going to do is it's going to give you 50 physical interfaces, MAC-VTAP interfaces as resources to use in the pods. We're also going to leverage MULTUS for this because that's going to allow us to pass the information between the device plug-in and the CNI. This is how the network for a MAC-VTAP network would be configured. You just specify that you're going to be using MAC-VTAP CNI and give it a name to the network. You just specify as an annotation the resource that the MAC-VTAP device plug-in is offering you, that you want to use for that network. Here is an example of the virtual machine instance itself. Basically, you're using as a back-end MULTUS with the network name of the previous network attachment definition that you defined. On the front-end side, we have a new bind mechanism of that MAC-VTAP that's going to properly configure the interface. This is a flow of how everything merlis works. It's just an overall description between different components. Don't take it literally. I just want to explain different interactions between different components. The first thing that is going to happen, the bank-VTAP device plug-in is going to be advertising the MAC-VTAP interfaces that you have available. When Qubelet wants to run a VM, it's going to ask the MAC-VTAP device plug-in to allocate one of those interfaces. What that's going to give back to Qubelet is the MAC-VTAP device so that Qubelet can just mount it in the pod and give the correct access permissions through C-groups. Then, Kubernetes is going to run the pod and through the Sierra container runtime interface, it's going to tell MULTUS to add an network interface. MULTUS is going to go back to Qubelet. It's going to ask Qubelet for the resource that was allocated for that pod, in this case the MAC-VTAP interface. It's going to give it back. Then, when MULTUS itself calls the MAC-VTAP-CNA, it's going to pass as an extra parameter. It's going to be device ID, that's the interface name of the MAC-VTAP interface that was allocated by the device plug-in. The MAC-VTAP-CNA is going just to move the interface to the namesplate of the pod and it's going to rename it in a well-known name that MULTUS is going to give it and any extra configuration required, MAC address, whatever. Finally, Qubelet is going to run inside the read-launcher part, the lib-bit with the correct domain XML to be able to plug that interface into the VM. The last thing that we're going to see is how it's going to do that with lib-bit. This is a recent change in lib-bit, the capability to be able to use existing MAC-VTAP interfaces. They are defined in the domain XML as the interfaces of type Ethernet and the target device is going to be the MAC-VTAP interface itself and the manage is going to be no, that means that lib-bit is not going to create the MAC-VTAP interface, it's going to take an existing one. We're going to show a demo. I'm going to pass it to Miguel. So this demo, we're going to be seeing both traffic and the entire flow of this. So we can see that the MAC-VTAP device plugin is already running. We're going to see the configuration it has in the config map. As you see, it's the same on the examples. So we're creating it with the name eth0 on top of the eth0 interface. We now see the network attachment definition where we define where we require a resource named Ethernet0. We see that it needs to invoke the MAC-VTAP CNI plugin and now we see that we request an interface of type MAC-VTAP and also, as Jaime explained to us, the multis network. We're now going to be executing this via Cube CTL. It'll create the pods where the VMs will be running. It'll take a few seconds and should be happening. Okay, it's already and now we're going to see the virtual machine instances that we have running. It's interesting here that both of the VMs got scheduled on the node number two and what we're going to be doing now will look into each of the nodes and see the resources that are allocated in each of them. So first, let's look at the node number one that doesn't have any VM within it and as we see, it has the ability of creating 50 devices but has none allocated in it because it has no VMs on it. Now, if we go to node number two, what we'll see is the opposite. While it also has 50 devices that can be created on this interface, when we list the allocated, we see that it has two MAC-VTAP devices on top of Ethernet Zero created because the two VMs are running within this node. We'll now see a tiny, very simple traffic test. Before that, we'll log into one of the nodes and see in which subnet it is connected to and what'll happen. We'll see afterwards that both of the VMs got an IP address via the HTTP in that same subnet and we'll then try a very simple thing within them. As we can see, it's on the 192.168.66 and we're now logging into the virtual machines, list their IP address. As we see, it has an IP address on that very same subnet. Same thing for the other virtual machine. As we can see, pinging one interface between the other is working and we'll now try to reach the outside world by pinging Google. As you can see, it also works. So let's move back into the presentation and let's talk about the next steps of this collaboration we did. So all of this is currently, we developed this in December. We had to run quite a lot and basically, most of this lives in private repos of ours. We went upstream this into the Qvert community and to the Qvert community and not to, for instance, container networking like any other CNI plugin because the combo of device plugin and CNI, they kind of not make much sense on their own. They need one each other. For instance, the device plugin, what it does is create the interface and will enable a cubelet to provide, read and write, see who backs us to the character device that backs up the Mac VTAP interface and the CNI will actually move the Mac VLAN side of the interface to the pod. So one of them by itself could not be done. The Mac VTAP CNI, it is only handling the Mac VLAN side of things and there's already a Mac VLAN CNI plugin. We need both of them and so both of them will go into Qvert. The other thing is there's an open pull request adding the bind mechanism which basically creates the domain XML, the required domain XML to consume the character device from within the virtual machine. And this concludes our talk. Here's the list of all the stuff we did and some references and we welcome them. Any questions you might have? Come on, at least one. The host they run on. We assume that like as a limitation of the Mac VTAP by itself and they won't be able to communicate with the host they run on. So if you have like N hosts, they will talk with every other host except the host they run on. So this limits some use cases. So you could not like add a VM as another, I don't know, as another node where you could add to the Kubernetes cluster. You could but it would be like, it wouldn't be able to, like one of them would be outside of it. Okay. You had another question? I have no idea about that. Could you repeat the question? Okay. I was asking about the storage. How it's organized with Google Earth. Your time to use this side interface, isn't it? There is a read-write once if you're using some devices like Cepher Mini. But my question is how we can organize like migration in this case? No storage. We only look into the networking side of this. Is Kubernetes a subordinate migration of your phone machines? Maybe you want to catch us later. Question. So it sounds like you are somewhere close to get it to the community, right? But you're still not there based on what I'm understanding, right? Because let's say the proof of concept is doable to make it happen. Exactly. What we have is a simple proof of concept as soon as possible. But putting it to the community, right? But in the end, the everything goes, let's say, if you go on the open shift and those kind of stuff. So it's going back to the red content, simply saying. So but there is still not say any agreement that they're going to put in or those kind in a way. We do have buy-in from the community. The community is interested on the feature. For instance, the repositories we mentioned, they already exist. So now the plan is to make stuff, to port stuff from our private repos into the repos that will host this. So we have community buy-in. Now we just have to go through a regular review process and get stuff in. Exactly. You could. Exactly. The first link, the demo, you can clone that, use it. It will do every required step. Because you'll either have, no, no, no, no. It's working progress. It's even marked as working progress. It's a draft pull request at the moment. Okay. We're closed. Okay. Thanks. Thank you.