 Okay, welcome to the talk. Slip is that long-life slip, a new shine approach to user mode networking QEMO, and first, actually, what KubeVirt can do with it. So, hi, I'm Alana Paz. I'm part of the Container-Native Virtualization Team at Red Hat. I'm a maintainer of KubeVirt, and in the past of all the engine backend, and UI, Stefano is the one that develops bestie and knows all the details about how it works. But I know why we in KubeVirt really needed it and how we in KubeVirt are using it. And I'm Stefano Previo. I'm a kernel developer when I get the chance. I'm with virtual networking. I hope I say the name of the team right at Red Hat. And when I hear about pods, I don't think of Kubernetes, I get hungry. So, yeah, that's why we really need Alana for this, because otherwise I wouldn't know the context. This talk is not about, so you might have high expectations, so let's revisit them. It's just a tiny bit about Lirp. It's not about Rust, I'm sorry. But it's also not about microservices. Could be worse. And it's not about the cloud, I swear. So it is a bit, just a bit about what is Kubernetes, what is KubeVirt, and we'll explain about the existing KubeVirt networking approaches and why those approaches were not enough and we needed to think about a new approach, bestie. And finally, we will explain about what is bestie and how it works. So what is Kubernetes? Kubernetes, it is an open source system. It is used to automate the deployment, scaling and management of containerized applications. The smallest deployable and executable unit that Kubernetes has is called a pod. A pod is basically a group of containers that share resources. From the network perspective, a pod is a group of containers that share the same network namespace. KubeVirt is an add-on that extends Kubernetes. It adds resources for virtual machines. So with KubeVirt, it is possible to run virtual machines alongside pods in a Kubernetes cluster. The virtual machines can be managed in the same manner that regular pods are managed in the Kubernetes cluster with the same tools like Kube CTL, for example. Since the basic unit of Kubernetes is a pod, the virtual machine, we are running it inside the pod. Actually, we are running the virtual machine inside a container that is running inside the pod. And in this container, we are running the virtual machine, LibVirt and the QMO. We are connecting the virtual machine networking to the pod networking, so the virtual machine can communicate with the other entities in the cluster and the other entities in the cluster can communicate with the virtual machine. Again, the same way that communication with pods is done. Service meshes. So service meshes are one of the main reasons that we needed Pestis, so I will explain a bit about them. There are different implementation of service meshes. One of the famous ones is the open source project Istio. So service mesh is basically a way to control how different parts in the system share data between one and the other. The architecture of service meshes in the Kubernetes cluster is that inside each one of the pods that is managed by the service mesh, a sidecar proxy is added. A sidecar is a Kubernetes term for a container that is not running the main application of the pod. So to each one of the pods, the sidecar proxy is added and all the sidecar proxies are connected to each other and they form the network mesh. The sidecar proxies controlling the data that is sent to and from the applications in the pod that they are running in. So since this sidecar proxy is just a container that is running inside the pod and it assumes that the other applications that it manages their data are also just yet another containers in the pod. So the sidecar proxy may assume that it shares the same network namespace with the other applications in the pod. It may assume that it can see the sockets and the processes of the applications that it can do socket redirection, port mapping, IP mapping that it can see the addresses and the routes of the pod. And also it may assume that the data that it controls comes from the user space of the pod. So in case of applications that are running in regular containers it is easy to meet those assumptions. But in our case, when we are running our applications inside the VM guest, it is much harder to meet those requirements. The application in the VM guest is running in its own separate network namespace. It has its own kernel and user space. So it's not easy at all to answer those assumptions. So Kubevert today has several ways of how to bind, how to connect the VM network to the pod network. The two main options that we have are called bridge binding and masquerade binding. In the case of bridge binding, what basically happens is that all the networking information is removed from the pod interface and applied on the VM interface. So we can see in the diagram that the interface of the pod, the one in the green P pod, we see that it doesn't have IP address and it's not just the IP address. All the networking information was removed from it and it was applied on the pod interface, so on the VM interface. So we can see on the VM interface it has address two, zero, three, zero, one, one, three, one. This is the original address of the pod. So it was removed from the pod and moved to the VM interface. The guest and the pod interface are directly connected with layer two connectivity. We have a bridge that one part of it is connected to the VM interface and the other one is connected to the pod interface. In this kind of binding, we cannot run side cars, we cannot run extra containers inside the pod that is running the VM because any data that enters the pod, enters the ETA zero of the pod, it directly goes to the VM guest because there is a direct layer two connection. The other binding that we have is called masquerade binding. In masquerade binding, the VM is basically sitting behind the nut. NF tables rules inside the pod are used to masquerade and the dinad, the traffic from and to the VM. We can see that the VM guest has internal AP, 10, zero, one, two, it is internal AP. The VM is not directly connected to the pod interface, it is just connected to a bridge inside the pod. So any traffic that goes from the VM, it lands on the bridge, the bridge is the default gateway of the VM and then it is masqueraded because we have NF table rules and the other way around, we have dinad rules. In this binding, we can have side cars and there is no direct connection so we can have side cars but there are other assumptions that the service mesh has that are not met here. For example, the data that arrives from the guest applications, it doesn't arrive from the user space of the pod, it doesn't land there because it goes directly to the kernel space, it bypasses the user space. We are using the NF table so it goes just via the kernel space of the pod. So this assumption is not met. In both of the bindings, a user or any entity in the cluster that wants to communicate with the VM, it is using the pod IP, the original pod IP. So besides the fact that as presented in the previous slides that we don't have seamless integration with the service matches with these two bindings, we have some more disadvantages with them. We had to implement them from scratch by ourself. We had to implement the DHCP server. If we want multicast, we have to implement it too. Another disadvantage is that our pod, the pod that is running the VM, it is unprivileged pod. It doesn't have net admin, it doesn't have net row. As we saw in the diagrams, we are adding tab device, we are adding bridge in the masquerade binding, we are editing the NF tables. Those requires at least net admin, we don't have it. So we had to use some tricks to make it work. It wasn't straight forward at all. So because of these disadvantages, we started thinking together with Stefano's team about a new binding. This binding is called past. So in a past binding, it is implementing a translation layer between layer two and layer four. We can see in the diagram that we don't add any extra network devices to the pod. We don't have a bridge, we don't have tab, we don't have any of them. We just have a process, a regular process, the past process. This process is communicating with the guest using a unique domain socket and they're communicating with the pod interface using regular sockets. Sidecar proxy can run inside the pod and also all the other requirements that a service match has are met as well because all the traffic that the VM is sending is recent from the pastly process. So the sidecar proxy sees it as if it was sent by internal process that is running inside the pod. So the advantage of pasty, the first one is the obvious that we already said is that we have seamless integration with service match because the assumptions that the service match has are met. Also, we share a tool that is universal. CubeVert doesn't have to implement it from scratch. It is universal tool. It will be in the future part of QMO and LibVert. And also, it doesn't require extra network security capabilities, CapNet admin, CapNet row are not required. Pasty is already supported by CubeVert but in the future we are thinking of even replacing our existing masquerading bridge binding with pasty. So let's have another look or a deeper look at that orange box that was in the previous diagram that's passed. So it essentially implements a translation between layer two frames. So what you have on the left is a unique domain socket coming from QEMO with hidden frames. And magically on the right you have layer four socket so TCP, UDP, ICMP just equal. That's what the kernel allows. So, and you see that this needs internal knowledge of course of TCP, UDP. Like when this thing sees hidden frames you need to infer that there is a TCP connection going on and it will need to open a socket on the right. So the right is the host and the left is the guest. Now, you might have tot of sleep when you saw that maybe. So it does essentially the same thing. It maps a QEMO interface, not a unique domain socket but QEMO interface to host socket. So it implements user mode networking. It's very simple. It doesn't need any security privileges but it was really written for a different purpose. So there were a few problems with those topics in particular. So there is no focus on performance in the IRP. It forces not so you cannot have the same addressing and routing in host and guest which is actually convenient for the pods you saw before in Cuberead. It implements a full TCP stack. The IPv6 support is there now but it's rather partial so there is no part mapping to the guest for example. It runs in the same process context as QEMO which might be a bit problematic and also there are a lot of advanced features that might also be exposing a lot in terms of security attack surface. So now that we talked about pods, let's talk about the arguments to receive message. So on the left, okay, here I'm presenting the data path. That's a basic data path that PaaSt implements. So on the left, just imagine that you have QEMO, you have it in the other room and it connects to PaaSt with Unix domain socket. Here you have just one big buffer so no pair connection buffer so that's a notable difference with Slip and it allows us to have no dynamic memory location. So we are abusing the kernel buffers, the host kernel buffers of course to buffer things for us. So when we get frames, we just split them, spread them there and it also means that well we need to actually, sorry, this is a bit the other way around, that was the other slide. So this is host to guest, okay. So the packets are coming from the sockets now and it's anyway very, very similar to the next slide. They all end up on a buffer and we, wait a moment, this is host to guest, yeah, okay, right. So we cannot queue those buffers so we cannot really remove them from the queues until they are acknowledged from QEMO from the guest. So we need to pick at them first and then flush the buffers once we know that they've been received. This is true for TCP, this is not the case for UDP of course. This is what I was talking about earlier. QEMO sends Ethernet frames, you have a big buffer and again no per connection buffer so no need for dynamic memory allocation and PAST takes those Ethernet frames, checks which socket does it need to go to and spreads them and we have a similar mechanism to avoid implementing a further congestion control which is actually implemented in SLEARP so we don't do it, we just ask the host kernel when we can acknowledge segments. Another difference is you see on the bottom those are three, the usual three TCP full state machines. So you have the one on the guest and the one on the host, those are implemented by the Linux kernel and SLEARP actually implements the full TCP state machine as well. With PAST, so if we bypass SLEARP and we go to the top, we have something much simpler and the fact that we know that we have a TCP stack on the host kernel allows us to adjust three states, just a couple of flags and that makes the code, the implementation much, much simpler. So let's cover a few security topics. What PAST provides is sandboxing, so all the namespaces that we don't really need are unshared so the PAST process doesn't see anything after it starts. It doesn't require any capabilities because it doesn't create networking interfaces. That's the same as SLEARP. It doesn't run this route for real. It doesn't do dynamic memory allocation which is a multiple difference. The second profiles are rather strict so it's 26 syscalls. Doesn't have any external dependencies, sheeps with SLEENOX and up-armor profiles and it's not written in Rust, at least for the moment. So it should have some advantages to write it in Rust. We would ensure that there are no stack-based overflows, for example. On the other hand, it is probably possible but it's quite hard to completely avoid dynamic memory allocation unless we use the Rust core library instead of the standard one. However, we have anyway some buffer alignment in this old trick that would mean that we would need to use some unsafe code. So the usability of Rust here is a big question even though it should be possible and it's something that we are still considering. Let's have a look at performance topics. So you might think, well, user space, user model is working, okay, it's a toy. However, it's actually not that bad. It kind of competes. So those are tests we made against the existing bindings in QBIR. So this is just that filter that you have on those bars on the left. And past is actually faster if you stick to once a single tap Q. If you don't get to eight Qs, so the bars in the middle, then we have a gap. So that's still a factor of two, at least for big-ish packets. For small packets, past is actually faster. Why? Because we have a few tricks. So the guest can use whatever MTU because we don't actually send the guest packets. We just send flat data and the host kernel will deal with segmentation. So there are essentially no pockets when we write to sockets. It's just payload. We check some using AVX2 routines on X86, of course. We have some pre-cooked buffers. So those buffers is so early. They are there forever and they are almost pre-populated. So whatever we can write already, we write it. We don't have any congestion control additionally. And we are now trying to work on the difference between those two bars. So to solve this gap that we have with multi-tap, multi-Q tap by one idea that looks quite promising is to add the host user back end. So we would have one copy instead of two. What do you get this? Okay, so first off, it's Linux only for the moment because we really interact with the kernel. We have the kernel a lot of things, like what is your current sending window and how many segments have been acknowledged and how big is the buffer. So this is now probably doable with most flavors of PSD, but we didn't do it yet. Fedora packages available, that will be true actually in two days. So they are in the repositories, but it's still in the testing repositories for Fedora Certified and on. For other distros, we have unofficial packages. The KubeVertek preview that Alona mentioned earlier, we have a PSD for Cata containers. The Kube integration, yeah, as far as I know, that series is not applied yet, but Alona really wrote a proper patch set implementing native F UNIX socket until then we use our upper and I'm working right now on a libvirt patch. So there is already a patch available, but I'm working on proper support also using the native F UNIX socket. Those are links to lists. So the patch workflow is an email based one. There is a bug tracker, there is a chat and I wanted to quickly mention a incarnation of PAST, which is called pasta, which is the same trick actually implemented by the same binary, it's just a different command that will do the same story for namespaces. So it's an equivalent to Zilir for not an ass. I already proposed it to Podman developers. They say the great, but where are the packages? So yeah, then I started packaging. And that's it actually. How many minutes do we have? One minute for questions. So I don't know because... Oh, sorry. Can you replace Lirp? Are you going to replace Lirp? That was the question. I think it can replace Lirp. Right now, I'm not thinking of proposing it as an alternative for net user because well, it might be used like that, but yeah, it should be the case. I mean, it depends a bit. I didn't really ask the QMU community until now. So I think so. You will tell me, you guys will tell me. Any other question? Yeah. Do we have to poll on the? On the TCPX. On the TCPX. I actually don't, but I have some timeouts. It's kind of opportunistic polling. So usually not, but there are some, or the cold-time RFD events to which we put in an ePoll descriptor and yeah, if an act doesn't come for too long, yes, we actually have to ask the kernel again, hey, what happened to this? Yeah. Any other question? The side guard going to the pod. So the traffic between side current pod go through the kernel each time. Yes. Yeah. So essentially, well, it depends. I mean, with the BIOS user backhand, we would probably, I guess you mean between side car and VM. So okay, the VM here becomes part of the pod. So right now, yes, and it goes an awful amount of times back and forth because it's actually going to the Unix domain socket and back to user space, past gets it, and there are two copies. Those are cheap, but still there are two copies. So there is a read and a send on the socket. So with BIOS user, we would reduce it a lot because we bypass QEMU essentially. So QEMU would just be in charge of setting up the data path. And then, well, we would get it from the guest memory and then we would just have one copy that is from that buffer to sending on the socket. For that part of communication, it doesn't, yeah. There is no real. There is no real win. No real win in there. No real win in terms of performance or no, not really. But yeah, with BIOS user, we should be on par of what's currently being done. So our goal is to be the same now to exceed it. Unfortunately, but I mean, it's pretty good, I think. So what multi-cube top does is, yeah. The big win I expect is for communicating inside. Outside, outside, yeah, yes, yes. Any other question? We've been too fast. Yeah. Okay. The longest minute ever.