 Hello, everyone. Good afternoon. My name is Miguel Duarte. I'm here with my colleague Kike. We both work for Red Hat in the OpenShift virtualization networking team. And we are here to present a talk about titled Qvert VMs all the way down, a custom-sized networking solution for the cluster API provider, Qvert. Okay, so before all that clicks into place and you understand what we're talking about, because we really don't know how savvy you are in either Qvert, cluster API provider, and all that, we're going to introduce these three projects, Qvert, cluster API provider, Qvert, and often Kubernetes. Once we have a common understanding about those three projects, we can actually explain our motivation, like why we care about this and what problem we're trying to solve, the goals for the network plug-in that we want to develop. And after that, Kike is going to walk us through the implementation details of this solution and show us a demo for it. Okay, so the first thing, we're going to introduce Qvert. So Qvert, first thing, is a Kubernetes plug-in. It allows you to run virtual machines and pods in the same platform. It essentially runs like LiveVirt and QMU process inside of the pod. And that's pretty much what it does. The tricky thing here is if you spend a few seconds to think about this, you have a virtual machine, which is inherently a stateful entity that is scheduled and running inside of a pod, which is essentially a stateless entity on the cluster. So those things will make some tricky things in the future. And one last thing we should always have in mind in this scenario is that the networking requirements for virtual machines are a lot tougher than the ones for a pod, mostly because of live migration. That's kind of the feature we will live and die or live or die based on. Like, live migration will be our bread and butter for this presentation. Skip this one. So let me just introduce to you this project, the cluster API provider Qvert. So cluster API is something that in their own words, what it does is provide a declarative Kubernetes style API to cluster creation, configuration, and management. So all this means that the same thing you can do with, I don't know, Ansible or Terraform or whatever to provision a new cluster, you can use this tool and it will deploy, will end you a new Kubernetes cluster. It has different types of providers, AWS, Google, Azure. There's also a particular provider, the one we care about, that is Qvert. So this means that the cluster that you get is implemented using Qvert virtual machines as Kubernetes nodes. This begs the question of why would you want to do this? One of the reasons, like cluster scale, you can have like one very dense huge cluster with thousands of nodes and tens of thousands of pods. But I'd say that's really hard to manage and you won't see many of those. It's a lot more common to have a lot of tinier clusters interconnected between themselves. Another use case for this is like you have a cheap cluster provisioner that you can use for stuff like CI, for instance. You want to test your feature or you want to test how does, I don't know, your application survive a DNS upgrade or something, you just create a cluster, you run your test, you tear down the cluster at the end and you're done. Finally, let's introduce the Oven and Oven Kubernetes projects. Oven is essentially like an SDN control plane that orchestrates a bunch of open V switches that run on your worker nodes. It's a value proposition is of allowing you to use like higher level abstraction that you'd get from an open V switch. So instead of you getting, to manage like logical, to manage open flow, what you get, manage are stuff like logical switches, logical routers, ACLs, and these will afterwards be compiled into open flow and installed in the nodes of your cluster. So if this is Oven, we then have Oven Kubernetes, which is a CNI plugin that provides an opinionated topology and essentially translates Kubernetes objects to Oven logical entities. So let's say that you provision a network policy on your cluster, Oven Kubernetes will translate that into a set of ACLs and those ACLs will essentially be translated to open flow that will be installed on the nodes. Same thing with, let's say, services, these kinds of things. That's its task, translate from Kubernetes object to Oven logical entity. Okay, with all these things in mind, we can, we're good to go on the motivation. So our thing is we want to decouple the node updates from the tenant cluster VMs using live migration. What I mean by this? So remember that your cluster API provider, Qvert, it gives you Kubernetes clusters and it implements its nodes as Qvert VMs, which is a Kubernetes plugin. So you get essentially like Kubernetes inside of Kubernetes. And we call the, let's say the top most cluster, it's the infra cluster, the bottom ones, the ones that are being provisioned by the stool are the tenant clusters. So let's say that you want to upgrade your infrastructure cluster. Well, that will, we don't want it to impact the workloads of your tenants underneath. That cannot happen at all. For that, we will rely on live migration the entire time. And essentially, what we have today does not provide live migration. It simply does not give us what we want. And for that, you will see the wacky thing that KKK came up with. Why Oven? In the middle of all this, why should we go for Oven? Well, other projects like OpenStack, they're already using that technology and with really good results. Like you have like, using some improvements, you get like a migration downtime of around 100 milliseconds, which is extremely good. And we want to strive for those numbers. So that's where we're trying to go for. Okay, so we know what we want. But now we have to set explicit goals on our network plugin. So the first thing is the TCP connections that are established on the node, basically for Kubelet, and for the workloads of your tenants, they must survive the migration. Like once the, let's say the Kubernetes node that is essentially a QBert VM migrates from one place to another, everything that it is running must survive moving to a different node. Another thing, the IP and gateway configuration on that worker node, it must remain the same. It cannot be updated during the migration. Why? Well, for instance, Kubelet is bound to that IP address. If it changes the IP, Kubelet will basically go bananas and like your workloads will be impacted. Another goal that we have is that a tenant cluster cannot access anything on another tenant cluster unless that tenant exposes it via services. Also, a tenant cluster cannot access anything in the infrastructure cluster unless it is also exposed via a service. And we need to do that for two types of services, node port and load balancers. And I'm now handing this to Kike. So hello. Again, my name is Kike. I'm a software engineer at QB networking. And we have tried to find multiple products to have some kind of like migration proof of concept. So what we are going to see right now is like we, the big point that we need implementation wise. So what we need is to implement like migration in the cluster, the network, not doing it in a secondary network, not doing it in the future feature in the network. We will see later on why is that, why we want it in the network. We also don't want to set the IP address in the pod. We want to bypass as much possible the network inside the pod and pass all the information to the VMs. So kind of Kubernetes is not in the middle. So for that we configure the chp options from OVN in the logical switch port. It means like we kind of prepare a DHCP server so the VM can consume this IP configuration. Right. Also we copy some mechanism from Calico. Like they use something that is called ARP proxy. What it means is like some parts of the topology you can configure some parameters that is able to answer from a foreign IP address that doesn't correspond to the subnet at this level. So for that this is how we implement ARPs for the foul gateway. So the VMs has always the same default gateway independently of the node where they are running. And also the network cache is exactly the same. Okay. So now we are going to see the topology of the communication for the north south communication. Right. The most north part. This is going to be exactly the same before and after immigration. But the important part here are this IP address here and here because this is the IP addresses that we use to redirect to do the point-to-point routing during immigration. Right. As we see here in these tags. All right. In the next we see the lower part of the topology. Like the slide before is like the upper part of the topology. This is the down part of the topology. And this is where the point-to-point routing is being done. And for that we have two important OVN resources to configure egress and ingress. All right. We have something that in OVN is called policy. We use it for egress. So we say okay for traffic going from this IP. We want to go over this node with this IP. Right. This IP is the same IP address as I have like pointed out in the slide before. Then we have like another thing which is the static route. And we use it for the ingress communication. So it's kind of the opposite like okay if traffic is for this IP address you go over this port. All right. So also another thing that we want to do is configure the topology is the IRP proxy. And as you see, and this is very important, the IRP proxy is exactly the same for both nodes. With this the VMs are going to have exactly the same default gateway independently of the node. And most important the neighbor cache is going to be the same. So there is no need for updating the neighbor cache. So the downtime after line migration is lower. So the idea is to maintain as much as possible the need for configuration during line migration. Okay. Another important part here is as we say the configuration of the DHCP options. So this is a kind of OVN terminology to start a kind of DHCP server. So what it does is it serves the IP configuration to VMs over DHCP. And another important part is here. The port, the new one I'm spacing the port interface is not configured at all. So this is just L2 communication. So IP configuration doesn't make, let's say, noise during line migration. And it's just a L2 link between the VM and OVN. And of course the VM is residing the IP address over DHCP. Okay. Then we are going to see how this button part of the topology looks after line migration. As you see, it's kind of the mirror. And so everything here is exactly the same. The only thing that it changes is for the egress traffic, this IP, which means like, okay, now I'm in this node. So I want my egress traffic to go over this node that has this IP address. And the same thing for the ingress traffic, right? So what we do is the same IP, but we say, okay, now I want to redirect this to the port of the node. And this is the topology. Now we are going to do a real demo. This is, I'm going to explain with a pair of slides very quick, the demo we are doing. It's super simple, but maybe it's good for illustrating things up. So what we have to do is we have like a pair of nodes, we have a Hustler cluster at the node. And we have the worker VMs, right? So we have a client, a client pod that opens, yes, a one TCP connections to a server with some super dummy software that we have put there with the name TCP proof. We just open one connection, and if you get broken, the server goes down. So it's easy to see like the pod is down, the TCP connection is broken, which is very important for us. And then what we do, we do a migration, and in the same Hustler cluster, the pod goes from one node to the other, and the TCP connection are kept. All right. So let's go with the demo. Okay, so before I start, let's explain what we have here. So the first part is the latency between request and respond in the client, in the client what we have seen. And in the bottom part of it, what we see is like the two pods that implements like migration in Cuba. Because in Cuba, as Miguel has said, like all the VMs are backed up by a pod. And during like migration, you have one pod in one node and another pod in another node. And there is one moment where the state communicates with one pod to another. And then one of the pods dies and the migration has end. So we will see here the latency and here the pods of Cuba. And we see here one pod is in one node and the other is in the other node. And we see here the state. Okay, let's start. Okay, now it has to start. The latency in one moment, we will migrate it. All right. And then you see here in the status, that is the target pod in the target node is running. Now they are communicating the state between the two nodes using a Libre mechanism, like memory page and the like, they are communicating. And now the old pod in the old node is going to be not reading. Migration is done. So what is happening here is like this proof of concept we are doing is not perfect. It's not perfect, but what we want to accomplish is like the TCP connection is kept, which is good enough for us. It's not perfect, but it's good enough for us. And this is what it is. There is nothing more here. And what we have the rest of it is like we do again a migration. So we see also the same API address. All right. And same happens, but in the opposite direction. And then a spike, we are up here. I know it's not perfect, but it's good enough for us for now. We have some ideas to improve it. Like in OpenStack, they use something that they call multi-requested chases. And we will see what we do with that. And that's it. Okay. Conclusions. So now we explain what we are doing in the default need war. Instead of using a secondary need war where we have, like, more possible to change things up. Well, in the world of tenant clusters, we need to use a lot of Kubernetes mechanism to implement communication with the API server that is running in the infra cluster, the management cluster. So for that, using the default need war, we have a lot of stuff for free. We have access to services. Also, we have isolation. We have, like, the network policies. We have all of this. Then we have seen that using point-to-point routine in the primary interfaces. So we have, we can make the TCP connection survive and has a consistent IP address that follows the VM during the immigration, right? And now with these points, what we discovered is, like, we know how this proof-of-concept works. And we can start to implement it and improve it, like, little by little. And that's it. Questions. I mean, you are going to have, so the question is about what's going to happen, if I understand correctly, what's going to happen when you receive, when they try to access the port during live migration, right? So it depends, like, if the TCP connection is already open before live migration, and we have, say, the connection is not going to break, but you are going to receive some latencies in the packages. If you are going to establish a new connection during live migration, it's possible that your client is going to retry until it can establish the connection. So it's something like half a second now, something like that. I know it sounds like super bad, but it's just a proof-of-concept. So this is how it's going to behave. Okay, thank you. All right. And no, because in OVM, not OVM, sorry, OVM Kubernetes, you have different logical switches per node. So all the L2 traffic doesn't escape the node. So even the MAC address and the default gateway has the same IP address, since L2 is cut in the node, it's not going to traverse to the other node. Like, you have, like, different switches for each node. So even it's the same, the ARP, when you know when you use the default gateways, what it does is you just replace the MAC address with it. So this thing doesn't go to the other node because L2 is going to be only in that node, and you have the distributed router in top of it. You know, it makes sense? Yes, we are. We are not breaking it. We are, like, the question is, like, this feels a little, if I understand it correctly, it feels a little like we are breaking what is expected for Kubernetes networking, right? Like, you have two different pods, they just will have, like, different IP addresses. Yeah, that's why we are using a point-to-point routine, because, like, let's say that the pod is in a very foreign node that only understands our subnets. That's why we need these mechanisms. We are kind of, I don't know how to work to put it, like, making it more flexible, like fighting it a little. It's not a general-purpose pod, it's for a very specific thing, like, it's a backend for Kuber. I don't know if we are going to use it for everything, for something that is not Kuber, but, yes, we know that, but people is happy about it, because we prefer to have the immigration that maybe be strictly about this one. Yeah, yeah, let's see. Okay, something else? Okay, thank you.