 Okay, then hello everybody, and I guess if everybody's seated, then I guess we can start. So my name is Felix Hötner. I want to talk today about a little bit of OVN and running it in scale and burning it not only once, but actually quite a lot of times. Just to give you a short overview of what we are actually currently running. We are running OpenStackEnvironment with 550 hypervisors at a moment. Another one that's not running OVN yet, where we will steal a few more hypervisors from in the future. All of these are supported by 9 network or gateway nodes, depending on how you want to name them. That actually hosts all the external traffic of OVN. So we are not using some kind of DVR setup, but we are centralizing all ingress and egress traffic for floating IPs and external networks. All of that is running on OpenStack Yoga and OVN 2212, which is quite recent, so was released just last December. And OVN 3.1, which I think is the latest release that's actually there. A little bit of overview for the workload. It's currently just 6,000 VMs. Yeah, I know the description said it's over 1,000, but it's not over 1,000. Yeah, a few thousand networks, a few thousand routers. Actually 16,000 ports. We were quite confused when we actually saw the number ourselves. But there's a lot of ports into all of these networks for metadata, for router ports, multiple ones. Actually quite a lot of security groups and security group rules. A lot of these are the referencing each other. What we measured as peak currently was 15 gigabits of external traffic running through these nine network nodes. So I want to give you a short overview of the architecture of actually our OVN, our Neutron setup. Just to have a little bit of the feel of what I want to talk about now. The Neutron API actually connects to the northbound database of OVN. The northbound database of OVN is basically the OVN representation of what we have in Neutron. So it's basically a translation layer. There's also the definition of switches, which is networks, there's routers, and ports, and all of the things that you all know just in a different representation so that OVN can work with it. And we are then, let's say, OpenStake agnostic. The next part is North D. North D is then a translation layer. It translates from the high-level description of I am a router, I am a switch, I have a virtual port, for example, that runs Keeble-FD, and translates that into flows and port bindings. These are then stored in the southbound database. So the southbound database is a quite bit lower-level representation, and it's actually the main translation layer that OVN is actually having. Below the southbound database, there's relays. In this case, you are basically saving your southbound from dying all the time. We tried it without relays, and too well if you have a lot of client connections. What relays actually do is they hold a replica of the southbound database data, so clients can ask the relays instead of the southbound database. So it's basically distributing load and distributing client queries, which are the main cause of issues for the databases. Then on each hypervisor or on each network node, you see the OVN controller. The OVN controller then talks to the southbound database. He's like, this is a representation of what I need to install now, and translates to the even lower-level implementation that's running then in OpenVSwitch stored in a very local OVSTB. That's then talking to the kernel to actually install flows and to actually process packets. So we have a lot of translation layers that go from very abstract concepts like routers and switches to very detailed flows and very detailed implementations. There's also the OVN metadata agent that hosts the metadata service that all the VMs consume for cloud init, which was previously centralized with other L3 or DHCP agents, but is now actually distributed and running on each individual compute node. And just to make it easier for everybody, I wanted to have short dictionary regarding neutral and OVN because naming are quite a little bit different. For example, networks are logical switches. Subnets actually don't have a representation in OVN at all. They are represented by IPs existing with different configurations at different locations, but there's no subnet resource in itself. Routers are routers. Ports are actually distributed between switch ports and router ports, although you normally can easily find a one-to-one representation. And security groups and security group rules are translated to a quite similar equivalent. So I'm not sure if everybody has actually had a chance to take a look at flows and what OVN is doing there, but let's take a look at a flow. This is taken from the southbound database and it's the very, very rough version of what a router does. You see the individual tables in here. Open flow or the flow idea itself uses different tables. Your packet comes in at a table zero and is matched according to a set of rules. And whatever rule it matches, this action will take action and there might be the next action which says go to the next table or jump to a different table or something like that. So the very first table is actually things you all normally have in your network interface, which just said, if I get a packet, is it actually sent to my MAC address or is it sent to somewhere else? If it's not sent to my MAC address, I can completely drop it because it's most probably not something I can actually do anything about. And then we jump, if this condition does to it, we actually jump to the next table. I skipped a few or quite a lot of these tables because there's a lot of magic happening for very different things like fragmentation of packets like load balancing, like lookups and stuff like that. And I just wanted to keep it a lot simpler. So we didn't jump to table three. Table three is doing the very first check regarding routing. So, for example, if our time to live of the IP packet is too low, we send the time to live exceeded and we can stop the processing there because we don't actually care what's going on. Also, if we actually receive a ping, we can answer here. So the ping actually doesn't go through the routing pipeline. It's actually stopped very early in the routing pipeline process. All other packets just continue. And then table 13, table 17, now they're actually doing both the whole routing magic. Table 13 is saying, like, this IP address that is the destination IP, is it one of the destination IPs of another interface than the one that's coming in? If yes, I can decrement the time to live, I can change the source MAC address, and I can set a port where I want to send the thing out. If it gets sent to some IP where I have no idea where to send it to, I just drop the packet because I can't do anything about it. And table 17 is then actually the outlook up for the destination. So I need to see what's the destination IP. Is it a next-top router? Do I need to look up the IP or the MAC address of the router I need to send it now to? Or do I need to look up the MAC address of, for example, the VM or something like that? I look it up, I set it to my packet. If I don't know it, I send it up request, which is not nice because then actually I drop the packet, I just send it up request and we'll learn this for later. So I hope the source will send another packet. And for the next round, I then hopefully have an answer to this up request and can actually forward this packet. And you would think it might work without issues, but I guess we all know that that's not how it's happening. There's a lot and a lot of different and probably quite detailed issues that we saw, some are fixed. I'll go to a little bit more detail over them later. Some are not fixed actually or are not easy to fix because they have more issue on a maybe conceptual level. One thing is router distribution doing failover. That's something we saw just recently, which basically causes routers to pile up on single network nodes and the other network nodes going quite empty, causing this one network node to basically process all of the traffic. One of the main issues that we saw is OVSTB server clusters that are not too stable or not too healthy. All of the OVN implementation is basically a single-threaded solution or multiple single-threaded solutions. So you can quite easily overload your OVSTB server, so it doesn't answer hotbeats recently enough for all other nodes. So the other nodes think it's down. They vote for a new leader of their cluster and the cluster breaks apart a little. That's not nice. OVN controllers also take a few seconds to reconcile. So there is some incremental logic. So if there's a change, they try to recompute only a little bit of information, but there are some things where they actually need to recompute all of the logic they have in them and that can take a few seconds. Most of the time, that's not a big issue, but for example, Kiebel-FD failovers need such a recompute. And if I then need to wait three or four seconds, then my Kiebel-FD failover might happen fast in the VM, but from a network perspective, it's delayed by a few seconds. And what we saw also is a lot of edge cases regarding SSL connections in the OVSTB clients. So we are using TLS connections everywhere, but there's some error handling in very weird edge cases that we saw. What we also found is actually a kernel bug in the Linux kernel where you could kill the entire system if you delete a network namespace. I'll show you that in the end. And what's also there is a MAC binding table. So I told you earlier, OVNsense are requested if it doesn't know the MAC address, the auto destination MAC address for a given packet. These are then stored in the MAC binding table in the softbound database, and they, per default, don't have a time to live. They just stay there until the router that owns them is deleted. So we quite easily went up to like 500,000 entries into the table and we couldn't get rid of them easily. There's a feature called MAC binding aging, where these entries are removed after a given time, but they are not actively renewed. So the MAC binding is actually deleted. The next packet that comes in triggers a new output request, but the packet itself is dropped again. So you have a traffic outlet. That's something that's actually actively worked on upstream to get rid of this not so ideal behavior. So all of that now sounded, let's say, not that great, but let's compare it to the OBS ML2 plugin. And I'll do that in a few different topics. Let's take a look at maintenance. We previously had issues regarding restarting of L3 agents, because if they restart, they need to synchronize and create network name spaces, keep a lot of the processes and all of that stuff, which at least for us took 45 to 60 minutes. You need to take that number with quite a bit of care, because that's our environment that's running OpenStack Queens. So I guess there are some improvements, but I guess we are not getting to the point of OVN where we are quite consistently below minute. HA routers, I guess you probably all know that Paine are using Keeble-FD for failover. You have the HCP agents that use the DNS mask processes. So your network nodes have quite a few processes if you have a lot of routers. That's quite nice in OVN, because it's all flows. So you don't have the issue that you have thousands of processes running on that system that might need CPU time at exactly the same time. And failover on OVN is handled using BFD, so we are a lot more from our feeling precise there. For us, the now stable logic for OVS was we migrate all agents away from the L3 or the HCP agents. So our routers go to our manually rescheduled to another node, DHCP or DHCP namespaces are manually rescheduled somewhere else, and then we stop the agents, because then nothing bad can happen anymore. For OVN, we can't just stop the OVN controller because that healthily stops these things and doesn't actually burn the whole environment. If you take a look at what happens at overload. So if you have a network node and it processes more traffic than it can handle, it needs more CPU time than it has or something like that. For OVS, we had it a few times and we spend a few hours on recovery each time, because if one of the network nodes gets overloaded, it stops answering to keep a live D randomly. And other network nodes then take over these keep a live D processes, they get masters. But this one node isn't consistently dead. So it sends some requests some VRP announcements out as well, and these keep a live D processes start to flap. So one node can easily bring down other ones just with the load of keep a live D fail over all the time. For OVN, we have never observed such an issue like that. So a node might fail, but then this one node fails. You can try to reschedule routers to other nodes, so maybe you can isolate the issue or something like that. But we don't cascade a failure to somewhere else. And if we take a look at control plane outages that looks currently a little bit like if it's completely down, ML2 OVS looks a little more stable, but if that's actually the case, I'm not so sure. Both of them obviously prevent updates. If the control plane is down, nothing, you can change anything. What is still working OVS is a failure of keep a live D if the user's keep a live D instances. That doesn't work in OVN, but for cases where keep a live other packets to that keep a live D IP go over router because that needs the control plane for updates. But we don't have that issue regarding things going out of thing that maybe you also saw for ML2 OVS where just randomly a given agent doesn't know the network nodes it needs to send traffic to or broadcast packets to or whatever else and which randomly fixes itself after some time. So let's go into a little bit more detailed thing of what actually broke. One thing that we saw quite early, if you just restart an OVN controller, it takes a little bit of time or quite a long time until the neutron RP actually sees it as up. That's a fix that's already merged because that was just a missing, let's say, notification there. We saw quite a lot of issues with the OVN metadata agent if you have a lot of, or a lot of ports it needs to monitor because it basically asks the southbound database for all ports. Not only the ports of the node it's currently running on, but all ports. And since it's maybe not the most performant pattern implementation of the world, it might take like a few minutes or a long time to actually start up. We use monitoring conditions now for that. So we just ask the southbound database to report on the local node because we don't care about anything else. The thing I mentioned earlier regarding OVN routers are not load patterns when failing over to other nodes. We're still taking a look at that. And there's some issue regarding neutron bumping revision numbers. So neutron thinking some port in OVN is out of sync with the neutron database and it resynchronizes that every five minutes and then still thinks it's out of sync. But since that's just for virtual ports, it doesn't hurt us that much. As you can see, OVN broke more. For example, one thing that's also already fixed is IPv6 failover for external IPs because there was no announcement. So jobs were not sent so physical switches didn't know they now need to send traffic to a different physical switch port. If you want to use that, OVN 2212, that that fix is included. A few fixes for the Python OVSTB package regarding handling of broken connections where we could easily get to 100% of CPU time trying to reconnect on a socket that's actually closed. One thing that was actually denial of service from the outside was a package to a router port to the outside router port so if it's TCP soon, it was actually resubmitted. So the router tried to route the packet to itself and decremented the TTL and tried to route it to itself, decremented the TTL and did that until the TTL expired. And this just causes a lot of load on OVN because it doesn't know it can just drop the packet, it needs to process it. The recalculation of OVN controller in NorthD that's, I guess, a constant struggle there just because single-threaded processes and NorthD currently doesn't have some incremental sync that's literally now being implemented. So changes can just take a long time to propagate which is especially annoying for failure of people with user instances, otherwise you don't notice it too much. Next thing is OVSTB cluster stability. I'll give you a few tips for that right after the slide. And what's also quite ugly is there's a resubmit limit in OVS so if OVS needs to do too many actions and packets are processed in too many actions, then OVS is actually dropping the packet because it just can't handle it and wants to prevent an infinite loop there. There's also quite a performance if OVS actually needs to do that all the time for each individual packet and there we need to actually take a look at each individual instance of that and determine appropriate behavior. So if you now want to run an OVN at scale, let's take a look at a northbound database there. You see the three northbound cluster members so OVS DB is using a raft cluster for that which works quite nice and you have one leader in this raft cluster, you have two followers. One thing to do is if you want to take backups, don't take the backup at a leader because the leader needs to process our writes and it needs to write a snapshot before you can trigger backup which is not that great because it then fails over. If you want to use TLS, which I can strongly recommend, don't do TLS in OVS DB. So put some kind of reverse proxy in front that's terminating the TLS connection just to get the load down even further. So all we want to do here is load on the OVS DBs and the neutral API or NorthD can then talk to that TLS termination processes so some kind of reverse proxy and then use that to talk to the northbound database. For the southbound it actually makes sense to go even further. You have the bottom part which is actually the same as before but only NorthD connects there and at the top you see relays. Then OVVN are the idea to replicate the content of the database and to just serve read requests by forwarding write requests. Most of part of OVN deployment are read requests so you can get quite substantial benefits from that. And don't just go with one or two relays, go with a lot of relays. Relays are cheap. The worst case, consume one CPU core, they consume maybe one or two gigabytes of memory. So we, for example, run 24 of these. Just, it's there, use it. And point everything that's not NorthD to these. So neutral API, OVN controller, the OVN metadata agent, actually maybe not yet neutral API because it doesn't work but as a patch for that, that makes it work. A lot of client libraries actually try to connect to the leader of the rough cluster either explicitly or because they forgot to actually say I can connect to anything. And, yeah, that breaks relays because relays are never leader of the rough cluster. Regarding timeouts, OVN has a lot of timeout settings between the different processors. Set them to large numbers, like one or two minutes is a seemingly completely fine value. It doesn't feel good from my perspective but there's some improvements that are needed in order to actually reliably set them to lower values. Otherwise, you have random things that are disconnecting and that doesn't help availability either. Regarding versions, there are some LTS releases. I'm not sure why they exist. The amount of benefits and the amount of changes and performance gains in your versions are so large. Don't use LTS. There are two things. If you are on an external network where our consumers or other people or whatever, somebody else is actually active because OVN tries to do a lot of things. For example, if you get GR packets from the outside because of some kind of failover but you actually don't care about these which might be a lot of cases because maybe the MAC address doesn't even change, it just propagates a new port. And you from an OVN perspective just say it's this one port, it doesn't change too much, it's just relevant for maybe the physical switch infrastructure in between. And there's a new setting, regarding broadcast ops to our routers. You can disable that. Then OVN will just drop these packets instead of trying to send it to more routers than it can actually handle because that's one of these instances where we reach that resubmit limit. Also, if you have a bunch of hosts outside there then set this MAC binding age threshold so that the MAC binding table is clear. Otherwise also you will end up with the amount of MAC bindings that will never expire. If there's the all static in your environment then there's no benefit in setting the setting but if they are a little volatile then I guess there's a lot of benefit in getting rid of these. Okay, and I promised a kernel bug. There's a kernel bug. We actually had an issue in the start-up script of our metadata agent. So what we did is let's take a short look first how that whole thing works. You have a VM on a host that has a tap device in the main network namespace of the host which is then connected to the virtual port of the VM. The tap device is connected to OBS. Also for the metadata agent there's a VETH pair. One of that is in the main network namespace another one is in the metadata network namespace with an HA proxy and all that magic runs. And if the VM now sends a packet to the metadata agent it goes through the kernel ideally with the accelerated data path. What we observed is if you delete that metadata agent networking namespace. Not the port, not the HA proxy process but really the namespace while it's still active. And the VM in the exact same time sends a packet to there, you can get a kernel to N. Because the kernel then tries to send a packet to a VATH port that actually doesn't exist anymore or is in the process of being decommissioned. And there's a nice infinite loop that can trigger in that case. Basically bringing your hose down because a lot of processes depend on things they're actually finishing. There's a fix for that in kernel 6.1 on UR that we built. Okay. And that's it from my side. Are there any questions? We are currently in the process of building our upgrade path for OVN. So I guess that's a session for the next summit. Any more questions? Yeah, we started that with yoga. And probably would now go for newer open stack versions since that's already there. There's some incompatible changes that might happen between the OVN and Neutron that both sides need to be aware about. And so it makes sense to be quite current with both of these. There was a question back there but I don't know from whom. We are currently taking a look at offloading that work from the service basically by doing hardware offloading on the network card to push these flows from OVS and from the kernel data path to the network card. But with bonding and VLANs and stuff like that on that documentation doesn't exist. And if you want to know more, ask Luca back there. We used it also in the past with OVS. So we didn't want to take a closer look at that again. But maybe that makes sense with using BGP but we wanted to keep all this let's say external connectivity, the multiple external networks. We have just a few locations so that these few locations get all the horribleness of stretched networks and all that fun stuff but all the other nodes are fine and don't need to care about that issue. So we wanted to centralize the problem. What we do if we do maintenance on a compute node is we completely evacuate all VMs from that before that and afterwards you can quite easily stop the OVN controller. There's nothing that I know that you need to take special action on. So what we do if we do maintenance on a compute node is we completely evacuate all VMs from that before that and afterwards you can quite easily stop the OVN controller. That you need to take special action on. What we found interesting is there's upcall statistics from OBS so AS has an internal metrics tracker that tracks various things that are happening. That's sometimes quite a good indication what it tries to do if it doesn't respond. Otherwise taking a look at mega flows, taking a look at what packets are actually hitting the kernel space, what are packets are hitting the user space that's quite interesting because if you see a lot of packets hitting the user space it can quite easily overwhelm the open V-switch on the user space side and therefore overwhelm the kernel side. The kernel will then start dropping packets. That's one of the bigger issues. What we see is that around 99% of all packets are handled in the kernel. If that goes below 95% you have no chance to actually handle the user space. Currently it's also rather experienced. You can have these messages regarding it exceeds the planned time of a second for such a good computation and if it exceeds that regularly then I would use that value and give it a lot of buffer on top of it maybe double or triple it and use that as a timeout value. One thing that I also heard but that was yesterday in another session there's the neutron agent heartbeat set this to a large value because that triggers just a lot of changes on the OBN side. We already said before we started that whole OBN endeavour to one hour and we just copied that certain and gladly we did. If there are no more questions then I guess we are done. Thanks very much.