 Hello everyone, and thanks for joining us for a new episode of Open InfraLive. Open InfraLive is the Open Infrastructure Foundation's weekly live show airing every Thursday at 14 UTC. One recurring episode on the show has been the large-scale OpenStack show, organized by the OpenStack large-scale SIG. We invite operators of large-scale deployments and get them to present how they solve a given operations challenge and discuss live between themselves their different approaches. For today's episode, we decided to discuss neutron scaling best practices. I meet with lots of new OpenStack users, and as they grow their deployment, scaling neutron is a top concern. So for this episode, we invited developers and operators to explore early architectural choices you can make, recommended drivers, features to avoid, if your ultimate goal is to scale to a very large deployment. Our guests today are Ibrahim Deraz, Site Reliability Engineer at Exion, who will drive the discussion. David Comey, Senior Cloud Engineer at Bloomberg, a very long-time OpenStack user. Laiosh Katona, Master Developer at Ericsson, and Current OpenStack Neutron Project Team Lead. Slavik Kaplansky, Principal Software Engineer at Red Hat, and Prefuse Neutron Ptl. Nishal Nasiatka, Senior Technical Lead at StackHPC, and Kola Ansible Project Team Lead. And the man who needs no introduction, Mohamed Nazer, CEO at Vexhost, and OpenStack Technical Committee Member. As I mentioned, this is a live show, and we will be saving time at the end of the episode for Q&A. So feel free to drop your questions into the comments section throughout the show, and we'll try to answer as many as we can. I'll now pass the mic to Ibrahim, who has plenty of questions for our panel. So take it away, Ibrahim. Hi, everyone. So as Chirin introduces me, we are Exion, a new user in the OpenInfra Foundation. We deployed OpenStack in our infrastructure, but we are mainly new users. As we grew up, we had some questions about Neutron, how to scale with Neutron. So this is the episode of today. So as you know, a Neutron is a pretty complex subject in OpenStack. It's pretty hard to scale the network in general. We had a lot of questions that we asked to the OpenInfra Foundation, and I thank everyone to be here to answer all our questions and the community of our course. So the first part we add question on is the driver part. So what are the drivers or plug-in? Drivers or plug-in we will recommend if we want to reach a large scale. I don't know if anyone wants to begin now, or if we introduce ourselves first. I think given that we, Thierry has done a great job giving us all a small intro, maybe we can get to diving into this first question. I guess I can start. So for me at least I find that we run the OVS driver out of the box, which is kind of just the OpenV switch, which has kind of I think been the de facto for OpenStack deployments across the years. There's been some recent changes that maybe some other people here that can talk about with kind of more of an OVN-based thing, but for us OVS has been the thing that's kind of reliably and kind of always worked in terms of a driver. And so I would say, you know, that goes back to ML2 and really what I think is what a lot of people are running is the ML2 driver and then whatever agent that you use in there in combination. Thanks, Mohamed. So I think I can share my thoughts from the developer point of view. So basically, and from my experience previously as operator also, there is no like one single and the best for everyone answer which driver scales the best. Basically each of them will more or less scale well if you will configure it properly. Each of them will require some tuning, some fine tuning to do. As Mohamed said, ML2 OVS driver backend is basically the, was the default one in the DevStack, was the one which, and still is the one which is still, which is used by most of the, in most of the deployments. Now there is this OVN driver which is backend, which is I would say hot topic currently and is new default in DevStack and in AppStream CI and we are putting a lot of effort in developing that but still ML2 OVS can I think do scale pretty well. Of course, there is also Linux Bridge backend which, which is I would say simpler than ML2 OVS or OVN maybe don't have feature parity with OVS backend. So basically there are things which can't be done in Linux, with Linux Bridge backend but from the other hand it's much simpler topology on the compute nodes for example, so some people may be more familiar and I know that there are deployments and there are big players who prefer to use Linux Bridge instead of ML2 OVS because of the simplicity for example. So yeah, that's more or less what I wanted to share from developer point of view. Yeah, perhaps I would like to highlight as well that there is no single best recipe for choosing the best driver. So from maintenance perspective, what is, which driver has the most attention that's now for sure OVS and OVN. But yeah, we have other like Linux Bridge which we still try to maintain and there are users Slavic mentioned who still work with Linux Bridge. And then it's, so there are even choices which perhaps the best now for example, Ericsson still uses ODL in some deployments that's a kind of taste of choice or how to say that. So there is no best recipe. So you have to consider what you need, how you can propose solutions if you find a bug or you just want to use that and you have no development time to work on that and help the community to fix things if you find an issue with the driver which you choose. I think I'll jump in here to say that for Bloomberg we have used Linux Bridging in the past with Nubian Network and Metaka. We currently have moved to Asuri based OpenStack and so when we did do the jump we switched to using Calico. So Calico, which is currently sort of, I guess you call it an out of tree driver, out of tree implementation which is unfortunate and hopefully that can be fixed in the future is a pure L3 solution. So we don't have, in our case we're not using any sort of underlays or overlay. Everything is L3. So our hypervisors basically are talking to traditional leaf spine architecture, a pair of tours and we're using BGP to announce the routes for the VMs. We're able to support things like floating IPs as well but what we found was moving away from everyone using floating IPs to having sort of overall connectivity has been a real plus because we've just basically eliminated one source of certainly user confusion but also just strangeness in the networking stacks where for example two hypervisors are both responding to our requests for the same host and things like that. So that's another alternative that we've found very helpful. Perfect for that part. So during this conversation we heard a lot about OVS and OVN and the part with Calico from Bloomberg. I want to go more specifically in the choice between these two major parts so OVS and OVN. I know that there were like features there is not a feature parity between both. For example, the VPN as a service that was missing in OVN and that was introduced with Gizena. Do you know other part that for example you can say you shouldn't go with OVN if you need this feature or this feature will be implemented soon or don't choose for example the VPN as a service for example with OVN right now because there is missing features part missing technologies. Do you have any feedback on that? Some users or clients that reported some features in current between OVS and OVN? Yeah, actually there is a page in open in neutron documentation about the gaps between OVN and OVS so that's checked quite frequently when something new is merged to OVN which covers features in OVS. Yeah, but actually so if you need special drivers like BGP or VPN that's always hard so next week we will have the PTG and we will have discussions around for example BGP and BGP VPN covering from OVN. So yeah, some features are not in OVN actually but if you understand well mostly from Red Hat side there's quite an effort to have those filled in but that has to be checked first if you have some special need like VPN and which driver is the perfect for you. Yeah, I find that even though OVN is not something that we use on our side I found that the fact that it's kind of built up to be a network SDN out of the box that's kind of what it's meant to be so Neutron becomes more of you know the way that I see it with OVN, Neutron is more of OpenStack Nova and OVN is like LibVerve and KVM and all the you know all the fancy stuff happen there whereas the current like Neutron OVS it's kind of a mishmash of like Neutron does some data plane management and kind of goes back and forth so I think I personally like the OVN there's also other advantages like you know there's an OVN Octavia driver so for example you can do load balancing a lot more effectively without having virtual machines that are running if you're load balancers with HA proxy instances they're just simply using OVN to kind of build load balancers natively using OpenStack. But it's also important to note that the Octavia OVN driver only supports DCP and UDP load balancing right health checks have been added I think recently so if you need something more advanced than DCP then probably it's not the driver you should choose coming back to gaps in OVN I think one of the at least most important for us is that OVN DHCP does not work for bare metal ports so there's a workaround to use the Neutron DHCP agent which we are using but we're looking forward to to get rid of that if that's possible in a couple of next cycles yeah so there are some gaps as was already said but the list of the gaps is becoming smaller and smaller every cycle so we are really hard working on that except what was already mentioned bare metal provisioning with BGP and VPN things the biggest I think gaps which are still on the list are related to quality of service so if you need DHCP marking or QoS for layer 3 like gateways then that's also something which is not supported in OVN yet as I said we are working on it we are moving forward closing those gaps every cycle now OVN is default backend used in the DevStack so it's I would say well tested backend because it's tested by all jobs used in Tempest, Glance, Nova and every other project basically in the AppStream community but from the other hand I want to add one more thing that how to choose between OVN and OVS backend again it may be it may depend on what you really need because ML2 OVS from the other hand is using some well known technologies like Linux namespaces IP tables KIPA LiveD, DNS mask and things which all of us are probably very familiar with and when it will become when we will need to troubleshoot some problem on the production it's much easier to do it with ML2 OVS then with ML2 OVN looking into the hundreds or thousands of open flow rules and understand it so OVN is I would say something which will be better for the future but currently I can understand people who will choose OVS just because of it's easier to to troubleshoot and to understand what's going on underneath and so that's my point okay perfect now I want to ask a question about resilience so we have chosen our OVN, OVS or OVN I can explain by everyone here there is a resilience part to make sure that nothings goes down to make sure customers can access to their services so there is a lot of work done on Neutron for example with VRRP, DVR what are for you in the physical part also and on the software part the most resilient choice to make sure that we can guarantee that if for example a not goes down, the service goes down something else will take the relay and make sure that no services will be impacted by this I can start talking a little bit about this one so for us I guess the way that we manage one of the ways that we try to help our customers try to be as resilient as possible is we ask them to so what we allow customers is to directly plug into the public network and I think that for a lot of scenarios that gives you a lot of advantages if floating IPs is not something that is kind of a big deal for you because most of the time if you're doing something like ML2 OVS to get a floating IP you need to have a router and then at that point the router could be using IP5 so that it's actually active backup but there is still a low time of failover and you're kind of hairpinning through your network nodes so you're bringing all this traffic through systems and if you're just mostly doing it only to get a floating IP and you don't actually need to reuse this IP address you're going through an extra layer for not a very big advantage so one of the things that we like to tell our customers is we allow them to plug directly to the public network which means all of our compute nodes are also directly connected to the public network and so by doing that we're eliminating the entire SDN path and having the user just be connected directly to the switches that go to the gateway and there's all physical hardware so it's no different than having a physical system that's connected to a switch at that point so we find that that has a tremendous improvement because if there's any outage of the networking control plane or the routers or one of the controllers you're completely unaffected by this at all as long as the compute node is still running and operating you have no issues so that's something that we kind of find is like the combination of you know use L3HA and Neutron to try and make sure your routers are highly available if you have to use you know if you have to use that otherwise just plug directly to the public network so we know that that path is you know pretty stable and not going through rabbit or database or upgrades or anything like that anyone else wants to comment on that? I think I have seen some post that with OVN for example DVR and VRP had some trouble at some point mixed together of something like this I don't know if one of the Neutron PTL have some feedback on this is it possible to mix all these resilient choice together is it possible is it recommended also is it not perhaps you have the answer? Yeah, I just wanted to mention that for example Ericsson customers usually skip the wall virtual layer 3 stuff so so they have their own network fabric and OpenStack just gives the layer 2 connectivity and everything else over that is handled by some other routing and gateway fabric which is out of OpenStack control or they have some special drivers for that I don't know exactly but that's another option if you need that because in that case that part is not affected by upgrades issues with the network node or with the controllers as Mohammed said Getting back to your question about OVN in the OVN case traffic is by default distributed I mean layer 3 traffic is by default distributed so you have kind of DVR enabled by default and traffic which is using floating IPs will go to the external network directly from your computer as that traffic is still centralized on one of the nodes is going through some of the one of the nodes so this is kind of similar to what ML2 OVS with DVR is doing and with ML2 OVN you asked about VRRP traffic but Neutron is not using VRRP at all when ML2 OVN is used so maybe I'm not sure exactly what issue you were talking about asking about but I think I missed between OVS and OVN I think it was on OVS I was just going to say with our calico implementation so for for resiliency we tend to as I mentioned earlier each of our hypervisors has effectively a pair of tours that it talks to in the rack and so we effectively use ECMP or we implement ECMP that way so that if one of the tours for example fails we have an alternate path those tours themselves are connected to a set of spine routers and then even above that we have a set of super spine routers and so on so it tends to be how we have resiliency at the lower level towards the VM side okay thank you very much for that another key part of the resiliency of our network infrastructure is the features to be enabled or not we mainly we want mainly to use the VPN as a service for example but with our experience we don't know yet how does it affect the network at a really large scale we don't know if for example we have 50 users for the moment it will go okay but if we have a thousand users we don't know how it will react do you have some feedbacks and some features to enable or stay away from when we are trying to are achieving large scale on an open stack yeah so basically short answer is that if you want to go with large scale you should enable as as few features as is needed basically more features you will enable more troubles you will have I would say so please enable what you really need basically what I'm talking about is for example let's imagine about let's talk about L3 agent if you enable I don't know floating IP port forwarding if you enable some QLS things and some other extensions and plugins there everything will work fine unless you will for example restart your agent and it will have to reconfigure everything from scratch for 1000 routers which are on this compute node it will be okay but it will take more time when you will have more extensions enabled because each of them will need to configure some IP table rules some TC rules for QLS or something else so basically less you will enable easier it will be to get to large scale I would say and in that situation probably it would be good to mention that it will put a lot of strain on the Neutron server API and the mess it was right yeah that's also true thank you for that so I think we've answered all the questions from the architectural part so software and we didn't talk a lot about physical routing except on the Bloomberg site they are using some spineleaf architecture with a super spineleaf on top to ensure the most resilience part on the reddit side or on the physical parts I know that there is a lot of use cases but what is for the new users if they want to achieve some kind of resiliency on the physical part what will be the easy road to get to achieve that hardware I can start a little bit yeah I think for us the way that we build out any kind of cloud or anything we use spineleaf architecture so really making sure that you are building your network to be layer 3 out of the box really helps you a lot down the line and now the nice thing these days is that back then switches that can do layer 3 were very expensive very inaccessible thing now I would say most switches that you are going to get are going to be able to do layer 3 are going to be able to run BGP are going to be able to run most routing services so generally we have spines and leaves where what we do is we have two leaves in every rack connected to both other spines so that means that every single rack has four connections going to it from two different spines at least which really and there are 100 gig links so every rack has around 400 gigs a second that's coming in and it's all fully redundant and those switches are set up in an M lag or C lag or whatever your vendor calls aggregation of switches by doing that then every system gets a link from each one of those so your entire network stack at that point is all pretty reliable now where we go a little bit further is a lot of people when they deploy provider networks they use V line for provider networks what we've done actually is we use flat networks but what we actually do is we run VGP on every compute node and what we do is we have a VX line for our provider networks and a big part of that is because all of our routing infrastructure is also all Linux space so everything is using VX LAN so we don't have any problems with having one network with you know 100 slash 22's that's not a problem because there's no ARP that goes out anyways all of the MAC addresses are all being published by VGP and so what we have is our compute nodes all top to the top RAC switches which every and every RAC has its own VGP AS number so that it just not naturally does EBGP to all the other RACs and so it's super effective that it's just like ARP traffic and it's also really reliable and nice because if we want to spin up a compute node for a customer in like another room in our facilities we don't have to go and run cables or whatever as long as it joins or layer 3 in some way has reachability it can get the VX LAN network published and it can start talking over that network so that last part is where I feel like we're a little bit kind of you got to do it right or otherwise you might have some weird stuff come up but I'd say kind of the layer 3 MLAG is kind of the way to go in my opinion Perfect, thanks for the answer Perhaps Laho's you Actually I have just a question so don't you use VGP VPN or Neutron Dynamic I think for this or something not managed by So it's not managed so we use VGP, EVPN to probably get all this stuff and so it's not managed by Neutron what we do is we present the physical interface and then we attach that physical interface to OVS so for all that OVS knows it's just I think that's a bridge to there is a little tricky thing that we have to do because FRR free range routing does not cannot read the MAC addresses and publish them off of OVS so we have to create a Linux bridge and then set up a virtual ethernet and plug one to the OVS and one to the original bridge and we're hoping that with time OVS or FRR can natively learn MAC addresses from OVS and publish those out Thanks Thanks So now that we have now that you have answered all our architectural questions performance wise we had some questions at one point that we're using SDN but perhaps sometimes SDN can be helped with some hardware hardware card for example we wanted to leverage some of the daycapsulation and encapsulation of packets with some smart NICs so do you already installed smart NICs and use hardware flooding with OVS OVN was it helpful? Yes we didn't help at all and just stayed with the software defined So we've been using Melanox ConnectX cards mainly ConnectX 5 and 6 6 are nicer because they can also offload connection tracking so security groups and all that stuff works fine and hardware so we've been able to reach near wire at the Ethernet speeds well we're talking like 50 gigs, 100 gigs and so on right so that's nice well that is mainly working for us on MMO2 OVS and the OVN installations if we're not having ConnectX 6 which offloads connection tracking then we need to disable port security or use stateless firewall that has been introduced in latest OVN versions and so on so there's a lot of considerations if hardware flooding will really help you to do the job or not I actually have a question we did dabble a bit with ConnectX 5 and 6 I guess a question that's maybe related to what Ibrahim is also mentioning is how much tuning did you have to do on the compute side right because it's one thing it has the ability to offload 25 gigs but I think a lot of times when we are trying our experiences if there was a lot of changing maybe I don't know IO threads and all sorts of others tuning that you have to do to the virtual machine so that it can actually push that sort of traffic or if it has enough VCPUs what was your experience like with that I think perhaps Michael did you hear that last bit that I mentioned yeah it was a bit choppy probably my internet I'll quickly so I was saying I'm sure you had to do a lot of virtualization tunings for NOVA so that Libvert launches VMs with some specific options or whatever how much time did you spend on that and is there anything that you would kind of point at and say these are some important stuff to give you a big boost and even if your hardware offloading yeah so we've been doing some of that stuff also doing some watches to Neutron or NOVA at least some of our my team members have been doing that so probably it's it's a very long topic so as I said it's not really trivial to set that up we've spent considerable amount of time on doing that right there's a lot of blog posts on the stock HPC website pointing to what we've done right so I don't want to really like talk for the next 15 minutes but yes as I said it's really not trivial right we've spent very long days sometimes weeks to get that working as we'd like to get it working right and in most cases we've been using that for so we've using we're using ASAP square and the VFLAC functionality for SRIOV right so that also adds some resiliency so we're using a bond for the SRIOV networking but but yes it's it's not very easy and in most cases it also requires some interaction with the vendor right so I think the kind of good takeaway from this is I think hardware offloading is great you can hit really high speeds but it's not a buy it, install it change the config and now you're getting 50 gigabits a second for your VMs yes unfortunately it's not that easy and we would like to be that easy right it would be very nice thanks Michael for that feedback I think it's really interesting we we will ask some questions internally I think for us to install that this hardware offloading I think it would be really interesting another question we had in an exam part was how to size our network nodes in order to not hit some limits so for the moment we don't have experience on that side how many nodes should we have, how many CPUs should we how many CPU cores we should have on them how many RAMs is CPU is an important factor if you want to only do SDN for the encapsulation do you have a method for all of us to know how many nodes do you need for a specific part or do you just monitor everything and when you see that it begins to hit the limit you add another node so maybe I will start answering this question first of all we have two factors here I would say first is scaling because of the data plane and because of the number of resources in Neutron and the second thing is scaling because of the data plane how much traffic you want to send and so on and I would like to talk first about this data plane and resources Neutron resources things so basically Neutron I would say scales pretty well horizontally but it won't scale vertically when speaking about ML2 OVS right and L3 agent, DHCP agent the agents are basically single threat applications and no matter if you will have 32 cores or 1 or 2 cores in your system it will configure everything more or less one by one and it will spend the same amount of time and now the second thing is that when everything is up and running fine then it's running fine you can have for example 100 routers or 1000 of networks DHCP agent and all will be fine the problem will happen when you will have some you will need to restart this agent when you will have some problem with RPC messages and some time out in RPC messages in the agent because such operations will will start full sync of everything and then agent will have to go through everything and configure everything again from scratch and this may take a long time actually so because of that it's better to have for example more nodes with DHCP agents or L3 agent and keep less number of routers or networks on each node rather than have huge network node which will host everything for you because basically you will have easier maintenance later so my advice to scaling horizontally basically the same is for API workers you can have as many API workers as you or RPC workers as you want and this should work pretty well so scaling that way is I think better in terms of data plane sorry control plane and number of resources okay perfect thank you so for the performance side I think we are okay I don't have any more questions but now I have more question about how do you run Neutron projection the monitoring part, the command downsize and failure so the main question that we have is like I said what are the command and command downsize failure of Neutron what do you have in content and what critical matrix did you monitor to see that everything was going down and how did you act with this failure of the insect I can start this up so we got a couple things that we actually watch for when we are monitoring clouds and especially how to know if Neutron is doing the right thing or whatnot so a couple things I'll start with what comes to my mind at first right one of the things that is actually pretty interesting monitoring your availabilities of subnets you don't want to find out at the day that your IP is completely run out especially if you're working in public IP space for example so monitoring things just making sure that you can still have enough IP addressing in your cloud another thing that we like to monitor is checking the binding status of Neutron ports so sometimes if there's problems that start happening you'll see a lot of Neutron ports going into binding fail status usually that's a telltale that there's something not okay that's happening that will require a little bit more investigating as well a lot of times if the OVS agent is up then you know things are probably or sorry whatever agent you're using reports up things are usually means that they're working but that doesn't necessarily tell the whole story sometimes sometimes an agent can report that it's up but it's having problems communicating or maybe sometimes what happens is some race condition occurred and a bunch of IP tables rules came in a and now Neutron is just looping trying to save the IP tables rules but it's failing to save them so being really mindful at looking for error or trace level of mistakes sorry exceptions in the logs is super important and can really uncover a lot of things and I think this goes for a lot of OpenStack projects but the health of your RabbitMQ cluster is like super critical to the health of everything when things are not good in RabbitMQ not only could they cause problems like at the time that they are but they can start to introduce corner and edge cases that are just not considered right so what could happen is if your Rabbit is acting up a bunch of ports that might be half provisioned because the messages are not being fully delivered so you end up with things that are not clean and then the recovery becomes even more difficult if anything I would be if you're trying to kind of get out of the failure when Neutron or anything I would be more in favor of completely shutting down RabbitMQ and have suspicions that it's causing problems than trying small little things here and there because in my experience the longer the outage goes the more edge cases you're starting to enter and you're kind of end up with weird weird stuff in the cloud the smaller the things also if you use L3HA OpenStack or Neutron has a great API to tell you which router is active and which one is backup sometimes you'll see issues where all of them are backup sometimes you'll see issues where more than one is active so those could be tell tales of okay there's probably a networking issue that we need to look into those are some of the stuff other than that I said that are a bit more like very specific and in the weeds obviously you've got the normal make sure services are running make sure agents are up make sure the API is responding sort of thing as well I'll just mention that I think the same sort of issues and so on rabbit is very essential to everything but for other for example Calico uses at CD or you know basically storing both policy which actually the firewall policy it actually instantiates on the compute nodes as well as information about networks and subnet so if that CD is not healthy that's an issue with respect to Calico so you trade off one thing so rabbit doesn't necessarily affect neutron as much in our implementation but we have another thing that is another failure point the other thing we've seen in the past is cases where we DNS mask you on our compute nodes their provision through the Calico's DHCP agent and sometimes DNS mask you didn't get updated for new tap interfaces has been instantiated and so you'd get VMs that get created but basically there's no DHCP responses and so on so we've had to take a look at things like is the running DNS mask you process doesn't know about all the tap interfaces it's supposed to have there have been some bug fixes recently in Calico to address some of those things that where things get missed at scale you just there's so many VMs being created and so many interfaces being so many ports being created that a miss message can basically mean that you get one VM out of a hundred that is there but not tangible and so on internally we also we have some internal metering but we also use telegraph influx is telegraphed to send a lot of data back and do monitoring that way I don't think if I may add to what Mohammed said actually about putting everything down when everything goes crazy and then starting it up please don't if you have large scale and for example a lot of routers a lot of networks in DHCP don't restart everything at once do it one by one or in batches because otherwise you will have like Daniel of service on Neutron server because everything will ask for networks or routers and all other data and then Neutron server will basically do some dose attack on the DB and everything will go crazy and it will be still like snowball effect and you will have all the time problems so please do it in the batches or in small batches or one by one and then it should be easier to recover in such situation yeah and I would just quickly add patience I feel is a really important thing sometimes you just got to let things take their time start up make sure it's stable then move on and in outage people are like oh let's go but then just like you might cause more interesting problems okay thank you for all the answers really interesting I have just a small questions especially for Mohammed and David do you have a dedicated network team for your production on your side that's managing everything make sure that Neutron is running or do you have an overall team that's managing everything in our case we don't have a dedicated team I think the responsibility is split there's an operations team that would be focused on the actual network the switches the routers and so on there's our team which is more looking at the cloud pieces and we do look at you know for example the status of bird the bird implementation the BGP implementation we're running on the hypervisors that are talking to the pair of tours and then we also have a network policy firewall team that is actually pushing policy into the compute nodes effectively so that division is sort of split between those three so if we have an issue often we have to figure out okay is this something that's a physical issue is it a spine that's down that's causing some sort of issue or does it seem to be a policy issue in which case then we go to the firewall team try to find out you know was there a bad policy push for example and on our side no it's really the same team that kind of figures is on the same and that's mainly because a lot of our like I said our networking stack is all Linux based so that's a lot of the same tooling that we're interacting with whether it's compute nodes, routers switches all kind of run the same thing. Perfect thank you I think I finished with all my questions thank you all for asking I think Jerry can get the microphone. Yes we had plenty of good questions in the chat we'll present two of them and see if you have good answers for them so the first one is from Bassem on the YouTube chat and it's how is your experience with centralized routing in Neutron like where all the traffic goes to Neutron server for large scale open stack environment who wants to take it? Anyone use centralized Yeah I mean we run RL or virtual routers on a bunch of systems and we don't use DVR. One of the big reasons why we don't use DVR at least back then was the fact that it consumes a public IP address for every compute node and as you get large scale that adds up to a lot of IP space that's being used for potentially one single system or one VM that's running somewhere so that just doesn't kind of work out for us in that sort of scenario so we kind of do that but we don't have too many problems like I said because we try and push customers to be directly connected so whatever the leftover traffic cross centralized routing works pretty well the only caveat usually is startup time for Neutron L3 agents can take a little bit of a while as it goes and rewires the routers and you know that can take a little bit of time and goes back to the comment around patience I can just say that recently in one of our customers deployment we noticed some issue when there was a lot of hundreds of routers HA routers so there was a lot of HA VIP traffic sent between them and when one of the nodes went down KIPA LiveD was starting sending VRRP packets and on other nodes and there were problems basically OVS was consuming a lot of CPU and everything was basically going crazy and the only way to recover everything was to shut down and start network nodes one by one again but that's again something what I already said if you have large scale you should have more network nodes and scale horizontally then if there will be so many routers hosted on one node it should be better thanks for your answers we have another unless someone wants to add something to this but I can't see if you raise your hand if you have more the next question is from Jo from Korea and the question is very specific the time to restart the OpenVswitch agent can become longer in DVR with hundreds of hypervisors his case is that the agent is looking up a lot of ports from the database while restarting is there a way to solve this so maybe I will answer because it seems like more cloud oriented question so the way how it works is that it's not only because of DVR every time OpenVswitch agent is restarted it will ask for Neutron server for details about each port what is different with DVR and what can be the root cause here is that with DVR you have to enable L2 population and basically every time ports are updated Neutron server will and they are updated because OpenVswitch agent when it's restarted and asks for details about port the port is updated state of the port is updated to be down or active or built and because of that Neutron because of that port update Neutron server will basically notify all other agents that FDB entry for this port was changed so because of this L2 population there is a lot of traffic and a lot of RPC messages sent to all other OVS agents on other compute nodes where there are ports from the same Neutron network basically and probably because of that this restart takes much longer time in case of DVR if there are any way to solve this I don't think so to be honest for now I don't remember about any work in the upstream to improve that we did some improvements with FDB add and FDB remove and this L2 population mechanism last cycle or two cycles ago or something like that so maybe if you update your Neutron maybe it will be better I don't know exactly what version is used there but there is nothing else what we can do with that as I remember we had some bug fixes for improving how OVS agent discuss these things and refresh these informations from Neutron server that was not specific for DVR it was just how these RBC things are working but there were improvements in the last cycle last cycles perhaps so it's worse to follow Neutron and OpenStack and have as latest version as possible okay well thanks everyone for answering all those questions I think it's time for us to close this episode I think it was really a great way of demystifying Neutron choices and I hope the audience learned a lot from this discussion I certainly learned a lot so that's great so a few other other news coming up next weeks and in the near future first if you're an open infrastructure user tomorrow is the deadline for nominating your organization for the 2021 SuperUser Awards so you can see the URL on this slide if you want to nominate your organization or someone else please do so and the deadline is tomorrow so not much time left next week we will not have an OpenInfralive episode we will have our project team gathering event happening all week the project team's gathering is where our project teams get together to discuss and organize the work of the upcoming month of development so it's a very key moment in our development cycles for OpenStack but also for all of the other open infrastructure projects this one is happening virtually so anyone can join for free and there is still time for you to attend that event at openstock.org this event is really made possible with the support of the Open Infrastructure Foundation companies so thanks to them for supporting the development of Open Infrastructure and if you're interested in joining the foundation please head to openinfra.dev our OpenInfralive show will be back the week after but also note that we'll have a very special two day episode with keynotes next month on November 17 and November 18 so this will replace the traditional keynotes we have at OpenInfra it's free and you really don't want to miss it so don't forget to register for this very special event registration is open at openinfra.live special thank you for to red hat our headline sponsor and HiVolve, Inmotion Hosting Courage, ComponentSoft and Cloud and Heat are supporting sponsors and finally don't forget that this show is for all the Open Infrastructure community so if you have an idea for a future episode we want to hear from you you can submit your ideas at ideas.openinfra.live so thank you all for joining see you again in two weeks on Thursday at 14 UTC and thanks again to all of our speakers who joined us today and special thanks to Ibrahim Deraz for leading this session and see you soon on OpenInfra.live