 Hey, hey everybody how y'all doing cool? That sounds like you guys are doing great So my name is Kevin. I work at Cisco But please don't hold that against me. Just kidding And seriously, thank you all for being here. You helped to prove my mom wrong. She said no one would show up So take that mom so I'm here to talk about the neutron L3 agent and Some of the the things that I have done both this employer previous employers elsewhere Whatever to implement some ha as far as that goes So kind of the first thing that I want to touch on and I want to make absolutely clear here Is that I don't think there's any one right way to do this? I'm sure that y'all have different ways that you may or may not be doing it But you know, mostly I just keep hearing about pacemaker And like that's a fine implementation in some cases, but I don't feel like it's the only way to do it So I wanted to offer up kind of what we're doing And maybe if we have some time at the end talk about maybe what other people are doing as well So at the end of the day the goal is just you know, you got to move these L3 resources these IP addresses to new L2 resources as quickly and seamlessly as possible, and that's more difficult than it might seem But it's like a really really important problem to solve So Start at layer 3 this is where internet happens and that's sometimes a good thing sometimes not so my beautiful drawing here You have in a typical setup, you know, you might have multiple L3 agents running and all of them host different routers and Craziness going on like you can see router one over there some tenant apparently created a router and just wanted to use up some of their quota without actually using it so This is what this is what nor a normal setup will look like so when you have a router fail or sorry an L3 agent fail You have these routers that are left over and these guys here, and they've got nothing They've got nowhere to go. So so what do you end up? What do you end up doing with these things? And that's really the problem that we're trying to solve is where do you put them so? Layer two is what we actually have to deal with and as you can see the arping. That's definitely the hardest part So by and large one L3 resource can only be tied to a single L2 resource at a time You have one IP and one Mac and that's kind of the end of it And if you want to change that pairing and you got to tell the switch what to do otherwise So there are some technologies that exist that sort of try to work around this like HSRP VRR P CRP basically various iterations of ARP and RP And we're even working on implementing we being the open-stat community are working on implementing some VRP like functionality in Juneau And there's a there's a blueprint for that But in the short term they're really There's nothing integrated in an open-stack today that kind of gives a seamless Layer three layer two failover so when those routers are out there orphaned They don't have anywhere to go and there's no way to quickly move them anywhere else So again sort of what I touched on is pacemaker is sort of the default Is the default that everybody that everybody goes to and you can see the docs the doc site automatically sort of takes you there Some problems that I had with pacemaker were The last time I used it to be full disclosure here the last time that I tried to implement pacemaker was with Nova Network still and When we switched to quantum now neutron I didn't even try to implement it because of these problems But false positives were a huge problem that I had and maybe I just suck at tuning pacemaker I don't know but I was causing or I should say it was causing more downtime than actual outages like a false positive would happen L3 agent would migrate to the to the failover and There'd be an outage and people would be like what's going on and sometimes you would end up with I'll talk about this a little more Later, but split brains where you've got you know routers are on both agents and all kinds of craziness happening so That was one of the big problems that we had again kind of split brain possibilities We didn't and we didn't really want to implement like a stoneth thing because you know We were having enough false positives that that just caused even more trouble and again, maybe I suck at pacemaker I don't know but That was the problem that I had Another issue that I had was that it basically assumes control of the L3 agent start stop function So by default at least the way it's documented is you know You want pacemaker to actually start your L3 agents and you don't want Your end of scripts to do it anymore So you run into issues where you have to you know, you can install the packages But then you have to remove them from the from RCD and you know put it under control of pacemaker and You know, it's totally doable when you're when you're When you're if you're doing it through puppet or chef or ansible or whatever, but I don't know It's just an extra step and it's like why right? So limited horizontal scale for me I found that it was more difficult to run a bunch of active L3 agents and I'll kind of touch on this a little bit more but mostly just because because it required entire service starts and stops and it usually requires like a mirrored pair if you will of Of hosts so you've got you know, you've got these these two hosts that are these this hardware that's just sitting there basically doing nothing So I kind of look at it as a like a raid one of layer three functionality The active passive model requires more hardware And it really works on a per agent level So, you know, you've got n number of routers sitting on an agent But pacemaker really only knows about the agent. It doesn't really know about the routers on the agent So it doesn't really give you a very fine level of granularity So as I said, I sort of akin to raid one You have to have two pieces of hardware sitting there and you may scale it out horizontally But you've still if you've got three L3 agent nodes, you really have six So and you know and that may be fine and that may be what you want to do and it certainly solves some issues like capacity issues and stuff like that, but Looking back at our diagram here we see that You know, basically what ends up happening is you lose your L3 agent And then pacemaker just fires up another one or moves those routers to another one or clones the resources over to another one So you end up with the exact same layout but You know, this agent goes away and just comes right back on a different piece of hardware So your layout looks it ends up looking exactly the same So after messing with this for a while and like I said, Nova Networking was where we did it last Didn't really want to try to deal with that again So we thought what's a better way that maybe we can approach this problem So what we did was we created something we call the neutron HA tool and you can see it's part of the Stackforge cookbooks currently So I mean it's out there. You can grab it. It's free and fun So a few of the things of it it's API driven. So basically it uses native API calls to perform all of its functions So anything that the neutron client supports it can do It's meant you can run it externally from infrastructure. You can run it across site, whatever Basically the way that it works is it runs and it checks the agent status from the API and if it says an agent is down it does some jitter checking and stuff like that and it and If the if it determines that an agent is down and is actually down Then it effectively makes API calls. It says remove the routers from this agent Add them to a different agent or reschedule them the schedule then Splays them out to whatever L3 agents are are available again and everything is ideally hunky-dory Doesn't always end up that way, but So it's easily extendable. It's I mean it's written in Python And it just uses the standard open stack open stack libraries But most importantly it works on a per resource level by which I mean again It gets a list of all of the uses the API to get a list of all of the routers That are living on an L3 agent and then it actually just moves them one router at a time So you're not trying to deal with you have the granularity of dealing with a per router basis instead of dealing with a per agent basis So again going back to this beautiful diagram So if we lose that L3 agent in the middle This is what will end up happening after the HA tool determines that it's gone It'll just reschedule them onto L3 agents that were that are still alive So I would you know going back to kind of the raid analogy I would say this is more like a raid 5 or a raid 6 like you have this parody laying around over here and over here and if you lose a disk you just Say well that disk is gone. I'll replace it when I can I'm going to move my data to disk They're still valid and good and happy So one of the nice things is like only routers and IPs that are on the affected L3 agents are impacted So when you're in my experience when you're stopping and starting L3 agents you always You always run the risk of something crazy happening So in this case it pulls the routers off and this blaze them back out Onto other things and so you I mean you're dealing with a per router basis And with the L3 with the pacemaker stuff when you restart an L3 agent You know it'll have to take all of the routers on it and then it takes a while for them to come back up Which is still kind of a problem with this But because it splays them out across your available L3 agents you sort of parallelize That workload and the time that it takes for that recovery So if you have a hundred routers and a hundred, you know on a failed agent You have a hundred more that are laying around It'll take far less time to bring them all back up So again the recovery time depends on the number of routers and the number IPs on each router mostly that's because The migration happens pretty quickly, but the routers have to re-arp every IP to the upstreams to the upstream switch. So And again, it's back to that layer 2 problem And because we don't have VR or P or something like that. We got to tell the switch. Hey everybody. We're over here now The metadata proxies migrate with the routers So, you know it basically just sort of moves things along So what is the catch? There are many of them It certainly is not the best solution that has ever been invented by man It's not seamless again. It doesn't use some manner of matte cloning or you know vips and anything like that So there is downtime router disappear or the agent disappears router loses connectivity It takes X amount of time before it comes back up somewhere else In neutron now so prior to Havana in grizzly then this was actually a serial process But now the arps will happen sort of in parallel like it'll just do them as fast as it could I can remember one of the biggest things that I hated doing was restarting an L3 agent because you'd be like Watching the log and it's like arping arping Arping and just take forever And again the more IPs the more floating IPs that were up there and running and stuff I got the longer that took but now stuff happens in parallel generally in my in my testing I haven't done more than a like maybe a thousand floating IPs migrating off of from one agent to another But it usually takes 16 to 90 seconds once the migration starts happening The various as a service offerings they further complicate things so like for instance it only accounts for L3 agent-controlled services right now load balancers of services another really interesting example like Just that comes to mind because I was looking at that recently and like there's no Currently there's no API call to be able to remove a load balancer from one agent move it to another So like that's another kind of a catch like that sort of exemplifies this like you can only you can only do the stuff that the API does So certainly gives you some issues there There's no coordination between the HA tools. So how do you ha the HA? You know you can have a tool running here and another one over there. They detect an agent down You're gonna have lots of race conditions So that's something I'm sort of working on at the moment um And if anybody has any ideas, I'd love to hear them but It's currently not demonized and like so the way we the way we have run it is it just runs from cron So that means thinking worst-case scenario you have to add 60 seconds to your total recovery time if you have it running every minute and Then you have the jitter protection which adds additional Total recovery time and then you also have the issue where if you're jitter like so the way that the jitter works Now as you give it a minimum and a maximum time and it does a ran between those two And then it just re-checks to make sure the agent is actually still down before it does the migrations And if it comes back up, then it says okay cool It's fine And if it doesn't then it then it triggers the migration, but if you If you're jitter protection and your 60 second protection and stuff like that if all of that stuff overlaps There's the possibility it can run more than once at a time Which hasn't been a problem in the past because you know like the first one will kick off And it'll start looking for it'll start looking for agents to migrate and then the second one will kick off and stuff And they tend to they tend to behave pretty well, but it's certainly something that needs to be called out Just because again race conditions abound in this particular scenario And then and this is sort of a problem with pacemaker as well But there's no mechanism by which to ensure that the resources actually come back up So you know you may migrate a router and you know if if the router stays on the dead L3 agent the next time The tool runs it's going to It's gonna try to move it again if it's still there But if it actually gets removed from this agent and then never comes up properly on this agent over here You're not really gonna know about it and until the tenant says hey How come I can't get to any of my IPs or my VMs won't route or something like that And again, that's a problem that you have with the elf or sorry with the pacemaker Also, so it You know it's but again something that you want to call out um cool, so DHCP sort of a layer three thing, so I figured I would talk about that just briefly It's not really included in this because DHCP agents can be run in active active mode already You can run you know like a ton of them And you basically just say how many do you want specifying your agent config file and it'll spin up two or three or five One thing to keep in mind of that though is that each agent requires an IP in the tenant subnet So if they create I don't know like a slash 30 or something like that You're gonna take up and you're and you're running 10 DHCP agents are gonna take up a good lot or all of their space Sider math in my head is not my forte. So Let's see here, okay, so DHCP is multicast so again, that's why it works to have them all It just like broadcasts out and everything is everything is happy all the agents have the same lease files I've sort of been working on this a little bit with the guys upstream and so just the first one to reply binds to the VM and and They'll all resolve so like in there's a patch just got submitted now So that the resolve.conf if unless the tenant specifies otherwise will get all of the DHCP agents that are running As name servers and it's resolve.com and they can all resolve all of the IPs internally and Recurse upstream so Yeah, so any DHCP agent can do that and Then yeah by default they'll each hand out a list of every agent as an available resolver The one thing that we do do in the HA tools currently has an option to replicate DHCP at all of the agents So if you get you know if you change the if you change the value or you just want to say I really want to spin up a lot of DHCP agents without going through and dealing with it You can you can run the tool and there's a an option to replicate DHCP It also has a dry run option and some other things, so it'll tell you what it's going to do Alright, so Moving forward so this is so that's that's kind of the end of what we did with the HA tool That was one implementation that we looked at as opposed to trying to deal with pacemaker So moving forward what is neutron doing? Well, we're implementing VRP like functionality You can basically specify the number of active L3 agents per subnet and we'll set it up It uses a contract D and keep alive D. I think and really Where this factors into things is what's the point of diminishing returns for my HA tool for pacemaker for whatever like? How much time is it worth spending? Implementing things and which is pretty much why I came to the conclusion that all of those caveats and things like that At least so far for me for my scenario or an acceptable risk Because it's like well if we're gonna get something eventually I'm just trying to get a stopgap good as good enough. It'd probably be better to call it a dr tool rather than an ha tool so However, this is the beauty of open source And I just like I really wanted to touch on this again is that you know, there's no runway to do it one right way My goodness. I can't talk today So it's good to think outside the box and to do cool things and you know come up with ways to do it Like just because you read that pacemaker is the way to do it or that this ha tools the way to do it or whatever That doesn't mean that that's the only way to do it so I would encourage everybody to Think about your scenarios think about what you've got going on and look at creative ways to solve problems And then come back and I'll come listen to your talk about how you've made this better or whatever Cool, so any questions any comments any if anyone to tell me why I'm wrong If not the sound guy said if we could get out of here early, that's good. So I'm just kidding I think I had to answer that I might answer that question because we have been working on something very similar like We've been Enhancing the age a tool that you are also using Upstreaming some changes and then we added a pacemaker resource agent for the age 8 for the age 8 2 to monitor everything And also take care of restarting the agents Yeah, it's all upstream. Well, the resource agent is still a pull request only on the upstream repositories, but it's yeah Yeah, so I mean there certainly is work being done Like I say from my perspective the issues that I see I didn't want to spend more time trying to eke out a little more Performance for me. It's been good as good enough and given like the like I say the hopefully seamless failover and stuff That's coming up. I hadn't spent specific time on that Yeah instead of just Keen more L3 agents when they dies. I was thinking to have shooting more L3 agent based on the traffic so In our use case when we support eScience users where the where the traffic goes up I mean they download a lot of data scientific data. So I was thinking to launch More L3 agents on demand like when the strike goes up They launch more L3 agent and then balance it, right? So is there any way to do that or something? Yeah, I mean not in the not in the tooling that we've created today The way that we look at that is we basically kind of again along the looking at the raid level thing Like I have a number of L3 agents running and when they get to a certain capacity Based on how many are running and how many routers we have and stuff like that. We we spin up more physical hardware We're not currently doing anything in virtual hardware but I'm sorry. Does that answer your question? Yeah So yeah, so we're so I'm not working on getting anything to spin up agents automatic more like build new agents automatically But that certainly is a really good idea and for people running virtual infrastructure control infrastructure that would That would be really amazing like you could implement like heat or something like that to spin up Automatically expand out your L3 infrastructure Good comment. Thank you What do you think the behavior of the L3 HA? Structure would be if your DHCP configuration always contains two V routers for every VMs routing cable It's a really good question I'm not really sure to be honest with you. I'm not I'm not actively working on some of the upstream stuff I'm more working on some other things at the moment. So That's a really good question. I really don't have my first consideration Is it would probably increase the load if you ended up with two V routers for the same VM going on the same L3 agent? Right and I would have to imagine that there's got to be some kind of like in the code when it finally gets written Like in the scheduler code where are we going to put things that that'll probably a consideration? If not, we should definitely look at the blueprints and make sure that it does right Thank you. Good comment Community is working on the distributed work virtual router, right? There's some work on going I guess they are pushing it to the next release. So, how is this thing compared to the distributed virtual router? Sorry, I'm having a hard time hearing you Changbin. The Neutron guys are working on the distributed virtual router Right for L3 agent, right? So I just wonder how is this thing compared to the DVR's distributed virtual router Well, I mean I sort of consider this to be like a third-party bolted-on Try to fix its solution. So it probably has nothing to do with what people are working on. I was more going for like a What can we do today in Grizzly and Havana to sort of overcome the shortcomings? What are some ideas? So I would say it has nothing to do with with what's being worked on later. Did that answer your question? I'm sorry Okay, thanks Yes, sir. How is the DHCP lease file shared? Well Sorry, how is it checked by the DHCP agents? How is it shared? Yeah, because you said that there's one lease file for all the agents, right? Right. So anytime Currently the way that it works is that anytime a new VM is spun up or the lease file gets changed like it pulls from the Database the entire lease file for that so the entire like list of VMs for that subnet and rewrites out the file from scratch Anyone else? All right. Well, I really appreciate you guys coming. Thank you I hopefully you learned something and if you didn't then hopefully it wasn't a complete waste of your time Thanks you guys