 Good afternoon, everyone. Good to have you here. This is the talk about clusters, routers, agents, and networks on high availability in Neutron. My name is Florian. I run Hustexo. We are a professional services company doing consulting and training around OpenStack. To my left is Asaf. He is from Red Hat and works in the OpenStack networking team there. To my right is Adam. He works for SUSE on OpenStack HA. You'll find our names and our Twitter handles at the bottom of this slide. If you have feedback to share about this talk or questions or anything like that that we can get to later, then please contact us that way. Also, just as a reminder, right off the top of the talk, if you go to skett.org and the OpenStack Summit schedule, there is a feedback button for every session. Please utilize that liberally, and we're thankful. The conference organizers are thankful for any and all feedback that we get about the talks, about the venue, about the setting, about anything else that you would like to give feedback on. We'd like to thank you in advance for that. We're concerning ourselves in this talk, which is, of course, in the networking track with high availability from one specific area in OpenStack or in one specific area in OpenStack, and that is, of course, OpenStack networking, aka OpenStack Neutron. Of course, we generally want high availability in just about any aspect of an OpenStack deployment and in any aspect of OpenStack components. When we talk about Neutron, HA specifically though, then we need to look at high availability for Neutron from not one, not two, but three different angles or three different perspectives that we need to cover and we're going to cover all these three perspectives in this talk. One perspective or one angle that we need to look at Neutron at from a high availability perspective is the Neutron API service, which for some reason no one really understands is called Neutron Server in OpenStack, whereas any other API service is called something-something API. So that's one thing that we obviously want to keep highly available. We want our users, we want our cloud operators, anyone who wants to use our OpenStack cloud needs to talk to our APIs. One critical API that everyone needs to talk to is the Neutron API. We don't want that to go away for reasons of, say, node failure or plant maintenance or something like that. So one thing. Another thing is high availability for virtual DHCP servers. In Neutron tenant networks, Neutron tenant subnets we can enable or disable DHCP for IP address configuration. All of that is of course managed dynamically by the Neutron DHCP agent and we need to take that into consideration for high availability as well. And thirdly, we need to talk about high availability for Neutron virtual routers and that of course is the L3 agents home turf. This concerns itself with North, South, East, West routing in our cloud and any kind of layer three network connectivity, which we also of course need to provide not only in order to be able to talk inside and outside of our cloud but also as a prerequisite for things like metadata acquisition and so forth. So from those three different perspectives that we're going to look at, high availability in a Neutron context, we'd like to give you insight into three different things. One is what's happening in this area upstream. What are the high availability considerations that go into upstream development? And we also want to talk a little bit about the specific high availability value add that comes from that individual distro vendors. And we're talking about the specific value add that comes from two distros, Red Hat and Sousa. So that's just two out of several that are available. Why just those two? Well, we can't cover everything. If there's anyone here in this room from Ubuntu or Mirantis or any other open stack vendor and you're badly disappointed about not being represented in this talk, we're sorry. But please feel free to submit a panel for Austin. And we'll be happy to revisit this topic then. So with that, I'll hand it over to Adam. There we go. And we're going to start out with HA considerations for Neutron server, the Neutron API service. If you want to use the clicker. Thanks, Florian. Okay. So this is the simplest area to talk about probably is HA of Neutron server. I guess it's this one. Yeah. So this is pretty simple on the HA front and not too much has happened in development recently. So the essential thing is that as long as you have a highly available database, then the server is stateless. So you can just run as many as you want, active, active, and you don't need to do anything special other than ensuring that the client has a reasonable way of reaching any of the server backends. Okay. So this is a just simple architecture diagram. This probably won't be new to too many people here, but typically what you'll do is use something like HA proxy for load balancing. In this case, we've got HA proxy running on the first controller and there's a virtual IP which the clients connect to to reach HA proxy. And then the connections go through that to one of the backends. And you get load balancing and active, active that way. Which is pretty simple. And so then if one of the controllers goes down, then HA proxy detects that and automatically reroutes traffic away from that back end and everything from the client perspective carries on more or less happily as before. So the obvious problem with this is what happens if the HA proxy, the node running HA proxy, the controller dies there. And in this case we have pacemaker, the custom management software, which is industry standard Linux HA software. We've got some of the maintainers in the room actually. And that takes care of managing HA proxy and the virtual IP and it fails it over to another controller and everything again carries on as before. And there is a resource agent that manages the virtual IP address and that takes care of issuing a gratuitous ARP during the failover to reduce the downtime for clients. So this technique has been around for quite a long time. It's nothing particularly new and not much really has changed from the Neutron server. I'm not sure I wanted to start that so quickly. But I thought I'd give you a quick demo of that just to get your brains into the space to see what it looks like from at least the command line perspective. So maybe you want to restart the video. So a quick video. So we're on an administrative server. We're going to SSH to controller number two, which at the moment is not running HA proxy or the virtual IP. And we'll just use the cluster command tool to show the configuration for the virtual IP. You can see the IP address there. If we load the open RC file, then you can see the IP and that corresponds. And if we connect to Neutron with a symbol query, then everything works as expected. But that's going through the virtual IP, so through HA proxy. And if we just check using the cluster status utility, you can see that HA proxy is running on the first node. That's the one that ends in 01. And the Neutron server is running on both. And the system CTL status is showing the same thing there. And you can see that there's no virtual IP running on that host. If we reconnect to the first controller, this is the one that is running HA proxy and the virtual IP. So if you run the same command there, you can see HA proxy is running and the Neutron server again. And on this node, you can see the IP is there. And now we're going to do something pretty nasty to that node. We just pause core async. And that will effectively stop cluster communication. And it will cause that node to be fenced. And so if we go back to the other node and ping the first node, we can see it's being fenced. Why does that keep vanishing? Okay. If we look in the syslog, we can see the node has been fenced. And if we look at the cluster status, then we can see HA proxy is now running on the second node. And the system CTL report confirms that again. And now we can see the virtual IP has failed over as well. So now if we just load the open RC to get the same config on that node and reconnect to Neutron, then you see we get the same results as before. And you can see that the first node is still rebooting because it's been fenced. The network stack has just come up. And then if we attempt to connect over SSH, SSH will come up very shortly. And finally, the other node is back up. But at this point, typically, it will need some manual maintenance to get it back in the cluster and check that, you know, see what went wrong. So that's a very quick demo. I'm aware that that went very quickly. If you missed that, don't worry because you can watch this video. The slides are all available online. You can watch it again as many times as you want. We'll provide the link to the slides again at the end of the talk. To summarize this very briefly, ensuring proper failover of an API service such as Neutron Server is something that's actually relatively straightforward because the services, the API services in Open Stack are typically inherently stateless. So in fact, what you need to do is you need to ensure that you have as many instances of these as you need. You put them behind a load balance. You make that load balance for itself highly available. You put that under pacemaker and core sync management behind a virtual IP, and then you've effectively got your bases covered. And another thing that's worth pointing out is that, you know, obviously you probably realize that the demo that Adam was showing was on SUSE Open Stack Cloud 5 on a SUSE platform, but it's effectively pretty much the same concepts that are being applied on the Red Hat side as well and on the Ubuntu side. On the Red Hat side, we can actually confirm because the guy who's more or less in charge of this on the Red Hat side sits over there, I think, in the seventh row, and he hasn't as yet thrown an egg or a tomato or a knife, which he normally would do if we lied about his work on stage, or sent us a horse head or something like that. Oh, I'm sorry. Yes. Don't talk about what. I don't remember. But it's basically, you will see this pattern in most Open Stack high availability solutions for API services. So with that said about HA considerations for neutron server, we're going to take it up a notch in terms of complexity, and that is another component where HA considerations start getting a little more involved, and that's the neutron DHCP agents. Yeah. So probably the most significant development in this area was in Juneau. This option got introduced, which allows having multiple concurrent DHCP agents for each network. So it's just easy to configure, and that looks like this when it's set up. So DHCP-1 represents a server for the first network, and you can see there are two servers in this case for that particular network, but it could be more than two. And another two for the second network, and another two for the third in this example. And so the way that works is that because DHCP operates over broadcast, the client sends a DHCP discover out, says, somebody please help me out, I need an IP. The server's race to offer to give an IP, and the one that wins the race gets axed by the client, and the client gets the IP, and the four-way handshake completes. And actually this is the old version of the slide. Yeah, I take full responsibility. So you get to eager tomato me now. Oh, right, thanks. So yeah, we actually did, if you grabbed the QR code earlier, you actually already have the updated slides. But the last bullet point actually was a correction that I just failed to update my laptop for. Sorry about that. That's okay. Oh, thank you. So the key point here is that the host to Mac mapping is static. So whichever server wins the race, it doesn't matter. And if the lease runs out on the client for some reason and it wants to renew, even if it gets the renewal from another server, I mean, if it reestablishes, it gets the same IP all the time because of the static mapping, essentially. And you can query this through the Neutron API. Here's an example. This is a query saying, show me all the hosts which are running DHCP agents for the floating network in this case. And it shows that they're alive. And then if, for example, the first host goes down, then the status updates there. And there is some very nice documentation on this upstream, which is there. So feel free to check that out. But it's a pretty straightforward mechanism, I think. So with the API services covered and the DHCP agents covered, we're going to get to the point that has historically been sort of the most tricky or the most tricky HHA challenge for just about Neutron's entire existence up to this point. And that is to ensure high availability for L3 agents, which are effectively the agents that control and affect routing in our clouds, or L3 connectivity, as opposed to switch L2 connectivity. Sure. So let's just quickly go over the problem. So we just understand why we even need all of these complications. So you can have multiple L3 agents, and you can schedule your routers to this agent or that agent. It's a single point of failure in the sense that if the node that hosts a specific router or a set of routers fails, you have a serious problem. I mean, all of your VMs, they don't have external networking. They're not accessible via their floating IPs. It's not very nice. So there's a bunch of different solutions to this problem. There's this really simple one, which is basically just a flag that you enable, and that's it. That's pretty much what you do. It's that flag over there. And basically what it does, and this is entry, and this basically mimics the solution that a lot of people already had out of Tria via just scripts that pull the API via Krontab. And what this does basically is that you have a loop that runs every 60 seconds on the server, and it basically checks, you know, hey, do I have a dead agent? Well, let me check all of the routers that are scheduled to that dead agent and just manually reschedule them. There's a few issues. One issue is that failover will only work if your control plane is up, if your neutron server is up, messaging and database. The other issue is that just you're relying on the check, you know, you basically check if your agents are dead or not using the heartbeats, which you can have can have flapping agents and these sort of issues, which we wanted to avoid. The final issue is that it's really, really slow. If you have a thousand routers, it's going to take over an hour to move everything. And that's just an hour's worth of downtime. I don't know how acceptable of a solution that is. I'll be polite. So what we opted, what we actually did was, and this is during the Juno cycle, we modeled something called highly available routers, which basically means that when you create a router, it's preemptively scheduled to multiple nodes. It's configurable. And basically, for each router, in this case, you have the blue router, it's replicated twice. So for a single neutron router, you get two replicas, one on each node. And you have that nice command on the left that shows where, just like DTP earlier, you know, you can issue it on a specific router and it'll tell you all of the agents that are hosting that router. And then that right column, the HA state column, that's new and we'll talk about that in a minute. So a little bit about how that actually works. And yeah, failover works in this slide, which is cool. Just a little bit of details because this was introduced in Juno and I actually want to talk about the diff or like the new stuff that we added. So I'll just briefly talk about how it works. Each router is spawned in a namespace. We spawn a keep alive process for that router. We tell keep alive, you know, all of the IP addresses and interfaces of the router, all of the floating IPs as well. So we basically configure keep alive for that router with all of the IPs. And we just let keep alive float those IPs as virtual IPs in case of a heartbeat failure. So that how it works is that when the slides work, keep alive sends these heartbeats saying, I'm the master, I'm the master, it's very selfish. And if a neighbor basically stops receiving these heartbeats, you know, for a few seconds, he says, oh, the other guy is dead. I declare myself as the new master. It's kind of morbid. So, and you just get keep alive to float these IPs. The actual keep alive traffic is using VRP internally. And that what we did there was for every tenant, we create just a normal neutral network, which is used exclusively for the VRP traffic. So it's segmented from your other traffic. And it's just going to use your default tenant type. So if you're using GRE or VLAN, that's what it's going to use. Although we added an option that you can actually select what segmentation type you want to use. You can backboard it if you want. I think that's a decent, that's kind of the bare minimum details that you need. For the cool stuff that we added, because they are open stack, it's, you know, it doesn't actually have to work. We just have to add new stuff. That's what we do. Especially in red hat. So, is that quotable? Sure, yeah. Tweeted. So, this will be between us. So what we did was, we said, you remember this, the nice table, yeah, this table here, where we said, we want the admins to be able to know what's the state on each, on each node. I mean, for a given router, I really want to know where is the active instance. So in case of a failure, you see, oh, okay, I have, I know where the active is because that's where the smiley and the active tuple is. And I know that the left node died because it's showing those nasty X's. And the last state that it reported was active. So it lets you kind of troubleshoot your network. If you have, say, two agents and both of the agents are actually up and both are reporting active, you know, you probably have some sort of configuration issue or an incoming bug report. So, it seems simple because it's just a column and it took a few months to code and review because it, it turned out to be sadly really difficult to code. And it actually enabled us to add a bunch of new features that are dependent on this. So it's actually, it's pretty important because once we, once the, the Neutron server and the database know the states on each agent, we can do a lot of cool stuff, which we'll talk about in a sec. So the overall design here was that we spawned a new process pair router. It's called Neutron keep alive these state change. And what it does is that it monitors for IP changes in the namespace. So if keep alive performs a transition, it's going to configure new IPs or, or maybe delete them if a transition happened. And we run in this new state change process, we run IP monitor, which basically spits out events whenever an IP is added or removed. Whenever that happens, it notifies the layer three agent via Unix main target. The layer three agent aggregates these over a period of roughly 15 seconds and then sends a single RPC message saying, okay, these thousand routers for each router, this is the last state change. So say if another node dies, it had a thousand routers, it floated all of those thousand routers to my node, I'm going to get, I'm going to get a thousand of these notifications into the layer three agent, it's going to aggregate them and send a single RPC message saying, you know, these thousand routers are now active on me. And that is sent to the Neutron server and it updates the database. You could say, wow, this is complicated. Keep Alive D has scripts that you can use, you know, you can just tell it, execute this script whenever a state transition happens. We tried it for a very long time. It's problematic because it issues the scripts before the transition starts and not after. And that produced really weird races and we just couldn't figure out a way to solve it. So we just monitored for IP changes, IP address changes in the namespace and it turned out to be foolproof as much as open stack can be. So it adds value in the sense that it helps admins, which is cool. It also enables a bunch of new stuff. So one issue that we had with HE routers is something called L2POP. So just a quick refresher. What L2POP does, so okay, by the way, the slides are out of date. So this is extremely confusing because it's, so maybe I'm confused. Call that an Easter egg. Okay, so this is not confusing. So what happens is that L2POP does two things. So it's relevant only for GRE or VXN or Geneva tunneling, not for VLANs. If you're interested in tunneling, then what it does is that it does two things. It optimizes your tunnel interfaces. If you do a VS VCTL show, you're going to see all of the tunnel interfaces formed on this machine. Well, without L2POP, it's just a full mesh. So if you have 1001 nodes, if you go to one of those and you issue the command, you will see that you have 1000 VXN interfaces. So it optimizes that because you don't need a tunnel to another remote node, unless there exists a network where you and the other nodes have a VM, right, in the same network. So in this case, two blue VMs, green one and green two, for example. So you need a tunnel there. The other thing that it does is that, for example, for the bottom two nodes, they actually have a tunnel interface. However, if the green three VM, if the bottom right green VM sends out broadcast traffic, there's no reason why that traffic would reach the bottom left node. Even though they have a tunnel interface OVS-wise, the broadcast traffic just doesn't need to reach agent three because it's not hosting green VMs. So that's what L2POP is about. And it's awesome. The issue is, in order to optimize the broadcast traffic and destroy and create the tunnels on demand, we look at the host binding of a port. So on the bottom left, we issue a show command on a port. It's bound to a specific host. And that's the information that L2POP uses to determine its logic. The issue is that, in our case, with HA routers, the ports aren't actually bound to a single machine. They're bound to a bunch of machines. They're replicated. And then what happens here in this slide, for example, is that the tunnel is pointing to the wrong node. And you're not going to get connectivity. And it's going to be extremely hard to troubleshoot. And you're going to hate me. And it's not nice. So we solved it for some value of solving. And what it does is that we use the information that we already have from the state visibility thing, which I explained earlier, because we know where the active node is. So whenever a state transition comes into the neutron server, we update the port bindings accordingly. So we basically get all of the router ports, and we update the port bindings to point to the new node, to the new active node. This only works if your control plane is up, your database, your neutron server, and your messaging rabbit, and the messaging network. And so your control plane needs to be actually alive. So just to recap, if you don't have L2POP, you're not assuming that your control plane is alive, and failover will work with L2POP, you are. Also lengthens the failover time from around, you know, eight seconds on my machines to 30 or 40 seconds. So we checked the box. If anyone wants to, and we have a backup plan, which I would be happy to share, basically use the data plane to do this even with L2POP. And if anyone wants to do it, you know, you're more than welcome, and I'll be happy to review it. New stuff. That's what we do. We do new stuff. So DVR and layer three integration, it's two patches. We merged the first patch a couple of weeks ago, and we're going to merge the second one hopefully next week. The idea is that if you're running DVR, your failure domain is reduced if you're using floating IPs or east-west routing. For north, south routing without floating IPs, that traffic is still centralized, and it's still going through a centralized network node, so we're going to HA that. And so you create a router that's both distributed and highly available, and you're good to go. The other thing, we thought it would be really difficult, so we didn't implement it, and then we found out that we actually did some digging and migrating or updating a router from legacy to HA was actually really easy, so we just did it, and it was merged like yesterday. And contract integration, if you're doing a failover and you have a VM that's using SNAP traffic, so it doesn't have a floating IP, so that type of NAT is stateful. So if you did a failover, that type of connection would break, and we can solve it using contract, and again, if you want to help out, reach out, and we can do it. So with the approximately 10 minutes that we have left in the talk, up to this point, all of what we've been talking about is stuff that's either happened upstream, that's most of what Asaf covered, or stuff where what we talked about is not actually upstream, that is to say, in the OpenStack code base itself, but where pretty much the entire community and all OpenStack vendors agree how things are done. And now we're going to briefly go over the small differences that do exist between SUSE OpenStack Cloud 5 and RELOS P7, and actually also point out what the similarities are and where the overlap is. And Adam's going to kick off with SUSE OpenStack Cloud 5. Right, and I think we'll also touch briefly upon what's coming in our next versions, respectively, as well. So Cloud 5 actually came out in May 2015, but our HA strategy has been relatively stable ever since we first released it, and I think February or March 2014, two releases before that, and there's an update to our Cloud 3 release. So Cloud 5 is based on Juno, so anything that is in Juno with respect to Neutron HA generally applies. One thing that we do architecturally is try to be as accommodating as possible, so we leave it up to the customer how many clusters they want and what they want to put in each cluster and how big they want the clusters to be as well. So there's some flexibility there, and there's an interface and an API for making that straightforward. So as was mentioned before and shown in the demo, Neutron server is just active-active behind HA proxy as are pretty much all the other OpenStack services. Anything that is stateless can be run active-active. We just put behind HA proxy and we have one virtual IP per network per cluster and also a couple more for the database and message queue. And we use Pacemaker for managing HA proxy like you saw in the demo earlier, and all OpenStack services including the Neutron services listed here. So they don't get started and stopped through normal system in its script type processes, but through initiated by Pacemaker and Pacemaker Tracks and monitors those services. So we have this setting that I mentioned before for DHCP, highly available running multiple agents per network, and we just basically set that to the number of network nodes that we have available. And we don't use the option that Asif mentioned. Instead we do something slightly different with the L3 agent, which is that we leave monitoring and fencing to Pacemaker, and we have a slightly unusual sort of special trick for doing that, although I expect this to change in the future. But what have we been doing so far is we have a custom resource agent that runs in Pacemaker, which is separate to the resource agent that looks after the L3 agent itself. And this just basically monitors on a regular interval to check whether any of the L3 agents are dead. And if so, then it uses this Neutron HA tool script, which talks to the Neutron API and migrates routers away from dead three L3 agents. And the stop action, if you know anything about Pacemaker, resource agents, the stop action doesn't do anything because if the stop action fails for some reason, then that causes fencing and we want to make sure that doesn't happen in this particular case. So DVR obviously is a very hot topic with Neutron and HA. And we have a technical preview of that in the current release. And the next release, which we just announced the beta for, the beta is now open and I think that was announced yesterday, that will be coming out in the next few months. And that's based on Liberty and DVR will be fully supported in the next release. And a quick mention here, we're basically doing the same thing as Red Hat are doing. Red Hat have done some fantastic work on Compute Node HA and we're converging with their solution in our next release. And yeah, this is basically by extending the Pacemaker cluster to manage Compute Nodes, not just controller nodes. And the Open V-Switch agent, if you're using Open V-Switch, becomes part of that solution there. And the fencing again is based on Pacemaker. Yes. Yeah. So I'm looking to collaborate more with you. For the Red Hat side, I opted to keep it fairly short because as distros, what we do is we create upstream features and kind of mold them and beat them into submission, test them and we get something that we're happy with. Well, we picked and choose a bunch of features that we already talked about. So the DCPHA and the highly available routers, that's what we're going with for OSP7. And for the API service, it's pretty straightforward. It's what we already talked about. You're getting HA proxy, a little balance and it's being made highly available by Pacemaker. For your agents, as I said, you have the highly available routers, which we discussed pretty in-depth. You have the DCPHA being, each network is being replicated and created on multiple network nodes. All of your agents are being monitored in case they crash because Neutron is perfect. It just automatically restarts them. One cool thing, we have a couple of cool things that are coming. One thing is instance HA. The idea is that, especially if you're using shared storage, it becomes really cool. If the VM on a compute node dies, we monitor and we use Nova API and we monitor it and we actually recreate the instance on another machine using the same volume. It's a widely requested feature and it's coming soon. Another thing that you don't have a slide for? Yeah, another thing that I forgot to build a slide for is DVR. It's been in tech preview for OSP6 and OSP7 and that means that you cannot install it. You can just do an installation, just manually do steps after the installation to enable DVR and the support is in tech preview. We are aiming for full support, including installer integration. I'm not a marketing person so I can't make any promises but I promise that it will be for OSP9. So we're aiming for OSP9. So yeah, don't quote me but it's not for sure but it's absolutely going into OSP9. That's it on my end. We have time for exactly one question so make it count. How do you do the HAO with the VM? Do you have a shared storage for it or is that also some other magic that you guys have done? So just to repeat the question, the question was how exactly is the high availability for individual VMs achieved? So yeah, we have the experts in the room. By the way, if you just want to talk to them after and get all of the information, they're right over there. But to summarize, and there's a demo apparently. I'm fully up to date on this stuff. So there's a demo in the Dell booth for exactly this feature. But with shared storage it works better because you can reuse the volume. Yeah, it works. So I'm sure that answered the question. Okay, we're unfortunately out of time and we'd like to clear the stage for the next speaker and not be rude to them but we'll of course be happy to answer any more questions outside. I had the idea for this talk. I'd like to thank my two co-speakers and I'd like to join you in thinking, join me in thanking them because even though they have different affiliations, they have been extremely gentlemanly and gracious working together here. And I'll be happy to repeat this one. I'd also like to, I'd also like to specifically thank Asaf for pinch hitting for a colleague on relatively short notice. And I'd like to thank you all for coming. And have a great rest of the conference.