 Hi, everybody. Thanks for coming. Before I start, I would just like to say that all the slides can be obtained through this QR code. It will actually take you to a Twitter link where you can, if you want, tweet that you're here and hopefully enjoying the session, but you don't have to tweet it, but the links to the slides are from there. So please don't feel the need to take photos of the slides throughout the presentation because they will be available online indefinitely after. Okay. So we're going to talk about high availability for pets and hypervisors and give you a kind of state of the nation overview of what's happening in the open stack world. My name is Adam Spires. I'm a software engineer at SUSE specializing in high availability and open stack. And this is my colleague David from Intel who has a similar area of expertise. And today, so we'll very quickly look at where open stack is currently in terms of high availability. Then we'll look at when we need high availability for the compute node plane because that is a slightly controversial topic. And we'll look at some of the architectural challenges involved in implementing that. We'll go through several existing solutions, give you some hopefully completely unbiased advice on choosing a solution, and we'll talk about stuff that's coming in the near future and also how you can get involved with the upstream community. So today, we have high availability looks something like this in open stack. It's usually just on the control plane. And you have active services for the open stack services, API services for example, active passive maybe for the database and message queue. That can be active passive as well. But the point is that the services get automatically restarted and you get increased uptime in terms of manageability of your cloud. And if you look a bit closer under the covers, typically you'll see PaceMaker and Chorusync used as the underlying clustering technology and HA proxy maybe for load balancing. Quite often you'll see Keeper LiveD used as well. This is all very standard stuff and it's basically kind of a solved problem mostly. But it's outside the scope of this talk really. That's just to set the context for this talk of where we are currently. So that's basically the picture on the left hand side of the screen here. That the control plane is highly available. But you still have the challenge that if a compute node fails, then things can go wrong. So when is it important to do something about this kind of failure mode? Do we really need to care about it? Some people say actually we don't need to care about it. And for the benefit of any non-native English speakers in the room, you may have not heard the expression white elephant, but the white elephant is an uncomfortable topic that no one really wants to talk about so they just leave it in the corner of the room and pretend it's not there. We're not going to do that today. The white elephant is whether you should run pets in the cloud or not. And so let's just take a quick look at this. So first we need to, I'm assuming probably everyone here has heard of the pets versus cattle metaphor, but I'll just run through it very quickly just in case anyone hasn't. So this is a metaphor for virtual machines in your cloud. And there are essentially two types of virtual machines, which the nature of which are kind of given, the clue is in the name really. So pets are typically given unique names whereas cattle aren't. This reflects that pets take a lot of work to create and look after whereas cattle don't. And similarly when something goes wrong with a pet, you need to invest a lot of effort to fix it whereas with cattle you just get another one and it's simpler. So what does that mean in practice for compute node HA? Well, when a pet dies you actually get service downtime because it's not designed to be resilient to failures. With cattle, the cattle are designed that if one of them fails then the service still keeps running, albeit in some maybe slightly degraded fashion. And the pets are typically stateful with mission critical data in there whereas the cattle just have a stateless or they just have disposable data. So you don't need to worry if the data storage goes away. But both need some kind of automated recovery. So the pets because they have this critical data associated with them, you need to make sure that that data is protected if there's a failure. Cattle, you don't have to worry about that but you do still need automated recovery. Why do I say that? Well, if a compute node is hosting cattle, over time the failures mean that our service becomes more and more degraded and manually restarting them is a waste of time and unreliable due to the human factor involved. So we need some kind of automated way of doing this and heat used to actually support this with HA restarter but that was deprecated in kilo and I talked to the current PTL of heat recently and apparently there are no plans to bring this kind of functionality back in although there are new kind of convergence and self-healing capabilities being added to heat, I believe. I'm not an expert in that area. But you can do restarting through the APIs as normal but you need to do it yourself. So in contrast it's a bit trickier if you're hosting pets and the pets fail. Apologies to anyone who loves kittens. So in this case we have to be much more careful to resurrect the when we're resurrecting the pet VMs because when a compute node fails we think it's failed maybe because we've lost network connectivity to it but it might still have access to the underlying storage and it might actually still be running and still writing to that storage. So if you resurrect the VMs on that on another compute node then you will have like the resurrected pet running on a new compute node. You also have this kind of zombie evil twin pet running on the unhealthy compute node and they'll be writing to the same storage and you get data corruption and your mission critical data is now in trouble. So this is a big problem and we really have to deal with this carefully. So in conclusion our opinion is that yes there are really good reasons for doing compute HA. Firstly because cattle need to be auto-restarted in some way and secondly actually there are really valid reasons for running in pets in OpenStack even though a lot of people don't like the idea of doing that. You know the typical response is to say oh you should just migrate all your pets into cattle workloads but in the real world you know that's a lot of effort and you can't just do it overnight so it's best to have some solution while you still have pets around especially if you want to consolidate all your workloads into one cloud rather than having you know one maybe VMware estate for all your pets and then another OpenStack estate for all your cattle. So there are manageability benefits in there. So if this is really needed functionality why hasn't it been already done and the answer is that it's actually surprisingly tricky to do in a reliable manner. The first challenge is configurability that every cloud is different. Every cloud operator has different ideas of what kind of SLAs they want to offer and so you might want high availability per availability zone or per project or even per instance and also there are some failure cases that are unusual like if Nova Compute fails the VMs could still be running perfectly happily and still providing your service but they're not manageable anymore so in that case you like do you kill that compute node because then you're damaging your service and you didn't really need to but on the other hand you've lost manageability so there is a problem that needs to be addressed. The second challenge is obviously scalability you know we want to be able to handle hundreds or even thousands of compute nodes and related to that the clustering software typically is like a full mesh cluster that every node talks to every other node so you can't just extend your cluster to include all the compute nodes because then the connections number of connections between all the different machines is just is too much so for example pacemaker and core sync the underlying messaging layer doesn't scale above about 32 nodes so there are some kind of obvious workarounds but they're not good so you could for example divide up your compute node clusters into artificial chunks that creates problems or you could try and do high availability actually within your guests and that's really ugly because then you have to have a cluster stack inside your guest it's like intrusive and it you would need a different cluster stack for each distribution version or whatever inside your inside your workload so this is all that extra work that we don't want so the scalability issue is actually solved by a fairly new feature of pacemaker which is called pacemaker remote and what this is it's an extension that allows you to run a proxy demon on the compute nodes and they can then be controlled by the core pacemaker cluster and you can run all the any resources on the compute nodes and then they're controlled by pacemaker monitored managed in the normal way but the the the full mesh connections is still just within the control plane so this can scale very very high or very far out I should say so the the next and possibly biggest challenge is is around reliability because you've got to deal with lots of different failure modes so if we sample your hardware could blow up and what you need to do then to prevent the kind of zombie pet problem that I was describing earlier is you need to fence the compute node in other words kill it forcibly through an out-of-band management solution like IPMI and to make sure that the the VMs really are dead before you resurrect them somewhere else then you resurrect them on a different compute node the same thing if for example you have a kernel level or OS level issue of some sort you just fence and resurrect again another failure mode is if libvert or the hypervisor one of the hypervisor processes on the compute node fails a nova compute could fail the control plane could fail but hopefully we've already taken care of that as I said at the beginning of the talk and the if we have a recovery workflow controller of some sort to to do the resurrection of VMs then that could fail as well if you're really unlucky so we have to worry about that and the VM individual VMs could fail and just the VM could be healthy but the workload inside the VM could fail and this last one is actually out of scope for this talk because that is a completely different problem that is very specific to you know how you want to do your monitoring inside your VM or whether you even want to do that so that's that's kind of the last step of the whole puzzle that we'll just maybe talk about in a future summit well it's probably been talked about elsewhere here as well so this is a good time to introduce something called nova evacuate which you may have heard of before and if we have a compute node failure after fencing the node we need to resurrect the VMs in a way which open stack is aware of and luckily nova provides an API for doing this which is called nova evacuate so we just call that API and nova takes care of the rest and if we don't have shared storage it can still work it will just simply rebuild a VM from scratch on another compute node and at this point I need to give a public health warning that actually nova evacuate doesn't really mean evacuation unfortunately it's slightly confusing and the reason I say that is if you think about natural disasters in the top picture here we have a hurricane offshore and you know the weather forecasters can see it coming and everyone has advanced warning and so it's not too late to evacuate whereas in the lower picture the devastation has unfortunately already occurred so a typical evacuation is too late by that point and if you translate this to to nova then the top scenario is a bit like nova live migration where you're doing maintenance that is planned and the bottom scenario is unplanned that there was a big failure and in that case evacuation is not really the right word to use so in summary it's a bit of a misnomer and in Vancouver actually the nova developers were discussing maybe renaming it to nova resurrect or something like that but it hasn't happened yet and it probably won't happen any time soon so just whenever you hear nova evacuate just pretend you you saw nova resurrect and it will maybe make more sense so now we'll talk about some existing solutions in the free open source space and the first one is one that I've actually been working on quite a bit over the last year or so which is based on a thing in pacemaker called OCF resource agents and a resource agent is essentially a plug-in to pacemaker that lets you manage resources of any type you want and so what we do is we actually we have two resource agents called one called nova compute which runs on the compute nodes and that just looks after the nova compute service and we have another one running on the control plane called nova evacuate which is in charge of the VM recovery workflow and that has that uses its own database inside pacemaker on the left there the CIB is the cluster information base and there's this helper fence agent called fence compute which will store state in there when it sees a compute node has failed the node will be so the the node the compute node explodes and the it gets fenced then it marks in the database that the node needs recovery and the nova evacuate service initiates the workflow and then calls the nova evacuate API to recover and so in this solution we have we have the ability to deal with failures of the compute node and the nova compute service the Libvert service any other part of the software stack on the on the compute node but what we can't do is look after the individual VMs and so this is commercially supported in rel OSP I think seven onwards this is a screenshot from OSP eight documentation showing the beginning of the installation process for this is a sequence of command line commands that you just type and it will set the whole thing up and then once you followed that you're good to go and it's also in the product that I work on which is Sousa OpenStack cloud and I've got this demo video here to show you if I can start it so this is the web interface for managing Sousa OpenStack cloud and the first step is we'll just start up pacemaker cluster from scratch we'll call it cluster one and there's a bunch of options you can set here will mostly take the defaults except we'll set up the fencing stoneth stands for shoot the other node in the head will give the IP of the hypervisor so that it can take down individual nodes and we'll assign the controller nodes to the cluster to the pacemaker cluster and also will install Hawk which is a web interface which allows us to look inside the cluster we'll see that a bit later on and then we assign the compute nodes as remote nodes in the cluster and we hit apply and then that will take a few minutes to to do all kinds of configuration management and then we come back and we do the setup for Nova which is just simple as we assign the cluster of controller nodes to the Nova controller role so that's a highly available compute plane and then we assign the remote nodes to the KVM role so that the KVM services are made highly available this is the Hawk web interface and you can see we have the two compute nodes running here and then this is the evacuation the recovery workflow controller and there's the fencing agent so that's the setup now let's test it out we have a VM running here just a siros image and the first thing we'll do is find out where it's running is it running on compute node one or two well you can see it's not on one and it's on compute node two there so we'll ping compute node two and just keep an eye on that because that's the one that we're going to be killing to test the fail over mechanism and we'll also ping the VM that is running on that compute node and so we just get the network namespace there and then we copy in the IP address from the horizon so we're pinging the compute node in the second window so in the first window and the VM in the second window in the third window we'll just keep an eye on the log files to see the Nova evacuate workflow and now we'll force a failover by killing the pacemaker remote demon which is this pretty much the same as killing the node and we can see that pacemaker has noticed the failure and the the pings to the VMs have stopped they're both dead and the evacuation workflow has started and completed and then the pings recover so the the compute node is actually already rebooted obviously this video is sped up because we don't have much time in the talk you can see the the compute node has been fenced as which is correct and the compute node it rebooted but the hypervisor is not up and now the instance is on the other compute node and it's yeah pinging that's the end of the demo so sorry if that was very quick but we're trying to cram a lot of information into one talk and couldn't figure out any way to make it slower paced so in summary this approach using the pacemaker OCF agents is ready for production use now this commercial support from Red Hat and Suzer the code is upstream in the open stack resource agents repository which and by the way when you visit these slides online all these things are hyperlinked so you can just click straight through to the project and I'm the maintainer of that so if you don't have any ideas or for improving it or whatever then just please get in touch the downsides are that there are some corner cases they were actually really small corner cases where failures can be problematic and it doesn't like I said earlier doesn't handle failures of of of VMs but it's a pretty good solution but we're going to do better in the future so the next one is masakari which has a really similar architectural concept and it looks like this so the recovery workflow engine is the masakari controller in this case which runs outside pacemaker unlike the last approach it's got its own database and it also has these extra monitoring processes on each no compute node so there's a monitoring process to monitor the hosts one for the processes on the compute node so like Nova compute, Libert, Cilometer agent and so on and there's one also for monitoring individual VMs failures of VMs which is a nice extra feature so this is slightly better in terms of which failures it can handle it can handle the failure of the compute node in its entirety on the left on the far right as before it can handle the failure of Nova compute or Libert but it can also handle you see in the middle there VM2 failing and that's a nice extra feature so that's also available on github there's a recent one one zero release which added support for pacemaker remote so it now scales there was sent us support added uses SQL alchemy now which is nice one caveat is there if you're trying it on a bun two fourteen oh four then to pacemaker remote because it's quite a new feature in the version of a pacemaker on bun two fourteen oh four is pretty old so you will definitely need to build compile compile it yourself probably not the case on sixteen oh four I'm not sure so in summary for masakari the nice things are that it monitors the VM health externally not inside the VM remember I said that's out of scope and the there are a few other things about the recovery workflows that are quite nice and a bit more sophisticated on the downside it really only uses pacemaker as a glorified host monitor and there are some disadvantages associated with that for example it has to wait five minutes after the node is being fenced which is not great so now I'm going to hand over to David who's going to talk a bit about the Mistral based solution that he's been working on thanks Adam it's all cool okay so first of all I'm not sure if you're all familiar with Mistral so I'm gonna talk to you a little bit what Mistral is so as you'll probably read it's a was workload as a service service for OpenStack it enables users to create workload which is just a graph logical graph of task each task you can define what to do for each task if it happened with success of it and with error state if the tasks that are already in Mistral are not enough for you you can write your own actions these actions are literally a Python classes so you can do anything inside it and this execution workflow it may be triggered by based on even from still meter you can use it as a type of cloud cron so run the workflows on a given time or you can use an API call to run in on demand which is used in the solution so this is this architecture diagram that you have seen before and Mistral also fits into it you have Mistral as a workflow controller we have Mistral database and we have also the small fence evacuate script that is run when the note is fenced so it can handle the compute the counter the compute node failure and also Nova and Libyert in case of compute failure it is fenced as in the OCF agent solution and then fence evacuate script is called which is just telling Mistral to launch the evacuate workflow that will be shown later on and then as a result Mistral is communicating with Nova API and telling Nova to evacuate the VMs for the Nova Compute Libyert failure and other it's also uses a pacemaker remote feature so it's exactly the same solution as the OCF agent so the code is available on github for the solution and it has a lot of pros like you don't add new uses a component already in OpenStack you don't add it some new components it's very simple like you will see and it potentially can be integrated with Congress which will enable us to do a different workload depending on the type of failure that will happen also it has an ability to talk only some VMs for evacuation of course some cons like its experimental code and also Mistral Resilence is work in progress we do a huge progress on Mistral H8 during the Mitaka cycle there is still some work to do during the Newton cycle so this is how the evacuate workflow looks like it's listing VMs at the beginning if it succeeded it filtered the VMs so that we know which we want to evacuate which not and if that succeeded we send the evacuate app with an evacuate API call to Nova if some of this failure we can retry of course it will not be retrying forever we can define how many times we would like to retry before we fail the whole workflow what is worth adding to it if we are sure that this work will be around after the node was fenced we should add at the beginning Nova mark host down call so that Nova would know that this host is already done and it will be feeding up the whole process and also we have ability to mark the VM suspects we can do it two way one way is to use the metadata for a VM so that we are setting that this particular very important VM needs to be evacuated another way is to just mark a flavor using the using the extra specs for flavor and here is a demo how it works right now so first of all we are launching some to we are learning to VMs on the same host that we are attaching floating IP to them so that we can attach we can SSH or ping them we are pinging the first one and we are SSH to the another one and after that I will create their file so that you will see that with a short storage which is used in the setup will get the same file after evacuation so we will literally got the same VM after a hardware start so there I'm creating some some text file and checking if it's already there there it is and now I'm starting to ping this VM also so that we can see when we lost the connectivity with it and when we will get it back and very important part we want this one VM we are just created a very important file to be a very high available so are marking using the metadata it so that it's need to be evacuated okay we are just checking if this happens and we can see in metadata it's evacuated set to true and one part we are looking to logs to see what will happen when we kill the compute node so we're just looking into the log of pacemaker to see what will happen everything is set so we can SSH over to a compute node and kill it and after a while there will be logs in pacemaker that will show that this compute is fenced and shows up so that means that the fencing process has successfully and we can see that the VM is now up and running again we can try to SSH over it and there is our file okay so the another approach is using sendlin sendlin is a clustering service for OpenStack it is designed to orchestrate the collection of similar objects like nova instances or like heat stacks it has a lot of policies that enables you to for example placement or load balancing for scaling also and during the Newton cycle the sendlin team is going to work hardly on a health policy so that it will enable you to just keep eye on your cluster of VMs and automatically bring them back to life and something wrong happened to them but it is not done yet so right now it's it's only promising so you should keep eye on this project but right now it's not usable okay so this is a quick summary well maybe not quick but summary of the three solute the three first three solutions we talked about again I was strongly recommend that you you go and look at these slides online and take longer to digest them because I there's not enough time to go through all the details here but the the highlights here at least for me I think for us are the we're quite excited about the possibility of integrating the misterial approach with Congress to do policy-based recovery and also the there are these two capabilities of masakari which are very nice and which we feel the best of breed solution in the future should definitely include in general actually these three solutions are really quite similar so you know they they they do the main job pretty well so there are other considerations other than just functionality worth thinking about a couple of proprietary solutions one is zero stack you actually have a booth here it's near the suzer booth so please come and visit us and them in one visit if you want and they presented in Tokyo and it's basically a cloud in the box that you install in your data center they provide a software as a service management portal remotely so you need a port 443 connection between your cloud and their management solution but that's it things simple to set up they have VMHA coming as a feature I think I believe quite soon I'm not exactly sure when but in the next release I'm told so that's definitely worth keeping an eye on and it has some other very interesting features in there this kind of adaptive approach where a node can magically turn from being into a compute node into a controller node for example if a controller fails and you want to keep quorum in the cluster you might want to boost it back from being a for node cluster back to a five by sealing a compute node which is pretty clever there's another solution that was presented in Tokyo as well very different doesn't use pacemaker uses other technologies has fencing through IPMI and self-fencing and it one of the really nice things about this one is that it has this kind of action matrix approach for dealing with the different failure modes in different ways and that can be configurable unfortunately the source code is not available so it's not usable outside these companies at the moment but who knows that may change in the future so which one should you pick and here's where we give a highly unbiased decision tree and I actually genuinely believe we tried very hard even though obviously we have vendor allegiances and so on tried very hard to just stick to the facts and the facts of these that if you want a validated stack that's already been very you know well tested and supported then you have options depending on whether it matters to you that you want it to be open source and upstream or not if you if that doesn't matter to you then you saw the the red hat and Suzer solutions earlier if you're not too bothered about using proprietary stuff then I've been told by Canonical that they have partners that they can work with that provide proprietary solutions to it and sometime soon zero stack will also be able to offer it if you want to do it yourself basically then then you've got these options which I would just say you know rather than recommending one I would just say evaluate them to be fair have a look at all of them and make your own mind up personally I think we believe that the mistrial approach in the long run has the most promise maybe but certainly masikari has some nice features I mentioned and also the OCF resource agents approach is possibly the easiest to deploy right now maybe but it really think depends on your individual case but I would say don't underestimate the work involved in doing this yourself it like because I know I you know in building the the Suzer open stack cloud version of the solution that was a lot of hard work so if it was hard work for us then it's probably be some hard work for you to and if none of those sound good then just wait for the community to come up with something better which will be working on and that brings me nicely on to future work so obviously there's going to be a lot of interesting discussion this week the product working group really care about this feature which I find pretty significant and we want to build a breast and breed solution possibly based on mistrial maybe you know with elements of masikari in there somehow and there's other work to be done design work and so on it's all going on if you want to get involved please do firstly we have a lunch meet-up tomorrow 12 30 just look for a table with the cluster labs sign or just look for the guy with the shaved head might be easier and you can join us on IRC I set up this open stack HA IRC channel think around the Tokyo time maybe just before Tokyo there's also an official label HA that badge so if you want to talk about high availability on the open stack of mailing list then please use that label so that people can filter for those emails and spot them easily because it's a very high traffic list we have weekly meetings IRC mate meetings which are logged if you can't make it you can look at the logs afterwards the link is there and as I said before my open stack resource agents project well I shouldn't say it's mine because I inherited it from somebody else but I'm maintaining it now and there's the HA guide which is under active development and yeah so please get involved and you know tell us what you think so now we have about four minutes miraculously for questions so please use the microphones or I can repeat your question if you can't get one yeah I just wanted to say thank you very much for reusing an existing HA solution and not writing your own again you're very welcome Mistral approach looked like it migrated the disk image is that correct and do Mistral and OCF also migrate disk images for the VMs you mean if it's if migrated what because I didn't heard clearly sorry the VM disk image since it still had that same file on it no because when the if the node goes down there is like no way to go to it so if you don't have some kind of distributed storage like self then the disk is lost at this point so to use the solution to use some distributed storage any other questions if not of course okay yeah can you show the QR code again oh sure great idea yeah do I have a home key it's a new laptop thank you so what's the timeline for the mistrial like realistic timeline as far as being you mentioned redhead was production ready and Suza as well but what's the yeah well I mean from my perspective Suza is very interested in in this solution but it's it won't like we've just released Suza opens that cloud six so it certainly won't be here until version 7 which I guess will be a while and there's a few things we have to figure out but I mean I think the the work that David is done so far has proven that it has you know it really works it's a good at sound approach and has a lot of potential so it's difficult to say right now hi can you please comment on the latency of failure detection as well as the scale at which it has been tested right well it depends on what things failing and how you're monitoring things so but typically so for example if the if the compute node fails and you're using pacemaker remote then it depends on the monitoring interval that you've set for pacemaker remote from the core control plane and in our product I think we have that set to 10 seconds I'm pretty sure that's right for memory so it happens pretty quick and one of the things that we do to we didn't really mention to increase the the response time is that once pacemaker has noticed that the compute node has failed it then tells an over forcibly okay you know it uses this special API mark hosts down to say to Nova hey you haven't noticed yet but this host really is down so you need to consider it down and that means we can start the recovery process instantly so that that saves a good I think 60 70 seconds so it's a pretty important thing to use do you have scaling issues like a number of computers and so on I don't think so I mean I'm told I wouldn't certainly wouldn't claim to be a pacemaker a remote expert but I'm told I work closely you know with a lot of people who are and and they say it can can scale up to large numbers and I don't believe there would be any significant latency I mean it's low traffic the monitoring so yeah I guess the honest answers I'm not entirely sure but I don't think it it's a big problem okay I think we should probably say wrap it up and let the next speaker so thanks very much