 Okay, okay then welcome everybody. Thanks for coming See mr. Ballin the audience is a great way to make me feel not at all nervous before presenting It's great to be here. So my name is Adam spires. I'm a senior software engineer at Sousa and this is my colleague Yeah, and my name is Sampath. I'm from entity yeah, and so we're gonna be talking about high availability for instances and What we believe is is the future of this feature in Upstream open stack. We've been working on this with a smallish community for quite a while now So these are the findings. So firstly, this is kind of a sequel to a talk that I gave with another colleague in in Austin If you haven't if you didn't see that one You're very welcome to you know go back to the Watch the video recording of that I'm gonna go over some of this Same material of that to sort of set the scene for the problem just to make sure everyone's on the same page There's a lot more a bit more detail in in that Talk about the the the background to this problem So today I'm gonna quickly Talk about the difference between HA on the compute plane versus the control plane. I'm going to make a case for why we need HA on the compute plane And go through some design goals of what we think the ideal solution looks like talk about some existing technologies and Sampath is gonna Tell you some more about one of the solutions that he has been Leading and working on and we'll talk about what we're up to currently and where we think we're going in the future and we'll also Warmly encourage everyone to get involved So firstly, I'm also difference between control plane and compute plane HA. So most people are probably familiar with this kind of HA on the control control plane So pretty standard architecture where you have a cluster with all your services running in it may be multiple clusters you can divide up the Responsibility of work for example have API clusters API services in one cluster and message queue database in another for example, there's a lot of different possibilities But essentially it involves some kind of cluster manager and which will do automatic restart of controller services and it will increase the uptime of your control plane on average and If you look a bit more closely, then you've got you know, typically an HA proxy load balancer You might have crossing pacemaker. Maybe keep alive D Especially if you're doing neutral on L3 HA these days It's a pretty pretty standard practices which are documented in the official HA guide and I should mention that That HA guide is actually currently under a lot of flux And I'm one of the people involved in trying to revamp that and get it back up today Because it does have a number of issues which needs sorting out So this HA on the on the control plane is It's pretty much a solved problem mostly. I mean there's still ongoing work in some areas like Neutron and so on but it's you know, it's generally pretty well understood So if your control plane is the stuff on the left of this diagram, and then it talks to all the compute nodes on the right hand side Then if you have a failure in the control and in the compute plane then You're not covered by any of the technology that I just mentioned so in that situation is is there a problem that needs solving and I would contend that there is Not for every single cloud out there, but certainly for some and we all know the pets versus cattle Metaphor which actually I think I first heard off when when Tim I was presenting it a long a long time ago So yeah pets. I won't go into you know detail because I think everyone knows this but pets are unique and Stateful and they take a lot of work to to create and look after Cattle aren't unique. They're stateless when something goes wrong. You just get another one replace it and So cattle are Basically ideal for the cloud. They're kind of designed To be to run cloud native the naturally resilient and disposable so if you have a bunch of hypervisors in your cloud And you start populating them with cattle Instances then everything's happy if one of them blows up then you you don't panic you just deploy another one You keep going you might want to keep scaling your cloud you might have different types of cattle in different projects and You know and anything can fail anywhere, but you just keep going and it's all fine But you get to a point where okay, there's a bit of mess that needs cleaning up here and Ideally the application layer will take care of the cleanup. Maybe it toss the open stack API's and cleans things up But you might have a dumb castle situation where actually you want the cleanup to be done automatically by the by the control plate control plane by the infrastructure layer That's it's still not a big problem, but what if like for example a whole compute host fails then That really is something that needs to be handled by the infrastructure layer because it's not a problem that's specific to any one workload, so for example, you might want to reboot it at that point and Or reacting some other way notify an operator or whatever So it's so it's not a Big problem when when fairly to happen with cattle, but you know It just does require some thought when it gets to pets the situation is Quite a bit more complicated because now you have pets scattered around and you can't just if a pet fails for some reason You can't just redeploy that pet elsewhere without thinking because it has state and that state needs to be protected So you need to make sure that that pet really is dead before you resurrect it somewhere else Otherwise you'll get two copies of the same pet conflicting with the same state And that's where you get the need for things like fencing Come in to avoid data corruption So you might want to defend against your failures in different ways Well, I'll get onto that in a bit actually, but yeah in this case if a compute host fails Then we need to be very careful before just we can't just reboot it and then expect everything to magically heal itself We need to think about the fencing and a Controlled recovery that is context aware. So as a summary do does open stack really need it I would say in the in the general case. Yes, we do need compute plane ha and The the quick summary is pretty much the reasons I've just stated It's not just pets that we need it for sometimes we need it for cattle and Despite what some people I think believe that there are valid reasons for running pets in open stack There's manage but manageability benefits all the the nice ecosystem within open stack for managing your resources There's no reason that cattle should be the only thing to benefit from that Otherwise, you would have you know a cloud here for your cattle and then something else over there for your pets and it's not An economic way of of handling your estate and As as developers We know that it's expensive to migrate to just completely convert a pet based traditional Legacy workload into something that is is cloud aware. So we kind of agreed on this as a community I guess largely I mean a While ago and we wrote this user story and the product product working group discussed it and And decided that it was one of the higher priority user stories out there But it didn't really go very far in terms of like how do we actually you know What what how do we respond to this user story and do some implementation? So I'll talk about that Shortly, but firstly, what do we really want from a solution? So the design goals are at least this is this is how we we see it. I think Firstly obviously as to scale that's just that's just a given and look it's So to scale we We seem to come up to a consensus that actually one way of Scaling that works pretty nicely is if we already have pacemaker running on the control plane. There's this nice Feature called pacemaker remote where you can have a remote proxy demon that runs on your compute nodes and kind of Forms an annex to the the control plane it extends the control plane. You can you can use pacemaker on the control plane to Control stuff that's running on the compute nodes and Because this is not a full mesh hierarchy with all the can the pacemaker remotes on the Compute nodes talking to each other you only get communication between the control plane and individually each Pacemaker remote so it does scale. So it's kind of like a I guess a hub topology So that's a brief Response to the scalability thing We want to handle different failure modes obviously we want to be able to handle an individual compute host Failing that could be just the OS or the hardware or whatever It could be a pros an individual process like liver demon On on a control on a compute node. Sorry or Nova compute Or we definitely also of course want to handle failures on the control plane But like I said earlier, we've sort of addressed that if we have a recovery workflow Controller that is is responsible for handling failures on the compute plane then we need to make sure that that itself is resilient. Otherwise We can nanosalve in a situation where suddenly our complete compute plane is no longer protected We want to be able to deal with individual VMs failing from a kind of infrastructure as a service Aspect what the we consider out of scope is failures inside the instance So if there's something wrong with the workload inside the instance, that's not the business of the cloud operators to worry about That's the business of the consumers of the cloud So that's the only one of those explosions that we consider out of scope for the rest of this talk Another clear design goal is is operability So we need to be able to have as operators have visibility into what kind of failures are happening in the cloud and What the system is doing to automatically or send me automatically respond to them? And we also want to have a like a history and be able to see the history of what failures have happened before So for example, we can See, you know, how many failures are happening on a weekly basis and is that what we would expect or is it too high? So we need to be able to Configure the response to failures in a sort of policy-driven manner and not every cloud is the same shape or has the same SLAs or requirements so for example We might decide that we want to configure the response by availability zone or We might want to configure the response by by project by tenant or by Instance flavor or even by individual pet if we have a specific pet that needs special treatment in some specific way We probably want to be able to cope with that So every cloud is different There are other things that we might want to configure We might want to set aside a certain number of compute hosts to be reserved So that we always know if there are failures on other compute hosts that we always have some hosts spare to fail things on to We might want to configure the retry thresholds On on various operations recovery operations We might even want to go into the the exact workflow that has been used for recovery and and customize those There are some Clouds in production out there that are using some of the existing solutions that were about to show you and It was very important for those. I mean I for example, I work for SUSE So obviously we have customers who are using our current implementation of compute Plane HA and we obviously want to make sure that those customers can be upgraded to a an improved solution in the future that's a you know best of breed upstream solution and similarly at NCT they have you know production clouds that are using their solution and They need to be able to upgrade that to an improved version in a smooth manner and We want An recovery workflow controller that is intelligent and context aware So some examples of that if Nova compute fails, but the VMs are still running The VMs are still healthy, but you just can't control them because Nova AP Nova can't from the control plane can't talk to the compute plane and control them So in that situation, do you want to automatically kill them and recover them elsewhere? Well, if they're kind of disposable cattle, then maybe it's safe to do that as long as it's not Too big an impact on the overall service But if it's pets then actually you might do more harm than good by by automatically killing them and trying to restart Them elsewhere because you might cause a significant service disruption. So things like that Need to need to be considered There's the possibility of having multiple faults occurring at the same time And then you need something that's intelligent enough to look at all of them and think well Do I handle these are separate things or are they all related? So I just focus on one of them If something goes wrong, then maybe you want to set the compute host to maintenance mode and until it's Everything's back up and running. Otherwise you might try you might end up with new instances or certain activity happening on that host that shouldn't happen and This is a use case from the entity world that they have Storage boundaries. Do you want to talk about this or yeah? The basic concept is we use the shared storage for each clusters so the without shadows shared storage the Evacuation is not going to work as we expected. So We want to define the Shared storage boundaries. I think no one still not supporting it, but it will in the future So that's why we specifically asked for this requirement Obviously as to perform, you know, we need a quick response to failures So if you're polling then you that might incur an extra latency over pushing so we Thought that you know using the standard open stack model of notifications on the message bus makes sense and We might want to Decide to handle certain faults at a higher priority than than others. So one example of that is if You have a host failing then it could well generate faults at not just the host level, but the instance level and You don't necessarily want to spend effort trying to recover an instance when you're already trying to recover the whole host that that instance is on and and if if there is some reason for Delaying a response to a particular fault Because you've considered it lower priorities another another one then obviously the operators need to understand that otherwise They might wonder why is this instance not being recovered automatically if they didn't realize that there's there's some good reason for that And finally it should go without saying saying that you know any any good upstream Best of breed solution should conform to all the open stack standards So the four opens open source open design open development open community We very much subscribe to that So, you know typically Python using all the standard libraries like Oslo and so on The the code should be hosted in the normal open stack way with Garrett review and all that and CI and Yeah, so that's the a whirlwind tour through the design goals if the kind of the direction that we were trying to head when we've Guided all our Existing existing work so Quick Recap slash intro if you haven't seen it before to the Open-source solutions that we're aware of out there in the community that handle compute plane ha the first one is is one that I've been Involved with for quite a while and it's it's the solution that is used by the product I work on sees open stack cloud, but also the The red hat product as well and they take a very similar approach And we we share code between them and we we collaborated on patches and and so on and So this is if this is like the kind of typical template for an architecture for these solutions Where you have the recovery workflow controller with some kind of state tracking What's going on so that it can be an intelligent make intelligent decisions then that lives in the control plane and then like I said We have pacemaker and the control plane and pace make a pace make a remote in the compute plane For actually controlling what goes on in the compute plane. So in this particular solution It looks like this. So actually the recovery workflow controller lives Effectively inside pacemaker itself So the the state is tracked in pacemaker and we have these two components Nova evacuate is effectively the recovery workflow controller that that takes care of executing a nova evacuate command which resurrects Instances onto a new compute node once the old one has failed And then we have this dedicated fence compute fencing agent which is responsible for Flagging to the nova evacuate process that some some clean-up recovery work needs to be done I'll maybe go into a slightly more detail on this if we have time later on So this solution Handles the the following types of failures. So it handles the the big explosion on the left there Which is your First the other whole compute host blowing up and it handles failures of individual Processes in terms of trying to restart them if if they crash for some reason It does not hand handle failures of individual instances. So that is a weakness. Oh Yeah, I just mentioned I'm so this is a video demo Which we don't have time to play here but I included it in the slides and the you know these slides are online the qr code was shown at the beginning I'll show it again at the end. So Yeah, you couldn't if you're more interested interested in more details on this particular Solution, this is a YouTube video that you can watch to just get a quick overview of you know how it works and So a quick summary of that, you know, this is a working solution has been for I Guess over a year now Or more there's commercial support from Susan Red Hat maybe from others. I'm not sure and The the code is upstream but like I said it it doesn't handle failures of VMs and There's a few other corner cases that are problematic there, you know limitations of the design rather than rather than actual bugs and So, yeah hand over to explain the next solution, which is called masakari. Yep So the masakari project is what we use in production in the entity So it's pretty much same architecture But we have take out the recovery for recovery flow for external as external service so This is a kind of a rough diagram how the masakari looks like the masakari itself consists with the masakari API and the masakari engine and the We use the pacemaker for the control plane h.a. And the we also use the pacemaker remote in the compute nodes But it's it's not we support default Masakari default support pacemaker, but you may configure it with the console or any other h.a. Solution you have is very the flexible solution so Besides the masakari main project we have another sub project called masakari monitors and In default we support three type of monitors the first one is the the host monitor which monitors the host failures And the second one is the process monitor which is looking for the any process failures Such as no a compute fail or liberty failed or any other like ice-caji demon Multiplot so such kind of important process get failed it will try to send the notification to the masakari API the third one is the the instance monitor which use the liberty API to monitor the VMs individual VMs and Then the notifications will send to the masakari API then it will take some Intelligent decisions based on the rules and it passed to the masakari engine to do the actual recovery so currently we have Only we have the task flow recovery driver and it has three main recovery workflows for instance failures and VM failures and host failures and and Also, we have another project called a python masakari client, which is give you a nice CLI to configure and control everything in the masakari the workflows and the configurations and the the the boundaries for the fail overs and This is an example for The process failure the we have process monitor each of on each compute node Which is looking for the process failures if we detect any process failure it will try to restart the process in several times And even if get failed to restart the process and work it properly it will alert to the masakari API and Then it goes to the masakari engine and the end result is it will disable the nova service on that particular compute node and the second one is the The instance monitor we use the lipid API lipid API to monitor the instance activities of the instance such as like The freezing IO so any other the bad Things happen in the VM. So it's monitored the VM externally and it will try to Get some alert from the lipid API and if it finds something it will alert to the masakari API by Send the notification and then it's go to the masakari engine and it will execute the the instance failure workflow to recover the instance the eventually what it does it is stop and start the VM and That will hopefully recover the instance from the failure and the third one is the node failures scenario and If some nodes was down it will detect by the other nodes in the cluster so this is we use pacemaker remote to detect those failures and the we have the host monitor which is a Connected to the pacemaker remote and it will alert the masakari API and it will get reduced the deduct the the Same notifications coming from a each and every host and it will aggregate it's the one notification and it will hand over to the host failure workflow and what it does this is Evacuate all the VM to to the healthy host in the In the cluster so in here we in masakari we support four types of The evacuation patterns the first one is the we use the nova Scheduler to select the particular host for the evacuates destination and the second type is we use the Reserved host so first of all you have to configure the masakari with some reserved host like It's a sample you have a 10 computer nodes and you can put another Computer node as a reserved host then it will use to As a look at destination for the failure host and the third and fourth patterns are the hybrid of those like first use You know our ex scheduler if it's fails then go to the reserved host and the fourth one is the opposite of the the priority and About masakari, I think we started this project as a github project Project in the like two and a half years ago now is under the open stack namespace so you may find the most details everything details in the masakari wiki and The current stable release is stable okata and masakari team turn the pretty much good job job to enhance the the recovery engine to support the customer of several recovery patterns and We also have the retry failure recovery workflows and the The most huge part huge work has been done with the bring it up to the open stack standards So if you take a look at the code, it's pretty much clean and very much understandable at now Yeah, thank you. And by the way The there are quite a few links on this presentation So, you know when you go to visit it on it's just a website So you can just you know go through and follow any of the links to any of the information that interests you Including these two links here So that there is one more Sort of upstream open source solution which sort of started off as a proof of concept and Proved itself, but hasn't really been taken too much further recently And it's based on Mistro, which if you haven't come across the Mistro project It's a workflow as a service project. So in this particular case Even though, you know any user can make any workflows for any any use case in this particular case Obviously, we're interested in workflows revolving around recovery of Failures on the compute plane. So this example here of this workflow is is what you would do if a Host fails. So you look to see which VMs are on that host Decide which ones you want to recover through some filter optionally and then and then do the recovery And yeah, that's the github repository for that if anyone wants to look at it Okay, so those are the those are kind of three You know upstream open source solutions to this there are other proprietary ones as well But we're not focusing on those in this talk Here's a quick comparison of the the first two and then the third column is is what we're aiming Towards now. I'm definitely not going to go through all the details of this matrix right now But I just you know want to point out them the most obvious thing which is that Masakari has a lot of very nice features it ticks almost all the boxes that we want The one area where we The OCF base solution is is maybe slightly more elegant is in the host and process monitoring Which is done natively through pacemaker and that is what pacemaker was designed for Originally, so the the idea is to basically combine the two of those and get this best of breed solution Which really ticks pretty much all the boxes Or at least a lot of them in the short term So how are we going to do that? Well, we've been discussing it for quite a long time actually over several design summits and so on and we decided to the kind of divide and conquer approach is Is what is needed really? so We decided to split it up into key areas. So for example host monitoring host recovery VM monitoring VM recovery process monitoring process recovery and We thought okay, if we can tackle these independently and build them in a modular way Then anyone can implement any one of those components in whatever way they want And it will still interface with all the others So people can kind of do a mix and match approach to building whatever they want and that will also give us a nice smooth upgrade path So there's the gory details on that aether pad. I'm going back to Newton design summit It I can't promise how much sense that aether pad will make to you, but if you want to look it's there But so what we did after a lot of discussion as we started writing some specs And to be honest, you know that the specs are kind of still in process. We haven't Necessarily like there's still a bit more knowledge in our in our heads from all our discussions So we've had weekly IRC meetings and so on and you can you know read the logs of those as well So the specs is not not a completely up-to-date faithful representation of where we are Right now. In fact, this talk is probably the first time that we've exposed all of this thinking Publicly so we're trying to do a better job there, but the respects for You know those areas that that I just mentioned We we also we haven't written these yet, but we do want specs for writing some other resource agents as well Okay, so this is like that was all the preamble of the talk and this is kind of the interesting bit or at least I hope it's interesting so the The OCF resource agent solution, which is the first of the the three solutions that we just mentioned Looks like this. This is a pretty much the same diagram as before but just rearranged and not as colorful But don't worry the colors will come in a second So you have the The state in pacemaker. I apologize for the small text by the way But again, you can look at these slides afterwards so that the state here is is only of the recovery of like the failures and the recovery it only lives within pacemaker and The dividing line across the bottom there is in between monitoring and recovery So that corresponds to the modular approach that we're trying to take And and you can see so that this existing solution wasn't a kind of elegant separation of those two things because you have the nova evacuate Process that is kind of straddling the two and yeah, that's how it looks and then This is where it starts to get interesting So the way we want to change this is that we want to retire the nova evacuate resource agent and we want to replace it by this more capable thing that I've Somewhat arbitrarily called nova host a letter Which is another resource agent, but it's capable of alerting so its responsibility is is is for is Not relating to recovery at all. It's just the monitoring side of things It's only responsible for passing the failures on to a system that is able to to do some kind of recovery and We're aiming for a driver-based approach to that so it can talk to masakari it can send out arbitrarily HTTP, you know, like just a kind of sort of chance. Yeah, like a rest type, you know message to to to whatever HTTP endpoint is is out there that the wants to consume it and so with this and With maybe some small modification of the two existing other components there in yellow We can start to make the architecture a lot cleaner And this also offers really quite a nice upgrade path from existing solutions because for example in if if one of my customers who's running suzer open stack cloud has the components running on the left What we can do in the upgrade path is to add a resource agent into the cluster for the Nova host Alerter and it's still running Nova evacuate and then all we need to do for the cut-over is to literally just stop the Nova evacuate resource and start the Nova host alert at resource and then suddenly the responsibility for doing the recovery of failed compute host is handed over and there's no You know, there's no Impact to the control plane or the compute plane and downtime of any services So we were quite excited when we when we came up with that Approach and It is a bit more flexible again because if we bring back in a Solution to recovery based on mistrial for example like the the the proof of concept thing that I mentioned earlier We have you know, there's this similar mechanism where pacemaker Uses a custom fencing agent on the left called fence evacuate that delivers the signal to mistrial to initiate the recovery workflow and We could reuse that even like either directly from the Nova host alert or By using masakari with a new driver that implements recovery workflows through mistrial instead of What it's currently doing, which is the driver for using task flow So The stuff in red Hopefully goes away the stuff in green appears everything else and then the stuff in yellow You know gets modified and everything else stays the same So it's really in terms of code changes from where we are now from both sides from the like the Sousa red hat side and from the entity side and whoever else out there is using these upstream solutions isn't is really quite a minor shift, but it's it's It delivers pretty much what we what we need Incidentally, this diagram is focusing on mainly on host recovery, but a similar Kind of approach applies to everything process and the VM record. Yeah So I Think we're are we close to time? Yeah, so we'll just wrap up So I've been working on packaging Masakari in the upstream in the RPM packaging project. That's now almost completed Just got to finish off the masakari monitors thing, but the other two packages emerge. So any RPM based Distro should benefit from that Then the next thing is to integrate the this new Nova host alert approach Update the spec so that anyone reading those doesn't You know gets the latest details and Then yeah, the future work for masakari like we still need to write a lot of documentation and right now we are on it and we have some the documentation under review and we will keep continue working on documentation and We have another spec for recovery method customization That is for like you can actually customize the recovery method on your needs. So I'll skip the details But yeah, and the one other thing is we gone We will try to implement the back-end driver for mistrial and the another very cool feature is we are thinking about is the having the ironic support for the volume booted ironic instance making it highly available through yes it's kind of a similar workflow and we also looking for to get in the big tent because it's a It's very pretty much important thing for our users to get insures. This project still Maintained and it stayed in the good hands Yeah, and there's a another little there's a few improvements that can be made on the Nova side Which I was talking to one of the Nova guys yesterday about Which and I think some of this is already in the pipeline for exposing the More details of the recovery process as it happens So that's pretty much it if you want to get involved then please do we have the dedicated HH open stack HRC channel If you post anything HA related to open stack dev or any of the other open stack mailing lists Please do include the HA badge on it so that people like myself are filtering the huge firehose of traffic for HA can Can can spot that stuff we have the weekly IRC meetings that everyone's very welcome to There's yeah, the launchpad project there There's similar one launchpad net slash masakari, right? Yeah. Yeah That like I said the HA guide we're working on and please bear with us this stuff that needs fixing there and yeah Please get involved. Yeah, and There again is the QR code and the URL that everything you just saw is available from there including hyperlinks and That's it. I think where we were we out of time or do we have time for questions? We're Where I think we're out of time so Yeah, yeah, so thanks very much Please one more thing we have another forum session day after tomorrow. Oh, yeah, right? Of course. Yeah, it's not about the D just not about the company. It is whole thing about the open stack HA Any topic is welcome. Yeah, like the the new front HA sin the HA. Yeah Oh, the control plan. It's like those of a feather for HA and that's Thursday at what time? I don't remember anyone know offhand. It's like Thursday morning. Yes, it's like 11 to 11 It's part of the forum. Yeah, it's part of the yeah, so please come to that if you want to talk tomorrow And of course you can ask questions, you know just come up afterwards and happy to answer anything. Yeah, thanks very much Thank you very much