 Good morning everyone. My name is Pooja Ghumbre. I'm from platform line systems and I have here with me. Hi, my name is Sampath. I'm from entity So we are here to talk about zero touch high availability for your open stack virtual machines using masakari and console So here's a quick look at today's agenda I'll give you a brief introduction about what platform line systems does and what is the need for VMHA today and the current status of open stack project called masakari Then we'll touch upon platform lines zero touch solution to address that problem And what is the future work that we need we are addressing in our roadmap? So platform line systems was founded in 2013 by a bunch of xvmware engineers We started our journey with SaaS managed open stack and today we also offer the same for Kubernetes We also recently introduced a project called fission which basically allows you to have serverless functions as a service on top of Kubernetes, so this is all remotely managed So we're basically a remotely managed hybrid cloud since we also support public clouds like AWS or a GCP in our environment We work with any hardware or platform that we already have so your existing KVM vSphere or Docker environment can easily integrate with platform line So one of the main advantages of being SaaS managed is that we can provide 24-7 monitoring of your infrastructure, so we take care of the entire life cycle from Deployments until seamless upgrades and we help you with all the troubleshooting issues as well So that's that about platform 9 and now we'll switch gears to VMHA. Okay. Thank you Pooja This is about I'm going to explain about what is high availability for the VMs And I'm not going to go for more details because we have been giving so many talks about this summits and the previous summits and You can refer those talks or the discussions and we have end of the presentation We have slides called the related topics, but for simply What is highly built for the VMs is like Especially when you run any traditional workloads in your clouds, you want to provide some SLA to your users and They you have to make sure that the VMs not gonna disappear one day suddenly so you can The those VMs can disappear due to a lot of reasons like the host failures VM the KVM failures and process failures and The high availability is the restrict of a rescue or resurrect those VMs from those failures so we have been giving so many talks in this summit about the lot of details about these topics and We have this long argument about kettles and pets about the clouds so This solution is basically target for the pets, but not on only for the pets You may use these solutions also for the kettles like kind of alternative recovery mechanism, so and Also, we have doing a lot of works on the instance HAA and the control plane HAA in the open stack You may refer to the HAA documentations about a how to configure a cloud with the high availability and we also have These specs and user stories also define what are the use case of the this HAA plane and And this is the rough art architecture of the masakari This is the project we started in the github like two years ago and now we see under the open stack namespace The masakari team has been a great job about to make it to the open stack standards And we have now API and the engine that those are the main components of the masakari And still you have the my the Python masakari client we give you a nice CLI to control everything and We have three basic monitors to monitor process failures and instance failures and the host failures and By the way, you can configure it with the console or pacemaker or the any other hs solution you have and So we have a lot of configurations and it's highly configurable So you can refer for the details in the masakari wiki. We just can Go to Google and search masakari wiki. Then you will hit the page So you may find more details in there So I'm gonna skip the rest of the Details here because we have limited time. So I'm gonna go back to Pooja Okay, thanks a bit So let's look at what's the platform nine solution So we talked about using masakari to recover from a node failure but before we get to that the first step towards that is detecting that there's a failure in your system and Re-configuring your system once, you know that our node has gone down or a new node has joined your cluster, right? So for detecting host failures we use console I'll talk more about that Now we have console and masakari for failure detection and recovery. So the missing piece here is Any of these clustering services that you may use for detecting failures? They need admin intervention when something goes wrong and that is a problem today that we are trying to address So platform nine offers this zero-touch solution where we basically automatically Re-configure your h.a cluster when a node goes down or when a new node is added to your availability zone and Keep in mind that when I say zero-touch, it's it does not mean zero downtime This is still a traditional workload and you have to expect that there will be a minimal downtime for a downtime for recovering from that failure and evacuating the VMs So as I said, we chose console It's a tool developed by HashiCorp, which allows you to do service discovery and health monitoring We chose this tool because it's very simple to set up and it has this inherent capability that it requires minimal admin involvement Apart from service discovery and monitoring it also provides you a key value store, which you can use for leader election that console relies on So it basically uses two protocols one is the gossip protocol and there is a consensus protocol It's as you can see in this diagram There's multiple nodes in your data center So these are all think of them as hypervisors running consolation and either client mode or server mode All of these nodes are participating in what is called the gossip protocol which helps it with discovering a field node And the server nodes There's also a leader amongst the server nodes, which is basically responsible for handling all the transactions and Replicating the state to follow our server nodes So in console the consensus requires a quorum of at least n by 2 plus 1 servers So that defines your fault tolerance So if you have three servers, you can tolerate one failure if you have five servers, you can tolerate two failures So you can set up your environment in that way So we have all these building blocks now and how did they come together so we added the service called platform 9 HA manager so the top three services are basically part of your control plane and H a manager is what is responsible for? Automatically reconfiguring your h a cluster. So if you have a Nova availability zone you can say enable h a and That would basically create a cluster within console. So that H a manager pushes cluster configuration to This H a slave component that is running on all the hypervisor nodes and H a slave is basically a helper service for console Which identifies whether it should be running in client mode or server mode? Also, as you can see with the requirement here is that all the KVM hypervisor should be using a shared storage So that Nova can evacuate once you notify masakari that a node has failed and it needs to be recovered So some of the items in the future roadmap include handling simultaneous node failures So basically when you're reconfiguring a cluster, that's a time window where you cannot monitor any new host failures So you need to handle that that's something they're working on apart from that Monitoring service and instance health. So we are currently looking at compute nodes going down But an individual instance or a service on that host can also go down. So you need to address those failures as well Yeah, of course in masakari side, we were so much focused on the coding last few few years and It's kind of a lack of documentation at the moment, but we have minimum documentation, but right now we are working For the more detailed documentation for masakari is kind of our very prioritized item for this release and also We have this spec called the recovery method customization which is give you a highly configurable the interface to The configure your customer recovery workflows such as like In the node failure we try now we only have the evacuation But you may configure it to like send a mail to operator before the evacuation and send a mail another mail to operator Like after the vacation says okay. I'm finished with the occasion. It's kind of one example, but you can the Combine a lot of recovery methods in your workflow And also we looking forward to getting the big 10 because it will give you a lot of Android pages for users And also the masakari community and it would be Nice thing to have Yeah, so just to summarize we are using our distributed consensus protocol for Monitoring the state of your h.a. Cluster which allows you to recover even traditional workloads similar to how you're used to with your cattle and Basically the idea is to eliminate any admin intervention. So you get a seamless experience with that and Most of our vSphere customers are basically quite used to this feature So this gives you an equivalent functionality on the KVM side as well These are some of the links that you can take a look at we have a VM h.a. demo on YouTube that you can check out There's also links for masakari and console documentation also the pf9 h.a. Project which includes the h.a. Manager and h.a. Slave services I talked about it's part of our open source projects You can go check out that code and contribute if you can If you have any questions or comments I think we don't have time, but we'll be at the back if you have any and we can adjust those. Thank you