 And the time is up, so good afternoon, everyone. I'm Wu Jiang, and I work at Fiber Home in China. And today, we'll talk about VMHA. And there is the topic of better VMHA solution, three brain solving and host network for awareness. And so here is the agenda. Today's presentation consists of five parts. And you can see them from the screen. And OK, let's get started. As we all know, the high availability of VM was born with visualization technology and was a very creative feature at that time. In today's cloud scenarios, more and more applications are beginning to move to the cloud and continuous. But for some legacy service, are they still unreformed or cannot be reformed in short term? So they still need HA features to ensure the reliability of their applications. But we analyzed the traditional HA solutions. And there are also still some disadvantages, such as rely on IPMI and can only handle single things and almost no solution to the three brain problem. So based on the analysis, we want to build a better solution of VMHA. So before the implementation, we put forward two requirements for design and feature. And the design must integrate with our product. And it must be independent of OpenStack and try not to modify native codes. And in the feature era, it must be able to handle the three brain problem and some complex faults. So there are other goals. And next, I'll begin to explain our method. So this is the architecture of our HA solution. It based on the CentOS and OpenStack. And the shared storage we chosen is ZFFS or NFS. So you can see from the graph, there are three main layers. And the top layer is controller clusters. At this layer, we've developed the HA stack component, which act as a brain to control the global HA behavior with its agents. And the bottom layer is computer nodes. And here, we also implement a lock manager called fitlock for split brain protection. And at last, three clusters built by ECD in the middle other bridge, they help HA stack to sense the status of each nodes. And the combination of the components above constitutes our VMHA solutions. And this table shows some specific information about each component, including the description and deployment and reliability requirements. And I'll now describe them one by one here. And this diagram shows where each component in the system is deployed and where it's connected. So it's notable that, besides IPMI network, the controller and computer nodes are connected by three major network plans. They are management, storage, and service network. And we'll use this in the following explanations. And let's take a look at the usage scenarios. This graph illustrates how the system interacts with end users. So for the general users, it could create HA VMs and modify HA attributes. And their HA VMs will automatically recover for a false failure. And for the admins, they could configure the HA strategy and other patterns. And they could also turn off the HA ability when needed. And for the HA style itself, it provides host for detection and the HA task tracking capabilities. And in some scenarios, it also needs to perform fencing action. And above are the most common interaction scenarios of the system. And here, I also introduce a term about fencing. It means the process of locking resources away from a node whose data is on 13. So in our solution, it means stop related VMs when the host's data is on 13. So after introducing the scenarios, let's see the workflow of several HA actions. The first one is create a HA VM. And similar to the common VM creation process in NOAA, you only need to add HA's metadata when you create a HA VM. And then LabWord will go to Feedlock to register the VM before it's running. And the second one is HA. The HA stack is periodically pulling the status of each nodes. And once a host network is abnormal, the HA process will be triggered. And after a necessary basic check and storage detection, the HA VM will be evacuated for the 40 host that complies with the HA strategy. And then the HA stack use task tracking to ensure the HA action is complete. And the last workflow is about fencing. And the HA stack agent will continue to maintain a heartbeat connection with the shared storage. So once a host heartbeat is interrupt, the fencing event will be reported to HA stack. And if the agent gets a response in time, it will follow the instructions. Otherwise, execute fencing by default. And here, I also want to clarify the scope of HA detections. So when will it trigger HA? An exception occurred on the host network plan. And this exception conforms to the HA strategy. Then the HA stack will trigger the HA action. And when will not trigger HA? So when VM status is not active, stopped, error, or internal exceptions occurred inside the VM, will not trigger. And when the VM virtual network is abnormal, we still won't trigger HA because HA stack only handles the physical network failures of a host. And the VM virtual network failure are still handled by Neutron. And by the way, when the core components are abnormal, the HA process will still not be triggered. So I've already introduced some basic information about the system. And next, I began to explain the two key features in our solution. And the first feature is sleep ring sewing method. And there is two questions that need to be clarified first. So what's the sleep ring? It's a result of cluster partition where each side believes the other is dead and then proceed to take any resources as though the other side no longer own any resources. So what's the influence on the system? And I will use this following graph to give an example. And so when the controller finds the computer node is disconnected, it's difficult to judge whether the broken link is caused by the host failure or the network failure. So if it rushes to trigger the VM evacuating at that time, it causes when the same disk mounted to be running on both hosts. And therefore, the VM concurrence where the same disk may cause the irreversible data corruption. That's unexaptable. So to solve this problem, we've implemented a read and write lock manager called FitLock. And in addition to avoid invalid VM fencing actions, we also achieved a set of fencing protection mechanism in HA stack. And please see the graph on the right. And in our solution, the FitLock will ensure the right rights are unique. So the same VM cannot be started at the same time. It fundamentally avoids the occurrence of sleep-brain problem. OK, on this page, let's see the principle of FitLock. And a FitLock is a lock manager built on shared storage, like cell lock. And you can see from the graph, all hosts are located a unique lock space on shared storage. And each host has read access to its own lock space and only has read access to others. And each host periodically updates its own time step. And the host list renewal is equal to all that hosts the VM list renewal. And here, the most important thing is if a host list is being renewed, then the VM list is owned cannot be acquired until it has expired. So that means a VM that is already running on one node cannot start simultaneously on another node. Therefore, there won't be two identical VMs in the system. That's the way we solve the sleep-brain solving method. And what's the difference between FitLock and the cell lock? And first of all, the lock granularity of the two is different. The cell lock uses each VM disk as a lock, but others use this VM itself. And in addition, the behavior of the two, when dealing with some scenarios, are also different. For example, when the heart rate lost, the cell lock will wait and the QHA VMs under, but the FitLock will ask the HA stack to judge where a socket. And in process, restart scenarios, if the cell lock is restarted, the locks with were lost because part of data are stored in memory. So this is equivalent to a malfunction and will lose protection here. However, under FitLock, we've achieved a fencing protection here to prevent such failures. So it has much better reliability. And next, let's see the performance of FitLock in some specific scenarios. The case one is the simplest. When a HA VM is a point, the VM list will be registered in the lock space. And case two, when the shared storage is accessible, then the HA stack agent will report fencing event to HA stack and will wait for a response. So if the agent gets a response in time, it will follow the instructions execute fencing or not fencing. In these situations, HA stack will find the storage is abnormal, so no need to fencing. That means the HA stack agent will get not fencing comment and all HA VMs will remain here. And case three, when the computer node loses connections with controller nodes. So you can see in this case, the original host can still update heartbeat on shared storage. It still has the list, so the VM will still be running and cannot be started on another host. So that means the sleep ring will not occur. And the last case resembles an isolated island. When the computer node is disconnected from all nodes, that means all heartbeat and the fencing event cannot be parsed outside. So in our solution, due to sleep ring protection, the HA VMs will be stopped at original host by a HA stack agent. This is because when the host is completed interrupt, all HA VMs will be evacuated to another host at that time. So if fencing is not performed at original host, once the host communication with them, all HA VMs will continue to operate disk in a short term. That means the sleep ring may occur. So this is the reason why we need to execute fencing in this solution. And OK, we just introduced the feedlock. And then I'll start to explain the second key feature about host network fault awareness. And you can see from the graph we've introduced the ECD between controller nodes and the computer nodes. And corresponding to three physical network, we established three clusters on ECD. There are management, service, and storage clusters. And from the bottom side, the HA stack periodically updates the heartbeat of each networks every 20 seconds to the ECD. And from the upper side, HSTAC obtains the connectivity of each nodes by reading the three network clusters from ECD. And once a host network heartbeat is updated within two minutes, then the HA process will be triggered according to the configure strategy. So through this up and down update mechanism, the HSTAC will ensure the perception of each nodes. And about communication method, the interaction between each nodes is through ECD API. And according to different message types, we adopted two transmission methods here. The heartbeat of each network is updated by corresponding network plan and some key messages like phasing event are reported through three network plans together. And the HSTAC will remove redundancy during the processing. And about HSA strategy, we use a JSON template to configure. And the default strategy of HA is shown in the below at the table. And you can see from it. And for example, we are triggered the HA process when the story plan is interrupted. And of course, you can customize the strategy based on your own business needs. And so here, the two key features in the system are introduced. And then I'll show you what are the related features we've done in the solution. And speaking of a long-running task, we naturally think of how to judge its execution result. So therefore, we implement a task tracking module in HSTAC. And all HA actions will be tracked here. And the field tasks will be retried five times by default. And if a task still fills, an alarm will be generated to remind the administrator that HA recover fills. And to prevent excessive HA parallel, we've also developed the flow control in the solution. And you can see from the graph, we use this dynamic lens variable queue to control the global HA rate limit. And the default length is 20. And it supports runtime modification. That means you can change it at any time and no need to restart the process. And in addition, the parallel HA rate limit of one horse is five by default. And that means the horse that exists, this threshold, will be filtered out during the horse selection. And this slide is used to introduce some of the protection mechanism of the system. So first of all, all the related process are protected by watchdog. And in addition, we implement the two protection mechanism when large scale failures occur in the data center. And on the right is the illustration of the features. And you can see from it, if more than 50% of the horse are done, that means the data center may have experienced a major failure. So the HA stack will stop itself to avoid when evacuating. And if the horse recovery reaches 70%, the HA stack will automatically recover to continue to provide HA capabilities. And these two params are also configurable. And we also implement a feature called HA maintenance to turn off or turn on the HA functions. This is for some certain maintenance scenarios, like OS update or the story expansion or upgrade. So the different performance of the system when using this feature are described in the table. Here are some events and alarms related to HA. And there are four categories, VM, horse, storage, and ACD. OK, at last, let's take a look at the test results. And in our lab, we use a total of 38 computer nodes. And the test tools we chosen are rarely hit as subscripts. And we use this FIO to pressurize the storage and use Zabix to monitor the performance. And in this test, we create a thing of 1,000 VMs to be recovered by interrupting more than 50% of the horse. And with different rate limit value and whether to pressurize the storage, we divided the test into three scenarios like this. And after testing, in the first two scenarios, without storage pressurization, the recovery time of single VM is within one to two minutes. And the total recovery time of 41 and 20 minutes, respectively. And in the last scenarios, with storage pressurization, the recovery time of single VM is almost two minutes. And total recovery time is extended to just over one hour. And here are some actual cases of our products covering up hybrid and the private cloud scenarios like this. So about future works of the solution, at the functional level, we plan to implement the integration of QGA and HA stack so that we can handle some force inside the VM. And at the usability level, we also intend to graphically display the HA strategy template so that the admin can easily customize the required HA policy on the OM portal. And at the performance level, the recovery time needs to be reduced as much as possible. So there are the related references. And so at this point, my presentation has been fully introduced. So are there any questions? Hi. My name is Sampath. And I'm the current PGL for the project, Masakari. Yeah, in the open stack, we have kind of a similar project. And as far as I understand from your presentation, we already have most of the features. And do you have any plans to integrate with Masakari or contribute to Masakari? Yes. I made some combination of the HA stack and the Masakari. Yes. And how to say? Firstly, the force source is different. The Masakari supports VM failure, process failure, and the host power failures. And the others is to do with the host network failures. Exactly. Yes, this is one. And secondly, from the functional combination, the HA stack provides the split-bring solving method and can communicate via three network plans together. But we don't provide APIs outside this one. And the second is, now, we doesn't provide VM internal fault detection yet. OK, thank you. As I said, we do the same thing in different way. Yes, so maybe we will join us. I'll discuss it with you later. Thank you. Thank you. OK. So are there any questions? I want to know, are there any plans to build up open source projects for this kind of solution to provide it to others? You need to, you mean, can I provide? How to say? Now, this is a private solution of our products. And we are preparing for this to put a repo in the GitHub. And we are preparing for this. And it will be announced when we provide this on our app. And we are preparing for provide the download of this HH stack on GitLab. But now, it's not yet. So we add it later. We will announce it on our official website. I would like to extend an invite for you to join the activities of the self-healing SIG, a special interest group. We would love to have your input and expertise and experience in this is one of many self-healing scenarios. And we are trying to build up upstream solutions for all of those within the community. So the more people we have connected working on the same stuff together, the faster we can go. So if you'd like to join that SIG, that would be great. Thank you. Thank you very much. OK, that's all. Thank you very much. Thanks.