 Hello, everyone. My name is Li Xiangyu. I'm from China Mobile. I'm a software developer. Now I'm responsible for the development of OpenStack Celerometer for ChubbLogic and today's topic, Guardian in our team. Today I will share the Guardian and resource high availability service in China Mobile and talk about some technical details about it. So as the Guardian is no open source now and few people know it, so firstly I will want to introduce Guardian's background and some important concept. Secondly, I will introduce the logical and functional architecture of a Guardian. Then I will introduce implement details about Guardian, such as how does VMHUI work. And last, I will talk about the plan of Guardian. Firstly, it's the... oh, sorry. Firstly, it's the background of the Guardian. In 2006, some of our customers proposed the requirements that the vision machine cannot be affected by the physical machine's failures and cannot be stable and can be severely used all the time. But at the time, our product lines let the least cost of the service to meet the customer's requirements that can detect, alert, and recover the failures of the physical machines and vision machines and so on. And in 2007, every project put forward higher requirements on the fault detection rate, fault discover speed, and fault recovering time of the high availability service. So as you can see, we face a lot of challenges and problems because we don't have to develop a new software from the serial. So to develop a... so the first difficult is that in the complex environment, there are multiple resources and multiple kinds of failures. Then makes the service very difficult to guarantee the occurrence of a judgment. And secondly, in a large-scale environment, it is difficult to guarantee the effectiveness of the high availability. And the last one, some of the failures you can image is very, very difficult to find out and very difficult to recover. So we face a big challenge, but we must find a way out. So based on this background, our team studied the research and development of Guardian from 2006 and released our first version of Guardian in half of the 2006. And we always continually adding the Guardian's new features and make the optimization, including the detection and troubleshooting efficiency for the NFV project since 2007. Guardian was first applied to a private project in 2016 and has been running stable for two years. And for now, Guardian has been applied to many projects, including public cloud, private cloud, industry cloud. And recently, we are doing a full test for the Guardian because we refactor the code and optimize the Guardian structure this year. And the Guardian will be soon to apply in a large-scale public cloud in China Mobile, which has about 5,000 nodes. These are Guardian's background. Then I will introduce four key concepts of Guardian which may be more easier to know about Guardian. It's resource, strategy, failures, and action. So what's the resource in Guardian? Guardian is used for offering high availability services for the resource reach in IAAS level. There, it contains compute resources, both the physical machines and visual machines, and network, storage, service, certificate, and so on. So next is what's the strategy of Guardian? Strategies for Guardian is the ways to monitor resources, the ways to define resource failures, ways to recover resource failures. And what's the failures for Guardian? Failures means the problems and issues of resources in Guardian, resources in IAAS. It contains the abnormal state of resource, abnormal property of resource and abnormal relative factors, which may cause the resources failures. And the last one is action. Action in Guardian contains the monitoring line, record, and recovery. So do these lovely images make you more interesting in the Guardian? Okay, let's take a deep look about into the Guardian. So I will use some images to introduce the architecture of Guardian. First, we should know where is Guardian in a cloud? As you can see, Guardian is in the service layer. Monitoring the resources state which in the visualization layer, such as the visual machines, servers, KVM, and so on. Then Guardian also has close relationships with other services on the service layer. This service is such as the compute service. No one will help Guardian to do the actions. And Guardian offers APIs to the upper platform. For example, the management platform can send an HOS strategy and resource properties through the API. And operation and maintenance platform can receive the alarm and resource state message from the Guardian. And the next is the logical architecture of Guardian. There is a monitoring agent in the compute node. The agent has VM poster, physical machine poster, and service poster, and so on. The poster is function we use for monitoring resource state and other factors. Monitoring get the sample from the poster and send the samples through the management network to the control node. The collection agent in the control node is useful to collect the sample that comes from the monitoring agent. The collect agent writes samples to the database, such as MyCircle, and can also write the data to the Redis. At the same time, Guardian will always check the database if the database has the abnormal sample about resources. If Guardian finds out the failures about resources, Guardian will do some actions to recover the resources from the abnormal state. For example, through the OpenStack API to recover it. And the Guardian can also do the action directly to the resource, such as a post monitoring. Meanwhile, keep a light and LBS keep collector collection agent HA and pacemaker service keep Guardian service HA. Next is the functional architecture of Guardian. As you can see, Guardian can see the user interface model, resource model, strategy model, and action model monitoring agent and collection agent. As I have already introduced monitoring agent and collection agent in the previous picture, so let's look at the user interface module. It's very easy to understand. It's an API layer, which offers API to the upper platform to send HA strategy and resources. And the second is the resource model. It contains multiple kinds of resource classes, which has defined some attributes of resource by default. In the strategy model, there are a lot of HA strategies for different kinds of resources, such as visual machine, HA, physical machine, HA, service, HA, and other kinds of HA strategies. In the action model, we can do alarm, recovery, monitoring, and database-assessed action and so on. Then I will introduce more details about those functional models of Guardian, tell you how they work. This is a typical case about VMHA. As you can see, in the down over the picture, there is a computer node, and the monitoring agent is running on the node. Also, the VM QML process is running on the same node. And while the VM QML process was killed by some unacceptable reasons, the monitoring agent will always monitoring the VM QML process state. And if we find out the failures about the VM QML process, it will send samples to the collection agent. And the collection agent will write samples to the database or the radius. At the same time, we can use the user and administrator, can use the interface model to define the strategies, to define the resources property through the resource model and write the resource inverse to the database. So when the strategy engine gets the strategy definition, it will start the HA strategy. First, it will do some pre-check to get samples to check the samples through the action engine, through the database assess to check if there are abnormal samples in the MySQL database have the abnormal samples about the resources. And then if the strategy engine finds out the failures, it will do the oppose monitoring to do some pre-check to directly monitoring the VM QML process, the VM check if the VM is up or down. And then he will use the action engine to do the recreations such as to restart the VM QML process. It will restart the VM failure. It will migrate the VM to another computer node. And after that, the strategy engine will do some after-check actions to check if the VM has become the normal state. It's a typical case about the VMHA in Guardian. So next, I will talk about the implement details about the Guardian. The first one is the visual machines HA due to the physical machines failures. And the visual machine is on the share storage. The first step is that the process is very similar, is very similar like the last page, the preview page. The monitoring agent will always monitoring the physical machines storage network, management network and business network and not only send the abnormal samples to the collection agent, and the normal states, it will also monitoring the collected samples and send it to the collection agent. And write samples to the database. And the strategy engine will always check the database to find if we can get samples about the physical machines network state, if the strategy engine finds that we cannot get samples. We will check how many nodes in our cloud cannot get samples. The user can send the rays of the nodes because many nodes in cloud area at the same time is different case like this. So the user can send the rays if the nodes we can if we will do some actions to the nodes. So if the nodes cannot collect the physical network samples, it will do the alarm, it will do the alarm and send the SMP or email message to the upper platform. Or if the nodes that encounter failures that is not over the rate we defined before. We will use the monitoring model in action engine to do the post monitoring to check if we can ping the storage network or the other network. If we cannot ping and check the network is in good state, but we cannot get samples. So we will think maybe other reasons cause these failures. So we will do the alarm action. So if we cannot ping or get the right state of the physical network, we will check the physical machine's power. If the physical machine and we will alarm the power states both the up and down. So then we will power off the back node because we must guarantee the visual machine always stability offering to the users. So we must evacuate the visual machines to another normal node which can continue to offering to the users. So first step, we must power off the back nodes. It is because we must make the normal computer nodes is down in the back nodes. So we can do the evacuation actions. So when the VMs evacuate to another node, the storage engine will check the VM states and recall the VM states to the database. So next is the visual machines HA due to the physical machines failures. But the visual machines is on local disk. It is one of our public cloud which the computer nodes have two disks. Two computer nodes become a pair like this. The DRBD0 in A node is paired with the DRBD0 in B node. And the DRBD1 is paired with the DRBD1 in B node. So first, we cannot get the samples from the database which monitoring the network of the physical machines. We will check if the node is a DRBD environment and get a node, get a node pair. And then we will get a spare node. Spare node means the node we haven't used and no visual machines running on it. Then if there is no spare node, it will alarm. Then if we find our available spare node, we will shut off the band nodes and store the pair between A and B node. And then we will repair, do the repair action between A node and C node, B node and C node. Then if we repair failure, it will alarm. Then we will evacuate the visual machines from A node to C node. In the end, we will check the VM state. The process is more complex as the images can show. And I will introduce the DRBD repair process. At the normal time, A and B, we have talked. The DRBD serial disk is paired with the DRBD serial in B node. And the DRBD1 disk in A node is paired with the DRBD1 disk in B node. So if there are something wrong in the A node, Guardian will shut down the A node and do the DRBD disconnect action in B node. And then Guardian will SCP DRBD pair configuration and do the DRBD address to address in the B node and the available spare node C node. And then restart the DRBD service in B node and C node and do the DRBD connection between B node and C node. So now DRBD0 in B node and DRBD0 disk in C node became a new pair. And DRBD1 disk in B node and DRBD1 disk in C node became a new pair too. And we will mount the DRBD1 disk in C node to a directory. And we evacuate the VM in A node to the C node. And the vision machines also use in the dedicated cloud in our company. So as you can see, there are many drones, the Benz drones which collect a bad host and the spare drones have a little spare host which there are no vision machines running on the spare node. And Guardian will do the action and every user in the dedicated cloud have their own user drone. For example, the user one, the user one have the four nodes. And three nodes are VMs running on the three nodes and only a spare host, only a spare host has no vision machines running on it. And if the Guardian finds out the failures on one of the bad hosts, it will do the actions and remove the bad host from the user drone and move it into the bad drone. And then, sorry, sorry, the second step is evacuate the VM vision machines on bad hosts to the spare host and then remove the bad host from the user drone and move it into the bad drone. And let's add a new spare host into the user drone. Then I will talk about the vision machines due to the VM OS crash. As you can see, we have a watchdog agent in the vision machines and monitoring agent will always listen to the watchdog event of vision machines. If the watchdog cannot feed the dog, feed the dog means that it will write some data into one directory. So if the OS crash, the watchdog cannot write the data. So it will restart the VM if it cannot feed the dog by default. And the monitoring agent will catch the restart action and the action, then make one sample and send it to the collection agent and write the samples to the database. And the strategy engine will check if there have the VMs abnormal samples. If the strategy engine finds out that the VM OS crash, it will alarm and then it will double check the VM state in the ad. So the next is the service HAA in visual machine. Sometimes it may encounter the watchdog in visual machines that cannot feed the dog by its own reason. So we have another agent, QML guest agent. QML guest agent will enter VM and will check the watchdog service state. And the monitoring agent will listen to the event if the QML guest agent, if the watchdog state is in the normal state. And if the watchdog service is abnormal, the monitoring agent will send samples to the collection agent and write the data to the database. And the strategy engine will check if the VM service failure and alarm the VM service crash. And then the QML guest agent will also restart the watchdog in the visual machines. And in the ad, the strategy engine will also check the VM service state. It's an example of the watchdog, but we also have much service in the VM. It depends on which one the user will write to monitor. The next is similar. It is the service HAA in physical machines. The monitoring agents in the computer will check the service in the computer and all in the computer. And we can also install the monitoring agent in the controller node. So the monitoring agent check the service state and send abnormal samples to the collection agent and write into the database. And the strategy engine find out the service abnormal sample and alarm it. And then the strategy engine will do the action to recover the service for example restart the service and so on. And then in the ad, we will check the service state double again. The next one is the certification HAA in physical machine. It's very similar with the other case. The monitoring agent check the certification state and report it. And if the strategy engine find out the abnormal sample, it will alarm. And there is no the recovering agentation in this image, but we can add it by ourselves. This case show the good capability of the guardian. And the next one is Visual Machine Network HAA. We support OVS, SR, IOV, DPDK, port, network HAA for now. Most of the process of the Visual Machine Network HAA is similar with the certification HAA case. The biggest difference is the Visual Machine Network state poster, which is in the monitoring agent. The poster is the function of how to monitor the VM's network state and support OVS, SR, IOV, DPDK, port of VM. For normal VM that the port is OVS type, the poster monitor will monitor the type, the QVB and QVO port. For the SRV VM, as the SRV port makes the package can go to the physical network without going through the visual layer. So we can only monitor the physical network states where SRV port in. And for DPDK VM, we monitor the VHU port. And the next is Visual Machine Network QSHA. We support OVS and SRV right now. When we create a new VM, we will get the binary QS of the host. Because the NOVA computer node will always collect the binary QS data and send it to the database. Then when we create the VM, we will check the binary QS of the host and filter the binary state and find a suitable node to show the node and schedule to create a computer node. And the NOVA computer will always monitoring the binary QS in the computer node and send a network QS over the notification to the Guardian. And the Guardian can be alarm. And the last one, the last one is our plan. We will continually optimize the Guardian and add new features. You want to optimize the Guardian hope it can be faster and more current, accurate to the HA and better scalability and can be better managing and concurrent processing mechanism and implement the physical machines HA through the CPU, memory, disk and network and so on. And we also want to cooperation and development in the future. And we want to open source in the future and contributing our HA experience to the community. Thank you.