 Good afternoon, ladies and gentlemen. I'm very glad to see you here. Today, my topic is about compute known high availability, but it's in a distributed health-checking way. The slides are a united effort of Xu Hejie from Intel, and he's also known as Alex Xu in Nova Community, and Liu Junwei from China Mobile, which is the largest telecommunication company in China, and me, Zhou Zhengsheng from AWS Cloud, and it's an open-stack startup company in China, and you can call me Zhengsheng. So here is the agenda. Firstly, I'd like to give some context information about why we are doing compute known high availability and what problems we solve. I will also explain how we implement the very first version of compute known high availability. Secondly, I'll discuss the problems in the first HA implementation and how we can make all the checks and heartbeats distributed to all the nodes. Then Alex Xu will give us some insights on what needs to be done in the Nova project, and Junwei will share some experience on handling HA events in Celerometer. At the last, if we still have time, I can show you a live demo. So what's compute known high availability and why we need it at all? Compute known is a physical machine running over instances. Compute known high availability means when machine hardware or network link fails or the host operating system crashes, the node should be fenced and shut down, and the tenant virtual machines on the node get relocated and rebooted on other compute nodes. But as far as I know, the upstream open stack did not consider compute known HA a problem. Because open stack was firstly designed to be run in public cloud, and we assume the workloads are cattle alike. When the compute known dies, the virtual machine dies. Who cares? We host cattle, not pets. This is exactly what cloud computing is about, right? But open stack demands the tenant workloads themselves to provide false tolerance and free-over abilities. Say, you have to restructure your application in a modern microservices way, or deploy, clustering, and load balancing middleware in your virtual machine. As open stack expands the horizon to private cloud today, you will see a lot of customers already run virtual machine on proprietary virtualization platform, and depend on the virtualization platform to provide HA for the virtual machines, especially in traditional industries among our customers. They're relying on compute known high availability a lot, and the applications they have may be monolithic proprietary software, so you cannot change the code, or even they develop their software themselves without considering HA, but rely on the platform. In reality, some of our customers in China even consider compute known HA is one of the key features. If we don't support that, they will not consider using open stack. This is why we try to add this feature to AWS Cloud OpenStack offering. At first, we think open stack compute known HA is easy. You can even do this at home. You can just run a Chrome job or a shell loop, then use a nova service list to find out all the nova computer services. If the node gets crashed or the management network of the node is down, the nova service starts to become down, so you can grab the down the service instance and pick the host name. Then call IPMI to shut down the physical machine and finally call nova evacuate. The solution is perfect, and the customer and AWS Cloud live a happy life ever since, until one day the customers call our service phone to complain. We just kill a tenant virtual machine without any reason. Let's see what the problem is here, what the problems are. There is an example architecture that we deploy for our customer. We have a PXE network and a popular node to install operating system and open stack components automatically. We have three controller nodes. They are running MySQL with gatherer, grab them, kill cluster and all the API and scheduler services in a high availability fashion. We offer self as distributed storage backend for Glance and Cinder and we boot our virtual machines from Cinder volume. This is our typical deployment. We have OpenStack management network. All the rest API calls and we see happens in this network, the virtual machine and then access the self storage in storage network and the network traffic between virtual machine run in the tenant network. It may be a GRE or VxLand channel, but usually we deploy VLAN for our customers because it's private cloud. All these networks are physically separated. There may be dedicated hardware switch for a particular network. If they share switch we allocate, VLAN IDs for them respectively. The point is that Nova Compute Services send hard bits on the management network. If the management network of the computer node is down, we cannot manage life cycle of existing virtual machines and we cannot bring up new virtual machines on this cloud. But it actually does not affect already running virtual machines and their applications because all the tenant network and storage request are not affected by management network outage. If you consider the ultimate goal of high availability is to minimize application downtime. So in this case we should neither shut down this computer nor evacuate the host. We should just let it be and have the system send an email to a human operator to check the management network on this node. Some renders use pacemaker and zookeeper and depend on the building hard bit mechanism of these projects to monitor the compute nodes. But pacemaker and zookeeper requires remote node to send hard bit and however by default people only deploy one hard bit network for pacemaker and zookeeper and the hard bit network is usually run on open stack management network. It will run into the same problem here. Another problem here is that if the storage network or tenant network is down the virtual machines might crash or the application in the virtual machine loses connectivity. In this case you actually want to evacuate the compute node or live migrate the virtual machines if possible. Again if we only run one hard bit network we don't have enough information to do the correct decision. So here comes our version one implementation to solve the above problems. We designed our HHS monitoring service. We run the monitoring service on one of the controllers and the service pins all the compute nodes and via the management network, storage network and underlay tenant network. Then it gathers compute node connectivity information and consult an action matrix according to the matrix if it may send email or shut down and evacuate the host. So let's see how the action matrix look like. And this is an example matrix. Let's see the first row. The first row. Say if the management network of compute node is down it just send an email without interfering the virtual machines. And you see the second row it tells us if the storage network is down the compute node cannot access the self cluster if the virtual machine is booted from single volume the guest OS may already crashed. So you can just fence and evacuate. But if your virtual machine image are stored on node local file system in this case it should do nothing except sending an email to a human operator. I mean that the action matrix depend on the open stack deployment configuration. It really differs from customer to customer. There's no single and correct action matrix for all the situations you design your own. And so let's back to this. So we still have some problems. The first one is the monitoring service is a single point of failure. It still needs to be highly available. Secondly if compute node happens to lose the power supply we will not be able to shut it down remotely. Sure. Because the IPMI unit on the machine also loses power. And we are losing control of this machine. And at last there's only one monitoring service instance. It has to provide the compute node on all the service networks. It does not sound very scalable. So we are implementing distributed health checking. We want to use Spacemaker and Zookeeper. But finally we decide to build all the things based on console. The project console is a well-known project in microservice and container community. It is actually a service registry and discovery tool. But it's more than that. It provides a key value store which is replicated among all the console servers. It also offers strong consensus among the server nodes. So you can use it as a distributed lock manager or perform leader election. You can register health checks in the console agent. I can explain this later. And it also can manage thousands of nodes according to the official documentation. All these features are exposed via REST API. So let's see its architecture. This slide shows how we deploy a console cluster in an open stack environment. And each controller runs a console server. And the console server runs the REST algorithm and synchronizes the state between the servers. And REST is a consensus algorithm for the state machine replication. In the short word, if we successfully put something into the key value store on one of the console server, we can read the value in other console server immediately. And this is like Zookeeper and pacemaker. And we then run console agent on the REST API. And the console server processes all the console processes, whether it's server or agent, participate in the console cluster. The console process exchanges messages using the console protocol. For now you can just think it's a messaging protocol. I'll explain in the next slide. And you can register health checks dynamically in each agent. The agent runs the check in the server. The check is edge triggered. For example, if we ask console agent to check if CPU temperature is below 90 degrees Celsius, the console agent will not report the CPU temperature to the console server each time. The agent only communicates with a server when the CPU is too hot. So here comes the gossip protocol. And it's one of the most famous protocols in the world of peer-to-peer networks. When the node 1, for example, wants to broadcast a message to other nodes in the cluster, it can pick a number of nodes in a random and send the message. In this slide, it sends the message node 2 and node 3. And node 2 and node 3 will also pick some node in random and relay the message. Finally, the message gets delivered to all the nodes in the cluster. You might think gossip protocol is not very efficient. It's just message floating. But the key point is that for each single node in the bandwidth and CPU power do not increase as much as the number of nodes in the cluster increases. The convergence time is acceptable for health checking and membership probing. You can click, here's a link below. There is a simulator. It can calculate how long it will take for a message to be delivered to all the nodes according to the cluster size. And how we probe each other. The most interesting mechanism in console, I think, is how it detects node failure. In Zookeeper and pacemaker, the hotbeats happens between a static number of server nodes and a large number of remote nodes. In the console cluster, every node just picks some other nodes in random and pin each other. And if say node 1 cannot get a response from node 2, so node 1 will pick, for example, node 4 and node 3 and ask them to pin node 2. And if say node 1, let's go to the next slide. Other nodes does not get a response from the node 2. Node 1 will broadcast a gossip message saying that node 2 is suspicious. If node 2 can communicate with at least one of the nodes in the gossip cluster, it can finally get this message and it can respond that, oh, hey, I'm okay. I'm not dead. But if node 2 is actually jobbed, no one is sending a clarification. Then node 1 thinks node 2 is offline. Then node 1 gossips the message to the whole cluster. In this way, every node in the cluster has the full knowledge of the membership and status of other nodes. So how we can, this is the slide. So how we can utilize this feature. And so firstly, the monitoring service now fetches the membership and status from console server. We can then solve them and then we solve the monitoring service HHA problem. We also implemented a complement fancy method using console events mechanism. So let's see the architecture. And there's the architecture of a solution v2. You can see there are three open stack controllers above. And two compute nodes here, in every important service network, namely storage, management, and tenant network, we run a dedicated console cluster and perform the distributed poll. As you can see, there are actually three console agent in compute node 1. The first console agent joins the gossip cluster dedicated for storage network and the same as management console and tenant console agent. So for each controller node, we run a monitoring service. And there are three instances of monitoring service and they elect a leader service under the help of console coordination. Then the service instance which has taken the leader role will consolidate all the health checks and membership information from all the three console server agent on this controller node. So it now has all the connectivity information of all the compute and controller nodes on all three networks. Then the monitoring service will be able to consult the action matrix I mentioned above and do whatever it should. And then the other two monitoring service instance just keep checking whether there is a leader or not. If the leader controller is down, for example, controller 1 is down, the leader lock will be automatically released by console. Then one of the alive service instance will become the leader and then continue the monitoring. So here's a little bit more detail about the monitoring service itself. Alex once asked me that why the monitoring service has to elect a leader themselves because console server already elects a leader using raft algorithm. Can we just fetch the raft leader information from console server and use it? The answer is that the raft leader role cannot be controlled by the application so we are not using it. Let's think of the flowing case. Suppose the console server on the management network node 1, controller 1 is the raft leader and then we just find this controller node lost storage connection connectivity. In this case, in this node point of view, every other nodes have those connectivity on storage network but this is not true actually. In this case, we can not trust the information fetch from the storage console server on this node so we want to migrate the leader role to the other nodes with healthy connectivity. So you can see the application level leader is totally different from the console cluster leader. So console provide very nice simple API for the application to acquire and release a list of nodes. In the official site there is also a guide on how to implement leader election using the rest API. We also have a list of fenced nodes. Once the monitoring service shuts down a machine, it abends the machine to the fenced node list and just skip nodes in the list when it runs the action matrix. This gives the information to the server on the machine and fix the problem. Otherwise, the monitoring service will keep shutting down the machine in the loop to share this information among all the monitoring service instances so we can still have this information after we migrate a leader role so we save this information in console key value store. This is the main part and the monitoring service use IPMI to shut down the compute node remotely but if it failed to do so, it can fire a fenced event from the console server. Then the console server delivers the event using the gossip protocol to the node to be fenced. In console agent we can register a watch with handler for the event so obviously upon receiving the fenced event it can be fenced. We currently use storage network as our console fenced network because it is crucial for our service and once we detect the console agent losses connectivity on the storage network we have the node commit suicide and shut down itself. This is necessary piece of self fencing. We can be sure that when losing control from outside of the node it can be fenced. We can be sure that when losing control from suicide or the node loses power in either way we are sure about that the node is shut down and it is safe to evacuate the compute node. So here is all of our version 2 implementation. And how about our solution with PaceMaker and ZooKeeper? In fact, we use PaceMaker heavily for mass equal. I also investigated the PaceMaker and ZooKeeper solution. But you can see there are a lot of limits in PaceMaker and ZooKeeper. Basically you can achieve the same effort using PaceMaker and ZooKeeper. But just a lot harder. And I think it's a lot simpler to implement the solution based on console. Besides, only console provides a scalable distributed probe mechanism. So our version 2 solution is not perfect and we want to add more features. The most important thing to do is to reserve some bandwidth for the gossip messages. You can use the gossip simulator to calculate the overall bandwidth for each node. I mentioned the link earlier. So my talk is here. Next is He Jie. I will give a quick review about the things we need in NOAA to support such distributed health check system. So this health check system is built on S-House actually. We're really missing something in NOAA to support that. So we have to hack some code in NOAA. So first thing is service API. Actually we can use this service API to get data to know whether it's up or down. Under this API is a thing called service group. And in the NOAA internal and backend with DB or memory capture or zone keeper it is implement heartbeat for the NOAA campus service. So when the heartbeat is done we'll feedback to NOAA. We'll report this back from the API. So the problem is this is just a monitor for the NOAA campus service or say NOAA campus process. And normally it's just running on the management network. So that's also the reason why we need to share storage network and 10-nit network. So for our external house check system. So we need a way to feedback as NOAA can build the status back to NOAA. So we really need a new API. And this API is already enabled in Liberty. After our system upgrade the Liberty we will get this ability. After that we already set our storage down. So we can begin to evacuate the instance. So this is API sample about evacuating instance in the house. So basically you need to specify two parameter. First one is the house. So you need our system to find out a house the instance evacuated where. So the problem is our external internal system hard to know which house is best. And even we're doing something that may invalidate the initial NOAA scheduling policy. So we really hope this can be down to NOAA scheduler. And for the last thing this is already enabled into Juno. The house parameter is become optional. So when you evacuate instance and NOAA schedule will choose a house for you. So but this is still not enough for the scheduling. So as we know there are customers that scheduling policy in when you're booting a new instance. So but the problem is that's the schedule policy didn't persist in the NOAA. So when you evacuate the instance so that schedule policy is missing. So we still have a chance to invalidate this schedule policy. So that's the last thing in NOAA and we hope we can have that in Mitaka. The last thing is the on-share story. It's pretty similar. It's hard for our external system to know whether the target house whether on the share storage or not. And this also in progress work in the community and makes this parameter easier. And after that the evacuation become very easy and this is good for our instance. Not just for us. I think it is for all the people want to implement their external health check system. Okay, that's all. That's all about NOAA. So. Hello everyone. I will talk about how to implement the on-share solution in China mobile private cloud. In our solution we implement our solution to use to best thermometer because as we know thermometer is a good framework for monitoring and alerting. We must install gas agent in the water machine as a channel for collecting health data. The gas agent put the health data to thermometer computer agent at last the data to thermometer database. We also add two thermometer metrics. The first is instance pin delay which represents the tenant network health by pin the water machine gateway default gateway. The second is instance disk health which represents the storage network health. We also add two formula to trigger a thermometer alarm as follows. The right is our architecture. We can simply see simply. Now we introduce the two-millimeter alarms in detail. We can see the pin delay alarm in detail. Calculate the average value once every 10 seconds. If all of three continuous average value are more than threshold, thermometer should trigger an alarm of pin delay. The disk health alarm is in detail. We can see here. As default, the thermometer supports three alarm handler actions. Email, SMS, and REST API. In China mobile private cloud, we use email and REST API handler actions. For REST we can use one of these NOVA API, all of them to satisfy your requirements. Also, you can use other high-ovalability system API as the handler action. At last, our solution has some advantages. It can deal with the high-ovalability of virtual machine level. It can deal with tenant network failures and storage network failures. However, it has some disadvantages. It doesn't deal with management network failures and IPMI network failures. It does too many duplicatable checks and it's failure. And it must depend on guest agent, for example, QEMU guest agent. Fortunately, we combine with cancel mechanism we can overcome most of disadvantages about. Thank you. Let's see. We have connectivity. And this is our horizon. We do a heavy customization for horizon. A bit slow network. Okay. Let's see. So let me migrate the virtual machine to we are doing the experiment on node 4. So I migrate the virtual machine to live migrate to node 4. And then here we see the monitoring logs of the monitoring service. So we have three monitoring service running and these two are in hot standby mode. They are saying, oh, we are not the leader. And so in this is the leader. So it prints the connectivity information. And so let's see the host for. There is a virtual machine and then we can just shut down the tenant network. It's on eth2. It's a VLAN. So our monitoring cycle is about 30 seconds. Actually, by this time console cluster already detected node 4 is down in the tenant network. Here you can see when the monitoring cycle comes it says that it cannot connect to node 4. And it uses IPMI to shut down the node and it's scheduling host evacuation for node 4 after 65 seconds. We are waiting for the NOVA service to be marked down. You can use API to do this. And it's continue checking. So let's see in this node. And we have a client we can list the nodes we fenced. So we can see node 4 is fenced. We fenced the node 4 on this machine and we can fetch the information there. And you can see it already triggered the node 4 host evacuation. This is our hack our own extension but if you are using Liberty you can use the API instead. So let's see how it goes. Just update. Okay, it's rebooted on node 3. Let's see if we can get the connection. The window goes here. Let me drag. So it's up. So is there any questions? Maybe please stand to the microphone. So what steps do you take to make sure that the evacuation actually completes? For now we are just waiting for 65 seconds but as I mentioned I have opened the slide. This is a suicide suicide mechanism here. So the suicide mechanism will help us make sure that if we wait for at least about some time it will either shut down or it has power failure. Really my question. The NOVA evacuate call is not reliable. I think it's in different ways that an evacuate request can get lost by the system. Sorry? There's like half a dozen ways in which a NOVA evacuate call can get lost. You mean that the evacuation call itself fails, right? Yes. So here actually we are sending an email to the operator so you can see if I can get an email. It's not 100% successful but at least you get the email. If your service is down you can check your email and see the NOVA has a problem. So here. He sends an email to me just now. Thank you. How many failures can you handle? This is our future plan. We want to add a limit on the concurrent failures by configuration. We now can just handle for example if you are over committing your hardware handling failures can be very difficult but if you leave one or two hosts not very so busy you can plan in advance. Okay. So as long as the console cluster alive we can continue to operate. So the console cluster needs at least half of the surveillance alive. So if we have three surveillance as long as two of them are alive it can continue to operate. Even if you can still use the monitoring service this is very string situation. But if you care about the high availability you can deploy more console surveillance on more machines. So you gave some reasons for not using pacemaker. So one was the typically it monitors on the management network. Yes. And pacemaker relies on the server node of pacemaker relies on cross sync. In cross sync you can actually use multiple hardware network but it's not very flexible. You either use it or you either use it as standby mode. So you see we have an action matrix so we have to make our decision so pacemaker cross sync is not so flexible in this way. Yeah. We don't necessarily recommend RRP. It's better to use bonding and VLANs for that sort of stuff so you get the benefits without the drawbacks. So the presentation you described is like a hard failure. Like a network is not pingable or the server is powered down. How do you deal with hardware degradation? Say your network ethernet from like say 1 gigi to 100 megabyte or your hard disk have some block errors or yeah. We have separate monitoring solution for the health monitoring and we can monitor all the hardware situation and this is a project scope. Sorry. I know I'm hoping all the questions. I'm wondering what limits because you said one of the reasons for not using pacemaker was it didn't scale as well. What limits did you hit? Actually pacemaker with pacemaker remote is very good I think but as I said it's a bit hard to run the slide. So in pacemaker originally it limits the cluster size but in remote it can exceed the cluster size but if you are deploying multiple network like this like console do you have to add ping D resource on each compute node and you have to add 3 and you have to ping who? Because if the controller is down you have to ping the other controller so the ping D resource you have to configure a target for it to ping and this adds some complexity for the deployment scripts and management. Why do you need ping D? Because in pacemaker remote it runs only one hardware network say in storage network and tenant network you have to use ping D Bonding would have helped with that as well. You mean all the networks and then use VLAN but that's another deployment thank you