 OK, thank you, everyone, to join this presentation. The presentation will be made by our three. This is Xinhui from VMware. I'm a core developer of Sunlin Project. And this is Qiming, come from IBM, and the PTI of Sunlin. And this is Peter, come from Nokia. He's a core developer of Nokia. And there is a photo over there. Actually, he is the runout of the PTI of the MISCHO. He cannot come this time, but he contributed a lot to this presentation and this integration, too. So we appreciate him also. Now we can start. Actually, high availability is a very complicated problem to resolve. By this graph, actually, we showed four types of high availability in OpenStack deployment cloud, where in an OpenStack cloud, OpenStack is supposed to manage a group of hosts. And on the hosts, OpenStack can create and do the lifecycle management about the VMs and containers all these things. And in the container and the VMs, application running. So the four types of availability can involve different kind of factors, such as for the host, we need to consider the availability of the physical hardware and operating system, of course. And network storage all these things. For the VM layer, we need to also take care of the resource availability, such as network storage and mobility. That means migration and do a cross-room placement, something like this. And for the application layer, actually, we need to take care of the service resilience. I'm sure there are already very popular topics into the different scope, such as we need to consider the quality of the service and the cost of transparency and the data integrity when we do the service resilience. And for the OpenStack itself, of course, there are many things we need to do the reliability. To help this scope resolve the auto-healing problem, certainly in the past cycles already do a lot of work to provide the framework to help the failure detection and report and recover this loop. By this graph, actually, we show the auto-healing loop provided by Sunlin. As a clustering service, Sunlin can help to create a group of same type of objects, such as a VM. We call it as a cluster. And the cluster can use a standby mode to do the availability by providing some redundancy. And after creation of a cluster, we can create and attach some policy to control the placement. That means when I scale out a cluster or create a new node in the cluster, I can use the placement policy to define where the new node can be placed. And more importantly, we can provide the health care policy. That means we can use a policy to mention what kind of detection the user want for the target cluster and what kind of recovery action they want to use whenever anything happens, they want to do the recover. Here, for the step two, actually, we provide a different kind of detection way. The first one is polling. That means after if some cluster is registered towards the slim engine, we can pull in to know the status of the different node of a cluster. And we also can monitor and listen that events happened to the targeted VM and maybe applications running inside. That depends on what kind of events the application can emit or fare. And then we can use a thermometer and an AE to send out an alert to the receiver of the assembly. Receiver is a very important abstract provided by assembly. In this way, we consist of a complete loop to do the detection and the recover things. And we already do in the past cycles the finish of the framework. But here, today, we're trying to deliver some extension we made in the past cycle. We want to bring this framework into an industry solution from two perspectives. The first one is we're trying to use some enterprise level monitoring to improve the detection side. That means we should use some reliable and scalable monitoring to generate the alert and to know whenever the failure happens. And more importantly, we need to make the monitor should be powerful to collect different data across different sources. That's very important. So not only a thermometer and an AE, we need to integrate with some enterprise monitor. The other perspective we're trying to extend the framework is integration with some workflow. Here, mystery is our choice. Because actually, whenever the recover happens, actually the action is very important. Maybe sometimes or often, the action can be the long run. That means we need to care about the steps and the status of each step. And then at the same time, actually, we need to handle the parallelism and the error handling. All these things, we need to consider them in a whole way. So that's the reason why we choose to integration with the VR ops. That's just an example of the enterprise monitor. And it was a very powerful project. We collaborated together with Sunlin together to provide an industry solution. In the rest of the presentation, we will introduce what Sunlin is and what's the advantage of the mix tool. And then we can present the VR ops part and give a quick demo. Now I would like to introduce Timing for the Sunlin DeepDial. Thank you. Okay. Just a little bit, quick introduction of the Sunlin project. The Sunlin project was started about two years ago. The goal is building a collection service for OpenStack, a very generic service that can help you manage homogeneous objects on OpenStack. For example, Nova servers, for example, heat stacks. And later on, we also extended Sunlin to many Docker containers. So on this page, I'm showing you the high-level architecture of the Sunlin service. As you can see, from the client side, we have a command line interface implemented as OpenStack client plugin. And we have a dashboard implemented as a horizon plugin. And we also have Python and Java language bendings for you to interact with Sunlin service. And we support multiple engine deployments so that if you are managing large-scale clusters, multiple engine can help you ensure that the scalability is achieved. Sunlin talks to the other OpenStack services through one project, one service that is the OpenStack SDK. We are not relying on any blah, blah, blah client. For example, Nova client or heat client. We're just relying on one service that's the OpenStack SDK. We talk to any other OpenStack service, including Keystone. We talk to Nova to manage VM servers collectively. And we talk to heat to manage a combination of everything else. As I just mentioned, we also support Docker containers today in experimental status. We are still improving that. To make the service even smarter, we also develop a lot of policies that you can attach to a cluster and detach from cluster. Examples including the scaling policy that helps you determine how do you want to scale out a scale in your cluster when something happens, some events you receive. And we also have experimental status health policy. We are still improving that. That can help you also heal any node that has failed. We have deletion policies helping you to decide which node you want to remove from a cluster when you detect something bad happened. We also have affinity policy and other placement policy that help you decide where you want your new node to be placed in a new availability zone, in a new region, or anti-affinity, all those kind of things. We also wrap load balancing. The load balancing service has a policy. So we believe that users are more interested in the policy implemented instead of the service itself. So that's the current status. It provides a lot of primitives, a lot of operations for you to operate your cluster. That's a collection of things together. We provide basic management of your cluster membership. For example, you add a node into a cluster, remove a node from a cluster. You scale it out, scale it in, you resize it. There are a lot of command line options for you to specify. Most of the time, these operations are already sufficient for you to do daily cluster management things. And we also have policy management support to support user scenarios such as auto scaling, auto heating, and placement, and load balancing, I just mentioned. There are some other use cases we are exploring. For example, the standby active cluster deployment so that you can do a rolling upgrade much more easier. Also, on this page is the command you will use to create a cluster. When you are creating a cluster, you specify the desired capacity, the minimum size, the maximum size, and the profile you want to use. The profile here is actually anything that can be abstracted. Today, we have implemented a profile for Nova Server, Heatstack, and the Docker container. Actually, you can add your own profile. We have an extension point for that. If you want to use send-in to manage an array of integers, you can do that. There are many different ways to use send-in. I just mentioned that we have a command line interface on the Horizon plug-in for your interact via the service. When you are writing some policies, there are two ways actually. At least, you can use the send-in command line or web interface directly. You will be writing a YAML file. On this page, I'm showing an example. That's affinity policy that you specify whether you want your server group to honor an affinity policy. This is the send-alone policy specification you can use to manage your cluster. Another way is that we have fully integrated sending resources into Heat. The right-hand side is a snippet of a Heat template file. You can specify send-in policies as a resource in Heat template. So next, I will invite our friend and Mitchell, expert to introduce you on how Mitchell fits into the whole picture. Hello, everyone. As mentioned, I'm jumping in place of Renat. And I will talk about why would you like to use Mistral in place of your own script solution or just using simple novel or heat actions for recovery. If you are not familiar with Mistral, it provides a robust DSL language. It's formalized. And you have all the common building blocks that would be available for you otherwise in your script. You can use your own structures that you create by reusing workflows. And if you're still missing something from Mistral from the built-in actions, you can just create your own plugin written in Python and add it to the Mistral library. Mistral is an OpenStack project. It's very deeply integrated with OpenStack, which means that it has resources for all the actions for most of the other OpenStack projects, which is basically opens the door for Sanlin to be able to interact with many OpenStack APIs. It also handles authentication to Keystone. If you're using the v3 API, that means that you can use trust. And that also means that you can write really long operations if you have very complex recovery solutions. If you would use a script, you would have to handle that yourself. Also, from Mistral, you can use familiar actions, like Nova Reboot and StackUpdate and such actions, which are very, very documented and basically follow the structure of the common line interfaces. And you can also find online examples if you are not familiar with a workflow scheme. And also, it's maintained by the community, and the new actions and resources are continually added to Mistral, which means that it's always growing, and it's always compatible with the latest version of OpenStack. You don't have to maintain it like you would do if you would have a script. But in a bigger picture, what you can do with Mistral templates is along with the Sanlin templates and maybe hot templates, you can deliver the whole recovery solution together. And also, you can separate your business logic and input parameters. That means that you can have a deliverable package that you can send out to multiple customers while the input parameters can be filled in at location. As already shown in the slides, Mistral is capable of parallel execution, which means that if you have multiple nodes to heal, it can be faster. And this is out of the box. One other upside is that there's a graceful failure handling available in Mistral workflows because when one thing fails, usually another tends to follow. So in this way, you can have a chance to escalate an additional failure or still have some more drastic measures to heal your cluster. Also, because Mistral is very capable, maybe your other lifecycle management operations are defined in Mistral as well, so you can deliver together. And also, you have the chance to cancel a recovery execution if you think that an automatic solution is at that moment not favorable. And this can happen in a way that stops at a very reasonable step or a stable state, where you can even continue if you wish so. Mistral also has a directory capabilities for these workflows, so you can have multiple versions, which is very good if you have a very heterogeneous cluster. For example, if you're doing an upgrade, and some of the nodes have already upgraded and would require a newer version of recovery action and other nodes are still using the old software. But you might not realize that with Mistral, you have built-in execution history because all Mistral workflows create a new execution and all the executions have the list of tasks that the execution done or about to do. And this makes it easy for you to debug if the recovery action has failed. And also, it can serve auditory purposes as well. One slide. The downside is that this database is right now is ever-growing, so you have to clean up once in a while. And I give back the mic to you. So I will continue to give the introduction about the VR ops. Actually, this is an example we use to show how our extension to work together with the enterprise products. VR ops actually is a very reputed product from VMware. This product can do different kinds of metrics collection and analysis about the health management, risk analysis, and efficiency management. So the different metrics can collect it from the hypervisor and the hosting label, such as for the health perspective, we can know if any network parts is suffering from the competency or contention. And for the risk analysis, maybe the user want to know if any CPU is suffering from the processor or something like that. And for the efficiency, actually, VR ops can report if anything is suffering from the low utilization. All these things are just simple examples. But for the enterprise level, actually, we can contribute many different kinds of monitoring and concrete metrics analysis. And the more powerful thing about VR ops is this product provides the adapter mechanism that allows different network and storage managers to contribute and integrate their data back to the open stack space. Here, this is a graph actually showed the improved auto healing loop compared to the third slide I just showed. Here you can see we implement a VR ops plug in heat. That means expose the resources group about the VR ops. And in the metaka, actually, we already integrated Sunlin resources with heat. And that means we can use the heat template to create the Sunlin cluster and generate VR ops alert or monitoring the mechanism. And then we can use VR ops to monitor the open stack clusters created by Sunlin and notify if anything happens, the condition trigger what we want to know. We already implemented the mutual plug in inside the Sunlin. And if anything happens, the failure detected by the VR ops, the VR ops will notify Sunlin to do the mutual base recovery. So the seven steps will improve the auto healing loop. Now in the following slide, I will give an example about what the VR ops template will look like. Here, on the left side of the slide, you can see the symptom. Actually, that's one resource definition required by the VR ops. Here, around the condition part lists the property group used to define what kind of matrix you want to use. Here, example is disk space. That means if the left-hand space is less than 5 gigabytes, the unit here is gigabytes, then the symptom will be triggered. So that's one example. On the right side, we will use a symptom to create a VR ops alert. Here, you can see the connection between the two resources. Here, we will use a symptom to define the alert. That means if the symptom is triggered, then I will generate the alert by VR ops. And here, actually, are two very important resources because by the two resources, we connect the open stack space objects with the VR ops side management. Here, for the notification part, it connects the web hook. That means what kind of thing we need to do. Actually, the web hook is generated by Sun Lin. That's Sun Lin provided, as Qin Ming just mentioned. Sun Lin actually provided a receiver. That means we can generate a web hook to do some specific things with authentication. Here, we use a mistro as a recover action and generate the web hook and use it here. The other side to connect by the resources what kind of alert. In the slide side, actually, we define the alert here. So, by the notification, we connect the alert definition under the web hook together. The former part of the custom group actually is another very important resource. Here, we use custom group to identify what kind of VR ops or objects should be targeted by the VR ops to monitor. Here, we can use the filters. The filter here will use the name of the VM that will contain some given keys and to filter out what kind of targeted VMs we need to monitor it. We have other things such as a tag or something like that. Here is an example, actually, about how to use a mistro workflow in a silly health management policy to do the recover things. And we can save all the details here, just pay attention to this part about the recover action definition. Here, we use a name to define the workflow name we want to call to execute, and we use a type of workflow to identify, okay, this action is a workflow, actually. And we use inputs under the parameters to group all the parameters or inputs we need to execute the workflow. Now, actually, we can give a demo to show the whole flow we improve for the auto healing purpose. In the demo, actually, we will use a heat template to create a sending cluster with two nodes, and we will attach a health policy with this cluster where we will point out which mistro workflow will be used as a recover action. Then, once the touch, the policy, silly will pass it and the generator receiver will use the mistro as a recover action. And then, give the URL, you just saw it, it's a webhook URL, and give the URL back to the ops part of the resource to create a notification, as I just mentioned. And then, we will use the tag, that means when underlying, actually, when we create the sending clusters, we will change it over, actually, tag every VM with the sending cluster ID. And the user ID, actually, where ops can fill out the targets we need to monitor. So, we will use all these tags to create the custom group. And then, the heat get everything and create all the where ops resources and close all the loop. Okay, now I would like to show the demo. Okay, here we will save the heat template, the part. We just list the receivers already created. And here, you can see we already create a receiver use the mistro workflow as a cluster recover action. And you can see here is a generated webhook. And then, we can show the where ops part as I just presented. Here is a symptom, and also use the disk space as a symptom to trigger the alert. And then, we will just do the test based on heat. And after the heat is run, and here, actually, we show the different resources already created inside the where ops. Actually, that's the UI of the where ops. And here, you can see the notification alert and the symptom already created because that's actually alert definition user to create the notification part, as I mentioned. This is the custom group we use to fill out the targets where ops need to monitor. And here, actually, is where the alerts to show actually to simulate the disk exhaust condition to trigger the alert. At the same time, we use another thread to DD, you know, write the input into the two nodes to exhaust their disk. And after two cycles, actually, the cycle you can set by the Selene health policy. And after two cycles, actually, the alert will be triggered. And then you can see that's a true alert generated by, detected by the where ops. And at the same time, actually, that's a UI of horizon, of course. And the two nodes is triggered to resize because, you know, the alert is generated to know, okay, the left space size is less than what we need. So we will resize the disk into a larger one. Here, we use a target flavor is a small M1 small. The original flavor is the disk of five gigabytes. And you can see underlying the resize work is there. And then going after it, you can see all the node already be resized to the small flavor. That's our demo, actually. If you want read more about our workflow things, actually, we have the code put on the GitHub and you can read more about what we did in the resize. Actually, that's our YAML used for the workflow side. You can see we actually list all the YAMLs to check if the flavor is right. And if we have the capacity to do the resize and then do the resize at last we confirm and verify all the nodes already be resized to the targeted flavor. So that's the reason why we use MISU here. It's a very good project for us to collaborate. Okay, so that's what we want to show. And now we are open to questions and all the suggestions. Thank you.