 Okay. Hello, everyone. Welcome to this session. Today, we will talk about the high availability and the scalability management, or WNAF. Okay. First, let's introduce ourselves. My name is Xu Haiwei. I'm from NEC Solution Innovators. And currently, I'm mainly focused on the send-in project. And I'm a send-in call developer. Hello, everyone. This is Xingui from VMware. I'm a call developer of the send-in project and also a TSA world member of the ONAF community. Nice to see you here. Okay. That should be another speaker here. His name is Xu Haifeng from ZTE Corporation, but he didn't come for the WASA reason and we also thank him for the contribution to this talk. Okay. First, the agenda. We will briefly introduce the attacker and show how attacker manages the WNAF auto-scaling. Also, why we need to integrate a send-in to attacker and what can send-in do for the WNAF auto-scaling and auto-healing. And... Okay. About attacker. TECRA is an open-stack project which builds WNAF manager and WNAF orchestrator to deploy and operate network services on open-stack. In this talk, we mainly focus on the WNAF management and we only talk about the WNAF instances. So first, what is WNAF? WNAF is a virtual network function. For network virtualization, all the network service devices can be virtualized. So the WNAF can be a virtual router, a virtual load balancer, a virtual firewall. So in WNAF way, for telecom use case, there are many WNAF working together to provide the network service. So in most cases, the WNAF are working together but not separately. So for the WNAF management, we need to think about it from a cluster standpoint. Okay, for WNAF management, what do we need to do? Of course, the WNAF lifecycle management is very important and it's already been supported by TECRA project. And besides it, for example, there is a virtual load balancer working there and there are many WNAFs running under the load balancer. If the number of the WNAF is keeping increasing, the load balancer will be overloaded and the performance will go down. So at this time, we need to scale out a new load balancer to solve this problem. All these jobs should be done automatically so we need the way WNAF can be autoscaled. And for another case, there is a virtual firewall working there and for some reason, the firewall went to an error state. At this time, we hope the WNAF can detect its own state and recover itself. So we hope the WNAF can be auto-healing. These are two very important functions for the WNAF management. Okay, let's first of all talk about the WNAF autoscaling. The WNAF autoscaling has already been supported by TECRA. TECRA used heat autoscaling group to support this function. We can see TECRA used the TOSCA template to create a WNAF descriptor. In the descriptor, there are essential properties which can be used to create a WNAF like flavor, image, network, and monitoring policies. All these properties are stored in the database. And when creating a WNAF, the descriptor will be translated by the heat translator and TECRA to translate it to an HOT template. And then TECRA will use HOT template to request the heat to deploy the WNAF. So we can see in TECRA, TECRA's job is just translating the TOSCA template to HOT template. And the autoscaling job allows to heat autoscaling group to do it. Okay, let's see how TECRA translated the TOSCA template to HOT template. This is a sample of TOSCA template and this only contains the policy part. We can see there are two types of policies, the TECRA scaling and the TECRA alarming. These two kinds of policies are derived from TOSCA policies. So the policy can be passed by the TOSCA parser. And by translating these two kinds of policies, we can get the heat autoscaling group resources and the heat scaling policy resource, also the alarm resource. All these resources will help to manage the WNAF autoscaling. But there is a problem there because heat doesn't provide an API for the autoscaling group. So TECRA can't manage the WNAF which are newly scaled out. That means if the WNAF is newly scaled out and it went to an error state, TECRA can't handle it. This problem can be resolved by integrating sending to TECRA. Okay, let's see what is sending. Sending is an open-stack project which provides clustering services. Currently, sending provides a container cluster and a VM cluster. Before you create a cluster, you need to create a profile. And in the profile, the essential properties to create a VM or a container are defined. And the profile is used to create a node. And a cluster can contain one or multiple nodes. To help to manage the cluster, a policy is invented. So sending defines many kinds of policies like scaling policy, LB policy, deletion policy. These policies can be attached to the cluster. And the rules defined in the policies will be triggered when their actions happen to the cluster. For example, if the cluster wants to define a node, the deletion policy will be triggered. And the policy action will follow the deletion policy rules to delete the node. And there is another important module called receiver. Receiver can receive the alarm from monitoring tools. Like a centimeter. Currently, sending supports two kinds of receivers, webhook and message queue. Okay. And let's see how to integrate sending to Tecra. We can see this graph. There is only one change from the original one. We don't use heat to deploy way af, but using sending to do it. So to do this, we need to have sending resources deployed. And the sending resources will be deployed by heat. So to do this, we need to write all the heat... write all the sending resources in HOT template. And then heat will deploy sending first. Then sending will deploy the way af. Okay. Let's see how to translate the template. You can see this template is a little different from the original one. And we add a new property driver to these properties. If the driver is heat, so the Tecra scaling policy and Tecra alarming policy will be translated to heat scaling group, scaling policy. But if the driver is sending the two policies will be translated to sending resources to sending policies, sending cluster, sending profile, and just part of the template. Okay. We can see the way af will become the nodes of the sending cluster. And Tecra can manage all the way af lifecycle while sending API. So the way af will just be under sending the control. Okay. The task is in Tecra. There are three tasks. First, add a template translation function to deploy sending resources for heat. This job is already implemented and under review. And the second one, modify current APIs to support managing way af lifecycle manually. This job is not difficult. We just need to add a scaling option to do it. And it will be done in the future. The third, if a user deploy a way af and at that time the user didn't realize that the way af needs to be scale out, scalable in the future. But later he realized the function is needed. At this time, sending can adopt way af and make it scalable. This can also be done in the future. Okay. That's all for the way af auto scaling. And there's another important function, auto healing. Currently, Tecra doesn't support the auto healing function. And by integrating sending to Tecra, auto healing function can also be supported. And Xinhui, we'll introduce this part for you. Okay. Thank you. Thank you for having me. Actually, here I would like to give an introduction about what sending can provide about auto healing part. There are four types of availability involved in the OpenStack Cloud, actually. You can see four parts here and list all the factors of availability of the OpenStack Cloud. Actually, OpenStack can help manage a group of physical hosts and create and delete, you know, do the lifecycle management about a group of VMs or containers over these physical hosts and running, you know, the applications inside the container and the VM. And the four, you know, scrum areas list all the factors of availability. Such as for the physical host, definitely we need to, you know, understand the availability of nodes and the network and the storage all these things, of course, including the operating system or hypervisor layer of the hosts. And for the virtual machine part, actually, we also need to careful about the virtual network, virtual storage and the mobility over different hosts and, you know, the management ability. And for the application, we have the terms about service resilience. That's very, very important, actually. We need to handle the quality of service and cost and transparency and data integrity all everything together to, you know, do the resilience. And, of course, we just have time to, you know, explain about the part of the open stack itself, the control plane, high availability is definitely the very important part. Actually, what Sunlin can help is about the VM layer availability and the application layer, part of the application layer. Here actually, the graph shows the framework of the Sunlin auto-hailing. As you can see, actually, Sunlin provided a clustering service as Highway just mentioned. We can create a group of same type of object and manage it. And then we can use health management policy to attach the policy to the target cluster. Once they attach, the cluster will, you know, manage it by the Sunlin engine. The engine will pull in or, you know, use some monitors to understand the status of the nodes and the clusters. If any failure is detected, then we will trigger the, you know, the whole recover loop. Here you can see we list the Celerometer and A that's actually the OpenStack native of monitors. But we are not limited to just accept the alert and events from the, you know, Celerometer like things. Actually, we provide abstract named receiver. That means we can use Webhook or some other thing, Zaka based event queue to accept a notice from the third party monitors such as, you know, we have different, such as the open source and both the enterprise level monitor. We can collaborate together to provide, you know, auto healing loop. Here actually is an overview about the Sunlin, you know, auto healing design. Here actually, you know, Sunlin, back end provides a different engine. I'm sorry, actually the Sunlin engine can be multi instance because we need to handle the parallel, you know, request handling things. Here we have a different abstract provided and list on the graph. You can see Sunlin provide, you know, different listening or failure detection mechanism on the left side. Actually, you right side, sorry. You need to, Sunlin can help you listening about if any, you know, some specific event is happened or polling. That means if one cluster is registered while the Sunlin engine, we can, you know, detect by polling each node of the cluster to see if the node is active or not. And then we expose the receiver. The receiver can receive the message and Webhook, you know, from the third party monitors. And we have policies. The policy can control the, help customize and control the placement and the health management part. And the least, but very important thing is, Sunlin is a, provides actually a framework to manage a different type of the object. We can support the heat stack also, NOAA and, you know, the physical host and the container, of course. Here, that means we can manage, no matter what type of the object and attach the health policy with it and then manage the health and, you know, auto healing things. In the following slides, actually I would like to go through the different running time from the deployment to the placement to the, you know, recover route because in different phase, actually Sunlin provides a different help for the auto healing purpose, such as for the deployment. Actually this phase is definitely related to the availability. Why? Because, for example, we need to handle affinity and anti-affinity and across the room. This can only be done by the deployment time and it has a long, you know, impact at a run time. So how for Sunlin to handle, you know, in this case actually we provide the affinity and anti-affinity policy and across the easy policy by attach such kind of policy with the cluster. We can control, you know, where a new node is placed when, you know, we need to create the cluster. We can, you know, balance all the workload into different room to increase the availability and we can point out what kind of node group should be anti-affinated into different nodes. So that's very helpful. And once start the run time, actually we can attach the health policy with one cluster. Once attach a policy with a cluster, actually the cluster will write, actually underlying Sunlin will register this cluster to the, you know, health manager. Actually that's a demon of service running inside the Sunlin engine. The engine will detect all the members of the register cluster to see the status is right. If not, we will trigger the recover things. You can see this way is, you know, internal detection. That means, you know, Sunlin will handle this purely automatically. And it is also customizable. That means we can decide to use node status polling or, you know, listening to the specific VM, you know, lifecycle events. Since, you know, the polling maybe have some, you know, negative impact on the performance. The scale is large. So we allow the users to choose the VM lifecycle events. This kind of detection way to find the failure. And actually we allow, you know, integration with third-party monitors. That means besides the in-by the way Sunlin provided to find the, you know, failure, we allow to integrate with open-source monitors such as Cylometer, Monoska, and Nigel, all these things. All we can integrate with enterprise-level monitors such as VMware, VerOps, we already do that. And it's purely, you know, amazing. And we have our keen to know, you know, more details about the network, you know, the party intention and underlying, you know, jump, all these, you know, trace things. That's very useful. All these, you know, integration with a third-party monitor is done by receiver, this abstract. We have two ways to receive the third-party alerts. One is message. That's just like a ZACHAR-based message queue. The other is webhook. Actually it's a URL with authentication and then the target cluster. So kind of, URL will let Sunlin to know what kind of thing we need to trigger once the failure happens. And once the failure detected, we need to do the recover. Sunlin actually provides very rich kinds of support to the recover. But the recover action definitely related to what kind of object, what's the type of, you know, the object Sunlin is managing. Such as for the heat stack, we have different kinds of actions such as recreate, update, and convert. In other words, that's because, you know, heat stack allows such kind of operation. For the NOAA, actually the objects we have more, you know, operations can be on the option list. We can reboot, rebuild, and recreate. The recreate actually including the processing about the function because, you know, we need to sure, okay, a failed node is really died or some negative impact will happen. And we need to handle the recover maybe sometimes by live or called migrate. That means maybe, you know, this host is not reliable. We need to, you know, migrate to another host. And in the past cycle, actually we integrated already with MISTO. That means once the failure happens, we allow the user to run workflow. They already used or, you know, in the practice, they already verified the sequence of the, you know, recover process is very useful for their environment. Definitely in this way, they can just reuse it by, you know, into the send in auto healing loop. And of course, we have many different kinds of hypervisor. Such as the VIR, we have the fault tolerant, and, you know, other hypervisor have some, you know, specific processing about the availability things and auto healing things. We definitely should support or expose all these actions to support the, you know, for the scenario of the auto healing. So that's what we can provide. And how to connect the different kind of, you know, choice of the options of the recollection with, you know, once some failure happens, we still use the health policy. Where we can recover that, we need to define what kind of failure is a failure for the target cluster. And what interview you should use to detect, you know, in some period, and what kind of recovery action you want. So that's, you know, totally sending provided. Okay, thank you. That's my part. Thank you. And finally, we'll make a summary and outlook about this session. Okay, by integrating sending to TECA, what can we do? Actually, we can make the VNF auto scaling more flexible and manageable. And also, the VNF auto healing part, this function is not supported by TECA yet. And we can do it by sending. And besides these two parts, sending, because sending has more policies like the load balance policy, deletion policy, and those policies can also be used to manage the VNF. So this maybe, these things can be help the VNF management in the future, so we can improve it, invite them to TECA. Okay, that's all for the presentation. Is there any questions? Hi, just a curiosity. What is that TOSCA to heat translator? Can you talk about it? This one? How is that conversion? How does that conversion work? It's a conversion. This can be done by heat translator. So that is the name of the tool, this heat translator. So I looked at the Google heat translator and it will come out. Thank you. Okay, thank you very much. Thank you. Thank you for this session.