 Hello everybody, thanks for your time and thank you for coming here. Our topic is about the requirement analysis of the platform reliability in a three-layer decoupling environment for NFV. I'm from China Mobile, my name is Zhiqiang Yu and this is my partner. Thanks Zhuheng. My name is Ian Jolliff. I'm from Wind River. I'm Director of Engineering and Platform Architect for our Titanium Cloud product. And very happy to be here today with my colleague from China Mobile to talk about reliability in a VIM environment. Okay. And as you know, in these days, my colleague Xiao Guang introduced the Novel Net Tick project in China Mobile. We do the testing and verification of the three-layer decoupling. And this topic is one part of our work, our tick project. And the topic, the first part is reliability in non-NFV platform. The second part is NFV three-layers decoupling environment. The third part is platform reliability in NFV three-layer decoupling environment. And the next NFV, I, we go in three-layer decoupling environment. And then we will introduce you several platform reliability scenarios, the last part is concrete. Okay. So now, as we know, in traditional city environment, and as we know, the non-NFV platform, the reliability is totally different from the decoupling environment. Different layers hooks on each other closed by APIs. Once, as you can see in this picture, once a fault happens in specific layer, other layers can detect it easily, and it's hard, and it's impossible to introduce third-party applications. That's not what we wanted in the future NFV type. This is the picture of NFV three-layers decoupling environment. And I think if you are working in NFV industry, I think you will be very familiar with this picture. It is from ETSI. And there are three layers, the hardware layer, the NFV platform layer, and the application layer. Thanks very much, yeah. So when you look at the three-layer decoupling model, there's a few things that you need to keep in mind. Certainly a layered approach to reliability is very complementary to the approach for three-layer decoupling that my colleague was talking about, so providing clear separation between the hardware, the VIM NFEI layer, and the application layer. And certainly some of the conversations we've had is how critical it is for the NFE orchestrator to be the high-level system brain for the end-to-end service availability. There's lots of different layers and ways to approach end-to-end service availability. But by having different roles at the different layers, that provides the ability to have very strong fault isolation, as well as no cascading failures between the layers, which at the end of the day will provide us with higher end-to-end reliability and availability. So there's the traditional, a little bit more detail with respect to the Etsy model. Many of the application sets will have a VNFM managing a set of VNFs. They'll communicate to the VIM, which typically is based on OpenStack. And also there's lots of Rust APIs that are available northbound and southbound into the VIM and the VNFM. So the VIM is able to detect and react to certain failures. Other approaches allow the MANO layer to detect the failures and do recovery. But some of the conversations we've been having are, is there a fast path for absolutely very critical applications where you need faster fault detection and fault recovery? And that's probably best done by the VIM. And then other situations where it's end-to-end service availability. And that's where the NFVO is more likely to be involved and have more context, perhaps in a multi-cloud, multi-site situation, and better able to handle fault recovery. Now that might come at a longer time for recovery, right? So the next few slides, and we've got just a couple of minutes left. And what we thought we'd do is run through a couple of different failure scenarios and walk through and talk about some of the fast path and slow path recovery models in each of those. So we'll go through nodal power failure, network management link failure, the service failure, hardware failure, VNF, and hypervisors. So we'll walk through some of these different scenarios, and then we're going to talk about some of the different trade-offs there. So on a nodal power failure, so if we have a compute node have a power failure, you know, the NFV orchestrator could get communications from the VIM that that node has failed and look to schedule that VNF elsewhere in the cloud or on a different cloud. Another alternative approach for an application or service that needs to have local recovery is that the VIM could actually relaunch that VNF locally, and that would be a fast path scenario in that case. Also a lot of solutions are using out-of-band management interfaces, and so if you can't talk to a node anymore, you need to perhaps query the system to look at the service availability. Is the service still up, and is there action really needed to take, or is this an intermittent failure? So again, that's probably best for the NFVO to look after that one. On the service side, again, the VNF can move from one compute node to another through an orchestration action, and then another option would be for the VIM, again, to take that and then report back up to the NFVO that it's taken that action. So again, a couple of different options for recovery, depending on the characteristics of your service. For VNF failure, again, it's very similar. One of the other things, though, is we've been working with our colleagues at China Mobile and in OPNFE on a couple of different ways to actually provide additional insights into what's happening in the VNF, and so we actually have a set of guest heart beating APIs that allow you to communicate directly into the VNF and make sure that the software processes are still working, and you can actually do guest heart beating to make sure that the guest is installed, and that, again, gives you very early detection and the ability to recover service before perhaps a MANO function would be able to recover it. And very similar to the Hypervisor, again, you could still have a slow path or a fast path. In this case, it's the VNFM and the NFVO talking to the VIM and being able to look at the VNF's availability but more focused on the Hypervisor. So it's the KVM process alive that's running that particular VNF. And then, again, here you can look at how a guest heart beating could be leveraged as well. Here is also very similar, and it is a good way to manage the compute node failures, and, again, when we went through this analysis, we were able to find that the NFVO is one approach, but also the VIM is just as good approach for detecting hardware failures. But if you need fast fault detection, it's probably best to do it on the VIM. So again, we found that both solutions are viable. So in conclusion, the three-layer decoupling model is a really proven way to approach these recovery scenarios. And we really came to the conclusion that some apps and services will require fast fault detection and recovery. Others may want to do that and be more tightly coupled into the NFVO. Okay, so for the scenario described in this presentation, when an NFVO is responsible to make all recovery decisions, it can result in high latency under heavy load. So the recovery performance will be faster and lower latency when NFVO also has VIM to make some recovery decisions. Yeah, that's our, actually, this is from our testing, our real-life work. Well, I think we're at our 10 minutes. Thanks everybody for coming, and if you have any questions, we're happy to answer. In itself, we'll handle all the failures and errors, right? Right. Whereas the NC recommendation, at least, I think release two is to, they didn't define that property correctly if I'm wrong. So this is an anti-battern of taking into consideration what the at-seat standard for NFV is, say, or it's an alignment with that? My view is that, I better repeat the question. So just for the recording, so the question is, there's a couple of options described here. Are they in alignment with at-seat NFV or not? I think that's the crux of your question. So I guess in my view, we see this as complementary for situations where the NFVO needs multi-site awareness and you're describing a service across perhaps multiple clouds. Yeah, the NFVO needs more context. However, there's probably certain different services where you need faster fault recovery and you want local detection and recovery and you want it more autonomous. So it's really just for different levels of reliability and service assurance. Exactly, yeah. So maybe I'll repeat your question just for the recording. So the question was, go ahead. Okay, so I think the crux of your question was, is the VIM, I'm sorry, could you maybe repeat it? Oh, policy, yeah. Okay, so the question is more, how does the orchestrator deliver policies for recovery down to the VIM layer? Yeah, I think that's some of the work that still needs to be done. I think some of the APIs need to be further expanded and defined because I think at CNFE a lot of the APIs are still, certainly from an NFVO perspective, still open to interpretation and need a little bit more clarification. But you can define, you know, my view would be that you'd want this configurable based on your, through your VIM and be able to define, again, just like you said about affinity or anti-affinity policies that are defined as part of the VINF.