 Good morning, everyone. My name is Ashik Khan from Entity Docomo. And four of our fellow presenters, all four of us, will walk you through to Open Source Project in OPNFV, which born out of a requirement from Telco, from Docomo, and has seen success in OpenStack. So we'll explain the requirement part. And then I'll hand over to the actual developers. And they will introduce which little features they have developed in OpenStack, and then how they have realized the whole features. So the requirement, I'll get straight to the point. So for Telco nodes, this is a high-level architecture of 3GPP-defined LTE-EPC, EPC coordinate work architecture, where you have this mobility management entity, which is responsible for controlling the rest of the core network. Every time you turn on your cell phone, your location registration goes to these nodes. And it stays there. It's anchored there. The mobility management entity, they will allocate different data path nodes. And through those data path nodes, you're data traffic. Well, beyond these is the rest of the world. It goes out of an operator's network. So the point over here is each of these nodes hosts few thousands subscriber session. If one goes down, a few thousand customers will lose network connection. That's already bad enough. What happened consequently that all these few thousands cell phones, they try to reattach, re-register to the network at the same time simultaneously. These nodes cannot take that load. So it gets into an irrecoverable scenario where a single fault eventually will not allow the network operator to recover this node all together. So what we require is very fast failover policy and features in telco networks. From to explain you a bit more. So this is, for example, you have the cloud infrastructure over there. You have a cloud manager open stack. And then you have the telco nodes, virtualized network function, VNFs. And in general, we avoid to reduce service downtime to zero. There's, generally speaking, they're in a hot active standby mode. So there is this manager, VNF manager, which is responsible for switching from active to standby if there is any problem in the active node. Now we have virtualized this. Docomo rolled out its commercial virtualized EPC last month. So by virtualization you insert this virtualization layer to hypervisors. So the hardware failures, failures over here is not directly visible to the VNF managers. So if there is a failure fault, the VNF manager takes time for the VNF manager to know about it, to actually do the act standby switch over. And that's quite a long time. And we'll have a significant service downtime. We want to avoid that through open source solution as we are using open stack in our commercial system. So what we need to do, what this project needs to realize is this. So if there is a hardware failure, this needs to be detected by open stack. Once it is detected, open stack actually needs to know which are the virtual machines affected and then which is its manager. There are many other virtual machines, virtualized network functions. They have their different managers. The open stack needs to detect who is the manager. And once it knows which one is the manager, it needs to send a notification to the manager. And the manager will do the act standby switch over. So the act standby switch over is an existing function that outside the scope. What we need to realize in this project that was a requirement is what I have shown in red these three features. Now, how this requirement was taken over by the open source community and how they have actually technically realized in open stack. And most of these have been merged in the Liberty release and Mitaka release already. So I explain those from the developers who actually wrote the codes. I hand over to Ryota. From here, he will explain the architectural part. Thank you. Okay, I continue that one. I'm Ryota from NEC, working in OP NFA as a doctor project lead. And I'm also the co-developer of ALDH, which is a telemetry project. Okay, I can provide more information about our high-level architecture for fraud management. So this is a high-level architecture of NFA. And on the left side, you can see the application and virtualized infrastructure. And the right side, you can see the manager of each are the same as you may have. And we're gonna use open stack as a virtualized infrastructure manager. And we detect the failure in the virtualized infrastructure then notify it to the application manager. During that process, the information will be masked or transformed to hide the low-level of physical resources and provide the failure as a virtualized resource. This is our focus. And this is more detailed about the fraud management function block and the sequence. And you can see four components, monitor, inspector, controller, and notifier, which is more genetic terminology. So we may have a couple of types of monitors to detect the various failures could be occurred in the back end or private back-end technologies. So, and those monitor will detect the fault in the infrastructure and notify it to the inspector as a role for the event. And inspector will recognize each fault as a failure according to its policy configuration and find out the affected resources and change the state. It's owned by the controllers. So controllers would be at NOVA, Neutron, and Cinder. And those controller will receive the change request from the inspector and it correct the state of its own. And also it publish the state change to the OpenStack common bus. And that common bus will be used for the billing or audit and in this case we use controlled by notifier. Notifier will watch those notifications in the OpenStack common bus. And if some failure event or the state change are captured, it will notify to the application managers. According to the alarm configuration, the alarm configuration will be set by the application manager when the application was deployed. So it is the way to capture various failure in the infrastructure and also a very effective way to notify the failure to the user. And here application manager. So now you can see the OpenStack project name in the screen. We're gonna use NOVA, Neutron, and Cinder as a controller in this terminology. And we also use Cilometer and ALDH as a notifier here. And we have two options for the inspector, one for Congress and Beatrice. And in this demo and in this presentation, we're gonna explain more detail about how we can use Congress. So we have three challenges in OpenStack. And as you can see, we have to work one on the multiple OpenStack project to realize our design architecture. And we also have to let the OpenStack user know corresponding resource state properly and immediately. What we mean the proper is that providing a failure as a state of batch-raising resources. And we also have to support various OpenStack deployment flavor, which means someone may use normal network and some other use Neutron with OBSM2 plugin and some may use ODO, such kind of thing. And the operator may have various policies. So we're working in the OpenStack proposing new features. We wrote draft Brooklyn spec and the code development in the OpenStack community and it merged in the Liberty and Mitaka cycles. So I'm just providing a high level of information and we switch to the other developer and they'll provide you to more detail about how we can achieve the enhancement of resource state awareness in also Congress-based inspections. So, Tomi. Hi. So my name is Tomi Uvonen and I work for Nokia and I'm going to tell you about resource state awareness changes that we have been doing in the past two cycles in Liberty and Mitaka. And the problem being that we have like a use case of a high available application and we haven't had enough state information, it hasn't been reliable and it changes too slow. So I'm going to tell a changes that helps in that one. All right, so here you can see the white box on the left. It's the compute node and if you are following the line over there about the periodic update. So that was something that existed before the Liberty release. And for us, we needed immediate action to change the state. So this periodic update wasn't enough. So I proposed this forced down API to ANOVA so we can use this external monitoring service to monitor the compute node. And soon as it recognizes any problem, it can call the forced down API and immediately the Nova compute state changes and you are ready to do what you need to do. Okay, but even that was not enough for us. So at the same time in Liberty and now if we look again those red lines over there, you can see what we had before. So you could have pulled some data and you had the alarm evaluator that is pulling the database in a sealometer and when it, and only then it triggers an alarm that's not very fast process, it doesn't suit for us. So if you follow then those blue and orange lines there, you can see what we did. So Rietta proposed this event alarm evaluator and so you could send an event to that and through the AODH then with the event alarm evaluator you get an immediate alarm. Okay, that's cool. We are almost there, but we still have a problem anyhow with the states. They are not there if you are looking to use a point of view. Okay, so there's a lot of stuff in here. So if we look again the white box over there, the compute node and what was there before. So if you lose this periodic update and you have VMs running, so what do you have? When you cure your service via these service APIs, you see a VM still up and running and it might be totally different. So you didn't get the information you needed. So to accomplish this more awareness here, I proposed this get valid service date in Mitaka to Nova. So now when you are curing your service, you get this host date. Host date there also and now if the periodic update is there, you get it as up and yeah, cool. Everything is up. Host is working and your VM states are showing as it should, up and running. And then if you lose the periodic update, by some means, then it's unknown and unknown here means that you don't know. You just lost the periodic update but why you might have your VMs running or they might not. And the cool stuff that we had was this forced down API. So now you have an external monitoring and you call that. And that then indicates in this host date as down. Okay, cool. So you lost your Nova compute connection. So your VMs are still showing up and running but now you get the cool stuff with the host date. Hey, actually host has a problem. Okay, I mean immediately now do a switch over. And then also as part of doctor, we are interested about the maintenance. And so you can see on the top right there, admin can call this service disabled API. And if he did that, you can see the host state as maintenance. And now in Newton, I'm also proposing a change here that you can actually see the reason also in this. So then you get the more detailed maintenance state. And then just a special case that if you are creating kind of your server, it doesn't yet have a host. So surely you don't have a host date. So then you get an empty string. And okay, this is in a sense not exposing the host. You don't get the host name or anything like that, but still the normal thing was seen that only admin should have this API. So the default policy Nova is admin. But now in Delco, we need this information to user. So Delco deployment, you would change the policy station in Nova to have this host state visible also for the owner of a server. Okay, so now we have all this thing and it's enabled now for something that we soon hear from Masahito. So I hand over to Masahito. Thank you. Thanks for Tommy. Thanks, Tommy. I'm Masahito working for NTT as a cloud architect and I am a core reviewer of the Congress project in OpenStack. So in my part, I want to show you how we achieve Congress-based inspection. And before starting my part, I want to do quick survey for Congress project. So everyone please raise your hand. Raise your hand. Yeah. Okay, thank you. And no, no, no. Keep your hand up. If you know the name of Congress project, please keep your hand up. Okay, next. If you know what kind of service Congress provides, please keep your hand up. Okay, thank you. Finally, if you use Congress before, please keep your hand up. Okay, I should explain what is Congress first. Thank you everyone down your hand. Okay, first of all, what is Congress? Congress is a governance as a service. Congress offers admin and availability to define their policy, to manage their cloud and the policy controls their cloud by defined policy. And now I think you have a question. What kind of policy can Congress manage? Additionally, the phrase policy of definition is difference based on your background. But there is no single answer for that. But you don't mind that because Congress aim to solve your problem with any policy to any service. And this is an overview of Congress. Congress is roughly divided to three parts. First one is API. The cloud admin can define their policy via API code. And second part is a data source driver described in that bottom of this picture. And this source driver is in charge of collecting data from the cloud service. In this picture, Nova, Neutron, Keystone and security system run in their cloud. And the data source driver collect the data from the service. And the third part of Congress is policy engine. Policy engine is in charge of evaluating the policy defined by the admin and with the collected data by data source driver. And then if any policy violation happens, policy engine take action to fix the violation in the cloud service. In this picture, policy engine tell Nova data source driver, hey, please fix this problem in Nova service. So now you, I think you know what is Congress and how Congress works. So this is a requirement for the inspector in the doctor, in the doctor. And this is a gap result of gap analysis with the inspector and the Congress. In RiverTDDs, there are three requirements for inspector. First one is the first failure notifications. But there is one gap for Congress with inspector. Congress pull data from cloud service periodically and evaluate it periodically. So there is the gap is real time policy evaluating and the policy enforcement. And second requirement is to get the mapping of a physical hardware error to the logical failure. It's easy to down by Congress because Congress has every data of your cloud. So if you write a rule to map the events to the logical failure, we can achieve this requirement. So there is no gaps. The third requirement is adaptability. And inspector is in charge of get a mapping with the event to the cloud services error. But the definition of error is different among each company. So we need this adaptability for inspector. But the adaptability is easy to achieve because if you write a policy, it's fit to your definition, it will be achieved. So there is one gap with, there is one gap between the Congress with inspector. It is real time policy evaluating and enforcement. So next I want to show you how we achieve this, how we solve this problem, solve this gap in the Mitaka release of Congress. So I've implemented push type data source driver which enable services outside of Congress can push event to the Congress. So this is an overview of push type data source driver. Another service top of this slide push can push the data to the driver and then the driver send the event to policy engine and policy engine evaluate, reevaluate their policy immediately. And we've also implemented doctor driver for inspector. This is the workflow how Congress and the doctor driver work as an inspector. First the monitor in the gray box send the hardware failure event to the Congress. And doctor driver receive this event and make an event list of doctor data. And next the policy engine receives a list of failure event in the hardware and reevaluate it to the logical failure. Meaning that the virtualization layer's failure and then it detects which VM is affected to the event. Finally, policy engine tell nobody source driver, hey, call the markdown API to that failure host and call reset status API for the affected VM. This is the details of doctor driver schema. Upper side of the table show the schema of events table in doctor driver. And the lower side table shows the failure host and the lower side table shows an example of an event sent by the monitor. So let's move on to the demo from Ryota. And before taking over, I want to say, I think this is a first implementation to get the translation from hardware error to the virtualization layer error. Thank you. Okay, well done. Thank you, Tommy, Masar, and all the developer working on this framework to realize. And now you can do the failure detection and you may perform the reactions. And I'm gonna show the demo using those features developed in Liberty and Mintaka cycle. And you can also do the same. It's open source. Okay, so I'm gonna show the two scenario. One scenario will detect failure when one of the redundant nick pulled down. And the scenario two is recognized failure when the whole set of the redundant nick pulled down. And using the same sequence, first make the nick failure or nick pulled down, then the automatic monitor detects the hardware failure event and notify it to the Congress. And Congress changed state of the affected VMs to error in NOVA using a new API and existing API for the servers. And also the NOVA send out the notification of state change and send notification to the VMs or application managers to tell the VMs failure. And then application manager will perform healing process. In this demo, we are serving a video with a video server and switch the act standby node. And the end user can continue to watch the movie. Okay, okay, let's start. So we have seven screens here. So you can see horizon, you may familiar with and some explanation in Congress console in service panel and application manager logs in the right side top, you can see the end user monitors the timing, keep turning. And we are also showing the console for generating failure. Okay, so for the service panel and application manager logs can be used. And for this middle of the, oh, sorry. Okay, horizon starts from horizon view. We can see the hypervisor, two hypervisors right now and each enabled and up. And currently we have three VMs, one two for the VM with act standby fashion and also running a VM manager, application manager. Okay, then we'll issue the command to the Nickdown. When we do the similar demo in the last year, we remove the cable. For this demo, we are down the interface from the command. So now you can see the green one, the, which is a data flow. So videos about zero or sub in the video and the user consuming that data. And yes, so we occurred in the last year we started one Nick failure. And this is a rule for the Congress. So this is a rule. When the doctor driver receive two event with type host Nick one down and host Nick two down, then the host will be become a error. So this is a rule, okay. And now we make that the second, sorry, the host Nick down and it remain running. But Congress already received Nick one down, host Nick down. And instance will be remain in active and hypervisor as well. So now it not recognize the failure. And, oh no, so second Nick down. You can see the application manager received the failure event and switch the state of the services. And for the Congress, it received two events. So it received the second Nick down and you can see the time here. It's in a one second. So maybe 800 milliseconds. So we are refreshing very frequently for this application manager, but basically application manager will be receiving the event from the open stack in here, ALDH. And yes, VM is become error state. You know, so hypervisor as well. So it, this is case two. In the case two, in case two will recognize failure with one redundant Nick port down. And the hypervisor getting back to enabled in app and sub is active again. This is for initialization. And yes, Congress also initialized clean up the old event that we used previous one. And now rule is changed once we recognize host error when the Congress received Nick one down and also when we receive Nick two down the Congress will recognize host as a error. So we change the rule here. Now we make the first Nick port down right now. And it fires the notifications. And the service is switched and the movie are continuing. You can see, okay. And the Congress also has the time we received the row event from the monitors and application manager also had the time which received the time. And it's again less than one second. Okay, that's it. And yeah, we make this demo with almost whole open source code. We did a bit scripting for the application manager, but it might be available in our repository, which mean we're working in the open with doctor project. It integrating open stack and other stuff. And we also doing start from requirement study and gap analysis and proposing a concrete feature to the open stack project and develop it and working as a open stack community. So you may have the question, but I would like to introduce one, my colleague is working for this demo. Suzuki-san, can you stand up? Yeah, he and one more another clergy are working for this demo. And we also have the doctor project member here. So if you have a question, you can ask right after this session. Okay, thank you all. Thank you for coming.