 Hello, folks, and welcome to this presentation entitled SFQM and Doctor, Keeping My Telco Cloud to Float. I'm Emma Foley from Intel. This is Ryoto Meebu from NEC, and absent today are Maryam Tahan and Carlos Gonzales. During this presentation today, I'm going to go through an introduction. Talk about the project formerly known as SFQM. I'll hand over then to Ryoto to talk about Doctor, to show a demo, and then we'll summarize. As the internet's becoming increasingly important in everyday lives, data centers are also playing a huge part in our lives. Now, taking this into account, the cost of data center downtime, and the rise of SDN and NFV, telcos and enterprises are becoming increasingly concerned with maintaining the same levels of service assurance, QOS, and SLAs as they previously would have had. As you move from physical hardware to virtual appliances, it's very important to be able to maintain the same level of service assurance as you would have had previously available. This is because you need to be able to monitor your systems from malfunctions and misbehaviors that could actually cause downtime and interruption in your service. So with this in mind, the SFQM project was created, and it's now known as the Barometer Project since last week when there was a scope change. It was created because the ability to monitor the NFVI is critically important in order to provide the required level of service assurance, and what you want to monitor your system for is to be able to enforce SLAs, detect violations, and detect any degradation in the performance that could cause an interruption in downtime, interruption in service, and cause system downtime. The output of the Barometer Project would be the measures and events required in order to enforce the service assurance levels that are required. This means two separate sets of features, one in your platform and applications, and one in CollectD, which will forward your statistics on to higher level fall management systems. These platform level statistics will include legacy statistics, which is anything available via IPMI, some bias information, so this will be static information, such as manufacturer or vendor info model, and so on, reliability, availability, and serviceability, some helpful statistics. RdT, which is resource director technology, it will also include open V-switch statistics and DPDK statistics, as well as output plugins from CollectD, which should be forward your data to OpenStack and also plug into legacy systems. More detail, the DPDK plugins, so for anybody who's not familiar with DPDK, it's a set of tools and libraries for packet processing in user space. The additional statistics are provided through the XStats API as of DPDK 2.2, and these stats include detailed error statistics for what's actually going wrong, but also events such as link status. On top of these extended stats, we also have a CollectD plugin for each one of the sets of platform features, so the DPDK stat plugin was recently merged and will be available in CollectD when it's released in December. It uses a DPDK secondary process to monitor what's going on in your primary packet processing applications, and the DPDK events will provide any events that need to be addressed immediately. It's a link status goes down, whereas with the statistics, you want to be able to monitor it on the side and deal with it at a later time. Events need to be addressed immediately. This is the same with the OVS plugin that is currently being upstreamed. Stats are available for what can be pulled from the DB table inside OVS, and this can be used whether or not you're running DPDK as your data path with OVS. Again, you also have events, so you want to know is the V-switch alive and running. I'll get to the status of the plugins at the end, but for now you can assume that they're being investigated in progress or already upstreamed at various stages. Now, we also have RAS events or RAS statistics. These are a set of platform features that provide information on reliability, availability, and serviceability, as usually in the form of events. So what RAS features do is they detect and correct faults, and when they find a fault, whether or not they correct it, you get a lot of information about what's going on in your system at this particular time. So for the short term, you know something's going wrong. For long term, you can actually see what's happened in and around a fault and be able to prevent it in the future. There are also RGT statistics, which is Resource Director Technology, which offers statistics per core group like last level cash occupancy, memory bandwidth, and we currently have the plugin merged to Collectee Master. I think this is the last set. We have legacy Collectee plugins. So anything available over IPMI will be monitorable via Collectee, and there will be some initially basic static bias information available. So this is an example of how these statistics can be used and how you would have a use case on a particular system. In this case, we're running OpenStack, and we have VM running. We're using NoVS backend to switch to packets. So Collectee is a system statistics collection tool. It is plugin-based, so you load any plugins that you want to use, and they're configured to respond at a particular interval. There are a number of types of plugins. So on the right, you will see read plugins. So OVS is an example of this, as any stats are being pulled from the system. And there are write plugins, which we're going to take advantage of, for formatting our data and pass it off to different applications that are sitting on top. In this case, we have Collectee running. After a particular interval, it will go to the OVS stats and events plugins and say, OK, what are the statistics available? These will go to the underlying application, query them for data, and return it. When they return to Collectee, any available write plugins will be triggered. In this case, we have a plugin for OpenStack, which will format the data appropriately, format the data right here, and pass it off to, at the moment, Cylometer. In the very near future, it will be passing this data off to Gnaki. So this data will then be available for any application that is sitting on top of OpenStack, or also for any format that is supported by Collectee. So this is the status update for the barometer project, or SFQM. As you can see, the purple plugins are currently being implemented, and the orange ones have pull requests opened upstream, so you can actually download them and test them if you are interested in seeing these capabilities. And we have the RDT plugin and the Cylometer plugin, or the OpenStack plugin, upstreamed. So in the future, we're going to be taking advantage of the notification plugin architecture within OpenStack so that you can directly post events to the notification bus. This is just so you can have a faster path, so OpenStack won't have to process all your stats. You can simply post events to the bus and have them reacted immediately. We're doing performance scalability and aggregation analysis as we're introducing each new step. Unfortunately, those are not complete, so we don't have any information currently available. And as I mentioned, there is Gnaki integration on the cards as well so that you can use Gnaki instead of Cylometer for your data storage. And now to Doctor, which is one of the high-level fault management systems. Thank you, Emma. So I'm going to cover our BNDP Doctor project. So as Emma explained, that now we can get the information from the lower-level network entities. And it's really difficult to figure out what is happening in the cloud. And in most cases, someone's saying, I cannot use my VM. Then in most cases, there is a connectivity issue. And if you use such a platform for the NFV or other technical use cases, it really had to figure out what is wrong in the network. Because there's many packages going back and forth. And it would be difficult to find out what was lost, or even how to figure out which VMs are affected, something like that. So in OpenFeed Doctor project, which is not an open stack, but the other open source project for the NFV. And it's in a Doctor project. We are working to build some, sorry, we are building a fault management framework for the higher-level entities. And we also try to cover some of the scenarios when the operator does some maintenance against their environment. So in these cases, like when the fault occurs, when the operators want to disable some physical machines, then they have to know which VMs will be affected, or which VMs have to be some treatment or switch over. Because there's many VMs running on the huge cloud platform. And it's very difficult to figure out what resources are affected. But failure or the maintenance event. So in Doctor, we created some recommend document. During that, we identified the requirement that do the gap analysis. And implementations working in AppStream, then integrate them and having a test. So this is a rough architecture in the Doctor's fault management. And the VMs are sitting here. So this is a platform providing a VM or a virtualized network, something. And on top of it, we are having application. In NFV terminology, we are saying VNF. And blue boxes are over stack. And it's managing the virtualized infrastructure. So it monitors the infrastructure and inspects what is happening. And control the infrastructure when some problems are caught. Or user may request some boot up new resources. Then the controller will do like Nova. And also it provide notification to some telemetry services or alarming services. And those services will provide more information to the infrastructure users, such as application manager or some other component. And as I can explain, so we have a map like this. So we can use Zabix or correctly for monitors. And we can use Congres, Bitrage, and Monoska, might be. It can be used for the inspector. And yes, for the controller, we have Nova, Neutron, Cinder in the open stack. And for Notifier, we can use Ceremeta plus AODH. So in Liberty and Mitaka, we implemented missing feature in the Nova and AODH. And that was already available. And in Neutron Cycle, we also had a driver to make the Congress inspect the failure event from the infrastructure. So the one thing we did in the Nova is to state correction. So as an open stack, it has to have a consistent resource state awareness. So when the failure occurs, alarm can be sent to the user. But when the user tried to get the status of VM, and it's still sometimes still saying it's active, it could happen. But we have to fix them. So we make sure that there's some monitor process can correct the state of the servers and also the host. And what we did is mark the host down, which indicate the host is not available. And you can see one engineer here, Tommy, who did a great job. And we also created event alarm feature in the AODH, which is open stack alarming services. And as you may know, Ceremeta were even Xabix correct various status and analyze those status and find out a serious or bad situation. And then they send out notification. During that process, there is some sort of polling point, which means it may delay in five seconds, sometimes one minute. So it's very bad to provide fast for notification. So we proposed event alarm, which will be investigate event, and evaluate that event very quickly and send out the notification to the managers. So it's very fast. It's something like, of course, less than one second. It's sometimes like five millisecond or 200 millisecond. It's very fast. And now we have everything in the open stack newton. And actually, we are using correctly. And those are not native in the open stack, but we can, yes, integrate them to the open stack easy. And I would like to focus on the inspector module. The module has to receive various failure notification from various monitors, like Xabix, or maybe there's some sort of the monitor provided by the hardware vendors. And it also has to find the affected batch of resources from the information stored in the fault failure notification and update the status. I mean, correct the status of the batch of resources. And when we are talking about failure, it can be various or subjective. It depends on the application and backend technologies and redundancy of the equipment or component. And it could be the operator policy or regulations and also the topology of network or power supply. So it's very hard to say what is the failure. So we have to have a flexible framework so that the failure has to be dynamically configured case by case. And in this presentation, we are talking of OpenStack Congress. And this was called as policy as a service, but now we are calling governance as a service. And this provides dynamic data correction from various OpenStack services. And it has flexible policy definition for declarations. And we have a policy example here. So this thing hosts down can be identified by when we had such event. Such event saying that the type is host and type is compute host down and state is down. And when that happened, execute this command. It means that NOVA services forced down on that host and so that created status of the host when this happened. And Congress also had polling mechanism in it. So this is architecture of the Congress. And Congress has API facing the users. So users can use what another services can put various policies via this API. And there's one policy engine and many data source drivers and data source drivers get information from the NOVA, Neutron, or many other OpenStack services periodically. And those are stored in the Congress and used for some evaluation process in Congress. So Masa from entity, he proposed new feature, which is a push type data source driver. And it enables the providing information to the Congress very quick. As I mentioned, the previous drivers are getting information periodically. But if we have push type driver, Congress can receive information as soon as possible. And this is more detailed sequence. So if we use monitor here, we can push data like there is a host down or there is a power issues in a specific host and then provide the failure notification or fault notification to the Congress as a doctor inspector. And Congress will get this information through the doctor data source driver. It's just correct event and do some policy variation against predefined policies. And we have a bunch of features proposed to the OpenStack. And we have mostly landed or implemented in the OpenStack. We still have a few things to do, but yeah. And how to integrate SFQM class doctor, it's a bit difficult. But we are still trying to figure out the best way. So in this figure, we'll put the correct T here and it will detect some failure and provide it to the accelerometer. And the accelerometer will do some fast evaliation to check that it can be a hold down or it's not active, something like that. Because if we are talking of a network, someone can allow the dropping one or two packet per second. No one cares. But so the accelerometer have to figure out what could be the lower layer. Then notify a critical error to the inspector, something like saying, oh, 80% of the packet are lost per second. So it should be the error. And then provide information to the inspector. Then inspector figure out that this host is something wrong. We have to fence or destroy this host and let the VM owners we have an issue on their VMs, something like that. Then we have a demo. But as I can see, we used correctly as a monitor today. And we're just checking the port status. And it's more obvious so that the monitor will report if there is a hold down to the Congress. Then Congress think about affected resources. So in the keynote demo, we marked host as down when the NIC was down. But it depends on the network configuration. As you can see, please think this is a one host. We have three VMs running on it and three port there. And we have two breaches. So one breaches are connected to two VMs. But the other one is just one. And oh, sorry, I want to. Yeah, so please check the NICs. So we have three NICs. And if we have issue on the NIC 1 and the NIC 2, we can still communicate with the VM 0. So in this demo, we are configured to Congress to understand this situation and just make affective VMs down and let the other affected VMs get alive. To do that, we still need some quick hacking in the neutron. But this feature is now proposed by Carlos, his micro Z. And he's the one originally scheduled to speak here. And now he applied his patch to the neutron. Then the port will be get down and send out notification to the AODH. And AODH will send to the port error to the manager. Then manager recognize those VMs are something wrong and do the switch over. And this is more detailed policy rule. But it seems like we have time, so let me explain. So in Congress, we define three policies. And those two policies mean OR. So if we get event that the event is type or SNCC 1 down, then we recognize as a NIC down. And also, if we have another NIC 2 down, it also recognize as a NIC down. So Congress will recognize just one NIC down, then recognize the bonded NIC down. Actually, it's not. And execute force down port against neutron when this NIC down event was happened. And against the port that are assigned on the specific host and using the specific physical network. And also, we have configuration in AODH to send out which other should it send to the failure notification. And let me change the screen. So we have three VMs on the host. Actually, there's a two host and three set of the VMs are running like this. And at the beginning, one host providing video services can go to the user client. And you can see the globe is turning. And we also have a standby VMs here. And we are also showing the application manager log here. So when let's start from here. Yeah, rotating. And once we remove the one of the bonded NIC cable, then open start recognize it. And Congress inspect the situation and correct the state of the Nova and AODH send out notification. And application manager then receive the failures. Yes, and it automatically switch to the standby VMs. And this set won't change as they are not affected. OK. Thank you, Karen. Thanks. So to summarize the entire presentation, basically trying to manage a complex cloud solution without proper telemetry is like walking across a busy highway. And you can either hear or see. You've no idea where you're going. And you can't make a safe move without complete disaster. So the use case that we've demonstrated here is basically painting the pedestrian crossing. So thank you all for attending. I'd like to thank Ryota for filling in for Carlos. And also Carlos for putting together the demo. And I'd like to thank Mariam as well for all the work she's done with Collectee. So if anybody has any questions, now's your chance. So excellent presentation here. One question is related to VNF event streaming, a new project which you were talking in the morning, I think, in the TSE. Are we going to, when you stream the data, do you still need to collect and store the data, or you just stream it? Do you pass it through this? I'm sorry. So VNF, some event occurs, I want the event to directly be taken action on rather than collecting the data. Now how do you handle that? Do you bypass the collection? Sorry, I cannot get there. Suppose you have a VNF and there's a failure. And I don't want to collect the data. I want to react to it without collecting the data. Does it bypass your collection data? Unless you have some form of data, you're going to... No, because most of the streaming events, we want to not collect the data, but react to it. Yeah. How do you handle that? Do you bypass it? Collectee is completely configurable, and its plugin-based architecture means that you don't have to enable everything. You could, if you wanted, just enable your event's plugins so that when there is a disastrous event that you do need to take action against that you can do that, and you won't be bogged down with any of the stats and metrics. Okay, thank you very much. So I have a question on the policies. Policies can be very complex. Just wondering how you test the policies, because if the policy has a bug, then I think it's very hard to debug, right? Yeah, it's a very good question. And it's... Actually, we have to figure out the good solution to buy data policies. But sorry, I don't have an idea right now, but basically it's some sort of language developed in the academic area. So maybe you can find some solution to buy data policies. And of course, if we have any bugs in the Congress, we can fix it in the Congress project. So it looks like a prologue, the policy, the language. Yes, yes. And I think that the operator or the integrator has to take the responsibility to set the proper policy. And there's no way to make sure everything is working, because the user can do some wrong configuration. It's same to this. Okay, just one more quick question. The backup VM, does it have any state? Or it needs to, for example, need to hedge some state from the primary when you do fadeover? Well, yes. So actually, we are running on the two video servers. And they're enabling one to provide some streaming data. And we also have the application manager next to those two VMs to control the switches. And once the failure happens in the infrastructure and goes through the whole doctor procedure, what I explained, and finally, eventually, the application manager will get know the VMs are not available anymore. Then he will switch the data from. Then we can see that the packet or the streaming are still alive. Did I answer your question? I think I talked to you. Sorry. Thank you, folks. And if you have any further questions, we'll be around here for a few minutes afterwards. Thank you. Thank you.