 Okay. Can you hear me? Is my mic on? Yeah, now it's on. Okay. Okay, so good morning. Everybody, sorry for the delay. We're having some resolution problems, so it's a little small. My name is Iris Finkelstein, and I have my colleagues here today with me, Ohad Shamir and Alex A. Whale. And we are all from Nokia Cloud Band. We're here today to talk to you about our project vitrage, or how to organize, analyze, and visualize your OpenStack Cloud. So before we go into the details of what vitrage really is, I want to take you on a little journey. So hopefully this starts exactly. Again, I apologize for the size of this, but we're going to dive into a very sophisticated cloud environment. Now this is a demo that was actually produced by our Cloud Innovation Center at Nokia, and it's intended to simulate a large service provider's distributed architecture. And in this cloud environment, we can perform various types of operations. We can do full lifecycle management, and do end-to-end solutions, service level orchestration, and so on and so on. And if we dive a little deeper under the city, what you're seeing right now, we can see the physical infrastructure of our cloud environment. And we can actually zoom in on all of these servers and see exactly which server is connected to which virtual machine, and so on. And what's really important about the simulation is that you can see that we can see both the physical and the virtual layers at the same time. Okay, and that's really important, and we'll talk about it later when we talk about vitrage. So going back up to the city, each one of these cities that you see here actually represents a data center or a cloud environment. And in the cities, you can see that we have city blocks. Each one of these city blocks actually represents a virtual network function or an application. And on the city blocks, we have buildings. So each of these buildings is actually a virtual machine or an instance. And you see that the buildings vary in size. So the variation in size is actually the amount of resources used for each of these virtual machines. Okay, and now this is a video of an actual demo or simulation. But in the simulation, we can toggle these resources for the virtual machines between CPU usage, memory, storage, and so on. Okay? So if you're a network operator and you have this highly distributed and modular architecture, then you're doing application and data modeling, and you're doing service chaining, and you're doing orchestration and shared data layers, and so on and so on. So you can perform a lot of tasks on this cloud environment. But you need the right tools to be able to perform all of these tasks. Okay? So when we're looking at this environment and you see that it's highly distributed, many cities, many cloud environments, okay? Let's just go back for a second and use this to simulate an imaginary scenario. So if you recall a minute ago, I talked about the physical infrastructure. So let's just imagine that in this physical infrastructure, we had a failure. Okay? A failure in a switch, for instance. So you can imagine how this goes. You would have a failure, a connectivity failure in your physical switch, and that would lead to a connectivity failure in your physical post. And in turn that would lead to a failure in the connectivity to the public network from your virtual machine, and that would lead to the application being disconnected from the virtual machine, and so on and so on. You get the picture, so it's utter chaos at this point. And you can imagine that if we have these failures happening in each one of these data centers, then multiple connectivity issues happen. So if you're a network operator, you have absolutely no way of knowing what actually happened, right? So maybe you would get an alert on your physical switch, but that's just one alert. So what happens next? What did the physical switch actually affect? Do you know what other components of your network were affected? Do you know what are the implications? Do you know how to fix it? And even if you do know how to fix it, do you know that what you're fixing doesn't compromise the other affected elements that you don't know about? So again, we're simulating a telco cloud environment here, and in a telco cloud environment, these things can happen and multiply upon themselves. So again, if you're a network operator, you don't know what happened in your network right now. And that's really what Vitraj is all about. Going on, I've had this question come up to me several times during the last couple of days, so I'm just going to take a pause and explain what Vitraj is and why we named this project Vitraj. So Vitraj is a stained glass window, and a stained glass window is made up of many pieces of colored and translucent glass. And if you're just looking at them as piles on whatever, then they don't mean anything. But if you put all of them together, then they make up this beautiful window that you can see through and so on and so on. So you get the picture. What Vitraj actually is is taking all of these pieces of information and putting them together and providing you with a window that allows you to gather insights about your system. So at CloudBand, when we started about five years ago, we were thinking, how can we help service providers improve service agility, improve their operations, reduce their costs, and so on. So basically, we were talking about network function virtualization, of course. And if we're talking about network function virtualization, then how do we do that? So if we do the switch from hardware-based network functions to virtual network functions, and we add a shared off-the-shelf infrastructure, then we have the ability to reduce lead times for operators, and we have the ability to increase automation to the level that we want. Okay? But all of these requires a very extensive tool set. And that's what we've been really concentrating on CloudBand for the last five years. Recently, we identified that there is a gap in this tool set. And this gap relates to what I was talking about when I showed you this video. How do service providers actually understand, actually monitor and analyze everything that's happening in their system? And most importantly, how do they visualize it? And visualization is a big part of Vittrages. And when we were thinking of this tool set and how to build it, we were thinking of what would we use to be able to build Vittrages? What was the right way to go? Now, at CloudBand, we've been using open source basically from day one. So that's the no-brainer. But over the last couple of years, we've also begun to contribute to open source. So again, when we were thinking of Vittrages, it was pretty obvious that it needed to go open source. And we're very happy and very excited that we actually decided to do Vittrages on OpenStack and contribute it to the community. And that is it for my part. And I'm going to hand it over to Ahad, who will go a bit more into the details of what Vittrages actually does. Thanks, Eris. So my name is Ahad. I'm product manager in CloudBand in Nokia. And I want to start with a short story. So imagine that you are responsible for operating an application. You are coming to your office in the morning. You take a look and you see new alarms on your screen. You're getting closer to the screen. And you actually see that you have three new alarms. So the first one is VMICPU load. The second one is host connectivity failure. And the third one is ICPU load on host number two. Now we are trying to figure out what really happened. What is the status of my application? What is the status of my system? And what are the relationship between those three new alarms? And it's really hard to do that. It's really hard to figure out those answers. So I will take you behind the scenes and let's see what really happened. So this is the real picture. So actually you can see that we have two major failures. So we have switched down, connected to host number one, causes the VM on host number one to be down, affected my application. And I have another failure on host number two, ICPU load, which affects the performances of VM number three. So now I can see the old picture. And you can see that actually I'm missing a few alarms and I'm missing the relationship between those alarms. So imagine now that I can give you this information. And this is exactly what Vitality is trying to do. So systems today are getting more and more complex. There is a barrier between the physical and the virtual layer. And there are many, many gaps, monitoring gaps. There is no one monitoring tools, either it's OpenStack, either it's external tools that can provide the complete picture of your system, of your cloud. The bottom line is that it's really, really hard to know what is really going on. So what are the challenges? What Vitality is trying to provide? So Vitality is trying to provide holistic and complete view of the system to reflect the relationship between the different entities. Vitality will enrich the statuses and the alarms for your systems and help you to understand the root cause of the failures. And different customers, different user needs to configure their system differently and virtual support different configurations. So before getting into the details of the Vittorage component, I wanted to take a short example to show what are the actions that Vittorage will take. So let's take the ICPU load on the host. So you get ICPU load failure configured on AODH, for example. You set the threshold, the threshold was crossed, and now we get the alarm. Vittorage gets these alarms. And hopefully the status of the host is also already modified to be suboptimal if not Vittorage will do it. And Vittorage actually raises deduce alarms. So what is deduce alarm? Deduce alarms are alarm that is not directly observed. So the deduce alarm is alarm on the instance. So we know that there is a failure on the host. We want to raise deduce alarms on the instance because we know that the instance would be affected from that failure. The second action that Vittorage will take is to modify the state of this instance to be suboptimal because it might be suffered from performance issues due to the CPU failure. And last action is to connect these two alarms with a link that will say that alarm number one causes alarm number two. So these are the three actions that Vittorage will take in this simple example. So now let's go into the requirement list into the Vittorage component. So first, we need extendable input sources. In order to provide holistic view of your system, you need as much as possible data sources. Each data source brings more and more information, more and more data about the status of the resources, the relationship between the entities, and the alarms. We need to have fast response. So Vittorage supports both pull and push notification. And currently, we support different data sources. It could be open stack data sources such as Nova Cinder and Neutron salometer. It could be static configuration file, for example, to configure how the switches are connected to the host. And it could be external monitoring tools like Nagyos and Xabix. Second requirement is to have initiative modeling of cross layering relationship. And for that, we have the Vittorage graph. So the system state is represented as a proprietary graph. It's very intuitive modeling of the relationship. The entities are the vertexes. So each vertex could be either a resource or an alarm. And the edges are the relationship between those entities. So every vertex and every edge could have additional properties like states, ID, name. And you can see it's very intuitive modeling. Next one, we need to have configurable business logic. In order to do that, Vittorage has to react to any change in the graph. So every time we have a new instance or new alarms, it's reflected into the Vittorage graph. It's actually added to the Vittorage graph. And then we have the Vittorage evaluator, which listening to the graph and upon each event, it's evaluate which are the relevant scenarios and what action it needs to take. The scenarios are stored as a template. We call it Vittorage template. It's a YAML file, very human readable. And you can see in a short example, a condition, alarm on the host and host contains instance, then the action would be set the state of the instance to be suboptimal. So it's very readable, very easy to configure it. And we use these templates to hold our scenarios. Next requirement is we want Vittorage to notify other project on the instance we have. So if Vittorage knows that states should be changed or we want to raise the use alarm, we want also to notify to other project like AODH, maybe Nova, Cinder, Neutron. And last one, we have REST APIs and CLIs to expose the insights from Vittorage, to expose the topology, to expose the deduce alarms and states, and to expose the root cause analysis. We have horizon plug-ins, screens to show the topology, the alarm list, the entity graph, and also the root cause and Alexei will present it in a minute in the demo. So put it all together. Let's take a look on the Vittorage high-level architecture. So we have the data sources. Currently in Mitaka, we support Nova, Nagyos, Static's configuration, AODH, Cinder and Neutron. We are planning to add more data sources in the future. The information from the data sources is injected and reflected in the Vittorage graph. We have the evaluator and the template that are listening to the graph and evaluate and execute the action based on the relevant scenarios. We have the notifier to notify other projects and we have the UI API. So I think that now it's really a good time to move to Alexei to present a live demo of Vittorage. Thank you, Ad. Hi, guys. First of all, I will apologize because our DevStack crashed about an hour ago when we tried to configure the display settings for this computer. Then I raised the DevStack like about a half an hour ago. So hopefully everything will be all right. If not, you can always come to our booth in the marketplace and see it all working. I will start. So what we'll see today. Today we'll see a demo of two use cases in Vittorage. In the first use case, let's say we have a user that defined an AODH alarm at threshold AODH alarm in AODH of CPU usage above 0% for example on a specific instance. The CPU usage above 0% is not a real use case. It's just so we'll have a quick alarm in AODH. For those of you who aren't familiar with AODH, AODH is the alarm's engine in cilometer. Here on the right, we can see part of our compute topology. We can see that we have the OpenStack cluster. The OpenStack cluster contains the zones. Each of the zones contains the hosts, and each of the hosts contains the instances under the host. Now let's go to the horizon UI. First of all, I will show you that white doesn't show it. I change to the web browser and it doesn't show it. Can you help me, please? Sorry for the trouble. Thank you. I want to switch all the time between the browser and the presentation. This is one desktop and this is the other desktop. So your best bet is just to drag this onto the other desktop. How do I leave it? You may have to just leave your presentation then. How do I leave it? You'll have to exit out of the program. I think it's taking over the second desktop. Can you see? You'll have to drag it in. You see how there's a second desktop there? You just need to drag that onto your computer. Just want to do it. Maybe try turning it to the left. Can you open your display settings again in here? It mirror displays right there. It's going to have to be cropped like this so you'll be able to see it. Sorry. We see it in a bad display manner. We can see that we have four instances in our desktop. We see that three are running and one is suspended. Now I will go to the BitRush tab. Here we can see that we have three sections. The topology section where we can see the different hierarchies of the topology, the alarm section where we can see all of the alarms in BitRush, and the entity graph where we can see the whole BitRush entity graph with all its entities and the connections between the entities. I will go to the topology section. Here we can see that we have our standard representation of the compute topology. In the standard representation, each segment in the circle represents entities of the same type of resources. For example, in the inner segment represents the open top cluster, the next segment represents the zones in the cluster, the next segment represents the host in each of the zones, and the last segment represents the entity, the instances in each one of the hosts. Here we can see that most of the entities are green, which means that they are running, and we see that one of the entities is gray, which we saw before, that one of the instances is suspended. I can click on each one of the entities, for example, this one, and we can see, sorry for the presentation, it's because the whole screen is like cropped, so you can see that the small sander is supposed to be up, so you can understand where exactly you are in the big sunburst. I clicked on the Nova host entity, and on the left we can see its details, the state, the name, and everything, and we can see all of the alarms on this entity. We can see that we have one alarm on this host. Now I will go to the command line, as we talked before, that I will raise threshold alarm of CPU usage above 0% on one of the entities. I will raise it. Now I will go back to the presentation. Sorry that it looks like this. And now we'll see what is really happening in vitrage. So what happens in vitrage? AOGH raises the alarm, vitrage receives the alarm from AOGH, and adds it to the graph. When vitrage connects the alarm to the resource, then the vitrage evaluator runs, and it checks if there are any matching scenarios in the templates. In our case, I have defined the template that says that if I have an AOGH alarm on an instance, then change its state to suboptimal. Now we'll go back to the horizon UI, and we will see what has happened. I will refresh the page. And we can see that VM1 now is yellow. I will press on it. We can see that its state is suboptimal, and it has one alarm on it. So what we have seen for now, we have seen that we have raised an alarm in AOGH of CPU usage above some percentage. And we changed the state of this instance to suboptimal. Although the NOVA is not aware of this problem, and the state in NOVA of this instance is still running, although it has this problem. Now I'll go back to the presentation and play it again. So here we can see a template of this kind of use case of the AOGH. Each template consists of two main parts, the definitions part and the scenarios part. The scenarios part defines the business logic in a human-readable way, as we said before. Here we can see that when we have an AOGH alarm on instance, then set the state of the instance to suboptimal. Very easy to understand. The condition string that we have here is defined in the definitions section. The definitions section defines the entities and the relationship between the entities. If we'll have more time in the end, I will explain it in more detail. Now we'll go to our second use case. In the second use case, let's say we have a physical switch failure, and we'll see what Vitras does with this kind of data. To do this, we have a data source from Nagios. We take the data from Nagios. Nagios is a lower-level monitoring system, for those of you who aren't familiar with it, which can monitor physical and virtual layers. Let's go to the Nagios UI, which we have here. Here we can see the Nagios UI. Here we can see that we have two components, the host, which we have, and the test of it, and the switch component. I will go to the switch component, and I will manually simulate a switch failure on one of the tests. I will go to this test, for example. I will change its state to warning. So now let's see, again, what is happening in Vitras. Nagios raises an alarm, and Vitras receives the alarm. When Vitras receives the alarm, it adds the alarm to the graph. When Vitras connects the alarm to the switch, then it checks if there are any matching scenarios in the templates. In our case, I have defined the template, which says that if I have an alarm on a switch and the switch is connected to the host, then perform the following three actions. One is, raise the deduce alarm on the host. B, change its state to error. And three, add a causal relationship between the alarms. Now, when the deduce alarm on the host is added to the graph, again, the Vitras evaluator runs and checks if there are any matching scenarios in the templates. In our case, again, we have defined that if we have an alarm on a host, which is connected to an instance, then raise the deduce alarm on the instance, change its state to suboptimal, and add a causal relationship. So now we will go to the Vitras UI, and we will see what is happening. Here we can see that we have the host and the instance. In the different instances in red, one of the hosts is still gray, because in our case, we have defined in our YAML files that in our case, the suspended state, as we can see here, is worse state than error. Each one of you can define it for your use case, however you want. So I will go to the Nova host, and I will see that its state is error. We can see that we have another new alarm on it, and as well as on the instances, we can see that its state is error and the alarm that we have on this instance. Now I will go to the sections alarm, to the alarm section, sorry. And here we can see all of the alarms that we have in Vitras. We can see that we have alarms in Vitras from ARDH, from Nagios, the deduced alarm that Vitras is raising. Now let's look on one of the alarms, the alarm that I just raised, the uptime alarm. We can see that for each one of the alarms, we have its name, the resource on which it was risen, the ID of the resource, the severity of the alarm, the type of the alarm, which means from where the alarm came from, and we can see the RCA of the alarm. In the RCA of the alarm, I will zoom in. We can see that we have the alarm on the switch, which caused the alarms on the host, or the different hosts if you had more, and it caused the deduced alarms on each one of the instances that we have in the host. This way, you can understand in a much better and quicker way that if you have many alarms, what will happen and what you need to do in order to solve it. Again, I will sort it by resource type. I will go to the RCA of the instance, for example, and here we can see for the instance itself what will happen. We see that we have an alarm on the instance, which was caused by an alarm on the host, which was caused by an alarm on the switch. Again, this way, it is a very easy and quick way to understand what is going on. I will go to our last section. Again, so it looks a bit weird because of all resolution problems that we have. Here, we can see the whole vitrage entity graph. You can see that we monitor, as we said before, we monitor the Nova host. We monitor VM. On the left, you can see the details of each one of the entities. You can see that we monitor Nova instance. We monitor Nova Neutron port, Neutron network, the cinder volume that we have here. I will find it. Here, we have a cinder volume. You can see the whole entity graph and the connections between the graph. You can see the alarms, the different alarms. And in this way, again, you have one place which collects the data from many data sources. And you can see the whole system in this graph and the connections within the different entities. So you will have much more knowledge of what is really going on in your system. I think my time is up. So I will give it back to Ohad. Yeah. So thank you, Alexei. I want to finish with what is next. So we saw what is currently we have in vitrage. And OK, what is next? What we are planning for Neutron. So we want to add more data sources. As I said before, we want to add a driver for it, maybe for Monaska, for Zabix, which is an external monitoring tools. Another thing that we want to add is the integration with other projects in OpenStack. So currently, the deduce alarms and deduce states are something that internal in vitrage. We expose it via APIs. We expose it in our UI. But we don't modify actually the state of NOVA. And we want to work on the integration with such a project like NOVA Neutron Cinder that they will take the insight from vitrage and actually modify the state of the resources or everything that can be used from vitrage. Next, we want to do alarm aggregation. You may get a lot of alarms that connected to one failure or you want to group it by resource, et cetera. So we want to add this functionality to vitrage. We want to add more templates and use cases. So vitrage is coming with a set of out-of-the-box templates. You don't need to start from scratch. It's something that we provide the common use cases. And each one can edit it or add his own new template. And last one, currently, vitrage support network X. It's in-memory graph database. It's great for DevStack. But if you want to move to production and to more deployment environment, you may also want a persistent graph database. So we want to add this functionality too. We are considering Neo4j or Titan as a persistent graph database. So my last topic is to talk a bit about NFV. So vitrage is a root cause analysis project good for everyone, good for enterprise, good for IT. But it's also good for NFV. And it's perfectly fit to NFV. As Iris mentioned in the beginning, we are CloudBend. We are dealing with NFV for five years. We know the requirement for NFV. And vitrage will build to fit perfectly the requirement for NFV to provide this correlation between the tree layer, between the physical, to the virtual, to the application layer, to give a fast response to have a accurate and fast travel shooting in order to recover from a failure in the system. We have a good match with Etienne NFV requirement and OPNNV requirement. And I want to mention the DOCTOR project. DOCTOR is a fault management project in OPNNV. We are working together with them. Actually, vitrage is a reference implementation to the inspector component of DOCTOR. And I invite you to come today and to listen to our next session, joint session with DOCTOR. We will talk a bit more about how vitrage is implementing the DOCTOR requirement for OPNNV. So last, what I want you to remember from this about vitrage. So we introduced vitrage, the new project in OpenStack. And if you have to remember three things, so these are the three main functions of vitrage. So vitrage provide holistic view of your system, vitrage enrich the alarms and the states of your cloud, and vitrage provide the root cause analysis to know what was the root cause of the failures. Thank you very much. Maybe we can take one or two questions. If someone has questions. Can you send the alerts to an external monitoring system like ZabEx instead of using the horizon view? You mean to notify external monitoring tools? Yes. So the architecture supports it. It's actually to write another notifier to notify not just OpenStack project like NovaCinder, et cetera, but external tool. So it should be supported. It doesn't matter to write this driver, to notify. Does vitrage use any of the APIs maybe to CloudBand? Can you repeat? Is CloudBand used in vitrage at all? Yes. So CloudBand infrastructure software product will use vitrage as part of its installation. OK. In your demo, did you use CloudBand or was it purely? No. The demo was pure OpenStack. All right. Thank you. Do you have any plans to add monitoring for containers? Monitoring for Docker containers, which might be running on Nova VMs? We have no current plans for Newton. This is something that absolutely we should consider for the future. OK. Thank you very much.