 Okay, can everybody hear me? Great, okay, so hello everyone. Thank you for staying so late at one of the last days of the summit We hope to give you an interesting talk today my name is dr. Alicia Rosenzweig and I have here also if I affect was the Ptl of vitrage. I'm a core developer in vitrage and we're going to talk today about some of the Ideas behind vitrage behind root cause analysis in general. What is out there? What can be done? you know, where do we think that we should be going with this and you know, please also this is supposed to be sort of a Thought-provoking discussion To sort of think about things in a Broadway. So whether during the session or at the end especially we'd be very happy to take questions We want to make sure that we are aligned with the rest of you in terms of your vision as well So what are we going to talk about today? I'm going to talk about first of all, what is root cause analysis? Why should it be done? Oh one second. I think we have a Timed slideshow over here by mistake. Okay. Let's hope that it's not doing that sorry so what is vitrage and What what what is all right? What is root cause analysis and what is it good for? then we're going to talk about some of the Aspects of vitrage itself So what we have to do in order to build this engine what are the underlying components the principles that guided us? Then if that will take over and show us a demo and Talk a little bit about what we see in the future of a trash and this also in this sort of broader perspective So what really is root cause analysis? So if you look at Wikipedia What is a root cause so Wikipedia? Defines it as a factor that is considered a root cause if removal thereof from the problem fault sequence Prevents the final undesirable event from recurring. This is a Very interesting statement because what it's saying is basically is that we're trying to imagine what the world would look like if The event did not occur if the fault did not occur We're saying had this event not occurred then the problem would have disappeared And so even by that definition you could already understand why determining the root cause of anything is You know has certain difficulties. Oh One second. I'm not sure what's going on here And so that's what so that's what a root cause is root cause analysis is the method of identifying root causes in a system Among system events specifically usually in failures usually the reason we want to know why something happened is because something bothered us In what occurred right? So we're not going to usually look for root causes for good things, but for bad things now Why is this important? Well, there's a lot of things we can do with the root cause analysis Okay, once we know the root cause this leads a whole bunch of features So if we start on the right side over here, first of all, you get understanding, right? You want to know what's happening in your system? It's your system after all Or you're relying on somebody else's system The second thing is once you know, what's the problem? You can have a fast reaction Even before you've fixed the problem before you take a care of the root cause you can already Address it. You can say oh, I have a problem on this host Well, then I guess I should do my healing on a different host Whereas if all you knew was that your VM crashed then you could just move You know, you might think that I'll just redeploy a new VM on the same infrastructure So by understanding what caused the problem you already gain The benefit of being able to act immediately and have a very fast response Next up is accountability right after the problem occurs your customers are going to ask why what happened it's not only going to be your customers, you know Nokia and vitrage specifically is something that we were working on in the context of NFV of Looking at you know of taking networking Facilities networking infrastructure to the next level to to moving to the cloud over their issues of accountability of performance of Regulation are very very important. They're central you can't make certain changes without being accountable and Reaching a certain level of reliability in your system So it's very important to be able to say why things happened and how you're going to fix them in the future without that clarity You can't even make that move even if you want to Moving on of course, then we have Actual fixing we know the problem. We know what the root causes we can go and address it We can fix that that host that is beginning us trouble. We can fix the We can fix the switch that's not functioning well by knowing the problem. We can of course go and address it And finally perhaps interestingly enough root cause analysis is simply the reverse The reverse process of prediction right with root cause analysis. We have a problem And we're asking ourselves. Why did it happen with prediction? We have a problem and we're asking What is it going to cause? What is it going to be the impact of this problem? so if in root cause analysis we say we have problem x what is the problem y that caused it in Prediction we say we have problem y what problems x isn't going to cause and so the moment you have the understanding of root cause analysis You can just reverse the process and then start talking about prediction And so you can see how moving both from a reactive looking to the past to what happened in the past kind of approach to a proactive kind of approach all this Is under the rubric of root cause analysis? And so we're only the beginning in terms of vitrage, but already we can see how much we could benefit if we had a very good engine That does precisely these things Let's see if I can Stop it from doing what it usually does Okay So what approaches are there to root cause analysis? Well in general we can classify this into three groups Whoop here we go again Ha one day I'll win and Not today apparently Okay Yeah, what is the root cause of this problem? Okay Maybe if I press this Let's see Okay, so we classify this into three groups one of them is what we call expert judgment So this is what you do before you had even before you have Computers, right? We don't even need to have the cloud You simply have people that are experts in a field and they go and they look and say what happened over here this actually takes place in many companies and We know we're something bad occurs, you know There's a safety regulation that wasn't followed There's a certain problem in the quality of a product you go you check you see what happened How did this happen? How can we stop it from happening again, right? So this kind of approach which is you can think of it as like, you know human manual kind of approach It relies and reflects on the expertise of the RCA investigator, right the person who goes to do the investigation And it's very subject to Subjective bias right it has to do with what that person saw With the priorities that person has what they know how they approach these problems, etc And the data that we have for this is simply the experience of that root cause analysis investigator Moving on to more automated systems. We have the statistical approach which Says okay, we're gonna look at the system. We're gonna collect data We're gonna see we're gonna check with the correlations and see you know how much do two events correlate with one another and When they have this correlation we can say okay, so there's some link between them The move between correlation and causation of saying a caused be not be caused a is not simple It's definitely a tricky issue. We can try and use issues that have to do with With a time-based things in our event a occurred before event B But that doesn't always work, especially when you're talking about high-speed monitoring Sometimes you can have situations where one thing is monitoring things and a sampling rate of once every 30 seconds and another every once every 35 seconds the fact that one happened got to your system your monitoring system before the other Doesn't necessarily mean that the first one caused the little ladder So it becomes more tricky when you look at this in terms of causation taking taking it to the next level But you know, but it definitely helps to have that kind of chronological ordering And finally, you know statistical techniques are more looking at you know Probably what happens and learning based on experience then you have more analytical systems ways of looking know like forms of formal logic that tried to reason about causation and think really in terms of What's called counterfactual reasoning which is what I mentioned before where you say well I see what happened But if something had happened differently what would have happened then and counterfactual reasoning while it's once you have it in Place and it works. Well, you can get things that you can really rely on the difficulty with that Of course is that a lot of times you don't have all that data. You don't always see the entire picture So whatever you see is also a lot has limited Though the methods you're using are more More well well structured So what did we do in vitrage? So we do definitely want to move on to more automated approaches things that if I will also probably talk about in in the next section but What we started from was what we call automated expert judgment So if you recall from the previous slide the first section was expert judgment where you take people that have expertise And you try to take their information and debug the system solving problems The approach in vitrage is we want to automate that process We want to take the information that we know about the systems and in most cases we understand each system individually We have people that are experts in compute people that are experts in storage and each of them understands their piece of the system The problem is they don't see the full picture and the full picture is critical in order for us to make sure that errors get to where They're supposed to get to on time And so what do we do over here? We have we start over here on the right We have expert judgment and we codify that judgment into what we call vitrage templates So vitrage templates are yaml files Which you can see in our in our wiki some examples and those of you have visited our booth But the yaml files that express the rules that the expert sees as yes This can happen if I have a high CPU load here I'm going to have CPU performance problems there if I switch crashes here That's going to create columns on the host Etc. Etc. And so these you know in advance these that you can that either you that you've experienced taking that experience and making it codified in these templates Now you can see over here in these templates I have these different shapes and each of them represents a different error Okay, so we're going to talk about that in a second for them for the second part So you have this expertise, but then you also need to have the data You have to have the when something happens you have to know what happened and so you have to collect information from Open stack services like Nova heat neutron cinder And you also need to collect information from external sources Open stack doesn't have everything we have things from Zabix Nagios other monitoring tools We can collect a lot of information and Pull it in one location into this entity graph what we call so then sees circle Round Vertices represents the different resources and the other shapes are the alarms that are being raised on them now You can see over here We have you know a bunch of alarms that we've received from different monitoring tools and we've connected them because we understand Oh, that alarm is on that hose the green alarm is on the VM. Etc. Etc. So we just connected now we have these two components. How did they come together? So if you look at what we have here we have these rules saying well a blue triangle A blue triangle Causes That's the arrow pointing down the yellow the yellow square and a blue triangle can also cause a green in a green circle and green circle and cause a red pentagram, so Not pentagram. I meant Forget what it is. Sorry And so we can connect these with these dashed arrows are to present causal relationships saying oh in this system You can see that the blue triangle cause these two alerts the yellow and One second the yellow Sorry about that Okay, let's hope See here two things one of them is the root cause analysis aspect where we can see that the different alarms cause things But also because we know that a green Then we know that we know that a purple and that a blue triangle also causes Green alarm a green circle and again. This is like short form in this visual thing But it always happens like on a VM so you can see here We have two VMs one VM already had a green alarm on it and then another one didn't have it on it but if I know that having a blue triangle causes a green a green circle and I can also raise that alarm on the green So on on the VM and this is critical because these deduced alarms and also changing state Which we can also do what they're really doing is not just adding more alarms to the system Think about it if we have a host, which is the known with a letter H inside it who sees the host That's the admin who sees the VM. That's the tenant So really if I didn't raise an alarm on the VM the tenant wouldn't see anything He wouldn't know there's a problem until he tried to use the system and discover that it's not working So the deduced alarms are not just making more alarms and you know adding more data to the system It's making sure that the information gets to the right person to the right user that whatever is happening on their level of the system Is being reflected in their systems? Okay? We can do this, you know to any you know with all we're using these rules We can project the information propagated through through the system as needed So what do we get from all this? So I've already mentioned a few of these things I'll just repeat them very briefly So first of all we get a holistic view a lot as we have this local understanding of what happens in each Individual component we don't see the full picture. So even though we're not doing automatic discovery yet statistically etc We do see you know the big picture the full picture as a result of this kind of automatic expert judgment Also, it puts together everything automatically. So it happens fast. It doesn't take days or hours. It takes seconds milliseconds The second thing is we propagate it through the whole system and this is really just the first step, right? That's not that we think that this is the be all and end all of root cause analysis But it is definitely the first step because what it does is it gives you first of all a fast ramp up of Of root cause analysis most companies have this information They just want to automate it and on the other hand doing statistical and analytical approaches demand a lot of data You have to have a lot of data you stored up. You can get really meaningful results until you've done that You you might get all sorts of results, but you won't know if you're really Finding the relevant the relevant information So as a first step to start working on root cause analysis and to add things that you discover This is very very helpful and about that adding and that's the second point. It's configurable So each customer may have a different system They may have things that actually they care about that other customers don't maybe one one customer runs with a system that Constantly churns the CPU and that's okay And the other customer thinks that see high CPU churn is is a problem So it really allows you to configure it initially to be appropriate for what you for what you need Okay, I'm gonna finish up with one or two slides about just a little bit about the structure vitrage And then I'm gonna hand it over to you the fat This is a little bit about the structure of vitrage So we have the data sources This is a little bit outdated, but we have a few more here There are great out but you can also also see some of them These data sources basically are the information that we get this is what builds us the topology of the graph Of the graph what is connected to what? And as you can see we have here Nova and Agios AODH we have a POC for a cinder neutron. We also have heat and Zabix. We have a lot of nice data sources All this gets propagated into the entity graph the graph you saw before that connects everything together So this is like the heart of vitrage where all the information is stored Then we have the evaluator so we gave it the templates the templates that were put in the expert judgment Are what gets evaluated when changes occur in the graph? so any change is a reason to analyze and see has this you know any impact on my system and These are the templates so this is like a small excerpt from a template and Of course whenever an event occurs we have notifiers that notify externally to Nova to AODH and we can write additional notifiers to notify other Services in the context of the NFV we can notify a VNFM or other services that do policy So notify whatever new alarms we raise or discoveries we have we can notify and finally of course like every Normal project we have an API and a UI That you can that can be used to see the vitrage insights Yeah, so I think I went over most of these details and so now I think is the time for our demo Okay, so yeah, okay Hi So I mean for the first And I'm going to show you demo of it was after you heard about it and you heard about root cause analysis So let's see how it actually works Okay, I'll just Fresh everything Okay So in vitrage we added a few Screens in the horizon setup and I'm going to show you these screens The first one I'm going to show you is the topology which is kind of a dashboard that shows you the the status of the system You can see quite fast and understand everything is green everything is working well You actually see mostly the compute hierarchy what we get from Nova and it's a Sunburst presentation you can see like the the entire OpenStar cluster and then you can see in rings like the The availability zones and then the host and the instances for example, I can drill down to An availability zone I can see some details about it And I can drill down further to the host and to an instance running on this horse And we will see later on that when something is wrong this view is very easy to identify where the problem is But you don't see these many details. So I'll move to a more detailed view Which is vitrage entity graph and in this graph as Alicia explained vitrage collects data from different data sources and like OpenStack data sources or external monitoring tool and it combines everything into a graph and show you the relationships of all these entities So here you can see the physical area. You can see the computes In this cloud are two availability zones and computes and you can see the virtual area or the instances You can see an application. This is a hit stack the cinder volumes and The network which is also connected to the instances So it is very clear to understand the relationships between different entities and how they affect each other and Now I'm going to simulate the failure in the system and we will see how it is reflected in vitrage So I assume most of you know Zabix Zabix is a monitoring tool That is very common and vitrage integrates with Zabix and Nagios monitoring tools And I'm going to simulate a hostnik failure. This is not it won't be a real failure But I'm going to tell Nagios to tell Zabix. Okay Let me know there is a problem with the nick So I selected the wrong wrong test so I'm just Modifying the definition of the test and This is a live demo and I just modified something in Zabix So it takes Zabix a while to to understand there is a change and to To notify me about the problem because it's not a real problem and After we see the default in Agios which just happened We will go back to vitrage and see how it is reflected in vitrage. So here we see in Agios Zabix things there is a problem with the nick and Let's go first to the topology and we see that's not everything is green and any longer And it's quite clear that there is an area that is problematic and now we can go and drill down and see there is an alarm on the host and there are two instances in this host and There are alarms on the instances as well And if I go to the alarms view I can see four alarms But I only told Zabix there is a problem with the host So why are there four alarms and the reason there are four alarms is that three of them are reduced alarms that were raised by vitrage I can see here. I can This view shows alarms of different types here We see a Zabix alarm which is the original alarm and three alarms raised by vitrage and we can also see in Agios alarms here or AODH alarms and Now I want to understand what is wrong with my application so I can find the application alarm. I See the application is not highly available and I can open the root cause for this alarm and In this view I see a drop down causal relationship of the alarms And I see that application is not highly available because there is an alarm on the instance that the application is using and this alarm happened because there is an alarm on the host about a failure in the host and This view is context-aware meaning if I open it for another alarm. I may be may see a different causal relationship That is relevant for this alarm. So I'll try and open it for the original alarm that came from Zabix and Here I see that the host failure affected two instances one of them had an an application running on top of it and the other didn't so this is the entire effect that this alarm had and If you go back to the entity graph We can see the alarms in the entity graph as well. Okay, I'll just Move them a bit so it will be more clear Sorry and here I also see the how these alarms are connected to the resources so there is one alarm on the host and The host state has also changed accordingly. Oh Sorry, I guess the session just expired. Okay It happens in live demos I guess Okay, so back to which entity graph I can see here the the alarm on the compute and The computer state has also changed and with large notified Nova about this change So the change is also in over and There is an alarm on one instance and another alarm on another instance These alarms are connected to one another and there is an alarm on the application and you can also understand from this view that this application here has a two instances related to it, so This explains the alarm that the application is still running, but it's not highly available because one instance is down Another thing that I want to show you is a new View that we added in a in Okata So it's not in this environment. I'll switch to another environment. We chose the templates and actually you saw like a tweak over here There was one alarm and new alarms are raised and I'll try to explain how we configure this behavior because it's not a It's something that is controlled by the user of it rush Okay, okay, so this is a new environment That has the template view and over here you can see a list of templates that are currently loaded in vitage and those statuses and in case you had an error like a typo in the template you see in details Some hint about what's wrong with the template and you can open and look at the templates So however, you see the structure of the template These are on the first blocks are like the building blocks of the temp of the template and the interesting part is the scenarios We have one scenario about I'm saying that as you see it's very human readable if the public nick fails on the host and The host contains an instance. This is a condition and the action that we want to execute is one action is raise an alarm on the instance And another action is set the state of the instance and Then we have another scenario saying if all this happened and the alarm on the instance was already raised We want to add causal relationship to connect to the two alarms So we know next time somebody asked we know that one alarm is the root cause of the other alarm Okay, so this is for the demo and I'll go on and talk about the future of it rush switch Okay, okay Okay, and so I'll talk now about The future we have a very big world map for it rush There are some functionalities that we would like to add One of them is an alarming relations. So you saw there were three or four alarms in the alarm In the alarms view, but imagine there is a very big setup like a production setup with hundreds of VMs And you could get hundreds of alarms and okay We have the root cause but it's still hard to find the first alarm to start looking at it and we would like to add some alarm aggregation and something that looks like a tree and We imagine you seeing just the root causes or the most significant alarms or alarms Aggregated by resource and then be able to expand and see all the details. Well, it's interesting Another use case is the alarm's history and this is something that we always also discussed in the design session yesterday Right now you see active alarms in the system and how they affect one another But suppose there was some error that happened in the night. Nobody was miss was watching and the next day You would like to know what happened or suppose an alarm happened on the roast and the alarm was fixed But the application didn't recover. So there still is an alarm on the application, but the original alarm is no longer there We would like a mechanism that allows the user to understand what happens. Maybe some kind of a slider in the UI showing Per time how the root cause analysis graph look like. This is something that we need to design and implement And of course we can talk about auto detection of alarm correlation Right now, it's configured in the templates of it was It could be great if we have a mechanism that looks in the history and understand that Whenever alarm a happened then right after it there was an R&B. So there must be a relation between them Maybe suggest a correlation or do something automatically Okay, we also discussed usability issues the entity graph Is a great tool for understanding relationship between resources But in a larger setup it can be very crowded and we wanted to find ways to highlight what's mostly important or to To let you see only part of the graph that is relevant for a specific use case This is something that we also need to think about Time-sensitive root cause analysis is what I just talked about and Templates creation and editing right now You can edit a template as a ML file. We would like to have like a smart editor in the UI That allows you to edit template and maybe correct you if you make mistakes the template is built with References between the template the parts of the templates. It's very readable Very very easy to understand, but you can make mistakes if you try to reference And I did a dozen That doesn't exist. I mean we have template validation already, but if you have UI tool for that, it will be great Okay Regarding the language of the templates there is a lot more we can add right now our scenario support and or and or conditions, but we would like to add support for not condition which is It raises some logic questions. For example in case of a high availability And there are two switches and I would like to say if once which if two switches are down Then there is a critical error if one switches down it is a warning But I want to make sure that if two switches are down, then I don't get both errors alarms I only want to get the the alarm about the critical error I don't want to get the alarm that one switch is down also So I would like to say if exactly two switches are down right now the the language of the templates does not support it And I'm sorry it keeps moving and And We need more data. I mean the more data with Raj has The more complex templates you can create the more insights we can give you and First of all, it means more data sources We would like to integrate with other open-stuck projects as many as we can and of course add external monitors Everything that can give us more information is valuable and we would like to add new consumers and So with Raj raises alarms and modified states of objects, but most of it is only in with Raj So there is one case where we notify Nova when the host is down and we actually change the state of the Austin's in over We would like to have more cases like this and more integrations with other open-stuck projects We have a POC of integration with a ODH, which is telemetry alarming service So we know how to raise vitrage alarms in a ODH, but it doesn't work well. It requires some In some a coding in a ODH. We already discussed it with them And this is something that we plan to implement so every vitrage alarm can be used in a ODH and We have a mechanism for vitrage notifier Which is very easy to add another notifier if you have your own system that you would like to get notifications from vitrage But we want to make To write our on our own as many notifiers as we can to to make it more easy to integrate with vitrage And we need the use cases We have a few use cases that are used in in Okia Sibis product, but and Could be common like host Nick Ferrell that I showed you but we would like to understand more real use case from customers that We can support Sorry again and we would like to have like out of the box template libraries, which is not so trivial because The templates depend on the alarms that you get on the monitor that you have and if you have Zabix Zabix is a pluggable mechanism and You can put different tests in Zabix So we can write templates that are relevant for specific tests and are not relevant for customers that have other set of tests And but we do not want to supply some some example templates and think of a more more interesting use cases that these templates can cover and Make this watch indispensable would like to to reach a point that every cloud operator Will know that he needs with watch to operate the cloud to understand defaults and to be able to manage it as easy as possible Thank you. If you have any questions for me or for Alicia Okay, so I'll repeat the question if you didn't hear the question was whether the entity graph is created automatically With all the relationships of switches computes and VMs or you need or we need to somehow configure it And so vitrage is made of different data sources and each data source is responsible for Managing its own relationships and the start the switch configuration We get it from a Static data source because right now we don't have anything that gives us this switch configuration So you you define the switch which switch is connected to any host But this is the I mean in most data sources do it automatically if you reconnect to Nova then we get from Nova a list of hosts and the relationship to the zones and to the instances and If we connect to send out then we get sender volumes and what each volume for each volume We get the information what instance it is connected to so each data source knows other data sources and connects to them And altogether it creates a nice graph Thank you Any other questions? Yes Yeah, yeah for sure. So on github as an open source You get currently integrations with the basic open-stack Projects and we plan to add integrations with other projects But if you are a service provider and you have your own data, which is not an open source or is not interesting for anyone It's very interesting very easy for you to to write a new data source like a matter of two or three weeks work to write a new data source and Automatically plug it in to vitrage to have your own data and see it in the graph Yeah, and there are people doing it in Nokia for another Nokia product Well, you see the view that you see depends on on your cloud your physical cloud I mean some one thing is there is the code that you can write code that only you have it and I mean nobody else will use the code and What you actually see in the graph is is the topology of the cloud that vitage is running on Okay, did I answer you? Okay, okay, any other questions, okay? Again the flag move we are a relatively new project we started a year ago and in June we became an official open-stack project and We are looking for contributors. So and it's a very interesting project. So if you are Interested or if you have any questions or if you want to give us feedback or you have use cases that you would like us to implement We will be very happy to hear about it There is some contact information you can use our mailing list IOC channel. We are looking for feedback. Yeah. Thank you