 היי, good morning. Thank you for coming for this session. My name is Oad Shamir, I'm a product manager in CloudBend Nokia, and I'm working with the Vitrage official project for OpenStack for root cause analysis from day one. with me here Yuval Adar, R and D manager in CloudBend Nokia, and is leading our analytics team. We will talk about automation, אה, אנחנו נתקדים וטראג'ה בשביל אוטודיטקטיין של ה-RCA פטרנטים ובמשין-לאנג-אלגוריטים. אז, אני מטחה להסתכל על מה זו רוטקוס האנלסיסטי, למה אנחנו נחשבים, ואז אני מטחה לך, בין רוטקוס האנלסיסטי, אה, אנחנו עובדים רוטקוס האנלסיסטי בין רוטקוס האנלסיסטי, אנחנו נשחק, ודיווואל כבר נחשבים על ה-RCA פטרנטים של ה-VTRAGE ובמשין-לאנג-אלגוריטים. ומה אנחנו כבר זו, ומה הם כבר זו, ואז אנחנו נתקדים על זה בעל של הרבה הרבים. אז, נחשבים. מה רוטקוס האנלסיסטי? אם אתם מגיעים לבקשה ולראות לבקשה של רותקו, אתם מגיעים שרותקו היא פקטור של רותקו, אם תלכים מהשקולה של פרקה, תלכים לבקשה של פרקה של רותקו. meaning that if I would know the root cause problem, I would be able to prevent the problem from occurring. Root cause analysis is the method identifying the root cause of system event usually failures. And why we care about root cause analysis, so root cause analysis as many uses, and it's dramatically for our way of understanding and operating our system. So if you look on the diagram, the left side is the past and the current, current situation of the system. The right side is the future. So IT systems today are reactive only. They can provide the left side. We want to be able to be proactive to predict the failure, to be able to prevent failure from occurring because we know the root cause analysis and we know to predict and to take the proactive action in order to prevent failures to occur. So let's start from the left side. So first, you want to understand the current status of your system. You want to know that if you can't reach your VM, you want to know why. What happened? If you want to know if there is an OST problem or maybe even other problem, you first want to understand and you want to be able to see the current status of the system. You want your system to reflect the current and the accurate status. Then you want to react and react fast. You want to fix your problem. If you can't reach your VM, you need to take some action. You may want to move it to another OST or to restart it and the action that you will take in order to recovery depends on the root cause. If the OST is not available, you don't need to restart the VM. You have to move the VM to another location. So it depends on the root cause. Then we have the accountability. It's very important. You need to provide your customers, they analyze what is the current status of the system, what happened and even what are the steps to overcome those problems. What are you going to do in order to provide them the service that you promised. Then if we are going to the future, so we want to be proactive. We want not just to treat the symptoms but also the root cause analysis. So if in the fast reaction I treat the VMs that are unreachable, now I want also to maybe to go and to take action on the OST or even on the NIC. If there is a NIC failure, it causes the OST to be unreachable, causes the VM to be unreachable, I want now to treat the root cause problem and fix the NIC failure. And last and maybe one of very interesting is the prediction. So if you will think about it, prediction is the reverse process of root cause analysis. Because in root cause analysis I have a problem and I understand what is the root cause. So if I understand the root cause, I can predict what will be affected from a problem. And so it's a reverse process and we can use the root cause analysis patterns to understand and also to predict failures that can occurred in my system. So let's go into under the hood of a vitrage. So what we are doing in vitrage, we called it automating the expert management of root cause analysis. We have the, if you look on the right side of the slide, we have multiple sources in vitrage. We get the data, there is no one monitoring tool that can cover everything. You need to gather the information from multiple sources. So we are taking the information from sources like OpenStack, the statuses from Nova and Neutron, Cinder, E, and also from external system, for example, Zavix or Nagios, that monitors the hardware. But we are not collecting just the entities and the status of the system. We are also getting the data of all the alarms from the different sources. So we are getting data alarms from AODH, OpenStack, CollectD, and external systems like Zavix and Nagios. So in the end we can provide in vitrage an holistic and complete view of the system of the current status of the system with all the alarms coming from the different sources. So you can see that I have a yellow square alarm on the switch and light blue triangle alarm on the host and etc. Then what we are doing in vitrage, we are taking the expert knowledge, the human knowledge from the DevOps people. So the DevOps people know few patterns of alarms. They know that if there is a problem on the host it may affect the VM. They know that the problem on the switch may affect the host and etc. So we talk all those use cases and root cause analysis patterns and automating. And we automating in a way that we called it vitrage template. We will explain what is vitrage template in a minute. But take a look on the left side diagram. So I understand that so every shape is a represented different alarm, may come from different sources. So I understand that a green circle alarm causes red pentagon alarm. And I understand that light blue triangle alarm causes green circle alarm. And so on. So if I take it with the information we are taking from all the sources now I can understand the relationship between the alarm. Now I can understand which alarm causes which alarm. And I can know the pattern and the relationship between the alarm. But not just this. It's not just to know the root cause. Look on the right side. So I know that blue triangle causes a green circle alarm. So I know that the problem on the host causes a problem on the VM. But the second VM has no alarm. So I understand that and I can deduce additional alarm and add more information to the system. So the dark green alarm is alarm that was added by vitrage based on the understanding of the root cause analysis patterns and the topology of the system. So what is the vitrage template? So vitrage template is a very easy readable way to capture this business logic of the root cause analysis pattern. So we are using YAML files. It's very human readable. It's very easy to add, very easy to modify it. So this is a very simple example that I have. So we have on the template on the top we have a section of definition and then we have the scenarios. As the conditions, for example, unreachable host and host contains instance and then the action that I want to take. In this case I want to raise alarm on the instance calling this instance down and to set the severity of this alarm to be critical. So to summarize my part building in vitrage taking the human experience knowledge automate it with the multiple data sources provide an holistic view of the current status of our system. We can understand exactly what are the status, what are the alarm coming from the multiple sources and we can also propagate based on our insight and raise additional alarms and expose it to the users. So this first step is very... It's fast ramp up. It's very easy to get efficient system and to get a good result very quickly. But, and we can configure it like you said very easily to the current configuration of each user or customer but we won't take it to the next level and the next level is to create those patterns those templates automatically using machine learning algorithms. So I will switch over to Yuval to explain the next step using the machine learning algorithms. Hi everybody. So thank you very much, for the overview of vitrage and what it knows how to do and let's look toward the future of vitrage and how we envision it. So the very next step for vitrage is to overcome the limitations that we have today with our expert judgment. After all, all of our DevOps experts are only human and they have a certain bias and they can analyze problems in their own way that they know that works for them but that might not be the optimal way to do it and usually the evolution of these patterns is pretty slow so it might be well suited for smaller environments but looking forward to larger environments it's not the best approach and also when we're talking about big data it's very hard to use our current approach. So what we want to do is to apply statistical analysis and we want to discover connections between the alarms that we get from our systems and we don't want to be affected at all by the contextual bias and what we see here as a potential is that the new system will always be on. We don't have to rely on human experts who do have work but also have families and we can easily adapt this new approach to every single system. So what's very important to just to note here that currently in vitrage a part of our future plan is already in review but a lot of following slides usually challenges that we face and general ideas on how to approach those challenges. So let's talk a little about statistical causal analysis. What is the main challenge that we have here? When we're talking about causal relationships between certain events we can always look at it in a very simple direction. We can have two events that occur in our system, event X and event Y and we can create a connection between these two events and we can say X will cause Y but then it doesn't have to be always that way Y can cause X in certain cases or there might be even a confounding variable somewhere outside of the system which will be the cause of both of those events that have happened. So correlation between the events is very easy to find. If we see a pattern that we have X appearing Y appearing after X it's very easy. We can say X causes Y but the actual causation is very difficult to determine in this way and even if we have very consistent correlation over time they can be misleading and I can give you a very simple example for that so let's say I'm walking outside and I notice that my hair is wet and there can be probably 20 different reasons for my hair being wet but I can also hold an umbrella in my arm and I can say well it's raining I have an umbrella in my arm so rain caused my hair to be wet but on the other hand tomorrow morning I can notice that my hair is wet I don't have an umbrella in my arm and it's not raining so I need to figure out why my hair is wet. So when we look at statistical RCA the way data is collected and represented in the system we developed an algorithm in collaboration with Bell Labs and every alarm or a fault in the system key can be represented as a section on a timeline and we have the segment we know when it starts we know when it stops and we know the time that the alarm was active on the timeline and we can group these segments over time according to their fault type and by the resource ID and that way we can create a timeline of the events as they have occurred on the system and once we start gathering these events we get to the basis of our algorithm we start comparing these events and comparing their timelines when they're caused and we can find overlaps between these events and we can easily notice that in certain amount of cases the majority of cases we have our X and Y events with a large overlap and occasionally we have the Y event which occurs without the X event and then we consider the temporal evidence of this causation so logically we can say that cause precedes the effect and the event X will always cause the event Y but the event Y can happen without the event X and this is the basic concept that we have behind the idea and now let's go to the actual design of this RCA so the way the flow goes first we need to collect the data I already mentioned that Trouge knows how to get the data from various data sources we know to query OpenStack we can get our topology straight out of heat, out of nova and then we can collect all of the events that occur in our system either using OpenStack services like Cilometer or AODH and I'm saying Cilometer on purpose because we know how to work even though we're downstream and we're rushing ahead we are also backboarded all the way down to Liberty and we can gather all of these events and then we go into the analysis phase once we have all the events we want to analyze them and find the causalities between the events and once we do that we want to make an update and this is the new approach instead of having our human experts to sit down and write the templates templates ourselves and we want to put them in a template repository where we can actually utilize our experts in a better way where they'll review the existing new templates and they can choose which ones really apply to our system and which ones they would like to use in production so if you look at the diagram of the Trouge architecture I don't know how many of you are already familiar with the product it looks pretty similar to what it used to be until now we have the main Vittrash graph engine which is connected to all the data sources and it's connected to Vittrash API to the dashboard to the command line and then we have the notifier service but then we have the two brand new boxes here which is the alarm accumulator which we don't have right now right now Vittrash has a memory database and the moment an alarm is gone from the system it will be removed from the database and we don't have the history we do have this history in the alarm accumulator and we also want to have the stats analyzer which will over time analyze all of the events that have occurred on the system and then we will put them in the template database and from there they can always be pulled and applied or they can be discarded if if that is the case so here are the notes and some preliminary results from the tests that we made so first I want to mention two very important numbers when we're talking statistics there are only two numbers that matter and that's one and zero and everything in between one and zero is what really counts for us so the main goal of our algorithm is to find the correlation score between two events and we can always set thresholds for these scores and we can tell algorithm above 0.6 and it will present us with those and in the data that you can see in this slide below this is not exactly open stack data unfortunately it's based from one of our other products you can see that we have events with a very very high correlation like we have alarm that we have suboptimal performance in machine which is correlated one to one with another alarm of the same type and then we have medium correlation which is above 0.1 where you have a lot more alarms that have correlation based on their timeline and right now we need bigger setups test this algorithm because we can take the algorithm we can bring it to a system and we can do reverse analysis of problems we already have or we can just apply it to existing alerts that are on the system and get this data as a couple of basic limitations and there is very little we can do about this basic limitation and the first biggest problem is that temporal precedence between two events can be caused by different factors so if you look at the timeline and we have these two events we might not sample them at the same time so that's the monitoring frequency impact that we have on the system and the real life example can be a system which is monitored by Nagios it also has ganglia on top and we know that ganglia will by default sample everything every 10 seconds and give us reports and Nagios by default samples every 30 seconds and then it's very hard to correlate this data because we can say event X started before event Y but event X comes from ganglia which has a higher sampling rate than the event Y and this might not be exactly the case so we can see this overlap but we have to overcome it somehow and the other limitation is that in a lot of cases they can be a time lag between the events even if your sampling rate is exactly the same there might be a glitch in the network and your Nagios data might arrive 5 milliseconds later or 5 milliseconds before the data you get from another system and this is something that we can actually easily overcome first what we need to do is we need to find some overlap between these events and our algorithm can do time shifts and that's exactly what we do in this case so first we try to find correlating events and then we can do a little time shift just to find the causality relations between these events that occur another big limitation we have and that we need to overcome is that when we use the entity graph as the engine we can only find events that are directly connected to each other it doesn't matter which path they go through and it's very hard to find external causes for these events so in this case we can have a non-monitored event that occurs somewhere in our system and we know that we have these two elements A1 and A2 which have a raised alarm there is no direct connection between the two elements maybe here we can deduce that there is an outside element which directly cause these two events and in that way we can find the relations and here we come to our entity graph which can help us overcome these limitations because the graph engine we have right now really has a tremendous potential as a troubleshooting tool in all of our environments and it can represent all of these relations that you have between the elements in the network and if we do become smart enough to see those non-monitored events we can represent them already on the graph and we can create the causal relationships between those events so here you can see that we have a very strong correlation between A1 and A2 which is caused by an external A0 event and of course this is on the roadmap we can send the code and everybody is more than welcome to contribute it will be highly appreciated if anybody wants to help us out and this is what I said about using the entity graph it can also help us to deal with the sampling rate problem and here we just need to in the entity graph we need to add weight to certain elements and we can also help it learn so in your system you always have your physical host and you have VMs running on the physical host so you can always define that your physical host will most likely be the cause of problems that you experience in your virtual machines I mean we can always think of a different scenario where the relationship goes the other way but we can all probably agree that in over 90% of the cases the physical hardware will cause issues on a virtual infrastructure and we can add guidelines to bias from the start and help us get better root cause analysis of the problem so as we said the first big steps toward using real machine learning to detect all the causal dependencies here and initial results we have are very promising but we still need to get feedback from the community and the industry just to validate all of our results and stick around it will be interesting to see what happens next and to see everything on the roadmap and of course to contribute and thank you very much so if there are any questions I'll be more than happy to answer but I just have to warn you everybody I'm just a manager I'm not the most technical guy but I probably have enough backup in the room thank you, my name is Michael McCune I'm just curious if you guys decide on what kind of processing engine you're going to use for like the data analysis portion like Spark, Flank, those kind of tools I think that we're using our own engine Vitrage engine actually the entity graph that I mentioned is the engine which does the analysis okay so you're not like you're not creating a you're not using some sort of external framework to do the data processing on the data that you're bringing in thank you how dependent are you on the open stack portion of the data collection or data resources can I just replicate the kind of output that you will get from Cinder and will it are if you have a standardized interface like a REST API can that be your engine driving the data collection from the storage for example right now we are semi-dependent on open stack but we also know how to work without open stack and that's something that is on our roadmap we need to be able to build the entity graph and we do have API interface for that so as long as you can provide us the data that we expect to receive so the example of the live data that I've shown here wasn't done with open stack it was done with with an NFE orchestrator that comes from Nokia and we can gather data from that product and we can do correlation with all the alarms so that is definitely on the roadmap and there are already a few ideas on how to make a very open engineering implementation so as I said before we have multiple plugins data sources for Bitwise and it's quite easy to add new data source we found it like a couple of weeks work to add a new data source so you can switch and bring your own data source instead of open stack sources in the end you need to build to get all the information so you have to make sure that you're covering all the components that you have storage network, computer and everything but it's possible yes and another thing to mention right now the Bitrage user interface is dependent on horizon but in a roadmap for the next release we have a standalone UI which is completely independent of the open stack framework and what are you guys up with the sensitive plugin right now I think it's on the roadmap for the moment we cannot use Sensu to gather data probably shouldn't be too difficult to implement because we used to have Nagios integration and Sensu and Nagios are pretty backwards compatible with each other so I believe it would be fairly easy to implement Sensu so where can we go to contribute that if we want well the code is on GitHub the Bitrage is an open stack project it's out on GitHub so you can use the code and send us your patches we have our C-channel and weekly meetings so you are more than welcome to join yeah we've got some guys that would like to contribute perfect I have two questions first one is a follow up of what the first gentleman asked about the analytics engine so as I understand probably it uses the Bitrage Graph for example can it be or is it plugable so for example I have my own some implementation to do some analytics can I just plug into rest of the pieces or the infrastructure and still use my engine instead of Bitrage default or whatever comes with the community code I think at this stage some of these correlation okay I'm very new to Bitrage but let me put it this way I see a lot of data coming from various monitoring agents or sources but some of this information is also available in log for example maybe the monitoring may not catch it but it is available in the logs of various services or components is it possible to feed some of that data into this infrastructure and get some correlation out of it technically would be possible right now you don't have this option and there are ideas for the future to collect Bitrage with various log collectors like Elasticsearch the whole Elk stack and even if you look at our product we do have Elk installed by default which will collect all the logs and there are ideas to correlate this data with Bitrage and just to give Bitrage even more data sources so I don't know if it's officially on the roadmap but it will definitely happen sometime in the future because currently what we use is like we do a lot of this kind of analytics from the logs rather than the live metric data because of many reasons and probably we would love to use it with logs for the same infrastructure yes I absolutely agree with you as I mentioned currently available version of Bitrage can only give you the current state of the system we still don't have history and we can only look at what is going on right now and the moment we go into saving our own history the next logical step will be to make a correlation with the logs that we have collected so we can see what happened in the system 5 days ago 3 months ago or whatever there are no more questions thank you very much