 Hi everyone and My name is if art effect. I'm a system architect at Nokia and the ptl of the open stock of the open stock with rush project and I'm going to present together with my colleague Alex say whale core contributor in vitrage and we are going to Present the project updates in vitrage. We will describe the what we are doing in pack version and our plans for the future And I will start with a brief overview view about the vitrage project Then I will discuss the main features in pike Then Alex say will present a demo and talk about our roadmap and plans for a queen's version And vitrage is an official open stock service for root cause analysis It helps organize analyze and expand the open stock events and alarms a Cloud administrator that has a fault in the cloud may see a very long list of alarms And it may be hard to understand what the root cause of these alarms is and This is where vitrage can help vitrage is another another role of Creating new alarms on problems that are not directly monitored for example in case of an interface failure a physical interface vitrage can identify the VMs the instances that are unreachable and raise alarms on these instances Nova is not aware of such problems in over everything seems fine But we should rush can help the cloud administrator understand that there is a problem with the instances maybe there is also a problem with Applications running on top of these instances with rush will raise additional alarms on such applications a Some background about the project The vitrage project was started during the mitaka cycle and Six and a half months later. It became an official open stock project. This happened a year ago The first official version was a Newton version and now we are developing pike We have about ten contributors actively contributing to the pike version and I'll talk about the high-level architecture of vitrage vitrage collection information and from different data sources and Some of them are open stock services Nova Newton cinder and heat We have a OD age. This is telemetry alarming service for alarms and we also connect to external data external monitors Zabic snaggios or collect D and all this information is combined into a topology graph the entity graph In this graph, you can see the physical error the virtual error and the application layer You can understand how they relate to one another and how they affect one another Whenever the graph is changed vitrage evaluator checks if there are actions that should be taken by actions I mean either raise another alarm on another problem in the system and Maybe modify the state of a resource in case vitrage a data identifies an error Or mark causal relationship between two alarms The law the rules for when to make these actions are defined in templates These are yaml files that the user can define and edit and I'll talk about it later If vitrage raises an alarm or modifies a state of an object it notice it may notify external systems Right now vitrage notifies Nova if a host is down We also send the SNMP Notifications and we are working on notifications to mistral and I'm going to talk about it later It was as an API to get topology to get the list of alarms and to get root cause analysis on top of it we have CLI and And we have a UI as part of the horizon dashboard For vitrage So this was the overview and now I'll describe the main feature that we are working on in pike and so one important thing that we started working on in pike was to Do an overall design of the high availability solution for vitrage We have several goals that we would like to achieve First one is to have a completely highly available a solution Another goal is that is related is to provide alarms history and root cause analysis history And we'll discuss a bit later And currently vitrage Topology graph is a is held in a network X. It's an in-memory graph database. It performs very well But if we get to very large Distributions we might need a persistent graph database So this is another thing that we are checking the option to to support a persistent graph database instead of network X So this is the the architecture that we are thinking of it's currently under review. It's not final In the in the current implementation vitrage graph service has the data sources drivers that connect to different data sources and processor that Help transforming the information from the data source into the graph Language and we are going to split it to different processors and Each group of processors will be highly available and this way we can make sure that no event is get is getting lost And that if one process fails there will be another one to replace it the detailed design and describes specifically where we need a master master and when we need master slave and like I said it's still under review and In order to support alarm history we think of Event sourcing mechanism we may Store all the events that arrive to vitrage in a in a database and once in a while We were we may create a snapshot of the graph status and if someone is interested in State of the alarms yesterday or last week we can do a replay and and show the status of the past Okay, and other main focus in the pike release is collaborating with other projects and We added vitrage to the RDO. This is the rpm Distribution of open stock. So it's much more simple for everyone to install vitrage these days And we have a continuous work with opnf vdoctor. This is the fault management project of opnf v and vitrage performs the Specifications that opnf vdoctor defines for how the fault management is inspector should behave and We are currently working on installation in opnf v, and we are trying to finish the requirements And We added in pike SNMP notifications if you have a system that requires SNMP notifications We can send them for every alarm that is raised by vitrage And and we just did a POC for vitrage integration with mistral mistral is open-stock workflow engine It's a very powerful engine where you can define different workflows and we show the demo as part of the collect the integration with vitrage it was this week a presentation in the summit and Very short explanation about the integration as abix may report an alarm to vitrage For example an alarm of a switch failure With rush can evaluate the alarm and deduce that there is an instance that is The host is down actually unreachable and raise another alarm on the host and Today in in such a case vitrage notifies Nova that the host is now is down. Nova is not aware of it This is part of the doctor use case and vitrage notifies Nova and what we want to add as a new functionality Is that vitrage will also notify mistral and then for example mistral can execute a workflow to Evacuate the failed host and move all the instances to a host that is working well and in pike we are going to introduce POC for machine learning and Like I said The behavior of when to raise alarms on other resources and when to mark would cause Relationship between two alarms this behavior is currently defined in template Yammel files These files are very easy to edit and the cloud administrator can add new templates based on his experience but we would like to to make it automatic and so in In pike we we are trying to make the various first steps towards it and have an algorithm That was developed together with Bellabs to find a causal relationship between Historic alarms so we examine some history of the alarms and we try to find correlation between the alarms And this is not so trivial because it could be the two alarms always appear more or less in the same time But none of them is the root cause of the other It could happen that both of them are caused by a third alarm that we just don't monitor And it could be that the first alarm is not the root cause but the second alarm is the root cause Maybe they are retrieved from different monitors and each monitor has a different frequency so determine the cause is not trivial and we are trying to do it and We hope to this work started in pike and will continue in Queens And Another issue that was raised was alarm equivalents by alarm equivalents I mean that it could happen that two different monitors report the same alarm For example Zabbix and Nagyos may both report CPU high CPU Lord In each of them the name of the alarm will be different the severity might be different But basically it means the same we would like we actually are currently working on Introducing a way to determine that these two alarms are equivalent mark this in the entity graph and then we will be able to To identify this this equivalence and they act accordingly Another use case is if we touch raises a deduced alarm on an instance That there is a high CPU load on the instance and then later on Zabbix report the same problem We would like to to identify that this is really the same alarm and not to distinct alarms There are several options of how to mark this this case Some of them are problematic we selected the option of Adding in the template a way to to say These two or three alarms are equivalent and then if you say in the template if you have a template that With root cause relationship of one of the alarms we automatically assume that there is a similar template for We automatically generate a similar template for the equivalent alarms There are a few smaller features that we are working on and some are important We added in the beginning of pike we added multi tenancy support in horizon We already had it in the API, but not in the UI So in the admin UI you can see the entire entity graph the entire root cause analysis And in the project tab you only see what is relevant to your tenant you don't see the entire graph and You don't see the entire effect on other instances that are not yours We added a new API for querying the resources that are known to vitrage You can query a specific resource or a resource a list of resources by a certain time And we modified the idea of it was it used to be calculated by a list of fields like resource type Resource name and we currently are using a uuid like the rest of open stock And We are going to We are going to implement a mechanism for that allows registering to specific vitrage alarms So you could get notifications on specific alarms that are raised by vitrage And last important feature that we are working on in pack. Actually, it was already finished is Enhancing the language of the templates and we already had in templates an option to say if One condition and another condition accord or if one condition or another condition and now we added support for the note So you can say if one alarm and not the other alarm in the condition of the template This is specifically usable for high availability scenarios For example in case of a heat stack that has two instances You can say if one instance is down I want to raise a warning on the heat stack and if two instances are down. It's a critical error And so like I said, this is supported and now and I invite Alex say to show a demo about this use case. Oh Thank you. Thank you. Fat So I'm going to show a demo In which we're going to see the use of the not term in the vitrage templates and also we're going to see the integration of Network physical components in the vitrage entity graph. So let's start with the demo We're going to the entity graph Okay, so what we can see here we keep Do you see? Yeah, yeah, okay. So what we can see here we can see here that we have a This availability zone we have two instances and each instance has a Two hosts each host has two instances as we can see here We can see here that we have this heat stack Which is comprised of two instances in a high availability mode between them. We can see here the Neutron ports the Neutron network and on the right here. We can see that we have written in our deployment that A script that discovers the obvious topology of the compute So we can see here that we have this obvious bridge that is connected to the host and to each one of the ports and To each one of the virtual ports We also connected the obvious bridge to the physical obvious port and to the physical obvious interfaces Now what I'm going to do I'm going to bring down the interfaces and we'll see what happened in vitrage I'm looking for the interfaces bringing them down So now we can see our Zabix in Zabix we have added that a trigger that Says that if let's say we have an interface down on a compute then raise an alarm So let's see that the alarm is raised in Zabix as we can see here And we can see That a new alarm was added on the interface here on the physical interface. We have written a Template that says that if we have an alarm on the interface Propagated to the to all of the instances that are sitting on that On that compute that on which the interface is on Now I'm going to bring down the second interface. I've changed the number of the interface Now we'll see that there are two Alarms here two triggers and now we'll see that in the graph Okay So what we can see here now we can see That instead of the warning alarms that we had on the instances we now have a Critical alarm on the instances because both of the interfaces are interfaces are down It means that the instance and the host are not working and because of that we have raised also a warning Alarm on the heat stack Now we are going to see the R's the root cause analysis of what happened here Let's see Okay, so here we can see that we have two Alarms on both of the interfaces that caused an alarm a network error a critical error on the instance Which caused a suboptimal error on the stack. This is all configurable, of course and can be done in the in a different way So now we will bring up back up the interfaces and see that everything is Getting back to normal. You can see that the critical alarms disappear and the alarm from the heat stack disappears as well And only the warning alarms appear Now when I will bring up the second interface, it will all disappear great so What we have seen here we have seen here two use cases One is talking about the high availability use case in which we have defined template that says that if I have a If I have an Instance on which one of the interfaces has an alarm and the other interface has no alarm then Raise a warning alarm on the instance, but if I have two interfaces on on which one of When on each one of them we have an alarm then raise a critical alarm on that instance Sorry about that some presentation problem Okay, so now let's talk about the Queen's Road now The main changes that I want to talk about is are free changes the first one is the alarm and Darcy a history and each System needs to have a history and in and in our case we have Free use cases that I want to talk about One is let's say you have an operator that is going back home in the evening And then comes back in the morning and sees that everything everything is fine in vitrage but actually when he was at home a CPU usage Alarm was created on the host and then it went back to okay And when we he came to the office everything was fine But actually there was some kind of problem in the system while he was at home So we would like to know about that and and the history will help us to know that something has Was wrong in the system and then he can do a countermeasure steps to see why that happened And maybe to fix it and to see that everything is fine another thing another use case why we need the The history is because of the machine learning in order for the machine learning to work well It it needs to have as much data as it can in order to analyze it and to understand steps from it and then to take a Actions due to that and this is another reason why we need it and the last reason that I want to talk about is The following the RCA history and let's say we have a host Let's say we have a host here On which an alarm was Reason the host down now David rush will propagate the alarm to the VMs as we can see here then let's say that After some time because of that alarm and a new alarm was created on the stack because it has some CPU problems or because it can't work because of Because the VMs has Have problems this alarm then we added Then we add a new template that says that if I have such an alarm then create a root cause analysis between Between this alarm and this alarm, but then let's say that the host down Alarm was deleted as we can see here But the alarm here the VNF down alarm is still on the VNF because the VNF is not working Because it had some other problem due to the problems on the VMs but now when Operator will come back. He will see only this alarm and he will and he Won't know what really was the root cause analysis of the problems and this and That's and this is why we need to know the RCA history in order for the Operator to know that this VNF down actually was caused be because of previous alarms that were in the system and the second Main change that I want to talk about is the alarm aggregation and We need alarm aggregation in two kind of types the first type is In the alarm list, let's say you have a Zabix alarm that on the host as we saw that cause the deduced alarms on the instances and then cause the deduced alarm on the Stack so what will happen is that in the vitrage alarm list you will see like a very big list of alarms and we would like to make it a Bit a bit less human readable and not to see all of the alarms but only to see the main alarm and then you can drill down on this Original original alarm and see all the other deduced alarm that was that will cause because of that And then this way you're you cannot you can only see the original alarms and you can drill down and see everything another thing is And Why we need the alarm aggregation is we need it in the entity graph if that's a spoke before on the on the alarm equivalents in which we know that Alarms that appear from different data sources such as a ODH vitrage Zabix and Nagios can be equivalent. So then in the graph we will see let's say on the instance three different equivalent alarms, but it will make a It will be a very like a chaos in the graph And you would like to to make it appear a little bit better and thus we will we can create an an Aggregated alarm and then if somebody wants he can drill down again to the alarm and see all of the Different alarms that arrived from the different data sources and see all of them As we can see here in our graph we have this This say original alarm that caused three deduced alarms that caused another deduced alarm on the heat stack okay the third use case that The third the third the change that I want to talk about is the UI Changes that we want to make this is very very important to us because We still have many many things to do in in the UI one of the main changes that we need to do is is in the Is in the entity graph Because at the moment let's say we have hundreds of entities but or maybe thousands but later on we'll have much more and in order to make use of it and not to see like Thousand of entities in this small kind of screen In which you want to understand anything we need to Understand how we can make it more usable for for an apple for for an operator and Few things that we thought about is the smart selection in which You can you have like a search box when you can write What you want to find and then it will highlight it and then we want to Have maybe a layered view where you can see the graph depending on the data sources that you choose Another thing that we thought about maybe is to have like a Small screen on the side of all the graph and the big screen where you zoom in and zoom out and you see where You are exactly on the small screen and This is that If I talked before I have talked about before about the RCA of the deleted alarm in the previous Slide and also maybe you would like to have a timeline slider, which means that We'll have the graph and we'll have this kind of a slider and then you can slide it Let's say from one month ago to now and then you like you'll see all of the changes are Which we had While you move that line slider and then you can like see the history and understand what is going on there and everything Another thing another important thing is the template editor which at the moment in order to to write templates You need to write a YAML file, but we would like to add Something in the UI that will be much more easier to do that For a user and then Many much more many use cases we will be able to create And Okay, so now what we're going to do is an open discussion and We and vitrage have many many questions to many different users all the time And one of the main questions that we have is a what are the main use cases that you use we as a Nokia Familiar with the telecom use cases, but we want to be familiar with the public cloud Use cases and many different other Uses, so please if you have Any data knowledge for us we would like to hear hello. I have a question. So I I see that you are trying to extending the size of the entity that you can monitor. I was Wondering if this extension includes something like the IOT world like can I get to 200 million entities and Make an alarm on a single device or it is completely outside the scope of the project And I think I'll try to handle this and at the moment We are talking about System that are comprised of let's say I don't know how many nodes and in which you have like I don't know 60 computes so You will have like how many computes thousand computes. Let's say Well, yeah, thousand computes. So if you want an entity for each device, it can be millions If you want to raise an alarm if you want to propagate your alarms Down to the single device that is in your pocket Then it become millions and millions. So I was wondering When you mean device you mean like let's say you have a say your phone. Yeah, I Don't think if I to have any so we actually I'm not connected We actually never tested with watch for millions we tested is for like tens of thousands But I mean it's really up to you if you can control whether you want to program to get the alarm to all of the devices Or maybe it's not interesting And And anyone else We can continue to the other questions that we have here Is like what kind of Betrash functionality you use the most What would you like to be changed there or added to there? Anything someone from The roadmap that we showed to you do you think that there Should be maybe other things that we need to add to betrash in order for your use cases to be done Okay, a new volunteer with the dream host we have a public cloud So where I see betrash is you know from an operator side it would certainly be interesting and I don't know that I have a whole lot of Feedback on that side. I mean you guys look like you're on the right track But it would be interesting to offer this kind of public-facing as well Which would require some like tenant level scoping so I could you know say for each one of my you know We have hundreds of people on our cloud being able to say, you know, here's betrash Here's your specific stuff. Yeah, actually we have it already. Yeah, it's already there. It is It supports multi tenancy and it's already there and yeah So I guess I don't know We we also you know have complicated network overlays and stuff like that is how much of that stuff are we able to expose to Our particular tenants, you know, we may not want to expose our hypervisors, but maybe we do want to expose our switching plane So at the moment what we expose to the multi tenant is everything that is with their tenant ID and It marks one One level beyond which means that if I have an instance, okay This is with his tenant ID and the instance is connected to the compute but the compute is for is of the admins So it will show that the instance is connected to the admin and no and nothing else because all the other stuff are like Of the admins, let's say the physical network topology Okay, thank you but again if you see if you want something else to be done and The use case of it would like we would love to hear about that and to see how we can integrate it into betrash Yeah, if our if my crowd it supports, I mean Be up a link What? Be up a link with a multiple link Yeah, right. So in your entity model you will show it, right or not In your demo you will have an entity model. Okay. Yeah, right. We'll show it Show it. I mean show the real the real thing. I have Sorry, I'm sorry, please again. No, no, it's okay. It's okay. If my my crowd support multiple link Multiple link between entities between the services, okay Yeah, right some for example in your server Maybe you have multiple things and then you support much learning So if the link done you can change to the other link. So I want to know that you will your great mother will show So change your mark So the entity graph is a multi graph a network X multi graph, which means that it's it it supports a Multi relationship multi relationships between entities and in order to see the multi links You just need to maybe add a new data source in which you you will connect those multi links to whatever you need Yeah, right. So it's okay, right? Yeah, if the link done, I mean a link done You will change to the other link need the mother will change their immediately or not If if it is sure it is okay because The link is active the other ones Yeah, right move the other one actually for the apparatus should know this really done It's important, right? Yeah, right Yeah, right the other one here the mother change right because the The link active in the depth one and this one time you change to a right one. So you as to the in the in the entity mark the graph should have to show this one To two parts of one is sent around to a operator Yeah, right, yeah, right I won't see Okay, so I will just finish the Presentation and then because we are out of time we can speak about it in a minute So I will I just have to say that we still have many many things to do in with Raj and in order to improve it many things in with integrating with other projects and this specifically Many changes in the UI. So please we look for contributors and if You want to help us and to help the community, please come That's it