 good evening, everyone... so, my talk is actually in the spirit of previous talk by Arnak Paulossian on this topic and in the spirit of the titles of our workshop, Data-Driven Systems, intelligent-driven systems, so I will try to share more experiences and challenges and challenges while dealing with the intelligent management of cloud environments and cloud applications. Especially focusing on key performance indicators, diagnostics for those environments. This is a joint work somehow also with Duisburg Ensign University, Professor Humfing, and my VMware colleagues and students. So, what is the main vision in this space? We actually want to develop healthcare for cloud environments. So, from diagnostics to remediation to fixing problems and resolving business issues. So, ultimate goal can be actually formulated like building self-driving data centers. Why not? Because you have self-driving cars today or near future. So, why not have a vision to build self-driving data centers? So, everything managed by AI, without human intervention, intelligently diagnosed, remediated, et cetera. But unfortunately we are far away from that reality and we will talk about more specific problems that we are dealing with in this space and challenges, difficulties that exist in this domain. So, it's a common truth that behind any modern business is a cloud environment, a data center that the business can be serviced somehow to users. So, in this era of digitalization, everything needs to be converted to a actually digital reality. That means that it should be somehow resided in any application, any service. Need to be resided in a cloud data center. So, keeping healthy of those data centers means that we keep healthy those businesses. That's very simple. But how to actually maintain performance of those systems, cloud environments and applications. The approach is very straightforward today and tomorrow. Monitor, measure any data possibly from those environments to have visibility into those environments and try to get insights from the monitoring data. But normally you end up with this situation with the guy. Arnek also mentioned about that, you have a lot of data, a lot of different representations. Many, many dashboards to actually support the visibility you want, but ultimately getting lost in the volume of information. So, that's why this is necessary but not sufficient approach today to reliably manage cloud computing environments in view of their complexities, of investigation and the scale we see in those systems. That's why also we talk about, sorry, I think I did something wrong. OK, so just an intermediate example that what does it mean to have also some specific analytics developed for monitoring data, like time series. You can measure millions of these types of metrics from your cloud environment in terms of its different parameters like CPUs, memories, cache, et cetera. And you need to actually investigate each metric in terms of its typical baseline behavior and react to the anomalies which are about deviations from those baselines. So then in order to combine all the information coming from that anomaly space becomes another issue to handle with another layer of analytics, another layer of AI. So, just an example how people approach today to solve specific problems in cloud management using ML, but needing to always put extra layer in order to get the final desired state which is about automatically performing health status check and fix of the issues occurring in the data centers. As I said, anomaly detection is a core AI operations task, so we don't have these cell driving data centers yet, but we have the error of AI operations where you solve different particular specific problems to address the main actually management issues in the cloud or in an application you monitor. Change detection is a very important task, which means that if something has occurred unexpectedly, then you need to somehow react to that one, so it can be a statistical approach, it can be an ML approach, et cetera. Forecasting any time series metric is a core problem, so you want to know what will happen soon in order to have time to actually react accordingly. Any kind of predictions about the future state of your system is very important, so these are specific tasks to resolve in terms of different data source, like Arnek mentioned, metrics, logs, traces, it can be more, but these are only modular, particular solutions. One substantial problem is whatever you monitor, whatever you measure, whatever you do in terms of anomaly detection, et cetera, people need root cause analysis automatically. It means whatever insights you have learned using ML from your environment, please give me a recommendation how to resolve the problem. What are the core actually events to look at? What are the core parameters or what are core processes I need to deal with in order to actually bring the health of the system into the normal situation, into the normal state? This is a large problem and open problem, I guess in general for these systems, but a particular scenario we have considered is key performance indicators and analytics, so whenever some degradation of KPI happens in an application, how to actually diagnose the situation and try to explain it, so discover the conditions that explain the KPI degradation. As a particular AI operations task. I have to mention about several factors that hindering some design of effective root cause analytics in this space. Normally we don't have expert validated, labeled, annotated data sets that we can leverage in our studies in order to build effective models that can be deployed everywhere and solve diagnostic problems of applications or cloud environments, cloud infrastructures, different layers of the main problem. It is not generalized knowledge from one environment from the other. And sometimes if you have a lot of data and you can leverage very sophisticated models, like deep learning, explainability is a problem because you need to explain actually your recommendations in order to build confidence at the user that those recommendations can be taken because those recommendations are going to change the status, change the business, impact somehow the service that can be on the risk based in view of your recommendations. So it should be explainable in order that experts get the trust and actually permit to intervene in the system with your recommendations. So any system, any environment, any application infrastructure normally manage with some KPI metrics. These are time series actually metrics which actually are in large responsible for the status of the system. With having reliable diagnostics in terms of KPIs, we can have actually two use cases that we address. Troubleshooting recommendation engine, if you have a solution deployed and user environment, you can leverage it to diagnose troubleshoot the environment it resides in. Or we can, as a provider of cloud management solutions, we can have a proactive support of our software at the customer environment because we produce software in order to solve the problems by others but our software which are also complex somehow systems need to be diagnosed as well. So support is again subject to human efforts, support teams, etc. High volume of manual work, we want to get rid of. So it means that having somehow behaviors of our product KPIs in real time, we can diagnose the conditions that are responsible for the KPIs behaviors which are unwanted and fix the product problems in the customer environments in real time. Two use cases that we are interested in. That's it? No, no, no. Sorry. It didn't go well. So it was a technical problem. Okay, so. As I mentioned, label data set which can be actually leverage for training classification models, rule induction systems are not available and hard to obtain. One actually approach that we adopted is about a kind of self supervised learning which means that we actually generate labels using KPI behavior artificially. So we say that let's we are unhappy, let's assume that we are unhappy with outlaying behaviors of our KPI metric. That can be a source of label to actually attach that to the whole time series space that we monitor. So to deal, if we have another 1000 metrics from the same system, so this outlaying behavior, this kind of positive class, can be attached to the corresponding actually vector of values coming from this time series data and a story in our data frame. If your KPI is in normal state, defined somehow, then you have a negative plus actually label for your data set, the same 1000 metrics. Okay? There is actually fundamental assumption in this process because since nobody tells us that this behavior, outlaying behavior is really anomalous in reality, we assume that taking this way and building the corresponding model based on the data frame construction I mentioned, we can actually get enough good solution to the diagnostics problem related to this KPI whenever we encounter real problem in customer environment, meaning that if this outlaying behaviors are good approximation of what will happen in reality, then you have approximated good, well approximated conditions, what will be the recommendations that you will produce in reality. Okay? So maybe there is a bias in this assumption, but this is the reality that we actually are facing in terms of lack of annotated data sets. Then if we do this trick, we naturally rely on some rule induction explainability methods, decision trees or rule induction algorithms, also leveraging regression analytics and importance of features which actually explain the long-term behavior of particular metrics which highly impacting the behavior of the KPI, which is important to... We are interested to know what actually dimensions of explainability global and local. So we want to know globally what are the most important actually processes that impact on the KPI and whenever some anomaly situation happens with KPI, we want to diagnose that instance. So who are responsible for this anomalous instance at the KPI? Okay. Just an experimental use case with the product support, which I mentioned. The product that we deployed in the customer environment is a multi-node software solution that measures the data from the cloud environment, but we need to keep its health in a green state in order to continue our service of managing the underlying cloud environment. So it has some self-monitoring metrics which are about its performance actually behavior and try to leverage those metrics to build a model which can be a model for customer support. So diagnosing our product, the customer environment, trying to get the conditions that are necessary to bring back the health of our product in those environments. The example is about 18 days of data with 15 million observations. 3,000 time series features are measured for that period of time. And we just pick up an important metric for experts which can be a cross-node latency like a super KPI. Which is important. How fast the nodes are interacting in order to be satisfied with the performance of our product and the customer environment. I already mentioned this trick of labeling outlying behaviors which is actually now experiments about 3, 5, 7 or 10% of higher quantiles that can be separated from the time series data and claimed as positive class labels which are attached to the rest of 3,000 metrics. And we experiment with different KPS like mentioned here not latency, average of cross-node or maximum of those cross-node communications. But also another metric which another KPI which is about its analytics actually service which computes the baseline bounds of all time series data in the system. So it's a large overhead in the product to wake up every 24 hours and compute whatever is typical try to derive whatever is typical for all time series data in the system. So then I will not go into details but we have a lot of insights, a lot of experimental results in terms of importance of metrics which are globally interesting interpretations of different situations. But what is important here if we train neural network like multilayer perceptron on this data set I have mentioned it gives 96% of accuracy but it is not usable according to our actually plan which should have enough level of explainability to talk to users or customers. So you see the result in the raw data set which was not good and the two things actually we learned in this experiment is that on the sampling actually really helps in overcoming the noise and also feature ranking. We have 3,000 different features but we realize that only a small number of features are really helpful in diagnosing the KPI and the information theoretic actually feature selection method called FCBF was really helpful in reaching the results in those last two tables and getting a good accuracy level for CN2 rule induction method. I was talking about RIPER in these experiments we used another information theory based inducer which is CN2 working with noise well and we continue to actually extract rules from CN2 and try to speak to our experts for the validating results. So PCA didn't work in this experiment on the sampling 35 to 65 actually proportion work well and feature ranking as I mentioned FCBF was the actually most helpful trick in this study. I touched some specific examples from CN2 which are rules explaining outline behavior of this overall threshold checking maximum duration with quality measures and distribution of the rule. We see high quality rules which says and they are rather complex actually rules, not simple which say that for instance if resource symptom region update, average duration is larger than specific threshold and some system attribute health is larger than this then you have anomalous KPI status and then you see more complex rule with different conditions on different actually features which explain again outline behavior of the KPI which is interesting because nobody actually owns such a knowledge so explaining specific situation with different conditions combined and specific values attached to to each line of condition. In other examples of high quality rules which you can see that relate to different actually features in the study like capacity reclamation something and resource metadata which were not present in the previous examples. Another set of rules here we see that starting to play a role in explaining this process of threshold checking. More examples for latency KPI again complex rules with high quality and dealing with different features hip size some transmitted bytes actually volume et cetera which nobody of us data science understand how, why so the expert validation is really important for those rules and for that purpose actually we initiated simple initial validation of discovered rules with our engineering team with some interesting fragments so generally they know what from their long term experience working with support with the product and generally they know what are important metrics but they they are surprised with the combination of different important metrics into conditions that can tell what is the problem but they overall liked the rules discovered with some eye opening and surprising factors as I said means something more to try this model in different environments make massive tests in terms of how those rules really exactly pointed out the problem the underlying problem and trying different metrics to measure the performance of this model like mean time to repair rate so how fast you actually repair your issue using the rules these are to be done for an extensive and rigorous validation of this global model and of course it's also in hypothesis whether global or local models will work whether we can train one global model for our product which is deployed in many thousands customary environments and it will work still globally for everyone or every local environment is so specific that needs to be trained the model needs to be trained separately but with this study actually we learned things that KPI degradations can't be explained with this self supervised trick like generating artificial labels somehow and trying to explain behaviors which cannot be really associated with real actually degradations in the customary environments there are some approximations of those those models can be enough capable hypothesis somehow works at this stage and what is interesting also that this can be used to actually practically quantified risk of the misbehavior of the product while leveraging the rules that we have we have discovered so with conditions that are satisfied in the rules we can actually quantify the level of the risk that is going to bring to full satisfaction of the rule which means you will get anomalous situation for sure if one condition is fulfilled then you have maybe one fourth of the risk 50% of the risk so it can be practically notified customer or our support that something is going to be wrong soon these are the main lessons learned and this summarizes my talk thank you so much so maybe I have missed that part could you please name the algorithm that helped you to generate these rules I talked about CN2 we have different algorithms decision trees but you see I mentioned CN2 rule inducer which produces the rules that you see on these slides this is an information theory based rule inducer you can check the literature which is very interesting it's a visual programming tool I didn't find any implementation in Python libraries so you can find it in orange only the specific numbers that you got in those rules do they need to be retrained for every system that you're trying to run your KPI diagnostics on I think hypothesis is that if our product we have different versions of the product 3 node small size version of the product 6 node 12 or 18 nodes so different sizes of the product so naturally it takes different volume of workloads from the cloud and I guess that one of them needs to have its own model so this was about 6 node installation which means that that's something to test whether it can work equally reliably for the 12 node deployment I'm not optimistic about that so different scales of monitoring data different workload levels so it will be hard to actually generalize ok does that rule induce only induce the numbers that you see on the right hand side or also the actual attributes that lead to for example KPA degradations reduces so do you need to specify the attributes yourself or does the rule induce you can say that I am interested in rules with participating conditions not more than 5 for instance you can specify how complex you want conditions to be nothing else so it automatically defines the best actually quality rules yeah just very briefly you mentioned that we need a bigger study how do you plan to get there do you put that in contracts with your customers and are they willing to participate or actually in this product support use case it's not really about contracting with the customers but with our support teams that actually support different customers in different regions so it should be some internal contract which is harder to get maybe with the customers so that's challenging another challenge that I wanted to emphasize no questions thank you realtime is two things actually one is that we can deliver the rule realtime because we have already pre-trained model deployed the environment so we see that if we are in anomalous state then whatever rules are satisfied now so quickly taking from the list the corresponding actually blocks to be recommended realtime because nobody has now that so everything is manual if something is wrong then they take a telephone call to support and say hey my KPI is wrong so then maybe it will take days to discover what happened to the system so realtime means that I have a realtime recommendation in terms of specific rule and I already see that it is satisfied in the system that's why I am confident that it should explain the KPI behavior at this moment the second is that it can be realtime and proactive because I can measure always track the satisfaction of these rule conditions and return a risk factor whether something will happen to KPI soon or not risk calculation I think it's very straightforward like the two conditions you see abnormal if you have already analytics that can detect what is KPI anomaly or user can specify latency below this or higher this I will not accept so you know the threshold then it is easy you check the threshold you get whether KPI is anomaly or not but who is responsible for that you have one million of different metrics who can be responsible for that anomaly so checking threshold is checking rules already discovered and stored in the system straightforward linear operation ok thank you