 Vse nekaj, da se vsi včuša, da se bo znači, da se je vse zelo, da se je zelo. To je Stephen Metai, zelo da se izvajem vzelo, da se izvajem vzelo, da se izvajem vzelo in zelo. Zelo da se izvajem vzelo. Tudi smo vzelo vzelo, da se izvajem vzelo in zelo. Selo da se izvajem vzelo. Dano, Assern senior staff, kaj je zelo vse projekte. Dano me in Antonin, zelo smo vse za vršenje, in vse mene. In je to všeho všeho master thesis. Tako je kod. Vse je vse vse linkov. The GitLab repository link, ko je vse zelo vse. I will also upload these slides, so you can see them also later. En jaz te da je tudi. Tako je. I ki sem vzumila. Prečaj smo ni izmoniseli. Zato značimo, da imamo otvar. Zato muzik tudi, da imamo to postočje. Zato smo videli, da se značilo, in da smo prišli, da se očetimo, da smo vse značili do vsega vočetnega vzruče. Danes smo namočili, da smo prišli predvajke modelje, da se značimo, regarding the results, because we compared the old system which was in new at CERN with our new data-driven one, and we will see the comparison with regards to true positive rates. So let's start with the introduction, for who doesn't now CERN is the European organization for nuclear research, it's located on the Swiss-French border close to Geneva, In je bilo odpočenite po II. Vrlo, početno začniti složenje za početnje. Vseh je vseh tvojeh tvojeh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh vseh. The web applications to all the send employees and users. So of course given the dimension of the data center we have also to have a big monitoring effort we have to monitor all the OpenStun components and all the hypervisor and virtual machines that run in the system in the cloud and of these hypervisors we get the logs and metrics in kako smo stvarili sem v pravi semnih vseh vseh vseh vseh vseh in htfs. In tudi dobro da smo sebrali vse grafana dashboards. Zato smo besetili kaj je nepravil jez, da počnemo postmortem analizijo, dalje in spetvenje in, da počnemo neko, kaj je izvore neregajev broj, napravljamo zvori. Stajamo se na to, da je tudi grafana dashborda, in je to, da smo vse načinati na detektivne anomalii v kluvu. To je tukaj na analizičnih metričnih. Tukaj smo vse načinati, da smo vse načinati, abilities of the time. And there was a static threshold at 900 milliseconds. So every hypervisor having values above this were considered as anomalous. So this is a bit inefficient because we have to set manually all the thresholds. And because of the huge number of service, it is always difficult to manually go and find all these errors. And moreover imajo tudi nekaj, ko je anomolus. Zato vse je bilo, da propovoljamo drugi datadrivene počke. V tem počke bilo, da bilo, da imamo mnogodne zelo vzelo, da počkamo nekaj koralizacij. In da bilo, da je bilo, da je vzelo, da nekaj koralizacija nekaj koralizacija nekaj koralizacija nekaj koralizacija nekaj koralizacija the normal servers will have similar behavior. And we will see at the end that indeed we outperformed the previous system. And more importantly, what we wanted from the system is to proactively identifying operational issues before the users. What do we mean is that for sure this happened to you, so probably you receive many tickets from the users saying kaj bi je tudi malo izgledaj. Zelo sem se poželil, da jaz sem se zelo zelo dve. Zelo sem se poželil, da nekaj je neko zelo zelo zelo, načo sem podelila tako, da bomo počelil. Veseljamo to, da vizivamo, da bo ne ovojijo vse začeli in pričalimo to ili neseljali odvore. To prejste, there are two main possibilities. The first one is about performing change detection, and it means that, if I have a time series, or a metric, and I consider this as normal behavior, I want to detect a change on the time series itself, considering then this as anomalous. Or we can group the servers having the same configuration, as I was saying, same hardware and same software configuration and same assignments. And we want in this case to catch different behavior of a specific server with respect to other servers. So this is the approach that we follow. And to explain this, we have here a plot of the standardized CPU load with respect to the data of February of this year. And we are plotting here in blue a specific server, starting to have an anomalous behavior. So we wanted to highlight here the fact that having in red the maximum value of the load, we can see that this server is reaching the maximum only in certain points. It means that for all the other periods, many other servers are reaching anomalous values. And here we have the minimum and the three major percentiles to show you the normal behavior. So of course, when this server starts to go above the normal values, we consider it as anomalous. So with our system, we want to catch those examples. And let's have a look to the system. First of all, we had to build a library. This library is mainly written in Python, and it is composed by two main parts. The first one is the ETL, which stands for extract, transform, and load. So what we do is to compute some normalization coefficients, and then we transform the raw data in a windowed normalized version. And then we save everything in pandas data frames. That will be the inputs of our system. Then the second part is the anomaly detection core, let's say. We call it Addisern, and there is an analyzer wrapper of PyOD, which is the Python Outlier Detection Library. We did a wrapper in order to have an extensible module, so we can add many other models in irritating from the PyOD library. And here we also have the code that we publish into the CERN Monitoring Infrastructure of the results, and we will see that later. All these module is installable by pip, and we do that in our Jupyter Notebooks in the GitLab CIECD system, and in iRflow, which is the orchestrator that we use. So let's have a look to our pipeline. This is the main pipeline, and we can divide it in two parts. The first one in blue is the data analytics pipeline, which goes from the data sources to the publishing of the anomalies that the models find. And then in orange we have the annotation pipeline. And given the feedback from the experts of the cloud and given a benchmark data set that we had to build, we tried to evaluate our system in order to make some changes in improving the results. We use many technologies in this pipeline. We always want to exploit all the set of tools that CERN gives us, for example, HDFS, Spark, so all the framework that are already in use, we tried to use them. And the pipeline covers all the flow from the data sources using collectD to the monitoring infrastructure, which is composed by Grafana and Kibana. And every step of the pipeline is containerized, mainly using Docker. And if we try to apply these technologies to the pipeline, we have this image. So as I was saying, in the data sources we use collectD and HDFS. Then we load them in the data preparation task, and we use Spark to preprocess the data. At the end of this, we will have a pandas data frame, which will be the input of our machine learning core, which is built in Python and TensorFlow. Then we make some ensemble strategies, and with fluentD, we publish the anomaly results to ElaxisSearch. And thanks to Grafana, we get the data from ElaxisSearch, we build some dashboards in order to look at the results and to also evaluate the old model. So as I was saying, all the steps are based on Docker containers, and all this process in the middle is orchestrated and scheduled by Apache Airflow. So we want to have a look on this orchestrator. So Apache Airflow is an open source framework. We run it in another Docker Composer itself, in another Docker container itself. And on the left, we can see all the pipeline that I will explain in the next slide. But here we can see all the runs that Airflow scheduled. So it's very easy to find if something went wrong. Every task is green if it was successful, pink if it's skipped, and then we will see why we skipped some steps, and red if it was failing. So the ETL processes are submitted to the CERN infrastructure, so we don't have to do them, but the Airflow deployment and all the machine learning training and inference are always run in a single VM. Of course, you want also to have replications, so you can use multiple VMs, but what we want to highlight here is that you can use just a single VM to deploy everything. This is the pipeline, more in deep. I hope you can read. But anyway, this is the pipeline, and to write pipelines in Airflow, there is some Python code to be written, and in this Python code you can create tasks and you can also create the order of them. Here we have the training step and the inference step, and in every step we check if we have already the data locally saved. If not, we start all the ETL process with Spark, but if we have them, we just skip to the next big task. And also in the training model task, if the models are already trained, we don't have to do anything and we just skip to the inference step. The inference step then is the same, I have the same structure of the training one, but of course this will run every time because all the times that we have to make some inference, the data will be new, and on the opposite, the training part will run just the first time. We won't highlight this looking at these two graphs, so here we have the first execution of the pipeline and here the following ones. And we can see that in the first time that we run the pipeline, we have actually to process the training data, we have to process the machine learning models themselves and to save them, and then we can do all the inference steps. On the opposite for the following executions, all the training part is very fast because we just check if we have the data and the models already there, and then we spend all the time in the inference part. And this means that here we use more or less 35 minutes and here the duration is just seven, eight minutes. This is the input of our pipeline, so the system can run in parallel with the multiple algorithms and the input are a bunch of YAML files. So here, for example, we have the YAML file saying which algorithms the system must run and must train, and for every algorithm we have also, of course, the parameters that we can set. And as I was saying, having a wrapper class of this Python Outlier Detection Library, we can extend our library with new algorithms. On the opposite, this is the output of the system, and as I was saying, we use Fluent D to push this all JSON file to elaxis search. In particular, we publish the top 20 anomalous scores that we find for every four-hours interval. And if we see the structure of the JSON, we are indicating the details about the algorithm that we use, about the hypervisor and the group that we are analyzing. We have here the metrics that are in input to the model and finally the details about the anomalous score. So now we can see a bit of the machine learning models that are inside the system. The first step was to check several anomaly detection models in order to see which one was the best one. And in particular, we focused on some traditional anomaly detection methods that here you can see in green, some deep learning ones, which are in orange, and then we tried also some ensemble strategies in blue. Here you can see the AUC rock. I will talk about that later, but it's just a metric to measure the performances of the models. And we can see that between the traditional and the deep learning models, the isolation forest is the best one from the traditional ones, and the LSTM autoencoder is the best one from the deep learning family. So what we did is to check only these two models and trying to improve them in order to improve the old system results. Let me talk also about this model more into the details. The isolation forest is a traditional machine learning model, which is based on point isolation. So it means that if we consider the Xi as normal, because inside the dense-based region, we need many divisions of the hyperspace to then isolate the single point. On the opposite, if we consider only XO, which is an outlayer because it's outside this, let's say cluster of data points, we need less divisions of the hyperspace to isolate it. So this is the main idea of the isolation forest. Talking about the deep learning models that we use, we are using autoencoders. These are neural networks, which use some bottleneck to reduce the latent space of the input, and then they are trained to reconstruct the input itself. Moreover, if we use the LSTM and the GRU units, we are able to reconstruct multivariate time series. What does it mean? It means that if in input we have this time series, for instance, it's still the normalized CPU load. If we train the autoencoder and then we run an infant step, this will be the reconstructed input, and it will be close to the original input because the model is trained. And the main idea here for the anomalous action purpose is that if we train the model with normal data, then the model in the infant step will have a higher reconstruction error for anomalous windows, with respect, of course, to the normal ones. So, all these models give in output some anomalous scores, and those scores are higher with the higher probability of being anomalous. For instance, if we have these windows, all the scores will be close to zero because everything is normal. And on the opposite, in the next one, these hypervisors should have an higher value of the anomalous score. Finally, before looking the results, these are the metrics that we use in input. We selected them from some experience, looking the results, and looking at the evaluation of our benchmark dataset. And also, they were given from the cloud managers at CERN. So, namely, we use the context switches, the CPU load, the CPU system, the disk IOTime, the pen, and the memory frame. Of course, you can add more inputs, but talking about unsupervised learning, it's better to keep the set of inputs as small as possible in order to not have many correlated inputs. So, let's see the results. Talking about the figures of merit, because, of course, we have to decide how to measure the results and how to measure the performance of the data. How to measure the performance of the models. We usually use the rock curve in the anomaly detection field, which is a plot, which plots the false positive rate against the true positive rate. And usually, to measure the performance of the models, we use the area under the curve, which gives us a single number, giving us the performance of the model. What we did for comparing the old system and the new one is to set a specific false positive rate, given by the cloud managers, and usually is quite small, because in production, you don't want many false positive. And then, setting that to a specific value, we will look at the different true positive rate of the different strategies to decide which is the best one. To do this, we also need a benchmark data set. There is a lack of publicly available data sets about this topic. So what we did is to build our own data set. We looked at 40 host, 40 servers, with respect to two months of data. And then we labeled a total of 12,000 windows of four hours, having the 2% of them as anomalous. So these are the results. As I was saying, we set the false positive rate to 0.1%, which is low, but is acceptable from the cloud managers. And we were looking here to the true positive rates of the two strategies. So we had 8% for the current system based on the thresholds and 21% on the system, which was data-driven. We also tried to increase this value just to see, for example, at 4% of false positive rate what was happening. And we added that the current system wasn't able to catch anyway many anomalies. It was 26% of true positive rate against our 92%. So we can say that our machine learning model and machine learning system weren't able to upperform the previous one. So summarizing, we engineered and designed this anomaly detection system, which is also modular and extendable and can be used from others in both descent context and also in other contexts. The system used three major machine learning models, the Isolation Forest, the LSTM-GRU Outend Coders, and it leverages the collective behavior of the cloud servers. So this is the main point and we think that this is why it works pretty well. And that's why it also outperforms the threshold-based solution of Grafana that was used from the CloudSERN servers, from the CloudSERN managers. So this was it. Thank you for the attention. And of course we have five minutes more or less for questions. And thank you for the attention. Is it recorded? Or I can repeat them? Thank you for the presentation. It was really interesting. I have one question. All the metrics made sense to me, but I was wondering why you choose memory free over memory used. That seems a bit counter-intuitive. Is there any specific reason? Was memory free more specific? Or is it just because it's easier in some way? There is no specific answer, we were just trying to not have correlated input. So one of the two was anyway good. Okay. But it was a bit random actually. We just used memory free and the Cloud managers said that was okay. All right. Okay, thank you so much.