 OK, our next presentation will be from Andrea Frittoli. He is an open source developer advocate at IBM. And he will tell us about his experience using machine learning for continuous integration. Please, go. Thank you for the introduction. So welcome, everyone. So yeah, my name is Andrea Frittoli. So I present today a work done with a couple of colleagues, so Matt Trinich and Kira Hulfert. So the idea is that we wanted to really learn more about machine learning. And with open source, there's a lot of open source tools for machine learning. So there's a lot of documentation, a lot of books you can find, and you have very powerful tools at your disposal. So we thought, OK, let's make something that is interesting to us, and let's apply machine learning to it. We have been working with the OpenStack community. And there is a very large scale CI system there. So because the tools are kind of there, the open source tools, but what is hard to find is an open source data set, if you will. And you really need a good data set if you want to do something meaningful. And it's hard to get good quality data. But because we have this CI system in OpenStack and all the logs, everything is available publicly. So we thought, well, let's look at this data that we get from CI. So here we are showing in million of tests. So it's one for 10 power to seven. So it's the number of tests that run daily. And this is a graph. I got it updated until the beginning of March. And so we do continuous integration, continuous, continuously log data there in OpenStack. So we have lots of data. And one of the problems in the community there has always been to have enough engineers to triage all the failures and understand everything that is going on there. So we thought, well, we might use AI to try and solve this problem and help us as human and scale better. As you know, in open source community, you may have a limited amount of engineers. And they may have a limited amount of time that they can dedicate to the project. So we thought, OK, maybe AI can help us here looking at this log files from CI and understanding what is going on. So what we do there for OpenStack is integration testing in virtual machines. So every time a patch is submitted to OpenStack, OpenStack, if you don't know, it's a cloud management system. So it allows you to manage this cloud and virtual networks and storage and so forth. So we have public cloud donors that donate resources to the OpenStack project so we can spin up VMs. And every time you want to run tests, so we spin up VMs, we install the entire OpenStack on top of this virtual machine. And then we run end-to-end tests, so which means we actually create virtual machines, storage, networks, and so forth in the VM. And all these generate loads on the virtual machine. And we gather then system logs. We get our application logs. And we also gather what the logs that are produced by a tool which is called this that allows you to store in kind of CSV type format information about key statistics of your operating system, like the CPU utilization, the disk IO, the memory utilization, the average load on a minute or five minutes and so forth. So there are a large number of statistics. And we get this data for all the CI jobs, the million of tests that are run daily. So we thought we could use the distal data because this is numerical data as well, so it will make life a little bit easier in terms of normalization. And it's also interesting because we get this data running test on an open stack, but it's not necessarily open stack specific because it can, the same kind of system machine learning model you could be applied to any distal data generated in any kind of use case. Sorry about that. So one of the things that we recognized in the beginning maybe a bit late is that, yeah, so you need data. Actually, if you want to do something, as I was saying in the beginning, so even if you're not sure exactly what kind of machine learning model you're going to use and you're still considering things, if you have a chance to have data to start collecting it, do it as soon as possible because the more data you will be able to gather, the better it will be the outcome for your machine learning work. So what we have done, because the storage space from the CI system on an open stack side is limited space and they regularly clean up the logs there. So we set up a system that regularly fetches the logs that are interesting to us and stores them as raw data in S3 storage. We use a function that is run periodically, so a couple of times a day, I think, and we just get the latest logs and we store them in S3. So we built an application that let us create what we call a dataset, which is a normalized dataset that is taken out of the raw data that we created. So actually the first step before you dive into writing Python code or doing things is actually to explore your data and try to visualize it if possible. If there are multiple dimensions, you may try to cut it in some way or to aggregate it in some format so that you can find where there is more information in your data. Sometimes you have a lot of redundancy, like if you look at the system performance, like the different indicators, like in our case, you may see that the graph for the CPU matches very much the graph of the average load, for instance, and then maybe using both of them is redundant because it's the same kind of information. And because the more information you have, the more processing you have to do, so it's better to try and get rid of the information that is not relevant. So gather a lot of data first, but then focus on the one that is actually relevant for your model. This is an example. It's a Python tool. We call it CML, CI Machine Learning, that allows us from the raw dataset to select a number of features so that we can have a tool away from the raw data that we collected over time to extract a dataset that we know is a specific format. So we do two things. We filter it. So we filter in terms of features that we select. We want to use the CPU or the memory. And then we select the size of the dataset. And we do normalization. Normalization, because this is numeric already, means mostly that we want to make sure that the numbers are between minus one and one because it makes a computation much less CPU intensive, if you will. So this is an example of how the desktop later looks like, as I said, is a kind of CSV. So it's a big table. And you have timestamps. And for every timestamp, you have a lot of columns. So what we do here, we do a selection of which columns that are interesting for us. So this is the CPU utilization for user space. And this is the average load over one minute system load. And then we also do down sampling because we have this data for every second. And we realize, and I will get in more details about this later, that you don't really actually need that kind of granularity. So you may get good results with lower granularity. And then we do data normalization because we have a matrix as a result of the filtering we've done before. But this matrix is actually one sample for us because it's a result of one CI test. So we want to do unrolling, which means that this matrix actually becomes a vector, a long vector of numbers, and normalization. So this is a sample of unrolled data. So you see the CPU that was here. This free value, 6, 1, 1, 7, 5, 1, 2, 6, become actually three different columns. USR 1, USR 2, USR 3. So we have this for every instant in time that we selected. And these lines now, they correspond to different samples. So this is a CI job 1, CI job 2, and CI job 3. This makes it a bit more clear. And you see the numbers. They have random values. And after normalization, they all belong between minus 1 and 1. Another thing that you may want to take care when you build the data set is do not use everything for training. It's important to train your model. But it's also important to have an evaluation phase where you test how accurate is your model in predicting the things that you want to predict. And you may also want to have a small depth data set that you can use for fine-tuning the hyperparameters of your model. Again, all this, the data set, sorry, something I wanted to mention is that the labels term there is, labels is a bit overloaded as a word. In there, we just need the name of the features that we have, like this USR 1, USR 2. And this is the data and the classes is actually the values of the things that we want to predict. I'll get more details about that in a moment. Data sets are also stored in S3 storage. The next thing is that we wanted to do is to be able to define an experiment. So again, we have a tool written in Python that we created that allows us to define the hyperparameter of an experiment and store them in a format so that we can rerun experiments in a repeatable way. So we have a data set that we can recreate. And then we can run experiments. So there are multiple hyperparameters. We can select what is the structure of the neural network, for instance, how many hidden layers, how many neural layers, how many steps, and so forth. And then we have a wrapper command that allows us, CIML train model allows us to actually run the training. So the reason we did the split, the separation in different steps. So preparation of the data, normalization, and then training is that in our experience here, we use TensorFlow as a model and TensorFlow has tools and APIs that allows you to do data normalization. But we wanted to do that directly with Python tools because when we started, we didn't know for sure whether we wanted to stick with TensorFlow, maybe switch to other different frameworks. So we wanted to be able to normalize our data and get it to a clean state that then we could use to feed to any kind of machine learning framework that we wanted to use. So we are able to run the training on your laptop locally with this, but we also integrated this with FIDL, FDL, which is an open source project by mainly contribution by IBM. And that basically allows you to take a model and do the training, distribute the training on a Kubernetes cluster. So this is a bit more about the training infrastructure. This is the picture is the architecture of FIDL itself. I will not get into the details of that. We use the estimator API in TensorFlow, which is actually based on Keras, which is another open source library, which allows very high level of abstraction it's very good if you want to start defining models, very simple. So we created the Semel wrapper, which is Python script that invoked the TensorFlow API and do the training, run the training, we do the normalization, the preparation of experiments. So as I was saying, that the machine learning framework is interchangeable and the training options is local or we can also, we have a help chart to deploy your application so then running container or we can use FIDL. So for the prediction side, the model that we have in mind that we implemented here is event driven. So because we have CI jobs that are executed, the event that are generated is when a CI job is completed, then new data is generated and we want to run inference on that data. So based on the model that we were trained. So we don't have a request for inference based on that model, but we have an event that is triggered because of that we've written a Kubernetes application that basically includes an MQTT client. OpenStack CI system generates MQTT events when jobs are completed. So we can listen to these events, download the logs and then run inference and predict the parameters that we want to predict on that. Because the CI job from OpenStack is considered for us a trusted source of data, we can use the new data that comes in as well to continue training the model with new data. Okay, so this is all about the kind of infrastructure and how we structure the project. So I will get a bit more now in details what kind of inference we've done, what kind of training. So we've done two main type of experiments. The first one is a binary classification. Basically we want to predict very simply if the tests are passed or failed. So we've seen enough logs over the years and we thought, well, let's make a bet. I think just looking at the system load, we are able to predict whether a test was passed or failed. So let's build the model and see if we are able to predict that. So it's a supervised training because we actually know whether the test is passed or failed. We use the DNN classifier with two classes from TensorFlow. We picked a specific CI job, which is called Tempestful. In OpenStack, there are several, but two main CI pipeline. One is the test check pipeline, which is executed when the code is initially submitted to the project before it's actually reviewed by developers. So there is a lot of variability in there and we didn't want to use that data because we didn't want to try and predict failures that are due to a typo in the code or something like that. Instead, we took data from the gate pipeline so the gate pipeline is executed on code that's been tested already by the automatic system, has been reviewed by the contributors and has been approved. So just before the code is merged, the tests are executed again and this is the gate pipeline and it's clean data and the failures are related maybe to infrastructure issues and maybe some new version of a library is released and it's pulled in the test and it fails. So maybe there is a race condition as that it was not picked up during the check testing and so forth. We have 3,000 examples and we split them in 2,000 hundred for training and then hundred for test. In terms of hyper parameter, we use kind of the standard settings, the real activation function for the output layer, sigmoid, which worked very well with the binary classification. Default optimizer again, so the adaptive gradient descent. It's a good one because basically it's adaptive. So it doesn't really matter the learning rate, the initial learning rate that you set because it will adapt itself and select a good value for that. And so it's a neural network. We have, we started with five hidden layers and 100 units per layer. And we run 500 epochs, which means that we go 500 times over our input cell. We have 2,100 samples for training so we repeat that 500 times and then we use an input function that randomizes the order to make it more effective. So then we run a different exam, different test. So our main, sorry, key metric is accuracy that we looked at and we tried different combination of features. So for instance, just looking at the CPU, just looking at the memory or the average load or combination of different features on the DSTAT data. And we saw that in terms of accuracy, the combination of CPU and average system load was the most effective one. So this looks a bit the other way around because accuracy, the better, the closer to one, the better, right? So all these tests that we did, they had reasonable accuracy but they were close enough that if you would display them with bars that from between zero and one, you wouldn't notice the difference enough. So we actually displayed one minus the accuracy so that you can appreciate the difference in accuracy between the different experiments. We also looked at the loss. The loss was slightly worse in this case but still acceptable and not very much different from the other one, so. And with this we achieved an accuracy of 0.992. Which is pretty good, we think. It's pretty reasonable. On a 900 evaluation set, it means about seven mistakes, which is pretty good. Another thing that we wanted to see if it's how well this training model could work with a different CI job. So there are several CI jobs that are executed for OpenStack and for each we run end-to-end test which is the tempest-full job and then there is a Python 3 version of that which is very similar but slightly different because one of the components, Swift, doesn't support Python 3. So those tests are not executed and that service is not running which affects the overall system matrix, sorry. So what we need to train the model with the tempest-full dataset around the evaluation of the tempest-full Python 3. So, and the result was kind of okay. So the accuracy went down from 0.994 to 0.953 which is still reasonable. But the loss doubled and the precision against recall area, it's much worse which means that it may mean that our data on the LCI job maybe was too biased or maybe there was a little bit of overfitting that was happening there. So to summarize for the binary classification we found that with 10 seconds sampling is the best but one minute maybe enough that we can get a very good accuracy and a very good precision versus recall curve. The graph on the right side is the training loss while the training is done so it goes down smoothly and it's pretty consistent. The second experiment is a multi-class type of classification that we did. As I was saying in the beginning, the VMs are donated by different cloud providers around the world. So one thing that we wanted to see if you're able just looking at the system profile to detect which cloud provider is hosting the test. So the classes in this case are 10 because they are 10 different that are, sorry, providing resources. So we use a similar setup, the data set is the same type of hyperparameters there and again we used a resolution of one minute, the same features, the loss converges but the accuracy we got was 0.6 which is not really good. Not good enough, definitely. So we went back to doing different experiments. Again, we tried different combinations and we tried okay, let's put more data in it. Maybe we need more system features and we added the memory, we tried even with this scale but there was no significant improvement in the accuracy. The top value there is 0.6 of the one minus accuracy. So it's not, sorry, the bottom one is 0.4 so which again is the 0.6 accuracy. So we said, well, maybe we are down sampling too much. So we tried changing the resolution. We tried 10 seconds, 30 seconds. Maybe we tried even down sampling more, going to one minute and 10 minutes but again the best we could get was not very good with slightly improvement here going down to 10 seconds but it's not really significant. We tried changing the hyperparameters. Maybe we need a more complex or a simpler network there so it turns out that actually using three layers instead of five layers is likely better. We got 0.668 but still not really useful. We couldn't really do a good prediction there. So what changed thing was actually going back to the data and this is one of the key outcome of the work that we've done and a good takeaway is that the data is really important so you should know your data and should look at your data and understand what is going on as much as possible. So as I said before, we had 10 classes but it turns out that some of these classes were actually could be converged because we had several regions from the same cloud provider. So for example, OVA each had two different regions and another cloud providers had three different regions so we thought well maybe because they're the same cloud provider even if they're in different geographies they will use a similar type of setup. They will use similar hardware so maybe we can collapse them into a single class. This way we reduced classes to six and we tried running our experiments again and what we saw is that the accuracy was increasing dramatically so we managed to achieve 0.9 as an accuracy. So basically the takeaway from here was that we were trying to separate things that were actually not possible to separate because they're very similar. So that's why we couldn't get a good accuracy because it was kind of random for the model whether to detect that a certain test ran in region A or region B for a certain time cloud provider. So we did some extra tuning we tried different network topologies and we managed to get to 0.9 to five in terms of accuracy with a one minute resolution and again using a CPU and a system average. We tried again and then doing this for the multiclass applying to a different CI job so we used the same similar setup as before but this time it didn't work really that well. So the best accuracy we could get doing this kind of training with one CI job and evaluation with a different CI job was around 0.77 so not really good and also the average loss and the loss increased very much. So to summarize with a multiclass example again we had this user CPU and the one minute load average was the best selection in terms of features from the DSA data to use to predict this. So it's an interesting takeaway that just looking at a CPU and a system average load every minute on a system you can actually identify what is the underlying cloud provider that is providing the VM. So this looks, it makes us think that this could be used maybe in future for doing things like creating specific benchmarks that could be then applied to different cloud providers and see and recognize different type of issues in there. So to conclude, I'm not sure how we're doing the time. Yeah, so to conclude as I was saying a key takeaway, collect data, you need a data set. And even if there is a lot of open source software I think a big challenge will be open source data. And a lot of data, a lot of company has a lot of closer data. The good thing with large open source communities that they have this ability to produce a lot of data which is open source, the data itself available for everyone and you can actually use that to run machine learning experiments. The other thing is you need domain specific knowledge on the data that you're using. If you are a lawyer and you try to do a machine learning project for something related to accounting you probably won't work. So in our case we worked for several years on tuning the QA and CI system and reading logs for open stack. So we knew the data so that was why it was a good fit. I strongly recommend, I mean in our case because it's just a simple neural network that we used it worked well just running on CPU. So we tried running on GPU but we didn't see very much improvement. But nonetheless, we recommend work with cloud tools, run your training on the cloud. There are several ways you can do that. But you don't want to have a dependency like you start your training on your laptop and then battery runs out or you have to go somewhere. You want to have this running in the cloud and you want to have repeatable experiments and the ability to collect the data and have it on the cloud. Especially because if it's just you but maybe then some other colleagues or some friend joining the project and you want to share data and collaborate. So if you have everything set up in the cloud it makes things much easier. And also, well, we were able to confirm that the system load plays a role in the failures. Of course with the binary classification we actually got the information about that system is the tests that failed or they're passed which we know already but you get a level of confidence as well with the prediction. So in future we think we could use this confidence level that you can get about the failure of the test to build a system that allows you to see that a test even if it's reported as passed it's not identified as passed with high confidence by the system which might mean, which might able, which might let you that detect early that something is going wrong. So situation, real life situation we had with the CI system is that page after page maybe some tests are added which add race condition maybe the performance in one of the services degrades so it starts using more memory unnecessarily or simply there are more services. So the CI jobs quality of it slightly degrades over time until it gets to a point where the failure rate for tests that should actually pass becomes too high so it's not good anymore for a CI system because it brings you a very bad developer experience. So having this kind of confidence evaluation whether the test is passed or failed may be used to detect that something is getting worse in your test system so that you can use this to predict that something is going wrong in the wrong direction. Other future work that we might do is to look at using similar techniques for other type of data like the logs from application for instance. For other features we could look at predicting other things like the type of failure that happened. The difficult bit with that is that you really then need a human created data set. So basically if you have engineers that usually look at the logs and they say oh well this failed, yes but it failed because I don't know Clive provider X had an outage or because there was a network issue somewhere. So you could use this information, start collecting this information and label your experiment with this information and then you could use this to train in a similar way a model to detect the type of failure but this requires a lot of work because you need to connect to collect all this human input basically. And other things that we thought about for future work is that you could try to define some metric that allows us trying to do clustering on the data. Clustering is interesting, it's very difficult though but if you manage to find the right metric you may be able then to display your data in some n dimensional space or two or three dimensional space and see the specific test runs leave in a certain cluster in certain areas of the space and that allows you to identify that this CI job or this data samples they have something in common and this allows you basically, it's a way that you can use to automatically build classes without doing the human created data set before. Other things that we want to do explore more for the job portability and if we are able to move forward with getting interesting information maybe from the clustering we could use this data also in the real CI system. So the talk is available on GitHub and all the code that we wrote for CML project is also available on GitHub and if you have any question there might be some. Sorry, could you explain, go through again why the prediction of the pass and failure is useful? From what I could tell the input data for prediction is only the load levels of the test being run. Is that right, did I miss something? So as it is now it's information we already have so in that sense it's useful in a way in the sense that it proves us that there is information that is embedded in system data from this load profiles we are able to infer something that is whether the test is passed or failed with a very good accuracy. And so this basically it was the beginning in the beginning what prompted us to continue and do other kind of experiments and look at the multi-class classification for instance and try to extract more data from this load data that we have on the system that's basically it. I mean the other thing that I mentioned, I mean it's not something that we really tried but because you can get an estimation of the confidence of the prediction you could build a system where you get you can say well the system predicts that this is passed or failed with a certain level of confidence and then you could use this confidence level in your CI system to say well if I'm running the same test the same type of CI jobs and the confidence degrades it means that something is changing in the system we are testing. And so you could use this information as a kind of warning to look at what is going on. It could be that for instance the memory utilization of your service is increased and that changed the load profile and that degrades the level of the prediction the confidence in the prediction. Okay, got it now, thanks.