 Good morning, good afternoon everyone. So machine can generate a tremendous amount of log data and this presentation is about reducing the complexity of log files to narrow down the interesting bits. If you take a picture of this slide you can use the QR code to download the slides with the speaker notes. So we are OpenStack engineer, Dirk works for Suzy and he is troubleshooting OpenStack since Folsom and he spends too much time looking at log files. Thanks Tristan. So Tristan is working for RETAT and is member of the vulnerability management team for over three years now and part of his work is focused on CI-CD solutions. So to give you a reason why we are here, we love OpenStack Health. OpenStack Health is a front end for the OpenStack CI system, it's a huge CI system, it's supporting the OpenStack software development and it produces daily a tremendous amount of log files, results of tests and so on over a large variety of different jobs. And the OpenStack Health website gives you a good way to introspect the testing results. So in this graph you can see that fairly regularly one particular job that I just picked out of random is being run in the infrastructure over a couple of days. And occasionally there are jobs that are not always passing, this is sort of normal, but it's really important to get to the root causes of those frequent or infrequent failures in order to have a stable and more predictable result when you're changing the software. So whenever jobs fail unexpectedly, people, engineers that work on OpenStack have to look at the log files and figure out if there's a certain pattern in this failure that needs to be solved in the software or in the CI environment. So usually OpenStack Health gives you this kind of overview, but then after some time you actually click on some link and you end up on something that looks roughly like this. Please raise your hand if you have had this experience of like lots of texts, cray, un wide, no idea what's going on. So I give you a hint, there's actually an error here. Who can see the error? Raise your hand. So I made it easy because I only extracted 2% of the total log file. So the error is actually when you know Ansible, you can sort of see the structure of tasks getting executed and you can see a failure report at the lower end of the slide. And you can guess, okay, there has to be an error somewhere near that line. Okay, so who found it? Okay, there are a couple of people that are greater or braver than I am, pretty good. So to cap this up, a lot of the analysis of test results is a manual process that takes a lot of time. So you're looking at the job results, you get a whole bunch of log files. Usually you have to start with the top-level log file, which you open in a separate browser window, then you scroll, you scroll, you scroll, you scroll. You find a failed task somewhere, and that failed task is associated with a timestamp. And then using the timestamp, you can look up the similar timestamp in the service log files that are siblings of that particular log file. So usually the failed task is not enough to figure out the root cause, but you instead need to look at what the services were thinking about the state of the overall system at that particular point in time. So you open more browser tabs and you in introspect the individual service logs also have to do a lot of scrolling and matching in order to figure out how things are led to that failure. And this is usually quite tedious and time-consuming. And not always the actual root cause of the failure is labeled with a string called error or failure that you can easily crap for in your browser. Sometimes the failure is just a combination of things that haven't been already tagged as an error in the code before. So our vision that we want to present here is that you can go directly from the job results to finding a root cause without like a lot of manual tasks in between. And we believe that actually most of the tasks that I presented before you can automate or tooling can automate this for you. And even in non-obvious cases, like for example race conditions or feature flags being turned on and off depending on particular corner cases, an automatic process can help you isolate that better. And to give you a small sneak preview of the tool that we are going to introduce is so this is the very same log file that I showed you before. And the difference here is pretty obvious. So you see a red line, which is a line of a log entry that is animal or considered animal by the log analysis tool. And this actually gives you a really good starting point for further root cause analysis. Thank you for this presentation, Dirk. Today's plan is in three parts. First we'll do a short introduction to machine learning. Then we'll look at the log classifier implementation. And we'll do a CI integration demo. But first a bit of taxonomy. So, EI, Artificial Intelligence, is a growing branch of computer science with a very broad scope. Machine learning is a field of artificial intelligence that use statistical techniques. And Nira Snebur is one of the simplest of all machine learning algorithm that we will explore today. In this presentation, BUILD defined a generic process that generate logs, such as a CI jobs, deployment script or service operation. The baseline define a nominal build and target will be the build that we want to analyze. In this first section, I'll introduce the base principle along with two objects that can be used for logs processing. It's namely the ashing vectorizer and the Nira Snebur model. Not that other model may easily be used by while keeping the same workflow. So, let's look at a generic training workflow. This diagram shows how baseline are processed to train a model. And the raw text lines needs to be converted before being used by a machine learning model. Then we can repeat the same process to test the target. And after the model has been trained with nominal data, it can eventually detect the noise from the target data and report what went wrong. Next, we will see how to implement such model. We already know what is not relevant for log analysis. Under words, such as a date or IP address can be replaced by a known token. And we'll see that doing generic tokenization greatly reduce the complexity of this problem. After tokenization, log line needs to be transformed into something more convenient for machines. And ashing vectorizer convert each word into a numeric hash value using a ashing trick. Then it encode the token occurrence information into a sparse matrix array of all the possible hashes. So, each vector is very sparse as it only contains the line tokens. And the vectorizer is used on each log lines, whatever its source or structure. As an example, this is the job output of a DevStack jobs. On the Y axis, there is the log line. So, in this example, it's 34,000 vectors that are represented horizontally. And on the X axis, it's the token occurrence position. So, the green dots shows baseline vectors and the red dots show target vectors. And this representation shows all the vectors in order. So, what we want to do is look for the distance of each target vector to all the baseline, whatever the order. So, next, we will see how to interpret those vectors. And we'll use the nearest neighbor model, which is a model that learns from baseline vectors, our training data. And it enables efficient regression analysis to quickly compute the distance of target vectors to the baseline. And we do that by using the k-neighbors query. So, this example illustrates a search from the previous DevStack example. So, the raw log line is converted to a vector of tokens. And the k-neighbors function returns the distance of that vector from the closest baseline. And it's a value between 0 and 1. So, in this case, the line contains unknown tokens, showed in red. And that increase the distance a lot. This is a preliminary work, passed on the ashing vectorizer. And it works very well for a large majority of log file, but it has limitations. The current neighbor, nearest neighbor implementation, actually use a brute force search algorithm, which means that the search complexity grows linearly with sample size. It's also kind of an unsupervised learning model. Thus, short failure message may be indifferent from noise. And finally, nearest neighbor doesn't work well if the log file contains too many sparse features. So, for example, the mistral service log may contain the full request body. And that generates very long log lines, which makes the vector very different. And computing the distance does not result in relevant information. However, most log files have a limited amount of possible variation. So, this graph shows the number of unique vector found in jobs. On the y-axis, you have the number of vectors. And on the x-axis, the number of jobs used to collect those vectors. And this second graph shows a search time growing linearly with sample size. So, on the y-axis, you have the search time of 512 vectors in seconds. And on the x-axis is the number of samples being used by the model. So, according to those measurements, sample size or model being trained with more than a million vectors are very practical because it will take 40 milliseconds to compute one distance, which is too long for processing all the output of a single job. And even 500,000 vectors is like a kind of a limit for this implementation. The next section introduced an easy-to-use implementation of this technique. Thank you, Tristan. And thanks a lot for allowing me to introduce the tool that actually here originally wrote. So, hopefully, I'm not going to butcher it too much. So, what is log classifier? Log classifier is a Python tool based on Scikit-learn, which is a machine learning framework. The machine learning framework, Scikit-learn, provides a set of algorithms for efficient data analysis, like big data and also machine learning techniques. It has some support for text-based learning. So, it provides various standard models, like the back-of-hearts approach or the hashing vectorizer that Tristan was explaining earlier. And in our experience, the hashing vectorizer is actually most efficient when you let it run on machine-generated, or actually log files that are generated by automated jobs, rather than, like, other text classification tools that are more trained or more were created originally for analyzing human written, like literature or similar text input. And the only assumption, really, that it assumes is that it expects that you pass in text-based input. It should be line-formatted, so it's not handling anything else. In order to install it, it's actually fairly easy, so you can use pip to install it. You can also install it with your distribution-provided packages, if you want to. And when you invoke it, it gives you a help screen. And then it goes over that fairly quickly. So, the main modes of usage are listed here. So, the first one is stiff, which allows you to give you two parameters, like two files, and it will try to reduce the diff between those two files and give you a reduced output. You can also give it a set of files from a directory, which is the DEAR command, which it's used for. And in order to tweak that more, you can separate the training and the actual run step in two different passes. And there are more advanced use cases, like, for example, the Sule integration, so that you can directly connect it to a Sule instance and it will download and integrate the log files that are being locked or stored there. So, the assumptions of log classifier is that you feed it build logs, and if the build result was successful, then you add that particular log file to your baseline. So, the baseline is a collection, it's a model of known good inputs that you want to be subtracted, basically, from the Sule instance. So, the nearest neighbor algorithm that Tristan was explaining earlier is a variation of a non-generalized training. So, the model is not actually trying to build a representation of the data, it's merely storing the data. So, it's the baseline, or the model is a store of the data or the model is a store of the data. So, the model is a store of the data or the model is a store of all the data that you have fed into it. And you build it basically from the successful runs and then you run it as a target against the failed run. So, to give you this in a graphical simple overview, so I took a picture of a pile of leaves and on the right side there are a couple of differences. Can you see more than one? Can you see one? OK. Most people see one difference. Can you see more than one difference? Yes, there's someone saying two differences. And this is a problem, actually humans are really bad at it because you just can't see it. So, this is the binary diff of the two pictures and you can see there are actually a couple of changes and it's fairly easy to spot the most obvious one but it's really hard to spot all the others. And sometimes when looking at log files actually the non-obvious thing is leading you more directly to the root cause than the obvious thing, not always. So, to make it as simple as possible we are running it against the Dev Stack drop so I'm calling, so I already prepared a log file so I have a known good log file and I have a known bad log file that failed somewhere and I'm just calling log file diff on it and as you can see it processes it within a few seconds and it reduces the output from the bad log file to what it considers anomal or not part of the good log file. And you can see in this example where it created a model on the fly in the background and didn't store it actually on disk that it already provides a reduction from like the original 35,000 lines to 700 lines. So, 700 lines is still a lot but it's a lot less than the 35,000 lines before. And in this particular case one of the highest hits is actually the one that is actually the main error that you can find in this log file. And to visualize that a bit more you can actually store that model and like try to make a graphical representation of it. So, in this example I use the standard algorithm it's called truncated singular value decomposition it's part of the scikit-learn framework so I didn't have to implement anything myself I just had to use the framework and what it's actually doing is it reduces the model which has like a million rows as dimensions because each word is being encoded as either a one or a zero in a row to a 2D graphics. So the 2D graphics actually doesn't tell you that much about the structure of the data but it tries to flatten it out so that you can see some kind of relationship. So in this graph you can see the cray dots the cray dots is the actual rendering of the raw input log file and the red logs or the orange colored dots in the picture is the reduced log file that we actually drain the model against. So log classifier has in addition to the vectorizing step a feature extraction step where we are like ignoring tokens that we consider completely random like for example data strings or IP addresses or any other unique IDs and this reduction is unsafe because we are ignoring a lot of input but on the other side that increases the efficiency so you don't have to train on a large collection of sample log files in order to get a good result so I already introduced that you can run it against a single file as a baseline but you can improve the accuracy of the output by training a model on multiple sample inputs so the normal use case is that you do a two phase approach you first train on a collection of log files that you consider a good or a part of your baseline and then you run it against what you're actually looking for so in this case I was running it against eight log files and one question that you could have in mind as why did I pick eight so I did actually a small test for Dev Stack log files and you can see here that in my particular example that I randomly picked even training against one sample already gives you a large deduction of the total output by training against more samples gives you a much better result and you can see actually it's plateauing so in this case it's a good choice seems to be between four and eight log files for example depending on how short you want to have the output a different use case the log classifier labels you is to use it against journal D so you can directly use your system local system logs but the way to use it is to run this command or log reduce journal range day what it does in the background is it creates a model from the journal logs of yesterday and then compares it to the log files of today so the underlying assumption here is that anything that happened in the past and didn't annoy you too much is considered a good baseline so any errors that frequently happen to see what is new this week or this today and this gives you a reasonably good output it's better if you train on a larger set of data so similarly to what I explained before you can use log reduce journal to train it unlike the last month of your journal entries and then run this saved model against today's log files and yeah I did that against my workstation so this is like a totally unfaith example so you can see here the typical problems that engineers learn to ignore so the first hit that is interesting is like this 0.7 hit which is a very high animality that apparently I'm not able to receive email anymore because like my inbox is too large so there's an error about file too large okay whatever it's a different line that was printed coincidentally that apparently have an un readable sector on one of my hard drives and when I looked at it I thought okay there must be a software problem because it's rating it as 0 so it means this is normal and then I looked further and I realized actually okay I do have this error already for quite some time so it's actually perfectly working I just learned to live with the error so keep in mind it's just a normal compared to the baseline and to demonstrate that I just killed one of the processes that are running on my workstation and rerun the last step and you can immediately see that it finds two more additional lines that it considers interesting like exactly the error that I just introduced into the system another more advanced use case that you can think of is like using it for analyzing like SOS reports so these are tables of a complete system snapshot so it contains a lot of log files it contains a lot of information about what was installed what was running what was going on on the system and it's usually very large and it's usually taken when there's some kind of problem in the system and it's not always obvious where the problem is coming from or what even the problem is you just know this table and there was an issue so you can use the same machine learning principle here to compare it against a table that was taken when the system didn't have a problem so that way you can remove away the noise and you can just see what was different at that particular table that you are interested in and in this particular use case the tool is actually building a machine model in the background for each log file type so it's automatically trying to match log files that are similar and that are for the same service and it's only collecting models for each of those services so it's not like mix and match everything together but it's separating them out which gives you a much better result and also one interesting thing to show here is that it has an HTML output option so if you prefer to use a browser which is recommended for then you can actually open the report and you can click and browse through it and you can actually very quickly go to one of the files in the support config which is called messages.txt and it extracts you the error that was actually a root cause of that particular case coincidentally with a fairly high anomaly rating so I mean just a few seconds of waiting you have reduced like a support config table of like several megabytes and several millions lines or 100,000 lines of text to find the root cause and one last example that I want to introduce on the CLI site is that you can also use it as part of your log rotate process so you can basically train a model on the older logs like the logs that you have archived so the files that have text and like a compression extension at the end and then you can run it against the current log file and this way you can just see what is different than what you usually have as errors or as anomalies in your system and this is just one example you can see this particular use case already reduces you from having to read like 6,000 lines of log files it's probably not good enough but it's a significant improvement over the previous workflow and one thing that we probably have to look at is to integrate it into a centralized logging system so when you have your logs aggregated to a central system this is currently under progress work in progress Okay so using the tool manually maybe come back some for some other use case such as for continuous integration jobs we will now see different ways to integrate anomaly detection in a CI workflow At the end of a run the user is presented with a list of build result and the goal is to help him understand why some of the job fail and CI jobs are a great target for nearest-nevo regression because the build outputs are often deterministic and previous run can be automatically used as baselines So we will look into the Zool CI system architecture but the same process could apply to other CI systems This diagram shows that the user interacts with the Zool scheduler through quad review system and the scheduler executes builds through a remote executor service and builds are executed on test instances that are ephemeral and at the end of the run the executor retrieves the logs and reaches them to a log server And finally Zool returns the log server URL to the user so that's the links that are showed in the quad review system and the link points at the log server artifact directory where there are all the logs and the job output But more importantly Zool also stores the build information in a database and this is a key component to make the log classify And this let's see how a log classify can be used in this workflow So the first idea use case is to run the tool on the executor node directly before the upload log task So the main advantage of this implementation is that the job does not have to be adapted and the post run can be simply added to the base job but the cons are that we use memory and CPU overhead on shared resources This is an example of a base job integration The tool comes with Zool jobs role that are ready to be used to report the jobs condensed to the user And the role is basically performing three tasks The first one is to build a model of our job using the previous build execution And then produce a report of the logs and artifact that are present on the executor And finally to return the report file URL to the user so that the build result links do not present the whole jobs artifact Exciting report that have been produced Another use case is to deploy a standalone service that can be triggered after the build execution So the trigger could be automatic Perhaps part of the job If a job fails then it can send a message to trigger an analysis or this can be requested manually by the user The main advantage of this use case is that it enables user interaction with the classification process For example, we can feedback fails positive to remove them from the report And it also provides a centralized index of anomalies for this job And the disadvantage it's an asynchronous process That means that the build result will still point at the log directories And the user have to wait for the report to be generated So here is the statue page of Zool And if we look at the build interface we can grab a job that failed And the log URL is the build result which contain the job output which is usually a long file of not so interesting information but somewhere around the end there is an error So if we end up doing that we can use the log classify service And what the service need as an input is the build UID as well as the Zool service API endpoint So the endpoint is used to collect the baseline so that it can learn what is a nominal build and then it will use those baseline to extract what was the issue with that build And then the report looks like this where the distance is showed as a background color and the big block at the end is the issue of that job And as I said it can be used to provide user feedback So in this case we can remove the false positive of this report and ultimately create an archive of that anomaly which will contain the job output and all the distance of each log line So to conclude log classify has been created in the context of the software factory project So this is an open source development forge and the screenshots shows many of the components that can be easily deployed on premise or as a service And it includes components such as a Gerit, Zool and log classify that is enabled for processing jobs output To contribute to the log classify project the best is to join us on the log classify IRC channel on FreeNode And here is a tentative roadmap One of the key item is to be able to under the streaming logs and to do that we need a way to have adaptive model to change perhaps other implementation that does not have the brute force limitation of the current one And we also want to be able to produce a public domain data set of CI jobs output with the anomalous line being annotated and that will be that will enable further research in that field and to be able to work on a common data set to test new model and find new solutions So this concludes the presentation Thanks for your time and if you have any questions That's really cool first of all And is there a way to take the model back in time like analyze this log with the model that you had built up one month ago even though you were continuously training it for the last three months You mean like refit a model with new So coming back to the example which Doug gave that only this particular sector it marked it as zero and we learned to do So had I analyzed that log with the model that was built a month ago it would have said that this is a new error Yeah I think So basically what you are asking for I think is like a feedback based learning So like you are saying Okay actually I noticed that there is an error that I tended to ignore some time ago and now I want to actually flag this as an error and see when it started and stopped and so on So I think this is one area of improvement for the web UI that Tristan was just showing that it's actually when you annotate it a result that you can also like say okay I actually want to raise the baseline because now that I look at it this is one thing that I'm interested in So currently it's not possible to do that because the baseline is already pre-built at the training phase and you cannot remove stuff you can just add things to that but it's definitely an idea that I was also tinkering about I don't have a solution for that at this point in time Thanks it's really great work Did I understand correctly that the ordering of log lines is not taken into account now So the error could appear anywhere Yes So the model that is being built is designed to be inaccurate So it's completely ignoring the order of the words in a line as well as their relative relationship to each other So a log line is not taken as an independent line it's not put into context at this point in time Do you see any kind of obvious path forward that could add that in the future Well right now with this model we found that's worked the best so we use that only but we could imagine a way that the lines are processed by two models and then we could try to mix the result of the two models but we still need ways to validate a new model so that's a Could I see the QR code again? The QR code I'd like to take your slides We'll get back to the first slide So all the slides are online including all everything that we should have said and the speaker knows if you press S and if there are no more questions then I thank you very much for your attendance and enjoy the rest of the conference Bye