 Hi, everyone. Wow, that's loud. This is your obligatory machine learning talk. Every conference has to have one. Here it is. I'm Steph Walter. I work at Red Hat and I wanted to show you how we're applying machine learning to finding bugs. The there's this is a talk about a subject that I feels very much in progress. I like to normally talk about things that are done and so you can play along and do stuff. The stuff that I'm going to show here, you can certainly run on your laptop. That's not the problem, but it's far from complete. So if anyone has interesting places to take this or wants to get involved or work along, you know, that's the intent of this to talk about what can we do better here? How can we make this work? So Let's see if the clicker works. Woo! Okay, so first off, this is the Gospel of Martin Pitt. Any sufficiently complex system will have bugs. Bugs are entropy, and entropy is fundamental to the universe that we live in. If you don't have bugs in your software, your software is trivial. Really, what we want to do is minimize the bugs that have a significant impact, knowing that there will always be bugs. There will always be entropy. We want it to have less effect. In addition, for anyone who's worked with testing to a decent extent, we'll know that false positive, also known as flakes, are everywhere. They're a pain in the butt, and they constantly happen in a pull request that you open and some tests will fail in some unrelated part of the software, some subsystem, something else like that, and this is super annoying. So here's some data and some stats from the cockpit integration tests. The cockpit does about somewhere near a million, well maybe more like 800,000 virtual instances spooled up per month to test the various changes that come into the project. If you go and open a pull request for anything in the project, about 200, no sorry, 2000 virtual machines will be started to test that. That includes booting Linux, testing some feature, shutting it down, and doing that 2,000 different times. So we can see here, do I have a pointer? I do. Nice. In the last 90 days, this is up-to-date data. This is how many total test runs have gone. These are full integration tests. These are not unit tests. These are full Linux systems being booted. Of those, we have this many failures in total. So that's about 5%, or 4.9%, something like that. So most of the time, the first thing that you realize is most of the time the tests succeed, which is most of the time, those virtual machines and those resources are basically being wasted. Theoretically, if we were to apply, like, you know, software optimization rigor to this, most of the time, they're doing something useless. They're just doing the same thing, same thing over again. But they do that in order to find these test failures. When we break out these test failures, we find this, that there are 42,000 real failures, 54,000 issues that are known or tracked somewhere else. Someone has already said, this is an issue, but it's not part of cockpit. We can't fix it ourselves. We're going to file a bug somewhere else. And we're going to track it and ignore it from here on out. And then this many test flakes. We'll get into how we can tell the difference between test flakes versus failures in a minute. What's interesting here is that when you look at, no, when you look at a different tab, when you look at the cockpit pull requests in the project, Martin Pitt popped up, you can see that very often, a pull request will fail due to flakiness. In fact, it's very common that a pull request fails like here, this pull request by Zana. Hi, Zana. We'll fail due to unrelated reasons. This entire pull request will be marked as red due to this one test that has failed on one operating system of all these different ones that are tested on. So of about 2,000, 2,500 test runs, all of this by such a CI system is fundamentally collapsed into one red. One red propagates up like that. So that's why we see those numbers that that many real failures, 42,000 seem to make the project look very, very red. So the test flakes have an effect that goes far beyond their numbers. If you want to access this data, there's a URL that this is constantly updated every week. And the data inside this file looks something like this. I've expanded it, it's JSON-L format, and each line is a different record. We have data about the pull request that we're testing, the revision that was pushed to that pull request, the status of a test, whether it's failure, where it was tested, in this case, Fedora 27, the data at which it was tested. I'll get back to that merged line in a second. The test name, the URL for the log, whether that issue is tracked somewhere else, and the actual contents of the log. The merged line is interesting. The merged line is a piece of data that tells us whether someone ignored that failure or not. Whether this record represents something that got merged into the project or not. So a test either, a test failed, and someone said, I'm going to re-trigger the tests, and then those same commits were merged, or someone just forced the thing to merge anyway, ignored the failure. So we can tell that a person, a human, thought that this, if there's a failure here, they thought that this failure was unrelated to the revision, and therefore we can with a pretty good accuracy guess that anything that has status of failure with merged equals true is at least by the team regarded as a flake. Maybe wrong sometimes, but it was regarded as a flake, a false positive. And this is a source of training data. Once you have 2,400,000 records, and you have this kind of data about some aspect of human behavior, you can then start to train machines to learn from that. Just before we go any further, Cockpit is not some little web UI that does some few little things. It interacts with all parts of the system. So when we're talking about testing here and finding bugs, we're not just talking about some JavaScript bugs. We're talking about bugs all the way from system D to SE Linux to accessing the Kubernetes API, accessing Etsy password, the entire spectrum of what's happening on Linux. There's another Cockpit talk that's going to happen upstairs at 3.45 if you want more info on some of this. So what we're talking about here is finding bugs in Linux in general. When a bug happens in Cockpit, it's very often the case that it's in some interaction of these components. These components all iterate at different versions, and sometimes people make changes that break someone else or have an account for an older version of some other component, and so on. So this is a claim I'd like to make, and I'd like to prove. This is very much in progress. I'm making this claim here, but I don't feel like I've proven it in a thesis sense yet. Is that the test flakes of false positives are by and large with a margin of error bugs. False positives test flakes are bugs. They are the entropy in your software, the latent bugs that are hidden that are causing these false positives to happen. All of these annoying flakes that plague your project when you test aggressively are really the tests finding those problems and being helpful, and they're being by and large by most projects are ignoring them. So we can turn that around. Either we can abuse hundreds of children in a sweatshop and make them read these records and figure out that there's bugs here, or we can use machines to do it. And these tests are essentially fuzzing Linux. They are running through various races, various loads, various timing issues, different combinations of components, all of this stuff, and all these mutations are essentially fuzzing once you talk at that scale, essentially putting different inputs, not inputs to the software from the front end or the command line or whatever, but inputs to how it runs. And lastly, these same bugs that are false positives in your testing are the same ones that customers and users and people will run into when using it, and you will never be able to reproduce them, because they happen one every thousand times or one every 10,000 times you do a particular thing under specific load or whatever. And yet here they are happening all the time. By taking advantage of this, we can actually prevent them. So let's get into the machine learning part. This is how at present we try and take advantage of them. We've gone through several iterations of this. This is how it works right now. And for every topic, of course, there's an XKCD, and here's the guy saying, this is your machine learning system? Yep, you pour the data down into this big pile of linear algebra, collect the answers out the other side. What if the answers are wrong? And then just stir the pile until they start looking right. And in many cases, machine learning has a bunch of knobs that you have to be able to turn and turn it until you have the desired effect. But we can't be bashful about this. This is really what we're trying to do here. We're trying to avoid having to bring hundreds of people to come and do this thing and use machines to do it instead. That's what this kind of machine learning is. We're essentially emulating what a human could do by doing it at scale. So this is the basic concept. We want to take all the test failures, and we want to cluster them into clusters. Some clusters are loose, some clusters are tight, and then there's noise, things that don't cluster very well. There's a little cluster here, and we want to say, we want to make the claim that most of the test flakes cluster well, and that the big clusters are bugs, and the bigger the cluster, the more important the bug, the more often it happens. Some bugs are bugs in your tests, obviously, and in the cockpit tests, many of the bugs are bugs in the Linux stack. Some are in cockpit, some are in the underlying layers. I'll show you some examples of this later. So here's how this works. These are the techniques we use. Obviously, you have to pre-process the data a bit, clean it up. Then we use TFIDF, term frequency, inverse document frequency. We use this very cool, very simple technique called normalized compression distance. We used unsupervised clustering in the form of a DB scan, and then nearest neighbor classification. Let's look at each of those. All of this is implemented, by the way, with the simplest of tools, scikit-learn. If you haven't gotten into any machine learning or any of these, really, these tools, then that's a good place to start. This is nothing fancy that needs to run on spectacular hardware. So much of the clustering currently revolves around the log input. The actual raw contents of each log. If you look at, let's look at a pull request. Let's look at this example that we're looking at here. It's on this pull request, and we look at the details of that loading. You'll see that each of these things here is this kind of log output. This is just broken out. There's successful ones. There's failed ones. And this is exactly what we're clustering here. So you take the raw output, and you pass it through some basic pre-processing that does some pretty generic stuff. It replaces grids. It replaces any number with 0, 0, 0, replaces grids with Xs, figures out it also does IP addresses, although I don't think there's any here. If you notice it removes, it knows how to see a path and collapse the path. It takes these file names here and collapse them down to their extensions. Really basic things that we do naturally in our heads when we look at these logs. If you were looking at this day in and day out, as some people unfortunately had to do, you ignore these things. Most of the numbers are literally noise. Most of the file paths, you only look at the last part. If you ever see... We'll get into that next. The grids, that's just pure noise. So this is basically emulating what you do when you look at this stuff. Then the next step is using TFIDF. What TFIDF does is look at each term, in this case we treat each line as its own term, and take the entire body of data that we're training from, those 2.4 million records, and say which of the lines are found across 30% of them, are found way too often to be interesting information about this particular test. If we look back, we see that lines like this here happen very often. Lines like error timeout or this bullshit down here. It's nice status, but you really want to get rid of it. When you're actually looking for useful information, and you're doing this actively as part of your job, you just ignore that stuff. TFIDF lets us do that really effectively. There's a knob here to tweak where the percentage is. If I see this in more than 30%, I think that's where we are right now, of the logs just trash it, just ignore it. Or if I see this, I think there's another one over a certain percentage of failures, then trash those lines, and the lines just come down to the data. This is a really good example. Of course, I chose a good example for this thing, but they're not all this beautiful. Sometimes they're longer, and there's junk in there. But you can see one junk line didn't make it out this one, and I don't know why. But other than that, these 1, 2, 3 really are the data lines of that entire log, and it figured this out by itself. You didn't have to go and code in the fact that I like lines that say, wait present, or test team, or any of that. It figured that part out itself. So what happens next? Now we have this, basically what we think is reasonably the information of the test log. So we pass it into a distance algorithm called normalized compression distance, or NCD. What this is, we want to figure out how similar two things are, two pieces of data are. This is a beautiful algorithm. Maybe I'm easily impressed, but this is really impressive to me how simple this is, and how well it works. This works on images, music, logs, text, pros, whatever you want. Any kind of data and checks out how similar they are. Z is a simple function that takes any compression algorithm, passes the data in, and just takes the length of it. You take the length of the compressed data. So using this basic capability, you then take the length of X concatenated with Y, the two things you're trying to compare, the two test logs. One log is X, one log is Y. You concatenate them, you get the length. You do the same for individually, X and Y, and you pass them into this basic bounds checking setup, and you get a number between zero and one that indicates how similar two things are. If you have A, B, and C, or X, Y, and Z, you can tell that X and Y are closer than X and Z. Actually, Z is a bad example because Z is already there, but you know what I mean. You can see that these are similar, these are not similar. This is the kind of input that you can put into an unsupervised clustering algorithm. And the only caveat is that... Well, there's two. One caveat is that compression is hard to optimize on GPUs or hardware and that kind of thing. We can scale this pretty well because we're trying to compare this across tons and tons of logs to get basically matrix of how things are similar and how they aren't. So it does scale, but the actual individual tasks here are really hard to optimize besides just running them on a powerful CPU. That's the one limitation why you probably don't see NCD used more often. If someone comes up with a nice compression algorithm that works in essentially a parallelizable way, which may be fundamentally impossible, but if they do, then NCD is the kind of thing you would see all over the place. The second thing is that you have to, of course, pass in uncompressed data. So if you're using this for something else, this is a little rat hole, but if you're using it for something else, just decompress the data first. Obviously, don't pass in two MP3s into this and expect it to work. But the uncompressed audio data or text data or whatever you want, then this will actually do a really good job of telling you whether something's similar. It will literally find similar beats like if the song has a similar beat in it or the same drums and whatever. Anyway, let's move on. So the next thing we do is we use unsupervised clustering. The last thing you want to have to do is constantly go in and tell it, oh, put your test, your networking failures over here and your VRT failures over here and define these clusters yourself. Defining clusters is kind of something that you would see as a hallmark of neural networking or some of these other algorithms. That's why we don't use it here because you really want it to figure out that there's a cluster here. I don't know what it is, but there's a cluster here and the things are similar. And so DB scan is just your very basic initial entry-level algorithm for doing this. I think it would be nice to move past DB scan because there's ways in which it's inefficient. It tries to calculate way too much of the matrix of distances between everything and there's things like hierarchical DB scan and other techniques, but we haven't had time to use them yet. Right now we just burn CPU and do this. One aspect of this clustering algorithm is that it has noise. Similar to what we wanted in the first picture when we started this, we want to be able to tell that there's outliers and we don't want to put them in clusters. We don't need to force everything into a cluster. We want to ignore that noise data. I'd like to show that that noise data is intermittent failures, real false positives. The machine just like someone pressed a button, some cable monkey jerked the cord in the data center or something happened, that's completely outside of finding bugs. As it stands, some of those get clustered because some of those things happen regularly, like filling up disks, but I think with enough tweaking and fixing of the actual test bugs and so on, we should arrive at a point where the actual noise is stuff that we can ignore. And the clusters are the bugs. So that's all part of the training. You pass the data through there, you get training and you get a bunch of clusters. I can actually show you them here. We'll look at these. I downloaded one set. There's a whole bunch of clusters and each of these have logs of all the things, all the collapsed data and so on. And each of the clusters, we can ask questions about them, such as, you know, it's cut off, such as, how many items are in this cluster? What percentage of them were merged? And that's disgusting. What percentage of them were merged? Are any of them tracked? How many different tests fail on this? Oh, look, this one just failed on one particular operating system, you know, that kind of stuff. Then when we get a new test failure in, after the training has all happened and so on, we ask the question, so given the clusters that we know about, which one is the best fit in? We use K nearest neighbor, which again is a very simple algorithm to do this. We calculate the distances to these clusters and say, where does it fit? So, we do all that with the test logs. So, it's all great and fancy to do stuff like that, implement it, but then making it actually do something in the real world is, of course, always a challenge. So what are we doing with this? Well, first thing that you obviously do is you auto-retry flaky tests. So, you can see here that this particular test failed and it retried due to flakiness. The flakiness score was high enough that it figured out, hey, this used to be ignored. This was ignored so many times by people, so let's just retry it. And that actually is self-reinforcing learning where it'll retry and learn that, hey, I retried this and it succeeded and then the commit got merged. So it actually produces better data for the next time the learning happens. We annotate the test logs with data so that people reading the test logs can say, hey, this is probably a flake or it's not, so someone can input the system and tweak the different knobs that, or sorry, configure the system and tweak the different knobs to get better results. And then we actually find real bugs. This is the interesting part. These are real bugs where we found, in this case, a bug in cockpit and there's pull requests that got merged that show that we are accessing various APIs in a completely racy way in the test for finding this. We were ignored by this for a very long while. It turns out the root causes, we were executing user ad and the group commands and various commands to access to modify user info on the system. We should have serialized them. We were accessing them in a racy way. This is the kind of bug that a user would have found and then we would have been like, well, it sucks to be you. We can't reproduce that. But the test for finding it, when we actually listened to those flakes, it was an interesting result. This is what here is brewing right now. This particular cluster is very small. I have the number here four. It's only happened four times. But four times in the real world, SC Linux has screwed up with Network D. It's happened very recently in the last two weeks to prevent the system from booting at all. It just does not boot. So I mean, if we want to look at this, I don't have much time, but we have all the proof for this. It's not theoretical. There's another one where we were updating React state wrong. But here is one where we figured out that package D, package kit was crashing intermittently. These are all recent things, but this has gone on for quite a while. There was LVM crashes. There were tons of different SC Linux issues that happened intermittently during boot with races and all sorts of stuff. This stuff comes out regularly and you start to find these bugs. I don't have time to look at these and go into the details. So if you want, and here's the various commands, it's very easy to run this stuff. You can download the state. You can do the machine learning and you can actually put tests through there and get out results. And this is part of the cockpit project. One minute. So I've got to skip ahead. What's next? There's a... Tomorrow I want to work on containerizing this. So putting it in a container. I have some code done for this. Essentially the part that does the machine learning is generic enough to containerize it so other projects can use it. Event soon at some point we want to file bugs automatically. Once you have a big enough cluster you can just basically file a bug and say, hey, here's a bunch of stuff. Here's all the core dumps. Here's all the logs. This happens often enough that you really need to pay attention. And it's the second or third or first highest flake. And the machine learning believes it's not a flake at all. We want to produce a set of non-flakey tests for other projects to then use the cockpit tests so that you can basically say, we know that these things never fail wrong. They're solid tests. And we want to be able to track known issues automatically. Start to say, well, this is specific to Braille or Ubuntu or whatever. There must be a known issue in that operating system if it doesn't happen across all these others. And so we should prompt people to file an actual bug there. All right. And I'd like to make this ludicrous claim to close. Any testing system that does not have enough flakes should introduce more tests or more mutations in their tests so that they can get more flakes because otherwise your tests are not finding enough bugs. And of course, this is really backwards, but once you figure out that the flakes are bugs and you see that most of them are bugs, this is the conclusion that you're right at. All right. That's it. We have a time after the slot in this room. Can we have questions? All right. Okay. That's fair. All right. Let's talk after. No, let's talk after. He would like us to leave. You know, there's gap in the room now. All right. No. Okay. So I'll be around and we'll... We can talk here. The next talk is starting in like 15 minutes. So we'll talk in the back. But that's okay. And we can talk in the back. I'd love to hack on this together and work out the details. No. That's it. Either way. Okay. But I'd like to take questions. I'd love. That's always a fun part of the talk. So you mentioned that Zlib is slow and, well, it's not slow, but hard to parallelize. Have you thought about using something like one of those compression algorithms which are faster than memory access like BLOSC? That would be cool. Which one? BLOSC. BLOSC. B-L-O-C-B-L-O-S-C. No, but we probably should. We should try that. So, yeah, and if we... It's this idea that if you have compressed memory on disk, you can read compressed memory faster, I mean, compressed data faster. And then if you decompress it fast enough on the CPU, this whole process can be faster than reading uncompressed data. So you need very fast decompression algorithms for that end compression. Interesting. And are they hardware accelerated or is it just by inherently fast? So, BLOSC, we should look at putting that in. Compress too well, right? That's fine. There's a trade-off there. And as long as it still has the same effect, then, yeah, that's cool. We tried to choose Zlib compressed because it was really fast and we'd use the parameter nine on it or I forget which way it is, zero or nine to try and do the fastest possible. But this might be faster, so that's interesting. Any other questions? Cool. Well, thank you very much.