 So I want to talk to you today about how we've applied machine learning in our open source software projects or for a specific purpose of Finding bugs and we come come around to this in a roundabout way. So I hope you'll follow me along But my goal today is not necessary to convince you to use this although you're welcome to anyone here can use it It's simple But just to figure out how to think about this how to use these techniques in your projects and places where it might be applicable and Rado is doing just that Rado recently contributed to this and started hacking on using some of these same concepts and of the same code For the kernel. So I'm going to go through a case where we used it for cockpit And then I'm going to hand it over to Rado to explain how he's starting to use it for the kernel. So Take a step back Martin Pitt Has made this claim any sufficiently complex system will have bugs Bugs are entropy and entropy is fundamental to this universe Your software has bugs Get over it The question is how bad they are and how easily you find them In addition, I will make this claim Any sufficiently complete testing system will be plagued by test lakes. You will not be able to get rid of them Here's an example The cockpit integration test We run this is 90 days of data. Actually, this is old data. So I apologize for that I think this this was from last year 90 days of data running two and a half million integration tests full integration tests booting a Linux or booting multiple systems and having them talk to each other and Then shutting it down and running the next one and so on as you can see when we look at this data, we have 122 thousand failures and We track known issues. So bugs in other people's code. Everyone's code is shit except for your own, right? and that that accounts for 54,000 and Then when we look at the actual data, I'll show you how we do this We find out that of the remaining failures, there are 25,000 test lakes Which is a lot Why is this a lot? Because ci systems and testing systems typically if you run a thousand tests if you own open a pull request against cockpit Something like 2,000 VMs will start up 2,000 integration tests and 3,000 unit tests like lots of tests for run One failure means it goes red So test flakes even though they seem like a small percentage of the total amount have a big impact on the project so 25,000 test plates, that's a decent amount and and So this is the situation any sufficiently complete testing system is plagued by test flakes So when you have this much data, what do you do? You apply machine learning You pour the data into this big pile of linear algebra and collect your answers out on the other side If the answers are wrong, you should stir harder I'm gonna show you how to do that stirring But more to the point what I'd like to make what I'd like to say. Oh, let's just talk about how we make this data first We can tell whether someone thought a Certain test failure was a test flake a false positive by whether the fact that that commit was merged into the master of the project or not so let's say I open a pull request and I see test failures now if there are if they're related to my change I Typically will make a change to the pull request push another commit onto it either force push or add a fix up And then that will get merged if I believe that they're a flake the behavior will typically be That I either retrigger those tests Get it to green and then push it so exactly the same commit gets pushed that fail or It just gets pushed anyway someone ignores the the results waves the results and says go for it So we can tell by whether a certain commit was merged or not That revision up there on the second line was merged or not Whether a human thought it was a flake or not. They're not exactly. They're not always perfectly right But we can use this as a source of data Just to give an idea of the the scope of this problem cockpit is not some little Web interface that does this, you know happy little JavaScript things it talks to over 90 different parts of Linux each of which are Are going forward at their own pace their own release schedule and across all sorts of different distributions? And we try to make this work across all of these cockpit is talking directly to every one of these systems So you can imagine the amount of test flakes possible test flakes That could happen in this scenario So I would like to make the claim that test flakes are Bugs the bugs that Martin was talking about the test flakes are those bugs They are representations of that entropy in your software the bugs that must exist Show up as test flakes. We typically treat them as some kind of Annoyance that we have to get rid of make the test like stop when in fact They are showing us where the bugs in the software are not necessarily bugs that happen every time in the software But bugs that happen under load bugs that happen under bad timing bugs that happen one every ten thousand runs That you will never ever be able to reproduce when a user or a customer reports it to you here We have this treasure trove of data. How can we use it? These tests are fuzzing Linux Much like a typical fuzzer will go in and change the inputs And mutate them to see if there's a bug related to input data These tests are essentially fuzzing Linux from a different angle of timing Races of how close to boot how much entropy is on the system already all sorts of different variables there So how do we apply machine learning to this? We were we went through various techniques, but eventually we thought about unsupervised clustering Where we basically ask? Machine learning to cluster these failures in such a way that related ones come into a cluster and Then things that have no relation or one-offs are the noise behind the scenes They're not in any cluster. You can kind of see a model here. This is just a random graphic. I'll show you a real one later and That's at least the theory the theory is the bigger the cluster is The more certain we are that that cluster is a bug whether a bug in the tests Or a bug in the software. I mean, that's really the same thing tests our software and the ones that that are that are in the background are what we usually associate with false positives and And and flakes they're a network outage or a Running on a particular system that is actually broken something unrelated to the software itself So how do we go about this? How do we cluster these? So here's the techniques that we use Of course, we do some pre-processing on the data That's typical When we use TF IDF term frequency inverse document frequency. I'll show you how we use that effectively here We use normalized compression distance to figure out how similar two things are. This is an amazing Technique that's so simple We use DB scan unsupervised clustering then and multi-dimensional scaling and Lastly to figure out how how close a new result is to any of the existing clusters we use k nearest neighbors classification So what is the raw input look like? raw input is just garbage logs that come out of the tests and After you're familiar with a project suddenly this starts to make sense to you right now This probably looks like a lot like line noise to a lot of people I mean you can kind of tell there's a stack trace in there, but there's just a bunch of useless data What we typically see is that as you become more familiar with looking at these flakes you ignore certain lines Just as any of us ignore certain lines for example me working on the project for a long time I would ignore these lines at the bottom. Can I point to things? I Would ignore these lines here because I know hey after a test like a failure happens these lines just Happened by itself. They dump out data. Yep, Adam. Could you speak up? I can't hear Right good. So someone who's familiar with Linux knows exactly what happened here Adam says this is waiting for a network interface to show up To appear and it didn't it didn't appear So let's talk about this though. This is what we look what would seem like a test flake a Network interface. That's not in the right place at the right time However, it turns out this is a bug This is a bug in cockpit that if you ran cockpit at the right time and your network wasn't showing up or whatever It would crack it would have a an accept an exception and so it really is a bug Even though at first glance it looks like oh networking outage Some kind of thing like that once we did this we find out that this was a bug So let's let's go through this example a little further So what do we do next? We hope that Yeah The first thing we do is we do a bit of sanitization on the data This is always a dangerous topic because it's easy to micromanage your data and start to kill the source of your data But this is the amount that we found actually works It's easy to match with regular expressions to go through and say all quids Replacing with exits. I don't care about that when we look at data. We see quids as just exactly the same We ignore that they're different really when we're looking for this kind of stuff all Numbers gone now that might seem like a bold statement. This actually worked for us numbers all replaced to zeros Files has went down to their final path name in the stack traces. You can see that So there's a lot of stuff that's changed here a lot of things that have gone down These are generic things that that that are not related to the actual stack trace To this particular stack trace that you could apply to other data as well. And so they're relatively safe And we found them to work here So what's next after you do this the data has sort of cleaned itself up a bit, but hasn't gotten very far We apply term frequency inverse document Doc frequency now typically how this is used is a simple technique in in In machine learning or data processing what we look at How often? Across all of the various in this case test logs does a certain term occur and if it occurs across a large percentage of them Ignore it. It's not unique data across them. We don't care about that ignore that So we're doing it actually more like line frequency inverse document frequency. We treat a whole line as a term So first we normalize them zeros and give her to the paths and all these quids and all that kind of stuff And then we go and say What are the lines that have showed up across too many tests to be interesting? These are the lines that you would typically ignore as a Developer on the project for example if you see an a log with Warning added So-and-so IP address to known host file in your log output You literally Skip over that while you're looking at it like it's kind of just noise to you and this is a machine doing a similar thing So I think I think I might have turned the knob up a bit here on how much of the data it dropped But I wanted to show you the effect Normally it's about I would say about twice that amount comes through but basically the effect is that really you can You can fine-tune this knob to say how much? Resolution do you want of what data shows up basically that it it says it's a it's a it's a setting that you can you can change and Say across how many percent of the documents would a certain term have to appear or certain line have to appear before you just drop it So we come down to a Piece of useful information here So now this is a maybe I'm easily impressed, but I really love normalize Compression distance. This is a technique that should that allows you to take two pieces of data any pieces of data and Ask the question. Are they similar? Are they dissimilar and how similar are they in comparison to each other? It just uses compression So if Z here was a compression algorithm Z live LZ for anything Well, Z is actually sorry. Well, let's skip here Z is The length of the compressed data that's input Just the length So you take the length of X Compressed you take the length of Y compressed and you take the length of X Concatenated with Y compressed and you pass it through this very very simple little algorithm And you get a number between zero and one that tells you whether two things are similar or not similar This is using compression to do this and this actually works with any data. That's not compressed So you can do this with two songs and figure out whether they have similar beats You can do this with two images and figure out they have similar areas of Patterns or color or anything compression works for this because this is the fundamental theory That drives compression the sad part about this this this particular technique is it doesn't work on GPUs because compression is not Paralyzable and so it's expensive But it works beautifully and turns out that we have more than enough CPU power in the world to power this So So we get out a number between the two test logs of whether they're Near or far It's we make no claim about where they are what cluster they are between we can ask the question about any two test logs how how similar are they and Using that distance-based metric We can use a very basic Algorithm the so all the techniques I'm going through here are some of the most basic machine learning techniques that if you load up Psychic learn and try out machine learning These are these are the first things you touch and you can use them to get effect So DB scan is your basic fundamental simple unsupervised clustering algorithm It basically asks It starts enumerating through the through the different nodes and asks. What are the ones that are close within a certain certain distance? to that and starts joining them into a cluster and and then goes through and and and allows it to figure out that that Even it even if it doesn't know when you clearly in space where these things are the fact that they're all related to each Other they come together. Well, one of the cool things about DB scan is that? It also has noise so many of the clustering algorithms try to force Everything into a certain cluster or not whereas DB scan Has a noise concept and that fits with our theory very well the theory that certain things will not cluster They're just random failures and so noise shows up as its own kind of outliers and then Because we wanted to make this approachable by people So all of that is enough to then drive the use cases I'm going to show you how we use this in a second, but we also want people to be able to comprehend this We are very very poor at Comprehending the fact that there's a bunch of stuff in space and somehow we don't know where they are But they are this distance from each other and like your head explodes. So what rato brought this Fundamental simple little technique is if you treat every single one of those distances as its own dimension You can use multi-dimensional scaling to make an image out of it make it approachable in two or three dimensions for our our unfortunately simple brains and The reason for that is because we want people to be able to hack on this to use it to ask questions about it to be able To say hey, how can I make this better? It's not working. What happens when I twist the knobs and Then lastly when we have a new Tesla coming in After we've done all this training We can Pass it through the same pre-processing the same TF IDF and then ask the question Which of the clusters that were found is it most similar to? So let's so now we know all the techniques that are in play how we do that. Let's make this do something So one of the first things we did was take this and Retry tests that we think were flaky if there was a pretty good chance that the test was flaky due to the machine learning We'd retry it and what this does is it also helps reinforce The machine learning if it retries it and it passes then it's even more likely that it's flaky and so on so it It builds this source of data. So this has been running now for about a year on retrying flaky tests and allowing them to be handled Separately and not interrupt the workflow so much for new flaky tests because it's remember This is a sliding window of like 90 days of data That's constantly moving and the project is constantly changing the kinds of test logs that are failing are constantly moving So obviously Humans continue to have to train this continue to have to have their default behavior saying This failure is not related to my thing and I'm gonna prove it. I'm gonna try it. No, this is different I didn't even touch that part of the code and and try to either Rerun the test to get them green or to merge the code anyway So but this is another source of data that then comes back into the into the model Obviously we annotate test logs with with how likely it is that This is an old screenshot actually, but how likely it is that it's a flake and Then we this is old. It's a shame when it's a flake We actually ask the person who's reading this to say if this is not a flake file a buck because that's a bad thing You know like trying trying to interact with people and make sure that they understand what's going on Wow, what happened there? Is that a bug in the slides? That's a flaky slide goodness gracious Okay, hold Obviously each test has a name And it's trivial to figure out then using this model which tests have always been solid which ones are flaky which ones are not and so on We start to use that to good effect in last in the last month or two Marius Made a change to the to the bots that that that use the machine learning to start to file bugs for the biggest clusters and Figure out. Hey, this is flaky But it's one of the biggest clusters that the ten biggest clusters get bugs so people go and investigate them This is just one that was recently filed On and it gives you the information. I think we could get more information here Actually, if you click on well click, but under a source material. There's much more information about the flake itself so that we can actually start to investigate these basically Percolate up these things and say this is a this is almost certainly a bug either in this in the tests or in the software and we need to We need to investigate it. We need to fix it. So let's look at Look at this so This is running in Kubernetes In our open-trip cluster and there's there's a standard way to post data predict stuff and I'll show you the read me for That but in a second I actually have this open over here it dumps out all of the various clusters that are found and The log for the machine learning of course and like all of that stuff and Various statistics in this file. Here's the actual machine learning model Here's the data that was used to input it. You can see even compresses a decent amount of data and and So you can we can go in and investigate this and play with it and ask questions about it rather easily one of those things in that directory is visualization every time that this happens as a visualization output and This is actually very recent. So that's kind of cool And you can you can this is this visualization is for no other reason than to allow people to say wait a second What happens if I change this or that doesn't look right and to be honest right now? It doesn't look right so well a lot of it does look right So keep in mind that this is like the stars in the sky kind of you two stars that look close to each other may not be close To each other Just because you know we're looking at in two dimensions so it would be nice to kind of look at this in three dimensions, of course and and add a little more fidelity to this but but and we can do that with D3 or other Techniques, but you know, I only get to hack on this on my My and during the holidays so wait for the next holiday or maybe I don't do it And then we can we can we can see it in three dimensions But nevertheless we see some of the clusters. We see this big cluster here of of Failures that are all related so a single color here is related. We see tons of other little clusters that have worked together And we start to be able to ask questions such as what the hell is going on here Something is wrong It's possible. It's possible that this is noise across scattered across all these different dimensions that somehow is being compressed there But at least let's ask the question and figure out wait a second. Why is all this noise not being clustered? Is it really far apart in distance or is it or is there a bug somewhere? This allows us To to start to interact with this and figure it out Goodness gracious Wow, this is like I Think I can do this Okay, make it do something. So here's an example of a couple bugs that that were found Due to this The one that was actually really interesting was this one here For a while in Fedora 29 Just randomly every thousand boots or something there was a race between network DNSE Linux and the system literally would not boot literally just Stopped the boot and went into emergency mode and You know, you might think of that as a flake You might think that that is a flake as a home shitty testing system again Just falling over what the hell But no that was a real bug that someone would have had on their laptop on their server And it would have just literally not a booting system and when you can't access that system It's it's a painful thing to have a not booting system. So here's what that was one that was found And that I thought that was pretty cool. There's another one here, which is a little more embarrassing for about two years or Maybe three even before we did this there was this stupid test like that kept coming up in cockpit as a as we were developing it and it was the most annoying thing because every once in a while this whole user account page which is for no reason end up in these strange states and And we were always like these tests are such crap You know, let's fix the test and we're constantly trying to fix the tests in different ways once this once this Was implemented we looked at it and it's like my god That's a lot of related tests that are related failures together and it turned out there's real problem in in cockpit a real problem with Racing invoking tools invoking the shadow utils user ad Reading the password file and so on too fast And when you did it too fast or the connection was slow enough that the results wouldn't come back enough And they were racing with each other. There were significant bugs there. So and Zana Fixed that page. I think it's mostly fixed now But it shows that when you're doing when you're using these when you're using machine learning often things that are not intuitive to you Start to come to the fore and you have to listen a little bit You can't just force your opinion on machine learning on on these techniques. You have to listen to what's coming back or else You know, you have a mindset that that sometimes is hard to hard to change There's other several other Bugs that were found. I mean, there's lots of these. There's just a few examples, but a package hit crashing or of or of UI states being invalid and they're yeah, and these continue I think I I believe that there's many more that we can find here as we scale up the amount of testing that we do And the data that we do we will just find more and more and more of these You need to first scale up your testing to significant level before this starts to make sense So what if you wanted to do this? How could you how could you use it? well it's in a it's in a GitHub repository and There's a read me file for how to hack on it out of how to work with it. It's That's that link leads here. So it tells you how to prepare the data How to format it put it in the right form? That's always the important part and then you can use these tools to train stuff and And I'll take you through some of these on the next slides But there's a read me file with all that information there. So you clone the repo You can pass the data into this training thing into this training script and it trains your your It trains the model tells you how many clusters you have outputs all that stuff you saw the image the The various files and the cluster and so on then you can use this command here to predict the prediction is interesting because we tried we try to train only on Very little we try not to over Micro-manage the training we really just train on whether two test logs are similar We could probably add a couple more variables there, but we really didn't want to add too many We want to instead ask questions about the clusters that come out Rather than forcing them to be segregated by various properties. So for example We pass this data in which is in the same format as the data That's training. We pass this data in and here's the results. It says well This is most likely two failures that occurred in these two different tests Five times in this one two times in this one. This is a simple example It happens there were seven failures it happened across these two Oh that there's a obviously a typo in my slides But in these two operating system, let's pretend this says for door 29 or something else oftentimes This is across all linuxes or across very specific one like well seven or something like that Here are the dates between which it happened Here's whether it was merged Whether this broken where the typically commits in this cluster were merged In other words, they were false positives whether they were not merged in other words. They're likely to be True positives, I mean, you know not flakes and then null is unknown. We have no idea what happened here And then also, hey, we track stuff for known issues We start to track different test failures and categorize them as hey, that's a bug over there We're gonna track this we're gonna associate it with its bugzilla or with it with its issue tracker and so This source of data allows us to say hey, how likely is this related to another known bug or not? And I think if we scroll off well, I don't know how scrolling works here But that's actually a bug and it tells you what percentage of this cluster is Is a known bug You can obviously use this in Kubernetes is all containerized So you push it to your Kubernetes project and it'll go and walk this stuff for you Walk the data for you and then there's a Way to post data into it to train it using a standard put or post command There's a way to again with a post to send your data to be predicted and you get back Output similar to the to the previous page Like this output you get back in by an HTTP request. So it's really easy to integrate into your project What's next? Well, oh actually this part is not merged yet sadly, but there's a Additional pull request. I don't think it's even done yet Well mostly done to automatically retrieve data from GitHub So to make it really easy for you to add this to your project should you want to to pull in data from all your GitHub statuses and failures and just constantly shove that into the training and Then tracking known issues automatically would be good right now. We file bugs for the biggest failures the biggest clusters, but Well, we what we should probably do is try to associate those with the known problems and start to interact with them and put Comments on them and reinforce Some of the bugs it doesn't say this is happening every other day or a whole bunch of this just happened with this change and start to Give more feedback back into bugzilla and all the other places where these failure these known issues are tracked So and how do I turn rato? Hello, hello So when I first saw Steph's presentation about this machine learning project I was thinking like wow, this is a cool toy. I want to have it too and Since Steph is talking about finding Linux bugs In general, there's no way we we can avoid talking about the kernel itself How do I use it? But when it comes to debugging the Linux kernel it kind of stands out compared to other user space applications The usual approach turn it off and on again doesn't really scale well with Linux Debugging is almost impossible in real time. Although we have these great applications like system tab um Usually the machine stands somewhere Between the firewall of the customer and us and we have no direct access to it when Linux does crash all we get this is cryptic error message on the console and Image of the corrupted system if we have kdump enabled But when it does come from the customer It has to be manually picked up and analyzed by a human being We have special tools for that like crash and it requires a certain skill set to go through all these structures inside the Linux kernel you have to be really skillful and Understand what's happening there expect certain values, but all this All that Stuff it Is not relevant for the customer. He's not user of the arcane all he wants. He was is a single button solution and sometimes Even for the engineers Looking at the certain buck If you are stuck on a certain bug it can help you if you have a similar bug that can lead you to a solution so what I came with Using step alters Machine learning libraries and the data we have gathered for the customers. I put together It's not that I directly a tool yet. It's more of a technology technology preview of a tool that will come and it uses Colonel kernel messages as input and using clustering and normalized compression distance tries to Like tell you the nearest or the the best other matches that are That have solutions inside our knowledge database Yeah, cool, that's all Yeah, and I'm excited to see that and so I'm excited to see us applying this technique to other places It's not hard. The total amount of code here is 800 lines of Python and It's it's scikit-learn. It's running on bog standard CPUs. No GPUs involved. This is easy Practicing how to apply these techniques. I think we went through some bad bad techniques first We tried to apply neural networks to this didn't work Neural networks don't have a concept of noise They do this weird thing where in the typical example you you you you try to classify human handwritten numbers, right? so all hundred thousand digits of training data you put it into machine learning and suddenly it's telling you Pretty well. Oh, that's a three. That's a four. That's a five When you when you put trash into machine learning at least they need machine learning. I'm sorry neural networks When you put line noise in instead of a digit, it'll confidently tell you. Oh, yeah. Yeah, that's a four It's it's not it's not it's not unsure. So it has less of a concept of insurity So it was a poor choice for modeling this sort of data with tests and a constantly moving unsupervised set of data But we learned from that we applied different techniques and it's really really easy once you get practice with this It it's more natural and you can just apply it to your project So I like to make a ludicrous claim and I think this is a little bit forward-thinking Eventually I liked we'd I'd like to prove that test lakes are almost always bugs and eventually if Your testing system does not have enough flakes Then your testing system is not working correctly Linux has bugs Martin Pitt said so The universe has entropy and If your testing system is not finding that and you're not being forced to react to that using techniques like this or other Techniques to mitigate that it means your testing system is not working well enough. So rather than flakes being a source of Annoyance something to be squashed and gotten rid of it's a source of value something that we can actually Actively used to make our software better. It's telling us it's shouting at us here are the bugs and we're not listening so I'm excited about this the first step is ramp up your Source of test your data how many tests you're running to a high level tons of kernel stack traces Tons of test results millions of test results both these cases allow us to apply these techniques We can't apply it to small amounts of data. It doesn't really work. So Once we get it up to that level We know we've played with this and this is just the beginning There are so many tools at our disposal to then use all of that data to our advantage All right questions Yeah Yeah So that's a good question and the question is about do we validate our model? Do we check that a known Issue is not ignored and the data is not lost and it actually is clustered correctly in the model So I say Today we do that poorly and the way we do it is via testing we have a model of data that we pass through and expect it to make these decisions about and Roughly put it in these clusters and so on That doesn't work well with the sliding window and the various knobs that I was talking about right every every week It's getting new data and it's expiring old data. I mean, it's a first stop measure A stop gap measure to prevent what you're talking about, but I feel like we could get better at that Yeah Right Mm-hmm so person was highlighting that this same container is Is usable by the AI libraries and you can call it by open whisk serverless computing to do this stuff as well So you have another choice there and in addition is encouraging everyone to start simple try different techniques to see what works What doesn't work just because for example neural networks didn't work for this case doesn't mean it won't work for your case or Try with try try the basic Techniques that that person was talking about in the previous talk I think Adam had his hand up earlier as a that's an interesting idea. So the question is So Adam was saying that he's seen this in the source of his career that every test flake has a has a has a real bug behind it But the question was can we figure out where these flakes are where those bugs are? Come from which commits introduce the flakes and thereby introduce a bug and I think I think that's worth looking at The scale at which this is happening allows you perhaps not to do that as a gating measure Before the commit goes in but within the next day or two you would have enough data to do something like that It's all about making sure that that we give enough data to the model to make such decisions And we're not flooded with noise we can pick out the ones that are that are Pretty guaranteed to to fit that that criteria Yep, that's right Right I Yeah, I agree with that. Let's see if I can summarize For the recording and the comment was that it would be good to if we could go away from calling these flakes We call them what they really are whether they're a race or an unreproducible bug or a some other Some other failure I And I agree with that and that's part of the reason for doing this is not just to have fun with the data But also to help change our mindsets change our mindsets and prove Allow people to prove this for themselves allow people to interact with the data That's why that image and hopefully that the D3 model where we can touch and feel it is so important because We can do a lot of stuff here with machine learning But I'd also like machine learning to help teach us new things and this goes back to Moravik's paradox where machine learning is And machines in general the way they act and behave are Fundamentally unintuitive so so Moravik's paradox in so many words states that the things that we believe to be simple Such as walking down a staircase are in fact very hard for a machine to accomplish the things that we believe to be difficult Such as playing a game of chess are pretty trivial for a machine in comparison And so this this the underlying nature of how unintuitive it is to have an intelligence thinking a completely different way from us and Allowing to us to learn and interact and allowing to affect our mindset is a very valuable tool And so that that's the underlying thing here and why we ended with that claim because we really want to start thinking about this problem differently Yes, Steven So the question is can could we integrate this into the container verification pipeline that we use the fedora and red and rel And an open shift and can we put this into other systems? And so this is part of the reason that I pushing so hard on everyone to ramp up the scale of the testing It's in order to use this effectively I Will I would say we should be running about a million integration tests a month on the project That may seem like a high bar, but it's not that hard once you apply basic engineering to the to the task Cockpits a million ish 800,000 actually integration tests per month run on five or six beefy systems like it's not running on a massive cluster of stuff and And that that's because basic engineering was applied to this problem one of the reasons is That ci is essentially very very very repetitive Extremely repetitive. Look at those numbers. Look at these numbers here These numbers my gosh two million two and a half million tests and we're really real failures plus test flakes, which are the bugs Actually, all of these are bugs. That's it. That's a tiny amount That's that's an order of magnitude off 99% of what's happening here is green just churning Burning CPU power for the same results every single time in Theory if we got in and started micromanaging this as humans It is essentially saying wait a second 99% of the effort of these machines are wasted in order to produce this tiny amount of result The the root cause of this is ci is extremely repetitive extremely repetitive It's also very easy for the kernel to optimize the IO scale it up to insane Densities that you would never think possible when you just just take some things out of its way That's a whole nother talk that that we can we can get to but in other words. Yes, it's possible But the requirement is to scale up our testing to a level where there's enough data And we know that it's possible to do that. We have the techniques Yep Yeah, and so total makes a good point that this is again another example of the unintuitive nature of Machines interacting in our projects And I agree with that completely Exactly so the comment was maybe we should vary some response and actually introduce fuzzing Actively it turns out for most projects. That's not necessary But I will say like it as we said at the end here If your testing system is not having any flakes and you've tried everything else You should probably start introducing variations in your timing and your network response and all that to find the bugs that you know must be there So are we out of time? Three minutes we can go Any more questions So all of this is available for you to play with there's github pull requests open Rato is working on kernel stack traces and Prasanta has an API for calling this in a serverless fashion so and And even if these things even if you end up playing with this and say wait a second This is not the kind of problem. I'm trying to solve get familiar with the techniques get familiar with how This can affect your mindset and start to change the way you develop software and I'm excited about what we all come up with Thanks