 Hi everyone, hope you had good lunch. Now the next session is on using AIML to process automated test results from OpenQA from Tim Flink. All right, well, as you said, I'm Tim. I work for Red Hat with Fedora Quality, and that's just what I'm gonna be talking about. Just before I get started, I am a really big believer in informal presentations. If you have questions, please ask them. I was a little unclear on, I'm still, I'm gonna find out whether I put too much information, not enough information, so if you have questions, please feel free to raise your hand. If I don't see you, please interrupt me. That's fine, I don't really care. The whole point of this is to communicate, and sometimes that's the best way. So what I'm looking to do is do a bit of introduction, talk a bit about neural networks, go into the experiments or the experiment that we've done, where we're looking to go in the future and leave some time at the end for some questions. So out of curiosity, is anyone coming in here hoping that GPT was gonna be making the presentation? Just me. I was kind of hoping, but yeah, it didn't actually work. So just to kind of get started because it's a very popular term right now, and in my opinion, there's a bit of a decent amount of misunderstanding of what it all can cover. What I am looking to talk about is, in general an experiment that we did to figure out whether or not it's gonna even be possible to be using AML to do triage. And it is in particular using data from tests that are visual, and I just wanna make that distinction because a lot of testing is done primarily in the text domain, and the techniques that are done there don't necessarily work with visually oriented tests and vice versa. Just to be clear, it's not about, we haven't created this magical tool to automatically triage test failures, maybe someday, but we aren't actually to that point yet. And nor will, in my opinion, what I'm about to talk about be able to completely replace human triages. This is, in my opinion again, a good way to make people more efficient to help make their jobs easier, make their lives easier. But at the end of the day, it's going to always rely on some form of human expertise in order to actually get the job done. So the system that these tests are running in is OpenQA. OpenQA is a system that's primarily based on computer vision. What it's really good at is, it takes a screenshot of a machine that's running and it has a little picture, and you basically tell it, look on the screen anywhere for this picture of, for example, a button. If you find this picture or this little snippet somewhere in the screen, move the mouse over there, click on it, and you create your test cases based off of that. And consequently, a lot of the information that comes out of it is visual. Yes, there is some text that comes out of it, but it's primarily in the forms of screenshots and it also produces videos of the entire run. The system itself started with SUSE as far as I know, it's almost the way they do all of their automated testing. But it is in Fedora now, and it is a critical part of our release process. Basically, every update in Bode goes through OpenQA, and at this point, every brahide update goes through OpenQA as well. So it runs a lot of tests. So this particular idea, this started sometime last year, there was a chronic issue within OpenQA where tests would fail unexpectedly when Firefox, which was being used for part of the test, would just quit. And the test would fail because Firefox quit. And it was, we tried looking into it, it took a while, at least when this started, there was not a, we hadn't found the root cause. So we had a whole bunch of tests that were failing for reasons that we weren't testing for, which is not terribly useful. So the people who were doing the triage would go through, they'd recognize that this test had failed for a relatively common reason and would mark it as for rescheduling. And the test would be rerun and tell it either past or failed from an issue that wasn't this, Firefox just decided to stop working. And it wasn't something that was causing the system to stop or anything like that, but it was overhead on people doing the triage and they usually have better things to do with their time than deal with a common issue that hopefully we can find a tool to work around. So we came up to this question of, can we create, can we train a machine learning system to detect that particular crash and then automatically do the rescheduling? But this kind of comes to a question because I was talking about this issue last year. As it turned out, as I was starting in this, the bug was magically fixed. I'm not sure anyone particularly found this bug, what the root cause was, but there was an update and they stopped working. Or I mean, they stopped crashing. So there's a very valid question of, well, why do we care? This is no longer a valid issue. Why is there a point in continuing with this? Because this issue's been fixed. It's not going to be of direct use. And it's an easy answer. It created a rare situation where creating the data set is cheap. By and large, when you work with machine learning, the really expensive part is creating your data set. And something that I've learned through doing a whole bunch of machine learning is if you don't have to create your own data set, don't. It's a lot of work and it's not fun. And at that point, there was a real question of can this even work? Can we look at the screenshots coming out of OpenQA and get enough information to start doing triage work? So this was a cheap way to create the data set to set up for the experiment that didn't involve me harassing the triagers to help me tag different runs until they stopped responding to my emails. So at this point, does that make sense? It's set up as we have this recurring issue. It's visual. I guess I haven't gotten much further than that, but any questions at this point? All right, so we did try a couple of things. I'm generally a believer that more simpler is more better, but it didn't work. Tried to do OCR on the pictures and run that through a classifier and that failed spectacularly, so I didn't write slides for it. But it was worth trying because it's the easiest possible thing. But to get into some background on what I ended up doing is, so artificial neural networks, the way they work was really inspired by how the human brain works. You have a bunch of interconnected in-column neurons that work together to remember and process information. It's generally made up of multiple layers, at least modern ones, but the concept itself is not really that new. It was first proposed in the late 19th century and the first computer-based neural network was done in the 50s. And as time went on, it would be popular in research and then people would realize at least in the 50s that they didn't have the compute power to do it so it would die off for a while. Then people started talking about it more in the 70s and then it kind of died off for the same reason. The use of neural networks really didn't start taking off until we had the growth of in particular GPUs and the ability to do a whole lot more math a lot more quickly that made this theoretical, oh, we can make a computer work like a brain into something that is a lot more practical and doable. Roughly, oh, this is, I'm trying to, sorry, I'm remembering that this is being recorded. So that shows up. And like I said, this is a very simple diagram. I don't have the time to go too deeply into this but the basic idea is you have, this in particular is one but you have your input layer and your output layer. Each of these would generally represent one bit and you present your input. It is connected to all of the different layers in between through various techniques and then you get an output. Is the very, very high level of how a neural network would work or does work. In terms of the types, the what's called a multi-layer perceptron is very much like the diagram I just showed. It was meant to model how the brain works and how the neurons in the brain work. It is still very heavily used but it tends not to be the primary thing. It tends to be a final layer trying to gather stuff together to produce a, at least in a classifier sense, a single output. Another is recurrent neural networks. It's usually used for more for natural language processing and where you have things of indeterminate length and convolutional neural networks. This is, they really became popular for image classification of give it a set of images and say, in a set of categories, can the neural network correctly classify this image as a dog or this image as a bus? That kind of stuff. And that was state-of-the-art in terms of image classification until about 2019. And the stuff passed then, this is a good question of why, I'm not covering it because I ended up using a CNN. And the reason for it is data. All the stuff that is newer, the stuff that has been state-of-the-art since 2019 requires like millions of pieces of data in order to get that performance. And that's getting back to the theme of this was an experiment, we wanted to do it cheaply. Millions of data points was not an option. So a classifier. Like I said, it's given an input, what class does it belong to? I think of it as very similar to a capture. So you get the, we've all seen these things. It's like, select all the images with crosswalks. So this is using a human as a classifier. Is this, does this image have a sidewalk? Yes or no, go on. So each one of these pictures can be classified as contained sidewalk or does not contain a sidewalk. And using neural networks is one potential implementation for an image classifier. So getting into machine learning. From the high level parts of this is, you showed a bunch of examples, it's supposed to learn from that and then be able to replicate basically what you showed it. And that's how a neural network also works is you give it a bunch of data and the correct answers. It learns from, this was correct, that wasn't correct and eventually you will, well, you hope you end up with something that can repeat what you've taught it. A common term, deep neural networks, it just means more layers, more layers, more computation. And that's really what it comes down to. There's no strict definition from what I've seen. Generally it's three or more layers to be considered deep. But that's about that. My very quick overview of neural networks makes sense. Does anyone have any questions? Go ahead. Hold on a second, wait for the microphone. I forgot about that. Could you go back a slide? You were saying it's multiple layers. Does that mean multiple passes of the same? No, it would go- Is it multiple passes to determine the image, to classify the image or? In retrospect, I probably could have found a better diagram. So in this case, this is a network with three layers. You have your input layer, one hidden layer, and an output layer. So it just adds, more layers would add more, if it was fully connected or something. So it's more process, more steps before it finishes. So it goes from input layer to the next layer to the next layer and eventually you get to your output. So it's not more passes, but it's adding more steps before you get from beginning to end. Does that make sense? Thank you. Regarding this diagram, each point in the first hidden layer would point to all of the points of the consecutive hidden layers. For a fully connected network, yes. So this is a simplistic diagram. In this case, yes, there are cases where you would use that, but when that stopped being state of the art, probably in the 90s. So I mean, it's a good way to think of it, but it doesn't always happen unless you specify it. Does that answer your question? Yes. Any other questions? All right, so let's get into the experiment. All right, the dataset. Like I said before, one of the big advantages of looking at this particular problem is it was easy and cheap to create the dataset. We had an issue with the tests, and if the triageers saw that test, they would reschedule it. So making a couple quick assumptions that assuming that all the jobs that failed for this reason were rescheduled and assumed that anything that failed and was rescheduled was due to this issue. Are there, did I probably catch a few that weren't supposed to be in there? Do I miss a few? Yes, but over this time period that holds true enough to produce a valid dataset. From that, it's easy to just point some code at the open QA instance, download all the pictures, download all the videos, and we then have our dataset that is categorized into two sets. We have jobs that were supposed to be rescheduled and jobs that weren't. And again, I'm harping on the cost because that's one of the biggest things is I didn't have to go to the people doing the triage and after they described the issue and confirmed that I understood, they didn't have to sit there for hours and hours giving me like, okay, so this job number, this job number, this job number, this job number, those were the ones that you were looking for. So the cost in terms of human time was very much minimized. I gathered all the jobs from August 30th to September 13th of last year, grabbed all the screenshots, any text that was produced in the video, ended up being a little over 31,000 jobs. The full dataset is 208 gigabytes. Eliminating the data brings it down to about half at 102 gigabytes of mostly pictures. So using a convolutional neural network, it doesn't have really a sense of time. You have to give it all of the data at the same time. So I can't just feed it the first screenshot, the second, third, fourth, so on. And what I ended up doing was creating a composite screenshot just as I'm gonna go back and forth, but basically this. So this is, you can kind of see the different screens. This is from a part of the job. These are all the screenshots that produced. Here's the first one, second, one, third, one, fourth, one, so on. So this ends up being one image that contains all of the screenshots and I can feed that into the network all at once. Just getting into some technicalities. Did some sub-sampling in general with neural networks. If you give it something that's rare. I mean this was, I think we were seeing maybe 1% fail. I think it was out of the 31,000 jobs, I think it was 900. So I mean it was 1 30th. If you just feed that ratio into a network while you're training it, it's not going to do very well because it can just always say false and 29 30ths of the time it's going to be correct. So I just limited the, I took all of the samples where it was supposed to be rescheduled and I think I left it at a ratio of about five to one. So for every job that needed to be rescheduled I took five random ones that didn't and created the data set out of that. Left the data set as 80% for training. So it took 80% of that to do the training, did the testing on the remaining 20% and used a three layer deep CNN to do the classification. And again this is the composite screenshot that gets fed in. Hopefully this representation will at least not be too confusing. So you were talking about, was it Jeff you asked about adding the layers. So in this case this is one layer, this is a second layer, this is a third layer and this ends up being three and four. I mean we can get into technicalities. This looks like technical. It's, so basically it's this last one is dimensionality reduction. So this is a vector of length 100 and I needed to get binary answers so it has to be brought down to one. So that's what this last one is, is dimensionality reduction from 100 to one. Which is why I was saying it's like, is it another, it depends on how you look at it. I'm not gonna spend too much time harping on the specifics of what all of this is doing but it is using convolution to steadily go through and convolve the data that is in, that starts in the layer and essentially do feature extraction, trying to extract information out of it and then move on to the next layer. You can, it was left purposely without some specific, there's a parameter search space that I left for this. The different sizes of the layers, the fully connected that last part where it's kind of pulling the information together and eventually reducing the dimensionality. What size of the kernel it's using for convolution? The max pooling has to do with trying to keep it from having all of the information in one part of where it's looking. It will take, it would take longer than I have to really go through it. It's a commonly used part, it's part of almost every CNN but I did a great search of all of that which is 432 runs and I just wanna make a bit of an aside. There's been some conversation on some of the Fedora lists about do we really need GPUs? And I wanna point at this where this is the same machine. I ran the same network on the same data and it took four minutes to run with a GPU and it took 24 minutes to run on just the CPU. So it is dramatic and even just with four minutes per run that still took more than a day running constantly and it's not, I mean it's a 13th gen i7 with a 24 gig 3090. It's, I mean in terms of what people use for machine learning it's tiny but it's, I think that says more about the clusters they use for training stuff like GPT which costs millions of dollars. But it's still, it's the, it's just sort of bringing home why we might really need to have some accelerated option for doing the AI, ML and Fedora and more of an illustration of like I said the concept of neural network has been around for a while. It didn't really take off until we had machines that could do that much extra math very quickly. Leaving aside the last one that took hours do you have an estimate of the power consumption of the four minutes versus 24 minutes? I, I imagine I could go look at it but I don't off the top of my head. Be interested to know cause. Oh yeah. My, my suspicion is that it takes let me guessing this is gut is that it's less energy with the GPU but I can, I can figure it out. I don't off the top of my head though. So one question, two slides before. If you were to not take the subset, subsample, if you were to just take composite images, how much, how many composite images would you still require to get 80% efficacy throughout for one single failure? I don't understand what you mean by 80% efficacy I don't understand. No, so for example, if you are training for 80% right and you subsample the data set if you were to not subsample. I don't know 80% of what I don't understand. You subsample the split right to 80 to 20, 80% of training for 20%. No, I subsampled, I don't, I can go check the code. I don't remember off the top of my head. I believe it's five to one. So there are five jobs that did not need to be rescheduled forever one that did need to be rescheduled. So that, and that's just for training purposes if it's too rare. Like I said, you can just the, and I've done this where the neural network will just return zero for everything and you'll end up with like 90 something percent accuracy because 90% of the jobs were zero. So that's where the subsampling comes from. In terms of the split, it's just 80%. So then from, you know, took the original subsampled it. So there was a five to one ratio of not rescheduled to rescheduled and then from that data set took 80% of those and that was used for training and the remaining 20% were used for testing. Makes sense. Okay. Okay, and go forward please. And so if I understand correctly, it took 28 hours to completely train the network. No, this is, it took four minutes to, with the GPU it took four minutes. This is, so like, but I can run it with different parameters. So this is just with trying all kind of like to do an exhaustive search of this entire list with two tries each was 432 iterations. Okay, and like the 28 hours is, the number is the total time. To exhaustively search through that parameter list to see which one performed the best, which combination of parameters performed the best. Oh, I see, I see, okay. Okay. So if you got through this whole experiment, the idea after the 28 hours, you know which neural network configuration is the most efficient for the task. So if you wanna train it on a different failure, you wouldn't have to go through the 28 hours again. You could just. Wouldn't even have to go through the four minutes. Okay. I mean, once you have the network trained, you can just run it on more. So that four minutes is to train on the 80% and then test on the 20%. So I don't know off the top of my head, I think it would be, yeah, I'm not even sure. I don't wanna try to do the math in my head. It's much quicker once it's actually trained. Any other questions? So, OpenQA has configured such that the screenshots are all 1024 by 768. When I did the first try, I took all, you know, did that composite screenshot and then shrank it down to the same size of 1024 by 768. The results were so bad that I did not write them down. And actually at that point, I was convinced this was not going to work. And I'm like, all right, let's just try this one more time. Let's double the resolution and shrink it down the composite screenshot, which could be eight times the size of the original, but shrink it down to 2048 by 1536. When I did that, and then did search the parameter space, it was getting really good numbers. Oh, crap. I was gonna put a slide in here on precision and recall. So the first thing about accuracy is basically how many, so it got almost 98% of them correct. But the more important things is going through the precision. Did it find, I'm trying to think of the, how many of the true positives did it find? And then recall starts looking into how many false positives were there. This is much easier to explain if you include the diagram that I forgot. But suffice it to say it did well. It is not showing too many false positives, nor is it showing missing things, like things that should have been a positive that should have been a negative, which in my mind is important for this use case because if we are trying to rely on something to say, yes, this is triaged in a certain way, this is not triaged in a certain way, if you get a whole bunch of false negatives or false positives, no one's going to trust it and it's going to cost them more mental energy than if we hadn't done it in the first place. These are that, okay, so there's a typo in my slide. These are the configuration values that yielded the best results. The kernel size is supposed to be two. I'm not sure why it says 62,000, but that value's supposed to be two. And getting through it's like, that's great. We have something that's 98% accurate, but is this useful? Is this something that can be applied elsewhere? And this is a way of saying not really. Like I said, this was an experiment to see if any of this is feasible. But as an analogy, it's like if I have this network and I spent all this time to train it on pictures and I want it to identify crosswalks. That's wonderful, but the next capture I get I need to identify buses. It's not going to find anything. So that's kind of where we're at. This experiment was an analogy to pointing it at open QA and say find all the sidewalks, which is great if you're looking for sidewalks but that's not all the time. And, sorry, one of the potential issues is that because I had to use the large screenshots, it takes a lot of memory on the GPU. With the 24 gigs of memory I had in the GPU, I could only do the images. I think it was four or five in a batch. And batch sizes are usually larger than that, but there's only so much memory. Do you have a question? Offloading the images from the system is too slow for this application. Folding? Offloading the images from system memory is like if you don't have enough. It has to be on the GPU in order to run. In order to get that time increase from 24 minutes to four, it has to fit within the graphics card's memory. Okay, thank you. So just curious on the composite image things, as the human behind the curtain here, I happen to know that in classifying these failures, all you actually needed was the failure image. Did you try running the experiment just using, because I believe from the API you can get only the image that the test failed on. I don't know if you looked into that. I did not know that was a thing. Yeah, it knows. It circles it in red in the web UI. And I believe from the API data, you can figure out this is the one the test failed on and then run it on that. I mean, I could, no, I did not know that was possible. So no, I did not run that. I did not know that was a thing you could do with OpenQA or that OpenQA would tell you. Next time we do it, we can do it that way. It'd be interesting to try my suspicion is that would not work. Because, I mean, I've talked about all this OpenQA triage. It's mostly you. So my- You've just been using BayVem pronouns for me so far. Well, I, My understanding of the time period that this was happening is it's not, you needed more information than just the one that it failed on? No, for this one, Jen, for this one, you did the screenshot. From the screenshot, you didn't get any more context of what job it was, what it was doing. The failed screenshot would always be, oh, it just dropped back to a command prompt and there's a bunch of XORG server blah there. That's the thing the network would have wanted to figure out. But there are other cases where that wouldn't work, though. Just for this specific failure, it probably would have done, I think. We can, it'd be interesting to try. Like I said, my instinct, and I don't want to, I mean, we can get into the details of it. I don't think that would work. Is my instinct, because I don't think there's enough context in that one image to be able to tell, you know, reschedule don't, but it'd be worth trying because it'd be a lot less computationally expensive. Is that another question? This is kind of a question for Adam. Are there things that are successes that look like that failure screen for other kinds of tests? That's a question for Adam, I don't. He asked, are there things that look like, are there things that would look like the failure screen but would be a success in another test? Again, I think in this specific case probably no. So yeah, as I said in the previous question, but just in case anyone didn't catch it, in this specific case, when the test failed, what we would normally see is, instead of something in a web browser, which is what the test was expecting to see, we would suddenly drop to a console and there would be, you know, this was actually running X for stupid reasons. So you would have the X server output on the console and probably like a cutoff stream of error messages from X, usually. So that's not something that's a success in any other case. Yeah, there's no point at which we want that. So I think it would probably have been fairly clear cut, but for instance, there's another problem I have right now where if we were only looking at the failure screenshot, that wouldn't work and you probably would need this composite screenshot thing, which is an ingenious approach that I like, yeah. Thank you. Yeah, I just, I'm going back to, like I experimented with OCR and the OCR had so many problems with identifying the text correctly. That was part of my suspicion, but it'd be worth trying. Like I said, more simpler, more better. Hey, so can you just go back to the experiment slide? Like the experiment method where you did like, yeah. This, the experiment design? Well, the grid search one where you mentioned that you performed the grid, yeah. So yeah, so in the grid search, so the way you did what I understand, like for every hyperparameter combination, you like trained the network using the training set and then got the accuracy or whatever using the test set and continued that for all your combinations and selected the best one based on the test set performance, is it? Yes, that is correct. But like, do you think like that, that would kind of like overfit the, like or influence your network or the results? That's because test set in general is supposed to be kept separate and at the very end, you need to do that. So maybe a split of, let's say, training validation and test, so that at the very end, when you select the best hyperparameter, then you apply test at the very final, is? No, you're correct. And to be honest, this, I didn't do this as formally as I could have. If I was, you know, especially if I was trying to get this published, I would have to do that because you're right. I'm kind of cheating is a way to put it. I think for this particular context for the fact that we're not trying to produce something that's gonna go into production, it's just a question of, you know, is there enough information in these screenshots to try and triage for specifically that purpose? I think it doesn't matter, but you have a, you're correct. I should have done that. But does that answer your question? I'm not trying to say, I mean, I think these results are valid, but. No, I think that's valid. That's because you have got like quite, quite a lot of like data points. So essentially it's unlikely to overfit, I guess. But I mean, and now that you mentioned it, I'm curious because I did record all of the, I mean, getting into more specific things, you know, within that 80%, I think I did another 80, 20 split. So it's only using, so I used that 20 for validation. I did record that those, the results from the validation parts of it. And I, now that you've mentioned it, I'm curious to see if it would have picked the same hyperparameter combination from the validation set versus the test set. Yeah, thanks. Does that make sense to everyone else? What I just said and what he was asking? Enough anyways. All right. He basically pointed out that there were some flaws. So there's some flaws in my approach that certainly would not pass muster if I was trying to get it published or something like that. All right, getting back to, and I keep harping on this because I got some questions about, you know, when is this going to go into production? And it, you know, again, getting back to, it's not going to go into production. It wasn't ever meant to go into production. This was an initial experiment. And the idea was to, is there enough information in these screenshots to do triage just based on throwing them at a neural network? And I think that the answer is, yes, it's certainly possible and certainly worth looking into. So data and code. I forgot to push my code public because I found a bug last night and something that doesn't affect stuff, but I need to push that public. The data is available. It is a 50 gigabyte tar ball that took four hours for my computer to compress down from a hundred and, yeah, 108 to 50. So you're welcome to grab it. It is a lot of data and I can publish these slides afterwards. If you'd want to see the code sooner, then I get around to pushing it. Just come find me. I'm not trying to hide it. It's just a trying to make sure it's correct before it gets pushed. Yeah? Hold on a second. Can you wait up for the... So you said the conclusion is, yeah, it's possible. If I would love to do a new classifier to find the buses. So, yeah, initial around a lot of investigation, but the second time when you want to do a classifier for the buses, how long it would take you and how much time it save you during regular QE? The primary flaw in the approach that I've talked about is that it cannot scale. In order to say, Adam said there's a new issue. It'd be nice to be able to detect. In order, the problem is creating the dataset because in order for this kind of an approach to work, you have to have at least 1,000 labeled instances of it happening and then do all that kind of stuff. And if we're trying for every single issue to find 1,000 examples of it before we start recognizing it. So how long would it take you to create new dataset for the buses? It's a hard question to answer. It depends on how frequently the error is. The long-haul intent is the person who is going through marking things as, yes, this is related, no, this isn't. So I don't know if I am misunderstood in your question. I have no idea, even the scale. Is it minutes, days, weeks, years? Again, it comes down to the issue. In terms of once the dataset is created, it's gonna be the same kind of process of, say, a day or two of computer time to explore the parameter space and then four minutes to train each would be my suspicion. Assuming everything's about the same size, but the most expensive part of this is not the neural network, it's not the time coding it, it is creating that dataset. And so in order to get 1,000 examples of that particular issue, however often it's happening. So say it happens 10 times a day, it's gonna be 100 days before you get 1,000 examples. And do you need 1,000 examples? That's a hard question to answer. It depends on the issue, but what I'm trying to get at is the expensive part is the dataset creation and collecting all of that. So let's look at this from the perspective of recognizing failures that, let's say we don't get them in OpenQA. We can look at Fedora, a more general view on Fedora. So for example, we have stack traces. And we have a system that from the user's systems uploads these stack traces and we have a collection of the stack traces. So for example, this can be used if we marked a certain stack trace that this is real error. And we know a bug that is associated with. We can have the information similar to that stack trace identified by similar network RAM to be belonging to the same bug. So kind of how long would it take to achieve this? This is probably something that what Miro is asking. Before I answer the question you asked, I just wanna answer a different question that was implied. There is, that's a whole field of research in itself that is not terribly related to exactly what I talked about. But to answer the question you asked it, honestly it starts getting a little outside of my knowledge. The problem with honestly every machine learning problem is data and it's just kind of getting back to the dataset and is it how long? I don't, honestly I don't know. The reason why I'm taking the stack traces as an example because they are more confined and easier to identify similarity there. Remember when all of these GPT became sort of public available for experiments but not really run in open source fashion. What people did, they started looking into stable division, diffusion and similar network systems and started to convert other types of information into the visual. So effectively the beginning of that story was that they converted audio into the video, into the pictures and ran stable diffusion to generate something new with those audio and then converted back to the audio. And this is effectively what I am implying here. Now we can take these stack traces, convert into pictures that you analyze, stack them together and run some sort of analyzer. If we know that this particular stack trace for example corresponds to the real problem that we already been getting in past and we have collection of those problems actually classified everywhere. We can take that data that are already associated with the solved bugs and train the system with that to figure out some sort of a similar stack trace happening. And apply that to the logs of OpenQA when the crash actually happens. One of the issues I at least noticed is that with OpenQA specifically it doesn't always give you that in text form. So like the crash is that it was seeing with X, it would show up on the screen and it would not show up in any of the text logs. So yeah, getting back to what you had asked, I don't know, I mean it reminds me of talking to someone who did some interesting research of trying to find malware by converting the bits in a binary to a bitmap and then running visual analysis on that. It's a hammering of it because you're making a much harder, you're trying to solve a much harder problem for something that could probably be solved much easier but it still seems to be where the direction of research is going today. And like I said, this is me, like this is outside of my exact realm of knowledge. My instinct is that because, I mean it'd be worth trying but my instinct is that because it's more structured, more information retrieval techniques and less machine learning might be more effective because like I said, the downfall of most machine learning problems is that you need a hundred examples, you need a thousand examples of the same thing before you can train it to find it. And in a lot of cases by the time you have that many, the bug is fixed or it's no longer an issue. And either way, finding someone to identify a hundred duplicates would be difficult. So I don't, am I answering your question? That's an interesting way to do it. I don't know how possible, how likely it is to work. I'm just looking at this from the other perspective. So we have roughly upstreams, then downstreams like Fedora, then downstreams like Cento Stream, then downstreams like Rel, which have a span of time before a certain thing comes in. So by the time we missed something in QA, in Rel, that problem for sure might have already been seen somewhere upstream. Upstream meaning in Fedora and in Rel, upstream of that project. So if we could have had some sort of analysis of these problems we see in upstreams and in Fedora done at that time to generate these kind of data sets and then reapply them downstreams, we might actually get results that we don't see now in Rel QA or we don't see them in the actual support cases and help there because we see them maybe 12 to 18 months later. No, and that could, yeah, having in a way, I'm paraphrasing what you were getting at, but having a way to fingerprint some of those crashes to be able to identify and later that, yeah. Just side note that if you want to look for this traces from, I'm gonna compare it to the different issues already reported, you already have it, it's part of ABRT, FAF, FAF already do that not based on AI, but on the editing distance. I was never too much popular, so if you want to dive into that, you can, it already exists. I'm using the traces example here just because people are aware about ABRT, but in reality, I would take fragments of the actual execution logs with the variation there like SSSD logs or journal with certain things and focus on those because these are the things that we analyze with support every day and some of them span a lot of time in execution and you do match across multiple logs to find the actual problem, it's not in a single place. So extracting these, correlating them and finding them this way seem to be a bit of promising. I think Steph Walter did this couple years ago with the cockpit and they did find some promising results, but again, training and then reapplying results fails because you never find the same problem. While in maybe reapplying this across multiple distributions from upstream to downstream might give you actually repeatability of the problem, that's the reality we see. Customers might see the same issue again and again in downstream 12 to 18 months later after we probably solved this problem in the upstream and this gives you a chance to actually catch something that you forgot to backport, for example. That's my point, maybe this is a real value here. Okay, and I'm happy to talk about this more, but I have five minutes left and we'll see if I get through everything. I should be able to. So getting into, it was an experiment. It was never meant to go into production, so what can we do with this? Can we make something out of it? From what I understand, and Adam, this is from a conversation we had, something that is possible that I want to take for the next step is can we take those open QA failures and start grouping them by root cause? And so that you don't have to have so much, again, trying to revisit this whole thing of making people who are doing the triage work more effective, it takes less of their time because they have other things they need to be doing rather than just triaging the same thing over and over again. I think the value is not so much about taking up people's time. It's more the, so the Adam W. Restot Bot runs, maybe every two hours during the day and then there are these eight hour intervals where I have to go sleep, and that means that if your update is blocked on one of these failures, you might be waiting eight hours before it gets restarted. If we can do something like this, then your update will get restarted immediately and then we cut out that latency. I'm talking about more general. So not just doesn't need to be restarted as in can we change open QAs so that there's some suggestions of these five failures look very similar and trying to group things by root cause or failure not just doesn't need to be reset. Yeah, that could be helpful for sure. So yeah, I agree. I agree it's definitely worth looking down. And I think that the thing that we have to look at is the whole question of how long does it take to get enough data together to classify things versus how long do problems exist? But there are, I mean the X thing lasted for several weeks, this 404 thing has been going on for a couple of weeks now. Yeah, and like I'm talking about something completely different. This is related and not. So the, but the idea is, yeah. It's still the same thing, it's not always the same ones. But one of the things that, I'm sorry to tell you I'm just trying to make sure I finished my slot, finished before time's up. The, there's a system out there called Tango. It's been published in research. Basically it was developed to try and find duplicate results from screen captures of mobile apps. And it used video and text to try and find duplicates within its data set. It's something I do wanna look at because I think it's promising, it's related enough to be worth looking at. But, you know, as I'm repeating myself for the 50th time, the problem is data. Modern machine learning stuff needs a lot of data. We have only so many expert hours and only so many emails to set experts before they start ignoring my emails. And never talking to me again. So it's, you know, a question of this is the real cost. You know, is this going to look promising enough to spend the time to create the data set to do more research? But one of the things that I also wanna look at is a technique called active learning. Well, I call it technique, an approach called the active learning. And the idea is instead of sending every single sample to an expert to annotate, analyze what your data first and organize things such that you are using them minimally so that you can gain the most amount of information from the least amount of expert time. And that's, so the idea is from here to look at Tango, to look at active learning so that we can minimize the amount of time that we need from experts and hopefully find something that can start grouping those failures by root cause with enough confidence that it doesn't end up being noise. So just repeating the conclusion, it's promising, but in my opinion, but it is only beginning. And going forward, it's going to start getting expensive. One of the things I do wanna harp on is, and I'm leading into this, all of the work that I did was on an Ubuntu machine because this stuff does not run well if at all on Fedora. And we can get into a whole bunch of stuff about why and what can be done. There is kind of a solution coming up that someone's working on, basically the replacement for NVIDIA Docker. You used to be able to pass the GPU through to the container and have all of NVIDIA's proprietary crap in that container and run it on Fedora. That way, we stopped shipping Docker. The replacement that can use Podban, I think someone's working on packaging it. But that's a lead up to if you also don't like the fact that this all had to be done on Ubuntu. We're working on stuff to fix that. And maybe you should show up for the, there's a at 430 in the room next door. We're gonna do some of a meetup to start talking about some of the stuff and what we're doing to fix it. I think I'm out of time, but questions, comments. I guess if you have, I am out of time. If you have questions or comments, please come find me. Thank you so much, Tim. And thank you for all of you who are here in the Fedora Leads, the Linux distribution development room. We will be picking up again in five more minutes. And if you are on your way out for a moment, there is a badge sign on the back desk there. Please stop by and get your flock badge. If you haven't already, you can scan the QR code. So I just wanna remind everybody for that. So we'll be picking up here in about five more minutes.