 So, hey everyone. Thank you for coming in. So, the topic that we're trying to address here is optimizing the long test run times using AI ops. So, oftentimes as contributors to open source projects, most of us have made pull requests to GitHub repositories. And they tend to get into CI CD tests, which oftentimes take up so many resources that we cannot possibly run all tests or just have backlog. So, to answer all those questions, how do we do that? And where do we come from? Why are we doing this? We are here to talk more about the optimal stopping point tool. So, I would like to introduce my colleague, Hema. Hello everyone. I'm Hema Viradi and I'm working as a senior data scientist at Red Hat. Both Akanksha and I are part of the emerging technologies data science team, which is part of the CTO office at Red Hat. And I'm based out of California and the United States. And my name is Akanksha Duggal. I'm also a senior data scientist in the emerging technologies group. And I come from Boston. So, to give you an overview of what we're going to talk about today, so we start with talking about AI for CI, which is the main project. And the optimal stopping point tool is a subset of this project. And I'm going to also go over what is the motivation behind this project and what's the solution that we've come up with, what all data sources we are using along the process, going over the workflow, presenting insights and a final demo where we'll walk you through all the steps that we've taken along. So, to start introducing this AI ops toolkit called AI for CI, I would like to mention there's a problem that we're trying to address here so that there's a need of automated and AI driven monitoring when it comes to the testing data. There's tons and tons of open source testing data that is being generated but not collected or put to use. So we see that there's a lack of AI driven metrics in the open source community health. So that brings us to the opportunity of leveraging these data sets that are being made available by lots of open source communities. And there's tons of open operations data, for example, test grid, prowl and GitHub have so much data that we could put into use and find out some interesting metrics that could in return benefit the community health. I lost my cursor guys, sorry. So the solution that we've come up with is coming up with an open source AI toolkit that helps us in collecting and analyzing all the CI CD data that is being made available. Then as a part of this project, we've also made machine learning models that help resolve various use cases. So one of the use cases that we try to solve here is called time to merge. So every time you make a pull request, oftentimes it takes super long to be reviewed or merged to the main code base. So this particular model is going to make a prediction, which will just put a comment on your PR every time you make it. It's going to make a prediction of how long will it take for your pull request to be merged. Similar to this, the main topic of this presentation is the optimal stopping point classifier, which will help you understand how long should it ideally take for your test to finish running. And what is an optimal stopping point after which it should be technically safe to just terminate that test or maybe restart it or inspect why it was acting like that. And just save up on some resources for it be cloud sources or engineers. These are also your resources that you must want to save for the right set of things. And another ML service that we have here is called the build log classifier. So we all know that there are tons and tons of logs that are being generated in Proud. And it's oftentimes difficult to just go through one so many logs. So what we would like to do here is just classify these Proud logs on the basis of the type of failure they belong to. And also as a part of this major project, we have KPI and metric dashboards. These are interactive dashboards that you could look at and see all the metrics that you have so far. And all in all, with this project, the aim is to foster an open AI ops community that helps leverage all these amazing data sources that we have. So as a part of this project, we so far have collected tons of data from various data sources like test grid, Proud, GitHub. And we've collected metrics and created KPI dashboards. We have machine learning services that support CI CD processes and are also integrated with GitHub projects. And it's super easy to use. And finally, it's also a resource for everybody in the community to utilize. There are templates, notebooks, scripts that anybody can just put to use directly without it's just for free, basically. So to come back on the topic that we're trying to address, I would like to mention why we are doing this. So this is a graph of various tests that were collected from a GitHub repository called code QL. So this is the distribution of the run duration of all these tests. So we see that most of the tests that were run, they finished running within the first minute of running. So that makes us believe that technically it shouldn't be taking so long for these tests to finish running. However, when we started to look at the failing test specifically, we saw that there are couple outliers that take up all the resources or take sometimes infinite amount of time to finish running. So wouldn't it be a good idea to just come up with a point after which we know it's probably just holding resources or blocked somewhere. It could be an outage or any reason for that matter. So the aim here is to find out the bottlenecks and just point out where it could be going wrong. So the ML solution that we've come up with is that sometimes these tests are taking long and we would like to find an optimal stopping point after which this test is most certain to fail. And hence, in return, we want to just save and allocate our resources in the best fashion. Talking about the data sources, so we all know GitHub is home to a lot of open source projects and a lot of people make contributions on a daily basis to these repositories. So we are collecting our data from GitHub. We've also looked at Prow, which is a Kubernetes base CI CD system. It has a lot of data that is being collected. A lot of checks that run on PRs are oftentimes reported back to Prow, which is also a good place to scrape data from. Then TestGrid, which is also another platform that helps people visualize their CI processes. A lot of communities, even besides Red Hat, are keeping their data on TestGrid, which is easy to visualize. But still, we think that we can come up with a better tool, scraping all this data from the back end and coming up with more insightful metrics and KPIs. So now, I think I'm going to give it over to Hema. She'll elaborate on the solution approach that we've taken here. Thanks, Akanksha. So now that we have a brief understanding of what the problem is at hand, let's take a look at the approach that we tried to take in order to come up with this optimal stopping point model. So first off, as we saw, the different data sources that Akanksha went over, the first step is of course to start collecting these kind of data from our CI CD tests. So the main data source that we are looking at right now is GitHub. And in GitHub, you would have noticed that whenever you make a certain PR to your repos, it usually shows a bunch of checks that are happening at the back end part of the review process. And for example, you have pre-commit checks, you have your file-linting checks, and apart from this, there can be many other checks that happen. And for any PR to get merged, these are some kind of prerequisites. So a test has to be successful in order for the PR to get merged eventually. And all of these are also sort of defined under workflows, which are part of GitHub actions. So that's how we're getting these kind of data sets from various repositories. So as an initial experiment, we look at a particular repository of interest. We see GitHub actions have been enabled for that repo. And then we go and look at the different workflows that have been set up for it. And we go ahead and just use an API, which allows you to extract that particular workflow ID. And all the checks within it can also be obtained through that API. And that's how we get all those test durations and those kind of features that we start seeing from that data. So once we have all of this data ready, we start moving on to feature engineering. So in any ML model approach that we look at, you need to start identifying what are those important, relevant features that can be used as an input for your model. So here we're going to look mainly at those test durations, where Akanksha showed us the plot of those different durations. And what we try to do here is, for a given test, we kind of see the entire distribution of how long that test is taking to run. And we further try to bucket them into different time range intervals. So for example, zero to 10 seconds, 10 to 20, and all the way up till the completion of the test is where we try to further split and divide the test durations. And within each of those intervals, we try to find how many tests are likely to fail in those time ranges. And we try to basically calculate those percentage of failures over time. And once we do that kind of approach of calculating those durations and splitting them into those different buckets, what we do is finally try to reach to that optimal stopping point prediction. So to do this, as I was talking about that percentage of failures that we like to calculate in those time intervals, what we do is we've defined a threshold, where here it is 75%. So we say that in those time ranges, if we see that the threshold of is a percentage of failure is more than 75%, we say that anything beyond that is likely that the test is holding up resources and it's likely to fail. Now this threshold is just a default value that we came up with. But again, it's customizable. We can tweak it as per the needs depending upon how the workflows and checks have been set up. But this is just some initial kind of approach that we sort of started implementing to further define what that point in time would look like. And once we have that final sort of interval or that time slice at which it's going to have a greater chance of failing, we eventually want to integrate this back to the GitHub actions. So we would like to use the help of GitHub bots here. So what the bot will eventually do is whenever you have a PR open, the bot is going to sort of run the model at the back end. And then the bot is going to leave a comment on that PR saying, hey, this test is likely going to fail at this particular timestamp, you should probably stop it so that the rest of the checks do not get affected because of it. And it's still a work in progress, but that's ultimately the plugin that we would like to enable for these different repositories where you have a lot of checks to be passed. And there's like, you know, it's a big PR, maybe, and you have a lot of developers who are being put on it to review it. But they kind of don't have an idea as to why that certain test is failing and you probably want to move on to the next phase of your review. So eventually that's what the bot leaving a comment is trying to overcome and help out with identifying that pain point for your developers. And so how does this entire kind of workflow set up look like eventually is you have your PRs getting open to a repo. We go ahead and have that optimal stopping point running at the back end. So in this graph what we see here is you'll have all those test duration buckets on the x-axis and on the y-axis is the percentage of failing. So as you see, as the first range is basically taking zero to ten seconds and then you'll have all the other ranges beyond that until the test gets completed. And then you see there, you know, the percentage of failure. So over time slowly the percentage of failure tends to keep spiking and rising. So anything beyond that 70% we say that you know hey it's gonna eventually fail. Or if we set the threshold to like 60 then it's probably going to fail beyond that 50, 60% time range. And then if we correlate that back on to the x-axis you'll actually know that time interval at which it was going to fail. So maybe it was around like one minute ten seconds or something like that. So that's kind of how the output is going to look like. And then finally this is what the GitHub action bot is going to eventually integrate as part of our service. It'll leave a comment saying that you know it predicted that for this particular check that is happening beyond 50 seconds is when you should terminate it else it's going to hog up your resources. But if you won't stop at this it's the idea is that we can also allow users to take some actions based on that. So we would like to also have some kind of capability of you know asking the user like should I go ahead and stop this test or giving them some set of actions that they can perhaps take and help part of their CICD process. But as a very low hanging fruit we basically want to leave some kind of comment to provide more feedback for the repositories and for the PRs that are being opened in different repositories. And now we can move quickly to a small demo. So in this demo we're just going to go over kind of like the code and a little bit of the workflow that we follow. So I'll give it to Akansha to start with the initial part of it. Awesome. Alright so just for this demo purposes we've like started to look into this repository called CodeQL and similar to any of your repositories this also has couple pull requests. If you take a look at any of the open pull requests that we have here we see that there are some checks some haven't completed yet, some were completed couple seconds ago. So things like these oftentimes take a lot of time because there are so many checks and workflows that are being run. If you go to actions you can see more details about each of these checks and each of the runs that were made how many will fail etc. So what we aim to do here is to get all of this data in Jupyter Notebooks which is the home ground for most of the data scientists to start exploring the data and find insights out of them. So if we start looking at this notebook the main agenda here is to just scrape and get all the information that we get from the actions tab from GitHub. So after we get that information we just try to put it in a decent format that we can finally use to perform some evaluations and once we have collected all of this we try to just put them together into different formats and like classify them by passing and failing tests because it's easier to make a prediction once we know how both data sets look like and once we have the passing and the failing data frames ready we try to split them up into train and test data and further move towards the optimal stopping point prediction. So he was going to go over the approach that we followed to find out the optimal stopping point. So once we collected the data in the previous notebook so just to mention Jupiter notebooks are if you're not familiar with they are basically an interactive way of writing your Python code. Why we call it as a notebook is because it's like you have everything sort of broken down in a cellular format like this. So this will be like a first cell that you run and then usually the outputs also get printed one after the other depending upon how you write your code. So it's a preferred tool for most data scientists. So if you're not familiar about it you can go read more about the project Jupiter you'll get to know how this tool has been used. So that's kind of how this code looks like in that notebook format. But yeah moving on to start our analysis we go ahead and take those CSV files that we saw in the previous code and once we get those two different sets of data. So you have one for all your passing tests and then you have one for all your failing tests. So we kind of read them separately. And we kind of looked at it from two different approaches. The first approach is was an experimental approach. We don't use this actively right now but it was something that we researched a bit and did some analysis on top of it. So I won't go too much into detail but the idea in this approach is we try to look at the distribution like a statistical distribution of how the run times look like. So you again have your run duration on the x-axis. We try to see how many tests are within those different buckets that we see here. So we have about 20 in the zero to less than three seconds time range and so on. So that's kind of where we try to figure out the distribution pattern of it. And we do the same thing for both passing and failing tests. So statistical distributions, there are some libraries in Python which will automatically define the best distribution based on the values that you have. So here it's trying to fit like these different type of distributions for the data set that we have. And then it tells you which one is your best type of distribution and based on those distributions further we try to find out the intersection point for your passing distribution and failing distribution and that intersection point is essentially what we map to as our optimal stopping point on the x-axis. So that was approach one. But we wanted to move on to a different approach which is more favorable for us which is based on the probabilities of tests failing rather than just looking at the distribution but more on from a probability standpoint because ultimately that's what you want to predict. So as we were talking about those buckets again here you see some tests are even going all the way up to like you know infinite timestamp which is not something that you want. You want to get rid of those kind of long-running tests. So after you understand that distribution we go ahead and try to plot the percentage of failure. So that's kind of what this ultimate approach that we want to focus on. So we set a threshold of 70 after observing a lot of tests we came up with that threshold. This is just one test that we're showing this for but over time you're going to have multiple repositories, you're going to have multiple checks, multiple tests. So each data set is going to look different but this is in this particular code at least it's one test for one particular repository. So that's just something to keep in mind that that threshold may not make sense in some situations but at least it's some kind of a starting point for us to look at. So that's where we can visualize it better. These are just a more from a more data scientist perspective we're trying to normalize the values rather than look at the raw values. We're trying to scale the values and things like that. So these are just some couple of more ways where you can normalize your values and try to do the analysis on top of it. But yeah you see that the graphs kind of look different just because you've normalized your values a little bit you see some more sort of intervals here and things like that but nothing too drastic of a change. And yeah ultimately we kind of use this threshold values on this and then we try to intersect it on this x-axis and then we find out what time duration or what timestamp does that correspond to and that's the point beyond which you should not be letting your tests run for for so long. So that's the overall kind of a goal of this particular code and coming back to our slides if you want to engage more if you want to learn more about this you can scan the QR code here it'll take you to our github repo so we track all of our work there if you have suggestions feedback open for any contributions or even just you know opening an issue if you want to learn more about it so please go ahead and do that. And we also have another project called AI for CI which akanksha mentioned at the beginning of the talk so that's a more larger open source initiative which is a collection of all these models that we're building so one of them which we presented today another model is the time to merge model that some of our colleagues worked on so in that model we are basically predicting what the what is the amount of time that it takes for a certain pr to get merged in a project and why we we don't want to give this to scare people that you know it's going to take so long for a pr to merge but the motivation for doing this is if you have new contributors for any open source project they might be hesitant to participate and contribute code because they don't really know if that PR is going to be reviewed or not so for any of those first time contributors the goal is if if you're able to predict and tell them hey the PR is going to get merged in a couple of days it gives them maybe more confidence to contribute to your project and the second advantage is it can also help you know community managers to look at their project and say okay it's taking a lot of time to review PRs do we need to change something how do we make it more efficient things like that so that's kind of how we came up with these different ways to consolidate these models so if you want to learn more about that I would encourage you to look at the AI for CI repo and you'll find all of these resources there like the notebooks the dashboards we've built the data sources and so on so please go ahead and do that and these are again some more references to all of our repositories and with that I would like to stop here but thank you all for attending and if you have any questions we have a couple of minutes to take them yes yes yeah yeah yeah that's a good question so just repeating the question again the question was how well would this work for a smaller github repository so if you have a smaller repo a less mature project I'm assuming where you have smaller unit tests how accurate this would be is that your question yeah how much data do you need yeah so I would say for any machine learning model the generic answer is more data is better so I don't want to default at that answer but I would definitely say that it might not be fairly accurate but at least over time if we keep you know trying and seeing that it's predicting at somewhat near accuracy then it would be a good start but of course if that particular check or that particular test is also found in other repos we can also use that as our training data set so it doesn't have to be coming just for your particular repo but if it's a test that is very specific to your repo then maybe the training data sets cannot be as large as expected but I would say it's definitely generic enough that we can try to find you know data sets from other repositories which might have similar checks and maybe use that to better train the model so that would be one way to approach it yeah do you want to add to that yeah I think just to add to that I think the more we retrain and the more feedback we get from the repositories and the more number of tests that we run that's the feedback that we look for in any sort of machine learning model so that we know how well the model is predicting and we can probably just retrain it on the new data that is being made available by new PRs and tests being run so I think that would be useful does that answer your question yes yes so repeating the question it was so if you're taking tests from different repos and using that to train your model how do you kind of map and figure out which if the data is accurate right so yeah so some of these checks like for example in that screenshot you saw there was a check for file linting or there was a check for pre-commit check so those are more generic so those kind of are standardized to some point to some extent so even if you're collecting from different repos we can have some confidence that these data is going to look similar but yes some of the other checks are more customized maybe for a particular repository so it might not work well for the others so in those situations we might have to eliminate those kind of data just so that the model is not training on the wrong thing so it's kind of like an experiment of you know checking which data sets is it able to learn better versus which one it's not so that's something we might have to look at further yeah yes yeah that's definitely okay that's definitely a good question I'm going to repeat it please correct me if I understand it wrong so you mean to ask that now that at this point we're making predictions for long running tests but what about the shorter running tests that that finish quickly is it even accurate that it finished so early right is that the question yes short running tests and false negatives that's a good question but as of now we are just looking at the long running tests but that is one of the use cases that we would also like to address but it's a part of prowlogs because if you don't get any information from run duration the best place to go look for more info is proud right like where do you go look for your logs particularly for these tests do you have logs that are being generated for these tests like is there a place that you go to okay okay yeah but for most of the test especially for open shift repositories they are linked back to proud and oftentimes when they don't get an answer the engineers who are working on these projects they go back to proud look at the logs and try to understand why particular thing was happening in certain ways and that's where like we also have like a very initial project that we did on prowlog classifier which would help you understand why why certain things are happening because it mainly just classifies all these logs into categories as to what category they belong to in terms of failures and passing and what could be the possible reason if they are behaving a certain way so that's another use case but this use case is mainly focused on the long-running test go for it yeah thank you for that plug in oindrilla also to add on to that at this point it's a work in progress but what we aim to do here with this initiative is that anybody who wants to use this tool can have a customized file where you can just put in what you want from this tool to do so starting from what is the threshold that you would like to specify it could be anywhere between 75 to 95 whichever thing you want to specify and also the next step here would be to have a tool that will automatically terminate all these tests but this is only something that we we can do if the user or the repository owner lets us do so this is something that we would want to add to this tool but definitely something that the owner has to take a call for so you can specify all these parameters and if once you specify it the GitHub action will automatically run on all the incoming pull requests in the future so that's the plan but most of the things are still work in progress but we were happy to just present whatever we had so far looks like we're out of time I would like to thank you all again for joining us please feel free to reach out for questions even later during the conference thank you