 Good morning, everyone. Have you ever found yourself lost in logs and dashboards while getting to the root cause of your failures? Have you considered monitoring your CI processes using AI tools? How do you like the idea of having an intelligent monitoring tool that can help you analyze all these failures? Well, in this session, I, Akanksha Duggal, and my co-presenter, Oindrila Chatterjee are going to introduce you to our project AI for CI. We will be walking you through the tools that we developed that are going to support your CI CD processes. Well, hello, everyone. I'm Akanksha Duggal. I'm from Boston, United States. I'm a data scientist in the office of CTO. I work in the team AIOps at the AI Center of Excellence at Red Hat. And here, everyone, I'm Oindrila Chatterjee. I also work at Red Hat and the team AI Center of Excellence in the office of the CTO with Akanksha. And I also hail from Boston. So by the end of this presentation, you should have an idea of how you can use AI for CI and its set of AI tools to monitor your CI CD workflows. But before we go there, let's start by discussing what this project is about, the story of how we got here, and the various initiatives that support this process. So what's AI for CI? Before getting into AI for CI, though, let's understand what's AIOps? AIOps stands for artificial intelligence for IT operations. AIOps is a critical component for supporting any open hybrid cloud infrastructure. In our opinion, though, it's mostly a cultural change, very much like what we saw with the DevOps movement. Here we combine the two cultures, the dev and the ops culture to create a new culture, the DevOps culture. And the same cultural change is going to happen with AIOps. The DevOps culture and the data science culture use very different tools and have some different mindsets. Our aim with AI for CI is to bring these two cultures together so that they can learn from each other. We aim to embrace the intelligent tooling that we have in the AI world and apply it to the operational domain. And what's CI? CI, or continuous integration, is the practice of automating integration of code from multiple contributors into a single software project. CI CD is a solution to the problems that integrating new code can cause for development and operations teams. You've probably heard of it. It's called the integration hell. So what's AI for CI all about? It's AI applied to CI CD data simply. So it's an intelligent open source AIOps toolkit, which can be used to better monitor builds in order to help a development lifecycle. So the goal of this project AI for CI is to build AI tools for developers by leveraging the open data made available by OpenShift and Kubernetes CI platforms. As a part of AI for CI, we have built a collection of open source AIOps tools, which can help support a CI CD process. Such open operations or CI CD data, which are coming from these real world production systems is a rarity for public data sets today. Thus, this also presents a great starting point and an initial area of investigation for the AIOps community to tackle. So we are also focused on cultivating an open source community that uses open operations data and an open infrastructure for data scientists and DevOps engineers to come together and collaborate. So let's look into some of our current initiatives. Firstly, we are heavily involved in collecting data from different open data platforms and creating a community around open CI CD data sources. We are also involved in quantifying and evaluating the current state of a CI workflow using key performance indicators. We are also right now building AI and ML techniques to improve the CI workflow. And finally, and most importantly, we are interested in creating a reproducible and an end to end workflow around this whole data collection analysis and modeling efforts using multiple technologies like Ilyra, Kubeflow, Pipeline, Selden, Jupyter Hub, all of which is developed, built and operated on the Operate First environment. So let's look into what Operate First is and what's Operate First Cloud. As I mentioned earlier, we build this AIOps community and develop the necessary tools on Operate First. Operate First is an initiative to operate software in a production grade environment, bringing users, developers and operators closer together. It uses the same community building process of open source projects, which all of you are familiar with, but it's extended to ops procedures and data. So Operate First enables a collaboration between open source developers, cloud providers, and AIOps supports this collaboration by creating a new set of tools around CI CD processes, which could otherwise be a pain point during the development and production of open source projects. So if you're interested to learn more about how we open source the operation of software, I'd strongly recommend checking out the upcoming webinar by the Operate First team on October 5th to know more about open source cloud operations. I will now hand it over to Akanksha who will dive into the various open data sources that we are looking at today. So as we discussed earlier, one of the goals of this project is to build an open AI community involving open data sources which originate from the different segments of a CI process. To understand these different segments of a CI process, let's take a simple example of OpenShift. OpenShift is an enterprise Kubernetes container platform. There are hundreds of repositories associated. Thousands of contributors make pull requests on the repository through GitHub. Every new PR to the repository with a new code change is subjected to an automated set of tests and bills. PROW is the central component of this automation. It is a Kubernetes based CI CD system. How Kubernetes testing group defines is that PROW is a CI CD system that is built on Kubernetes for Kubernetes that executes the jobs for building, testing, publishing and deploying. It seamlessly integrates with the GitHub via hooks which can trigger automated CI CD tests for GitHub PRs. Test Grid is a platform that is used to aggregate and visually represent the results of all these automated tests. OpenShift also collects the cluster level telemetry data which consists of the metrics about the resources that are being used while running these tests and bills and reports any results that are experienced while running these tests. So the ultimate hope of this work is that we will be able to connect all the various data sources starting from the GitHub pull request, the bugs, the cluster level telemetry data sets, Test Grid and the PROW logs and develop a complete picture of a CI process. So let's start by looking into the first data source that is Test Grid. As we briefly introduced earlier, Test Grid is a visualization platform for the CI data. It is an open source project developed by Google to help people visualize their CI processes in a grid. It is used by a number of communities to track the status of their tests and bills in a visually friendly format. Test Grid primarily reports the categorical metrics about which tests passed or failed during the specific bills over a configurable time window. So what does this Test Grid dashboard display and where does it fit in the context of a CI workflow? We will look into this in more detail later when we present our demo, but this is what a typical Test Grid dashboard would look like. It's an aggregation of multiple tests over time where the green ones indicate that the tests were passing, the red ones indicate they were failing, and there could be other possible states like not running or maybe flaking out. So in a particular development lifecycle, whenever some developer opens a pull request and has some commits with new changes, it is bound to go through the CI process of having the tests and the bills. The results of those tests and the bills are displayed here on the Test Grid. Now in order to quantify the current state of the CI workflow and identify any gaps within the CI process better, we calculate certain key performance indicator metrics related to these test runs. Calculating these relevant test metrics and key performance indicators related to the tests helps us get to the root of the problem and help us discover the recurring patterns within the running tests and ultimately help the developer and their managers with the development workflow. So we start by collecting the data. So as we saw that Test Grid contains a lot of information about running multiple tests over time. The data set has categorical metrics about running multiple tests. To gain more insights into this work, we try to download this data programmatically in a Jupyter Hub environment, which is a tool used by data scientists to run the notebooks interactively. So we access these visual grids from the Test Grid data in a Jupyter Hub environment. We create a connection from the Jupyter Notebook to the Test Grid URL and download all the dashboards available by scraping the HTML. This allows us to dig deeper into the various features of the data and gain an in-depth understanding of the data which would not be very, very obvious if you just look at the dashboard. So once we have collected the data, our goal is to apply some AI or machine learning techniques to improve the CI workflow. But first, we apply certain analysis by aggregating various tests, detecting patterns in the data, which can help us quantify and evaluate the state of the workflow. For instance, if developers and managers were to manage the allocation of resources for a CI process and in order to save these resources, it would be vital to know if a particular test has a higher probability to fail or if we could find an optimal stopping point for a test or a build after which the test is certain to fail. And to do that, we have calculated a certain number of relevant metrics and KPIs that will not only help us evaluate any AI-based enhancements we make to the CI process but also pinpoint the developers as to which areas need the most attention or more improvement or which areas need the most resources. After having the Jupyter notebooks for the data collection, model training and all parts of the machine learning workflow, we ensure that all these tasks are sequential and continuous. To do so, we use Ellyra and Kubeflow Pipelines, which is a platform for building and deploying the scalable machine learning workflows. We now run all our notebooks in an automated fashion using the Ellyra and Kubeflow Pipelines. And finally, in order to help out the developers and the stakeholders to view these KPIs and the metrics and the aggregated results of these tests visually, we created automated dashboards that can help analyze the status of multiple tests, investigate the problematic tests, builds and jobs. Now we'll move to another data source that is Prow. So Prow is a Kubernetes-based CI CD system. Jobs can be triggered by various types of events and report their status to many different services. Test Grid provides the results of these tests. But if you want to triage an issue and see the actual logs that are being generated during the building and the testing process, we will need to access the build logs that are generated by Prow and are stored in Google Cloud Storage. So the data set that contains all the log data for each build and job, they are contained in the remote storage as directories and files. So in order to understand this data as well, we try to download it programmatically in the Jupyter Hub environment and perform some initial exploratory data analysis and apply NLP techniques like TF-IDF, that is the term frequency, inverse document frequency, in order to cluster these build logs on their content and automatically determine what kind of a failure of that log is. So these logs are usually a rich source of information for automated triaging and root cause analysis. But unfortunately they are very, very noisy data types and cannot be ingested to an ML model directly. So we clean these logs by encoding some subject matter expert knowledge, that is a couple of well-defined rejects. Before applying the TF-IDF to each build log and then encode these build logs into a vectorized representation that should retain its contextual meaning in relation to the other build logs. So we start by applying a clustering algorithm to the job based on the term frequency within their build logs to group the runs according to the type of the failure of the log. So for example, the different clusters represents the group of builds that fail due to platform failures or failure caused during the preparation steps before the test start executing. And it is also able to distinguish between the logs that have different syntax structure that may not be critical to a developer such as clustering builds into distinct groups based on whether or not it contains obfuscated data versus non-obfuscated data. So that's all about Proud and now we're going to move to another data source that is GitHub. So the builds and tests that are running on the PRs that were added are because of the changes that are happening in the application's code base. The goal of CI is to automatically identify if any of these code changes are going to cause problems in the deployed application. Therefore, information on the running tests and builds along with the information within the GitHub code repositories such as the metadata and the diffs about the PRs can give us more insights on the overall CI process and ultimately lead us to the root cause of the failure. And in attempt to quantify the critical metrics within the software development workflow starting from the code contribution where we have prototyped a model that can predict how much time it's going to take for a pull request to be merged. That is the time taken from the creation of a pull request to the time it takes to be merged. So a metric like this can help us identify the bottlenecks between the development process For example, having an estimate to how long it's going to take for a pull request to be merged can help the developers and the engineering managers to better allocate the resources to a certain pull request and speed up the engineering process. So let's take a brief look at the workflow that we are following for the same. To predict that what time it's going to take for the pull request to be merged, we frame this as a classification problem. So whether the time taken to merge for a pull request falls between one of the few defined time ranges. As a first step, we collect the data set of the pull request and related metadata from the open shift origin repository using one of our internal tools. Then we transform the input columns obtained from the pull request such as the size of the PR, the types of files that are added in the PR, the descriptions of the PR. All these features can be further ingested into a machine learning model. We then explore some vanilla classifiers to classify the time to merge value for the pull request into 10 different bins or classes, using the features that we engineered on the raw pull request data. Finally, we deploy the model yielding the best results into an interactive service using Selden. This endpoint is available for anybody to interact with and test it out on new pull requests. Once integrated with the GitHub repo, this service can provide newly submitted pull requests with the time to merge estimate. Now let's take a look at the demo. This is the test grid UI. As you see, a lot of communities have the results of their tests and bills dumped into this platform. We are going to click on Red Hat and see that we have a lot of grids that consist of various versions of open shift. I'm going to click on one of them and see what data we have in here. As we can see that we have various grids that indicate how many of the cells were passing, failing, flaky. For this particular case, I see most of them are failing. Let's go into further detail by clicking on one of these. This is what essentially a test grid dashboard looks like. It's a 2D dashboard that contains test names on the left side and the timestamps on the right side. If you click on one of these cells, you would see that some of them are green, which indicates that a particular test is passing at a particular time. If you see the red ones, that means that at that particular time the test was failing. If you click on one of these failing tests, we'll be taken to the Prowl website that contains logs, a lot of logs that probably would not be so understandable by just looking at it. For all of this, we try to access this data programmatically in a JupyterLab environment. We have this notebook that is called GetRawData. We start by importing some Python libraries. We find the path for the test grid UI. We make a connection to this. Then we scrape all of this using beautiful soup library. That allows us to scrape the entire HTML. Now we have a list of all the dashboards that we previously saw when we clicked on Red Hat. After having all these dashboards, we are going to iterate through each one of them and collect data for all these selected dashboards. Once we are done with that, we are going to save it in Steph S3 Storage so that we can use this stored data for our further analysis. Now we also have to calculate a lot of metrics and key performance indicators. To do so, we have a well-defined metric template that we use and is friendly to all the new contributors in order to ensure that there is consistency when we are trying to store all of these metrics. It's called the metric template notebook. We use an example of Flakes. We explain what the key performance indicator is. Then we start by importing some libraries. It is essentially a helper notebook that contains a lot of helper functions starting from a class that helps you connect to Cep, upload, save, etc. Then we import the data set that we have from the raw data. Then we have an encoding process as to what each number in the test grid corresponds to. Then we have some encoding functions so that you don't have to do all that encoding when you fetch the data in that 2D grid. Finally, we perform calculation. In this particular case, we are trying to calculate flaky tests. For flaky tests, the code is 13. We try to make a list out of all the tests that were number 13, that means flaking out, and store them in a dashboard. This data frame looks like timestamp, associated tab, grid test, and the test duration, and if that test was flaking out or not. We have also calculated flake severity as to how many tests are more severe and are more prone to be flaking out. We've also tried to visualize it here using Seaborn library, and we've also tried to have more visualizations using superset dashboard that we'll be looking at shortly. And as always, once calculated, we are going to store this metric into Cep S3 storage. Quickly going over another example of metrics, that is the test pass failures. Again, the drill is the same. Talk about the KPI metric that you are contributing to and import the raw data and start with the metric calculation. So since we are doing test pass and failures, we'll first take the failures list, which is the code 12, store it in a list and move it to the data frame, and we can see if the particular test was failing or not. Then for the same thing we're going to do for the passing test, try to access everything that has a status code of one, which is the test was passing, store it in another data frame. We combine both these passing and the failing tests together into one data frame and then move to further calculation. So for this case, we have calculated the number of tests, the number of failing tests. We see that the failure percentage is 73.6. Then we calculate the total number of passing tests, and we see that the passing percentage is 52.46%. And finally, after calculating the metric, we are going to save the results to SEF again. Now we'll look at how we run all these notebooks in an automated fashion. Did you have a question? So the test grid dashboard, the test grid UI in itself has some explanation for the different tests that are running. So when you're looking at a test grid dashboard, a test grid gives you an ability to dig further into a particular test, see where it originates from, and it also links you to a proud dashboard. So that proud dashboard can actually take you to the GitHub code repository, what does the test suite consist of, what are the automated tests that it's actually running. And when we are running it in a Jupyter Hub environment, we automatically scrape all of the tests which exist for a particular grid, which exist for a particular dashboard, and we already have that metadata linked back to that particular test on test grid. Does that answer your question? Oh, the question was, I think, if I understood it correctly. So we have all these different tests. So how do you know which test is what? How do you basically understand what it refers to, what it's doing? All right. So Akanksha gave a great demo of what are we collecting? What are we calculating? But this is a lot of notebooks. This is basically 20 notebooks just for one data set. And we need to do this. We need to run this in an automated fashion on a daily basis. So for that, we luckily have tools like Ellyra, Kubeflow Pipelines which help us automate these AI ML workflows. So as you can see here on the screen, we already have a very nice Ellyra Pipeline setup. And what you are looking here is on the left side, we have the raw data collection step, which runs sequentially with all the metric calculation steps. So each step that you see or each node that you see here is actually a metric calculation step, which all run parallel to each other. So running this is very simple. I'm going to show it to you live in action. I can just click on the play button here. I give it a pipeline name. Let's call it AI for CI. For this purpose, I have already created an existing Kubeflow Pipelines runtime. All of this is documented if you want to do this for your own AI ML workflows. And I hit on OK. Pipeline is packaging, it's connecting everything and it's submitted. So clicking on this run details takes me to the Kubeflow Pipelines UI. If you're not familiar with Kubeflow Pipelines UI, this is what it looks like. I'll increase the size a little bit. So as you can see here on the left, the experiments tab aggregates a bunch of experiments that you're submitting. Once you have submitted an experiment, it takes you to a run. So what we are seeing here is essentially a run. So this node here is the get raw data, the first notebook that we submitted and this blue indicator there indicates that it's actually running. So once you click on this running notebook and go to the logs, you can actually see this notebook running in action. So this is going to take an R. We don't have an R, so we have already, we already ran a Kubeflow Pipeline and this is what it actually looks. It has a very neat UI representation. So after all the steps have run, you should essentially see a green check mark next to each step. And all of this can be run and trigger on a daily basis on a frequency that you like and all of this can be automated for your CI CD workflows. So moving on to the superset dashboard where we have collected all the metrics and where we display the previously calculated metrics. So this is what the superset dashboard looks like. You can see on the top, you can filter by a particular tab. You can filter by grids. You can filter by tests. You can filter by selecting all these metrics like build pass failures, test pass failures. Okay, I think I need to refresh it. There you go. So we have all these fun metrics like test pass failures, build pass failures, the number of flakes for a particular test standard grid. You're going to see the flakes severities, the percentage of tests which are being fixed over time, what are some persistent failures that exist. So we can view these metrics such as the main length of a failure which is basically how many times a test or a build suite was run before a failing test started to pass automatically. We can see metrics like how much time did it take for a failing test to start passing again. So metrics such as this can help engineering managers identify long-lasting failures. They can actually using these dashboards pinpoint which tests result from a pass-to-fail status and a fail-to-pass status and a particular time window and different metrics like that. So for example, if we filter by a particular grid, it should actually narrow down all the metrics and calculate that for the configured time duration and you can see this is going to take a minute to run but you can actually see all these metrics changing for the particular grid of interest, the test of interest and the dashboard of interest. So this was a yes. That's a very good question. So we did not mention. So the question essentially is all this data that we are seeing on the dashboard, where is that coming from? Is it from the Elyra pipeline which we just saw here? So actually there's a step in between which I did not go over but that's a really good question. So all the data that is being stored here from this pipeline is being automatically stored in a SQL storage and we are using a Trino database to do this and we have an automated script which essentially takes all these calculated metric data frames in form of parquet files and stores them in a SQL database and superset interfaces really well with SQL databases. So that's what we are doing to populate all of these charts. So moving on, we saw a quick demo of how to calculate metrics. Let's look into how we calculate the time to merge of a GitHub PR. So we can look at a particular pull request on the OpenShift Origin repository. You see that this was opened on April 22nd. This was already merged. So let's take this example and see how we interface with it in a Jupyter Hub environment. So moving on to the model, the time to merge model that we created. So essentially we are reading all this pull request data programmatically from S3 storage and you see here that we have the title of the pull request, the body, the size, who created this, when was it created, when was it closed, the number of commits, all of those features of a pull request. But a data science model cannot read this. It's incapable of reading all of this information. So we need to encode it. We need to convert this into features. So we apply a bunch of feature transformations like encode the size of a PR into categorical metrics, find out whether the person who's contributing is, are they a reviewer? Are they an approver? Calculate the day of the week was the PR created? Was it created on a Friday? Does that impact when a PR is merged? Was it created on a Monday? On a weekday? Or was it created in January or December? Do those things make a difference? The size of the body. And once we do that, we try to apply the transformation to that particular PR. So we see that the particular PR that we saw on GitHub, it was actually created by a reviewer. It's created by an approver. It's created on this day of the month, April, and all that information. So what you see here, the data frame that you see here is essentially encoded and feature transformed, ready to be ingested by a machine learning model. We have already created a machine learning model, like using the steps that Akanksha mentioned. And we have deployed it using OpenShift and Selden at an API endpoint. So this endpoint that you're seeing here is basically the time to merge endpoint which is available for anybody to interact with. So we can actually see it live in action. So what we do is we send a REST API request to this particular deployed endpoint and we pick out this particular PR and see how the model reacts to it. So I'm going to run this Sel. We get a response 200 which I'm happy about which actually says that the model is returning a response and what it returns here is basically the probability for the time to merge of that PR to fall into nine categories. And what are these nine categories? They are basically are they under three hours? Is it under six hours? Is it under 15 hours a day? And depending on different our windows, we are defining 10 independent categories. Moving on, we see which category that it's most likely to be in depending on the prediction. So as you can see here, it says that it's under the second category like array two which is actually under 15 hours. So basically this is the time to merge the model in action and this can be now integrated into a GitHub PR using a bot and this can provide live time to merge estimates for a PR. So moving on, there you go. So we saw a demo of how we calculate key performance indicator metrics, how we are training these models, how these dashboards look like. Now it's time to understand how we can engage. So if you want to engage with this project, there are multiple ways for you to get started and we compile a list of ways you can engage with this project on that URL that essentially consists of all the links to all the documentation and everything you would need to get started. So for example, you can interact with and leverage the various open source CI data sources that we work with on this project and with all the data collection scripts, the exploratory all of that is available. There are also interactive and reproducible notebooks for anybody to get started with for this entire project, which is being available on the Operate First, Jupyter Hub environment. We also have interactive dashboards for you to view various KPIs and metrics. We also have the interactive model endpoint available for the GitHub time to merge ML model. We also run automated AIML workflows using Kubeflow pipelines and Pyra like you saw. So if you wish to run ML workflows and automate your Jupyter notebooks, you can also follow a guide that we have compiled. Finally, to learn more about the different analysis and notebooks within this project, we also have a YouTube playlist where we record many of our demos and explanations for different parts of this project. So definitely do check that out. And this is an open source project which we started within our small team at Red Hat. If this project is useful to you or if you're working on similar efforts, we strongly encourage contributions to this project. So there are various ways you can contribute to this project using on our existing notebooks or contribute to additional KPIs and metrics and our analysis. So if you would like to contribute to developing additional KPI metrics, we have a video tutorial and a helper notebook which outlines the template for a KPI metric contribution. If you would like to contribute to an existing ML workflow model, improve it. We highly encourage that. Or if you want to add your own ML analysis and model, we also have issue templates for you to get started with a contributing additional ML analysis. So that's all for today. Here are the links to the various places we talk about this project and post about this project. So feel free to check these out. Thank you so much for joining us today. Please let us know if you have any questions. Yes. Thank you. So on the contribution, some of the contribution and get started guides, we also have outlined how you can onboard your CI data source which is slightly independent from TestGrid. Maybe you're not using, maybe you use Tecton Pipelines and maybe you report those on a different kinds of storage. So we have some guides which are compiled for you on how you can interface those S3 storages with these notebooks and connect those to a dashboard. So not everybody is using the TestGrid UI and of course if you plan on using the TestGrid UI, the TestGrid community also has some very good guides and documentation. But if you use an independent CI data platform, we have some documentation on how you can onboard that into this ecosystem. Yes. So if I understand this, does it anger the GitHub, GitLab CI CD? Got it. So that's a good question. So technically this should integrate with any full request be it GitHub, GitLab the structure remains the same. It's all behind, it all has Git principles. This project lies on GitHub. That's why we have been primarily bootstrapping this with GitHub communities. But on that note, we have a tool called the MI Scheduler which is an internal tool which essentially collects data from GitHub repositories and that also interfaces with GitLab repositories. So I guess it's all about where we are collecting that data from. Yes, Karsten. Thank you. That's what you mentioned about the mindset change and how we're bringing those together and what can you talk a little bit about what you think that's going to take especially in your experience in operations and what kind of things you think it's going to take what would be helpful, what would be the next steps for like me or other people to help make those decisions happen. As an operations person, I buy that wow, yes, I'd love to have you use these tools. Right, so it's I think that's a great question and I think that's the same cultural shift I talked about with Dev and Ops how those two were very different cultures and they came together to create this very cohesive DevOps culture and the set of AI tools and the whole data science domain is also sort of it has some very interesting, very cool tools but it's not being leveraged that often in the operational domain and that also can be observed from the fact that there are not a lot of CI data sources in the AI domain so data scientists do not typically work with operational data if you look into the various open data sources that exist today, you will not find real world production data sources so that's a barrier which we are trying to break by using the operate first cloud which makes all these data available for data scientists to start using and I think that's the first step we are taking towards creating data science tools on a real system rather than fictional CI data sources or old stale not relevant data sources and I think that's a small step towards this and this will take some time but I think we have a long way to go Yes, Mathew No No, this is all being bootstrapped, we are a very new project, just been around for a couple of months, few months but the goal is to feed this back into the OpenShift Origin community we are working we initially started this with some of the some members from the OpenShift team and the goal is to feed all of these tools back into their ecosystem so that's definitely on the radar Just to add I wanted to ask Deepthi what is the CI tool that you guys are using Okay That's awesome, I think that's very within our ecosystem because some of our sister teams they use github actions and log data in different platforms so I think that's very relevant we should definitely connect Alright, there are no other questions Thank you so much everybody for joining, really appreciate it Thank you everyone