 We are back with machine learning and AI track today, and now let's welcome Andhraal Chatterjee and Akanksha Dukkal to speak about AI for CI, speed up your CI CD processes using AI operations. They have sent the prerecorded video that I'm going to share in a minute. Have you ever found yourself lost in logs and dashboards while getting to the root of the build or test failures? Have you considered monitoring your CI processes using AI tools? How do you like the idea of having an intelligent monitoring tool that can help you analyze all these failures? Well, in the session, I, Akanksha Dukkal, and my co-presenter, Andhraal Chatterjee, are going to introduce you to our project, AI for CI, and walk you through the tools that we are building, which can be used to better support CI CD processes. Hello again. I am Akanksha Dukkal. I'm from Boston, United States, and I work as a data scientist in the office of the CTO in the AI ops team at the AI center of excellence at Red Hat. And here everyone, I'm Andhraal Chatterjee. I'm also from Boston, United States, and I also work as a data scientist in the office of the CTO in the team AI center of excellence at Red Hat with Akanksha. So by the end of this presentation, you should have an idea of how you can use AI for CI and its set of AI tools to monitor your CI CD workflows. But before we go there, we will start by discussing what the project is about, the story of how we got here, and the various initiatives that support this process. So what's AI for CI? Before we get into AI for CI, though, let's start to understand the AI ops mindset. AI ops stands for artificial intelligence for IT operations. AI ops is a critical component of supporting any open hybrid cloud infrastructure. In our opinion, it's mostly a cultural change, very much as what we saw with the DevOps movement. Here we combined the two cultural mentalities from dev and ops to create a new culture, the DevOps culture. And the same cultural change is going to happen with AI ops. The DevOps culture and the data science culture are using different tools and have some different mindsets. Our aim with AI ops is to bring these two cultures together and learn from each other. We aim to embrace the intelligent tooling that we have in the AI world and apply it to the operational domain. And what's CI CD? Continuous integration or CI is the practice of automating the integration of code changes from multiple contributors into a single software project. CI CD or continuous integration, continuous delivery is a solution to the problems that integrating new code can cause for development and operations teams, also known as the integration hell. So AI for CI is simply AI applied to CI CD data. It's an intelligent open source AI ops toolkit which can be used to better monitor builds in order to help the development lifecycle. So the goal of this project AI for CI is to build AI tools for developers by leveraging the open data made available by OpenShift and Kubernetes CI platforms. As part of AI for CI, we have built a collection of open source AI ops tools which can help support a CI CD process. Such open operations or CI CD data coming from real world production systems is a rarity for public data sets today. Thus, this presents a great starting point and a first initial area of investigation for the AI ops community to tackle. We are focused on cultivating an open source community that uses open operations data and an open infrastructure for data scientists and DevOps engineers to collaborate. So some of our current initiatives include collecting data from various open data platforms and creating a community around open CI data sources. Quantifying and evaluating the current state of the CI workflow using multiple key performance indicators. We are also involved in building AI and ML techniques to improve the CI workflow. And finally, we are heavily invested in creating a reproducible end-to-end workflow around data collection, analysis, and modeling efforts using multiple technologies like Ellyra, Kubeflow Pipelines, all of which is developed, built, and operated on the Operate First environment. What's Operate First? We build this AI ops community and develop the necessary tools on the Operate First cloud. Marcel Hild gave a great talk yesterday on how to open source cloud operations. I'd recommend going back and finding his talk recording, but in short, Operate First is an initiative to operate software in a production-grade environment, bringing users, developers, and operators closer together. It uses the same community-building process of open-source projects, but extended to ops procedures and data. Operate First enables a collaboration between open-source developers and cloud providers, and AI ops supports this collaboration by creating a new set of tools around CI CD processes, which could otherwise be a pain point during the development and production of open-source projects. So let's dive into the various open-data sources that are being used in this project. As we discussed earlier, one of the goals of this project is to build an open AI ops community involving open-data sources, which originate from different segments of a CI process. To understand these different segments of a CI process, let's take OpenShift, which is an enterprise Kubernetes container platform as an example. OpenShift consists of hundreds of repositories. Thousands of contributors to the project contribute via pull requests on GitHub. Every new pull request to a repository with new code changes is subjected to an automated set of builds and tests before being merged. Prow is the central component of this automation. It is a Kubernetes-based CI CD system. The Kubernetes testing group defines Prow as a CI CD system built on Kubernetes for Kubernetes that executes jobs for building, testing, publishing, and deploying. It is seamlessly integrated with GitHub via hooks which can trigger automated CI CD jobs for GitHub PRs. Test Grid is a platform that is used to aggregate and visually represent the results of all the automated tests. The ultimate hope for this work is that we will be able to connect these various data sources like GitHub code changes with the filed bugs, the Test Grid visualization platform, and Prow data sets in order to develop a complete picture of a CI process. The first open data set that we will look into is the GitHub data source. The builds and runs by the CI process are required because of code changes that are happening in the application's code base. The goal of CI is to automatically identify if any of these code changes will cause problems for the deployed application. Therefore, information on running tests and builds, along with information within GitHub code repositories such as metadata and diffs about the PRs, can give us more insights into the overall CI process and ultimately lead us to the root cause of any build failures or issues within the software development process. In an attempt to quantify critical metrics within a software development workflow, starting with code contribution, we prototyped a model which can predict the time to merge of an open pull request, which is basically the time it takes from creation of a PR to merging of the PR. A metric like this can help identify bottlenecks within the development process. For example, having an estimate for how long it will take for a PR to merge can help developers and engineering managers better allocate resources to certain PRs and speed up the whole engineering process. Let's briefly look into the workflow that we follow for the same. To predict the time that it will take to merge a new PR on a repository, we frame this as a classification problem, whether the time taken to merge a PR falls within one of the few predefined time ranges. As a first step, we collect a dataset consisting of pull requests and related metadata from the OpenShift Origin repository using one of our internal tools. Then we transform the input columns obtained from the pull requests such as size of the PR, types of files added in a PR, description of a PR into various features which can then be ingested by an ML model. We then expose various vanilla classifiers to classify the time to merge values of a PR into one of these 10 bins or classes using the features engineered on the raw PR data. And finally, we deploy the model yielding the best results into an interactive service using Selden. This endpoint is available for anybody to interact with and test out on new PRs. Once integrated with a GitHub repository, this service can provide newly submitted PRs with the time to merge estimate. Next, my colleague Akanksha will go over the other data sources. Next up, we have the Prowr data source. So what is Prowr and how is it linked to the tests and builds? Prowr is a Kubernetes-based CICD system where jobs can be triggered by various types of events and report their status to many different services. Test grid provides the results of tests, but if you want to triage an issue and see the actual logs generated during the building and testing process, we will need to access the build logs generated by Prowr and stored in Google Cloud Storage. So what does the data look like? This data set contains all the log data generated for each build and job as directories and text files in remote storage. In order to understand the data set of the events and build logs from Prowr, we download them programmatically in a Jupyter Hub environment, which is a tool used by data scientists to run interactive Jupyter notebooks and to perform initial exploratory data analysis and apply techniques like inverse document frequency and in order to cluster build log, based on the type of log it is. The logs represent a rich source of information for automated triaging and root cause analysis. But unfortunately, these logs are noisy data types. That is, the two logs are of the same kind, but from the different source, they might be different at a character level that traditional comparison methods are insufficient to capture this similarity. So to overcome this issue, we will use the Prowr logs to learn the log templates that denoise the log data and help improve performance on downstream ML tasks. So we start by applying a clustering algorithm to job runs based on the term frequency within their build logs to group the job runs according to the type of their failure. For example, different clusters represent group of builds that fail due to platform failures or failures caused during preparation steps before tests starting to execute. And these kind of clustering algorithms are also able to differentiate between the logs which have different semantic structures, which may not be critical to a developer, such as clustering builds obfuscated and non-obfuscated buildings separately. Now, let's look into another open data source that is Test Grid. So as we briefly introduced earlier, Test Grid is a visualization platform for CI data. It is an open source project developed by Google to help people visualize their CI processes in a grid. It is used by a number of communities to track the status of the tests and build in a visually friendly format. Test Grid primarily reports categorical metrics about which tests passed or failed during specific builds over a configurable time window. So what does it display and where does it fit in the context of a CI process? We will look into this in further detail, but what a typical Test Grid dashboard looks like is it's an aggregation of multiple tests over time period, and each grid represents whether a certain test is passing, that is green, or failing, that is red. In a typical development lifecycle, once the developer has opened a pull request with some commits containing some changes, their changes would go through a CI platform which runs a series of tests and builds, and the results of the tests are displayed here on Test Grid. So now, in order to quantify the current state of the CI workflow and identify any gaps within the CI processes better, we can calculate certain key performance indicator metrics related to the test runs. By calculating relevant metrics and key performance indicators related to the running tests, that can help get to the root of the problematic build failures and help discover recurring patterns within the running tests and ultimately help a developer with their development workflows. Now let's take a look at the Test Grid workflow. So we start by collecting data. As we discussed earlier, Test Grid has information about status of running multiple tests over a period of time. The data set contains categorical metrics about running multiple tests. So again, to have more insights and work with this data programmatically in a Jupyter Hub environment, which is a tool used by data scientists to run interactive notebooks, we need a way to access the visual grids that were displayed on the Test Grid data programmatically. For that, we create a connection from the Jupyter Notebooks to the Test Grid URL and download all select dashboards from Test Grid by scraping the HTML. This allows us to dig deeper into various features of data and gain an in-depth understanding of the data that won't be very obvious just by looking at the dashboards. Next, we calculate the metrics and apply some ML and AI techniques. So once we have collected the data, our goal is to apply AI or machine learning techniques to improve the CI workflow. But first, we start by applying certain analysis and aggregating various tests, detecting patterns in the data, which can help quantify and evaluate the current state of the CI workflow. So for instance, if developers and managers were to manage the allocation of resources for CI processes and in order to save these resources, it would be really vital to know if a particular test has a higher probability to fail or if we could find an optimal stopping point of a build run after which the build would most definitely fail. And to do that, we calculate some relevant metrics and key performance indicators, which not only help us evaluate the AI-based enhancements we make to the CI processes but also pinpoint to developers what specific areas need the most improvement and therefore should be devoted more resources to. Next, we have an automated pipeline to do this in a recurring fashion. So after the collection of data, model training and all other parts of the machine learning workflow, we have to ensure that these tasks are sequential and continuous. So we automate the sequential running of the notebooks using a simple workflow, using tools like Alira and Kubeflow Pipelines which is a platform for building and deploying scalable machine learning workflows. We can run our notebooks in an automated fashion. Finally, in order to help the developers and stakeholders view KPIs, metrics and aggregated results of their tests visually, we create automated dashboards that can better help analyze the status of multiple tests, investigate the problematic tests, builds or jobs. Now let's take a look at the demo. So for the demo, we can quickly go to the Test Grid platform itself which is an aggregation of test build results. So if we click on Red Hat, we see that different versions of the OpenShift project use Test Grid to view the results of their CI processes. So as you can see here, there are a number of different tabs for different releases of OpenShift and it is further divided into informing, blocking or broken, which are the different test suits being used here. So if we go to a particular tab or a dashboard such as Red Hat, OpenShift, OCP Release 4.1 blocking, so we can see that there is a signal whether the tests are passing, failing or are flaky for this particular test, mostly are failing. So to look into these tests themselves and view them in a grid, let's just click on them. So now we can see that this is the grid view and this displays the results of the test run at different times. So what you're seeing is really whether or not a certain test passed or failed at a certain point. So this is essentially a 2D array which we can do some analysis on as data scientists and to get this data into a form of suitable analysis, we try to access this programmatically in a Jupyter Hub environment. So let's take a look at that. So this is the Jupyter Hub environment and we already have some notebooks ready here. So we're going to start with collecting the data through the notebook called KETRAWData. So in this notebook, we fetch the relevant data from testgrid.kates.io using basic HTML scraping techniques. So we start by connecting to the URL and try to fetch all the associated dashboard names. Then we iterate through all the dashboard names to collect the associated dashboard data and after collecting the data, we finally store this on SEF to ensure any further analysis. Now let's take a look at how we compute metrics from this raw testgrid data. So in order to ensure consistency, we have a helper notebook which outlines the template for a KPI metric contribution if you would like to contribute to the work of developing additional KPIs and metrics. So in this notebook, we import the raw data stored in S3 storage. We also introduce some helper functions that help in encoding what test status each value in testgrid corresponds to and then encoding the data set into lists that can be used in creating data frame with the metric values. So for example, in this notebook, we have performed a calculation to find out the number of flakes for a given dashboard grid or a test. So to compute a number of metrics to evaluate the current status of the CI workflow, we have these metric notebooks which compute these. So let's look at one metric notebook in detail which is the build pass failure metric. So in this notebook, the key performance indicator that we would like to create greater visibility into and track over time is the percent of builds that passed or failed. This can be used to capture the build success rate, that is the number of successful builds or deployments relative to the total number of builds and deployments. So we start off by importing the libraries and helper functions from the metric template notebook and then we try to import the raw data from SAF. In order to perform the metric calculation, we first start by finding all the tests which are failing, that is they have a status code of 12, as mentioned in the metric template notebook. After doing this, we move to collect all the tests that are passing and have a status code of one. So now that we have a list of tests that were passing and failing, let's calculate the build pass and build fail percentage. So as we can see here, the build failure percentage is 48.68 and the build pass percentage is around 50.94%. So post the metric calculation step, we use the Python package C1 to visualize the build pass and build fail across time. So this is how it looks like. So we also have interactive visualization for each of these metrics, which we will take a look at shortly when we demonstrate this superset dashboard. So after all of this, we try to combine the build passing and build failing metric into one data frame and then further store it into SAF's S3 storage. We saw how to compute various metrics from the raw test grid data. Now it's time to run all of these notebooks in automation. We have already created an ML pipeline using Ellyra. This ML pipeline consists of two steps which run sequentially. The first step is the get raw data notebook which downloads data from test grid and the second are various metric calculation notebooks which run in parallel. We can trigger this pipeline from the Ellyra UI and move over to the Kubeflow Pipelines to view the running jobs. As you can see here, the pipeline called the AI for CI demo was triggered and has started running. Once completed, the pipeline steps display a green check mark next to them much like this. We can automate this pipeline to run on a recurring basis such that the test grid raw data collection and metric calculation steps run on a daily basis and store the data on S3 storage. We now move over to the superset dashboard where we have created interactive charts to display the previously calculated metrics from the data stored on S3. For example, these charts tell us the total number of test cases, the number of tests that passed, the number of tests that are failing, etc. We can also view metrics such as the mean length of failures, which is how many times was the build or the test suite run before a failing test started to pass, the mean time to fix, which is how much time was taken before a failing test started to pass. Metrics like this can provide engineering managers with insights such as which tests or platforms have the most long-lasting failures or how long does it take for a failing test to start passing again. We can also filter this dashboard by tabs, grids and tests to view specific metrics to our product and test of interest. For example, if we filter by the grid periodic CI OpenShift 4.6 release, we will be able to see the number of test cases run for this particular grid and the associated metrics for this grid. All right. So let's engage. If you want to engage with this project, there are multiple ways to get started and we compile a list of ways you can engage with this project on this URL that you see on the screen. You can interact with and leverage the various open source CI data sources that we work with on this project, along with the data collection scripts and the various exploratory analysis. There are interactive and reproducible notebooks for this entire project available right now for anybody to start using on the public Jupyter Hub instance on Operate First. We also have interactive SuperSet dashboards for you to start interacting with and viewing the public KPIs. We also have an interactive model endpoint available for the GitHub Time to Merge ML model which you can try out. We run automated AI ML workflows using Ellyra and Kubeflow Pipelines. So if you wish to run ML workflows and automate your Jupyter notebooks, you can follow the guide that we have compiled. And to learn more about the different analysis and notebooks within this project, you can check out our YouTube video playlist. This is an open source project that we started within our small team at Red Hat. If this project is useful to you or if you are working on similar efforts, we strongly encourage your contributions to this effort. There are various ways in which you can contribute to our existing notebooks or contribute additional KPI metrics or analysis. So if you would like to contribute to the work of developing additional KPIs and metrics, we have a video tutorial and a helper notebook which outlines the template for a KPI metric contribution. If you would like to contribute to an existing ML workflow or model by improving it, we highly encourage that. Or if you wish to even add your ML analysis and model, we have an issue template for contributing to additional ML analysis as well. So thank you for joining us on this talk. Please let us know if you have any questions. Thank you, Oindrilla and Kanksa. Very great presentation and talk and a very nice demo. And now we'll take any questions if anybody has any questions.