 Hello all and thank you for coming. So today's session is going to be about how we use data-driven methods and workflows for uncovering project insights from your GitHub repositories. My name is Karan and I'm a data scientist in the emerging technologies group within Red Hat. And hey, everyone. My name is Oindrila Chatterjee. I'm also a data scientist working with Karan in the emerging technologies group in the office of the CTO at Red Hat. So before we move on, let's give you a quick overview of what we are going to go over today. So we are going to start by introducing what are project metrics and how they can support the community health. We'll then introduce you to our project AI for CI. Then we'll give you a brief introduction of the Operate First Cloud, which we use as a part of this project. And we'll give you an idea of our time to merge prediction tool and the ML workflow that we followed for the same. We will then go over the next steps for this project and how you can engage. So it's no secret that Git repositories can reveal a lot of information about a project's health and community. So we can derive insights such as the velocity of a project, the blockers during the development of the project, community engagement. So metrics such as member churn, bottlenecks and contribution, average time to triage, time to merge, all of that can help software development teams and also open source program offices to, let's say, better allocate resources to a project. It can also help them evaluate an ongoing project success. It can help the team advocate for the work that the software development team is doing, and finally it can help analyze the growth of the development community as well as the journey and growth of the community associated with the project. So how do we calculate these metrics? So obviously we derive data from code repositories, which are coming from the repository data. It can be from the Git free pose. And for the most part, we have heard about analytics and retrospective analytics on a project's development. How can you use AI or machine learning to aid a project's development? So for that, we introduce you to a couple of tools that we have built, which is the time to merge prediction model, which is basically an ML model which can help predict what the time that a new open pull request will take to merge. We also have built models such as the optimal stopping point of a given long running test, basically the time after which running the particular test is not efficient because it's most likely to go ahead and fail. So you can use ML services such as this to aid with your software development process and also to help with community growth. So that brings us to our project, AI for CI, and the whole toolkit. So basically this project is a collection of open source tools that includes scripts, notebooks, dashboards, different metrics, and data sources. So for this project, we collect data from various CI CD data sources that also includes GitHub repositories, and we display the metrics that we collect on dashboards, and also build some ML services and tools like the ones I discussed earlier, which can help with a project's development process. So we'll shortly go over one of the ML services, which is the time to merge prediction model that we have built and discuss the ML workflow that we followed and how that can support an open source project. So we built all of these AI tools and services on the Operate First Cloud. So the Operate First Cloud provides us with tools and services and ML toolkits like Jupyter Notebooks, Apache SuperSet, S3 data storage, Trino database engine, model serving infrastructure that can help with different segments of the ML workflow. So what is Operate First and what does it actually mean? So as data scientists in the emerging technologies group at Red Hat, we work closely with the Operate First initiative. Many of you might have heard this in the previous sessions, and if you haven't, the main idea behind Operate First is that, well, open source has made software accessible to everyone, but the knowledge on how to actually run it in a production cloud environment, that knowledge is not widely accessible. So things like how to install something on the cloud, how to perform updates, stuff like this, SRE details are not always made transparent and obvious by cloud providers. So this creates a barrier to entry to a run that application or software, and that's one of the things that Operate First seeks to address. So under this initiative, we want to actually deploy, run, manage applications under an open source community cloud. So first of all, this makes operations and SRE best practices available to everybody and accessible to people. And also since we are running the software ourselves firsthand, we can take lessons that we learn from the community, from this open source community cloud and put these lessons back into building the application. So basically this makes our applications easier to run and manage for everybody. So how do you or why should you as a developer or a data scientist or a data science manager care about Operate First? So one of the major workloads that we run on the Operate First cloud is the project open data hub, which is essentially a set of tools for doing open source cloud native data science. And if you are a member of the OSPO team or if you're a data science manager, this is a great platform that can enable your data scientists. So this essentially means that as a data scientist, you have public access to a collaborative reproducible open cloud environment where you can do all of your data science work. So yay. You also have access to a real platform with instances of cloud services like applications such as JupyterLab, S3 storage, databases, model serving infrastructure and stuff that I mentioned earlier. And secondly, since these workloads are run publicly, you also have access to the data that we collect related to the open operations. And we can create an operations data set out of it. So this can have data, for example, what does the memory usage, patterns of a particular application look like before it starts to fail? And this creates a great opportunity to leverage this open operations data set and to better understand operations and build AI ops applications on top of it. So this is exactly what we are trying to do in our project AI for CI. So let's take a look at the workflow that we follow or the components that this project consists of. So you have an idea of how we went about building this. So with the interest of prototyping this workflow, we started with the OpenShift Origin GitHub repository that was one of the repositories of interest. And we collected its pull request data over the years. We then transformed the input columns coming from the pull request, such as size of the PR, the types of files in the PR, the number of files, the description who created the PR, and various other features like that that can be ingested into a machine learning model. These features can also be used to derive some interesting metrics about the pull request themselves. And this can basically provide a slightly better visibility into the software development process. Now in order to view the statistics, KPIs, and metrics that we collect from this data and to view that in a visual manner, we create some automated dashboards which can provide slightly better visibility into aspects of the project, such as the journey of contributors over time, triaging, merged statistics over time, and the growth of the project over time. So let's take a look at the tools that we use to create these dashboards and the visualizations that I mentioned earlier. So we basically collect data from Git repositories using a tool called the ThoughtMI, which I'll discuss shortly. We then use Jupyter Notebooks to explore the data and extract some meaningful features from these pull request columns and create some SQL tables on a Trino database engine. Jupyter Notebook is basically a collaborative environment which data scientists are pretty familiar with. It helps prototype notebooks and help you build proof of concepts. And finally, we import these tables that we create from Trino into creating charts and visualizations into Apache SuperSet. And all of these tools that I mentioned here have been deployed and are available on the Operate First Cloud. So for gathering the GitHub data, we use a project called MI, which is a part of Project Thought, which is basically an open source project within our group at Fred Hat. And we basically use this tool to collect data from GitHub repositories of interest. It's actually also available as a PyPy package. So you can basically specify the GitHub repositories of interest and the features that you want to collect, like issues, pull requests, and the date range you want to get it over. And it provides you with a JSON formatted dump. You can use this tool to also collect some sort of metadata about the repositories. And then you can store this data on S3 storage and use for your model training purposes, your visualization, and all sorts of things. So let's take a look at the dashboard that we created. So one of the first dashboards that we created for mostly a demo purpose is that coming from the OpenShift Origin GitHub repository. So as you can see here, you can see certain metrics, such as just the number of pull requests that are appearing over time, the mean time that it takes to merge these pull requests, then some size-related KPIs, like what's the size distribution of this pull request over time, and also some contributor-related KPIs. And for example, who are the top PR creators or the Unik PR creators over time? You can also view the contributor stream over time and how that's varying. And more in detail metrics, such as the number of commits on each PR and how does that vary over time, the number of files that are changed over time, and stuff like that. We are also applying some of these workflows to one of our internal org repos, which is Thought Station. So as you'll see here from the dashboard, this is a much smaller group or community. So you'll see how the metrics vastly differ from the OpenShift Origin charts. So you can see issue-related metrics here, like what are the number of open issues, what are some issues that are being currently worked on, number of triaged closed issues, and all of this can also be enabled by labeling your PRs or issues and can also help the maintainers and the folks who are actually working on triaging with a more high-level view, which they can also drill down into if they are to see the current state of affair. You can also see some metrics, such as what's the minimum time that it's ever taken to close an issue, maximum time, who are the first common contributors on the issues, mean time to close an issue, number of issues and contributors over a weekly trend, and also the same kinds of metrics can be repeated for pull requests. Like you can see the minimum time to close PRs, maximum time, time for the first approval of a PR, and then the weekly trend of contributors. We are also able to apply certain time filter ranges and filter it over the last quarter, last week, stuff like that, so you can see how that will change some of the metrics. And again, most of these metrics, like last quarter, maybe for some of these repos, there are no PRs actually, so there are some issues, which is like, it's a small community, so of course the numbers greatly vary, but these are some very simple metrics, but obviously these can be extended to be more extensive, and as a next step, we also want to introduce how you can bring AI with the help of some machine learning tools, aid these metrics or create more greater predictive analytics for GitHub PRs and GitHub issues. So with that, I will hand it over to Karan. All right, thank you, Andhra La. Let me just go back to the slide. Cool, so yeah, so just to recap, so far we've seen the workflow for collecting raw data from GitHub, cleaning it and pushing it to a Trino database, and then creating some visualizations out of it to see some key metrics for your project. Now in the next part of the session, I kind of want to focus on an orthogonal workflow that you can have from this step, using which you can basically create and deploy machine learning models to essentially create like intelligence services that will assist you in project planning or improving the community or developer experience and so on. So specifically, we'll talk about how we extract some features out of the raw PR data that we collect. Then we'll talk about the kind of models that we explored. And finally, we'll show how we deployed our model as a service so that it can actually be used in your workflow. So obviously there's like different kinds of ML services that you can create like following this workflow. For example, you can have a service to tell you who's the best set of reviewers for a PR or like how much time you'll have to wait before someone comments on your issue and so on and so forth. For this particular project, the service that we create is called the time to merge prediction service. So the idea is, given a PR, we want to estimate how much time or how much effort is gonna go into getting the PR merged. And some of the main motivations for doing this is to basically help like a project lead or scrum master or like a similar persona to kind of help them in like triaging like incoming issues or incoming PRs on a repo and also get a sense of the workload that their team is under over a given sprint. And on the other hand, like as a developer or as a contributor, like having this kind of information kind of helps you manage your own time and resources better. So that was some of the main reasonings for doing this service. So to frame this as a machine learning problem, we were considering two main approaches. So one approach is to set it up as a regression problem. So here the outputs would be like real values or like real numbers. So this could be something like PR will be merged in 36.75 seconds or something like that. So that was one approach. And the other alternative was to do a classification approach. So here we'll have more or we'll have less granular timeframes and we'll have like time intervals instead. For example, PR will be merged in one to two days or three to five days and so on. So which one's the right approach for you, right? Well, we think this kind of depends on the use case for your particular service or your project. But based on our conversations and like our analysis on the initial data, it seemed to us that splitting it into discrete classes and then doing a classification problem was the way to go for this problem. So this means that our machine learning setup looks somewhat like this. So you have this raw PR coming in and then you have a model that does some feature transformations and pre-processing and then produces a prediction that looks like one of these labels, like merge in zero to three hours or merge in three to six hours and so on. Cool, so that's the overview of the machine ML setup. Now let's talk about how do we actually create this model in the middle. So the first thing for creating the model is to engineer some features out of the raw data. So what this means is from the raw PR data that we have, we look at some things like number of lines of code added or number of files changed or what kinds of files are being changed. Like are you changing a readme or are you changing a config file or are you making some edits in the main.py file? So we thought these were some of the interesting features that would provide some strong signal into predicting time to merge. Another thing that we look at is for example, from the timestamps we can look at is it a weekday or weekend and so on. So from this chart over here, it kind of shows you what features were important in making these predictions. And I think this green bar over here shows the importance of a weekday or weekend and intuitively that kind of makes sense because if you're creating a PR on a weekend, then it's gonna take longer to get it merged into your project. But yeah, so we develop like a bunch of features like that out of our raw PR data. And then we scale them and then feed it to our machine learning models. So we explored just some basic models for this task. So we tried out SVMs, decision trees, random forests, and also gradient boosted trees. And we found that in most of the cases, a random forest seems to outperform the rest of them. So we went ahead with that model and we saved it to our S3 bucket for deployment later on. All right, so at this point, we have a model that's been trained and ready to go. So now let's see how we can create a service or like a deployment out of this trained model. So the first thing we do for model deployment is to write a small Python class or like a small script. So this script basically defines how your model is gonna be loaded. So in our case, it's gonna just download from an S3 bucket. And we also define how the prediction is going to be made for a given input data. So again, in our case, we packaged our model into a pipeline already. So for us, it was just like one or two lines of code that says model.predict on the incoming data. But yeah, so we write this small Python class and then we bake it into a container image using S2I or source to image. So this container basically uses Selden as the base image. And Selden is essentially a framework or a set of tools that helps you deploy your models as services. So it has all the logic for how do you deal with incoming requests? So it defines the HTTP server for you. So in this case, it tells the service to pull the model from S3 and then call the predict function. And once we have defined this container image, we create a Selden deployment and that's just like one YAML file. And this Selden deployment goes ahead and creates the machine learning service for us. So now we can just expose this on a route. And yeah, once you've exposed it, like you will have an HTTP endpoint to which you can send raw data and then get predictions as return values. So great, at the end of this step, you've basically created a inference endpoint that you can query and get predictions. So now we could in theory stop right here, but if you think about how you wanna integrate this into your development workflow, then it's kind of difficult for a developer to every time they create a PR to manually scrape the PR data and then send it to this endpoint to get predictions. So in order to improve this user experience, we went ahead and kind of created a bot out of this or a GitHub bot. So this GitHub bot basically does all of this for you. So every time you open a pull request, it will use the GitHub API to collect the data from the PR and then send this data to the endpoint that I showed earlier and then it will take the predictions from that endpoint and put it as a comment or a label or something like that on your PR. So yeah, by this process you can, this shows you how you can embed ML services into your development workflow. Yeah, so so far we have shown you what this logical pipeline looks like or what the workflow looks like on an abstract level, but we haven't really talked about how do you actually implement this entire workflow. So let's talk a little bit about that. So to do that, we use a tool called Ellyra. And Ellyra is a framework or a set of tools maybe to that lets you create pipelines that you can deploy on Kubeflow, Airflow, or you can just run locally as well. And basically each step in this pipeline or this DAG is gonna be Jupyter Notebooks or Python scripts. So it integrates really well with your data science working environment. And for each of these steps, you can define the runtime image, the resources and so on and so forth. So you can fully configure this quite easily. And yeah, so you have this UI that Ellyra provides to create the DAG that's shown on the left of the screen and then you can deploy it to run as a pipeline on a Kubeflow instance. Like I have it on the right side. So I guess to show this working in its full glory, let me try to show a demo. Doesn't seem to agree with me. All right, there it is. Yeah, so this is the Jupyter Hub environment that I have running on my Operate First Cloud that WonderLab mentioned earlier. And we have integrated Ellyra as a plugin on Jupyter Lab. So with Ellyra, you can simply, quite literally just like drag and drop the notebooks. Not like that. Right, so you have to actually start like that before you wanna drag your notebooks. So click on plus, pipeline editor. And then now you can just click and drag your notebooks like that. And then connect in any arbitrary fashion that you want. And then basically for each of these notebooks, you can define what runtime image it should be running in. So if you have mostly pandas code, then it has some pre-built images. For example, pandas 1.1. Or if you have some TensorFlow code, then you can use TensorFlow or PyTorch and so on and so forth. And in our case, we use like a custom runtime image. So you can use any container that you create as a runtime for your steps. So we're gonna use like this CI analysis notebook that we've created and use that as a runtime. You can also define any requirements like CPU, GPU and memory, as well as any environment variables or any file dependencies, et cetera, that your step might have. So I'm not gonna create my DAG all over again, but I do have a DAG that I've created already. And it looks kind of like this. So this is the entire project workflow, essentially. So in the first step of the notebook, it basically runs the data collection step that Oindra Lam mentioned earlier, which is the MI tool that our sister team, the thought team had created. Then the second step is the data cleaning, the feature transformations and pushing this transform data to a database. And then the final step in the pipeline is a notebook that takes the raw featured data and then trains a model out of this. And then pushes the trained model to an S3 bucket. So if you wanna run this, you can click on the play button here and then select which Kubeflow or Airflow instance you wanna run this on and then click on okay. And it'll take like maybe a couple of seconds. And yeah, there you go. And it's gonna set this up to run as a pipeline on your Kubeflow instance. So if you go to your Kubeflow instance, you should see, yeah, so there it is. So there's the pipeline that we just created. And as you can see, it is starting to collect data and basically run the notebook. So this is gonna take a couple of minutes to run. So I'm just gonna skip to a pipeline that's already run before. So yeah, this is what it should look like once it has run. And at the end of this step, what you'll have is a trained model that has been pushed to a S3 bucket. And then from there, the seller deployment will pull it and then serve any incoming requests for your GitHub PR. So let me quickly show what those look like. So this is the endpoint at which we have deployed our service. So what I'm gonna do is like pull some PR data right here, yeah, there you go. And then what I'm gonna do is wrap it into a JSON. That looks like this, do that, run. And then what I'm gonna do is send this JSON data as a payload on an HTTP request to the seller endpoint that we created earlier. And hopefully it should run correctly and there it is. It should give a 200 response in return, meaning that it ran successfully and you have some output. And that output should look like this. So here it tells you that PR belongs to class three, which means that it's likely to be merged in 15 to 24 hours and so on. So there you have the service working in its full glory. And of course, we've also created the bot or our teammate on our sister team created this bot that like every time you open a pull request on your repository, this bot will come along and put those predictions on your PR. So by this way, you can have like a ML service integrated in your workflow. Cool, let me go back to the slides real quick. All right, so there's a couple of next steps that we have planned for this project as well. So this is a fairly new project. So there's quite a lot of work that we are planning to do. So the first is to make this service available as a bot for any repository in general. So right now it's kind of tailored to our specific repositories, but we wanna expand that in the future. Next, once we have, we're at that stage, we wanna collect some live feedback from this bot or service and update accordingly. And yeah, so finally we wanna iterate on the actual models themselves to make it more performant and more accurate. So yeah, that's pretty much it. If you wanna engage with this project, there's multiple ways you can get started. You can find our data sets, the data collection scripts, the exploratory notebooks, and everything on our GitHub repository. And you can spin up a Jupyter Hub instance on the Operate First Cloud and actually run everything that we have right now. You can also interact with these dashboards that you've created, which are publicly available. And you can also play around with the endpoint that you've created, which is also publicly accessible. And lastly, if you wanna look at the different analysis outside of this project and the larger AI for CI project, then we've put up our YouTube playlist out there. So you can check that out as well. So thank you all for joining. And here are some more links, if you needed that, to the project website, to the GitHub repository, and a bunch of social media channels where we frequently post updates. So feel free to connect with us over here or on our emails or anywhere. So yeah, that's pretty much it. Thank you all for joining. And if you have any questions, we are here to answer. All right, if there are no questions, just feel free to reach out to us, go to our GitHub open issues. And yeah, thanks so much for joining.