 Hey everyone, thank you for joining me on the second last session of the day and the second last day of the conference. I won't take too long, try to keep it short with some call to actions. So welcome. I'm going to be talking about uncovering project and community health and metrics using data-driven techniques. My name is Oindrila Chatterjee. I'm a senior data scientist in Red Hat. I work for the emerging tech group in the office of the CTO. And let me give you a brief overview of the agenda for today. So we are going to be giving like a brief intro on what our project metrics and how they can support community health. Then I'll introduce our project AI for CI. I'll also briefly touch over the operate first community cloud that we are using underneath all of this. Then I'm going to be talking about some AI tools and techniques that you can probably use for your open source projects. And finally, I'm going to do a little deep dive into the time to merge model that we came up with. All right, so let's talk metrics. Why metrics? That's like stating the obvious. It's no secret that sources like code repositories, communication channels can reveal a lot of crucial information about a project and a community's health. We can derive information like the velocity of a project, the blockers in a development process, community engagement, so metrics like member churn, average time to fix, time to respond, contributions over time can help development teams and open source program offices, better allocate resources, evaluate an ongoing project's success. They can also help advocate for the work that software development teams are doing. It can also help with analyzing software, the whole software development community and finally assess the growth of an open source project and its associated community. So how do we calculate these metrics? Sources like GitHub repositories, communication channels, and the data stored in it like the traffic of the users, commits, issues, PRs, the code itself, all of this can be used to derive these metrics. And most communities are already doing this in some form and already have some kind of dashboards which track their community health. And also there are various types of tooling that folks are using to achieve this. So in this talk, I'm going to talk about a particular open source approach that we are following, that we came up with. And then can we go a step further from analytics to use machine learning to help a project's development process and further foster the growth of its community. So some examples of a machine learning service can be a model that can predict the time to merge of a pull request on a new repository or a new PR on a repository. It could be a model that can suggest the optimal time to stop a long running test before after which it's most likely to fail, predicting time to respond on issues and such ML models. So I'm not here to tell you today on which metrics to track and why use metrics to support your open source community. There were some excellent talks in this conference before which I saw by Vishi Sahar on Tuesday called Dev Team Metrics That Matter and Callie Dolphy about community metrics, what to measure and why. I'm rather here to tell you how AI driven metrics can possibly help your community. How to use open source machine learning tools to achieve this process. So if you are a data scientist or data engineer or you care about how to stitch this process together, I'm going to talk about some tools and how this can all be done in an open source community cloud. So looking at the project at hand AI for CI, so the problem that this project seeks to address is that firstly there is a strong need for AI ops, automated monitoring, anomaly detection, alerting and all sorts of AI ops techniques to help with CI CD processes with operations. And the main problem for building such AI ops tools is that all the software, even if it's open source, it often gets operated behind closed doors and closed systems. That's the data and log that are generated by these software are not open source. So this also presents an opportunity to us for open source communities like the Kubernetes testing infrastructure, which make their testing data open source. And these kinds of data sets are a rarity for public data sets today. And it presents a great starting point and an initial area of investigation for the AI ops community to tackle. So this project is basically a collection of some intelligent open source data science tools that can support CI CD processes. So it consists of AI ops models like the GitHub time to merge service, optimal stopping point, build, log, failure classification. It also consists of some KPIs and metric dashboards. And the overall goal is to foster an open source AI ops community with open ops data tools and services. So as part of this project, AI for CI, we periodically collect open operations data, which are originating from Proud, which is Kubernetes testing infrastructure, the heart of it, Test Grid, which is the visualization platform for tests, for passing and failing tests. GitHub and Bugzilla, which is a bug and bug tracking system used in Red Hat, and different ops tools like this and make them available for analysis. We then collect some key performance indicator metrics coming from the CI CD data and display them on dashboards. We also build some ML services, like the ones I talked about before, and shared them as notebooks, scripts, tools, dashboards, and also the pipelines that make up these tools. And finally, we build all of these tools and services on an open operations or an open community cloud called the Operate First Community Cloud, which provides ML tools required to build these services, like the underlying clusters, Jupyter notebooks, superset dashboards, S3 storage system, database engines. And we also make all the notebooks, templates, pipelines available, and scripts open source. So what's the Operate First Community Cloud? It's basically an initiative which is centered around open sourcing operations of software. It provides us with a real production community cloud, which can be used to operate software and applications openly. So thus, we are open sourcing the operations of data, the SRE best practices, and generating tons of open source operation data, including logs, metrics, issues, PRs in the process of doing so. For a data scientist like me, it also provides me with a real platform with instances of cloud services and applications that can support my data science work streams. And if you're interested in learning more about the Operate First Community Cloud, you can find me after the talk at the Red Hat booth. I also gave a talk on this yesterday with Marcel Hild. You can definitely go back and check the recording. So apart from the data coming from the Operate First Community Cloud, another important source of open operations data which we are interested in is the data originating from the Kubernetes testing infrastructure, which is Proud, Test Grid, and all its associated logs and metric data. So when we talk about the Operate First Community Cloud, and it being useful for data scientists, this is what I'm referring to. So the Open Data Hub project is one of the projects that's being operated on the Operate First Community Cloud. So it's basically an open source project which consists of various data science tooling like Jupyter Hub, Grafana, Spark, Apache SuperSet, Trino, Open Metadata, Paciderm, Airflow, Cubeflow Pipelines, all of these different open source tools which can help with different parts of the machine learning workflow. And also the other side of the same thing is the open operations data that's being generated by operating the Open Data Hub project. So data like logs, metrics, PRs, issues, all the SRE best practices, architectural decision records, blueprints, all of that good information can be used for any sort of analysis. And it can also be used to enable any sort of AI tooling that we want to build on top of it. So here I'm gonna now talk about the tooling that we typically use for capturing these metrics. So depending on your team or your open source community or project, there might be a metric solution that might be suitable for you. There are also existing end to end solutions where you can just plug your GitHub data in or repository in and it's already generating some templated predefined graphs and charts. If you're looking at doing this or building this the open source way and putting these components together, there might be different data sources that you wanna look at GitHub, GitLab, Garrett. And there might be also different scraping mechanisms, GitHub API being of course the thing that is underlying beneath a lot of these high level tools. There's Thought MI Scheduler, which we typically use. It's an open source project. There's the Chaos Augur project, which is again a large database of different GitHub repositories and it's associated data. And then you might wanna look at different tooling for where you wanna crunch that data. Jupyter Lab or Jupyter Hub is an environment that we typically use and store all of that data back into a SQL structure using Trino. And then comes your dashboarding or your visualization solution. We use Apache Super Set for it, but you might have different solutions depending on your requirement. So this is like a high level architecture diagram of the project. And this is basically the flow that we follow to do all of this. So on the left, the yellow that you see are all the data sources that we're looking at, Test Grid, GitHub, Proud, Bugzilla. Then comes in order to gain more insights into that data. You are programmatically collecting that data and creating Jupyter notebooks to crunch that data. Then comes the metric processing part where here you're working with folks in the community to actually understand what kinds of metrics are needed, what you wanna capture, what you wanna gain from the large sets of data that you have, and we are using Jupyter Notebooks to do all of that. And the main storage, which is backing all of this, is Cep S3 storage, which lies underneath all of this. And if you follow the downward arrow, here we are storing all of that data back in a SQL database engine and creating visualizations on Apache Super Set. But on the right is the AI or the modeling part of it. Some of that data is also fed back into developing some machine learning tools and services, which is then further deployed as an API endpoint or a service which can be integrated with a given tooling. So we use Selden model serving on top of Red Hat OpenShift to do this. And all of this, everything in the dotted line is automated using Kubeflow Pipelines, which helps us run recurring jobs and automate these bits. And again, we have some reference architectures and blueprints on how that can be done. And all these tools are available as a part of the OpenData Hub operator, which is being deployed on the Operate First Cloud. So let's look at two examples of machine learning models that we built to help with any open source project, for instance. So this service, the optimal stopping point prediction is focused on long running tests on given repository. So for instance, if you're a DevOps engineer or if you test your code often, you might be seeing that some tests just keep on running. And it's taking way longer than expected to run. And after a really, really long time, it feels some of that might be flaky. Some of that might be just due to some issues on the cluster or just something very random. And there is actually a fair enough technique which you can use to detect such long running tests and find an optimal stopping point beyond which with a confidence score, this test is most likely to fail. So a model such as this can just help better allocate resources and save some time on a particular project. The other project or the ML service which I'm going to talk about is the Time to Merge prediction model. So the concept behind this is for every new PR that is opened on a Git repository, this model is able to predict the time that it will take to merge this open PR. And it adds a time estimate or a label on the PR, zero to three hours, three hours to six hours, six hours to one day, one day to one week and so on and so forth really depends on the repository, the project and what you're trying to get out of this label. And such a model can also learn from the actual time that it took to merge and create that whole feedback loop into making this service better and more useful. So a metric like this can help identify bottlenecks in the development process. So for example, having an estimate of how long it will take for a PR to merge can help the developers on the engineering manager's better allocate resources to certain PRs or speed up the whole process. It can also give new contributors an estimate of when their issue or PR will be reacted upon or merged. And this can help encourage contribution to an open source community and especially from new contributors if you have sort of an estimate on this is going to be looked at but maybe one week down the road or after a certain period of time. So this is the current workflow that we follow to build such a service. As you see on the top, we collect the data, engineer certain features from it, build a suitable model for this and create a service and point to it and finally integrate this as a bot into the pull request repository itself. So a little bit more insight into each of these steps. So for collecting the data itself, we started with some large repositories. We looked at the OpenShift Origin GitHub repository and we are using this PyPy package called SRCOps metrics to collect all the data. It actually has the GitHub API underneath. It's adding some more entities and it processes the data a little bit and makes the raw data that you're collecting more suitable for analysis. Then comes the feature engineering bit. This is I think the most important part of this model because here we are actually looking at the kind of data that you want to capture from a repo or a PR in order to give a label like this. So for this we are not only looking at the PR itself, like when was it opened, what's the size of the PR, what are the number of lines, what's the description, size of the description title, but also the code repository itself. So for example, who is the contributor? Who is opening the PR, who's the author? Are they a maintainer? Have they contributed to certain part of the repository? And we look at all those features and rank their importances and look at the most important ones, which can impact such a model. And then we look at certain classification models. We also explored a regression technique which would give a timestamp. But we finalized on a classification approach, which will basically bucket the time that will take to merge. So we looked at some vanilla classifiers and finally we deployed the best performing model using the Selden operator on Red Hat OpenShift. And we expose a route to it, which can then be integrated on to a GitHub repository directly. Or you can access it from a Jupyter Notebook or from your terminal or wherever. So that's about the approach. I'm at 21 minutes. So as promised, the next steps for this approach. We are currently working on integrating this model as a bot on GitHub PR. So you can directly install this as a GitHub action or you can download the bot and set it up with your repository. We are also looking at ways to gain life feedback from the repository with which it's merged. And also iterate on the models for better performance. Look at more advanced classifiers and neural net base techniques and also finally we are working on toolifying this whole approach for a few of our machine learning models so that you can use it like an API and also tweak it on the basis of the features that you wanna lay emphasis upon. So leaving you behind with some resources if you wanna engage with this project. So there are a list of open data sources which we have linked and also stored with the repository. You can also interact and leverage the different notebooks for analysis and also look at the open dashboards that we have as a part of this project. We also run these models openly so you can actually send payloads to these model endpoints and try them out for yourself. We run automated AIML workflows using Kubeflow Pipelines. So we have those pipeline configurations and also some guides on how to set this up for your own machine learning workflow. And finally, if you wanna learn more about the different analysis and notebooks and the more ML side of it, which I didn't go into much detail. You can check out our YouTube video playlist. And I think that's all I had for today. Thank you so much and if you have any questions, you can let me know now. Let me try to pull a dashboard that we had built for this project called Thought. It's a Red Hat project for Python Stack Analysis and they also have an associated community. So they wanted to look at some high level metrics around issues and PR. So this is just an example of the dashboard we were looking at. So there are some trend charts, if it loads up correctly. Some contributor trend chart, some just numbers, basically which they needed from us, essentially. But this is more just to show you how the UI looks like for Apache Super Set. Of course, the same data can also be plugged into a different visualization platform. But yeah, that's all I had. Are there any questions?