 Hello everyone and welcome to the session. My name is Karun and I work as a data scientist in the emerging technologies group at Red Hat. Hey everyone, my name is Oindrila Chatterjee and I'm a senior data scientist working in the same team as Karun. We are part of the emerging tech group in the office of the CTO at Red Hat. Cool, so today we want to talk about how you can uncover project insights from your GitHub PR data. So let me actually start off by getting some context around this project and some background on where we're actually coming from. So as data scientists in the emerging technologies group, we work very closely with the Operate First Initiative. So I guess at this point many of you might have already heard a lot about this in the previous sessions, but just to recap, the main philosophy behind Operate First is that, well, open source has made software easily accessible to everyone. However, the knowledge of how to actually run it in a production cloud environment, that knowledge is not widely accessible yet. So things like how to install something on the cloud and how to perform upgrades and so on and so forth, these things are not made very obvious and very transparent by cloud providers. So this actually creates sort of a barrier to entry for using these projects. And that is exactly what Operate First is trying to address. So under this initiative, we want to actually deploy and run and manage all these applications in an open source community cloud. So first of all, this actually shows the community how to run and operate applications. And also, since we're running the software ourselves, like firsthand, we can actually take the lessons learned from operationalizing these softwares and then put it back, put these lessons learned back into the code. So in this way, basically we can make our applications easier to run and manage for everyone out there. So cool. That's a little quick overview on Operate First. And if you're interested in learning more, I would encourage you to attend Marcel Hill's talk on this tomorrow. But anyway, so you might be thinking that what does this have to do with me as a data scientist or as a data science manager? Like why should you care about this? So well, one of the main workloads that runs on the Operate First cluster is Open Data Hub. And Open Data Hub is essentially a tool set of cloud native data science tools. So this means that you, as a data scientist, have public access to this collaborative and reproducible environment where you can do all sorts of really cool data science work. So EA for that. And also secondly, since these workloads are run in a public cloud, we could actually collect data around the operations of these workloads and create this sort of operations dataset. So this dataset could have things like the memory usage patterns of a particular application or the CPU usage patterns of some operator before failure and so on and so forth. And this actually creates like a really nice opportunity for data scientists to leverage this data and try to understand and improve operations over time. So this is kind of exactly what we're trying to do with one of our projects called AI for CI. So broadly speaking, the goal here is to improve CI pipelines and CI processes by adding some kind of AI capabilities to it. And so this mainly involves two main tasks, I would say. The first is actually collecting some relevant data such as build logs or CI test failure data or bugzilla datasets and so on and bringing this data into a Python environment where you can actually analyze it. So that's the first. And the second part to it is actually building some machine learning models and some machine learning workflows on top of this data. So as you can imagine from the title of the talk, the data source that we're going to talk about today is going to be GitHub data. So specifically with this sub-project, what we want to show is how you can collect your data or collect your GitHub data in a format that's suitable for analytics. And then how you can calculate some interesting KPIs related to project development and also track them over time. And then finally, how can you use and create ML models to actually help you in this project development process? So for example, how you can have a model that predicts time to merge of a pull request so that you have a better idea of how much estimated effort would go into that PR and so on and so forth. So that's a quick highlight of this GitHub PR data insights sub-project. And now to talk about some of these steps in more details, I'm going to hand it over to Wendrila. Awesome. Thanks for that introduction, Karan. So as a first step in analyzing, we want to collect the data and have a method to extract it and analyze it. So we use the project thoughts MI tool to collect data from GitHub repositories of interest. The MI tool is able to collect data from Git repositories such as pull requests over a date range and provide us with a JSON formatted dump to analyze. The scheduler tool is a part of project thought too. And it is used to run a custom schedule like daily, weekly, and so on. Now to get the repository's pull request data, you can use the CLI tool called SRC ops metrics. We can use the create flag to create knowledge from a GitHub repository. We can use the islocal store flag, the islocal flag to store the results locally because by default MI tries to store data on a SEF S3 storage. Then we specify the repository that we want to collect the data from with the repository flag. These are essentially repository metadata that are being inspected, example issues or pull requests from which specified features are extracted and stored as a data frame. So now let's take a look at the workflow of this project or the various components that it consists of. So with the interest of prototyping this workflow, we started with the OpenShift Origin GitHub repository and we collect its pull request data over the years using the MI tool that we discussed earlier. We then transformed the input columns obtained from the pull request such as size of the PR, the types of files that are added in the PR, the description of the PR, etc. into various features which can be ingested by a machine learning model. These features can also be used for deriving meaningful metrics, can be used to gain insights into the software development process. As a next step in order to view statistics, KPIs, metrics related to the project visually, we create automated dashboards which can provide greater visibility into the aspects of the project such as contributor, merge statistics over time and other metrics which can provide more insights into the software engineering process. The next component of this project is a machine learning model. Using the features obtained, we wanted to train a model which is able to predict the time that will take to merge new incoming PRs on a GitHub repository. To achieve that, we explore several classification and regression based models which can classify the time to merge values of pull requests into one of few predefined time ranges or predict a time to merge value. Finally, we deployed the model yielding the best results into interactive service using Selden. This endpoint is available for anybody to interact with and test out on new PRs. So now in order to create the dashboards and visualizations that we discussed earlier, we follow three steps. Firstly, we get the repository data using the MI tool. Secondly, we use Jupyter notebooks to explore the data and extract meaningful features from the columns and create SQL tables in tree node database engine. Finally, we import the tables that we created and create visualizations using Apache superset. So let's take a quick look at the dashboard that we prototype. So I'm going to quickly share my screen here to show you the dashboard. So here is the dashboard that we compiled from the data collected from the OpenShift origin repository and the dashboard ability into the project and its development process. So we can visualize several statistics and metrics such as the size of the pull request in terms of the content that it adds, the top contributors, the trend of new and unique contributors over time, we also plot some pull request trends such as the number of commits added across PRs, the number of files that are modified over time, and finally we also calculate certain metrics such as the average time taken to merge pull requests over time. So these metrics can be useful for engineering managers and can help identify any gaps or blockers within the software development process. And we are also able to filter this database, this dashboard based on the time or the date range and we can also filter by size. These are just ordinary variables which categorize the pull request based on six sizes. So if we filter it by size two, the dashboard should show and reflect pull requests within this size range. So that was a quick overview of the dashboard. I will hand it over to Karan who will go over the deployment process. Cool, thank you, Manjula. So yeah, so now basically we've seen how to collect this data, analyze it, and how to make some simple ML models using the feature engineering process that Oandula just described. So the next thing we want to do is to actually deploy this machine learning model and make it available as a service so that it can be consumed by others and other applications. And specifically for deployment, we're going to be using the Selden operator. So to do this, the first thing we do is to save our pre-trained model onto a S3 bucket on SEF. So here we have saved the model.joblib which is our model into this opf data catalog zero backup bucket. Cool, so then next we write a small Python class that actually describes how to load the model. So here you can see it's basically just downloading the model from S3. So define that. And also we define a function called predict to tell it how do you make a prediction on an incoming request. So in this case, we're just calling model.predict on the incoming request. Then we add some runtime dependencies for our service into a requirements.txt file. So in this case, we've added the scikit-learn Python package because that's what our model is built on. And we also add the Selden Core Python package because that's needed by the deployment service. In addition to that, we also specify some build time environment variables for S2I. So we specify the model name, which is the name of the class that's doing the loading and predicting, and some other setup environment variables. Cool, so once we have these three things, we put them into a folder like this. And then we built an S2I image with Selden Core as the base image and the folder containing these three things as the context directory. So this is the image that we've created for our case. And then once we have this image, we go ahead and deploy this as a service using a config file that looks like this. So here we've basically mentioned the URL to the Quay image repo, which is GitHub PR-TTM. And then we also specify some environment variables that the service will need at runtime. So this is the S3 credentials in our case. And then we specify any resource requirements for the service, as well as the service orchestrator. And then finally, we define what kind of endpoint it's going to be. So for our case, it's going to be a rest endpoint. And then once we have this config file, we can deploy it, and then Selden will take care of deploying a service for us. And then once that service is deployed, we can expose it to the rest of the world by creating a route in OpenShift. All right, so now let me go through an actual live demo of how you would actually go about doing this. So this is the YAML file, the config file that I just mentioned in the previous slide. And it's pointing to the image repository that you've created. So that's going to be this repo for our case. And I've named this deployment demo deployment just so that we know it's not for production. So what I'm going to do is copy this config file and go to my OpenShift console and click on install operators, go to Selden, and create a new Selden deployment. So here I'm going to just paste the content of the config file in the YAML view and then click on create. So yeah, as you can see, it's starting to create this deployment for me. And now I can go into the services tab and take a look at how far along it's come in the deployment process. So yeah, here I can see that the demo service has been created for me. And just to make sure this is the right one, yeah. So as you can see, it's created just now. And it's ready for deployment. So to expose it to the rest of the world, I'm going to go to routes and then create a new route with any random name to it. And then select my service, which is the demo classifier, and then select the port at which I've exposed the service. So then I can click on create. I'm very sorry to interrupt you, but we have five minutes left. So this supposed to be for answering the questions or something like this. So I would ask, yeah, I would ask you to, because we have straight time slots, so I would ask you to whether answer the questions or just to remind them. So yeah, thank you. Oh yeah, I think we just have 30 more seconds left. So it should be almost done. Thank you. Cool. So yeah, now that my service is created, I'm going to copy the link of this and show that it actually works. So basically in this notebook, I've created a notebook to pull some data from the MI scheduler. And I've pasted the link to my service. And hopefully once it restarts, I should be able to show that I can actually query the service. Okay, in interest of time, I'm not able to reach the notebook, but I would encourage the viewers to check out the endpoint over here and see that it actually works with the real data. Cool. So I guess that was all that I wanted to show for the demo, and then show that how we can get set up for the service in just a couple of minutes. All right, so I'm going to hand over to Anjala to finish off the presentation. Yes, I'm going to quickly wrap this up. So this is a fairly new initiative, and there are several next steps planned to extend and improve this. So firstly, we want to extend this prototype service to some of our internal pre-posed and our operate-first community pre-posed. We also want to integrate the time towards hubbots such as secheter, addictions to new incoming PRs on the GitHub UI as labels. We also wish to use Kubeflow pipelines to automate the services. This will help version control versions of the model prototype via Kubeflow pipeline metrics. And finally, we wish to iterate on the model developed for time to merge prediction with an attempt to improve the model performance metrics that we achieve. So great. That's all we had for you today. Thank you for joining our session. And here are some links to the project website, the GitHub repository, and the social media channels for the operate-first initiative where we regularly post some updates. And feel free to connect with us on these channels. You can also reach out to us via the Slack channel, Data Science, where we usually hang out or email us to continue the conversation. And thank you again. I hope you enjoyed the rest of DevCon. Feel free to post any questions in chat. Thanks, Karna and Andriyala. There is one question and I guess we have two minutes. It's enough to answer at least one question. So the question is, any plans for supporting other code review systems like GitLab, Bitbucket, or Getit, etc.? So we haven't really made this... So as far as the PR data collection goes, we are exactly not sure about that because this is developed by the Thought team. But I'm sure if there's interest, they'll be happy to support these as well. And as far as the service goes, it's agnostic to what platform you're calling it for. So if you have a bot in GitLab that's trying to query the service for time to merge, then it should work just fine. Okay. Thank you. It's all questions we have. So both are ready in the tab. So thank you for your presentation. Cool. Thank you for moderating. Thanks, Pavel. You're welcome.