 All right. Hey, everyone. Thanks for joining us. My name is Oindrilla Chatterjee and I am a senior data scientist at Red Hat. And I work in the emerging technologies group at Red Hat and I am coming from Boston, United States. And today we are going to be talking about uncovering new open source communities using graphical analysis and network analysis. And I'll let my colleague introduce herself. Thanks, Oindrilla. So hello, everyone. My name is Hema Viradi and I'm also working as a data scientist in the emerging technology group at Red Hat. And I'm based out of California and the United States. And if you want to connect with me later, you can reach me on LinkedIn, GitHub. Feel free to reach out to us after the talk as well. So before we get started, I would encourage you all to answer this question that we have by scanning the QR code. So this is just a live poll to get an understanding to know what is your role right now in open source. So are you a developer? Are you a project owner? Are you a project maintainer? Whatever it may be. Feel free to put in your answers over there. And we should also start seeing it populate as well. So I see there's a software engineer, data scientist, so more software engineers. I'm going to give a couple of more seconds. But yeah, so as we can see, we all have different roles to play in this open source community. Our majority, I'm guessing, here at this conference are more from a technical perspective. So we are more of developers and engineers and things like that, right? So all of us have this important sort of role that we play. And part of the motivation of this project or the kind of goal that we want to do in this project is, firstly, we want to identify and get notified about the early and emerging open source projects that are existing out there. So we all know that the community is so large and so big that it's sometimes unclear for us to sort of pinpoint and identify those new and emerging projects. So that's one thing that we would like to achieve. Secondly, we would also like to look at the communities around these projects. So we want to track who are the important user groups? Who are the important core contributors? Who are your open source ecosystem main sort of maintainers? And how do these kind of evolve over time? And finally, we would also like to graphically visualize all of this. So that's where the network analysis and social graphical analysis that we're going to talk about comes into picture. So what we want to do here is we want to sort of identify the maturity of a project over time and we want to see where the interrelationships between projects exist. So how does network analysis help in achieving these goals? So we talked about looking at early notifications, trying to track those important projects. So one way to do this is representing them graphically. So what we do here in the first graph that you see is you're trying to depict your projects or basically your GitHub repositories as your central nodes. And you're also trying to identify the contributors around these different projects. So you're trying to see how many contributors exist in a particular project and so on. And we want to further dive down into that graph a little bit more and look at what are the important nodes that exist in that graph. So that's where all those graphical algorithms come into picture. So you're trying to identify which are the top most important nodes in that graph and try to identify the influential users and so on. And then finally, we also want to look at it from a time perspective. So you want to see the growth of a project. Let's say you're starting a new project. You want to sort of project what the trajectory might look like over time. So you also want to track all of this in a historical way. And ultimately, we also want to incorporate some AI and machine learning capability to sort of predict what the trajectory would look like for this project to succeed. So that's kind of how network analysis can sort of tie all of this together. And all of this kind of effort is part of a larger initiative called the Project Aspen. So Project Aspen is an open source project which is mainly developed at the open source program office at Red Hat. So what Aspen does is it has a couple of components that are currently actively being developed, which we are also using in our project in our work. So firstly, there is a tool called Augur where we collect all of our open source project data from. So Augur essentially is scraping all of the data which is from GitHub. And we mainly use that tool to collect all the information from various GitHub repositories. And we also have a visualization tool called 8.0. So this is an interactive dashboard. So what you see on the graph over there on the right is developed by these dashboards. So that's a tool that we also have where you can sort of filter which repository you're interested in. And you can sort of see some metrics like community health metrics, project activity over time and things like that. So there are some inbuilt metrics in the dashboard that you can further look at. And then we also have the repel repo which is the main repo where we do all of our research and open-ended sort of experiments to look at which is what we are contributing to. So that's a little bit about the larger initiative. And now we can come to the representation of these projects that we talked about in a more graphical format. So if you look at the first one on the left over here, we see that the central nodes are basically your project. So it represents a particular repository. And you can also find out who are your contributors which are surrounding that particular repo. So here we do not, you know, pinpoint contributors were completely sort of obfuscating that part of it. So we're making it very private and it's more like just looking at the contributor IDs and that's how it looks like. So some of these projects have a larger sort of dense contributors. Some of the smaller ones like the purple one have a lesser kind of population around it. So this is one way to represent a particular project. Now if you look at the second representation, again here, each of the nodes represent your repositories. But what we're doing in the edges is looking at the number of shared common contributions that exist between those repositories. So that's how it's likely different from the first representation. So the edge between those two nodes is basically like the weight of the contributions that might have exist between those projects. So those are defined by the activities that happen in the project, which we will come to a little later in the slides. But this is kind of how we would represent it from a graphical standpoint. So, oh, yes, you have a question as the weight of the edge. So and yeah, and also the length. So if you see a larger distance, it's probably because they have a lesser number of contributions. Whereas if you see them being a little more closer, that means the weight of the contributions is a little higher. So yeah, that's how the representation is. So next we can move on to representing projects as nodes and the shared contributions that we were looking at as edges. So here the main goal to do this kind of representation is you want to sort of aggregate all of those shared activities in those nodes. And then you want to sort of also further drill down and filter out those which might not have a lot of activity so that you only focus on those key sort of projects in those nodes, right? So, for example, here we're trying to look at some of the more popular Kubernetes based repositories. So we have a couple of Kubernetes repos. We also have some of the OpenShift repositories. So OpenShift is again very much related to Kubernetes. So these are some projects where we picked because of their sort of known connections to each other. And we try to look at how close they are, how far apart they are. So that the goal is that when you visualize it like this, you can actually identify those which are emerging, right? So if you know one very well-known project, let's say Kubernetes, you want to know what are the other projects surrounding that very well-known project because those can potentially be your next emerging project. So you kind of want to identify those key links between each other. So that's what we're trying to do in this particular representation. So what exactly counts as those shared activities, right? So we saw those edge weights between those different nodes. So what we count as an activity are things like issues, PRs, comets, PR reviews and things like that. And the weight that we defined is defined by what is the strength of the connection based on those type of contributions being made. So basically you're looking at are these contributions done by a maintainer? Are these contributions done by a developer or a core contributor and so on? So that's how those edge weights are being defined. And then finally, you also want to find those emerging projects. So we look at some metrics for a particular project like how many number of forks does this project have? How many number of stars, excuse me, how many number of stars does this project have? And then of course, the activity trend over time for a particular project. So that's kind of how we look at all these different shared activities. So again, if you can take the poll and answer this question, you don't have to scan it again. It should already be there in the previous tab that you had open. But I want to ask you all, what do you think makes a project rapidly emerging? So you have a couple of options to choose from. For example, do you look at the growth in the number of stars that it has? Do you look at the number of issues? Or do you look at the external popularity of that project? So kind of rating what you think is most important over here. So I see some votes coming in. Some people are looking at the issues PRs come. It says the most important. Some people are looking less in terms of the number of forks, number of stars. So I guess most of us are more interested in the activity of a project. I'm going to give a few more seconds for folks to add their responses. Okay, so we have some people who are interested in external popularity. So that's nice. You want to sort of also see apart from PRs and comets and all of that. You also want to look at it from external factors, right? So maybe you've heard about this project in some other outlets. So that's also an important feature. Awesome. So yeah, thanks for participating. These are all good responses. And you can sort of see that each of you, based on your role, also as kind of weighing things differently, right? So that's what we can also take into account when we do this kind of analysis to make sure that we're capturing the right set of activities that you think is important. And that's how you can sort of gauge where your project lives. So following up, one more question that we have for you is, what is the most important insight that you're looking for from your own community? So most of us participate in some kind of open source community. So what is it that you want to learn about these communities, right? So whether you're a first time contributor or whether you're a veteran contributor, what do you kind of look at and what is important to you when you are thinking about these communities? So you can feel free to add your responses over here as well. Okay, so what do we need to change to attract more users and contributors? Yeah, that's a great point. So you also want to see how can we improve communities, right? So what are some things that you should focus on to improve? So that's definitely a good thing to look at. What part need help to connect them better? Okay, yeah, so examples of use cases. What's one next project that could be interesting? Exactly. So that's something that we want to also gain from these kind of analysis, right? So you know that this project is existing, but you also want to know how do we make this project bigger? So I see somebody said, how do you make it bigger? What can be some other things that we can contribute to make it interesting? So yeah, these are all great insights. And we hope to incorporate these kinds of insights into our analysis further. But yeah, thanks for all these suggestions. And these are some things that we are also looking at. We're trying to understand what community managers are interested in, what contributors are interested in, what developers are interested in. So depending upon your role, you kind of look at communities differently. So that's the reason why we try to get data from different set of projects and try to sort of aggregate them in this graphical representation. With that, I would like to hand it over to Oindrila. She's going to talk more about these algorithms that we've implemented. Awesome, thanks, Hema. So now that we looked at more of the questions or some of the goals that we want to achieve from these projects, let's try to look into some of the more technical details and how do we get to assessing these goals. So firstly, to do that, we researched some algorithms which are graph centrality algorithms. And this is sort of one of the most important areas of research when you're trying to identify key players or important players within a graph network. So what do important nodes within a graph network mean? So in terms of like in graph language, important nodes can mean, number one, they have a lot of links or a lot of like direct connections to them, or it can mean that that particular node can reach other nodes in like fewer hops, it can reach other projects really easily. And another meaning could be that it really sits in between projects. So it sort of lies in between like the shortest path of different projects. So we'll go over a little bit more into what this means for GitHub repos, but before we get there, let's get into some of the algorithms that we researched. The first algorithm that we looked into is PageRank. And PageRank is one of the most popular algorithms which was again, Google came up with this. And this was used to rank like web pages or websites which are the most popular within like Google search rankings. So in terms of using PageRank for our use case, PageRank has a variety of use case like starting from social networks to molecular biology. So for our use case, we actually use these nodes since these different nodes represent different GitHub repositories. These edges between them are common contributors or common contributions. So PageRank can be actually used to detect the most prominent nodes within this graph network. So however, it was great when we tried to apply it to some use case. It was great at identifying like the important players, like the most veteran or the most well-established projects, but it was not good at identifying projects which are important in relationship to the other projects in the community. So for example, if we knew a list of well-established projects and if we are trying to understand that in this ecosystem, for example, containers, what is another project which could fit into this group, PageRank wasn't really good at filtering out or finding projects in relationship to other projects. So another algorithm that we looked at is betweenness centrality. So this was also our main aim was to find influential nodes within this network. What this algorithm essentially does or means is that it tries to find nodes which almost sit in between other nodes. So it almost lies in the path of the different projects within a network. So we tried to test this out on cloud-native computing foundation projects. So we applied this on a bunch of CNCF projects. And I'll also go a little bit more into this later. But what we saw was that the more veteran or the more well-established projects almost had higher betweenness centrality scores. And the other projects which were new or which were just introduced in CNCF had smaller betweenness. And something which was pretty interesting, which we saw with betweenness centrality, was that this was good at also emphasizing on a node's popularity in the context of the repositories which it is a part of, the ecosystem that it is a part of. The next algorithm that we looked at was closeness centrality. And again, this is also a way of detecting nodes which are closest to each other. So this can be used to find nodes which are at a shortest distance from an existing node. So we essentially use this to find nodes which are most well connected or the closest to a well-established project. So that we get a better understanding of projects which are almost like interdependent on each other. So these were the key centrality algorithms that we looked at. So now that we have an understanding of these algorithms, we wanted to apply this to certain use cases and try these out on some real world cases that we can try this on. So one of the first use case was identifying open shift, which for those of you who are not familiar, it's the enterprise version of Kubernetes. So we wanted to identify open shift almost as a downstream of Kubernetes. So in the years 2011 to 2014, Kubernetes was a well-established project and open shift was emerging as like a downstream of Kubernetes. And we wanted to trace back in time and see if we can detect those patterns just using centrality algorithms. So we tried, we tested this out on three control groups. One was well-known projects. So we added Kubernetes and Docker, which are very much related and which were very established during that timeframe. And in the emerging projects, we wanted to detect open shift. So that's what we added in that field. And in the last section, in other communities, we included certain projects which were also emerging and which were appearing in different like news outlets and which were coming up in those years, like Apache Hadoop, Apache Mesos, the Jetty project and so on. And we wanted to see what these algorithms show us. So the first thing that we did was we wanted to see like during those years what the trend looks like for these projects. So we particularly picked one open shift project. And as you can see here, I can also just quickly show you the real dashboard. So we saw that like during those 2011 to 2014 ranges, the pull request activity and also the commit activity and the contributor growth for this open shift repository was actually growing. And we saw that this is actually like a prominent project during those years. So we actually went back and represented these sort of data points from that Git repository in graph representations and we applied these algorithms to it. So the first thing that we saw upon applying between the centrality was that this was very effective at highlighting Kubernetes and the Docker repos, which are obviously well known. So the red nodes and the blue nodes that you see are sort of the Kubernetes and the Docker repos. But we also saw that the open shift repos were showing in the graph like the green repos are open shift, it actually came up. But the other community repos like the purple ones were insignificant. Although they were very key players, they were not really important in that ecosystem. And that's why that was almost effectively filtered out. And the second representation, we saw that this was even better at filtering out those other community repos. So here we see that sort of Docker and Kubernetes are pretty closely related and then the open shift repos are also there in this ecosystem, although a little further away. And the second use case that we tried, this I'll just go over very quickly, was actually representing CNCF projects using graph algorithms. So Cloud Native Computing Foundation actually ranks projects according to different maturity levels. So graduated projects are the most veteran projects and the sandbox projects are the more newer projects. So we wanted to see if we can use these graph techniques to essentially represent the maturity of a project in a way. So we tried like a subset of CNCF repos. We tried around 75 repos. And we saw that when applying the second graph representation that we saw before, where the edge lengths are almost like the closeness or the degree of connection between those projects, we saw that the graduated projects were sort of central in the repository and the sandbox projects were a little further away. Incubating projects were somewhere in the between, which is almost very similar to the maturity delineation that CNCF does. What was more interesting was the between the centrality scores. Here we saw that the graduated repos were the biggest blobs, definitely more prominent than the sandbox and the incubating repos. And without going too much into detail, I will show the overall rankings that we collected from this. So what we essentially did was all these different centrality algorithms, we scored each and every project based on the ranks the nodes were getting and we normalized them and added all those scores to get like a final scoring for each project within that control set. And here we were looking at about 75 CNCF repos and we sort of just picked a few top end projects within like the each category that we already know. And we saw that also the scores were sort of analogous to the maturity level of the project. So the graduated projects had like a higher total score, incubating projects had like a medium score and sandbox projects had a lower score. And this was pretty interesting to see because this also means that such scoring or these graph algorithms can also be used to identify which projects can be next introduced in the foundation or can be used to better quantify these decisions that we are making or also can be used in context of an organization. So yeah, that's all we had to share with you today. Some in terms of some ongoing efforts, we are also trying to extrapolate this work onto contributors or users. So applying these sort of algorithms on the maintainer score contributors on a project and actually seeing the sort of key players for each project. And we're also trying to periodically come up with lists of new projects that we want to introduce into the database so that we can analyze them further because our research is just limited to the projects that are in the database. And finally, we also want to prototype these like cool graphs and these plots that we saw into the Aspen dashboard that we saw earlier. So that would make it very easy to like filter down and use these graph algorithms outside of a Jupyter notebook or outside of like a Python environment, just making it easier for non-technical participants of a community to drill down. So yeah, here is the project repo where we are prototyping a lot of this work and in that folder, you'll find a lot of our notebooks. Again, there's a lot of documentation of this work and this is a collaboration that we've been doing with the open source program office at Red Hat. So if you have more questions, have ideas, feel free to reach out to us or open issues or just look through our work. And yeah, that's all we had for you. If you have questions, you can ask us now or also enter it in Slido if you're shy and just want to type it out. It's up to you, but thank you so much. That's a good question. So right now we are relying on like the Augur database which for which Augur's main, it's a chaos foundation project and I think their main data source is GitHub right now. But I think like the APIs are like very similar. So if Augur is able to like ingest GitLab, I think the analysis is essentially the same. So it really, or if there's a separate database that we can fetch data from, I think it's easy to just get GitLab data as well. So you're using the data? Yes. What's the data from your GitHub? Yes, we are not directly getting it from GitHub because you know how API calls and API limits and everything is. So Augur has been a very sustainable, great source for us because it also has different tables and it really has a very extensive schema for like different kinds of, like it really helps us go from like a project to another project and has great links in terms of contributors, participants and it's a good relational database schema that just helps us. So you're using the source? Yes. Yes, it's a relational database which we feed into for more unstructured analysis. Like definitely our analysis on is more like non-relational, more unstructured, but our source is all like SQL queries and then we ingest it into a Python environment. Yes. So what do I need to do to carry analysis of my whole projects? Reach out to us. Like if you want to open issues or we can talk further honestly, there are more use cases like we saw earlier that we are willing to try this on. So if yours is a community that you know you have ideas about, you want to, you know, gain some insights on, we can work together and just spend something up. Yes. Security or rational projects with one of the tools to present this. Yeah, that's a good question. So that's something that we are currently looking into. So as of now we sort of looked at like a specific defined use case that we can test and validate. But the next step is exactly to your point, we want to do this internally at Red Hat. So we want to also identify like what are the top security projects? What are the top cloud native projects? So that is something that we are going to do next to look at it from that particular vertical in a given organization. So yeah, good question. That's something that we're looking at. Yes. One more question for me. So what practical actions does your data work at Red Hat or does it have to do with that data? That's a great question. And I don't know how much of this I can go into but in general, I think we are a part of the emerging tech group and we work closely with the open source program office. And usually the goal for us is to look at, you know, what are some emerging projects which can be useful to Red Hat or which can be a good community for us to just invest in. So the ideal goal for this is to inform those decisions and also for us to validate some of our investments in terms of, you know, quantified better and just see that if there is enough data backing, if there is like metric backing or if there are like some indicators that we need to be aware of before we take the next step or the next action. Yeah. Right, right. And also if it's any new like incubating project that's happening at the company, just to sort of keep an eye on like, are we having the right resources for it? Is the project going? How are we expected to go? What are the next milestones that we should be planning for and things like that? So kind of looking at it from that perspective. Okay, we are at the end of our time. So thank you again. Thank you.