 All right. Hello, everyone. Thank you for joining us today. We are going to be talking about discovering emerging open source community through graphical analysis. And with that, I will go ahead and introduce myself. My name is Hema Viradi. I'm working as a senior data scientist in the emerging technology group, part of the Office of the CTO at Red Hat. I'm based out of the Bay Area, Sunnyvale in California. And if you wish to connect with me after the talk, you can either reach out to me on LinkedIn or also on GitHub if you just want to see the work that we've been doing. And I'll hand it off to Oindrila. Thanks, Hema. My name is Oindrila Chatterjee. And I'm also a senior data scientist in the emerging tech group with Hema. And, of course, you can feel free to reach me as well if you have any questions. We're also going to be hanging out at the Red Hat booth right after this. So that's also somewhere to find us. And I'm here from Boston, United States. And thank you all for being here. So what are our main goals that we are trying to address in this project? So we would like to get early notification of new and evolving open source projects. We've also been wanting to track and monitor new and important open source communities, user groups and ecosystems and monitor their movement. We also want to visualize where a project is situated in relationships to its peers in terms of maturity. So with that, I want to let you know that this presentation is going to be interactive. So if you would like to participate, you can scan the QR code right there or right here. And we'll have just three questions for you. So firstly, if you can scan there and just type in the free form field, what's your role in open source in general or within your community? What is your role and what do you do? I see data scientist. That's also my role in a community manager. That's awesome. Compliance engineer, Ospo leads, dev advocates. That's awesome. This is supposed to be pretty general addressed towards all these different personas. So it's great to know that we have different people from looking at communities from different angles. So moving on to how can network analysis help? So we did discuss our goals earlier in terms of getting early notification for projects, quantifying a project's maturity, tracking communities around it and how can network analysis and graph analysis help? So first, the representation, representing open source communities as graphical networks can set the stage for various interesting insights and analysis and can actually help us drill down into the interconnections within the community. Then categorizing important nodes. Once we have a network representation, we can leverage a plethora of graphical network analysis techniques to get answers to questions like what are the most prominent projects in an ecosystem, which projects have a lot of dependencies, things like that. Then we can also track these node groups over time and I'll go a little bit more into what's node later. But network analyses are typically done at a time snapshot. So once we have an understanding of these important players within the community, we want to track their movements over time and observe things like contribute or drift. So I want to briefly introduce Project Aspen, which is the broader project that we are a part of. And this is an open source project which was started in the open source program office at Red Hat and it has these two broad components. 8-0 is the Plotly Dash dashboard which visualizes open source community metrics and it's built using a Python native data science tool chain. And the other component, Repel, is focused on open research and the current focus area is the Developer Social Networks Analysis which we are going to talk more about in this session. If you have more questions about this project, we have Kali Dolphy and James Constell in the audience so feel free to reach out to them after this talk and we also have a Birds of Feather session today this afternoon so you can also dig more into this project or chat with them about communities. So this project is aimed at creating both community and business impact. So it's aimed at measuring project health, measuring the impact of communities within the broader ecosystem and early detection of risk factors such as drift which can inform business decisions and these metrics can also help inform and target marketing initiatives and can help measure impact of these decisions. So where does the data behind all of this come from? So I want to call out Project AUGAR. It's an open source project which was started by Professor Sean Goggins in the University of Missouri. AUGAR creates a database of events in OSS projects which essentially collects GitHub data and all the associated things like commits, PRs, issues, reviews, contributors. It's a part of the chaos community, the community health analytics and open source software foundation and essentially it helps structure the mountains of knowledge coming in from GitHub into more structured databases that can be used by data scientists like us. So this is how these projects are connected. This visual will help you understand this. So essentially the data coming from the AUGAR database is fed into the 8-0 dashboard service and we are using this data to also inform our analysis. So now let's talk about the exciting stuff, networks. So those of you who are not familiar with networks or graphs, it warrants a brief introduction. So network refers to a structure which is represented by a group of objects or people and it models the relationship between them. It's also known as graph in mathematics. Network structure consists of nodes which are the dots that you see and edges which are the lines. Here nodes represent objects that we are going to analyze and edges represent the relationship between those objects. So how do we represent open source ecosystems as networks? So we do that in two ways. The first way we represent this data is by representing projects and contributors as nodes in a graph. So a connection between them essentially means that a contributor contributed to a project in form of an activity and we will discuss that in a bit what an activity means. One thing to note is that all the edges in the first graph, all the edge weights are one, all the edges are uniform. In the second graph representation, all the nodes that you see are project repositories and the edge weights are governed by the degree of connection between these project repos. So the stronger a connection between these projects, the closer they are to each other within a network. So we did an experiment which my colleague Hema will go over into more detail in a bit but over a historical time frame we took some repositories like Kubernetes, OpenShift, Docker, Eclipse and we tried to visualize the relationship, if any, between these communities using the second representation type that we discussed earlier. So if you closely look at the graph on the right side, this technique actually turned out to be a pretty effective way to filter out repositories which are linked to well-known repositories and identify how close they are to each other. So over a historical time snapshot during that time frame we saw that understandably the Kubernetes, Docker and OpenShift repos turned out to be closely connected and filtering out some of the Eclipse repos. We don't even see them on the graph. And we see that some repos, especially like talker, OpenShift installer, Kubernetes are some of the closest nodes and some of the other nodes are a little further away. So what counts as a shared activity in our representation that we discussed earlier? So issues, PRs, commits, PR reviews by the same contributor counts as a shared activity and makes the connection between two projects stronger. What we are also working on and looking at right now is altering the strength of that connection by not only the quantity of these activity points but also by the quality of the contribution. So treat the weight differently if they are a maintainer versus a core contributor or versus a developer. So the personas that we saw earlier, like all of you answered, different persona in a community might have a different kind of an impact or an interaction with the community. So then we use these various techniques which we'll discuss a little bit more later to find more prominent nodes within the network. And finally to find more emerging on new projects, we filtered these repos based on various categories or criteria like projects which were started in the recent times, like projects started in the last year, the number of forks, the number of stars, the growing activity trends. So these are some things that we are filtering in on right now. So I want to know what makes a project rapidly emerging to you. So these are some prompts. You will see this on your phone screen. If you don't need to scan it again, you should again see this pop up. So if you can rank or pick, if any of these seem important to you, you can just click on it and rank it. And we want to get a check of what people think are more important within their communities to make a project rapidly emerging. I see some votes for growing external popularity, growing activity trend, not much stars and forks. Company investment is also important for some people. So it's mainly growing activity trend, external popularity, company investment, number of forks. Okay, cool. I see like 11 people have participated and seems like growing activity trend or issues, PRs commits on a certain project seems to be the most important factor in terms of considering a project emerging. And that's also being factored into our analysis. Something which is kind of difficult in our opinion to get an estimate of is company investment or funding just from GitHub. And also like growing external popularity because you cannot see that directly from GitHub data. So exciting. The next question that I want to ask is what's one most important insight that you're trying to look from your community. So are you trying to observe, contribute or drift? Are you trying to just measure the overall health of your project or make business decisions by observing project health metrics? If you have something that you're trying to track, which you're thinking about, one insight that you're trying to get from your community, we want to know what that is. And we can discuss how graphs and network analysis can possibly help solve that sustainability. That's definitely an important theme that I also saw at chaos con. A couple of days ago, risk especially maintain our activity. And we'll discuss risk a little bit more, especially in terms of contributor and core maintainer drift and how network analysis can also help point that out. Project velocity. I think some of the work that Project Aspen is doing can very clearly help quantify project velocity and also growth trends, emerging leaders, opportunities for contributions. That's also hard to quantify. I think we, in our parallel analysis for contributor or participant analysis, we are trying to see how key players in the industry are involved with these different project ecosystems. And that can be a good indicator for that. But this is great. Thanks for sharing what you're trying to track. And we'll try to keep that in mind while we are going over the rest of the session. And with that, I will hand it over to Hema. Thank you. Thanks, Oindrila. So now that we have a good understanding of the problem that we're trying to solve. And as Oindrila mentioned, these are some of the graphical techniques that we're trying to look at. So let's dive in a little more into the technical aspects and look at some of the solutions that we've explored so far. So as part of graph analysis, we look at something called as the centrality algorithms. So centrality algorithms, traditionally they were used as one of the more older graphical algorithms. And these algorithms will help you to identify which are those important nodes in a given graph. And here the importance of a node can be based on how many hops does it take for one node to get to another node, how centrally located is a given node, and also which node sits on the more shortest parts in connection to all the other nodes that exist in your graph. So these are just some examples of how these algorithms try to identify which are the more important nodes in your graph. And some of the centrality algorithms that we started looking at was firstly the PageRank algorithm. So PageRank algorithm, as most of us use pretty much day in and day out today is what formed the basis of the Google search engine. And PageRank was developed by Larry Page co-founder of Google. And its main aim of the algorithm was to sort of rank the web pages which are highly relevant based on the quantity and quality of links a given page would have. So this algorithm is now being more regularly and popularly being used much beyond your web search analysis. So these algorithms can be used in, for example, social network analysis. In nowadays, we're all into social media, so we want to find out who are more influencing community, which are more influencer users. And not just that, but even going beyond to things like molecular biology complex, road network system analysis. So these kind of algorithms are being stretched much beyond web and we try to use this for analyzing our data set. So the graph that you see here is basically some of the highly ranked nodes based on PageRank. So these are some of the top, like let's say top 10, which have the ranking given based on the algorithm. So it helps you to identify some of those high and popular nodes. It doesn't really effectively help you to understand how much influence does this node have in the graph. How many user communities are sort of trending out of those nodes, right? So that kind of information PageRank fails to capture. So hence we moved on to another algorithm called as the betweenness centrality algorithm. So the betweenness centrality algorithm is based more to identify those influential users that I was talking about or influential nodes in a given network. It's a way of detecting the amount of influence a node has over the flow of information in your given network. So it's often used to identify those nodes which can act as a bridge to connect, let's say one subgraph to another subgraph because that's more sort of prominently placed in the network to pass on information from one to another. So that's the main essence of what a betweenness centrality algorithm does and each node in your graph actually receives a score. So this score is based on like the shortest path that it takes for a node to sort of transfer information to another node. So based on that sort of shortest path that it calculates, that's the score that we assign to each of those individual nodes. So higher are your betweenness centrality scores, meaning higher is the more influential capability of that particular node. So the graph that you see here was when we tried out the betweenness centrality to some of the more CNCF projects, which we'll actually dive into a little more. But you can sort of see the more centrally and densely scaled nodes are basically having the higher scores compared to the smaller and lesser nodes that you see on the far outer side. They are the lesser nodes having those lower scores when you run this algorithm on them. And next we come to the third algorithm that we looked at, which is the closeness centrality algorithm. So this algorithm sort of goes hand in hand a little bit with the betweenness centrality algorithm. It's used to detect nodes in a graph that are able to spread information efficiently through a subgraph. So it's measured by how long it takes for a node to spread information within that network, basically pass on that information to other nodes. And here again, higher is the closeness centrality score. It's best placed to influence the entire network most quickly. So since here we're calculating the duration, it's slightly different from the betweenness where it was calculating the average shortest path. We're also looking at the duration that it takes to spread over that information and have its influence being utilized by other nodes in the network. So this metric can be useful for us to sort of identify, let's say, in an open source ecosystem, which user communities are more influential, which user communities can we use to help drive and promote and build more stronger communities around certain smaller open source projects that we might be looking at, right? And the screenshot that you see here is basically taking all these three algorithms and assigning a score to each of those projects that we have analyzed from GitHub. So all of these cores are sort of being populated for every node, which is corresponding to a project that we look at from open source. And now let's sort of actually apply these algorithms to real world scenarios, right? So we tried to look at two different use cases. So the first use case was we were trying to identify OpenShift, which most of you might be familiar with or at least heard of. It's Red Hat's Enterprise Kubernetes Container Platform product, and it actually emerged as a downstream of the widely known Kubernetes project, which emerged during the time period of 2011 to 2014. So we ended up taking GitHub data during that time frame. And we tried to connect and collect projects that were sort of into three different categories, one being the well-known projects in that time period, which was Kubernetes and Docker. The second being which were the emerging projects in relation to those well-known Kubernetes projects. Here in our case, we're trying to identify OpenShift. And finally, which were those other prominent communities who were also arising in that time frame, but they were not necessarily directly related to Kubernetes. So there were some communities like the Apache Hadoop, the Eclipse project, which were also sort of prominent in that time range, but just not directly dependent on the Kubernetes project. So that was our main use case. And some of the results that we started getting from this is firstly, if you look on the left side, you see the betweenness centrality algorithm sort of scaling all the nodes. So the large red nodes that you see is actually corresponding to your Docker repositories. The blue nodes correspond to Kubernetes and then you have the green nodes, which actually correspond to OpenShift repositories. So that third category that I was mentioning, which were again emerging communities, but they were not directly dependent on Kubernetes. You see that they sort of get filtered out in this graph. They don't really pop up, which means that they don't have a direct contributor or user communities in relation. And hence the algorithm is sort of ranking those out of this particular graph. And in the second one, we are just trying to get a closer view of which are those repositories or projects that are closely related to Kubernetes. So here we see the green ones like installer project. We see source to image. We see OpenShift installer, which are sort of closely placed based on the edges connected between them. The edges here being the contribution activity. So as you guys took in the survey, you guys mentioned things like PRs, issues, commits, these are what drive the contribution activity. And that's also what we're analyzing and depicting in our graphs on those edge lengths that we visualize over here. So this is a representation of some of the results that we got. And next we looked at another use case, which was to represent the CNCF projects. So CNCF, as we all know, we're all here today at OSS is the cloud native computing foundation, which is part of the Linux foundation. And they basically provide support to the more growing and emerging cloud native projects. So they have actually have three categories of their projects. So we have graduated projects, we have incubating projects, we have sandbox projects, which are defined by a CNCF based on their maturity levels. So how they've sort of grouped the projects, you can see the diagram that a CNCF has actually put on their website. So sandbox are like those innovation, high cutting edge bleeding technology projects. Your incubating projects are those which have sort of gone beyond just innovation, but slowly getting adoption, slowly growing their user community. Right now, currently some of the incubating projects are telemetry, Thanos, and then we have the more graduated, which have sort of gained a lot of traction, gained a lot of user community growth, and they're also being widely adopted by a lot of users around the world. So for example, right now, the most graduated projects are Prometheus, Helm, Kubernetes. So based on these different categories, we sort of took data and we applied these algorithms on a bunch of the CNCF projects. We took a collection of about 75 CNCF projects, and some of the results that we saw from these algorithms here, again, we sort of get to identify which are those more centrally located nodes. So the blue ones that we see here actually are corresponding to the graduated projects. The green one that you see are the incubating, some of the incubating projects and the farther red nodes that we see here are some of the sandbox repositories. So the sandbox repositories do have some contributors who have been working on more well developed projects, but the placement of the nodes in the graph give us an indication that they're still sort of not having that strong user base yet or do not have those many contributions yet. So this is one example of the representation. And the second representation here is the betweenness centrality algorithm. So it sort of effectively filters out which projects we are more interested in seeing. So which are those projects which are closely in connection to those graduated projects that CNCF has defined and try to sort of see which are the more lesser known projects that are emerging out of those well developed projects that we have over here. So this can be a good way to sort of effectively filter out emerging projects which in our already in relation to some of your prominent communities that we have. And we also have a code where we've written most of these algorithms in Python. So we have published them as Jupiter notebooks in a public repository under the Aspen project. So you can also check it out later to see sort of how these algorithms are being implemented. And finally what was most important takeaway for us from these algorithms is as we saw we had those three different algorithms that we tested out and each of them had a score associated with it. So we summed up all the three scores and we assigned for every project a total score and based on those total score we filtered them out in based on the highest to lowest. And we see that across these three categories what's interesting to observe is that the graduated project ended up having a higher score in the range of you know 1.5 to 3. Sandbox had a lower score range between 0.3 to 0.5, 0.6 and incubating sort of sat in between both graduated and Sandbox. So again these are tested around 75 CNCF projects these are not all the CNCF projects. So just a quick mention about the data set that we've been using and this is also tested for the more recent three years of data that we collected from 2020 to 2023. So based on those analysis these are some interesting projects that we saw. So for example some of the incubating popular projects now are Thanos, STO, some of the Sandbox projects that are gaining more attraction now are the DEX project, the K3S project, the Kafka operator projects and so on. So these are some ways you can use these algorithms to test it on your own communities or even if you're building out your own open source projects and you're trying to see where it stands in the wider ecosystem. You can sort of run these algorithms and get basically assign a metric to correlate the importance of these projects when you're analyzing it from from an ecosystem standpoint. So with that we also want to mention some of our ongoing efforts along with some resources. So these were our initial sort of exploration and research into the representation of projects in in a graphical approach. Ultimately we also want to identify the important user communities. So we want to perform the analysis at a more user level at a more individual contributor level and get an idea of which communities are growing. And finally the ultimate goal is also to have some kind of machine learning model implemented, which is able to sort of predict as we go. What is the trajectory of a given project? Are we able to predict how this project is going to be? And some of you mentioned you want to know the sustainability of a project. You want to know the growth trends of a project. So these are great sort of indicators to us as well because that's something that we're aiming to get to and achieve along with all those machine learning models that we are going to start building out. Once we've sort of taken this graphical approach, that's our aim to build out that model along with integrating it with some of the work that the Aspen project is doing. So if you want to check out the ongoing work, as I mentioned earlier, we have a GitHub repo. So feel free to check that out and see the notebooks. You can also create any issues that you want or even just reach out to any of us as we are maintaining that repo moving forward. And with that, I would like to thank all of you all for attending and listening on the last day of the conference. But thank you for tuning in. And if you have any questions, we're happy to take them now. I would say interesting is as we sort of expanded the data, it was definitely there was a change in the time ranges of the time periods. So when we were testing it for like, let's say back in 2010 to 2015, the ranking of projects were different versus when you compare it to let's say like this year or previous year. So some ended up even getting eliminated, which meant to say that over the years, some projects which were getting a lot of traction sort of declined. And that's something that's interesting because that also will help us to sort of build those predictive capabilities that I was mentioning. So these are some interesting ways to sort of see how the community is growing over time. So I would say that's one thing. Yeah, so just adding on to that, it's almost like over when tested over a historical time frame, we were able to see indicators of whether the project will be sustainable in the long term or not. And when we expand that data set, we obviously had clear data points. And the other interesting thing I think the first use case that she showed. So we have been mainly stabbing at it from like a more historical time frame to actually validate our approaches to make because we don't want to get into prediction before knowing that for certain that this makes sense. So definitely the open shift community and the it being able to effectively filter out communities which are important but not essentially relevant to the our ecosystem that we are tracking. That's Kubernetes stalker. It only filtering out that hey this small project which is open shift is pretty important in the graph, although it does not have that many contributors but it has a strong connection with those existing projects which we consider important. So I think it's just interesting that when you take all of this data and represent it in forms of like networks and graphs you can actually start to model those interconnections pretty clearly. And you're able to get a clearer picture into how those relationships are. So like I mentioned, this is a research project. So so far it we are not using this internally as yet, but essentially again the goal is to inform the our investment in emerging technologies and inform which projects we want to invest into. But again, the focus of this project is mainly open source and we want to make this available as a part of the project as spin which is essentially focused at community metrics and being able to make this available to any user who wants to model their ecosystem or dig deeper into their groups of projects that they want to track. Right so the like the from I think one of the initial slides we mentioned that the goal is to track emerging projects within an ecosystem and that ecosystem like we saw here could be CNCF or it can also be an organization. Think about like modeling this instead of taking CNCF repos taking all repos which are Red Hat repos or Red Hat repo dependencies. So it's still very initial in the initial research phases but we have been trying to work within Red Hat repositories and Red Hat repository dependencies to see if something stands out if some projects which exist in our adjacent space stands out. So this is one important in relationship to those projects and the ultimate goal is obviously to come up with the top 10 top 20 list of projects which we are interested in for that year or you know in the near time frame. Right. So this is going to be integrated into product the nice visualizations which project aspen has. And now you can definitely leverage the auger database and that can pull any repo such you're interested in and that will work very neatly with these notebooks. So if you're familiar with Python you can run these notebooks to get these charts for your repo but eventually I think a more way more seamless visualization will be when we are able to integrate all of this into that nice dashboard which project aspen has. Yeah, yeah just to add. So right now the dashboard that aspen has is more from like a community health metric perspective so that's readily available if you want to reach out to Kali or James they'll be happy to get you connected with that. So our work is still sort of getting into that pathway to ultimately land in those dashboards because it's more from a compute perspective it does take a while to get those graphs generated and but yeah it's on the roadmap for us to like get it integrated with their dashboard. But the repo you can definitely feel free to look at and just sort of see the work that we've been doing and follow along as well. Thank you. Yes, that is definitely something that we're going to look at next. So as an initial first pass for these model and graph analysis we mainly just picked GitHub data to start off with because we had a lot of open source data available Slack and things like that. It does take some kind of work to get across those apis and some teams may or may not have give us access and things like that. So we do want to expand beyond GitHub that is definitely something that we've been thinking about as well. Currently it's only focused on the GitHub activities. All right, no more questions. Thank you. Thank you for joining.