 So hi, my name is Brian Prophet. I am a manager of the Community Insights team within the open source program office and what part of what our team does is gather metrics and data to try to figure out how healthy communities are and make sure that they are growing and thriving the way open source projects should. And one of the people that I work with on my team, an amazing talent, is Kelly Dolphy and I'll let her introduce herself. Hi, my name is Kelly Dolphy and I'm a data science intern here at Red Hat and I've been here for, you know, about a year and a lot of the things that you're going to be seeing today has been my work over the past six months at Brian Prophet and I have been doing. And with that, I am also finishing up my degree at Boston University in computer science. Okay, great. So we'll go ahead and get started. And the first question that we always ask ourselves here is how to discover community health and sustainability. This is something that's really important. As I mentioned before, making sure communities are healthy and growing is a big part of what we do at our job at the open source program office at Red Hat. But we're also very interested more and more lately in the sustainability aspect of a given project because as more and more businesses and organizations get involved in open source, they really want to know, you know, how strong and how healthy a community will be if they're going to invest time and money and resources into that given project. So next slide, please, Kelly. Okay, so historically, this has always been done kind of by the seat of our collective pain. We've always, you know, tried to figure out community health around things that were fairly innocuous and seemed very obvious, such as, you know, how popular is a given open source project? How many people are using that project? What how many downloads are coming in for that free software project? Because, you know, if you look at it from there, obviously those seem like really good strong signs of community strength and help. As platforms like GitLab and GitHub came into being and you could start looking at the first level activity of a given project on those platforms, you might assign things about popularity and strength from stars, such as what GitHub has. And there's nothing really wrong with any of those metrics. The problem is this. They don't necessarily give you a true signal of what the community's health is. Things like popularity and downloads and the number of stars are great. And they look awesome when you're trying to market your project and tell people how, you know, wonderful it is. And it may very well be true indicators of how wonderful your project is. But think about this. Suppose you have a project that has a bazillion downloads, and there's no indication from the user consumption side that currently in that project there might be some kind of infighting going on. And there's been a massive battle going on on the project mailing list for months. And the project is about to forge and there it will then lead to, you know, a degradation of the quality of the software that's coming out from that project. Nobody on the user side will know because, you know, it's just still being downloaded. So it's not that downloads and stars and popularity in general are a bad thing, but we have found that over time they are not really, you know, solid true indicators of how healthy a community can be. Next slide, please. So what we're doing in the open source program office, specifically on the community insights team, is we're trying to do something and move beyond those gut feelings. We're trying to deploy analytical rigor. And that we've been able to do that through the rise of two key movements within the broader open source ecosystem. One is the standardization of metrics. And we'll talk about each of these in just a second. And the other is alongside of that standardization and how we can look at all these different kinds of community projects in the same way is the evolution of tools. And that is, again, something we're going to be talking about as we move through this. So next slide, please. So right now the tools are this and these are all coming around that standardization of metrics that I referred to earlier. And that is actually coming from a project independent of Red Hat called Project Chaos. Project Chaos is a Linux foundation sponsor project that is very keen on getting together metrics that can apply to any project because in the past, a lot of the pushback has been that we have just had metrics that you can't really apply something to a project about databases versus an academic project or something different. But that turns out not to be the case. All projects, no matter how they are constructed or what they are producing, have similar aspects which can be measured. And to measure that, we currently at Red Hat and in the Ospo office are using three sets of tools. One is known as Caldron. Caldron is an open source project that is based on Grafana and Elasticsearch and tools coming in and Gribmore Labs. And these are all tools that are put together by a wonderful vendor in the open source community known as Paturgia. Paturgia has put together Caldron as a hosted service. You can go to caldron.io now and start running your own metrics against projects that are out in the open source world, whether they are hosted on GitHub or GitLab. And what Caldron does is it gives us a very quick analytical and graphical snapshot of what a Git hosted project is looking like. What are the activity levels? How many developers are there now? How many developers were there last year? How fast are pull requests being looked at for the first time? How fast are those pull requests and issues getting closed? These are all things that Caldron can do. In fact, we liked it so much that we have stood up our own instance of Caldron so that we can run our research projects that much faster. Alongside that, we are working with members of Project Chaos on another tool that does a lot of deep diving into data sets called Augur. What Augur does, it does not give us a graphical picture of what a community is looking like, but it does give us really solid connected pieces of data that show us how different contributors are working within different projects, what those connections are, how projects relate to each other, which is amazing when you are trying to look at an entire ecosystem. You can see very quickly with Augur not just the strength of that project, but what you are really seeing is how it connects to all the other projects that are related to it, which is an amazing set of insights. Beyond that, we take the information from these two tools, and then what we are doing is we build something known as community report cards. These report cards or analytical reports basically give us the graphical and the analytical tools to define how a community is doing. We can run these at any time. We are working to make them as automated as we possibly can so that somebody can run the report, look at the data, draw conclusions from the data based on what they know about the project and how historically it has been, give that information back to the community itself, and they can work on identifying those things that might be problems and need to be solved. So, a combination of these three tools are what we are using now. I am going to give it over to Callie now and let her talk about some of the other things that we are doing within our team. Now, looking at this next phase of tools, we are trying to look at and see how we can use the information that we are comfortable with and have gotten very experienced with, and find new perspectives with the new options that we have going on to our table. This goes into the next level analysis of what business needs to know about open-source communities before they get heavily involved. It is no secret that businesses these days are having a growing interest in open-source communities, but a lot of these people have a lack of understanding of the little nuances in the open-source world. You can measure health, and that is one portion of it, but it is a completely different thing to figure out how that fits into your business in a way that is sustainable. Is the community sustainable and it is something that is compatible for you? Coming from somebody who is relatively new to the world of open-source, I have always been asking the question, what should I be paying attention to whenever we start looking at new angles of the open-source communities? Where does it go from when you are trying to look at intuitions of people in the past and trying to take that knowledge and make it data-driven? This goes into our new tool that we have created with a partnership with IBM Research. This goes into three parts. The first one is the IBM Watson discovery. This looks at GitHub at a completely new way of using AI-powered natural language processing. We can start to understand the industry's language, where it is uniquely looking at how open-source keywords and different things along those lines. This is a way that we can start grouping together repositories, not by certain metrics or whether these contributors all look at it and work on it together, but seeing if they are using similar languages. Are they talking about similar things within their readme's? That brings a whole new way to group together, repose and start to see new trends that are happening across all of GitHub. The next stage of this is project debater, which is one of the ones that I am really excited about. Project debater takes in a set of data where you can find arguments for or against a certain position. For example, you can type in OpenShift, the best container platform, and you can see the arguments that are going through it and see the weight, whether on the positive or negative side, and see what people are saying about it. Start to figure out, okay, where are the different weight places that we need, whether if you are looking at it from a point of view, like what do I need to do to take a one step up, or is this something that I want to become involved in if you are looking at it from an outsider point of view? Then, from here, we can start to use the experimental method. This is whenever we start to use multiple different sort of searches with a few of our different tools to see what the impacts are of the work that you are doing. What is the impact of the different events that we are having, the discussions like the one that we are having around here, around different search targets? When we are bringing all of this together, you are starting to see open source communities in a new light and start to get a little bit of a one step ahead of everyone else in the community. Yeah, no, thank you. Taking the new tools that Callie just outlined, we are going to take all of the things that we mentioned, the tools that I was talking about, and also the new tools that we are working with IBM to create around mode and debater and Watson. Now, we are going to start looking at things that we have never been able to really do before. One of those aspects is going to be the business impact of a given community. In the past, usually measuring the return on investment for a community has been rather difficult because you know you have to have a community and be a part of an open source project and put some time and effort into that. But what is the business getting? You can say, well, we are getting a commercial product that we are selling and we are making money off of that, and that is certainly true. However, there are more things, there are more aspects that you can pull out when you look at all the different elements of working with a community. This is what our tool set is trying to do. We are going to be looking at things like if a given organization is really interested about raising a certain conversation within the broader open source ecosystem. With the mode tools that Callie described earlier, we can actually look at conversations that are going on in debt-based comments and issue trackers within the mailing list. Any place that there are public conversations that are going on, we can quickly look for the terms and the conversation that we are really wanting to see if we are getting raised. Hypothetically, if a company were trying to talk up container space and Kubernetes-based tools like what we see around the open shift ecosystem, you could start looking to see if those conversations were happening. Are they positive conversations? If they are not positive conversations, maybe there is something going on. Maybe there is a problem going on with your tool set that you were not aware of before. We can start dialing in and figuring out what those conversations are about. Also, we can look at things like how the resources can be calibrated towards community health. If we see a problem within a given open source project, we can quickly ascertain what skill sets, what resources in terms of infrastructure, anything like that, what needs to be put in place to help that community solve the problem that it is having. Another thing that these tools are going to be able to help us do is get into that element of sustainability. With all of these tools that are disposable, we can measure risk at many different levels. We can still measure the internal project health, which is what we have been doing for quite some time, with cauldron and auger, especially with cauldron. Now, with auger and mode, we can look at how that community is interacting with the broader open source ecosystem. We can really get numbers and figure out exactly how important a given project is to its peers. Again, it is not a popularity contest. We are really trying to define very quantitatively the strength of a project and also how important it is. None of us who are watching this want to see another open SSH where you have a project that is maintained by a very small number of people and yet so many projects rely on it. If those people are no longer able or willing to maintain that project, then there becomes a very large problem in the broader open source ecosystem. The other thing that we are doing here is we are trying to catch these risk factors as early as we possibly can. The earlier we can figure out that there is going to be some kind of risk to sustainability for a given project, the faster that all the businesses involved with that project can make a business decision and rescue it. You are not in constant firefighter mode. You are planning ahead and making business decisions based on project risk as early as possible. Tell us a little bit more about the other things that we want to try to do. Absolutely. The next stage of this is looking at strategic investment and taking that one step ahead. By putting together the different tools that we have here, we can start looking at new companies, new communities that are bolstered by these contributed and project data starting to see what are the anomalies, what are new players that are coming into the field that we have not necessarily paid attention to before. This can go from just being a community to being a buzzword. We can start looking at what is going to be the next containers. If we start to see buzz in different talk, whether it be on GitHub and GitHub activity, whether it be at different events, we can start to see, okay, how is this activity comparing to the prior large exploding buzzwords in the past and start to see are there similarities between the two and if it is something that we should try to get on early on. Once we start to see these certain words come out, we can start tracking their trends and looking at their GitHub data and seeing what the issues and discussions are around it. What are people saying about it? We can look at that from project debater and seeing over time, is it going up or is it going down and taking that one step ahead look to see where we can become involved in. From here, we will be going a little bit into a demo of these different projects and how they connect together. First, we will be doing a walkthrough with the mode debater tool. If there is any chance that you cannot see my screen, please let me know. Right here, we have a demo of the mode tool on the discovery side. The terms that are getting used to bring all the data into this is we are looking specifically at Red Hat, Fedora and CentOS stream. We can change what those key terms are if we start to say, okay, we want to look just in general just at containers. Right now, for this tutorial, that is the terms that we are going to be looking at and what is taken into the large set of data that is going into this. One thing that I have really picked up on is that you can start examining the impacts of events with changes of GitHub activity around that date. One example of this is the DevCon CC that usually occurs around January. We can see that in 2019, looking right after the event, we can see that there is a bolster in the amount of projects created and even more around that time leading up to the event and after the event. The amount of commits per week are on a large upward trajectory, which I think is a huge thing to look into here because projects being created that has one portion of it, but if there is more activity around the different projects that have already had a ground in state, we start to think, okay, what was being talked about during this event that is making this activity go up? Is there something new or is there something that has been on the stage a little bit but really is taking off because of the talks that are occurring at this event? That is when you can start to get into a little bit more analysis and start seeing, okay, where is the why? From here, we can also look at the growing technologies and buzzwords. In a hypothetical, this event was really big talking early on about containers. We can start looking at a subset of this data. For here, we are going to do the example of containers and start looking at, okay, what is in this set that has to do specifically with containers? We can see right here how the different data changes over time and that the subset of data that has to do with containers almost has more of a peak than even the large overall set. You can start to think maybe this has a large impact in it. That starts sending you down the rabbit hole, which really starts to bring in new ideas. You can start looking into who are the contributors' companies that are being involved in this. Obviously, a lot of these are more red hat-centric because right now we are looking at terms that are focused specifically on red hat. Overall, we can start looking at the top contributors as well who are the big players here. We can also start to see, okay, what is their activity? Are they going and branching out onto some new projects as well? It really has this branching effect that starts to give way more ideas of what is going on in our communities. From here, we have some things that are starting to spark our interest. This is when I really think Caldron starts to come into play. Here he has a project that takes in a community's repo data. From here, maybe I was thinking to myself that I want to learn a little bit more about what companies are involved in this project. We can go to our visualization tool here and create a new visualization to start to see what is actually happening here. Whenever I'm starting to look at which companies are coming at play, I personally go for the pie option because you can see what are the percentages. It's a lot more visually appealing for me, but there are so many different options here that go upon. I'd really like to use the goal tool to start to see how the activity around the commit is happening. Is it reaching the goals or some type of threshold that I have found for this community? Like I said, there are so many different options here whenever it comes to visualizations. For this example, we're going to be going on to the pie tool. We're going to be using the source of their Git data to create this. Here we have just a random circle. We now start to look at what is the aggregation that we want to use. From here, we're going to want to take a unique count of author IDs. Obviously, people will commit multiple times on different repos. You want to make sure that you're uniquely counting each time a new one comes to play. I'm just going to get a little label going for this, which is the unique IDs. Here, we're going to start looking at the buckets, which is what are the sections that we're going to want to look at. We'll go into split slices. Our aggregation here is going to be in terms, looking at the different terms that are going into the field. From there, we're going to be looking at author domain right there. This is looking at their email URL, which is not a perfect metric, but it can really just get you starting to look at an idea. I feel like that's the biggest portion that I'm getting out of all these tools, is that they're not something where you just look at them and poof, they give you all the answers. They give you a new way to think about things and start to get more data behind the intuitions that you have or maybe telling you that the intuition that you had isn't necessarily always accurate. We're going to make the size of these buckets around 10 just for making it a little bit more visually appealing. We're going to group whenever you have a bunch of emails that are random, that are not large enough. There's not at least 10 of them. They're going to be grouped into another category. We're going to update. Look what we have here. We can start to go and click and see what are the ones that are actively involved in this specifically community. Gmail is a pretty common one, which doesn't tell us too much. We can see here that Red Hat is actively involved in it. There's a lot of others. There's also these other ones that are in the corner, sews.com. I've personally never heard of it, backtick.net. There's different, smaller players on this field that you necessarily wouldn't learn about. In different communities, you might have more of your major players on hand. But those are the two main tutorials that we have to show today to kind of give idea of what these tools do and how they could be working, how they could work together. And from there, open the table for some questions and see if there's any other things that you'll like to maybe analyze with our tools that we have available. Definitely would be open to looking at some different keywords using project mode. Cauldrons sometimes takes a little bit longer to do some specific searches, but if it's something quick, we can definitely make that happen. Well, this is awesome, Kali. I really love seeing you guys using this. And for the health and sustainability of communities, it's really key. And a lot of this, the tooling here that you're showing, especially the pie chart, and that is something that we've been using in the OpenShift community for quite some time. And one of the things that I always say about and try and preach, I think I'm very preachy about it, Brian might agree with that, is that the community management and community development, we kind of think of it as an art, but this is bringing and trying to bring to bear some of the data driven approaches that we use in our sales and marketing and everything else. Why shouldn't community and open source communities have access to this to do the same sort of stuff? So it's wonderful to see you guys using all of this. The other thing that I always and I love seeing in this demo was I'm up in Canada. I used to be in Massachusetts right near you in Beverly and a shout out to UMass, and I know you're at BU, so we had a little competition going there. But is this hockey puck concept about looking for new emerging technologies? And one of the things that we've been doing quite extensively is watching the migration of resources and even our end users from different projects. So and using these same tools and the bitersia tools and the network analysis ones, I think if anyone knows me, they've seen me throw up what I call the jellyfish diagram because it's always pink and many tentacles of watching how people collaborate across communities. And these tools have been available for us for a while. But what is really interesting to me is the use of the Watson and the AI stuff to do maybe some predictive things, to do more than just watch what I call them the migratory processes. So I'd like to hear a little bit more about how the Watson part plays into this, if you can explain, because that was slightly different than the pie charts and dividing out those are tools that that I've had, but the Watson, the mode stuff is really cool. And how that and how you're working together with IBM is, if you could talk a little bit more about that, that would be wonderful. Yeah, so this project kind of evolved a little bit over the past probably six to eight months. And it started out with IBM reaching out to us just wanting a little bit more of a perspective, a community perspective that they knew they're like, we are big on the research side. We've done a lot of this analysis from the standpoint of academic papers. But we understand that that's not how and where we're going to find the useful data to understand the open source ecosystem. And so we brought in more of the community and open source perspective to start to see, okay, what should be the main things to look at whenever it comes to analyzing GitHub. And so whenever we can actually go and look back at the visualization here. So with this project mode tool, all of these terms are being grouped together by using sentiment analysis, mainly on the readme's. They're also doing a little bit of metrics on like community metrics on the amount of these repos that have community guidelines, which is something that's also very interesting to look at. But it's taking the strength of Watson, like Debater, Discovery to group people together, group repos together using sentiment analysis, not necessarily grouping you all together because of the transfer of similar contributors. And so you can't go and just look at the entirety of GitHub whenever you're looking at this like mode tool whenever you're looking at this table, but you can say, okay, I want this specific subset. So for right now, for this demo, the subset of terms that we're looking at is anything that has to do with Red Hat, Fedora, CentOS Stream, maybe we wanted to look at containers, maybe we wanted to look at hybrid cloud, maybe we wanted to look at something just looking at like Google seeing it like what is going on that's just specifically doing going on with a direct competitor VMware. And so you can go and look at the repos and group them together in a completely different way. And how you want to do that has there's a lot of flexibility that comes once you bring in that sentiment analysis portion and start looking at the key terms. So if I hear you right, this mode tool is looking at readmes and contributor guidelines. Is it looking at like the mailing lists? What are the data sets? Is it mining in the Slack, the blogosphere, Stack Overflow or any of that kind of content? Or is this strictly the readmes? So right now it's just the readmes, but the mailing list specifically is something that we brought up and have talked about really integrating in especially with the Watson debater who don't have access to an interface to show today sadly. But we can take in the mailing list data, which is actually something that I worked on like cleaning and preparing data for a different project that I was working on during my internship. And so we've talked about pretty heavily taking the tools that we already have for preparing the mailing list data and putting it into tools like this. For this demo, we aren't looking at that, but it is very it we can make that happen pretty easily. This setup can be applied to a way wider scope than just readmes. This is just where we this is the starting point. Yeah, no, this is great because like I said, what we've been doing with the detergent tools is we do we do a wide range of things with it and beyond just community health. But one of the things that the other things that's really important, I think to emphasize to people watching this is that the importance of domain knowledge. It's great. And this has happened in any data science tooling too is that if there isn't somebody with a bit of domain knowledge about the domains and how they interact, it's the tooling is really one of the is lacking. So like I've worked in other spaces, finance, accounting, auditing and stuff like that. And if you didn't know what you were auditing or what is appearing, it's pretty you can you can go down a wormhole, let's say, or make the wrong assumption. So I think one taking it step by step doing the readmes adding a new step is really the right thing to do. And as we torture you and make you learn all about open source and communities will eventually add in hopefully all of the CNCF projects, the cloud data computing foundation ones, which is really where most of the projects that I work on and interact with live and breathe, along with the OpenShift ecosystem too. So I'm really looking forward to getting this with a wider dataset. And that'll scare everybody over at IBM, you know, in terms of, you know, getting getting that set up because we have a lot of that data, but not in the semantic analysis or sentiment analysis stuff going on. And the other thing that's really, I think, important to understand is, you know, who are the and I think you have a little bit of that, like you can see some of the people's names who are, you know, leading up the repos, but is it possible here like to do analysis that drives on an individual contributor to a project or is this really a higher level thing? So to see where where's Kelsey Hightower playing in these days or, you know, what's going on, you know, in the sandbox at CNCF, there's I think there's now 45 projects in the CNCF sandbox now and, you know, that's those those kinds of trends and analysis I think are key to really understanding the entire ecosystem around some of the projects. Yeah, so so answering that question. So it's a little bit of a fine line because we're trying to so the short answer to your question, the yes, we should be able to do that between the tools we have here and also auger, which we haven't demonstrated, you know, auger is very good at like looking at an individual developer and seeing where that person has, you know, wherever they are in GitHub, it's not limited by subset it just looks for all connections and finds out where that person is, you know, working. So we can do it. We're a little bit hesitant because and how we apply that because we don't want to get into privacy issues. We've historically and you know this to Diane because your work with Patergia and the stuff that Ospo used to do, when we approached this as like a giant fire hose of information, we were always very careful to try to keep the user data as aggregated as possible. And, you know, it's always been a fine line because when we look at things like the pie chart that Callie showed us earlier where we're trying to identify domain by domain, like who's working for whom? Are they working for Red Hat or Google or Sousa? We can do that. But to refine that, we kind of need to know a little bit more about the person because like at Red Hat, we are not all required to use our redhat.com domains when we you know go participate in any open source project. So there could be more redhaters or you know at any given project beyond just the redhat.com and it might be similar for other organizations too. So yes and as we move forward, we're really trying to kind of be very mindful of individual privacy, especially, you know, we're getting into situations with GDPR and the California equivalent. And now, you know, Japan has one and Brazil has one. And I just heard, I think, you know, another state in the United States has something going. So there are a lot of individual municipalities and countries that have privacy laws in place and we have to be mindful of those as well. Yeah, no, definitely. And that's really one of the things that like you were mentioning with the bitersia and the sorting hat and the cauldron projects and stuff like that. They have been very mindful of making, ensuring that it's following those things. And that's also, you know, especially when we talk about putting some of these toolings and making them available publicly as open source projects. And that's really been key. You know, I'm always, I'm a huge fan of this stuff. So I'm totally thrilled. I really want to have you back and demo the auger stuff and take some time, Collie, maybe, and look at where OpenShift lands in this and when OKD lands in this. And this is really very timely. As we're, you know, I love the emphasis early on, Brian, when you were talking about the ROI from vendors who are, there's always two sides to every open source initiative is the end users and, you know, what the value that they get out of the project and their participation, their use of your project. And then all of the vendors who are collaborating and the value that they get from participating in those open source projects. And so I think when we try and sometimes in our jobs have to justify resources being allocated to different projects, which, trust me, happens all the time inside of Red Hat, these kinds of tools let us really help with those judgment calls about where the resources are. And as I said, I think the other big piece of this is using it to see where the hockey puck's going. You know, where are we shooting to because we're always, we're always in the present when we're doing this, we're always trying to suss out some, you know, current kerfuffle in a mailing list or something that's going on over in here because somebody's unhappy with it. And we often don't do any forward looking what's coming down the pipe. Where are our end users playing? You know, there was a great example of it a while back with a company Amadeus that is using OpenShift and they came to gave a talk on their use of Kafka, right? And it was really early, very, very early days of Kafka. But I think had we had tools like this, we probably could have seen them starting to log issues in Kafka, log, do a PR, you know, those kinds of things. And so for vendors who are looking to see where their customers are playing, where their end users are playing, these kinds of tools, not just for Red Hat or IBM, but for everybody who's working in open source. And it really helps us justify continuing to engage and do this. And as we all know, open source is part of pretty much every company on the planet these days is using something open source, whether you're making candy or manufacturing rocket ships or writing software. So this is really part and parcel of every business organization's decision making process now. And the work that chaos is doing that, you know, all of the different inner source and other communities are doing are really very, very important. So I can't say thank you enough, Kali and Brian for highlighting all this work. And I am so thrilled to see it being done and getting the airtime and resourcing internally at Red Hat. So and really looking forward to picking your brain, Kali, and running some OKD stuff there. And the other one I want to see run is you run is operators that term, the operator framework, because that is really good. It's tentacles and so many different things beyond containers. And this is where once you've got these tools refined, and we've got some processes in place for it, getting the lead project engineers who are working on it, whether it's in the emerging tech office at Red Hat or one of the project leads for one of the CNCF projects, so that they have the domain knowledge and they can tweak and see this. This is really hugely helpful for community development efforts. So again, thank you very much for coming today and putting up with some of our technical issues this morning. But we are definitely having you back, Kali. This was great. Thank you all so much for having us. Yes, thank you. It's been a pleasure. We're excited about the work we're doing and we're looking forward to showing it off again.