 Hello everybody, my name is Kelly Dolphy and I am a data scientist at Red Hat in the open source program office Myself and my manager Brian Profit will be doing a presentation on Connecting open source and businesses and really looking into how we can start making data driven decisions now pass it off to Brian And as Kelly said, my name is Brian Profit. Those of you who know me. I'm sorry about that so we are both from the open source program office at Red Hat and I specifically We are both on the community insights team. This is a new team with an Oshpo That is really designed to quantitatively measure the health of communities, but in recent days We've been working on this effort for a while But recently we were trying to expand our efforts and go beyond individual communities and start looking at ecosystems and Also the interface between community and business and that is what we are going to be talking about today So I'll get us started with the you know the the background information So how do we discover community health and and sustainability? This is not a new topic. Some of this you probably have heard In past presentations from our office from the wonderful folks at project chaos Which is another Linux foundation project? So a lot of this work heavily overlaps And we'll focus on it today Historically if you're not aware Community health was really measured anecdotally There wasn't really any quantitative analysis So it was done on such vague and fun terms like how popular is the community or my Or how many downloads does a community have? What's the use of consumption rate or my favorite and I still hear this today How many stars so to get how community have and if you want to see my eye twitch just roll that past me sometime it's awesome and By them like each one of these measurements is not horrible But if you take them like by themselves and put too much weight on any one of these It's really not going to give you any kind of indication of pure community health It's it gives you a level it gives you some idea of what's going on, but not the full picture So we need to Deploy and a analytical rigor. So this is where things like project chaos, which I mentioned before Definitely come in. So there's really been three things in the last Three or four years that have made this kind of analysis possible One is the standardization of metrics and I know I've mentioned them at least twice so far But this is really what project chaos brought to the table. This is a conversation between really brilliant people who basically said You know, how do we measure communities even though they seem to be different kinds of communities because the old argument was well My open-source community is different from your free software community, which is different from this one over here There's no way we can all get together and figure out a unified way at measuring things and the chaos people said no And they actually figured out how to do that parallel to that was the evolution of tools that would go with those metrics and and take those metrics and actually produce Analytical results. So from project chaos, there is auger There is cauldron. There's a wealth of tools from the company called detergent that they're also using based on grimoire labs and elastic search and other tools like that Red Hat contributed one which is a little dormant right now called prospector That we built way back when So there's a lot of tools and they and the chaos in but environment that are Going to be used and we are taking those and we're doing some other things too. We'll talk about that a little bit later I got ahead of myself and also the thing that really spurred the concerted effort that we are doing right now in the community insights team is this need to have more objective analysis based on Business and community needs and the sentence the key part that the sentence is business and community Before it was just community But now at red hat and other companies We are seeing a real need to figure out where communities fit On our bottom line and that's not to say that oh we're trying to spend less on communities and cut costs or things like that But we are trying to figure out how to make them more efficient with what we have because I can't speak for all y'all But at our company and this is still red hat. So we are believers in all this scale is a problem, you know, we have 23 people on our team. Yeah, that's right Roughly and we have hundreds of projects in our company and so Figuring out how to get the health of all of these projects is really important just from a matter of scale I've mentioned somebody's already cauldron is a tool in our tool kit that we use. This is based This comes from the vendor detergent They it is a grimoire lab based tool Elastic is running in the back end and it really does a good job of looking at one community at a time Giving us a graphic feedback of what is going on as far as measurable results. So we're looking at things like like Time-to-first response for pull requests for instance and we're looking at demographics of Organizations that are involved in a community And anything that we want to plug in we basically do that auger is another tool that is more tech space this is a post post sequel based tool that dives into multiple get hub repositories and projects And looks at a wealth of information similar to what cauldron does But it doesn't across multiple projects and we can get a very big picture very quickly of what's going on in Communities and then something that we do at our own company is we do community report cards These are basically audit based forms that we run through and we use the quantitative analysis We also do things like does this project have a website does this project's website makes sense And you would think in 2021 that this would always be yes, but you would be wrong. So So we you know, we do Non-quantitative analysis as well as part of this Okay, and with that I'm done. So I will turn it over to my colleague Kelly Yes, so the tools are one thing and you can really change I mean not everyone needs to use cauldron or auger to look at this information but it's a lot of what you're doing with the data and how you're using it to inform the decisions that you're making and So we can look at this from the business lens of seeing what business is actually need There's a lot of different people who are very new to the open-source space And when you have all of these different communities all of this different information How do you go and start to like hone in to try to learn more about the different communities? You're trying to get involved in and some of the questions that come is that trying to see if this community is going to fit with The business model you can start looking at like seasonality of communities How the behavior goes over the year span and seeing it that that going to work for you or those is that high production or high Evolution going to match whenever your company needs it Trying to see if the company the company the community is sustainable And so are you about to depend on something that if two people win the lottery tomorrow? You are kind of it like at the end of your road And so starting to see how many people are actually involved in your community on the community and seeing how heavily you're going to to depend on it versus the size of the development and the size of like the maintainers and More than anything I feel like where we're at with looking at data for these communities is trying to see What should you pay attention to when you have such a big scope? How do you go in to try to find like a single needle to focus in to try to give you some a little bit more informed? Information about these communities and I feel like one thing that should be noted as well as Whenever you're looking at what businesses needs is also Inquips equips the communities to be able to advocate for themselves a little bit better in a language that more Matches the your audience you can start to say things with backing yourself up with data And we all know that's what everyone loves to see the here and now obviously I'm a little bit biased as a data scientist But it can start to let you go a little bit more on the offense in my opinion whenever you're going into these Conversations and trying to bring people in So I'll pass this off as well So as I said earlier one there are two things we're looking at here We're looking at community impact, and then we're also looking at business impact impact as well We're looking for the ever elusive return on investment, which is something that really historically Communities have not community managers have not done They there's it depends on where your open source lands in your organization, of course if you're in the marketing department Usually there's a little bit more focus on ROI But and but if you're in the engineering department as things are in red hat It's not necessarily the case so what we're trying to do is take all these tools and figure out how we can measure Quantitatively the the impact of what we do in the community So for example, one of the things that we want to figure out is like when we come to events like this Or More appropriately something like scale, which is the Southern California Linux expo Which is an excellent community-oriented event or all things open which is happening in Morale next month Those are very community-oriented events Red Hat is not really trying to sell you anything there. Okay. We're you come to our booth We don't really have sales reps there. We might maybe but not always We're there to answer community questions because we are trying to support the upstream so then the question becomes How well did we do with that and one of the things we want to try to do with some of these tools is look at traffic on our projects and Conversations that are happening in public forums and say okay. We really were trying to talk up Kubernetes this time when we did these three events. Did we see an uptick in conversations around red hat and kubernetes there? It's not a sales lead. It's mostly just say what's our impact statement? That kind of thing so that gets to the targeted marketing initiatives because we're really digging in with this new data analysis Into new sources that we've never done before They gave us all a bunch of lead generation apps on our phones for the booth Right and that so I can tell you who came by the booth and picked up a hat That's the old data, right now. I want to see well, okay Who was talking about us and what were they saying after they came by the booth? Did we make an impact positively or you know? negatively on communities after given events And all of those resources can be hopefully calibrated toward community health The community impact obviously that was probably the first thing we've always wanted to do now We're going to measure risk factors as Callie pointed out on different levels. We're still looking at internal project health and but now with tools like auger and Other tools that we're starting to build and integrate together. We're going to look at broader ecosystem tools So we can figure out what's going on in the general thing So again not to pick on it, but kubernetes is a big ecosystem It would be nice to focus on Individual projects, but it would also be nice to see what the general trend for health in that entire ecosystem is And early detection of those risk factors Can inform community decisions if you've heard me speak about this in the last three years You've heard me say this time again in the past community managers. We are all brilliant people Really, I swear to God But we always brought our own skills to the table. It was always done anecdotally I'm a writer by trade. So guess what I'm going to focus on documentation If an engineer comes in and becomes a community manager, they are probably going to focus on Processes and agile or whatever, you know, they're going to be on the engineering side of the table. That's not bad It's just what you bring to the table this sort of levels of playing field So people who are strong in one area and weak in another might now have tools to make informed decisions and Now back to the smart person Yes, I want to try to talk a little bit about the how the data science workflow goes into how we can start making more informed decisions When we're looking at analyzing communities, it kind of takes a little bit more of an experimental approach This is a space that really hasn't been dived too far into and so you start to have to figure out where your boundaries are And what are the type of questions you want answered? And so it always starts out with what is the type of data that you want to examine? Where are the different pain points you want to try to go into and get some quicker like we're trying to see where? How do you want to access in a quick way and can help get you some answers and trying to and also trying to look at? How should it be analyzed? This is whenever you start trying to go into whether it's a more simple scale if you want to try to look at the Mean or medium what's going to actually represent the type of data? You're looking at for example if you're looking at issue data, and you just want to look at mean of mean to first response There's something like that some issues never get responded to so that's not going to give you a really great view into what's happening within your community You start to have to figure out what layers do you need to look at to be able to give an actual cohesive view? And I think this has been something that has been great to be just so deep involved in like Ospo and starting to learn a little bit more about how different open source communities work and getting just to talk to a bunch of different people because I've gotten to come in about a year and a half ago being a pretty big novice in the open source field and so coming in and being able to be like okay what are the questions that I have where I know nothing and starting to figure out from the people around me how would you go about viewing those things from a community manager standpoint or a community member standpoint and so how can we start to make sure that all those viewpoints are going in and we can start to make a little bit more informed decisions and so next looking more into like strategic investment like looking ahead a big thing that going into seeing different softwares like auger is you can start to see the overall ecosystem and this can start to be where you can start to look at overall what are the different communities that may be having some more buzz around them things that we should be paying attention to more if a few years ago we were looking at containers and somebody could have told you I don't know six months in advance if that was going to be something that was going to explode that would be great and so that's something that you can start to bolster with having different ecosystem focused data and starting to see okay overall what are people talking about and where the different things do I want to try to be a little bit more informed and this can kind of go into as well as community like buzz what are the different things that they're talking about and trying to see what's going on and maybe the entire ecosystem of ML AI projects what are other people learning about that you can help make your better decisions about your own ML project and trying to foster those type of connections is really something I think is huge here and so overall you can start to try to see those emerging topics and trends to see where you want to focus in on the next go around or the different people you want to try to talk to it's really hard whenever you have such a big space and I can say that that's kind of something I felt when I first came into open source it's like where do I even go and so I wouldn't say data is always going to have the answer but it might help you get a little closer to the type of questions you want to look into you might be able to see something as we'll go into a little bit of a demo where you'll see a spike in the data that might not tell you why something is happening but you can start to look into it and start making more informed you can start to figure it out as a community member you might be able to remember some big conversation that happened on an email thread you can actually see how that impacted some community activity or you can look at a event like this if you had some huge talk about your community and a large amount of people came in you know the type of impacts they're having and things that you would like to do to try to bolster your community engagement or at different times if something negative is happening what you need to do to get ahead of that quickly so it doesn't try to take your community down and you can be one step ahead of it and so this is where we're going to look into a slight demo on some of the data that we have been looking at this is what I would consider that first step of looking at data we're kind of looking at commits all of the normal ones by monthly bases and different intervals and so right now you can start to use this to look at different questions you want to answer and then how you want to try to aggregate this data to the next step I'll kind of go into some examples on different things that this first step has gone to inform things that we want to look at in the future for example right here we're looking at how new issue creators for this certain for a community and this is actually we're looking at as a demo is the auger community because the auger community has been a great help in this process for us and we're actually that's the tool that we're using to generate all this data as a little hype up for them great community and so something you can start to see here is a large jump and the amount of new issue creators by day you can see that there's something pretty big that's happening in this community around December of 2019 and whenever you're looking at it overall you can really start to see that jump in the issue creation and so whenever you're going that you're like okay is there another metric that's that is going and confirming this jump in activity look at that you have it with the prs as well and so you're starting to see these are new people coming into our community and this is whenever you can start to have that knowledge that's been there over time with your community managers being like okay clearly something has gone to really impact our community what happened here how do we go to bolster this to happen again to try to bring more people in and also starting to look at are we able to handle this growth are the maintainers keeping up with the amount of issues that are being posted are they handling the amount of prs in a way that is timely are they responding to people and so while it's all great whenever new people come in and you have a lot of activity around it you want to make sure that you're making this a welcoming place for new contributors because you don't want them to come in see that nobody's responding to them and leave and never and not stay a part of your community this is whenever you can start to look at how many not just the amount of people who are coming in but the amount of issues that are being created overall and so this is something you can kind of see like okay if we're looking at the amount of new issues is it's representative to be is there still a large jump of the amount of issues or just people who are new and so you can start to see whether on how many issues are jumping up and if you're seeing I think the by month is kind of the best way of looking at it here in comparison with the data that we were looking at before and so here we can see a huge month with the amount of issues that are opened and so is it staying that way with the amount of issues that are closed are we keeping up with it and so here we can kind of see that for the most part it seems like it's a pretty good ratio but this is when you want to actually look over your backlog and time and you can look at the pull request data in the same way and so right here this is about the time period where that activity happened and so you can start to see that yes our community is actually able to keep up with the amount of people who are coming in and asking questions and so whenever you start to see some large jump ups in your backlog you can start to see if you need to bring more and pay tainers is is it's sustainable the amount of growth that you're having and this is whenever you can kind of go into the next step of if there is something going wrong or you're starting to see that backlog how quickly are people getting responded to and so if we're looking at issues for example how many people are getting an answer at all and so you can start to see the percentage of people who are getting an answer or not in their and their issues and then looking at how long it's taking to get any type of response whether this is a good or bad number is going to be informed by the community itself and so this is where I go really back to my point with the data that it helps you focus in on something but it's not going to always have the answers and so this is why I really think it's great to equip the community managers with this because somebody who is deeply involved in this community is going to be able to look at this data and know a lot more about the community than somebody who's on the outside looking in and so I think it's great to be able to see this and start to go into like the next stage of looking at the seasonality of communities that's something I'm really excited about being able to take this data and start to look at trends over the week and looking at trends over the years of seeing how your community's activity goes up or down and so whenever you do see a large spike is that something that season and from a seasonality standpoint happens every year or is this a spike or dip that's happening that is new and you just start looking a little bit more into your community and what's going on and so that's what we got for today thank you all so much and yeah we'll take any questions if you have time and over here so repeating the question for online if we see or if we see imprances is that the first level of iteration that we're looking for I would say it more you can see instead of just looking at a single month or a single year depending on the range that you're looking at you can see the trends over time with your community and so I wouldn't say it's a like predictive portion it's more that okay if I see a large jump in May of 2021 um is that a jump that is seen seen every year or is that there a huge dip it's more that you can start using it to like read and look and start to aggregate some of the data together and I feel like that's kind of what the next next stages are is how what do we need to group together to be able to get from a question to an answer as quickly as possible um so the question was whether the differences in data points between a more mature well-rounded code base versus something that's more immature correct um so the flip and answer would be nothing um because here like a mature code base and a mature code community is still might have the same kinds of troubles and strife as an immature community I mean so there is a part of project chaos that does look into the evolution of a project and that's the growth maturity and decline model so where a community is on that curve certainly matters um if you're speaking strictly to code base though no it doesn't I mean it's like if it's mozilla firefox they could be blowing up um right now versus a small you know three person startup project um but if I if I flip that question a little bit evolution matters so governance is a thing when you're talking about like a benevolent dictator for life for a project which is certainly an early governance model if you're doing that like in the first two years totally makes sense that's not a problem if that project is 10 years old and it's still like that and there's no real governance and you've got a lot of political infighting and things going on now you've got a problem