 Hello everybody, my name is Callie Doffy and I am a data scientist in the open source program office at Red Hat and we'll be talking today about community metrics and what to measure and why. So the three different parts of this presentation will first be going into the value of community metrics. Why do you even want to spend time doing this? The methodology behind generating these community metrics and the major point that I would like you to lead with is the analysis lessons learned. So what can strong community metrics enable? First I'm not going to tell you that data analysis or optimization is going to be the one thing that informs all of your community decisions. It's actually the opposite. This is to build off of your own open source community knowledge, incorporate others and potential biases and perspectives not considered. This allows you to be able to integrate people's special knowledge around communities. For example there might be one person in your community who is really informed about events and how that can impact community activity and maybe on the code base, maybe that be in your matrix channel and another person might be able to know a lot about PR behavior and what tendencies show a positive trend for your community and what might be a little bit an indicator that things might not be going as well. Another thing is that this can help you stay informed in a sustainable way. We all have a thousand and one things to keep up with and it never feels like we have enough time in the day. If an answer about your community takes 10, 15 hours to get you aren't going to do it regularly, if ever. And the last thing is there is so much data around repositories in every aspect of communities. How do you start to work through it? There is so much pressure around being data driven and you can go the angle of getting a metrics on the slide to just try to validate or validate your expertise but how can we start to pick through this needle in a data haystack and what to look at to actually inform your decisions. So the first thing when thinking about a metric is about the perspective you want to receive or gain and there's a few things to start considering here. Is your main goal to gain information or is it to influence action? Whenever there is an area of your community that is not understood or are you trying to take your first step to getting there? Or is there initiative that is already in place that you're trying to decide of whether you're already starting to decide or are you trying to decide on the initiative yourself? Next thing you want to consider is are you trying to expose area of your own improvement or are you trying to highlight strengths? There are times when you want to hype up your community and show how great it is especially when you're trying to show business impact or advocate for your community to people outside of it but whenever it comes to informing yourself about your community most of the time in identifying shortcomings is where you're going to have the true value. There's no problem in highlighting strengths but there's a bit of a time and a place. Don't use metrics and visualization just as a yes man just to tell you how great you are. Last thing you want to consider whether you're looking for measuring community impact or business impact. The language that many businesses speaks is numbers and data. That can make it incredibly difficult to advocate for your community in that landscape and make sure that people actually listen to your messaging before at instead of actually just trying to figure out if you have a valid point to make. So whenever you want to do this you want to start speaking in their own language which is numbers and data. Then on the flip side potentially you're trying to look at the community impact maybe this is looking at open source impact overall within your community or to the entire open source ecosystem. These aren't always like an either or situation but this framing helps kind of focus in on what type of metric you want to make. So when talking about general data science and machine learning work some version of this graph is what you're going to see. For this presentation we're going to be mainly focusing on this first step the codifying problems and metrics and a little bit into the collection and cleaning. From a data science perspective this presentation can be looked as a case study of this step and this step is also the one that's usually overlooked. It's not the fun or exciting one. Everyone wants to go to training models and doing the fancy stuff but this is usually where the true value of your analysis comes from. You don't just wake up one day and know exactly what you want to measure and what's going to bring value. So let's start to really hone in on the goal of our time here which is codifying problems and metrics. The tooling debate is for another time or you can come find me after this talk and I will rant about it all day. And so first this starts with truly figuring out what you want to know, what data you have, and how to get us to the true goal of thoughtful execution of data analysis. So we're going to be going into a couple of different analysis angles and breaking down these these different scenarios and these are just the main examples for this talk and a lot of these points can be applied just generally throughout. So this first scenario is building upon current data analysis. So say that you're already starting to go down the path of doing metrics and you already know what is generally useful for your community and you want to start to build off of it and make it better. The idea here is to start off of your traditional community analysis and it's like commits over time. That's cool and it's something we always see but what does that actually tell you about your community? So let's look through a few examples here. First let's think about numbers of contributors over time. It's great to have on a slide that for example you have had 120 different contributors but how can we take it the slight step further to actually start to inform ourselves? So we can look at maybe active versus drifting contributors. This is the same exact data but you can start to see the breakdown between okay how many people total have been involved in our community, how many people have say been involved in the last three months and is that trend starting to go down? Are people leaving the community and you can start to investigate why? Another thing you can break down into is looking at repeat or flyby contributors. So are there contributors that are just coming in and opening a single issue maybe patching a bug and leaving or is this contributor base very solid and staying consistent and active over time? Another example is the classic commits over time. Great you know that you had 200 commits last month and 100 commits this month. What else can we learn from this? The depths of commits can be a really important factor. Maybe those 200 commits just came from a hackathon and a bunch of college students decided to put a period and a bunch of different places to bump up their score versus that 100 commits might have been a huge overall in the entire code base and so that's a completely different perspective which you wouldn't have gotten just from the commits over time. Another portion of this is commits by subsets of contributors. This is actually something that I'm really focused in on right now and starting to figure out okay what factors of or what portions of your code base is dependent on one maybe two people. How many people have touched these certain areas of the code and for how long if those people leave what happens? You probably want to know this before those people leave because once that happens then all of that knowledge has been lost versus if you kind of look ahead on this you can start to have that knowledge share so it's not some disaster that happens in your community. So then we're going to be looking into scenario two which is community campaign impact management. That's a lot of words. Let's break it down a little bit. Let's think meet up conferences or any community initiative. How do you start to measure the impacts and establish goals? These two steps actually feed into one another. Whenever you establish your campaign goals you should also be considering how to actually measure and detect this impact. A lot of times people get really excited for actually going forward and applying and doing things within the community and that's not a bad thing but a lot of times you can get six months in a year out and it can feel very unfulfilling if you have no way of determining if anything actually changed from all of the effort that you went into it and so a lot of times these two steps kind of feed into each other over time and a really good example of this is actually some work that I've been doing with the Fedora community and their goal of doubling their contributors by 2028. We have been discussing what actually what does it mean to be a contributor? What is a contribution whenever it comes to the counting of these metrics and what repos sites different chat channels? What is being what do we need to look at to be able to be measured? By discussing what we're looking at to measure you can start to make more targeted goals and what you want to do to be able to impact these numbers. And so we're going to be now going into scenario three which is where we're going to spend the majority of our time for this presentation. The prior examples can be viewed as different portions of this cycle and it's pretty much living. You can you're going to be hopping from the end back to the beginning over time as you start to develop more of your visualizations and for me I'm honestly a lot more of an examples person. We're first going to go through the entire methodology of it and then do an entire example to see how this would apply. So the first step in this process is to break down your focus area and the perspectives and this is where you take into account those perspectives that we discussed in the earlier slides and I like to think about this in three different parts. First let's think of a magic eight ball. Whenever I was a kid I had that little plastic toy and you could shake it up and ask it anything. For example I could shake it up right now and ask if Max Verstappen is going to win or get pulled tonight for the Canadian Grand Prix and it's probably going to say certainly it is. Anybody who watches Formula One gets that reference. So this can be applied to your analysis area. If you could get any answer about your community right now no limitations what would it be? The next step with that is to talk about the data. From that magic eight ball question what are your data sources that could potentially have anything to do with that question. You don't want to be limited gather gather anything that you could possibly could have possibly apply. Now with the context of the data your magic eight ball question in your focus area what questions could be answered that could bring you closer to your goal. Most of the time you cannot ask the direct question that you want but you can start to get pieces of it but be careful because it can be very easy to want to take those pieces and bring it all back together and say that's going to answer your original question. There's usually a lot of assumptions that can come into play there so you just need to be careful about how you aggregate your different analysis points and sometimes you just want to take them piece by piece. Next step is converting this question into a metric and so this is going to be you're going to go through these steps each time for that final part of the step one. So that first thing you want to do is to take the overall data that you were looking at and get the specific data points that you would need. Maybe you looked at okay I want all of the data around these specific repos and then once you get here you realize that you just want to look specifically at the data around the pull requests. The next thing you want to do is looking is considering what type of visualizations are going to best represent this data to answer the questions that you're looking at. Is it going to be a pie chart or is it going to be a bar chart with a couple of different layered plots maybe a line chart over the top of that. And then next you're going to want to look at the insights and actions that is able to come from this what hypothesize what would be the impact of this information how you would incorporate into your community. Then you want to go straight into the collaboratory portion of this and you want to go to the most skeptical person you know in your community to bring this to and tell you every single reason why you're wrong. Usually this is the step where all of the magic happens. The deep like the best analysis and the different things that I've done have always come from this portion of it. You're not going to come up with your best ideas in the silo and that skeptical person is who you need to come in and bring you back down to earth. And so next we're going to be looking at the analysis and action not in every single scenarios will we go through every single step in this but we'll start to walk through it. Now that you have this hypothetical visualization or metric the first thing you want to consider is if this aligns to the prior knowledge around your community. If it does align with your prior knowledge make sure to step back and see if there was any assumptions that you made to kind of cater those results to what you already thought existed. And if it doesn't match make sure to go back and check to see if there was a data or calculation issue be very thorough here or if this was just something that was privately misunderstood part of your community. From there we can start to look at community and initiatives that can be impact or could be implemented from this analysis. And so we can inform those initiatives from the analysis and we want to make sure they are geared to be measurable in some way. And then from here is where we would start to observe these community initiatives in this scenario where we are not actually able to observe anything. Is it a scenario where you're not measuring the right things are you looking for activity from commits and PRs whenever the actual activity that's being impacted is people in a matrix channel now starting to communicate more to new users or does the actual initiative strategy need to change? So that was a whole lot of information. Now let's go more into a concrete example. So let's say that my focus area is to analyze new contributors and what their actions are for the first time. So my magic eight ball question here is are people having an experience in the community that convert them into being consistent contributors? Now I want to consider what data that I have that go into this analysis area and magic eight ball question. And for here I would like to look at individual contributor activity with repos and their repos and with time stamps. So so now that we have our data and our magic eight ball question let's break it down into a couple of different sub components and take them all the way to the end. So if we're looking at this from like the prior step examples we're going to go from step two to step three for each question because if we try to hop around it just gets really confusing. And at the end of this presentation I'll actually be showing what these visualizations look like implemented. So first sub question is how are people coming into my community? I want to be looking at new contributors and seeing what their first action is. And so the specific data that I'm going to be looking at is the contributions issues PRs comments all of the above by contributors over time and only looking at that very first action. And the visualization that I choose to look at this is taking first time contributions and breaking them down over the quarter. And so now that we have that visualization we now can look at the extension of okay what can we start to look at next. Let's say I start to talk to different people and we start realizing that we want to take this visualization one step further and breaking it down by quarter and looking at if the contributor is going to be a repeat contributor or if there's if the actions point to being a flyby contributor. And so this is this breakdown we can start to look at if there's any trends between that first action and whether somebody is going to make a little open an issue maybe do a bug and leave or become an active member of the community. And so there's let's start to think about potential actions that could be informed by knowing the activity for flyby or repeat contributors. We can start to consider if our current our current documentation supports contributors for that first contribution that points to being a consistent member of the community. Is there a way that we can support new communities more in these actions? Is there a consistent contribution that points more towards that repeat contributor? But it's not happening very often in that case. We can say that PR could be a common for repeat contributors but overall most people don't do that for their first activity. Maybe something we can do as a community is starting to label first new issues and put that apart of our contribution documentation or start to create a PR buddy system for new members of the community. Second sub question that we can look at in this area is what is the conversion rate from first time contributor to an active or repeat contributor. And the data we are going to be using is the second the same one as before. And we're going to be looking at the visualization. I decided to look at the visualization of a pie chart of looking at how many what percent are converted and which ones are not. Some of the questions that we can start to ask and be answered around these visualizations is is that number or percent starting to go down? Is there something we're doing differently in the community or is there something that's kind of fallen off the table to start bringing down that conversion number of people starting to leave or first time contributors are not having as positive an experience within the community? And is there one a little bit of a trend for ones that say this is another way that we can answer some of the similar questions from the prior visualization. And a lot of times it's really helpful to have one question that can be potentially answered by multiple different visualizations because it provides different perspectives and start to inform an overall answer. The last sub question of this is is our code base really dependent by fly by contributors? And this is a visualization we could look at here is total contributions and breaking it down by repeat or fly by contributors? And some questions that we can start asking is is this a ratio that we like? Is there a lot being done by fly by contributors? And maybe that's an underutilized resource and we're not doing our part to making it a welcoming community and bringing people in. Now we're going to go into analysis lessons learned. If you take anything from this, please let it be the last couple of slides. So might be a shock to come from a data scientist. Numbers and data analysis are not facts, no matter what people say. You can make them say anything that you want to. And your internal skeptic should be very alive and well when you start to build your own visualizations. The iterative process and bringing different people into question what you're doing is what's going to bring the true value and you don't want your analysis just to be a yes man to what you were already thinking before. Take time to take a step back and evaluate the assumptions that you are making through this process. And if a metric just points you in a new direction to investigate, that's a huge win. You can't look at everything and consider every single aspect of the community. So if there is just some anomalies on the graph that you can start to look at to investigate more, that's a huge win. You would have never known to look into those areas and it could really just be a conversation starter to take some ideas to a completely new place. And sometimes exactly what you want to measure is not going to be there, but it can bring in valuable pieces to a puzzle. But just like if you try to put those two puzzle pieces together and it kind of cracks it, whenever you try to force an answer and solutions that are not there, it can start to bring you down a very dangerous path. Leaving room for the goals of analysis to change can actually lead you a lot of times to a better place than you thought that you were going to do before. Another thing to consider is that data analysis is the start. Each scenario that we went through today is just starting at different points of the same cycle. That first example is that second go around of analysis after you made that first time pass. The second example is looking at whenever the goals and scope was established to an extent of a community initiative and starting to feedback loop of what that initiative is and how to measure it. And the third example is coming in with a brand new idea starting from scratch when you're data analysis. You should never stop asking if more context is needed or if it's truly answering the questions that you want. Most of the time it's usually a yes and situation. And sometimes you need to ask yourself if you're actually asking the correct question in this scenario. And there's just so much going on whether that be around data, your community, just trying to be a human. If you can cut down the amount of time it takes to get information or create an easy system to correct to check things at a regular cadence, that is a huge win. If something is going to take you 15, 20 hours, you're not going to do it regularly. But if you can create a system that takes 10 to 15 minutes a week to check, that is something that you can stay consistent on. So with the closing thoughts, data is a tool. It's not an answer, but it can bring together insights, information and individual knowledge to make it more accessible to everyone that wouldn't have been possible otherwise. And the methodology is so vital to the success of the overall analysis. You have to get comfortable with breaking down what you want to know into manageable chunks and start to build off of that. Taking a step back, open source community data analysis is a great example of the care that needs to be taken with all data science. You must take in the nuance of the subject matter and the topic area and open source communities are about as nuanced as it gets. The process of working through what needs to be asked and what is the answer and breaking it down at the very beginning is something that's often overlooked. People want to go straight into implementing a new visualization, playing with the data before asking what they want to do in the first place. Knowing what to ask can truly be the hardest part. And when you're coming up with something and when you can come up with something insightful and innovative after spending that time thinking about it yourself, bringing in other people, that's when you're going to truly bring something exciting along. So if you're a community member with no data science experience just looking for a place to start, I hope you see from this presentation how vital your perspective is even if you don't write a line of code, implement any visualization yourself that inside of the community and the different perspectives and nuance specifically around whatever community you're part of is extremely important. And you are the one that brings the insights, not necessarily just the data. And I hope you can see all the places you can be involved in this. And if you're a data scientist or somebody in your community that is applying the analysis or visualizations, you need to listen to the voices around you, even if you're also actively a member of this community. And so thank you. I'll be taking questions and also I'll start to pull up the graphs that we were talking about a little bit earlier. And this is for the conveyor community. Concepts to the legs curled up. To the legs curled directly now. Are you interested in doing it? If they're all get-based repositories we can put them in into our graphs and see what happens. All of the all of the visualizations that we're building right now, as long as it's a get-based structure, we can use it building. This is actually called the application here. It's actually called 8-0. But oh just repeat the question. So how is Red Hat using these visualizations? There's been a couple of different examples. These specific visualizations we haven't used as much like directly within Red Hat. Some people within the open source community that are community managers have been looking at these visualizations for how I guess that is working with Red Hat. So people within the open within Ospo, whenever they've been working with their communities have been using this to look at the activity around new contributors. There's been other visualizations whenever we have there's an assumption that's been put forward by people who are involved in the community whenever it's going in to talk about how that impacts Red Hat products. We have been able to use these visualizations to kind of take it away from being like does this actually exist? And a lot of times if you don't have proof of something exists, especially from a business perspective people want to stay there and just discuss the validity and these graphs have been allowing people to go away from being able to say is this actually happening to okay now what are we going to do about it? So that's been something that's been really exciting that I didn't whenever we started on this path in this project I really didn't get how valuable it would be to be able to just say this is what's happening. We have this is shown in the activity being able to show concrete evidence to be able to take away the time of just going back and forth and actually starting to discuss solutions. Sometimes community managers or I guess the question is is where do these requests come? Sometimes it's community managers sometimes it's solution architects on the product side who are starting to hear things from customers and they're starting to go down the rabbit hole being like okay what's going on in the community that's starting to get it to the point where our customers are not necessarily getting exactly what they want and so we're actually able to start going back and solving these problems from a community side not just worry about it from the all the way on the product and dealing with the customer side. Right now we're working mainly with get or just prevent directly with get hub data this is all this all these graphs are supplied by a project called auger which takes all the data you could possibly want around a repository puts it into a relational database and that's what we plug in with the dashboard that we built but we do want to extend outside of get hub as this project grows. If there's nothing else then we're good to go.