 Hi everybody, my name is Callie Doffey and I'm a data scientist in Red Hat's open source program office and Today we're going to be going over community metrics and what to measure and why So the few different things that we'll be going over will first start at what is the value of community metrics? Why do you even want to spend your time making these and producing these for your community? Next is the methodology behind generating community metrics and And then last it is going to be the analysis lessons learned this last part right here is really what I want you all to leave here today with and What kind of like the little person on your shoulder that's that's saying some things to you as you're making your own so first Why do we want to make these? I'm not here to tell you that data analysis or Automization is going to be the one thing that informs community decisions. It's actually the opposite This is to build on your own open source community knowledge Incorporate others and put an undercover potential biases and perspectives not already Considered and you can also integrates people's special knowledge So say one of your community members is really in the know around setting up events initiatives and understands the ins and outs and the Impacts of doing those type of events versus some people might be really in tune With the review processes and the things that can help contributors have a more success to having a merged PR Those type of specialties can be integrated into visualizations So everyone can benefit from that type of knowledge the next thing is we all have a thousand and one things to keep up With and it never feels like we have enough time in the day If you're getting an answer on your community that takes hours on end you're not going to do it regularly if ever and Lastly, there is so much data around repositories in every aspect of community How do you even start to work through it? And how do you start to work through the pressure of feeling like you have to use data? Everyone's talking about being data-driven But how are you gonna be that and actually be worth your time and not just one number on a slide to be able to say? Yes, I have used data and so we want to look at how to pick that one needle Through the haystack and what you even want to look at So the first step here is when you're thinking about a metric one of the first things you have to consider Is the percent the perspective you want to receive or give and there's a few things here? You want to consider first you need to ask yourself What is the main goal is their main goal to gain or influence action? Is there an area of your community that's not understood and you're trying to take that first step of getting there? Or is there initiative that you're trying to decide on or a measuring initiative already in place? The next one you want to look at is whether you want to expose an area of improvement or highlight your strengths There are times that you're really just trying to hype up your community and show how great it is Especially when it's when it comes to showing business impact or advocating for your community But when it comes to informing yourself and your community most of the time Identifying shortcomings is where you're going to get most of the value from your metrics There isn't a problem with highlighting strengths, but there's a time and a place Don't use these metrics as a yes man inside of your community just to tell you how great what you're already doing is You want to board boost morale and recognition for things, but like I said, there's a time and a place Lastly you want to consider community impact versus business impact The language that many business to speak is numbers and data That can make it incredibly difficult to advocate for your community and truly show its value This can be a way to speak in their language and show them what they want to see While getting the rest of your messaging across These metrics might not change anything that you're originally going to say But it might be able to leave room for more listening And so that's the business side of thing and let's look at the community impact Are you looking about how your community impacts open source overall Or how it impacts the ecosystem or the people encode around it And with these perspectives, it's not always an either or situation But this type of framing helps you create a more deliberate metric So when talking about general data science and machine learning work Some version of this workflow is what you're going to see people describe For this presentation, we're going to be focusing in on that first step Codifying problems and metrics and a little bit of the second And one thing we can look at here is that from a data science perspective This presentation can be viewed as a case study of that first step This step is often overlooked But the true value of your analysis comes here And you don't just wake up one day being like I know exactly what I want to look at Exactly what data it's going to take How it's going to look in this that or whatever It takes some time to really work through these thoughts So let's start to really hone in to the goal of our time here Which is codifying problems and metrics Tooling is a debate for another time Or come and find me afterwards and I will go on all day So this starts with trying to truly figure out what you want to know What data you have and to get to us to the true goal of thoughtful Execution of data analysis So we're going to start breaking down the different analysis angles And scenarios for metrics and visualization building These angles are not the only ones But just the main examples for this talk And they can be generally applied throughout So the first example here is building off of current data analysis Say you're starting to go down the path And you already know what you're looking into That is generally useful to you or your community Let's try to make it better The idea here is to start building off of common Or traditional open source community analysis Commits over time is cool But what does that actually tell or inform you So let's look at this first example Of number of contributors over time So say you know that you have 120 total contributors Over the lifetime of the project This is a value you can put on a slide But you can't make a serious decision off of that one value And so from here we can start making incremental steps From just having a value to having insights So instead of just looking at that number of contributors over time What if we start breaking it down into active versus contributors Over active versus drifting contributors over time So you see how much of your contributor base Has stayed active in this community up to this date And how many of them are drifting away Is this a ratio that you like Or is the active contributor base saying consistent over time Is it decreasing increasing That tells you a lot more than just one straight value Of how many contributors have contributed in your project At some point Another way we can break this down further Is looking at repeat versus flyby contributors over time So when one contributor has their first contribution How many of them come back and do more than what you would consider Your repeat threshold So maybe for your community If they go past making three or four contributions You consider that a repeat contributor versus how many of them Do one, two, whatever your threshold is And then disappear from that community Another example of this is the classic commits over time We can see that there is maybe 40 commits this month And 60 commits or 100 commits last month That's giving you a little bit of information But what if we start looking at the depth of commits over time Say in those 100 commits There was maybe one period added here Or one line of code change And that really the density of it wasn't too deep Versus those 40 contributions Maybe there was a huge overall overhaul of your code base And there was a lot of stress that was put on to your maintainers That's two completely different stories That are told with the same data Another way we can look at this is commits by subset of contributors Is there a part of your code base that is really heavily dependent On one or two people and you're not aware of that yet You're going to want to know that that code base is dependent On a few people before maybe one of them has a kid And needs to take a time away from your community Or for whatever reason they're no longer able to be involved Knowing that beforehand before it turns into a complete fire drill It's going to take a lot of stress off of your community So let's look at our second scenario Which is community campaign impact measurement That's a lot of words Let's figure out what that actually means Let's think meet up conferences or any community initiative How do you start to view your impacts and your goals And I view this as kind of a two step project process That actually feeds into one another And let's discuss why Once you establish your campaign goals You can from there determine what can be measured to detect impact And with that figuring out what to measure Can actually feed into establishing the goals And how you're going to implement this campaign It's easy to fall into the trap of being a little bit hand wavy And not concrete when it comes with campaigns Or with initiatives that you're going in And a lot of times it's just because people are excited They're wanting to go full steam ahead and start working on it And then about six months later and you're trying to look back And see what is the impact of the work you've done And sometimes you can feel a little empty Because you're like I can't see what I've done And an example of this where this process has fed into one another Is some work that I've done with the Fedora community They came to me with a goal of doubling their contributors by 2025 That seems like a pretty straightforward question and answer But once you start breaking down What is considered a contribution? Is it just a commit? Is it opening an issue? Is it having a message in IRC? What do you consider a contribution? And where do you count those? Is it only certain repos? Is it only certain chat rooms? Is it on certain websites? What and where? And that starts to drop like you actually have to define What is the parameters of this initiative? And so you can actually start having targeted You can have targeted initiatives to go towards Or not initiatives, but really efforts Towards how you push towards this goal And being able to see the impact This last scenario is where we're going to be spending The majority of our time here As the prior examples can be viewed As different parts of this workflow This is a living cycle And improvements and extensions can always be made And in those moments you can decide If you want to continue down the path of extensions Now, later or realizing that those are just still there To go into at some point in time Or maybe never I don't know about y'all But I'm a bit of an examples person So we're going to first walk through this process At a theoretical level And then go into a very in-depth example So step down is breaking down your focus area And the perspectives And this is where you're going to take into account The perspectives discussed in that earlier slide And I like to think about this in a three-step process First, let's think about a magic eight-ball I don't know I had as a kid this like plastic bowl Where you could shake it up And you could ask it anything And get some type of answer And so maybe I can ask it today Will Erlingist cancel my flight tomorrow? And they may say sources may say no And so we can use this idea around your analysis area If you could get any answer right then and there What would it be? And now that you have your magic eight-ball question Let's talk about the data What data sources that you have That could potentially have anything to do around this question Or more generally the focus area However, literally just bring in anything you could think of And now with the context of the data And your questions What some questions could be broken apart To bring you closer to that proposed eight-ball question And one thing to really note here Is that if you break this apart into multiple different questions Don't make the automatic assumption That you can bring them all back together And it will take you exactly back to that original question There's a lot of assumptions that can come there And you really have to be careful with that So let's go into step two Which is converting a question to a metric This step would be repeated for each sub part determined in step one So for each of those sub questions First we're going to be wanting to select the specific data points needed So say that you identified that around your subject area Could be all GitHub area All GitHub data associated with our repository Here is when you would say Okay, I want to look specifically at issues With their timestamps And maybe just the unique contributor ID Not even the names You're starting to kind of get like Kind of really drill into what specific data points You want to work with And then from there You want to decide what type of visualization or metric You want to use to represent this data What is going to be the thing that you can use With the data that will tell you Or anybody who's looking at these visualizations And tell you the answer that you would like to see Or answer the question more like And then the last thing is You want to start to hypothesize The impacts of this information What does this tell you about your community If you get one answer or the other Or what type of impacts you want to make on your community Depending on what you see out of this Once you have You go through those steps You get that first work and progress Or in progress metric And this is where really The collaborative process of this And the magic truly happens This is when you want to take Your metric to the most skeptical person in your community That one person who always has something And what about this That's the person you want to bring in in this process And anybody who could have Some type of perspective to build in And that's when you want to start to iterate On that first visualization And start bringing it to a more mature place So now we're going to go to step three Which is analysis and action You might not do all three of these things At every single visualization you make But we'll break down all of them So the first thing you want to determine Is that this metric follows what is currently known About your community And if it does Take a step back and make sure that there was No assumptions there that cater to that result That's when you really need to be honest with yourself On how you generated this metric Whatever the parameters And make sure you had a good review process And if it doesn't align with prior knowledge You might want to investigate further Was there a data or calculation issue Or is this just something that was privately misunderstood About your community Is a new piece of information for y'all The next one you want to look at is How community initiatives can be impacted Or implemented from this information These initiatives should be informed By the data analysis And be measurable once you implement them To see how what happened around those data points And once those initiatives haven't implemented It's now time to observe the community initiatives Informed by the metric And if this is the case where the impact is not observable Make sure that you're measuring the right thing Maybe you think that the impact is going to come in With the amount of PRs that are opened Or how quickly those are merged And maybe it's around the activity in your chat channels Or people are responding to issues quicker And there's more conversation happening in your community The impacts could be there And it's just not where you expect it to be Or it could be that you might need to tweak that initial strategy And try to see if some little tweaks here and there Will actually start to get the results you would like to see So that was a lot of information Now let's look at this concrete example Let's say I wanted to analyze new contributors for the first time First, what is my magic eight ball question here? What do I wish I knew if I could get a straight answer? In this case, my magic eight ball question is Are people having an experience that converts many to be a consistent contributor? Next, I want to look at what data that could go into this analysis area And my magic eight ball question And here I decide that I want to look at individual contributor activity With repos with timestamps So now we have our data We have our question and focus area Now let's break it into those sub-part questions And take them all the way to the end If you're thinking about the prior steps that we went through I'm going to take three sub-questions from that very first step All the way to the end So we're not just hopping around everywhere That just gets kind of confusing So the first sub-question that I came up with here is How are people coming in? Looking at new contributors Let's see what that first action that they're doing first So the specific data that I would like to look at here would be contributions Whether that be issues, PR, commits, comments Whatever that may be By contributors over time And I only want to look at that very first action that they make And the visualization that I would make here Is that first time contributions broken down by quarter I also should have mentioned this before I went into these sub-questions By the end of this presentation We will get to see what these visualizations would look like In implementation But right now we're just going to look at it theoretically So this first visualization is going to be a bar graph Where we look at first time contributions broken down by quarter And now I'm at that extension step Where I want to start considering Okay, how could I take this a step further? Who could I talk to and start bouncing off ideas with? And when I talk to some people We start thinking about Okay, now that we have that first time contribution And we're seeing what those actions are by quarter What if we look at what the actions are for somebody Who ends up being a repeat contributor Or a flyby contributor So is there some action that signal That they might be a more active member in the community stick around Or if they will be hopping in and out And then we can start to see potential actions That are informed like this You can start to ask yourself Is our current documentation Supporting our contributors with that first That most common contribution for repeat contributors Could we help support them more so that they would be sticking around Or is there a contribution area that's not common overall But is a good sign for a repeat contributor Let's say in this example that PRs is the most common for repeat contributors But most people overall do not do that Maybe we could start labeling the first good first issues consistently And being really honest with ourselves That that actually is a good first issue And link these issues to a contribution To that contribution documentation Or maybe we want to implement a program where we have PR buddies For our new contributors And so they have some type of connection to the community And support as they go through that process for the first time And then another sub question we can go into Is what is the conversion rate from a first time contributor To an active or repeat contributor Here we're going to be looking at that same data as before And the metric that we would want to look at here Is the percent of these contributors That have converted to active or have not And some of the questions that we could ask around this metric is Is that number or percent going down? Is there something that maybe we were doing six months ago That somehow faded along the way And we're starting to have that conversion rate going on? Or is there some type of trend for the ones that stay? Are they getting more communication from current members and support? Are there issues in PRs getting attention earlier? Those type of things The last example we'll go into here Is let's see if our code base is really dependent on those flyby contributors The visualization we think about is broken down All of our contributions by flyby versus repeat contributors And if we're looking this at a bar chart Is this a ratio that we like? Is there a large amount of our contributions being done By people that are hopping in and out those flyby contributors? And if so, is this what we view as an underutilized resource? Or are we not doing what we all that we can to bring them fully in For they are a consistent member of the community So now we're going to go into more of the analysis lessons learned And I will say that this is really the focus for me of the entire presentation And the things that I would like y'all to leave here with So first is the limitations here Don't know if you're prepared for me to say this from a data scientist But numbers and data analysis are not facts And you can make them say anything And the internal skeptic should be alive and well Whenever you're going in the process of making these This iterative process that we have talked about Is really what's going to be bring the value And you don't want your analysis to just be a yes man And tell you exactly what you already think You want to take that time to step back And evaluate the assumptions that you've made And if a singular metric just points you into a new direction to investigate That is a huge win You can't look into everything and know everything that is going on But if you can start to see some specific anomalies And the people who truly know your community can see Okay, there's a spike here or a dip here You're going to know what activities were happening right now Was there some type of discourse that was happening in chat rooms? Was there a huge meet up? You will know a little bit more than what I would do from an outside Looking and seeing this visualization This is just supposed to help inform you If this is also just a conversation starter that brings you to a new place That's again a huge takeaway from making these type of visualizations You have to take into account that sometimes exactly what you want to measure is not there But you might be able to get valuable pieces of that puzzle With that you can't just assume that you can put all of those little pieces together And get the exact answer to your original question If you start to force an answer or solution You can start leading yourself down a dangerous path of assumptions But if you leave room for that path or goal of analysis to change It can sometimes lead you to an even better place Or more insight than your original idea would have Next is that this is just the start Not the solution And each scenario that we went through today is at a different point of this process It shows how it is just a living process If we look into that first scenario of building off of prior analysis This is that second go around This is a little bit farther into our data analysis kind of life cycle That second example we had of like looking at community impact Around our initiatives That's we already have our goal and scopes established to an extent And the question is known Now we can go into observing these community initiatives and looping back around And that third example we have is just coming in straight with a brand new idea And building from scratch You should never stop asking yourself if there is more context needed Or if this is truly answering the questions you would want It's almost always a yes and situation And always I will repeat this over and over like a broken record Taking a step back on if you are even asking the right question Is an important part of this process There is so much going on Whether that be with data, other community responsibilities Trying to get your life together I don't know about me but I'm constantly doing that But if we can take down We can cut down the time it takes to get information about our community And make it an easy system to check at a regular cadence That's a huge win Just think about how much that can be used to inform the way you think about the community Even if it's not a direct conversation around that topic or metric And this will create a sustainable process If it's something that only takes 5 or 10 minutes every week to check into That is much more realistic than the 10 to 15 hours it can take To try to do something like this manually one time after another So here are my closing thoughts Data is a tool but it's not the answer But it can bring together insights and information that would not have been accessible otherwise And the methodology here is vital to the success and the value of this analysis You have to get comfortable with the process of breaking down what you want to know And to manageable chunks and building off of that And taking a step back Open source data analysis is a great example of the care that needs to be taken with all data science work You must take into account the nuance off the topic area And as we all know, open source community is about as nuanced as it can get The process of working through what to ask and answer is often overlooked in all applications of data science Knowing what to ask can be the hardest part But when you come up with something insightful and innovative It's much more important than any tool that you decide to use So if you're a community member with no data science experience just looking for the place to start I hope that this can show you how important and valuable you can be to this process You bring the insights and the understanding of the data and the needs of the community And I hope you can see how all the places you can fit into this process without touching a line Of code or a data point And if you're a data scientist or someone implementing metrics or visualization You have to listen to the voices around you even if you're also an active member of the community So thank you everybody. I'm going to go and put up the visualizations that are actually implemented and leave some room for questions and All of that good stuff as we go through that. Let me see So each of these were the three different examples that we went through earlier And how they can be visualized in actuality And if you have any questions about the tooling used for this, we're actually working with the The auger project that's under chaos for our data source and using dash and plotly to generate these visualizations So, yeah, I'll open the floor for some questions I have a very I have a biased answer and then I have a more generic answer my bias answer is going to be Come to the red hat booth right after this talk between different points And I'd love to hear about what different things that you want to analyze We're starting up a project here at red hat to be able to kind of meet Community like rich community managers where they're at and they they can work through the process that we describe today Then if they open an issue on this project one of those pretty visualizations can be produced for you in a couple of hours Honestly, the hardest part and like the whole point of this presentation is that coming up With what needs to be answered and what is useful to know is much more difficult and time consuming than me going in and building a pretty graph So that's going to be my bias answer my like more if you are trying to like figure out tooling that it's a little bit easier to use It's more cookie cutter You have different options with I know I think like it's like cauldron and betergia Different ones have more options where you can kind of like drag and drop and make some of the more high level visualizations, but Getting connection points to tools like auger makes it to where these visualizations can touch the data directly like all of these Are made by I access the data from a postgres database And so I can choose how to manipulate them more than it being a cut tool where I can't see the data So those are kind of my two answers to that and like I said the main reason why I'm here is I hope people come And talk to us about it. We can get more people involved and better visualizations out there for everybody See if there's anything else I think it definitely matters what data you're specifically looking at looking at and if you were like If you're actually in the community versus being an outside look in Two very different experiences if you're an outsider looking in it's a lot more difficult to get to those those data points And you have to have a lot of conversations and build trust with that community to allow to have access to that data If you are in the middle of that community need to look at what exactly you're trying to look at if it's a If y'all use different chat forms or different like email forms as the form of communication It definitely depends which ones emails a little bit easier IRC is a little bit easier slack is something i'm not as familiar about about how to get the data out of But it definitely matters point by point if you're inside the community most of the time whatever you have implemented If you have the permissions for it. It's not as difficult if you're outside There's some things that if you don't get community acceptance that is just not like accessible to you Yeah The identity problem I would say the first thing is is that For making more generic tools overall I'm never even going to touch that because it is unrealistic for me to be able to answer for every single community It's managed differently and so to be able to say that to make a generic tool for every community is saying that I'm going to be able to handle the identity problem is not going to be it But if like I said a lot of these you can take and start to tweak it for your own individual community That's when you have to start considering my my opinion. What are the different ones? So is it like are you trying to match up communication channels and just your github users Or are you a really large community and everyone has a set like has to get a login? Oh, no, don't do that So I would say don't have a wonderful question I answer for it But more that if trying to do this at a large scale that it should not be even touched because it's going I would rather give no analysis than wrong analysis thing else With that, um, I will be at the red hat booth today from one to three p.m And tomorrow morning from 10 to 1 p.m If you have any questions or want to talk about how to implement something like this for your own community I would really love to hear from y'all and these types of conversations is literally why i'm here um And come find me and we can talk about this and I want Once we have a little bit farther on this i'm excited to talk a little bit more about the project itself But we are a resource to y'all and the ideas that people in this room come in Is beneficial to our project and everyone else who wants to end up using it so Yeah, thank you everybody