 Okay, hello again, and welcome to this talk on open source metrics and analysis. As I said, in the first iteration of this, this is meant to be more of a beginner's guide and introduction to working with data and metrics around open source projects. As you can see, I am a program manager at Google focused on research and analysis. For those who aren't here in the room today and can't find me to ask questions later, you can find me on Twitter and reach out with questions, comments, feedback, all of the above. My quick background is that I started my career as a market researcher at Forrester, writing surveys, writing data, sorry, writing reports around that, and then consulting with technology leaders and sellers and builders, buyers, move more into a data and analyst role and now program manager within our open source programs office. So the continuous factor across all of these things is I've worked with a lot of different kinds of data. So that's kind of the theme of the talk today, is working with and around open source data. And the central comment is data is really messy. I don't know if anyone's worked with a perfect data set. I would love to know what it is and how I get my hands on it, but mostly every data set that I've worked with has had some kind of issue, whether or not it's the systematic issue, the way you've processed it, the way you've set up policies around it, maybe sort of access levels or gradations of access. If you work at a company, you don't have the right authority to see certain pieces of things. If you've ever written a survey, there's always the inherent bias that comes into building a survey, designing panel, filtering for that panel. And then the person taking the survey, how do they actually feel on that given day could affect how they respond to your survey? So there's always going to be issues and ambiguity in terms of how you collect data and around surveys, public data, some issues like you would see in businesses, reporting gaps, systematic issues, policy issues. I had my first internship working at a private equity startup, trying to trend research. I had to trend things in industry looking at 10Ks year over year, and if anyone have noticed, but you don't necessarily have to report on the same metrics every year. So there's a lot of inconsistencies, even in public reporting and public data sets. Now, about a year and a half ago, I started working with the open source programs office, and I was brand new to open source at the time. I kind of understood the concepts, but was fairly new to the space, fairly new to working with data around it. And in addition to all of those challenges that you might face working with data, open source has even more ambiguity, even in the way that we define a project. We could be talking about a code, code base, software, data set, framework. If you work with open source hardware, it could be a series of designs and processes to build something out of that design. I also work with the chaos community, where we talk about metrics, and that is sort of our end product is metrics and language around how to talk about metrics. And so even our definition of project can be quite various. And then in terms of how you even define contribution to it, that's a whole other can of worms that I will save for another talk or refer you to another piece of research. This is, I'm pointing to a couple of links throughout the doc. The presentation, this one is particularly a piece of research that came out of the University of Vermont that talks about how different contributions are counted and the methodologies that allow for understanding contribution models around projects. Particularly when it comes to working with source code logs, the biggest thing that I found with them is that there's a lot of inherent inconsistencies and gaps. One of the main ones that I found out early on in the process is that it's possible that your current records and historical records will not match, and this is because of when you actually pull the data around your project. I learned that if your project has been around for a couple of years and your fork now gets merged as part of the main branch, that fork will overwrite all of your existing historical records around the project. So if you're writing a query looking at a same time period and you re-pull the data and you write the same query, you could get different results, which I found kind of shocking as someone who's worked a lot of data that historical data can actually change in the context of source code logs. If you've worked with GitHub APIs, you might have bumped into some rate limiting that could result in lost or missing data, not to mention that the schemes and the logs can have different sets of things that they're measuring, when they're measuring them, and even what you're counting and how you're counting it. So GitHub alone has multiple different types of APIs, the event stream issue stream, web stream, not to mention other tools, have different things that they're measuring. If you've worked with Garrett data, it mostly focuses on changes and reviews and change logs, where GitHub has a whole issue stream that talks more about bugs, issues, triaging, and comments. A quick example of some inconsistencies you might see is a very simple query I ran on GitHub Archive, which is a comprehensive collection of the GitHub event stream, looking at a query that should have similar results for all of these tables given that they're on the same time period. But as you'll note, these are not the same number. So we know something is missing. We don't know how much is missing. We can have a rough order of magnitude based on the variation that we see in the same time periods. But how do we make this better? This is a topic of conversation that we have around this project and around open source data collection overall. Another part of the problems here are working with bots. In other kinds of data sets, there's always going to be machine data, but I found working with source code logs, there's a lot more automation in bots that are woven into what's being collected and what's being actually collected as activity around a project. Bots could be labeled as bots, but they also could be scripts coming from personal-based accounts. And then you can run into the challenge of maybe something is broken and causing a lot more activity. I have a personal example here working with looking at our contribution data across Google. I ran into a problem where I found there was an individual account that was responsible for over 40% of push events in the month of July. So I thought, well, that's pretty large given that there are thousands of people at our company that are working on tens of thousands of repositories and, say, generating hundreds of thousands of push events every year. And so 40% actually represents a large number. So I reached out to the individual and found out that this was just a broken mirror that was trying to copy things and failing. So it just kept trying to do it and it kept failing. And so even you can see kinds of activities like this that are causing a lot of noise in your data that don't actually represent real data at all. Just for a little bit of context, I was curious to know how many bots there are in GitHub Archive in 2020. I wrote a super basic SQL script that you can kind of see the structure of below where I'm identifying bots by finding BOT in the actor login until your handle. So if anyone in here has a handle with BOT in it, I'm counting you. So know that there's a little bit of flaw here. So we estimate that using this script, there are over 120 million bots that log activity on GitHub, the GitHub event stream as per GitHub archive in 2020. And for some examples, you can see the top 10 that popped up, Dependipot, GitHub actions, pull bots. Thank you to everyone who's put bot in the title because right now this is the best way we can actually identify it on Macs. People. So I want to talk a little bit about how people introduce additional complexities in the data that we're looking at. Clearly, most of these things are designed by humans, so there's always going to be some influence and flaw that we bring to the design itself in terms of how we create it, how we control it, our personal backgrounds, experiences and feelings, as well as the policies and processes that we put in place to govern it. In particular, in open source, I want to talk a little bit about there's different people at different layers that are impacting what you might see and when you might see it, say the user or contributor that's generating it, they can create their own handle, log in, use it however they want to, or call it whatever they want to. The source of the platform that collects the information, they design policies around how things are collected and what are displayed. The collection mechanism, if you're working with a tool or a pipeline that's building these things for you, if you're the collector or maintainer, then you also have choice in terms of how you collected, how you store it, how you access it. And then, very personal to me, the last piece is communication, how you talk about data and how you create content around it. And each of these individuals at each of these layers has a choice in how they interact with the data, how they set it up, and essentially ingraining either bias context or systematic nuances in terms of how the data itself was collected, presented and discussed. Additionally, with open source, there's less clarity with who has accountability for what. So I have the pleasure of working with a large company and we are beholden to policies that govern the data that we collect around our employee base, our user base, and any regulating body that we work with that influences how we design these policies and practices. In open source projects, there's a lot less clarity of where these boundaries are and who's accountable for what. And there are people sitting in the middle of this, it being subject to policies that are coming from companies, from organizations and government organizations and projects, but it's more just to a call out that it isn't always as neatly defined as it would be in terms of working with a company. There's also the issue of PII. So anytime we're talking about data privacy and data sensitivity, at the core of it, it's who people are and what information that they are comfortable sharing it in what context. And something that I love about open source is that we have the ability to participate without revealing ourselves. We can be anonymous if we can choose to be anonymous. We can choose to have multiple identities. I personally have two GitLab accounts. One that's used for personal reasons, one that's used for work reasons, and there are probably many other reasons why you can have multiple identities in places. So even though you can choose to be anonymous in a lot of these contexts, there is a number of pieces of research that are investigating the minimum number of things you need to know about a person before you can actually figure out who that person is. And I would stress, especially for small projects, even if you're trying to be completely anonymous with all of your data collection and visualization, depending on the size of your project, you might be able to figure it out who it is anyway. So just being sensitive about how you're portraying information, how you're collecting it, to ensure that everyone is sort of being kept in the boundaries that they want to. And I will say there's always going to be use cases for why you might want to collect PII and share it. Some projects use a contributor activity as a way to inform who gets nominated for the election process in a project. Maybe if you're working on a diversity equity inclusion initiative, you need to understand who your groups are represented and what you're underrepresented to know how you want to invest in that community. But at the end of the day, when we talk about actually sharing this data, I always want to caution what you share potentially might influence what you're actually measuring. I like to draw correlations with natural sciences if possible. If you're familiar with the quantum entanglement principle, any measurement of a quantum entangled particle will irreversibly change the properties around the original quantum state. So you literally cannot measure it without impacting it. And this comes true. This is mostly true, I would say, with things like reporting and metrics. Say you are creating a leaderboard and you're showcasing the number of commits around a project. Now, you could be inadvertently encouraging people to create more commits versus, say, working on other elements of the project, like issue triage or event management. So not to say that you shouldn't report. In fact, I highly encourage it. Just do it purposefully with the sense of an outcome that you're trying to achieve. If you're working toward community growth or community contributor diversity, being very transparent about what you're trying to achieve and measuring against that goal as a way to at least incentivize the behavior that you want versus the behavior that happens as a result of it. The last piece, which is probably the stickiest issue. And so I will be brief with it without overstating. Again, I am an analyst. I am not a lawyer. So when in doubt, involve the lawyers all the time. This is my recommendation here. But I mostly want to mention that a lot of the data that we're working with is subject to terms, policies, licenses. They may or may not exist, but when they do, it's important to know what they are, how they apply. They could come from the data set, from the project data is around, from the tooling, from the platform, privacy policy documentation of all of the above, regional regulation, country regulation, probably other layers that I'm not even thinking about. This is just sort of the general starting place with ensuring that you are allowed to use the data. You're allowed to show the data and that you are assuming responsibility for what you're collecting. I found this resource thanks to a colleague resource at the bottom, which is thinking about how you exchange technology and information internationally. The Linux Foundation actually put out a report last year to help guide folks that are thinking about moving things across borders and understanding the policies and regulations at play. Bringing back sort of the persona conversation, I wanted to bring it into this because of the multiple layers of policy and regulation that can come up during data collection and management. Say the user themselves has the ability to opt in or opt out, share or not share their information. So that's a personal choice, but then the hosting platform themselves, saying something like GitHub, has an entire set of privacy documentation that specifies how data can be collected, used, and all the regulatory requirements that it's subject to. Now, it doesn't stop there. So now that you've started to collect it, most platforms that collect data are inheriting the policies and practices and licenses that are ahead of it. So even if it's not stated in that tool, you have to know that the licenses that things are coming from still apply. So it becomes a sort of nested and inherited progression. Now if you're the entity, say myself working at a company, collecting data around projects, we're also now responsible for where the data ends up in terms of the infrastructure selection and ensuring that that complies with local regulatory requirements, data sovereignty, other issues that could come up when, at any time you're storing data, especially PII about individuals. So knowing what those could be for your company, for your project, and again, if it doesn't really exist, then maybe this is an opportunity for the governance or leadership boards and committees to talk about how they want to maintain data in and around their projects. And that's sort of the last piece. I've worked with a number of projects and spaces where there is not a lot, there's a lot of this is not defined. So we're trying, we're hunting, we're going to find out what things should we actually know about this data, about the projects that work with the data? Does it exist? Maybe it should exist. And in places where there's a lot of open white space, there's an opportunity for project leaders to create it. So I worked with the chaos community I mentioned earlier, and earlier this year, we've been working on our own privacy documentation. And this came up when we started collecting data for our diversity and equity inclusion and badging program. We started collecting more sensitive information about people and attendees. And we realized that we needed to have a way to hold ourselves accountable and a way for us to be transparent with the community around how, what we're going to do with their information and how we're going to maintain the integrity, security, and privacy of that information. So a lot of it in this case is just about being transparent, declaring your objective, saying why you're doing it. I think people are generally a lot more willing to share information if they know what you're going to do with it. They don't want to be exploited, but they do want to help you create a community that's welcoming inclusive of a variety of individuals. So the call to action here is if you don't find anyone accountable and you're collecting data, it's probably you. So it's good to have these things in mind. And again, if you can involve your legal teams, if you have the luxury of having lawyers available. So the last couple of slides, I want to talk a little bit about what do we do now? What do we learn? How do we implement these things? Starting first with sort of the perspective of the analysts, I'm going to call these good data practices. I think best is a strong word. And I think this is a learning practice will continue to get better. Always state what you're counting. I like to think about if I were to say this thing in isolation, can someone else recreate the metric and confirm and get the same result that you have? It's okay to use descriptive terms. I know there's probably writers in the audience that like to be a little bit more creative with how they describe things. I think that's fine as long as you define what you're talking about. So maybe providing some sort of glossary or appendix that actually defines how you're accounting contribution or engagement. It's very important to also state your sources including any methods, assumptions that could introduce bias or boundaries around what you're looking at. Are you looking at only GitHub data? Are you looking at data from a number of different sources, a number of different communities, populations? It keeps expanding. So always state your universe in a way how that information was collected and what does this represent within what you're trying to convey? I have an example here. I recently wrote a blog that included about the data section where I supply some information around the terms and the sources. If you've worked in the academic community there's a lot of existing templates around how to talk about your data and the process in which you arrived at your analysis. And outside of the academic space it's sort of again up to you how you do this. So this is just one example of it in practice. I always wanna mention too that as an analyst it is very easy to mislead with metrics. People really like numbers but if you remove the number from the rest of the context it can be taken out of hand or just harder to understand or harder to contextualize. So I wanna come back to this bot example that I mentioned earlier looking at the number of bots on GitHub archive. So I pulled another query that looked at how many pull requests events were triggered from bot-based accounts based on that designation and found that 38 million pull requests events in 2020. Now there's something missing here. What is it first? Maybe it's where I'm pulling the data from okay so now we're talking about pull request events on GitHub in 2020 but maybe there's 38 number. I don't really know what that means. Is 38 million a large number? It could be, maybe it's not a large number. So I started to look at the total number of pull request events on GitHub in 2020 and then looked at the bots as a portion of that and found that it was 44%. Okay that's actually quite substantial. It's almost, it's under half but it's nearly half of all the pull request events that were logged in the GitHub event stream. Now, okay is that, I know more now but I think I wanna know a little bit more especially after I looked at this 44% number I was really curious, okay what was it last year? So I provided another query that looked at it was only 25% in 2019. Now personal anecdote here, this actually really piqued my interest so I just kept running the same query year over year and found that in the year before 2019, 2018 it was 6%, 2017 was 3%. So that's actually really significant growth almost exponential depending on how you define it and what metrics you're looking at but point of investigation for a later date. It's more of the more context you provide now the sentence actually holds a little bit more water it holds a little bit more context. One thing that I think is missing from this is that this is also an estimate. This is not a real number. I mean it is a real number but it's not a hard in fact true number because of all the inconsistencies we know about the data collection process as well as how I pulled the data. I explained the query I ran earlier which if anyone has BOT in their handle or count to you so it's not just bots. So I think somewhere you might wanna write in it that how you're defining this where you're actually getting the number this could be an about section and just again trying to provide more information more context because I think numbers can be a really powerful tool to influence toward achieving an objective but if you leave things out they don't really mean a lot. The last section I wanna mention is just there's tools you don't have to do this all of yourself you can if you want to but there is some open source tooling available in the community. Again chaos has a number of projects grimoire labs and auger that are designed to help you build a pipeline or even a full visualization layer. We've also worked with CNCF DevStats they use Grafana as the visualization layer with the SQL database on the back end. So my last slide on recommendations is again assume you're accountable if you don't find anyone else to point the finger at or to go ask questions to assume that it's you so that you can answer all of these questions on your own in terms of knowing the licenses and policies that are at play. If you are collecting information where it hasn't been collected, displayed and discussed before maybe it's wise to state your intent and the purpose for why you're collecting this information how you intend to use it and how you intend to protect this information such that people feel comfortable sharing it with you. So really encouraging transparency especially in open source we want to understand each other we want to grow and make things better and part of that is by building trust and being transparent with why we're collecting information especially within say providing enough in and out was from scenarios some people don't like the information to be collected they don't know about it but they have the opportunity to say yes then it's always worthwhile to make that distinction and then I always call out for the researchers in the room do you really need PII? This really adds another layer of complexity and what you're doing with that information so if you remove it all then you remove a lot of liability in terms of just having any kind of exposure potential and again there are definitely use cases where knowing PII helps you with your analysis I know there's a piece of research I looked at that was looking at social models around projects where we wanted to understand the social networking components around productivity and software development in open source so that is a case where you need to know who the people are to understand the connections between them but if you're just looking at aggregate trends, stats and changes and how things are being built you probably don't need PII so that's the end of my talk I don't actually know how much time I have I might have one minute one minute maybe one question yeah absolutely so I'm gonna try to put my answer with your question in it because I won't repeat the whole question in an unlimited amount of time but understanding how to interpret what you're looking at is sort of how I took your question I'm a big fan of reading the methodology before the results I know a lot of white papers want to put the flashy findings up front to catch your attention and get you involved but often they don't mean a lot until you know where they came from to your point so I like to peruse the methodology section make sure you understand the sources make sure you understand the thing that they're looking at and to your point most reports are built on some sort of subset of information or a survey that was run to 200 people in this year so there's always gonna be constraints around information gathering which is just the nature of the game so I would say recommend for anyone looking at it just know what those are and try to understand them if you can if there's other existing sources that look at similar things you can start to compare their methodologies compare what they've done in order to make those assumptions if you look at academic research again it'll generally all be there because that's part of the process and writing up academic findings but in white papers it's a little bit less transparent so I would say if there's nothing there I would reach out to the writer and say give me more because I think it can be very easy to just state something not necessarily have a full picture so I think I'm gonna be migrated off stage now so I'd like to welcome my next speaker with one more plug to other talks about metrics at this event if you are hanging around and interested in that topic