 Okay, it is 11 o'clock, so I'm going to get started. Thank you, I'm going to start my own timer so I know how I'm doing. Thank you all for joining me today to talk about regulations, ethics, and messy data. As you can see from the title and from the title that I've used to describe myself, this is very much personal anecdotes from working with data in and around open source projects. I'm currently working at Google's open source programs office and if you're not here in person and you can't find me after the talk to ask questions or continue the conversation, you can find me on Twitter or you can leave feedback and ratings through the online virtual platform. So for a quick introduction about myself and why I'm interested in this topic, I started my career as a market researcher working with survey, design, building, analysis. Eventually became an analyst at Forrester Research covering the data center and infrastructure market and infrastructure economics. In that role I served as a consultant for builders, buyers, sellers of data center infrastructure solutions. Then in about 2018 I moved to Google where I still continued to be a market research and industry analyst but focusing on supporting our products. And then last year I moved over to the open source programs team or programs office where my title is officially a program manager but in this role my focus is on research and analysis. I get to look at a lot of our contribution data. I still get to run and build surveys just now they might be in the community. And so working with data in and throughout all of these various roles. You might also know me from the KubeCon registration page. Thanks to the cancellation of all in person events, my face has been stuck on the website since March of 2020. So I'm personally looking forward to the return of in person events and getting some new faces out there and maybe eventually being able to retire myself from this page. So the general theme of this talk is that I gave you a little bit of background about myself. I've been working with all kinds of data for the last 10 years whether or not it's survey data, business data, public data, financial data. And then in the last year now working with data around open source and open source projects and communities. And there was a fair amount of things that I didn't really realize or know well even though I thought I knew the space. And so that kind of comprises the theme of the talk in terms of things that I wish I knew before I started doing this job. I gave a similar talk at Open Source Summit a few weeks ago. For this one today I'm focusing a bit more on the implementation details in itself. So if you were say to go out and build a metrics program, build a dashboard, build some kind of reporting and analytics tools, that's again where we're trying to build toward. So one of the first things that I had to come to terms with and this shouldn't be groundbreaking for anyone is that there is no single definition of an open source project. We could be talking about a code base, a data set, a framework, a series of designs if you're working with open source hardware. I also am an active member of the chaos community where we talk about metrics and building common language to discuss things like community and project health. There are some sticklers that would say that open source is defined by the license that it's governed by, but the beauty in a lot of these licenses that they're not actually specific to what you're talking about. They apply to all of these different types of things and assets. And as varied as the definitions are around a project, you'll see even more variation in all the different ways that you can work on and contribute to the project from interacting with the code base to community management and event planning to documentation, localization, translation, marketing, communication, content development, legal, financial. We could go on and this is an active area of conversation in multiple areas in the community where we're all trying to come together to better understand how to contextualize contributions outside of just code because a lot of these things aren't in readily available systems that we can collect data from, so we need to artificially create templates and ways to recognize these other kinds of contributions. If this is a topic you're interested in, I put in a couple of resources here. The first is a piece of research coming out of the University of Vermont that looked at various ways that you could start to categorize and model various types of contribution outside of just code. And the second is within the chaos community we're actively talking about various types of metrics that you could look at or consider to measure all these different types of contribution. Also side note throughout the presentation I will continue to add little green boxes of research or further thinking or work on these topics. I wanted to bring this up too because we start talking about building out metrics programs or analysis programs and when you're thinking about all these different ways that we can work and interact with the project this is sort of leading us into the conversation of what kinds of data sources are we going to be working with? Are you predominantly focusing on Git or issue streams? Maybe if we're thinking about troubleshooting and support we want to better understand how users are interacting with the project that could be looking at forums like Stack Overflow. If you are thinking more about community engagement or want feedback on your event like today or maybe feedback on tutorial or documentation maybe you'll run a survey. And then maybe if you're focusing on more marketing and promotion of your materials you might set up something like Site Analytics. So these are just a couple examples. There are so many different ways and platforms that you can collect data from to better understand your community composition and your project. Now even with that broad expanse of things if we focus just on say GitHub and the repositories and the code base themselves we don't even see necessarily fixed parameters around what it means to be a project or what kinds of repositories constitute a project. I did a quick exercise while I was putting this talk together to say okay can I actively name all the repositories that are part of the Kubernetes project? Anyone in the room know how many repositories are currently active today? Okay. Well when I looked six months ago it was about 220 something. I want to say today it's a little over 300 but what I'm unclear on is how many of them are still active. I know some have been actively archived. So I reached out to Bob Killen who works within the Kubernetes community also one of my colleagues at Google. And I said so hey how many repositories are currently within the Kubernetes community. And he said run this script and then run it across all of these organizations. So as I was sitting there trying to just browse through GitHub and try to find all these things organically I'm realizing there's a little bit of like a tribal knowledge problem where you don't necessarily know the parameters around the project. If you're just looking at the GitHub event stream and trying to identify which pieces should go to which project you may or may not be very successful on your own because these things are constantly changing. We've seen this in the CNCF where projects merge and become one like open telemetry or they break off and become more. We keep on adding more things to the pie. So I really wanted to expand the universe a bit to acknowledge the fact that anytime you're building a metrics program you're most likely only looking at a subset of this information. So it's not the full picture. It's a very focused piece of the puzzle. And it's completely fine. That's how we can draw boundaries to create real analysis and things that we can actively measure versus just conjecture. But it's never going to be the full picture. So one thing I like to note here is there's another area of research coming out of the University of Vermont which is looking at trying to actually put a number on how much of open source is represented in GitHub because there was a lot of open source and there's still a lot of open source that exists outside of it. So even there are copious amounts of research efforts that are making use of the wealth of information that is happening on GitHub and they're trying to create conjectures on what's happening in open source that isn't necessarily even the full picture there. So this piece of research takes a look at how to contextualize that as well as starting to investigate maybe alternate trajectories or things that we can learn about things that are happening in GitHub and other styles of work that are happening outside of it. The next area of conversation I want to bring up is about the data itself. So when I started working with this kind of information there's sort of the assumption that, okay, it's public. You put information on GitHub about yourself. You perform activities. Anyone can go to your page. They can look at what you've done. They can look at the information you volunteered. So it's just public. What's the problem, right? Well, I'm going to put out a proposition here that even this kind of public data deserves classification. Maybe I'm coming from the background of working for a large company that thinks a lot about privacy and data regulation and every company has their own way to associate risk categorization around data they collect, whether or not it's their employee data, their customer data, product usage data, infrastructure monitoring data. Everything has its own classification and risk associated with either exposure, risk to the individual, risk to the company. There are so many different avenues that you can do this and it's very contextually driven by individual parties. And then there are the regulating bodies that help to influence and guide the policies and things that are implemented at the company side. And then you have the project communities where something that I love about open source is that every individual project gets to create its own governance structure, decide how decisions are made, decide how they're going to guide the evolution of that project. But that also means that a lot of these questions are now open-ended in the context of a project. And the reason why I want to call attention to it is because open source channels are just chocked full of personally identifiable information or PII for short. And it's sort of, it's up to the right of the individual how much they want to volunteer and I love that we can accept and cherish active contributors who volunteer very little about themselves and we continue to work in a model that allows for high levels of anonymity. But at the same time, the more you learn about someone either by maybe looking at all their activity on GitHub or maybe you're starting to bring in other channels, the more that you can infer about them. If you've ever worked with GitHub activity stream, if you say look at the push of it payload, every single commit has an author's email attached to it and whatever name or login that they associate with themselves. And this is really important if you've been going to any of the secure supply chains or route of trust talks where they rely on this piece of information to verify the person who made the original commit. In addition to that, say I've also volunteered that I work at Google, I live in New York, I have a Twitter handle, you can find me there. And maybe you can see my GitHub page, it's very sad, I haven't actually done a lot there. But this is enough information now to find me on another social media platform like LinkedIn where now maybe you've confirmed this information, you've also learned where I went to college and where I worked before I worked at Google. So these things can quickly snowball and so if you're thinking about it from a data aggregation standpoint, you're adding more and more information about someone. So even though they've opted into a specific platform or on the level of data collection that's happening around that person, in terms of your analysis program, the more you know, which is the suggestion for maybe thinking about how you might want to classify and treat this information. And not to say that there aren't valid use cases for collecting PIIRM projects. I know many projects use, say, contribution information to inform who should be going up for election in any given year or maybe you want to better understand the composition of your community to know how you can better support underrepresented groups. If you're like me and you also work for a company who are also working on projects that are trying to contextualize employees contributing in and around open source, then you might also bring in that data and add it to the pile to better understand, say, who within your organization is doing what? What products are they working on? What's their background? And for me, working in an open source program's office, we use this information to know are we serving our population? Do we know who our population is? Are we providing the right tooling, or is it possible that they can go out and work in these sorts of forums? And as a data analyst, I will definitely state that it's so much better to work with huge comprehensive tables where everything is in the same place. It makes it easier to write queries. It makes it easier to design filters and explore the information. But when we're talking about this kind of data, a lot of this information is sometimes best served by giving it back to the people it's about. Whether or not it's for research purposes and their contribution levels more holistically, and whatever it is, there might be reasons why you want to share some of this back with the community directly. And I don't think this is a problem, but when you start to marry all these various sources and increase the risk levels of that information, then you want to be sensitive about how much of it you expose at any given time. So the general suggestion is here, if you're comfortable with it, tokenize, compartmentalize, and then be able to individually authorize access to tables and then join it. Metalysts can have full access to all of these things, but then you can actually portion out and share tables directly with the public if you want to. I also want to mention that collection mechanisms in the mint of themselves can impact the information that you collect. And this is more apparent in, say, what I'm calling direct collection methods versus indirect, where indirect is I'm pulling from the GitHub API, the Stack Overflow API, and all the activity and information shared in those platforms are contained in the platforms and the users and individuals interact with it understand those parameters. But in direct collection, you, as the person who is going up to someone and asking, where did you come from? What groups do you identify with? You are just that interaction can potentially impact what people say because sometimes it can be uncomfortable. Maybe people aren't as comfortable sharing with information with people they don't know. They're involved high level of trust here. Or maybe they're just shy. There are many reasons why these sorts of forums can potentially impact what you can learn about people. So the general recommendation is try to provide as much anonymity as possible, especially when you're asking about some of these more sensitive or personal areas in demographics, even feeling satisfaction, opinions, and feedback are all better served in completely anonymous formats. And to do that, you can preserve either do not collect any email addresses in these forums, I suggest doing that in a third venue if you need something like follow-ups or meet-ups or contact forums. There are other ways to do this, just don't mix them because this can make people feel uncomfortable and less likely to share information with you. And also provide opt-outs. People don't have to answer every question if they don't feel comfortable with it. This is not something that you have to force people into. There's also the caveat here with if you've ever thought about size of data and being able to identify individuals. If your community is very tiny, then even though everything is completely anonymous, you might still be able to figure out who people are based on what they write. So you as the manager of that survey of the data collected around the survey, use your best judgment here to understand whether or not things are appropriate to share back or if they will have some level of risk and exposure. The general recommendation across all of these things is it's all about being transparent, especially in an open source where we're trying to actively build trust and work with each other. So if there are reasons for you to collect this information, be upfront about it. What's your motivation here? I want to better understand the community so I know what to invest in to encourage more participation, to encourage more representation. How are you going to collect this information? How are you going to store it? How are you going to secure it? And how are you going to maintain accountability of it? Again, another example, I work in the chaos community and we've recently been working on diversity equity, including badging programs for the participation of their speakers and other participants throughout. But in order to do that, we're asking some pretty sensitive questions and so we thought it was important for us as a community to state what we're going to do with that information, how we're going to ensure that we're going to protect it and keep it secure within the expectations of those that volunteered it to us. I'm noting here another piece of research at the bottom. The University of UC Davis recently investigated their calling it sustainability forecasting. There's some debate about the use of sustainability there. But looking specifically at the graduation rate of projects in the Apache incubator program, to see are there any discernible metrics that are better indicators of success than others? And the overwhelming observation with this is that it's communication, just communication levels. Not specific kinds of communication and there's basically opened up a whole new slew of research opportunities that I don't understand. Maybe some of the more dynamics here, but tangentially related to what I'm talking about, but in my gut feel like it's related. The better that we can talk to each other and communicate about what we're doing, the more collaboration and trust that we can build within our extended community. For the record, this talk is being recorded. So I know this, but if you do want to ask questions during it and during the live stream of the event, just noting we are collecting that information to interact with anyone who registered for this event and beyond. The last one I want to make here is a bit of an odd one because I'm not really sure what the best thing to do is here, but if you're working with a lot of, say, get activity logs, it is filled with bots and automation. They could be actively called bots and they could be automated scripts coming from personal based accounts. For a sense of scale, I wrote a very basic query by identifying bots that contain our actor login. So if anyone in the room has BOT in their GitHub handle, I'm counting you. So this is not perfect. Thus I say that we estimate that and that I rounded the number severely down. Over 120 million bots did some sort of activity on GitHub as logged by the event stream in 2020. For a sense of who these bots were, when you say who, we'll accept that with a grain of salt. Dependent bots being the largest one, noting that there are only activities in millions. So this is quite substantial. This actually piqued my interest and I went further digging and found that, say, bot-related accounts were responsible for 44% of push events in 2020 up from 25% in 2019. So this is a significant portion of activity that's now mixed in with human driven activity. So depending on what you're trying to show in your program, you're going to have to deal with this at some point. And so the current best practice or good practice here is to create some kind of referenceable list. So in my case, working for a company where we want to understand activities that our employees are generating, I have a known list of employees and we can pull them in from there. On the flip side, you can have a known list of bots and you can filter out from there. So that's what the DevStat program is doing. If you're familiar with DevStats, they build all of the wonderful metrics and analytics around CNCF projects. They also have a background file of all the bots that are known bots within all of these various projects so that they can report on human-centric activities. The next point I want to make is on the measurement component itself. When you measure, it influences the thing that you're measuring and this seems kind of straightforward and boring. I was thinking about this talk this morning and I looked at my Fitbit and I had 600 steps at 8 o'clock. I hadn't left my hotel room yet so it's now been a few hours. I've moved around a bit clearly when I took this measurement and packed it the measurement itself. But what I really mean here is that get activity logs do not stay consistent over time. In fact, history can change. Which was kind of mind-blowing for me as an analyst who worked with a ton of business data, reported data, financial data. The idea that historical information can change usually when that happened to the systems, somebody made a mistake. Or they forgot to count something and they had to go back and correct it. But in the context of if you go in and you are in a project for a long time and it has a lot of forks and the fork ends up getting merged back into the main branch, that fork history becomes the history of the project. So depending on when you pulled the data and you wrote queries around a specific time period and you rerun that say in a few years, it can change, it can look different because the composition of the project has changed over time. So your historical and your current state don't always necessarily match. There's also all of the inconsistency challenges that other datasets have. Specifically in this context, GitHub APIs have rate limits. So there are ways to get around this or the ways to reduce this, but often it ends up in results in loss or missing data. Any sort of cleaning or correction process to give a sense of some of these inconsistencies. GitHub archive is a comprehensive view of the GitHub event stream that is being aggregated and consolidated so you can view them all in the same place. I ran a very basic query that just looked at the total activity count for the same time period 2020 as it was shown in the yearly table, the monthly table, and the daily tables in the same time period. And you can see that these numbers do not agree. They're in the same order of magnitude, but there's a fair amount of variation. So something is probably missing somewhere. Do we know exactly what is something being over counted somewhere? Potentially. Again, thinking about are we actually certain about this information or is this just an estimate? If this is something interesting to you, this is a public project on GitHub and is looking for contributors. Publishing metrics also has the potential impact of influencing the behavior that it's actually reporting on. So this I've seen come up in things like leaderboards. Here I'm showing the Kubernetes leaderboard around total contributions that I'll come back to in a second, because this one I actually like this one, but there are other examples of where leaderboards can actually potentially influence things you don't want. Say if you're reporting on just commits, then maybe you're encouraging people to just submit a lot of commits and not do other things that can help support the projects, like addressing issues or supporting event management. So just kind of be wary of what you're incentivizing by showing any kind of rank list like this. And that just depending on where you are in the rank, it doesn't actually translate to value or quality of those individual contributions. So if you do want to share something like this, it's important to include context with it. So here I'm showing the combined contribution metrics, which I like because it combines a few things across push events, issues, comments and reviews to ensure that we're looking more well-rounded at just code contributions, we're not actually looking beyond code in this sense. Picking a time period that's relevant to you, I could have pulled this for 10 years ago, but I decided to do the last year, because I think that's a little bit more real and apparent for folks that might be at this conference today. And then if there are any particular externalities that could dramatically impact what you're showing, say this past year of isolation and changing work and productivity patterns that might impact what these numbers look like over time. So not to say that you shouldn't publish metrics, I think they can be a very effective tool to influence behavior. It's more about doing it purposefully. So what outcomes are you trying to achieve by showing some of these metrics? In the same project, I really liked the focus on looking at rate to first review, as a way to demonstrate productivity in the product, but also responsiveness in the project. The last section I want to address is around policies and regulations, which I chose a big, angry red slide here because this is where I bring out the huge caveat that I am an analyst, I am not a lawyer, so if you're doing this in the context of your organization and you have the access of legal teams, use them, consult with them, consult with your appliance organizations, because as you're amassing large amounts of information with PII, that will raise a lot of red flags in your organization, or it could. To bring back the slide, I just want to again call out that there is the sort of ambiguous accountability structure as it relates to this public-ish information about people, about projects. So depending who you are or where you are or where you work has implications for who is actually going to be responsible for the data, who's going to be accountable at the end of the day if it's just you and your individual project versus you and your company in that organization. So just as a general recommendation, it's always good to do a little bit of homework to understand are there any specific terms, policies, or regulation that this data is subject to, whether or not it's coming from the data set, the project, the tooling, any specific privacy or policy documentation from a platform, a company, or an event. And then there's also local regulations, original regulations that can create issues around how and where you can collect and store information about individuals. There's also a couple of resources here I'm linking to. The Linux Foundation published a report on how to consider export controls as it relates to interacting with international communities and cross-borders. Because we're dealing, again, with this sort of self-organized groups versus if you work at a company, there's very clear distinctions on who the customers are, who the users are, who the employees are and how we treat those pieces of information. In open source, all these things are kind of being built individually. So I think of them as various layers of choice, policy, and regulation. Say starting with the individual in question who can choose what they share with you and they choose what platforms to opt into and then are just opting into the collection mechanisms of those platforms. The platforms themselves have to dictate their own policies and how those policies are subject to regulations. Then there might be another layer where we're using a pipeline tool and aggregation tool that could have its own policies, but then it's also inheriting everything that came before it. So just because the policies are sitting on, say, GitHub doesn't mean they don't apply to you if you've now pulled the data from GitHub. You're inheriting everything that they've stated within their own privacy and policies. There is the aggregating entity that is now actively storing the information, aggregating and amassing that information, and they could be subject to their own regulations and requirements that could be coming from per company policies, or they could also be dictated by the project or the community leadership. You could go through that entire exercise and find a lot of missing pieces and that not everything is going to be clearly defined. And so in the absence of defined terms of use, then potentially it might be time for you to think about designing privacy documentation. Again, I brought the chaos example because we had hit a point where we were starting to collect information beyond what people knew they were sharing with us, say, by opting into using GitHub. So we wanted to openly state what we were going to do with this. This is how long we're going to retain it. This is how we're going to hold ourselves accountable to it. And essentially assuming that position for yourself as the individual or corporation or entity that's amassing the information. If you have any doubt and you have access to lawyers, involve them. And as a recommendation to the users, read the fine print or else you will end up like me stuck on a website and going back and checking my Barcelona registration to say, yes, in fact, the KubeCon and Cloud Native Organizing Committee did have the right to use my photographs and promotional materials. So what do you do now? I put together some good data practices. I don't call them best practices because this is definitely a learning in progress. Focused again on the perspective of the analyst, myself and the focus of the talk. So this is really relating to how we share this information. Now that we've collected it and the intricacies of that, when we're sharing it back with the public, things that I like to keep in mind are stating what you're counting. Maybe a rubric that I like to use is can someone else recreate the metric by how you described it? And this isn't to say that you can't use more descriptive terms like contribution or engagement. Just make sure somewhere in your write-up you've included how you've defined these things because there can be so much variety in them. Always state your sources. This is how people will contextualize what they're looking at. Sources include the method, the assumption, any sort of inbaked bias or base or boundary that relates to this particular subset of information. So remember that the how and the when impacts what we're actually looking at and can impact the information itself. Maybe also openly state what kind or type of contribution or contributor does this represent. And then as a last call out, there's a lot of funkiness so is this a certain fact or is this an estimation? If you've read academic papers there's a fairly rigorous template as it relates to reporting on the methodology around how the data was composed and analyzed in free form content. It's kind of up to you as the writer. So I, for an example, I wanted to share, I wrote a blog a few months ago and included in about this data section so that you could see an example of the information I thought was relevant to convey when sharing this information. If you're looking at building out a metrics program, I did want to call out there are some tools available in the community. I've called out DevStats a few times that if you're curious is a, is based on the GitHub Archive project which is again that comprehensive GitHub event stream uploaded into a Postgres database with Grafana as the visualization layer. For Chaos, the project, we have a couple of things like Gramoor Labs and Augur that are specifically designed for the use of contribution metrics. So as a final slide, remember that know your licenses, know any policies that can apply to the things you're working with. If you're going to collect data where it hasn't been collected before, it's probably worth stating your intention. If you're starting to mass this information where it hasn't been amassed before, maybe you want to create some policies or design privacy documentation so that you can ensure that you are openly communicating and keeping yourself accountable to the community. If you are collecting more sensitive information, allow people to opt out, not have to share everything that's not relevant for your case. And then just a really question, do you need PII or will anonymized version be sufficient for your needs or a fair amount of research that's just looking at behavioral trends? We probably don't need to see PII where another piece of research that I read recently looked at building social network models around projects and productivity. In that case, you probably do need to know a little bit more about the people. So it really depends on your use case. It's just a caution because every time you add more PII, it adds more risk and potential sensitivity to your dataset. So that's it for my talk. I think I have four minutes for any questions. If you don't have any, you're welcome to leave. I'll be hanging out up here. Oh, I see one in the back. You can go up through any of the steering or advisory groups. I know they do have some sort of legal pieces to it. So if you're under a foundation that could potentially be a support model. I know, for example, the CNCS steering committee was fairly influential in designing the community survey for Kubernetes around some of these more sensitive issues. So they were definitely imparting their own guidance to encourage better or less risky data practices. But that's one starting point. It is they are pretty expensive. I actually have a friend who's starting at a startup that they're working on a basically a legal consulting group that can be shared and distributed for a lot of smaller parties, but that typically is gear torque companies versus projects. So I'd say this is an area where I hope there are more lawyers that are interested in volunteering their time because they are very expensive. But where possible look for existing guidance and again leaning on your foundation boards? A lot. Mostly because they're sort of the how does how does that regulation being imparted on the collection and usage mechanisms. Yeah. Well, yeah, it's a consideration. So that's one of the things where I want to ensure that anything that I collect is also compliant with ADPR, even though it's technically not business data or customer data. So because of that we're treating any data that we amass around open source as user data, even though they're not technically users, but that puts it in a higher level of consideration and management and also limits the exposure of it. So we are assuming a higher level of responsibility even though it's mostly public information. Additionally to that, you can look at say GitHub does explicitly list how their data selection practices available through the API should be considered in the realm of GDPR. So I think they've done a really good job of sort of outweighing how to enforce all the specific elements and policy elements that are required in GDPR if you're pulling from that platform. Not every platform has done this. I think GitHub has done a good job to do that as well as with Privacy Shield. But others it's more up to you to understand how to apply all those principles to your own practices. Well, if that's it, you are welcome to leave. Thank you so much for joining us today.