 I'm very happy to see so many of you here and to be here with you today. And we want to talk with you about building and supporting open source communities through metrics. So we'll talk a little bit about metrics and what kind of metrics we have. We'll give some examples of how communities have used metrics and how they support decision making. And we'll talk about what does it take to actually get some metrics about your communities and give you some tools and advice on getting started as well. My name is Georg Link. I'm the director of sales at Petrugia and this is Emilio, our marketing specialist. So let's take a step back and consider why we are all here today at the Open Source Summit. At the core of Open Source, we care about using, sharing and collaborating in the creation of software with its roots in the free software movement and ensuring the right of software users. Open Source has evolved from being the realm of hobbyists and volunteers to the enterprise. Collaborative software development has taken on a new dimension in the last five to ten years. Today Open Source makes 58% of software in the enterprise and in fact 63% of companies in 2021 survey indicate that they want to increase their use and engage with Open Source. It is now knowledge that Open Source is present almost everywhere and forms the digital infrastructure we all reply on. First, the Herblit incident really elevates the awareness and it was a vulnerability in an Open Source project. That was used in many web servers to secure connections. The vulnerability existed for several years before being detected and exposed a lot of servers to security threats. Then we have another example, the Strats Equifax debacle that exposed millions of U.S. citizens personal information also had Open Source software at its center. A more recent incident was the log for J security vulnerability similar to the Herblit. This vulnerability was in an Open Source software library that was used in a lot of other software. And I am with these three examples and it was no wonder that with this high profile incidents the U.S. Congress asked the Open Source community about how to avoid these future issues. The U.S. issued a directive mandating more software to play in chain security and also it looks like the European Union also is now working on similar legislation and guidelines to avoid these problems. To address this challenge we need to understand how Open Source software is built. This typically involves an Open Source project. There are different types of Open Source projects. We have an example of the Mozilla Foundation released a report in 2019 on the different types of Open Source projects showing that each is created for a different reasons and has different governance chooses different licenses and engage users and other development to different degrees. Our focus in this presentation is on the Open Source project that are built by every community. As we know they are Open Source projects created with only one maintainer or that are fully controlled by a company. And we are going to exclude this to focus on the project that have every community. Our specific focus will be on what challenges you may face and how to overcome them. First we want to explain what our company already does and we have a history of working on this issue more than 15 years. We are maintainers on the Open Source Greenmore Log metrics tools as we are the official metrics partner of the foundations like Open Infra and Non-Focus nowadays. And when the interest grow in community health we co-founded in 2017 the chaos projects as cooperation with the Linus Foundation in collaboration between industry, academia and Open Source. So talking about the chaos community and this community has defined more than 70 metrics and maintain software for getting insights that you need. And so now let's we can see the some examples of what are those those metrics that they are measuring this in this community. We see in the Chaos Framework that they divided this these metrics in five working groups. We can see these five working groups and they have each group have their own focus areas. The first group is the common metrics where the goal is to understand what contributions from organizations and people are being made. They have focus areas such as contributions, time, people and place. And one example metric we can find in this in this group is the type of contributions. The weekend in where we can measure the types of contribution are being made. The second working group is the value metrics where the goal here is to identify the degree to which a project is valuable to researchers and academic institutions. The focus areas in this group are the academic value, communal value, individual value and the organizational value. And one example we have here in this group is the project velocity metrics in which we can find what is the development speed for any organization. The third group is the evolution metrics where the goal is to aspects are aspects related to how the source code changed over time and the mechanisms that the project has to perform and control those changes. The focus areas here are code development activity, efficiency, the code development process quality, the issue resolution and community growth. And one example metric is the new contributions, new contributors. Sorry, how we can see how many contributors are making their first contribution to give project and who are they also. Then we have the diversity equity and inclusion metrics where in the goal is here to identify this diversity equity and inclusion aspects for the for the communities. The focus areas in here are the event, event diversity, governance, leadership and projects and community. One example here for example in for virtual, virtual events is the time inclusions where the organized organizers of virtual events be mindful of attendees and speakers and in other, in other time zones. And the last group is the risk metrics where the goal is to understand how active a community exists around or to support a given software package. The focus areas in here are the business risk, the code quality, dependency risk assessments, licensee and security. And the, the sample metric here is the elephant factor with this metric we can measure what is distribution or working the community. Thank you for switching the mic. Now that we have seen there are 70 different metrics defined, there are a lot more metrics that have not been defined yet. There's many different options for what we could be measuring to support and grow our communities. So I want to show you some examples of what some communities have looked at before and what kind of decisions that drives. And I'll start off with new contributors and contribution metrics and to understand the, the activity in a project through that. The, when we look at an open source community, one of the things that is naturally occurring is that contributors are becoming inactive after a time. They might change their jobs. They might lose the interest in the project. They have personal things happening and just move on. And that is normal and healthy and that's okay, but we need to have new people coming on to the project for it to be sustainable and healthy. And so what we are looking for is RV bringing in new people and is the activity levels staying healthy over time. And one example where we can see this is in a report from the mountain community. I took this from the community report published in 2020 where this chart shows how many new contributors showed up. This is based on the commit log. So someone who made their first commit during these months and this is a five month, five year analysis. What we see here over five years, the community went through different stages. At first there was a somewhat lower activity level and then the community grew and there were more people coming in and then it dropped off again after a while. And to see this, the community report also looked at the level of activity and during that time frame where a lot of people joined, there was also a lot of activity in the community. And then we saw it dropping off and when you read the report, they were implementing changes after seeing this drop off to build a stronger foundation for growth again, implementing new processes. And you can see the impact here in the end where they were starting to regain some of the momentum. And this is the conclusion that is written in the report where these metrics have been used to really show that, hey, what we are doing is working, but then also to say, okay, we need to do certain activities. And so as we are growing and supporting our communities, it is good to have these metrics in place to see is what we are doing effective or do we need to try something else and then we make small adjustments over time and they just keep building the momentum. And that's how we are growing our communities. Another example is looking at organizational diversity. And this is how many companies are involved in an open source project because when you have even if it's a lot of people, if they're all employed by the same company and now the project is becoming less important or they pull the support, that project might just go away. So a strategic goal of some communities is to have a lot of different organizations working together so that it's not dependent on just one. One community that is very mindful of this is the Drupal community. And every year Dries is publishing a report showing how many organizations are involved. What is the level of activity? These charts are from the from last year's report and there was a drop off from the year before in total number of contributors by 10%. But the number of organizations that they were working for was only reduced by 2%. So the foundation of companies that were still standing behind contributions to the Drupal community was still a solid foundation, not much different from the previous years, even though COVID happened. One of the things that Drupal has done that is really remarkable to track this is they have a sophisticated credit system where all the contributors when they make contributions can declare I am doing this for me as a volunteer. I'm doing this for my employer or I'm doing this as client work for a client. And that's why I like to highlight Drupal because that is the most sophisticated that I've seen in all of open source for how this is being tracked. And I hope at some point this gets into GitLab because they're switching to GitLab and there is an open feature request that is being discussed. More on that if you want outside of this talk. And so looking at these these numbers we see that the number of organizations supporting Drupal stayed mostly consistent even with the drop off. It still shows there is a healthy support. Another community that I want to highlight is the Kata containers community. And in this chart I excluded the founding companies. But we are seeing here they had the strategic goal of growing the number of organizations contributing to the project and we can see here this upwards trend of more and more contributions coming from non-founding members. And as more companies are joining the color that the graph becomes more colorful too. This is attribute to the strategy but then the metrics are giving support that yes the strategy we are following is successful. The third and last category of metrics that I want to provide some examples for is around change requests. We have looked at contributors and contributions and organizational diversity which is about who is doing work. Let's take a look now at how the work is actually done in our community. A lot of us are using platforms like GitHub, GitLab, Garrett where we have these change requests and in the chaos community we use change requests as synonyms for pull requests or merge requests or change sets because we want it to be vendor neutral. The idea is that we have contributions that are being made and community members are asking for reviews from others or maintainers before those changes make it into the main branch. And so let's take a look at what we can learn from looking at this process. There was another talk this morning that was called cycle time for the time that it takes to first review, first attention, and so on. So looking at this Darling X project, it takes on average about four days for a change request to be reviewed and then merged, the four days for interaction between the community. Now this number by itself is interesting but we need to look at the context around it and looking at it over three years we see that during the pandemic there was a slowdown of overall activity in the community. We see the dip in the graph and so we might think that it's getting slower but looking at the time that it takes to review the change request it actually stayed steady and consistent throughout this entire three year period which is showing that the community maintained the level of interaction with each other. Maybe on fewer items they were working on but the activity level and the energy continued. So this gives us an idea for how things are going in the community by pulling together different graphs and digging into the data about our community. And because we have this data that's where the community can say okay here is something that doesn't look right so let's do something about it but for that we need the metrics, we need the data to make those kinds of decisions. So let's talk about some challenges on actually doing this. From an organizational perspective something to keep in mind is what are the right metrics to look at because we are what we measure. Once we measure something people start to game the system and adjust their behavior. So we want to be mindful of this, figure out what is the strategy, what are our goals and then work backwards to figure out what questions do we have to answer whether we are reaching that goal and then figure out the metrics that help us answer those questions. The metrics can also be simple. We were just talking yesterday at ChaosCon and Don had really great example of just using four simple metrics to get an idea of the community, four metrics to just give an overview of how things are going. Then we need to think about what do we actually do about these metrics. There are some metrics that are actionable where we can change something in the community to quickly influence them. Like if we have number of commits, I just had an interesting conversation with the community, we were looking at the commit metrics and one repo had really low numbers, another one had really high numbers and he said, but those cannot be compared at all because we are squashing all the commits in our pull requests in this one repository and have micro commits in this other. And so we need to understand that context of the community also when we're looking at the metrics and making comparisons sometimes is not at all what we can do. So knowing what is good and knowing what is bad, we need to establish that baseline within the same project, not try to compare ourselves. And then finally, there is also the concern around personal identifiable information, PII. When we are contributing to open source, we love transparency. That's one of the things that we thrive on in our collaboration. And it's good that we know who is making contributions, especially when we talk about trusting the source and trusting the source code. But that means people are leaving their names and email addresses in the commit history in the email in the chat history. And we need to be mindful of that because we have rules like the general data protection regulation, GDPR and other ethical concerns as well. So how do we justify doing an opt out solution ideally where we say, okay, this data has been provided by the contributors, they understand it is public and we are analyzing it for the benefit of the community. We have a justified reason. So anyone who doesn't want to be included in this data collection can let us know we'll exclude them, but we need to provide that way. If we want to analyze it for any other reason, we might not have a justified way and we have to actually do an opt in process. Like don't go around scraping emails and advertising to open source developers does not go over well. One thing to do when we start doing metrics that has worked in our experience really well is to be transparent and honest with the community to say, Hey, we are providing these metrics. Here's the dashboard. Take a look. Here's what we know about the community about you. And by doing this, then everyone feels a little bit better and comfortable with what we are doing and it's a resource to the community as a whole. Now, let's move on to some technical challenges and I have some better solutions for you here. When we want to start collecting metrics, we want to get started. The first thought we have to ask is where do we get the data from? Where is the community? I want to not just think about where is the source code being developed, but also where are the conversations happening in the community, mailing lists, slack forums. We want to be inclusive of all these spaces because all that information, all that conversation is important. Otherwise, we create a bias towards only the code contributors and there's so many other activities going on in an open source project that we want to be grateful for and elevate in our in the way we recognize contributors. When we get the data, we have several steps that need to happen. We get the raw data, we want to enrich the data and then present it and make it useful. So let's talk through these steps. When we get the data, the raw data, that's almost the easiest step. There are APIs, there are archives, there are ways to get to the data and the challenge here is if the source changes, we have to change our tooling that collects the data. Which, spoiler alert, if you use an open source tool, then you don't have to do it by yourself. There's probably a community around you that helps with that. So we can all benefit. Enriching the data, now we are getting to making more out of it where we want to unify the date formats. We want to look at the level of detail. If we get a git commit, are we just interested in who did want when or are we actually interested in how many lines were changed? So we need to think about what kind of information do we want to collect and store and what metadata do we collect about it? What context do we store? Then another concern is managing identities. As we are combining data from different platforms, good chances your contributors are using different user names or different email addresses. And maybe we want to combine those and say, hey, this same person is active in all these different channels and be able to connect those. Again, PII concern, maybe there are people who are contributing as with their personal and then with their company and they don't want those to be associated. But that is something to figure out with your community. And then there's a calculation process. Some metrics don't come from the raw data. Like we can get when an issue is open and when it was closed, but we might be interested in how long was it open. And so we have to subtract the two dates. And then we want to make the data useful. The raw numbers and the raw data by themselves great. But how do we actually support what we want to do? So who is the user of the data and what do they want to do with the data? What kind of stories do they want to tell? And what visualizations help them do that? It is the way that I think of this is when you go explore this beautiful city of Dublin and want to get no more about it, you can get a sheet of 17, whatever hundred this happened. 1204 the castle was built. Just the raw facts are not that interesting. But when you get a guide that walks you through the city and tells you the story and uses the dates and facts to back up what he is telling, then it sticks in your head and you get moved and you understand much better. That's what we want to do with open source communities as well. We want to tell the story and use the data to support that story. So we have open source tools that have solved all of these challenges to an extent where you want to get started. These are great starting places in the chaos project. We have grimoire lab and we have auger cold run as a platform built on top of grimoire lab Apache has the kibble project. The CNCF foundation has the depth analytics and you're welcome to try these tools and use them as starting points. So we've walked you through metrics and some examples and how you can get started and I'll let Emilio finish us off. Yes. So just a quick break up about about some lessons and ideas that we see in this presentation. First, we saw that how to use metrics to identify where community needs helps and track if actions actually leads to changes. We also see the to track metrics early and establish a baseline. Then also go for low hanging fruits easy to get metrics and get and then get more sophisticated later. Also present the metrics in context. Tell a story of the community that you are you are telling and also be transparent with the community about metrics provide public public dashboards and public reports. From us today also oops sorry that's okay. We're also at the booth when you go into the sponsor area tomorrow and the day after it's on the right. I think we still have some time for conversation. I always love to hear your thoughts on this if you have something to share with others or have a question. Yes. Your experience what was the most intern most interesting misrepresented metric? Somebody they were perceiving the community wrongly because they were looking at a wrong metric. So the question for those on online is what is the most interesting case that we have seen of a metric being misrepresented or where it created the wrong kind of understanding. One one misleading one is around the commits where people are assuming that commits means some value is being created and equating commit with commit and when you when you do that with the example earlier is commits can mean many different things. You have the merge commits you have documentation commits there there might be a commit that just does cleaning up of the code base refactoring and even when you look at number of lines changed it's really difficult to equate those and that that comparison I'm using commits as an example is where misjudgements happen. So yeah. Yes. Just you know the community talk balancing out the developer work with kind of doing metrics and a lot of data processing because I I think that some people will get to the one or or or the other and most people should look different and these are useful but they are maybe going to do it later. How do you balance that out? How do you arrange people to actually focus on doing this additional work even though it's not that's directly useful for the project. I mean in terms of doing the actual commits. So the the question is how do we balance contributors that want to work on the code and want to spend their time there versus spending it on creating metrics and getting insights to the community. And I think that is where in a community you have many different kinds of contributors and some that are more focused on doing the actual coding and they don't need to do any extra work all the data that we were looking at today is created accidentally or just by being having that trace of what was being done in the project through through the commit lock through the mailing list archive there there's no extra work involved in creating the data. We just need someone who is interested in being a maintainer and having a vision for the community and maybe doing some community management and they are the ones that would take a step back look at the data collected or if you are part of a foundation then it might already be provided and open infra is providing the metrics the Linux foundation has the LFX insights platform so maybe it's already provided for your project. Yes. How many people are consuming? Do you look at consumption numbers as well in your collection? So the question is around not just looking at the activity level data that is in an open source project but also at consumption and how many projects and consumers are there of a project. It's a really difficult one because there is no good data source. It's an unsolved problem we have. There are some solutions. So the OpenSSF Foundation is working with some vendors that are analyzing open source project usage in companies to get some idea of this. There's the project called SCARF where you basically install a proxy or use a proxy on your downloads for packages for source code to get an idea for how often does the project get downloaded. GitHub provides some insights to how often is a project cloned but even there it's a very inaccurate number because it might just be an answer to how often does someone build their project and is pulling a new version. And so these consumption numbers there's there's no good way right now that I'm aware of for getting that number. Once we have more S-bombs hopefully there is a better way to get to that but we have still have a long way to go to get open source projects to adopt S-bombs on a on a broad scale. Yeah. Thank you. Yes. Two. Okay. Let's start with one. Yes. Being hard to say okay. So what are some things that developers could do to protect their on PII? Yeah. Okay. So the first question is what is an S-bomb? And that S-bomb is short for software bill of material. And when it's like the nutrition label on food that tells you what are the ingredients that were used to make a meal the S-bomb software bill of material says these are the software components inside of this software piece. So as we are building on top of libraries and incorporating other software we keep a list of everything that we're using and then go to the consumer and say here is my software and here are all the pieces that I've reused from other projects and ideally you also declare what licenses they use and so on. So that's that's what the S-bomb is the software bill of material and then the second question was how do you balance the opaqueness or the contributors the maintainers that want to not expose their PII? And that's a conversation to be had in open source projects whether you allow that where someone uses an anonymous account or where you actually want to have that transparency to have the trust in who's doing that because one of things that can happen when we don't have the transparency is that someone else comes in as a malicious actor inserts a back door that then gets into all the projects that are using the library. So I don't know if if you want to go there but we might be able to come up with a good solution for that. I think if we use hyperledger and blockchain to verify identities in an anonymous way there might be some solutions that we could work on but I'm speculating I'm not very good in this myself. We still have about five minutes so does anyone have experience looking at metrics and has their own experience that they can share with everyone here or what what kind of tools if you are using metrics what kind of tools do you use? Yes. So yeah we are going to be on the right path really right and then you know it's changed over time as they it's evolved and then had new ownership and yeah it's very hard to get metrics as people whether program or activities or not a lot of the metrics you put out there things you need to get in order to make that a big one is probably like you said organizations how do you compare individual contributors who are working as a group as far as an organization there's an individual contributor who's very spiritual hopefully it's a it's a very different so to to summarize for those who couldn't hear this example is or this use case for looking at metrics is to determine if there are multiple technologies that we could be using which one is most likely to be the one to survive and be long term maintain and so looking at activity metrics and making the case for what is the what's the best guess which one to rely on now thank you for sharing that maybe can use the microphone if someone else has something they would like to share thank you very much for your presentation it was very good just a question in terms of if you get the metrics of say for example the communities community is not doing well what are the steps that they take in order to reinvigorate the community is there is it dependent totally on the on the as you said for example if something is owned by Linux Foundation is it the responsibility of that owner or how does how does the community get reinvigorated that's a good question how do you get a community back to back to health this is where where I would recommend looking at some community management best practices for what to do probably look at the the community identify what's actually going on what's the history understand a little bit better around the context because it could just be that the technology itself has lost its appeal and so you have to take a different approach or is it because you have a toxic actor that is pushing everyone else away then you need to take a different approach or is it because so you need to do some digging to understand why the the community is died down and then there steps to to mitigate that and that's around what what brings people to the to the project how do you get the word out maybe the project itself is set up in a way where it's really difficult for someone to get started and so you need to work on making it easier to onboard new contributors and it's community management is a really complex topic so I I don't have the cookie cutter nice here here's your solution answer but looking at the metrics can help you identify also other steps that we are taking effective or not alright I just got my warning that we are out of time so thank you so much for joining have a great conference