 Hi, everyone. Thanks for joining us for our talk about how we made GitHub's open source program office data driven and why that matters. I'm Ashley Wolf and I lead GitHub's open source program office. And I'm Natalie to dummy and I'm a data analyst I get up. You may have heard of GitHub. It's the platform many of you use to develop software and work on open source. We're home to 4 million organizations, 200 million repositories and hundreds of open source communities. At GitHub open source is in our DNA. Starting back in 2007 when we made our first commit to open source. Our journey has grown exponentially from there. We've open source thousands of projects and launched many programs to enhance the developer and community experience. Last year, we created our very own open source program office known as the GitHub Ospo. Ospo's mission is to empower GitHub staff and the community to contribute and use open source. The success of our mission is dependent on understanding how we're doing and where we can improve. We realized that having a system of measurement and real time analysis was critical to accomplishing our goals. So we set out to create a data driven program that we could rely on to carry us forward on key initiatives. We already partner with many different groups at GitHub been beyond and now to accomplish our goals. We needed to add a new partner. That's where I come in. I'm on the data team which manages the data warehouse core business metrics and business intelligence tools for teams across the company. We partner with teams to enable them to make decisions with data. Even though we're an internal team, we often support data request for open source. We provide data to support research requests for open source foundations like the Linux foundation and we help with the annual Phinos reports. We generate all the data and analysis behind GitHub's annual state of October reports, which analyzes data for millions of developers and repos to share trends across working habits, productivity and career satisfaction. We partner with GitHub products like feed sponsors and education to improve user experience with data. This was the first time our team had worked together and personally I was really excited. I was really excited to participate in this initiative and learn more about how organizations measure success around open source. With our new alliance between data and the Ospo, we were ready to get going. Now we'll dive into the framework that we applied and some demos of the results. We'll also share some challenges that might help you with setting up your own data driven Ospo. Now let's look into the framework we used. First is defining business objectives for Ospo. What questions are you trying to answer? Next is to define data measurements. For this part of the framework, you're identifying and defining measurements that answer the overall objective. Can these even be measured and how? Data understanding and analysis. What data is available, if any, that is around the objective and measurements. This step is where you'll explore and learn about the data to formulate Ospo metrics. Document all data qualities or issues that you uncover while performing this step. Next evaluation. This step is where you'll decide if the metrics answer your overall objective. Is there anything else to incorporate in your definition? Were there any limitations that were uncovered? You will want to summarize and document your data findings and your possible next steps. And last, share findings. This will vary among companies, teams and individuals. Is it a single report or a reusable dashboard? You may want to think of what tool or output you need. This process may seem linear. One step to the next, but it's not. Revisiting a step multiple times is not only expected but recommended. We'll share how this played out with us too. Within Ospo, we took this framework that Natalie shared. Ospo questions, data questions, measurements, analysis and reporting and applied it to three verticals. The open source we use are dependencies, the projects we rely on. Second, the open source we contribute to. These are contributions we make outbound to projects we rely on and ecosystems we want to help sustain. And third, the open source we release. These are new projects we create under company branded organizations. For each vertical, we'll apply the framework. First, we'll start with the open source we use. The Ospo questions we were interested in asking were, what are the things that we depend on? How healthy are they? Where is it used? Are there any security vulnerabilities? What are they licensed under? Before we move forward, we need to address and define what a dependency is. At GitHub, we use something called dependency graph to identify repositories that are used in production or development of our products and projects. This also powers some of our tools like dependabot. The big hurdle for us was collecting an accurate list of all our organizations we have across the company. And once we populated that, we were able to cross-reference it with our dependency graph to get the full list of all the repositories we depend on. We explore the data and made changes along the way to ensure we included all GitHub managed organization dependencies. In the sequel, you'll see we define this as GitHub owned repositories. We validated the results and often revisited previous steps of the framework along the way. Ultimately, this led us to be able to create a dependency dashboard. Unfortunately, we can't share that snapshot with you, though we will be able to share other dashboard snapshots later. But we do want to share some of the fields we were able to track in that dashboard. So for a list of dependencies, we were able to add data around licenses so we knew what everything was licensed under if there were any security vulnerabilities. If the repository was eligible to be sponsored or not, if they had a funding.yaml file listed, whether an employee created the repo or an external community member did. The last contribution date to help us to determine how active or recently active was that repository. And lastly, the number of times it's used so we could determine a dependent count. The next vertical we looked at is open source we contribute to. The Ospo questions we had here was to better understand who is contributing to open source, who were the top contributors at the company. And are we doing the majority of contributions ourselves or our others. The Ospo questions translate to data questions. Primarily what is a contribution and what is a contributor. We define a contribution as these activities done by users. A contribution to us is not just pushing code, but other activities. We also noted that there were three contribution scenarios. The first is contributions to projects, we depend on the second contributions to projects we create and release, and the third personal project contributions. And the third is to ensure to reduce privacy concerns, since we're monitoring what staff were doing in open source. What we did is only look at contributions made to dependencies of ours to try to reduce the creepiness factor, and to exclude any of the projects that were done as personal projects or things that weren't work related. At the end of this work, we only wanted to focus on work related contributions. So we were able to look at the contributions made to dependencies. This is an example of a query that establishes those activities to calculate contributions. Also, the data is looking back at the past 365 days, which is the timeframe we determined. Think about the timeframe, you're interested in for contributions and how you may want to measure it. This led us to be able to create the contribution dashboard for the Osmo. We organized it in a way to give us an overview of contribution activity, whether that was code or not code activity type. We're able to map back to teams and see which teams are doing the most contributions or the least, and we're able to see a trend. So over time, we can see if there are any significant changes that we may want to address. We also created a leaderboard which we're really excited about to help showcase individuals at the company that are doing the most contributions to open source, and we hope over time to be able to build a recognition program around their contributions. The third and last vertical is the open source we release. The questions we were interested in asking and addressing were around how healthy our projects are. We've created open source for many years, and we know that some of our projects might be stale and abandoned. We wanted to better understand this. We also wanted to make sure our projects were welcoming and inclusive and know whether or not they had core community health files like code of conduct contributing guide and read me. Before answering these questions we need to define what an open source project is and what is an open source maintainer. We define an open source project as a public repository with a license file that is not archived is not a GitHub pages. To determine whether a project is abandoned or not, we wanted to see if there were any employee maintainers left on it. We needed a standard definition for a maintainer and we went with a user that has right maintainer or trios access and who had at least one contribution in the past 28 days. The sequel is an example of how we translated our Osco definition of an open source repository to a data definition to further measure and investigate. We created logic on public repo indicator which is set to true. The repo is not archived and has a license and has a license and is not a GitHub pages. This is our release dashboard where we can see data for all of our managed repositories. This is a snapshot looking at the GitHub organization, and we can see whether our repositories have those core community health files like contributing guide code of conduct and read me. Ideally, they're all at 100%. Where they're not we can drill in and see which ones need to have those files added. We're also able to look at last contribution date and whether the contributions are being made by employees or by external community members. And we've already started to use this dashboard to do some cleanup. We were able to identify some projects that are reaching that stale and abandoned point. They hadn't had commits for two years or more, and without any maintainers, and for those we decided to archive them. We've learned a lot from creating these reports and dashboards. It's not always easy to get the data you need, even at GitHub where we have our own data warehouse, we had to continue to refine our definitions and timeframes. But now we have a great starting point and a baseline on the state of open source at GitHub. Being data driven is important for an Ospo, not only for the GitHub Ospo with data, you can begin to make decisions around open source, the open source you use contribute to and release. You can build your inventory and know what's in your supply chain. You can encourage contributions back to key projects you depend on, and you can measure health and sustainability of the projects you create and manage. And we're just getting started. On our roadmap, our next steps that we want to take are around addressing data quality issues, centralizing metric definitions, adding inner source metrics, and improving latency. If you're interested in building your own data driven Ospo, we recommend you follow this framework. First, ask your Ospo questions. Next, define data questions and measurements, explore the data, and validate it over and over. And finally, create and share your dashboards. You can have your own state of open source too. And thank you for having us present at OspoCon. Enjoy your time everyone.