 Hi, this is your host up in Bhartiya and today we have with us Frank Nagle, assistant professor at Harvard Business School. And today we are going to talk about sensors to report of free and open source software application libraries. Frank, it's great to have you on the show. Thanks for having me. It's great to be here. What is the goal behind these sensors behind the report? What kind of insight you're trying to gain through this? Sure. Yeah. We're trying to speak to three different audiences, actually with this report. And the overall goal is to better understand how widespread the use of open sources and in particular which packages are most widely used. So the three audiences we're trying to speak to are first kind of high level organizations like the open source security foundation or even government entities that are looking to invest and help ensure the health and wellbeing and security of open source. We're trying to give them some insights into where they should start, right? Because we all know, you know, some of the big projects that are widely used, but these application libraries are actually super important. And so we want to give some sense of which ones we should be thinking about investing in if we're trying to enhance the whole ecosystem. So that's the first and then the second group that we're trying to speak to are individual companies themselves. So we're trying to help companies understand if they use open source that is on this list that's very widely used. They may want to think more about giving back and contributing either code or cash or training to the individuals that are actually maintaining those projects because they're so widely used. If there some vulnerability was discovered in them, attackers would probably try to go after them pretty quickly, much like what we saw with log 4J, not necessarily because of exactly what the software is itself, but because of how widespread it is in use. And then the third group that we're hoping to speak to as well are the maintainers and the contributors to these packages as well. Because as we all know, you know, these small packages that have become to be an important stepping stone and building block for the modern economy, some of the maintainers and developers don't actually realize how widely used their open sources. And so our hope is that we can shed some light to that group on how widely used their projects are and how they may want to think about allowing companies or organizations to help them out in these various ways since that package that they've written or that they maintain or they contribute to is so widely used and important in the ecosystem. What are the processes that you use to gain insights because when you talk about government, different stakeholders, they have different accessibility, they have different level of information that is available or they can share. Yeah, so that was one of the things that we decided early on to do was partner with a handful of software composition analysis companies. And so rather than have to go talk to every single company individually, these companies are brought in to clients to analyze what open source is in their systems. This can be for a variety of reasons to think about, you know, what vulnerabilities they may have or what potential licensing issues they may be exposed to or things like that. And so by just working with a handful of these SCA companies, we were able to see the open source usage of thousands of that individual companies and what they use and bake into the software that they build rather than kind of thinking about, you know, the high level layers of individual applications, you know, like the things we think about like open source browsers or open source email clients or things like that. Instead, we focus that the more middle layer, the application library layer where we can think about what's actually getting baked in behind the scenes that is even harder to count than some of the more front-end facing things. And so that was our hope by working with these SCAs. We could get insights into thousands of companies by only getting the data from a handful of companies. Excellent. Thanks for explaining that. Now, let's talk about what you learned. What are the findings when you talk to these stakeholders? I call them stakeholders. No, absolutely they are. I think that's one of the things with the open source ecosystem is that there are many stakeholders and they play many different roles. There's the users, there's the developers, there's the maintainers, there's the repositories and all of them play a different role. And I think the hope is that we can all kind of come together and agree that, you know, open source is rather important and we should be thinking about how we can support this decentralized, you know, code in a way that is fairly decentralized as well. So when we thought, think about, you know, what we learned, obviously the most kind of prominent, you know, findings were these top 500 lists that we produced. So we actually ended up making eight top 500 lists, which I realized is a little weird and a little confusing. But that's because of the inter-key disease of the open source system. And so rather than kind of just have one list that was, you know, dominated in some incorrect way or biased by kind of the way that an individual language is used or things like that, we split up the list a little bit to give even more insights into what packages are being used and how they're being used. So that, you know, the primary finding that we have, we have these eight top 500 lists of open source projects. And then we also had these higher level findings, five of them that are thinking about, you know, things at a much higher level. And so this is related to things like the fact that there's very few common naming schemes for open source. And so even when we were trying to match these on the back end, it was quite difficult because, you know, there's more than one package named debug, right? And so if you just have debug and a version number, you know, you don't necessarily know which package that is. And so thinking about common naming and also common versioning because we saw some issues with when individual companies forked a piece of software and maintained a version of it internally and had their own internal versioning system. That could cause some oddities, especially when we think about the building, you know, efforts behind a software bill of materials, which rely on the individuals using a piece of software, knowing exactly what software is baked into that, right? And so some of these higher level findings, I think are particularly important. And the last one that I'll highlight, there's a few more in the report. But the last one is that all of these large, these most widely used projects, there's actually only a handful of developers actually supporting them. So in one cup, we looked at the top 50 projects on one of these lists, more than 80% of the code for all those projects was only being contributed by 130 developers, which sounds like a very small amount to me when we're talking about 50 projects and most of the code that's going into them. So thinking more about how these ecosystems are supported and so in a broad high level, it was one of the main things that we wanted to people to take away as well. Well, you were going to report, well, there any signs that were concerning and worrying because we do hear adoption of open source is growing. But one other thing is that a lot of companies, organizations, they still do not know how to engage with open source, how to bring good citizens. At the same time, supply chain security is becoming big issue, but there is no awareness of people don't even understand what supply chain is at some stage. So is there anything that you found that you're like, hey, this is a good lesson that can benefit, which is a 30-stake orders which are developers maintenance or organizations like Linux Foundation? Absolutely, yes. So I think one of the things that we were, perhaps we were surprised, but maybe we shouldn't have been was how widely used antiquated versions of packages were. So I'll use log4j as an example because that was a piece of open source that's been on everybody's mind for the past few months since the vulnerability came out in December and in the White House meeting last month. And so when we think about log4j, what we saw because we could see the individual versions of these packages that companies were using, most of the companies with almost a three-to-one ratio were using a version of log4j that was in the 1.x series and where the vulnerabilities that were found that created all the headlines over the past few months were actually in the 2.x series. And what's interesting though is that 1.x series was end-of-life in 2015. So there haven't been any updates since 2015 to that code base that many of these companies are relying on. So while they're lucky enough to not be affected by the recent big vulnerabilities, there are many other vulnerabilities that have been discovered since 2015 that exist in their software that will never have a patch because the developers have moved on to the 2.x series. So when we think about these types of insights that we can gain, one of the big things that companies should learn and pay attention to is that if they're using older versions of software, those versions may never be updated again because they've reached an end of life and therefore those vulnerabilities are never going to be fixed. And so I think that's something that we should all be aware of and all think more of as we're developing what we hope to be secure software. I do remember there was a time when there were a lot of critical projects. They were like maintained just by one maintainer, a Linux foundation, you know, pitched in and they help, you know, so that these projects become more stable from financial point of view. Are there any projects that you found? Hey, these are some of the very, very, you know, widely used projects, but they do run the risk of sustainability and it may need some help. Was there anything like that that you found? Absolutely, yeah. Some of the projects that we found hadn't had any updates to their code base in over a year or two. And so certainly they're not being actively developed anymore. And for some projects that are, you know, small, especially in the JavaScript ecosystem or some other ecosystems that you have these small individual packages that only do one function, that may not be a problem. But in some of the bigger, more complex pieces of code, if nobody's looked at that code or really updated it in over, you know, in years, then that could be a problem, right? And so individual companies may be using the latest version, but that version may have vulnerabilities in it that nobody has gotten around to fixing, right? And so one of the other pieces that we were trying to, you know, shine some light on here are some security best practices for individual maintainers and projects in the open source ecosystem. And so thinking about tools like the Linux and OpenSSF badging project, which allows individual projects to give some indication of, you know, the layers of security that they're building into their, not only to the code, but also their development process and thinking about how we can actually, you know, make sure we're using projects and relying upon open source projects that at least have some layer of security and thought going into their security rather than necessarily projects that security is not there at all, right? So now this report is concluded but open source software, these are not products. These are processes, you know, they will always be new projects. The whole chemistry will change. So is there any, like there will be a census three or what are the future look like? Will you be involved with other projects? Can you talk about that if you can? Yeah, absolutely. So indeed, you know, open source is evolving. This census is just kind of a snapshot in time. And in particular also it's a snapshot at one layer in kind of the overall software stack, right? So census one looked at the operating system layer. This was census two and we looked at the application library layer. We can certainly think about and are planning in the future to look at other layers. So perhaps the application layer at the highest level or even things thinking about the cloud, right? And all the container spaces that are being built into the cloud and the way that we use open source there. And so our plan is to in the future do additional efforts of this kind of census to at the application layer, application library layer but also go into these other layers within the software stack. And we'd be more than happy to have folks that have data or willing to share or willing to learn more reach out to us because we'd love to have even more data providers than we had for this particular census. Frank, thank you so much for taking time out today. And of course talk about not only this report but also share some insights that helps us understand how open source is being widely used and what role we should play as good citizens to make it more sustainable. And also once again, thanks for the work that Linux Foundation is doing there to make it a very, very healthy and of course, you know, a growing ecosystem. So thanks for that in those insights and I would love to have you back on the show because as you said, you know, there are so many other projects that you will be involved with. So thank you. Absolutely. Thanks so much for having me and I look forward to sharing future efforts with you.