 Okay, our next speaker is Jeff McCaffer from Githam, and he will be talking about building confidence to move around and secure it. Let's see how. Hey, how are you doing? Can you hear me in the back? Sorry, my voice is a little bit rough. I was already checked out, but I think they're just recording it. I don't think it's amplified. So can you hear me? Okay, unless you can't answer that because you couldn't hear me. Anyway, so yes, I'm Jeff McCaffer. I work at GitHub. I do a bunch of stuff in supply chain management and helping people secure their use of open source. And just, you know, the preceding talks have said a lot of good groundwork for this. I'm just going to give you sort of my perspective on it. You know, we've all got these components, right? Everybody's got these Lego pieces, which are the greatest thing. They've got APIs you can plug into. You can connect them to build lots of great stuff. And some of these have great big ecosystems where everybody's around. There's lots of support for the component. There's lots of components in the ecosystem. And some, you know, like not so much. So you're not really sure what's going on in that ecosystem. Some are very well organized. They've got lots of, you know, the registries are well maintained and organized. You can find the information about them, et cetera. And some not so much. Some ecosystems have tons of people around to help you understand what's going on with the package you're using or other packages. There's lots of information, et cetera. And some kind of leave everything up to yourself. You've got to figure it out, read the doc, and figure out what's going on. Some ecosystems are really safe. They're really not very well demarcated. Construction's happening here. This is a good place to be, all that sort of stuff. And others, you know, have unicorns, but they've actually got like Trojan people inside, right? So it's really unclear what's actually going on there. The thing that all of these have in common for the most part is they allow you to build really cool things. They've got all these pieces you can build on top of what other folks have done, put together as really amazing things that, you know, that solve your problem or appeal to your scenarios. This is actually Lego. And I can't quite tell the scale, but it seems quite large. So poor understanding of what's going on in these ecosystems, whether it's because they're so vast or because they're not well managed, et cetera, can lead to a bunch of problems, right? Failure to comply with simple licensing obligations. A profound sense of actual or perceived insecurity, not knowing what vulnerabilities you're being exposed to. And for you, a general feeling of inadequacy and lack of confidence. If you don't get the trend that's read here, it's like the self-help guides. Anybody know Tony Robbins? You know, try to be the best you can be. So I developed this idea that, like, it's a North American thing, maybe Cosmopolitan magazine has these things you can read, you know, 10 questions that will help you with your dating life and help understand your, build self-confidence and everything. So I developed this theory that we could have this idea of a dependency management quotient, like your emotional IQ, it's your dependency management IQ. So this is actually a participatory thing. You all should probably have a piece of paper that was given out earlier for your survey and everything. So go along, there's a bunch of questions. I want you to keep track of your answers to these questions and you just have to count them. You don't have to, like, remember words or anything. So just count up what you, you know, what you answer, what it feels to you. So the first question, my management, my package manager won't tell me what it did last night and that bothers me. So if that's, you know, if that is one, if you agree with that, if that resonates with you, otherwise just don't count it at all. I have trouble understanding the packages around me. So, okay? So again, count one if you know. I often have a sense of deja vu in looking at packages. I feel crushed by the debt of the packages I use. You can relate this to, like, your family, your friends, whatever. When packages cough, I sneeze. But wait, there's more. Hang on, here we go. I depend on packages I don't know and I'm okay with that. I often don't know what I'm supposed to do. I'm powerless to change the packages around me. Are you slowly sinking in your seats and feeling, like, less confident about yourself? See, that's the problem, right? We're lacking this confidence in engaging in the package management ecosystem. I may be paranoid, but that doesn't mean my packages aren't vulnerable. And finally, I'm overrun by packages I didn't know I needed. Oh, wait, that's not finally... Sorry, there's one more. All right. So, if you scored five or higher on that, you may be suffering from package-knowledge-gap syndrome or packages, for short. So who had five or more? Come on, it's a safe space. This is, you know, code of conduct. Okay, great. So, both were honest, that's cool, and self-aware. These are first steps, right? The first step, the treatment for this syndrome is regular use of tools, strong dose of data, followed by a progressive automation. So the first step is knowing, right? That's the first step. Knowing that you have a problem is the first thing. All right. So, this now is like, for knowing what you do, if you're a normal package management ecosystem or in a normal ecosystem, you've got hundreds of dependencies. So in the earlier talks, get research about how much, how big the dependency graph was. Some stats that I saw, you know, working at GitHub, it's like 180 dependencies in your average ecosystem. There is the top 50 packages are dependent on by 3.6 million other packages. The scope of this graph is massive. And so, you need automated tools to help you understand this. There's a whole bunch of different tools, and we'll talk about those in a second. But, you know, you run your package. Who runs a package in a package management ecosystem where the resolver does dynamic code execution at resolve time? Come on now, you've got to, right? So, a few honest people here, right? So this means that when you run, you know, whatever the install, the command is, it actually goes and runs code to figure out what packages it's going to install. This is really challenging, because now, how do you understand what you're actually going to be using until you're actually using it? So there's this problem we have with build time, no, sorry, package managers, not dumping the results of the resolution. So they don't tell you what's going on in the system, so you can't really do decent understanding, get a decent understanding and do decent inventory. So, using detection tools is called lock files if you've got them, right? So if your ecosystem has lock files, so keep control, take control over your dependencies and maintain the inventory of what you're doing in some sort of searchable system so that you can actually, now that you know you've got these thousands of dependencies, you can actually go and correlate that dependency information to vulnerability data, to licensing data and so forth. So there's a bunch of different tools available. Thomas in the front here has mentioned open source review toolkit previously, tools like scan code and phosology, do great job of discovering both components but also licenses in the software that you're using, et cetera. And software 360 itself is a great job of inventorying and maintaining an inventory of the components you're using and Antenna does a good job of detecting them. A bunch of commercial offerings I won't go into, but the key point here is get something that helps you understand what your dependencies are and manage them, okay? Yes, data. So this is, I'm going to go into this a little bit more. Knowing what you've got is a good thing to know but knowing about those components is key, right? So compliance information. What is the license of the component I'm using? What are the copyright holders? Because when I need to comply with my license, there's a good chance I need to say who produced the software. And the source location. This one's actually kind of funny. Did a little survey of data in clearly defined, which I'll get to in a second, but it took around 200,000 packages that are in active use around the ecosystem and looked at the packages to see whether or not you could find, given a package, could you find the actual source that went into that package, right? Because packages are binaries in general. And it turns out that in 42% of the time, you cannot go in any reasonable way from a package version to the commit that went into that package. So you can't, given a package, you can't find what source went into making that package. So how can you do vulnerability, deeper vulnerability assessment? If you had a source code disclosure requirement by license, how can you do the source code disclosure? You can't. So this challenge of maintaining compliance information is something that we took to heart a couple years ago and started a project called Clearly Defined. And I mentioned that one here. And I'm just going to pop over and show you Clearly Defined for a second. You see that? It's probably not big enough for the back. Let me bump that up. How's that? Can you see in the back? Yes, no? Okay, one more? Let's try one more. How's that? Okay, well, that's as big as it's going to go, I think. So the idea behind Clearly Defined is there's all these packages out there that have poorly formed metadata. They don't have compliance information that's associated with them. And we have to chase that ball. There are things we can do to get ahead, and I'll talk about that in a second. But in general, we have to chase that ball of figuring out, given a package, what the heck is in there, who produced it, what's the license. So we made this system that basically took a bunch of open source tools like phosology, scan code, and others, and ran those tools over the code for the package or the package itself. So we go out to the registry, get the package, run these tools, try to figure out where the source code is. If we could find the source code, go and get the source, run these tools over it, and analyze for what's the license, who are the copyright holders, where's the source located, what's the actual revision and commit, what was the release date, a bunch of other stuff. We put that all in a big database. Then we also, and we surface that to you here. So you can just go to Clearly Defined and type in the name of a component. What should we do? We'll do load-ash just for fun, because I know that's there. And, you know, here's load-ash. Oh, I shouldn't have done a live demo apparently. All right, well, let's just reload. I should have just picked one that was already there. Dun, dun, dun, dun. All right, so ESLint, great, that's a good one. So, you know, we can pop into ESLint here and we can see that it's declared license as MIT. We know where the source code is. It was released today, maybe yesterday. You know, we can see who the copyright holders are and also look at all the different files and see which files had licenses and which ones didn't. Now, it turns out that in many cases this data is incorrect. It's either missing or it's actually incorrect. And so what we did on top of Clearly Defined, the raw harvesting of data, is we allow you to go in and fix it. So as a community now, you can be discovering that, hey, this component that I'm using has a missing license, missing source location, whatever. And you can go in and basically change it. And I don't know, I'm just going to pick a random other one. You change it here and then you can go and say contribute. And when you say contribute, what's going to happen? You fill out the information. I'm not going to do it because this is live. But it actually goes and opens a pull request in GitHub on the Clearly Defined project itself. And there's a community of curators, people who care about this stuff for whatever reason. Maybe they're paid to do it. Maybe they just love doing it, whatever. And they go and they curate your contribution. They say, oh, is that MIT? While we're here, we're proposing that the new license is MIT CMU. Is it really that? Why? What's the justification? Is that reasonable? And so when they curate it, they eventually merge it like any other pull request. And that change data now folds back into the database. So now everybody in the community benefits from the increased fidelity of the data. So we've done this for a bunch of legal and license compliance data. We're also looking, going forward, at doing this for security-related data. So one of the challenges that we've heard around security is the mapping between a package identity. So whatever this one was, ESLint, something or other. And the CDE identities, the CPEs. So who knows what a CPE is? You heard that? And you're still happy, smiling? So CPEs are the identifier that's used to identify a component in a CVE, a vulnerability. And the CPE definition is relatively obscure and doesn't always map in any intelligible way directly to the packages that you're actually using. So one of the things that we can do here, and clearly defined in that scope, is define and help communicate crowdsource a mapping between package identities and CPEs. That will help everybody now understand, I'm using package foo, what are the version, what are the vulnerabilities for it? So that's the whole idea behind clearly defined. I think it's one of the things that you, if you're in an organization that needs to do compliance in any way, can start using that data, you can contribute to that data, integrate it into your engineering workflows, etc. So I'm gonna go back to the slides here. Okay, so that was clearly defined. Now that's all chasing the ball, right? And it's really annoying because people keep putting out ill-formed packages, badly constructed repositories. So to get ahead of the ball, who knows about the reused software? Reused.software? Oh, you should know about reused.software. So the idea here is really super simple. There's a way of you, as a software developer, an open source practitioner, to put forward the needed information, the licenses and copyrights and whatever in a really simple way, following just very simple guidelines and being helped by tools that keep you on the right track. So reused.software defines a syntax, a way of putting the data in there, and they have linting tools that are repo up to date. So you'd end up with this being like a PR check on GitHub or something that says, like, hey, you just added this new file, but you forgot to put in a copyright or a license statement. And these sound like kind of annoying things because you're just a developer and you want to do this, but it turns out that if you don't have a license on your software, people aren't really supposed to use it because they don't have a license for your software. So it's kind of important. So I suggest you go take a look at reused.software. So you've got an inventory of all the code you're using, the packages you're using, but now you need to know about the vulnerabilities in there. Now you can go and look at the NVD, the actual National Vulnerability Database in the U.S., but it's kind of cryptic and relies on CPEs, et cetera. Fairly hard to use, but okay, there's the GitHub Security Advisories Database, and this is actually a superset of all the CVEs that are in the NVD, lots of acronyms. Plus all the advisories and warnings that maintainers have produced on GitHub. So since maybe six months ago, we started the maintainer security advisories program. So if your project's on GitHub and somebody tells you that you've got a vulnerability, they say like, hey, there's a problem in this file. You can create what's called a maintainer security advisory. And that gives you a private place to work inside of GitHub. So you're working on GitHub still, but it's in a private fork, effectively. You can go and work with your security team and your other developers to figure out what the vulnerability is and to actually develop a fix. You can have conversation and everything. And then when you've got a fix, you can publish now a pull request that fixes it and have it create a CVE into the GitHub advisories database, which is available in a browsable form here at that URL, but it's also got an API that you can call. So now you can have your engineering system look at, hey, I'm using these packages. Let's go off to GitHub and see if there are any vulnerabilities, like that kind of thing. And then there's a ton of domain and ecosystem-specific vulnerability databases. The other aspect of using packages that doesn't get a lot of, I guess in the sustainability discussion we had some of that in the earlier talk, but community health. So when you're taking a dependency on a package, you're actually taking dependency on a bunch of people. So who remembers, I guess it might have been about a year ago or something that a black hole picture came out, right, that people had developed the first picture of a black hole. Turns out that was all developed basically with software. I mean, obviously, there's a whole mass of data processing that went into doing that. And that was all done in Python based on a whole set of Python packages. And it turns out when you do the analysis of a number of people who are contributing to the packages that get used to produce the black hole picture, it was like 20,000. So it was like 20,000 people wrote software that went into producing that image. So there's a ton of people in open source. Open source is all about people and communities, but are they healthy communities? When you're taking a dependency on a package, you're taking a dependency on a set of people who wrote that. So understanding that is the key characteristic of engaging with these packages. So chaos is the community health analytics open source software. Thank you. A project that's developed a bunch of metrics to help you understand what's going on in the communities, what's the bus factor in this community, etc. So there's a great set of metrics that you can use from chaos. So putting this all together essentially, given the scale of what's going on, you have to automate. Even if you're a relatively small open source project, the set of dependencies you're taking on is going to overwhelm your ability to manage that manually. So automating makes everything much simpler, repeatable, more consistent, makes it auditable, etc. And one of the key points, I think, for automation is automating decisions and policies. So some of the work we're looking at going forward is how do we make it so you can automate the choices that happen when you're taking a dependency on a component. Your system automatically detects that and then tells you like, hey, you're taking a dependency on a component that hasn't been updated in a year and all of the developers have gone away. That's probably an interesting thing for you to know, and knowing that upfront is really good, as opposed to leaving it to later when you found it as a security vulnerability. So by automating that level of functionality, you can then take all your human time to pay more attention to if this component actually fit architecturally, is this the right choice for my business or organization or my project, etc. So the prognosis then is that if you start taking some of these steps, your symptoms of package knowledge, gap syndrome should start clearing up over it takes a while. It might be three to six months until you actually start seeing some improvements, but it will happen. And that's it. I'll take some questions. Yeah. Right. The paraphrase is like, so you've discovered a connection between a binary and a source. How can you trust that connection is accurate? So that's an awesome question. And in some ecosystems, they've done a really, really good job of this. Like Debian and the reproducible builds is awesome, because you know there's a systematic way of linking back from the build to the source that these are connected. Most other ecosystems don't have that, and that's a shame. And that is today like ecosystems like Node and NPM. So when you publish an NPM package, the actual tools will tag your Git repo with the version number of the package. Now that's good, but it's not guaranteed because while you can package something that wasn't actually the same source, and it's a Git tag, so Git tags can move. So there's lots of challenges with that, but it's way better than the most other ecosystems that don't have anything at all that talk about source location. So the ideal here is being able to build your package, being able to have reproducible builds. Not that you have to build them, but that there is that connection between the binary package and the source that's trusted. But that's going to be one of these incremental things that we have to get better at as a community. Is there anybody here from a package management community itself like the NPM? Oh, good. So one of the requests I have to you is we had a bunch of people together in a discussion and it turned out that, and you might be the exception so don't take a bridge with this, but it turned out that a lot of the package management community folks didn't understand a lot of the requirements here. Didn't really know about compliance requirements and the ability to get back to source, the need for a license. There's some communities we discovered that even though the license says that when you distribute this package, you must include the license, they didn't actually have any of their tooling that forced you to have a license in your package. So if you're not doing that, you're by de facto making your users, your community, non-compliant. That's actually a challenge. So helping nudge your communities into best practices that help you be more compliant, more secure would be awesome. It's a request I have to you and I'd be more than happy to engage with how we can help you at GitHub to make that easier. We've got some other folks here from GitHub who can help with that as well. How many additional dependencies do I take on by using the tools that you suggest? It is an interesting question. Build time dependencies versus run time dependencies too. So some of those tools are a massive number of dependencies because they're doing all sorts of things. There's a challenge. Those are build time dependencies. There's still challenges with those because they can inject stuff and etc. And change, etc. But the alternative is to not use any tools at all and do it all by hand or not do it at all which would be even worse. Do you have a proposal for I was just asking. Stir in the pie. So how do you identify things from how to find that I thought CPE is a good idea but what do I use? So component identity is actually a real challenge, right? There's syntactic ways there's the package URL specification that just gives you a syntactic way of doing it. But it's an interesting question. How many people here, if I just said I had an NPM called foo version 1 would know what I'm talking about? Oh, okay, I know which bytes that is. The problem is that you can get an NPM off of lots of different places. You can get it off of NPMJS.com you can get it off of GitHub you can get it off your local internal repo wherever. So identity for some people identity includes the repository from which it came the registry and for others they assume that foo version 1 is the same source of where they got it from. So actual identity, package identity is actually a very big challenge that most ecosystems that I know haven't really addressed. I would love to be told how to do it, right? Please, it's really a serious problem. Thanks a lot to you. Oh, jeez. Sorry, I wandered away with it. Thank you kindly.