 example of a vulnerability in a library called Lock for J. You may have heard of this library. So the free form English text for this description is lacking a lot of crucial information for us to do this mapping. For instance, he doesn't mention that this is even a Java package, much less exactly which Java package we, this is talking about. Ideally, we need this kind of full package identifier that identifies the package in Maven, which is ultimately where most people reporting this package from. And there's another example here for Ginger 2 in PyPI. The description is literally just one sentence in Palace Ginger before 2, 10, 1, blah, blah, blah. There's no kind of easy mapping to the name Ginger 2 in the PyPI package repository. They're also attached to most CVEs, something called CPEs, and this is added by the NVD, added in the NVD database. And this is intended to allow automation when it comes to mapping CVEs to people's software. And this is coming from a globally centrally managed database of identifiers. The problem here, unfortunately, is that there is not a very good mapping from these CVE identifiers to actual identifiers that people actually use in their manifest files, for example. And in general, these CPEs aren't very predictable or follow a very consistent pattern. So it's also very hard to generate them automatically. So what ends up happening is most people who care about the security of the dependencies, they have to either manually map their dependencies to CPEs, or they have to rely on rather error-prone heuristics, both of which are not very ideal. And there's actually a bunch of other problems as well, unfortunately. We've also noticed in many cases that CPEs are added several days after CVEs are first published. So there's a latency in terms of when people are first notified about CVEs via this mechanism. And this is obviously not ideal because it opens the window of exploitation for people depending on their dependency. So lastly, CVEs also have this JSON schema. So I have here an example of the CVE JSON4.0 schema. And unfortunately, it was also not very easy to automate on. There's a lot of freeform text involved as well. So for example here, log4j is just referred to as Apache space, log4j space 2. Oops, I'm having the same problem as the folks from yesterday. Yeah. So and also the way version constraints are specified in the schema are not very well defined. So there's not very clear cut rules for how you actually apply logical operators and how you can combine different constraints. So there's a lot of ambiguity here that kind of prevents us from using this as an automatable mechanism. Okay. Cool. I'm back for problem number three. I'm a problem person in this talk. So many vulnerability databases. So we learned that we have these like blurbs on a typical advisory. And it's like a paragraph and it's not so machine readable. There is a bunch of databases that are adding to that on top of that. So they're taking that data in. They're trying to add more metadata. They're enriching it. Maybe they're making the description better. One thing we're doing at GitHub is we're adding new vulnerable effective functions. So saying like, hey, don't worry about this unless you're using this function. So lots of databases are trying to add to this data and make it something better. And none of these people are talking to each other. So you might get some data enrichment from one platform and different data enrichment from another platform. And they're all speaking a different language because they're all using a different format for their advisories. So this is actually a screenshot from have you heard of dependabot? Anyone? Okay. So dependabot before it was acquired by GitHub. This is a screenshot of the code base. And you can see these are all parsers. So they're taking it all. They're trying to aggregate from all these different sources like RustSec and PyPI to make this product. And there was not a good way to do that. So they had to write all these different parsers to bring that into one platform and use it in one tool. So that's another huge problem. So OSV is a solution to many of the problems that we just described. So what exactly is OSV? So there's actually two parts to it. So the first part is a vulnerability schema that we developed that allows us to encourage vulnerability information in a way that's automatable and consistent. And the second is tooling and infrastructure that can aggregate indexes and makes this data useful. Now I know what everybody's thinking. Everybody loves to bring up this example of an XKCD comic whenever someone creates a new standard. And often for very good reasons. But I promise we do have some very, very compelling reasons why we ended up creating our own. So OSV schema is something we developed in collaboration with GitHub and many other open source ecosystems that have actually since adopted our schema. So what we wanted to build is a schema that's focused on open source. And that's as minimal as possible. So thinking about what is the minimal amount of information we need to encode in a vulnerability advisory to make it useful or actionable. And we wanted to provide a mechanism to map advisories consistently to packages in open source. We wanted to be easily used by both humans. So humans should be able to read and understand what's going on, as well as automation as well. So machines can automate on that advisory. And we also wanted to generalize to most all open source ecosystems, but not in a way that requires consumers of these advisories to have to understand the intricate version constraint rules of that particular ecosystem as well. And more ambitiously, we want to build an ecosystem of distributed vulnerability databases and workflows for open source around this format as well as easy to use tools around this format that everybody can use. And we really just couldn't find any existing formats that fit the bill and satisfies everything here. So here's a quick example of what a GitHub security advisory looks like in the OSV schema, and this is for a Go package. So as you can see, this is a very simple schema. It's fairly easy to parse as both a human and it's very easy to parse as a machine. So there's a basic usual metadata, such as the ID, some English text descriptions, timestamps, references. But most importantly, we have this affected field, which allows us to unambiguously refer to package names and package versions. So taking a closer look at this, this advisory is clearly referring to a Go package with a given module path. So the OSV schema provides very clear definitions for every single ecosystem. And in most cases, in all cases actually, the name that's specified for that ecosystem matches the native way to refer to that package in our ecosystem. So there is no need to do any kind of mapping. Now the more complicated piece here is how do we deal with version, how to describe which ranges of versions are affected. And we wanted to build a way that's very simple to understand, as well as being expressive enough to encode all the different complicated cases there might be when it comes to, say, encoding which branches are affected and things like that. So these versions should match exactly the version numbers that are uploaded to package repositories, or they can also be good commits. So what we came up with was to mark flat events on a kind of version timeline or a version tree with where vulnerability was introduced, which we mark here in red, as well as where vulnerability was fixed, which we mark in green. And this kind of generalizes well to both kind of linear version numbers as well as get trees. And the way that we encode this data with kind of fixed information also makes it useful to consumers because it tells people which versions they need to upgrade to, to fix vulnerability, or what patches they need to apply. So commit-level metadata for vulnerabilities is actually quite an interesting thing that we want to explore. If we think about what goes into a security advisory, we can really streamline this into a process that works closer to a developer's commit and development workflows. So we can imagine a world in which developer, they fix a vulnerability by pushing a commit. Now this commit could have a repurchaseable test case, or could have a unit test, and then we could have automation infrastructure come in, perform a bisection, and give us exactly the commit ranges that contain this vulnerability. And then from there, once we have these commit ranges, we can correlate them to the Git version tags in the repository, and ideally the versions that are uploaded to the package repository as well. And finally, for a description of the advisory, you can just take that from the commit message. So what we end up with is something that's essentially a security advisory that's generated in a much more streamlined fashion, and it's aided a lot by automation here. We've also seen many times in the past, in many open source projects, that people push a lot of fixed commits without requesting a CVE, so this is something I've not helped with that as well. In fact, we already have some real-world examples of, that start to follow this exam, this workflow in a very limited way. So on the left here, we have an example from a global security database, which is run by the two Josh Bees in the front. So they receive a lot of Linux kernel vulnerabilities from Linux maintainers, actually. And this is done at a commit granularity. So every vulnerability has the commit that fixes the vulnerability, and in many cases, the commit that introduced the vulnerability as well. And from there, what our automation can do is we can populate the version tags that correspond to that commit range. And similarly, for OSSFuzz, which is the fuzzing platform that our Google resource security team runs, we do something very similar. So with fuzzing, we have reproducible test cases, and that allows us to perform bisections on every vulnerability we find to figure out which commit introduced the bug and which commit fixed the bug. And from there, we also do the same kind of version analysis, and we find that in this example, in the mRuby software, this vulnerability affects the 3.0.0 RC version. Now, what about adoption? We've actually made some pretty good progress. So Kate here from GitHub has helped get the GitHub security advisory database using the OSV format. And there are several other open source ecosystems, such as PyPI, Go, Rust, and the global security database has started using OSV as well. And one other thing we did was we collaborated with the people behind the CVE5.0 schema, where we successfully suggested a number of changes to help improve the way packages and versions are specified in the schema. And in the future, this will allow better interoperability between the two different schemas. So the second piece of OSV is OSV.dev, which is a tooling infrastructure that makes all of this data useful. So OSV.dev is completely open source, and what it does is indexes, it aggregates all the OSV-formatted databases that are out there. And it also provides some of your automation to do some of the commit-based vulnerability workflows that I just kind of went through in the previous slides. So right now, this works for OSVs, but we're looking to generalize this into a mechanism that works for everybody. So to use OSV, you can either check out the web UI at OSV.dev, or we provide a very simple API you can query. So we have three examples here. You can query by either a package name, a version, we can query by a package URL, or you can query just by a commit hash, because commit hashes should be fairly unique. So it's completely open API, a bit source, no red limiting whatsoever, it's a batch version, so this is something that we really want everybody to use. And the second piece of this is that we started to work on user tooling. So the first part of this is just a simple vulnerability scanner that's able to scan S-bombs in the future, package manifest, image containers, anything out there that can be scanned and want this to be able to scan. And this is just a starting point. With kind of a common format, we can all collaborate to build tools to make this vulnerability management work for easier for everybody. Back to Kate. Cool, let's talk about GitHub specifically. So why did we take on OSV? What was going on? What was going through our minds? Remember before when we were talking about the life cycle and how it ended at publish and then all these other people in the crowd who I used as examples, were like no I have other things I wanna add to this. So we decided we didn't like that. And we decided to launch a new feature called community contributions. So now if you're looking at any advisory listed on github.com slash advisories in this bottom right corner here, you can click this link which says suggest improvements for this vulnerability. It opens up this form, you can change exactly the metadata that you wanna change. It's not a paragraph. It's not an essay. It's not a black box because what it does is open up a pull request. And those pull requests are reviewed by our delightful curation team who work very hard. They're sitting in the back here. And so you have experts who are reviewing that and making sure this is actually a valid contribution before they merge it and change that advisory. So you can rest assured with that whole process. Here's the thing. To have this pull request thing happen and to have that out in the open, that required us to have a repository. And the repository had to be filled with all of these security advisories that would potentially be changed. So it's one file per security advisory that is filling up this whole repository. And once you're thinking about that, the question becomes, how are we gonna format the data in each of those files? Because this is going to be very public and used by a lot of people. So we wanna be intentional about that choice. And that turned into a conversation that was like, it's about the schema, but no, it's actually not about the schema. And so it grew into a much larger conversation. This was very early in my time at GitHub. Oh, here's another picture of all the files. And so we decided this was the moment to take a step back and define really what our vision was for this team. I'm a cheesy person, this is about to be cheesy. If you are emotionally lactose intolerant, maybe just check out for like 30 seconds, and then come back. So we imagine this future where no human is impacted by security advisories. And that's a big, hairy, audacious goal, and it may not actually be possible. But if there are organizations who can push that goal along, it's GitHub and Google. So we have a responsibility to our users in order to aim big and make sure that we're advancing that cause as much as we can. And really when you think about it, it's just three things that you need, just three. So the first is you have to have all of the information, all of it. The second is that it has to be 100% actionable. So whatever information you have, you have to be able to do something with it. And the third one is that you have to put it in people's hands. So it can't just be on the GitHub platform because we know some of you host your repositories outside of GitHub and that's okay, nobody's perfect. But we need to be able to make sure that you too can resolve the security problems that you have in your code. And so how does OSV fit into this? Kind of affects those second two there. So let's dive into those a little bit more. So if we want to have 100% actionable advisories, again, this is a dream, this is the future, this is not tomorrow, I see the curation team nodding in the back row like not tomorrow. But this is the dream. So that means that we have to match all of those relevant packages to whether or not you're dependent on it. So it's one thing to say like, oh, the OSS Fuzz package has a vulnerability and it's another thing to say, do I care? And we need a machine readable way for us to do that. Right now the way we're formatting this is with this matching system of the ecosystem name. So something like composer and then the package name which is the name of the actual package that you're harnessing. And we match those up through a package registry so it has to be spelled correctly, please don't put typos in your draft advisories. And so this was important and when we thought about the schema that we were gonna choose, it had to maintain that relationship and be machine readable so that we could continue to match up, here's the advisory with whether or not you're dependent on it without any humans getting involved. Next, in their hands. So it has to be really easy for private parties to build tooling on. There are a lot of forks of this big open source repo that we have. There are a lot of people who are already building tooling on this. I talked to one last week who was basically building his own dependabot. Like he like harnessed all of the data and then was building something such that he could alert repositories on both GitHub and Bitbucket that he had that they were vulnerable to something. So we needed something that was predictable. We needed something with variables that made sense. Our own internal schema had a lot of shorthand in it. That was not something we could have published without sprucing that up so people could understand what was going on. And we needed something that was machine interoperable. So that, we did this whole big comparison. We talked about like using our own schema. We talked about our own schema but in JSON instead of YAML, we talked about OSV but with YAML instead of JSON. We just decided that was the lawful evil option. Sorry Oliver. So we talked about a lot of stuff and ultimately what we landed on was that OSV checked all of the boxes for GitHub. So they use that same ecosystem and package name relationship that allows us to match up this advisory with whether you're dependent on it. It's also easy to build tooling on for the same reasons that I just said. And finally, last but not least, it sort of builds towards this bigger than GitHub universe, right? Because we have a lot of parsers in our code. We're also pulling in all the same sources as OSV. And so we are interpreting a lot of data and some stuff outside of OSV. So we have to change a lot of data to make it make sense in OSV. It would be really gosh darn nice if the whole industry would just use OSV. And then we wouldn't have to maintain those parsers. And other people wouldn't have to maintain their parsers. We could all just talk to each other with perfect communication in a machine readable format and how often does communication become perfect with no misinterpretation? Isn't that an amazing world that we all wanna build? Everyone's laughing at this point, that's great. So this is the future that we want to get to here. So let's go through that life cycle of an advisory once again. We talked about it before. This time there's gonna be more things on there. So first, Jonathan in the back discovers a security advisory. He alerts Trevor in the front who starts drafting that security advisory. They get a CVE because they're responsible. And then next up, they automatically, if it's created as a GitHub repository draft advisory, it generates an OSV file for you. You don't have to do any work, we are doing that for you. That means it can be picked up by OSV.dev, which a whole bunch of other people have built to Leon. So not only are you broadcasting that on your GitHub repository, you're broadcasting it to everyone who ingests OSV. You're also broadcasting that to Dependabot and all of the other competitors to Dependabot who take that advisory and say, hey, I see this version number. I see this package name. I'm gonna automatically send an alert to everyone who is using that package name. So then you're broadcasting way more than you would by just listing this or tweeting it out somewhere. The end user actually gets that fixed because that security advisory is in their face in terms of GitHub, it's in their GitHub repository. Please fix this. And then last but not least, if more people come along like Sam in the front row who says, no, I have more information that I need to tell you about, she can commit, excuse me, Sam can commit a community contribution and that goes right back to the draft process. It updates everything there. It generates a new OSV file. The whole thing repeats. All of that, like machine learning stuff happens. Not machine learning, machine readability stuff happens. Everyone gets more information in the world to resolve the security advisories that we're pursuing. Back to you. Cool. So what's next for OSV? There's still a lot of work to do here. The first is still more vulnerability feeds. So we have pretty good coverage of most language ecosystems largely thanks to Kate and GitHub here, but we're still missing things like Linux distributions like Debian and others. What we wanna build is the most comprehensive, distributed vulnerability database of all vulnerabilities in open source and make it easy for anybody to publish advisories using this format as well. And we started to work with, for example, Debian on getting OSV advisories for them as well. So essentially what we wanna do is to build an ecosystem around the OSV schema. If everybody uses this format, as Kate mentioned, everybody benefits because everybody can collaborate on the same tooling and everybody has the same parser and everybody works on pretty much improving the situation for everybody else. The following is actually still a lot of work left here to deal with false positives. So we've made some good initial steps with matching packages to advisories just by looking at the package name and the package version. But this will undoubtedly result in a lot of false positives in some cases. So VEXE is an awesome initiative to try to address this, but we need to figure out how this can fit into open source. So perhaps there is something we can do here with automation. In the world of open source, we have a lot of visibility into what goes into a security patch, for example. If every vulnerability has had the fixed commit, for example, we could potentially use automation to figure out which code paths need to be called in order for that vulnerability to actually affect the end user. And then from there, we can do some kind of source code analysis and perhaps we can reduce a lot of false positives this way before reporting vulnerabilities to everybody. So this is something that we want to explore. Don't know how successful this will be. So if anyone wants to talk about this, I'm very happy to chat about that. And finally, you can try out OSV. So again, you can go to OSV.dev or try out API. And you can also try the GitHub security advisory database. Yeah, so you can jump on to github.com slash advisories and submit a community contribution as mentioned. Or you can fork our new open source repository at github.com slash github slash advisory database. Or if you have feedback, you can tweet it right to atkcatlin. There's my Twitter profile name. And that's it. So if you have any questions or any thoughts, you can follow our issue in these three reports here. So the first one is for the OSV schema itself. The second one is for the OSV.dev tooling and infrastructure. And the third one is for the GitHub security advisory database. We also have a channel on the openness itself Slack. So if you go to slack.opennessesf.org and look for the OSV underscore schema channel, we're there. And we also have a mailing list as well. And that's it. Thanks, everyone. Do you have any questions for us? Jonathan. So I couldn't hear you. Was the question why are we not using package URLs? We do support package URLs in the schema, actually. So there's a field that I haven't shown in our example. There is a package URL field you can fill out. There's someone in the back, I think, I'm sorry. Could you speak a bit louder? Or is there a microphone in that part? I can be on the mic. Yeah, thank you. My hair is not as good as it used to be, sorry. So I'm not sure if we understand the question. So we support package URLs, which one mechanism for linking or talking about which package and what version is affected. And we also have our own kind of table of ecosystem and package name that you can specify as well. And well, the reason I was mentioning about like URIs versus URLs is like URL is like wear something out. It's a pointer towards something. And URI is the thing that identifies it. So if you're targeting the URI, that means if I'm going, like my company never pulls directly from a URL for a given package. Because what we do is we aim to, we always pull things into our own repositories, and then we deliver from there. And so if we say, well, this URL, that URL may actually end up changing from moving from one environment to another. But if it's a URI, an identifier as opposed to a URL, then that actually gives us something that's more, that we can have that's canonical, that works across different repositories, different groups, while still allowing us to identify the thing. So I was asking if there's a place that we can come in possibly give some of this feedback and help with collaboration there. And if it's not modeled like that. Yeah, so when we say package URL, it's not actually like a HTTP URL to that package. So this is a pretty generalized way of referring to the package in the ecosystem. So this is saying in the Maven repository, this package here, a package struts too cool at this version. Right, and in that one, package Maven is like, you're saying this is the Maven repository, but mind it might be package Maven slash my company slash, and then the canonical name is there. So having that format actually ends up breaking my, like that prefix on there, if you're tying that into your scanner, ends up breaking my ability to, like I have to work out, where did this package actually originally come from? And that's the feedback I'm trying to give, is that when you start to work with internal systems, it's not about only supporting the open source, always online, always pulling from external, like I have regulatory concerns that this allow me from pulling from there, directly I have to pull it internally, perform my scans, put in my own repos, which means we're not gonna hit the same names. Okay, perfect. Okay, I'll tell you. Okay, cool. And that's what I was trying to get to. Yeah, yeah. And in general, I think, because most vulnerability databases wouldn't know about your internal package repository, I think in those cases, you might need to kind of figure out what is the canonical package ID for that to be actually to make that useful. Sorry, does someone else have a question here? Justin? Yeah, I just wanted to ask, what do you all think the role of human creator? I think the question is, what is the role of human creators for these data sets going forward? Kate, do you wanna answer that? Yeah, speaking from our experience at GitHub, indispensable. At this point in time, there's no substitution for that human touch, making that description better, making sure that it's explained how you can actually fix this advisory. There's so much that goes into that. There are ways that we're trying to harness machine learning so that we can make that faster, like maybe take out the stuff that doesn't require as much human thought, but at this point in time, there's just no way to take humans out of this process. So we do have a question online. It's from Reinhardt saying, I'm thrilled to see GitHub and others are making it easier to report vulnerabilities. At what points do we need to start thinking about the quality and accuracies of those reports? Okay, so the question was reporting new vulnerabilities or is it about just in general, those vulnerabilities that we're pulling in? The question is at what point do we need to start thinking about the quality and accuracies of those reports? Oh, at all points. Today, yesterday, last week, I think that's really how GitHub started to develop their niche in this space is by saying we're gonna choose quality over quantity at first. And so that's why we started with human curators. We had a huge focus on that and making sure that it's high quality. There's been a number of times where we've gotten an email that says, why aren't you all reporting on this CBE? And we say, because we looked at it, we read it, and we said, this isn't accurate or we're not gonna send out alerts on this or this isn't high quality or this isn't really a security issue. So there is a number of those that we're pulling down and not sending alerts on. That's very much a thing that is on our minds and not even on our minds. We're doing things about it. There's also a second part to the question and it is what about malicious actors slash competitors? I'm assuming they mean a malicious actor submitting an advisory that isn't actually true. Okay, cool, gonna run with that. So yeah, that could potentially happen. And again, I would say that's why we have a curation team and we have this new community contributions function. So if you notice something that we may have missed, it's out in the world and we're publishing it and you realize, no, no, no, no, no, this is about a package I own, this is malicious, et cetera. We're gonna make sure that you have a way to flag that. Right now you can just write it into a community contribution. We have on our backlog, it's been there for a while, adding a functionality that just says dispute and making sure that it's really easy to flag and dispute advisories. And so yeah, I think that it's very possible and it is coming up and it is happening. So we're trying to make sure we have ways to handle that when it does come up. Awesome, so we're gonna cut off, so feel free to come up and we can chat.