 I still have coughs that rack through me, which is a lot of fun. I got conflu from hacker summer camp. So yeah, I know there were a lot of us that got conflu from hacker summer camp. Anyways, when it started, we were one minute past. So hi, everybody. Welcome to scaling the security researcher to eliminate open source security vulnerabilities once and for all. This is a talk that I've given all over the world, including Black Hat, Def Con, Italy, Japan, you know, places. So I'm happy to have brought it all to you here. Thank you for coming. If you've seen it before, it's not new, but hopefully you will all enjoy. So who am I? My name is Jonathan Leichu. I am a senior software security researcher for Project Alpha Omega. I work for the open source security foundation. I was the first ever Dan Kaminsky fellow. I'm a GitHub star and GitHub security ambassador. And you can find me on Twitter slash X slash whatever it's called. That joke's been made a lot this week. Also on GitHub at J. Leichu. A little bit of a disclaimer. This talk does discuss software that you can purchase and commercially available. However, for open source, all of this is available for free. This project has been supported by Alpha Omega and the open source security foundation. Also supported by the Dan Kaminsky fellowship last year. For those of you who don't know, I sadly never got the opportunity to meet Dan. Dan Kaminsky was a hero best known for a massive security vulnerability they fixed back in 2008 in DNS that would have allowed anybody to hijack DNS records across the entire Internet. Dan passed away a couple of years ago and the Dan Kaminsky fellowship was created to celebrate Dan's memory and legacy by funding open source work that helps make the world a better and more secure place. And I was honored to have been the first ever Dan Kaminsky fellow last year. Let's start with some spoilers. I generated pull requests. I generated 164 pull requests to fix zip slip, which is a critical security vulnerability across the Java ecosystem last year. But you want to know how the story started. Well, the story started with a simple vulnerability and the vulnerability was this. I saw this line of code in my company's build logic. It was the use of HTTP to resolve dependencies instead of HTTPS. And so you might ask, why is that important? Well, if you are using HTTP instead of HTTPS to resolve your dependencies, you can have an attacker in the middle, supply chain, attack your build, and if you're running it locally, they can hijack code in your machine, in your CI pipeline, they can hijack your CI CD pipeline. This vulnerability doesn't just exist in Gradle build files. That original one was a Gradle build file. It also exists in Maven. This is where that exists in Maven build files. And then this is where actually instead of downloading your dependencies, this has to do with uploading your dependencies. This is where your, and this usually has credentials associated with it, where you're publishing your dependencies to your artifact server. And this vulnerability was everywhere. It impacted organizations like Spring, the Apache Software Foundation, Red Hat, Kotlin, JetBrains, Jenkins, Gradle, Groovy, Elasticsearch, Eclipse Foundation, Oracle, the NSA, LinkedIn, and Stripe. This vulnerability was all across the internet. And I reached out to Maven Central, which Maven Central is the pip to the Python ecosystem, the NPM to the JavaScript ecosystem. Maven Central is that to the Java ecosystem. And they said when they looked at their traffic, 20% of their traffic in June of 2019 was using HTTP instead of HTTPS to resolve their dependencies. So how do we fix this? Well, I pushed for an initiative that all of the major artifact servers, unlike other languages, Java has multiple language, multiple artifact servers instead of just one, I pushed that all the major artifact servers in the industry would all decommission support for HTTP in favor of HTTPS only on January 15, 2020. And it worked. This initiative worked. I got all of these organizations on board. They all published blog posts announcing the deprecation of HTTP instead of in favor of HTTPS only. And right, this is awesome. We were all going to do the same thing. However, I reached out to Maven Central and they said, yeah, so we've only seen 5% drop since we published all this blog post. We still have 20% of the industry still using HTTP instead of HTTPS in January of 2020. And we were going to do this in 15 days. So you can imagine what might have happened on January 15, 2020 when we pulled the plug, broken software, lots and lots of broken software, posts on Stackeroflow. Why is my build not working? Yep. However, we stopped the bleeding. But what about the other repository? So I listed for you four major artifact servers in the Java ecosystem. The Java ecosystem is one of the oldest language ecosystems. There are more than just these artifact servers. The Cliffs Foundation runs one, Jenkins runs their own server. They all host their own artifacts on those servers. There is one central artifact server, but other people host their own. So how do we fix the rest? How do we fix this code that's vulnerable across the entire open source ecosystem? And I said, well, let's just go fix it at the source level. Let's go give the maintainers the fix. So how? Well, first thing you need to do is you need to find the vulnerable code. So I wrote this CodeQL query, very, very simple, very straightforward. And well, what is CodeQL? CodeQL scans hundreds of thousands of open source projects every single day. People install it in their GitHub repositories to do code analysis. But also as a researcher, I can run these queries against as many projects as I want to figure out if they're vulnerable or not. And for this little simple query that I wrote, GitHub has a bug bounty program. And they awarded just for this query and a little bit of documentation $2,300. And so then I needed to actually fix the vulnerability. I had a list of vulnerable projects from CodeQL. So my first bulk puller-quest generation was a Python-based wrapper over GitHub's hub CLI. And it had one nasty regular expression and a lot of logic for bouncing off of GitHub's rate limiter. And you can see this is the code, but let's drill in a little bit. This is that regular expression that I was using to fix Maven POM files. You might ask, why are we using a regular expression? We have XML parsers out there. The problem with XML parsers is that if you parse the code into an XML parser, modify the XML, and then dump it back out, it'll come out in a different format than you originally put it into it. They don't most XML parsers, all XML parsers that I'm aware of, do not preserve formatting. So I had to use a regular expression. The problem with using a regular expression to solve your problems is then when you're using a regular expression to solve your problems, now you have two problems, the regular expression and the problem you originally had. But it worked. I generated pull requests to fix this vulnerability across the entire open-source ecosystem. This is an example. You can see very simple change, HGP to HPS, in the right locations in the code bases. I generated 1,596 pull requests to fix that vulnerability. And as of the last time I updated that statistic, which was the end of last year, I had a 40% merge rate for that. It was pretty good. And the GitHub Security Lab awarded me a $4,000 bounty before doing that work across all of GitHub. So I got hooked on this idea of bulk pull request generation as a solution to fix security vulnerabilities at scale across open-source. You can see my GitHub contribution graph. This was for 2020. You can see two massive spikes. I actually did two campaigns that year to fix different security vulnerabilities. So I have a problem. I have ADHD. That's not really my problem. The problem is, I love chasing squirrels. I will go and look in an advisory database, and I will see a vulnerability. And I'm like, hmm, I wonder where else that is. And I can find that same vulnerability everywhere. Because you just go and you plug that code pattern into GitHub search or CodeQL or whatever, and it just keeps appearing. So the problem that I'm having is I'm finding too many security vulnerabilities as a security researcher. Here's an example. This is ZipSlip. We're going to talk about what ZipSlip is later in the talk. But this is ZipSlip, and this is on an old website called lgtm.com, which was showing results for a CodeQL query. And there are pages and pages and pages of real results with real security vulnerabilities that you can scroll through. There's too many vulnerabilities. And the problem is, some of them are in someone's one-off project, and some of them are in really critical projects. So I'm finding too many security vulnerabilities. I need automation to solve this problem. I need automated, accurate transformations at a massive scale. And this is where Open Rewrite comes in. Open Rewrite is an open source code base. It's written in Java. And it lets me manipulate the abstract syntax tree of a code. So what is an abstract syntax tree? Well, an abstract syntax tree is the tree representation of the code. And when you compile your code, your compiler turns it into this tree representation and then turns that into bytecode. The problem is, you'll notice that if you dump that bytecode, if you dump that tree back out, the compiler doesn't care about the white space, the comments, all that stuff. It drops all that stuff. So if you dump it back out, you've got that. So Open Rewrite is a format preserving abstract syntax tree. It keeps all that white space. It keeps the comments. And so because that, you've kept all this information that's really valuable to let you make targeted changes. One does not simply format the entire source code when they're trying to make changes. Maintainers will be like, great, thanks for that change, but you don't match our formatting. Can you run the formatter? And when you're doing that at scale, really doesn't work. And again, this is a fix for Zip Slip. Again, we'll go into that later. But additionally, you have code that you need to insert into the fix of vulnerability. And some projects use a tab. Some projects use spaces. Some projects use braces on a new line. Open Rewrite lets you insert new code into the code base while automatically matching the stylistic formatting of the surrounding code. So you don't need to think about that in the code you're writing. Additionally, it's fully type attributed. For example, is this log for J, SLF for J, log back? That might be really critical when you're trying to fix a specific vulnerability. I can't imagine there ever being a security vulnerability a logging framework, though. That would never happen. So because it's fully type attributed, you get all this really rich data. There's actually 6,000 nodes missing off that left image because otherwise it would just be a blur. That is the amount of rich data you're working with when you're manipulating the ASTs at this level. And then the problem, too, is that even simple code transformations can create complex trees. Like you've got Java method declarations, like all these different little objects you need to create. So Open Rewrite has this, let's say we want to insert this chunk of code to fix a vulnerability. Open Rewrite has a templating language that lets you write raw Java into a string. And it'll create all the trees that you need to insert into the AST in the right location. And then it has a coordinate system that lets you say, I want to place it specifically on this location in the AST. And it'll do that for you. And that lets you take this vulnerable code at the top and transform it into this non-vulnerable code at the bottom. So what's possible now? What vulnerabilities can we fix with this unlock that Open Rewrite provides? I'm going to talk to you about three security vulnerabilities that I tackled with this technology. The first one is temporary directory hijacking. The second is partial path traversal. And the third is zip slip. First one, temporary directory hijacking. So take a step back. On Unix-like systems, the system temporary directory is shared between all users. That means that if you write into the temporary directory, someone else can see that file that was created, who's local co-resident on the same machine. And this is the vulnerability that you'll see a lot in Java. What's going on here? Well, people often want to create a temporary directory, not a temporary file. Unfortunately, prior to Java 7, there did not exist an API to create a temporary directory. So what people did is they'd create a temporary file. They'd create a temporary file. They delete it and call makedir to make the directory. This is actually a bit of suggestion you get on Stack Overflow when you asked, how do you create a temporary directory? The problem is you sometimes ask Stack Overflow and you get vulnerabilities. So why is this a vulnerability? Well, there's a race condition here. And the race condition exists between this deletion and this makedir. Because if this makedir fails, if you fail to create that directory because someone else has created it before you with wider permissions, this doesn't throw an exception. It returns false. So one valid solution is to throw that into an if block and then throw an exception. The problem is this is still vulnerable to another vulnerability called temporary directory information disclosure. Because by default, this directory that's been created is visible to all other users. And so they can see the contents of the files inside of that directory that's been created. As such, this is not a private directory that only you have access to. So if you're putting sensitive information in that directory, other users can see it. So what's the actual fix? Well, this API was introduced in Java 1.7. It's a very, very old API. And it creates the directory with the right POSIX permissions that you need. And I have a bunch of CVs from this. I actually reported it, but then I'm like, OK, this is stupid. I should just go fix it in a bunch of places. And so I did bulk pull stress generation. And I generated 64 pull requests to fix that vulnerability across the open source ecosystem. And this is a very simple fix. It's just take that old vulnerable line, replace it with this one. But also, we can clean up more complex code. You see these deletes and maketers? We don't need these anymore, because it's all replaced with that one line. That's the unlock that open rewrite provides. The second vulnerability is a vulnerability called partial path traversal. Partial path traversal, well, so let's say you have two users on the file system. You have user SAM and user Samantha. Partial path traversal allows an attacker to access a sibling directory with the same prefix. So again, taking that user SAM and user Samantha, it's a vulnerability because you've tried to say I don't want anybody to access outside of user SAM, but because Samantha is a prefix of user SAM is a vulnerability. So how does the vulnerability appear? This is that vulnerability as it appears in Java. And you might ask, well, why is that vulnerable? Well, this is a string starts with comparison. And the problem is when you call file and you take that string with that original slash at the end, and you call get canonical path, get canonical path normalizes the path. So it removes, you know, dot dot slash, dot dot slash. It removes any extraneous characters. The problem is that get canonical path returns a string. And that string, you'll notice, is missing that trailing slash that we originally had up here. So now, when you have a user-controlled input come in and you have user SAM that we've provided, right? So we've got user SAM. And we have user-controlled input, which is dot dot slash Samantha slash Baz. When we call get canonical path on that, that gets turned into this string with user slash Samantha slash Baz starts with user slash SAM. And that is indeed correct. So this IO exception does not get thrown. So what's the fix for this vulnerability? Well, the fix for this vulnerability is to add that file path separator back in. Well, that's one of the solutions. The better solution is to use this path operator. So get canonical file dot to path, which uses the path operations to compare these things. And then you will not end up with this mistake. So how do we find this vulnerability? Well, the first thing we need to do is we need to look for that string starts with call. And then we need to look to the qualifier and the argument to see if those are get canonical path calls. And we also wanna check to make sure that it's not that file back separator character appended on there because we don't wanna fix non-vulnerable code, right? It's really important. Maintainers will get pissed at us. But it can't just be that easy, right? Well, software developers write code in a lot of different ways. What happens if they write it like this? Where they have that variable extracted into, they have that first argument extracted into a variable. Or the argument in the starts with extracted is a variable. How do we determine if this is vulnerable or not? Or what if they have it but it's not vulnerable, right? It's extracted into that standalone variable. We need data flow analysis. So data flow analysis allows us to track that variable assignment to that variable and where it gets assigned into that if case. And it lets us do even more complicated things where it goes through turnaries or other variable assignments or things like that. Data flow analysis allows us to see what that variable will be at the time where it's being used. So data flow allows us to uncover hard to find vulnerabilities and help us prevent false positives. And if you've written any CodeQL, which I don't know how many of you have written, by Shove Can, who's touched CodeQL, written any CodeQL, or is familiar with CodeQL? Okay, well, this API is designed after CodeQL. So if you're familiar, if you learn CodeQL, you can apply your knowledge to this API and similarly you can go the other way. And then when you apply it, you can, you know, using that knowledge, we can fix these vulnerabilities where you see that this vulnerability is appearing multiple lines across multiple variable assignments. We can still fix it appropriately. Let's talk about a quick use case. This vulnerability existed in the AWS Java SDK. You'll notice that they're using this bit of logic right here, leaves root to check to see if that starts with clause is allowing this key, which is an AWS key, right? In a bucket, you have key values for the AWS buckets, is that key a path traversal attempt while you're downloading the entire contents of an AWS bucket. And that starts with clause right there, was this vulnerability because they were using this leaves root in this thing to see can't download key as relative path resolves outside the parent directory. So I reported this to AWS and they're like, thank you, we will fix it. As with any good story, this had a little bit of vulnerability disclosure drama associated with it. This was a conversation that I had with the AWS security team. They said, we'd like to award you a bug bounty for this. However, you'd need to sign an NDA. I said, I don't normally agree to sign NDAs. Can I read it first before potentially agreeing? And this is the lovely response I got. We're unable to share the bug bounty program NDA since it and other contract documents are considered sensitive by the legal team. Yes. I still have not received that bug bounty to this date. They did apologize for this though. The third vulnerability is promised zip slip as alluded to in the beginning of this talk. So what is zip slip? Start out, zip files are key value pairs. The key is the path and the value is the contents of the file inside of the zip. So if you have a malicious zip file, somebody who you do not trust is giving you a zip file that is malicious. That zip file, the key can be a path traversal payload dot dot slash dot dot slash dot slash bin slash sh, right? So it is a path traversal vulnerability while unpacking zip file entries. This is that vulnerability. There's a lot of code there. Let's narrow it down a little bit. The key elements are get name, that is attacker controlled, right? And this file output stream which is being created from this attacker controlled file, right? Which may contain a file that is outside of the original destination directory because there's no guards between these two places to check to make sure that it has not escaped that directory. So zip slip is complicated. And zip slip is complicated because this is a valid fix for zip slip. You put a guard in place that just checks to make sure that it hasn't escaped that destination directory. But the problem with zip slip is that while this is a valid fix, so is this. These are both valid guards for that vulnerability. So how do we determine, because again, we don't wanna fix non-vulnerable code, how do we determine if this is vulnerable or not? We need control flow analysis. Control flow analysis allows us to differentiate between these two bits of code. What is control flow analysis? Control flow analysis is a graph. It's broken up into two basic parts. Basic blocks, which are the set contiguous operations that will occur in a program without a jump. And condition nodes, which are places in the code where you will have branching. So using this, we can determine if this code is vulnerable or not. So let's take this chunk of code and do control flow analysis on it. And we can see that there's this guard starts with that blocks, that prevents the access to this IO utils copy. As a result, we can guarantee that because this guard exists, this code is unreachable and so it's not vulnerable. And so when we put all this together, we can generate the fix only for the code that is vulnerable and not generate code for non-vulnerable code. And on top of that, we can clean it up even better, you know, a little bit. A couple more examples. We can do this for other, you know, other more complicated code bases even too. So let's talk about pull request generation. You got a security vulnerability, everybody gets a pull request. Let's talk about the problems of pull request generation. How fast can we generate pull requests? Well, there's a set of steps you need to go through whenever you're creating a pull request on GitHub. You need to check out the branch download, you need to check out and download the code repository, apply the branch and diff and commit the changes. Then you need to fork the repository on GitHub, rename the repository on GitHub. You might ask why is it important to rename the repository every time you're doing this? Well, if you have the repository name rewrite for organization A and the project name rewrite for organization B and you try to fork both of them with the same GitHub account, GitHub says you already have a fork with that name, even though they're different code bases, we can't fork that. So you need to deduplicate your repository names by changing the name every time. Then you need to push your changes and create the pull request on GitHub. You'll notice there's three API calls here. Actually, GitHub recently made a change. Three and four can be merged into one API call. However, you're still making two API calls per pull request. So if you're trying to generate tens, hundreds or thousands of pull requests, they also expect that you wait at least one second between every single pull request. And then they have a secondary rate limit that is not documented on top of that. So you just kind of have to pray or hope and back off. But these are the three API calls in which you will encounter problems with rate limiting. Yeah, if they could make their rate limit or a little less aggressive, it would make my life a lot easier. Anyways, so we made it this far. The vulnerability's been detected, the style's been detected, we fixed the code, the diff's been generated, the rate limit's been bypassed by kind of just praying and waiting and hoping that your timeout is long enough. How do we do this for all the repositories? Well, this is where Modern comes in. Modern is the parent company behind Open Rewrite. It is a SaaS that's free for open source. They have, actually this number is wrong. This should be 26,000. I need to update that number. It's about 26,000 open source projects, Java projects that they have indexed. And it lets you run these Open Rewrite transformations at scale and generates and updates those pull requests. They have 800 plus recipes if you want to do migrations from JUnit 4, JUnit 5, Spring 1 to Spring 2, they can do that for you with their SaaS offering. You might ask, why is JUnit 4 to JUnit 5 important? Well, actually, you can't use the latest version of Spring, which we all know is really important to keep up to date, without migrating your testing framework from JUnit 4 to JUnit 5. Your testing framework is it can actually be a blocker to getting your dependencies updated. So migrations like these can be relevant even in the context you may not consider. And then you can use it to generate these fixes, right? That's what I did. I don't know if I'm gonna have to start this. Okay, so I've run this recipe against a few hundred pull or repositories and then I can create a pull request. I can set the branch name, I can set the commit title, the organization and project name. I can supply a pull request title. I can give my GPG key, including my GPG private key, which yes, I know, inserting into a SaaS service, whatever. And I can commit these pull requests, I can generate these commits and create pull requests all in a way that will be verified, as if it's coming really from me, because it is. This is a code that I've written to generate these pull requests. Ah, come on, keep going. There we go. But there are more than, well, okay. There are more than just 6,000 open source projects on GitHub. What about the rest? Well, this is where CodeQL comes in. CodeQL again scans hundreds of thousands open source projects and has 35,000 Java projects that they've indexed that you can run queries against. And so once you've written a CodeQL query to determine which projects are vulnerable, you can take that list and add it to this GitHub repository, which is a CSV file, a list of the projects that Modern is aware of. And then you can also write a recipe that'll target those repositories that Modern wasn't aware of, but should be because they're still vulnerable and you wanna fix them as well. So let's go generate some open source software pull requests. That's what I did. Temporary directory hijacking, zip slip, partial path traversal. And here's some statistics. I generated 1,596 pull requests back in the day to fix that HTTP downloaded dependencies. I have about a 40% merge rate. Jay Hipster, which is a different vulnerability. This was a code generator where 15,000 instances of the same code appeared vulnerable across GitHub because of a code generator called Jay Hipster generating vulnerable code. Had to fix that. GitHub used this to fix this vulnerability in a coordination with cert.cc. I did Temporary Directory hijacking. This past year with Modern, partial path traversal and zip slip. And you can see the rates tend to be around 20-ish percent depending. In 2022 I generated north of 600 pull requests to fix security vulnerabilities. And I've personally been a part of generating north of 5,200 pull requests in my career to fix various different security vulnerabilities across open source. One unlucky project was the recipient of all three of my pull requests for this campaign. I think to date they still have not merged these. I should go harass them about that. Also given that star count up there, 1.3000. And this is my contribution graph for 2022. You can see those massive peaks. So let's talk a little bit to close out about some best practices for bulk pull request generation. I have been improving upon this as a part of the work that I've been doing with the OpenSF. But these are the things that I've learned in general as a part of doing this, at least prior to the future work that I've been doing. Messaging. All software problems are people problems in disguise. You can very easily piss off and maintain it with the wrong messaging when you're doing this work. So you need to make sure that you are kind, compassionate, and respectful of them. Also, you may look like a bot coming in. So you need to try to articulate to them, hey, no, no, I'm actually a real person. I will engage with you. I want to work with you. Like we want to make the world better. I may have automated the generation of this pull request, but if you have questions, you can talk to me as a real person. And then some more mundane things. First lesson one, sign off on error commits. You might ask, why do I need to sign off on the commits? This is what the sign off looks like. You say, why? There was a lawsuit, yada, yada, yada, TLDR lawyers. Otherwise, your pull request will get rejected by evil dragon bureaucrats. VA could commit as in, TPG sign your commits, then it shows up as verified like it does there, and then you won't be like Linus Torvald who's had his self impersonated on GitHub multiple times. CCOM is a commit format for security related commits. I don't have the deep time to go into it, but this is the format that is CCOM. Probably should be a part of the open SSF eventually, but did they? Yeah, okay, yeah. Yeah, yeah, so if this commit format includes like, CVSS detection, what detected, where were the report was, all that stuff in the commit message, if we could do more of this, it would be helpful for the community that's trying to look at this data. There are risks to using your personal GitHub account. Who here is familiar with GitHub's angry unicorn? GitHub's anger unicorn is what happens when you've broken something or they've gone down or something has crashed or whatever. This was my GitHub profile. So this is the anger unicorn. This was my GitHub profile for most of 2020. First time you loaded my GitHub profile until it cached in, you'd get this error because I had so many commits against my GitHub account. That being said, I would still recommend using a personal GitHub account because it makes it look like you're a real person and the maintainer's less likely to get upset with you. However, when you're trying to do this with other people or as a larger organization, that adds additional complexities because then you become a bus factor. There are trade-offs here. We're currently trying to deal with this within the open SSF. Vulnerability disclosure working group, auto-fix sig. Lesson five, coordinate with GitHub. Reach out. This is the security lab. Let them know you're gonna do this. Talk to them. Maybe they'll review your changes. Make sure you're not gonna just spam out stuff that's crap. And then lesson six, consider the implications. And this is actually the point of some of the work that we've been doing within the open SSF really recently. Shortly after beginning this work, I got this issue open against my repository. Is this responsible disclosure? Now, I use the more nuanced term coordinated disclosure, but whichever way you call it, responsible or coordinated disclosure, the answer is no. This is full public disclosure of a security vulnerability against an open source project. As part of doing this work, you are o-dang a maintainer, potentially. That has an implication. In order to fix that, requires a significant amount of infrastructure and quite a bit of code because coordinating disclosure through automation is a large, complicated space. That we at Alpha Omega are trying to fix, but we have not yet fixed. And GitHub has added support for creating private vulnerability reports, but it's still not perfect either because it adds additional levels of complications that we, as people trying to do this work, have to deal with. So the answer is, I don't know what the right solution is. When I was doing this, it was the best option that I had. It may not be the best solution moving forward, but we have way more vulnerabilities than we can deal with. So I wanna leave you this final conclusion. As security researchers, as software engineers, as security professionals, I believe we have an obligation to society. We know these vulnerabilities exist. We have seen them in pen test reports. We understand them. We've know how to talk about them. We know they're out there. We are the ones that know how to communicate about these vulnerabilities. We are the people that understand this stuff. There's the statistic that GitHub put out for every 500 developers, there's one security researcher. We are heavily outnumbered in this space. So how are we going to best take our knowledge of math, science, technology, computer science, security, and make sure that it scales because of how outnumbered we are? And I believe that automating our work as security researchers is the best way for us to have a positive impact on the wider security of the entire industry. I want to leave you with one final quote. This is from Dan Kaminski. It's on his Twitter profile, and it remains there to this day. We can fix it. We have the technology. Okay, we need to create the technology. All right, the policy guys are mucking with the technology. Relax, we're on it. I want to leave you with some, like, action items. Learn CodeQL, seriously. Like, you can do some super powerful language you can use to find vulnerabilities at scale across open source and in your own code. Contribute to Open Rewrite, and you can deploy your security fixes at scale. Join the GitHub Security Lab and the Open Rewrite Slack channel. And then also join the Open Source Security Foundation if you want to discuss securing open source. I run the Open Source Security Foundation vulnerability disclosure working group, Auto Fix SIG. I will be out for about a month. I need to take some leave for reasons, but I will be back hopefully after that to continue to push forward and improve the state of the industry in this area. I want to thank the Open SF, Alpha Mega, Modern, Human Security, and then Lydia Giuliano, the speaker coach that Black Hat offered me, and Shyam Mehta, who was my intern during the time at the Dan Kaminski Fellowship, who created some of the graphics that you saw for Dataflow and Control Flow Analysis. This is me. Thank you all for coming, and thank you for being here. I think we're the last talk of the day, so I have plenty of time for questions. Go for it! There's not very much on GitHub. I do... I could... Okay, this is... I agree with you. I very much agree with you. We should do more than just GitHub, right? The problem is most open sources on GitHub. So having the biggest impact where most open source is, is the area of focus. It's not to say that it's not important to focus on GitLab or Bitbucket or stuff like that. It's just... You gotta go where the biggest impact you can have is. Probably. Well... Probably. Right. I would... I'm actually even to look this up. I'm curious how many projects in the top 10,000 critical projects list are not on GitHub. I don't... Is it? I'd grow up... Do you have any insights into that one? Did the Harvard study only look at GitHub? I don't know. That's a good question. Yeah, that's a good question. Probably. Yeah. For my work, right, you have to focus on what APIs are easily accessible, what systems already exist, but you're not wrong. You're absolutely not wrong. Yes. Yes. Is it not? Yeah, that's fascinating. I believe it. No, no, I'm not disputing with that. Yes, there's a lot of open source that's not on GitHub. Yes. Yes. Yes. Yes. You might ask why. So, yes. Jhipster, which was the vulnerable code generator, is used very commonly to generate a one-off project that someone is trying out to see, oh, hey, what is Jhipster? They commit the repository and never touch it again. So that's that one. This was the first time that this was ever done against C projects. So, I don't know what that... And this vulnerability actually, I think, existed in a dependency. The problem is that in C, right, people are more often, more commonly, vendor their dependencies, source vendor their dependencies in their code bases. So these may also be dead projects that are, they just cloned another project, uploaded it and never touched it again. But I do not have the insights on that 7%. That was not, this was done by GitHub, not me. These ones, though, 20, and then this 40%, we reached after I did this work in 2019, and that 40% was reached after, you know, it's been just 2023, right? So I think that that data came from 2022. I still get pull requests merged. To this date, my GitHub notification feed, I still have pull requests that get merged for all of these, every once in a blue moon. Anything related to Modern is Java, Kotlin, Kotlin is beta, Python is beta, Groovy is stable, and then they also support XML, YAML, so Open Rewrite, they have written ASTs for, I think also Terraform, which is HCL, right, they have a bunch of languages, the programming languages they support, predominantly Java, and then they're working on Python and a couple others. I think they're also poking at JavaScript. Also, they have support for one other language that is only available commercially. It is a language used absolutely everywhere and nobody wants to write it anymore. Care to guess? Yeah, there it is, Kobol. Kobol, because there's a lot of Kobol out there and nobody wants to touch it anymore, so. Yes, a lot of money to be made in writing Kobol code, too. You know, it actually comes in at a slow enough rate that it's not actually unreasonable to deal with as a solo person. Not something that I want to do if I'm targeting 26,000 plus projects, but given the scale that I've been doing it, you think that 1,600 pull requests is a lot, but I only had 40% merge rate, so cut that in half, right, and then imagine that all these projects maintainers take a while to respond to you. You're actually only dealing with, at the beginning you're dealing with like 5, 10 a day, maybe even more, but that rate at which you're getting notifications to deal with these things, peers off really, really quickly, actually. Yeah. Now the goal is to maybe, I mean I'd love to increase the number of campaigns that we can run continuously moving forward and the long-term goal of this and the dream that I have of Alpha Omega is all of these things were done as one-offs, right? So I did a snapshot in time fix, but what if somebody tomorrow introduces vulnerabilities into open source? How do we deal with that? I want to see this stuff done on a continuous basis where we are just constantly sweeping the internet and cleaning up the stuff that we know are vulnerabilities and that we can fix because we have the recipes to do so. That is the dream of a world where if you introduce a security vulnerability, we just take care of it and you don't have it anymore. That's the other, yeah, you don't, so if we can prevent this code from showing up again and again in open source, it'll mean that LLMs are not getting trained on bad, vulnerable code. That is a topic for an entirely different talk. Go look, to answer your question, go look up a black hat talk from not this most recent year, but the year before, it's called In Need of Peer Review and it discusses that exact topic of GitHub co-pilot generating vulnerable code and what triggers it to generate vulnerable code versus non-vulnerable code? For example, what if I put my name at the top of the document as a well-known, well-respected maintainer who has written lots of code who may be like the maintainer of Python, does it generate better code because of the name at the top of the file? And the answer turned out was yes, it actually generates better code if you have the name at the top of the file of who wrote the file as somebody that you'll end up with better code as a result generated. So In Need of Peer Review, black hat 2022. Yeah. Yeah. Anybody else? Go for it. It is. It is. Yes, I agree. I very much agree. Yeah. When you need to hand over keys to let someone else create the commit on your behalf, though, not a bad idea. Yeah. Any final questions, comments, concerns? I do try. My dream. Yeah. Thank you all for coming.