 Welcome everybody. Thank you for coming this talk is titled scaling the security researcher to eliminate open source security vulnerabilities once and for all It's a talk that I've given all around the world including black hat def con Secretary Stockholm, you know code blue in Japan. So And thank you for coming to see it here So who am I? My name is Jonathan Leitch you I am a Software security engineer and software security researcher. I was the first-ever Dan Kaminsky fellow at human security I'm a get-up star get-up security ambassador and I forgot to update the slide that this slide But I am currently the senior software security researcher at project alpha Omega under the open source security foundation under the Linux foundation Yeah You can find me on Twitter and at GitHub at at J light you And also on the open SSF slack channel so disclaimer although So two things one I'm talking about code QL and I have a legal obligation to say that I'm sponsored by github because I Signed a thing a while ago that's probably not relevant anymore. The other thing we just this discusses a SAS solution Which is free for open source, but not free for commercial uses So but if you're doing this for open source, it's available for free for anybody So and you can you can use it to fix your vulnerability scale This project is supported by project alpha Omega and the open source security foundation The original bit of this research was supported by the Dan Kaminsky Fellowship at human security for those of you who don't know Dan was One of the is a famous security researcher who tragically passed away in 2020 No, 2021 Very young He was best known for a vulnerability in DNS back in 2008 and the Dan Kaminsky Fellowship was created to Celebrate Dan's memory and legacy by funding open source work that makes the world a better and more secure place And so I was the first ever Dan Kaminsky fellow and worked on this project for the past year under that So I'm really grateful to human security for sponsoring that work spoilers I Actually generated pull requests like actually fixed security vulnerabilities at scale across open source This is a vulnerability called zip slip and I generated 164 pull requests to fix zip slip across the job ecosystem But there's a story behind how we got here. So let's start at the beginning This whole journey of bulk's generating security fixes at scale to fix vulnerability to cross open source started with a simple vulnerability And that vulnerability was this The use of HTTP instead of HTTPS to resolve dependencies in my company's cradle build I looked at this logic at one one day and I said, how did that get there? And I had put that code there and I was like, where did I get this code from and I found that I had copied it from an open source project and This vulnerability apparently existed everywhere. I found this product this vulnerability in open source project So why why is HTTPS important? First off you if you're not using HTTPS to resolve your builds dependencies Gradle and Maven don't have any other artifact checking that's enabled by default. So you can very easily, you know, have an attacker in the middle Somp compromising the jar in flight as you're downloading it and this vulnerability impact This is an example of that same vulnerability in Maven. So the first one was gradle. This is Maven This is in Artifact upload. So this is where you're actually uploading the artifact is a final release Usually this has credentials associated with it. So you're also uploading it in plain text with credentials This vulnerability is everywhere. It impacted the builds of organizations like spring Apache foundation red hat Kotlin jetbrains Jenkins Gradle groovy elastosert the eclipse foundation It also impacted Oracle the NSA LinkedIn stripe. So this is this vulnerability was all over the open source ecosystem I reached out to Maven sonotype. I don't see Brian Fox here I should have harassed him to come see this but so I reached out sonotype and they said that 25% of Maven central downloads were using HTTP still in June of 2019 So how do we fix this? Well, so I pushed forward an initiative that on January 15th, 2020 all of the major artifact servers in the Java ecosystem would drop support for HTTP in favor of HTTPS only and So I got commitment from Maven central J center spring the gradle plug-in portal to do this and they all published blog posts Announcing this across the entire industry So, you know we announced that we're gonna do this but on January yes in January of 2020 About nine months after the original disclosure this vulnerability across the open source ecosystem Sonotype said that 20% of their traffic was still using HTTP. So we dropped from 25 to 20% So you can imagine what might have happened on January 15th, 2020 Broken software lots of broken builds lots of people being like what the hell why did my build break? Post on stack of our flow because we shut off the use of HTTP in favor of HTTPS only across the entire Java ecosystem So we stopped the bleeding But what about the other repositories so These are only the most commonly used artifact servers in the Java ecosystem. So to give a little bit of context These are the servers for like these are the servers that are used like for pip. There's pi pi for I'm saying it wrong I know for npm. There's No, J at the npm server right in the Java ecosystem It's older than all of these so there's actually multiple Centralized repositories that get used in the Java ecosystem and so people will have in their build declared multiple build servers to resolve dependencies from and So these are really popular ones, but other people publish artifacts right eclipse foundation has their own artifact server Jenkins has their own artifact server There's other organizations. So how do we fix the rest of the open-source industry that is pointing at these other servers that have not been fixed? Well, I said, why don't we just go fix the source code? Why don't we like just go bulk generate security fixes at scale? So how? So first you need to define the vulnerability. So I use code QL wrote this code to a query to find this vulnerability across open source projects and Contribute it back to github in the get up security lab bug landing program and for this So code QL you can use code QL to scan hundreds of thousands of open source projects You can write a code QL query and scan hundreds of thousands of open source projects across open source The feature you want to look for is a feature called variant analysis for code QL That lets you run these scans against all of these projects from inside of your VS code IDE and For this very simple query along with a little bit of documentation get up awarded me a $2,300 bounty for this this very simple chunk of code And then when we wanted to fix the vulnerability I wrote a bot bug puller crush generator was written in Python It was a wrapper over github's hub CLI it had one very nasty regular expression and a lot of logic for bouncing off of github's rate limiter And this is the this is the engine this is the thing and you can see You see like the commit message up here You know the regular expression some of the like how do we fix the file stuff and this is the regular expression now This is to fix this vulnerability in Maven POM files POM files or XML you might ask me Jonathan why didn't you use an XML parser to fix XML files? And the answer is if you read in an XML file into an XML parser Modify the XML and then dump it back out it will come out in the format of the XML parser It will not come out in the format that you put it into it with so If you want to match the formatting of the input file you have to use a regular expression Or at least you did the problem is if you're using a regular expression to fix a problem like this When you try to when you're using it to fix a problem you now have two problems the original problem You had and the regular expression you're now trying to write Yes But it worked regardless I Generated one thousand five hundred ninety six pull requests across the job because it's gonna fix this vulnerability This is an example of the diff. You'll see it's a very simple replacement of HTTP with a GPS in the in the correct locations Not across the entire files, but in the specific locations where it's needed So yes, and as of today In 2019 was when this one night or 2020 was when I did this campaign 2019 No, it was it was early 2020 We have a 40% merge rate at this point, right? It's been what it's 20 30 20 23 now So it's been four years, right? And for this work get up awarded me a $4,000 bounty for actually going out and fixing this vulnerability across open source So thank you to get of a security lab for that I got hooked on this idea of bulk pull request generation as a solution to fix vulnerabilities at scale across open source This is my get-up contribution graph For 2020 you can see these two massive peaks where I actually did two different campaigns It's actually having impact on on the open source industry So I have a problem I've ADHD and I don't I don't consider ADHD to be the problem but I Love chasing squirrels. I love looking at security advisories and saying I wonder where that vulnerability that somebody published Where else that is? And so I can take a vulnerability and look and see oh that vulnerability appears all over the place using tools like get ups Code search get up code QL. I can get tons and tons of rules So using what I know I can find too many security vulnerabilities Like more than I can reasonably report or fix this is an example of One code QL query for finding zip slip As it's shown up on a site that's now defunct called LGTM comm that was this is from code QL results I mean you can get these rules now differently more difficult But these like this is their pages you can scroll through pages and pages and pages of these results Showing real vulnerabilities in real open source projects So if I'm finding too many security vulnerabilities, I Need a way to actually deal with these vulnerabilities Need automation And this is where automated accurate transformations at masses scale comes in. This is where we can actually fix these vulnerabilities So this is where open rewrite comes in open rewrite is an open source library written in Java Written in Java, but you can ideally in the future be able to write these recipes in any language using growl VM Not quite supported yet, but that's the long-term goal. And So what is the problem that open rewrite solves? Well when you compile code like this down to an AST, which is what the compiler sees You create what's called an abstraction text tree The problem is that if you take this abstraction actually the compiler doesn't care about white space tabs comments all that stuff So if you were to dump this AST back out into source code, this is what you'd get So we started with formatted code and we've got nothing, right? So What format preserving means is that because open rewrite is a format preserving abstraction text tree It keeps the white space tabs comments spaces all of that information So we can transform back and forth between this text format that is your source code and This structured format that you want to manipulate to fix these vulnerabilities Yes, one not does not simply reformat the entire source file because maintainers will not be pleased with you They'll be like thank you for fixing the vulnerability. It doesn't match our style Can you do it again? And that doesn't work at scale Additionally we can generate new code that matches the surrounding source code So take for example this code that uses, you know Spaces, but this project uses tabs or this project uses braces on a new line Open rewrite has a templating engine that lets you generate new code that matches the surrounding source code of the entire file or project Which means that you're not just you're not just generating new code that fixes the vulnerability You're doing it in such a way that makes it look like it was written in the style of the original developers Additionally, it's fully type attributed. Is this log4j is this SLF4j is this log back right with type attribution you can determine that this is coming from one of these libraries And I can't imagine where it might ever be relevant to know That there's a potential large vulnerability in a logging framework that everybody's using across the entire internet and Then with this type information if you just have a syntax Adding that type into reach attribution and formatting the other for the much much much richer graph There's actually 6,000 nodes missing from this right side because it would just be a fuzz So with all this information you can have very very accurate transformations and Then also when you want to generate new code the problem is Tight the ASTs are complicated objects. There are lots of objects. There's lots of trees You want to generate new code is very very complicated. So we need to be able to create complex ASTs Very easily as a software developer and you can do that Let's take for example, we want to introduce this fix to fix this vulnerability in an open source project This is a fix for zip slip as an example We can do that with the templating engine that open-rear it provides so you can write this code out Just create this template and then use it with a coordinate system where it says I want to inject this Template into this AST with a coordinate system. So I'm saying after this statement that's where I want to put it and It will generate the sort the AST with the formatting that's relevant for that source location and Put it in the AST and create the diff So you're taking a picture so I give you a second good And so that lets us transform from this vulnerable code that has zip slip I'll talk about zip slip later to give like more context But like we can take this vulnerable code and replace it with this fix with this with this fix code very very easily using this engine So what's possible now? What other vulnerabilities can we fix with the unlock that open-rear it provides? So I'm going to talk to you about three security vulnerabilities if I have time temporary directory hijacking partial path traversal and zip slip So the first one is a vulnerability called temporary directory hijacking temporary directory hijacking is so temporary directories on Unix-like systems are shared between all users so That means that if you create a temporary directory on you know a Linux machine Any other local users can see the contents of that directory in with the default You pause x permissions that files get assigned when you create them So this is the vulnerability that appears in a lot of Java code It is the the way that a lot of people have been using to create temporary directories in Java And you might say like why is this vulnerable? Well, why the reason that you see this code all the time is because you ask on Stack Overflow how to create a temporary directory And you'd get this answer for a long time Yes, if you ask Stack Overflow, sometimes you do get security vulnerabilities So why is this vulnerable? Well, it's vulnerable because there's a race condition right here The race condition is so this is creating a temporary file that is a random name using like an actual CSPRNG or random random number generator, but then you're calling delete and So there's a time window where an attacker can see that this file is created See that it's deleted and then race to create that directory before you do and What is this return if it fails it returns false? But if you're not checking that false, you don't know that it failed And so you're now working with a directory that someone else created And has wider permissions than you have Well, okay, what if we throw this into an if check? Well that can solve this problem. Yes But this is actually still a vulnerability. It's a temporary vulnerability called temporary directory information disclosure because this maker by default is Has the read bit set for all users So anybody that is using this directory or your use of this directory any information you put into it anybody else can read So you're still exposing the contents to the local users So this is the correct way of fixing it. This is this API was introduced in Java 1.7. It's a very old API But because there's a lot of code written before Java 1.7 people are still using the old way So this is the way to fix it So I actually have a bunch of CVEs that I got assigned for finding this vulnerability in open source projects across open source So We can fix this with open rewrite and I did I have actually generated 64 pull requests to fix this vulnerability across open source This was back when Madurn supported only 6000 repositories in their set of repositories I can scan we'll talk more about that now later, but they now support 28,000 so I haven't rerun these campaigns since then but I presume we can get more This is what the diff looks like. It's very simple You can see, you know, we're deleting those unused calls and replacing it with that one line We can even do more complicated things, right? So you see this if these if blocks we don't need them anymore because We now are replacing this with the single line so we can do more complicated transformations So the second vulnerability we're going to talk about is partial path traversal So let's take we let's let's assume we have two local users on a file system We have users Sam and user Samantha And let's say we want to sandbox some chunk of logic to only access user Sam So partial path traversal allows an attacker to access a sibling directory with the same prefix So again taking our example of user Sam and user Samantha The reason that you can access user Samantha is because it's a prefix of user Sam So this is the vulnerability partial path traversal And you'll see that there is a guard that is attempting to protect against path traversal, but The reason that this works is because when you take user Sam and You call get canonical path on that file. It returns a string that looks like this and you'll notice We're missing that trailing slash that we once had when we started with this file that we created So then we have this user Sam that we're trying to sandbox to when that get canonical path gets called We get user Sam without a slash and then we have an attacker value come in dot dot slash Samantha slash Baz and That gets can a canonical path which canonical path does normalize the path So it removes the trail of the dot dot slash at the slash But now we've got users that slash Samantha slash Baz starts with either slash Sam which it does And so this IO exception never gets thrown and the attackers able to bypass the logic So how do we fix this vulnerability? Well, we want to look for Starts with we oh, okay Well, so first off one of the fixes for this vulnerability is just putting that this the separate a character back on there But there are better ways to do this than just you comparing strings This is the better solution where we're using Java's path objects to compare objects to determine if they're safe So how do we find this vulnerability? so we find this vulnerability vulnerability by looking for string starts with calls where the argument and the subject the subject is the It's also called the qualifier it's the it's the method prior to in the string in in the method chain is get canonical path and We want to look for cases where they don't have this safe safety check because if it is safe We don't want to fix the vulnerability. We don't want to fix non-vulnerable code So it can't be that easy right well developers write code in a lot of different ways So what if your developer writes code like this? Where they've extracted that call to a variable or they've pulled the argument into a variable This is where we need something new to help us out with this this problem We need oh, and what if the developers written code like this where they've made it safe, but they've also signed it to a variable We need this concept called data flow analysis so data flow analysis. Yes, there we go We need data flow analysis data flow analysis allows us to track variables as they flow through the application and see what they potentially could be at runtime and So using this we can we can we can follow the code and determine that there is or is not a vulnerability present and Data flow analysis can be multi-step. It can throw through turnaries. It can flow through very you know if blocks But this allows us to determine that yes, there is this flow through the application that is vulnerable or not and So data flow allows us to uncover hard-to-find vulnerabilities and prevent false positives And this is what it looks like You don't need to read all this, but it's designed to look a lot like code QL's data flow analysis So if you know code QL you can map your knowledge to open rewrite very easily I did this for my own sanity, but also for yours, so And this is where we you know actually fixing this vulnerability right you can see that we're applying data flow analysis here to Actually target the vulnerability and fix it appropriately in this code base I've got a little bit of an example This vulnerability was a vulnerability in the 80 bus Java SDK They were using this logic called leaves root to To check as you were downloading the contents of S3 bucket is the key for this S3 bucket Attempting to path traverse outside of the destination directory while you're downloading the entire contents of that S3 bucket And if it is guarding against it You can see that there's this logic here you can see that starts with call right there That's vulnerable they were using this as a guard for leaves root This leaves root was being used as a guard that throws this exemption that says cannot download key It's relative path resolves outside the parent directory, but you can see that's not sufficient because we're using that starts with call That's on a string So they fixed it and I they got a CVE for it But as with any good story, there's a little bit of vulnerability disclosure drama This is a conversation that I had with the 80 us security team. Hey, we'd love to award you a bug bounty However, we need you to sign an NDA and I said I don't normally agree to NDAs Can I read it first before potentially green and Amazon came back with the line? We're unable to share the bug bounty program NDA since it and other contract documents are considered sensitive by the legal team Yeah for the Pokemon fans Oh Yes So The end of that is that they said that's not our policy long term We're sorry. I had a tweet that I published this same story. It blew up They have since but I still haven't seen their NDA and they also haven't paid me I did get the offer of we can't pay you because you're not willing to sign the NDA but We'll give you a thousand dollars worth of 80s credit I'm like, I don't use 80s and I'm like, can you give me Amazon store credit? They said no But if you give us a wish list with a thousand dollars worth of items on it, we will mail you those items So I still haven't done that but it's it's one of those standing offers that I need to exercise so The third vulnerabilities of vulnerability called zip slip So zip slip remember how we talked about past reversal before or partial past reversal This is a true past reversal vulnerability, right? So you have an attacker supplying a value of dot dot slash dot slash dot slash dot slash dot slash Where they're trying to traverse outside of the destination directory. You're trying to unpack a zip file. So what are zip files? zip files are key value pairs they're maps of This is the file name. This is the compressed contents of that file You can iterate through those files and unpack them out under the disc So zip slip is where an attacker is giving you a zip file That is malicious and they are in they've intentionally crafted a zip file Such that if you unpack that without sanitizing their input zip file You they they they can write Contents of the file or your logical write contents of that file outside of the intended destination directory So what does this look like? Well, it looks like this. Well, that's a lot of code. So let's strip it down a little bit it's mostly to do with these two call sites where E dot get name is the entry of that key value pair and it's untrusted It is an attacker controlled value and you are then using that to create a file output stream and Unpack copying the contents of that file to that output stream. So they are able to write whatever they want to in that file So zip slip is complicated and the reason the zip slip is complicated is because well So this is this is the vulnerability While this is a valid fix for this vulnerability, right where we have this guard in place that protects against it there are other valid fixes for this vulnerability and so because of that So this is a valid solution, but so is this So how how do we differentiate? That like how do we write code to determine if this is vulnerable or not because again going back to that early point We don't want to fix code. That's not vulnerable. We only want to fix vulnerable code because maintainers will just get pissed at you So how do we check to see if there's a guard in place protecting this from being vulnerable? Well, we need control flow analysis So what is control flow analysis control flow analysis lets us differentiate between these two chunks of code because it creates a graph and The graph is made up of two node types. They're called basic blocks Which are the set of contiguous operations in a chunk of code that will occur without a jump occurring So a jumping a branching point right if if blocks case which statements and then condition nodes Which is where those branching points occur So for this chunk of code, we've got a basic block that includes all the way up to From from here all the way down to this if and then we have a branch to branch points here And then we have different return values. So control flow is a graph. It's a graph that you can traverse and ask questions at each Condition node that says is this a guard or is it not? so for this Non-vulnerable code we can build the control flow graph for this code and Navigate it and and look at this node and say well, there's a there's a there's a guard here that uses two paths starts with That's safe So because of that In that only reaches this IO exception. It doesn't reach this copy logic There's not a vulnerability here because it's it's been guarded against sufficiently and so because of that We can we can determine that this code is or is not vulnerable and then not fix it If it isn't but if it is vulnerable we need to fix it. So we do add that guard So and then when we combine it all together We're just adding this guard. That's the you know at the end of the day That's what's going on. You're just you're throwing a guard in there to protect the code And here's a more complicated where we're actually cleaning up some of this logic too as a part of that And then just slapping that that garden there to fix the vulnerability Here's a more complicated case where we're like we even messing with this stuff inside this try try with resource block Where we're extracting some variables as a part of it because we need to you know the variable doesn't exist We need to create it so we can use it in if block and then put it down later, right like that's you know So let's talk about pull request generation What time do I need to stop what am I at? Perfect. Okay, awesome So if you got security vulnerabilities Everybody gets a pull request So let's talk about pull pull request generation, especially with the context of github How fast can we generate pull requests? So when you're generating a pull request you've got three major components You've got file IO which is stuff that's gonna happen on your local machine You've got git operations which on github are not rate limited and then you've got github API calls which are rate limited So the first thing you want to do is check out the code and download the repository Then you want to branch apply the diff and commit the changes You want to fork the repository on github? Then you need to rename the repository on github and the reason you need to rename the repository on github is because Let's say that I'm gonna fork a repository named rewrite and Then I have enough from from one organization But that other organ is another organization with the same name rewrite if I have those two names they collide Suddenly github says I can't create that fork for you because you've already got a repository of that name Thankfully actually as of recently these have been before they mean since this talk is was created These are now one API call now. They've worked traditionally to Then you push the changes and then you create the pull request on github You'll notice that there are three Thankfully now to API calls that are using the GitHub API now github says in their documentation Yeah, let's let's talk a little bit about the github's rate limiter github's API says that But for each write request we want you to wait at least one second per call So if you're trying to generate a thousand pull requests or 2000 pull requests, right? That's three API calls times a thousand so three thousand seconds But then there's a secondary rate limits that's like actually doing this too fast Here's a header back to you. Please wait at least that amount of time Okay, you can handle that then there's a third rate limit That's like you're just using us too much. Please stop and that's just like arbitrary It's very annoying to run into that This so you know as I've said to the people of github if there's any way you could you know Stop rate of limiting your API so aggressively that'd make my life a lot easier this same flow We hope to use but a little differently moving forward as alpha omega To try to make this private to do this in not not publicly This is the way that I have done this But the goal is to try to use github's private vulnerability reporting PVR to do this same flow So we made this far the vulnerability has been detected the style has been detected we fixed the code and generated the diff We've bypassed the rate limit Kind of how do we do this for all repositories? So this is where I'm going to introduce modern. I should have updated this number But I need to check my own slides more thoroughly before I do It's free for open source projects. They have 32,000 listed Projects indexed, but I think the only 28,000 of them have Only 28,000 of them have ASTs for the full languages are not just raw files One of the things you weren't into is like if you can't invoke the compiler They can't build the AST and that doesn't work for builds that are using like ant But they work for maven and gradle and stuff like that So if it's if it's you know ant-based they just give you text files, which you can modify but you know And then it runs it lets you run open rewrite transformations at scale and you can generate you can use it to generate and update pull requests This is the sass. This is their UI They've over 800 open reader recipes including complete framework migrations. Oh, okay Why is a framework migration maybe relevant in security? Take for example You want it you need to there's a big vulnerability in spring. I can't imagine that spring ever being vulnerable, but There have been projects that have been like we're running spring one dot x you need to migrate to spring two Well, that's a big lift in and of itself, right? Part of that that you may not think about is if you're going to migrate from spring one to spring two You also need to migrate from J unit four J unit five Because J unit five they dropped support for J unit four four when they migrated J unit five was a massive rewrite of the testing framework new imports method ordering reordered You know new annotations new ways of invoking the thing. It's a massive lift You don't think about your testing framework being a security blocker But it can be to getting your you know framework migrations are blocking you from using the latest version of spring Which has a vulnerability fixed in it So you can use open rewrite and the open-reader recipes to help you migrate to later versions of of you know J unit to get you to be running the latest version of spring So you can also use it to bulk pull across generate across open source, which is what I've been using it for Please start so I can run a recipe across You know thousands of open source repositories I can say I want to commit it with a pull request I can put a branch name in there. I can put the commit message. I can put the commit title I can put the organization that I'm gonna fork it under I can supply my GPG key both private and public key And then they will go off and generate pull requests as if they're me using my account using my GPG key and You know generate them across open source So, you know, it's awesome for actually deploying these vulnerability fixes at scale across open source cool But they're more than just 7000 well, they're more than just 20,000, you know, they now support 20,000 or 20 28,000 repositories across the world. How do you find the other vulnerable projects? Well, this is where codeql comes in Codeql Indexes over a hundred thousand open source projects and 35,000 Java projects So modern's getting close to that, but we're not entirely there So you can write a codeql query that lets you scan all of those open source projects for vulnerability And then say this is worth my time to create an open rewrite recipe to fix And so if you want to add more projects to being the set of projects indexed by Modern There's this GitHub repository. Just contribute a Line to this CSV file and they will start indexing it for you So if you want to opt into also Alpho Omega work and where we're we're fixing vulnerabilities at scale Just get your pause make sure your pause for is listed in this file And we will we will include it in the set of projects we generate pull across for So finally, let's go generate some of the source software pull requests. That's what I did Temporary directory hijacking zip slip partial patch reversal. I mean, I've done more but these are the ones that I talked about from this talk And these are some statistics from that work. So Using that original bot that I wrote I've generated 1,596 pull requests. It'll the 40% merge rate Jay hipster was a different vulnerability using some other technology. I've generated 3467 pull requests Another vulnerability that was not done by me, but was done by github They generate 1,885 pull requests and then using Modern I've generated, you know, 6450 and 164 pull requests to fix real security vulnerabilities and open source across, you know In 2022 I generated north of 600 pull requests across open source and in my career I've generated north of 5,200 pull requests to fix various different security vulnerabilities across open source There's one unlucky project that was the recipient of all three of my campaigns And this is my contribution graph for 2022 you can see, you know, and there's more to come So let's talk about some is some closing thoughts some best practices for bulk pull request generation first off messaging All software problems are people problems in disguise. This is very true in this work You have real maintainers. I mean, we're all mostly part of the open-source security foundation here We're like we're all dealing with real people, right? I'm throwing a technology solution at a real person problem, right? We are actually but we need to we need to be cognizant that there are real people on the recipient end of these reports and be cognizant of that So as part of that, you know You want to include a messaging and a pull request that includes like full description of the pull request full description of the you know all that information about like, you know Not just the vulnerability, but why we fixed it how we fixed it and even context that maybe, you know You might not you might as a security researcher not want to communicate But is relevant and like how do you opt out of this like how you know? We're gonna do this for you regardless But if you want to opt out here's how to do use has how to do so So some simpler lessons lesson number one sign off on your commits. You might ask why this is what a commitment The sign off on your commit message looks like Why you might ask The back in you know a while ago, there's a lawsuit yada yada yada yada TLDR lawyers It just makes your life easier sign off on your commits Otherwise your pull request will be rejected by evil dragon barracrats Be a good commit isn't GPG sign your commits Otherwise you get impersonated like Linus Torvald has multiple times I get hub See calm see calm as a commit format I'm not gonna go into it here, but basically it describes how to format your message such that it Communicates all the information in about the vulnerability that you're fixing Lesson number four there are risks to using your personal github account to do this work Is anybody here familiar with github's angry unicorn? This one that was my github profile page for most of 2020. I Broke my github profile because of the number of pull requests that I generated It would not load you'd get a timeout and this was the error that you'd get They fixed this but you know there are risks to doing this However, I would still suggest using your personal github account for this work and the reason being If you have a maintainer that If it looks like it's coming from a bot they were gonna write negative react negatively react to you You want them you want like you want an actual communication you want to be approachable Like you're not a bot you're a real person and so I recommend using a personal account to do this sort of work Lesson five coordinate with github or whichever repository or host you're working with Reach out to the github security lab before you do this work and They can help craft your messaging to make less likely that you're going to be you know banned from github They can't prevent it, but they're making less likely And then lesson six consider the implications Shortly after beginning this work I received this Message this issue against my security research repository Is this responsible disclosure? Now I don't use the term responsible disclosure I use the term coordinated disclosure because it's more nuanced yadda yadda yadda But the answer to either of those questions is no This is full disclosure of a security vulnerability in an open source project You are oh dang a maintainer and they're downstream users There is an impact to that Now what I would have said and have been saying at black hat and def con was I believe the trade-off is worth it. However Because there's because at the time there was no support for Automating bulk pull request generation in a way that was private However, github is actually working on a way to make this possible with pbr They're adding an api that should be shipping in the next couple of months That will make it so that we don't need to owe day maintainers, but At a certain point you have to make it public if the maintainer is not responsive, right? So you you still have to engage in that conversation in that dialogue But we want to try to do so in a way that's more responsible more ethical if we can And that's the work of a working group that are a sig that i'm working on called the Autofix sig under the vulnerability disclosure working group to set up policies and processes That will help us Have a better answer to this question than I did back in you know, august of last year at black hat and def con So in conclusion as security researchers I believe we have an obligation to secure to society We know these vulnerabilities are out there. We are the ones that understand these vulnerabilities. We've written them up in pentest reports We've received them in our own emails. We understand how these vulnerabilities exist how they express themselves We have that knowledge There's a statistic from github For every 500 developers that are just one security researcher We are wildly outnumbered in this industry We have this knowledge in our heads. Most developers do not I you know spoiler most developers don't watch black hat and def con talks We have this knowledge in our head. I believe we have an obligation to leverage it and use it to the best of our ability And so with that I think that we can use this knowledge of our of math science technology to actually Achieve something powerful and actually fix these vulnerabilities at scale across open source With that I don't leave you the one final quote from dan Kaminsky. We can fix it We have the technology Okay, we need to create the technology. All right, the policy guys are mucking with the technology Relax We're on it So some final notes learn code ql Contributed open rewrite and you can actually deploy your security fixes at scale Join the get up security lab and open to write slack channels And then join the open source security foundation Which I know that most of you have and then if you want to get involved in the conversations about how we can do this responsibly There's a sig meeting every at 4 p.m. Eastern every wednesday discussing this topic of how we can do work Where we're automating security fixes at scale across open source um And doing it in a way that's respectful of maintainers, but also actually gets this stuff fixed across open source So thank you to open source security foundation alpha mega modern human Lydia juliano the black hat speaker coach that I worked on with to create this talk And sham my intern from last year who created some of the graphics you saw In about control flow analysis and help actually create control flow analysis with me in open rewrite last year. So That's me. Thank you I have time for questions, but I don't that you probably don't have time for questions. You probably want to get lunch Okay, does anybody have any questions that they want to pose to me go for it. Oh, there's a mic Yeah, you're not obligated to stick around don't yeah if you want to yeah feel free to to leave But I'm gonna hang out here. I've got a bit of time. I do have a flight at some time We're going up. We're going on one of the crew the harbor flights. So but that's at like two. So we're good so 20% you know except rate is Is maybe maybe not bad, but have you looked at Have you sampled it all to see if they were fixed in another? you know mr or pr or If that project was defunct or or like kind of where the other 80 percent of your prs are going so um So for some of these um Okay, so for some of these This one I included Excuse me closed pull requests is that because a lot of the times these were closed But merged independently for the rest of them I did not include those closed ones because a lot of these people just closed them because they didn't think them So that was a lot of manual curation of that data myself Um, but yes, we have had pull requests merge closed, but merged the other ones A lot of dead projects You know most open source at least on github. I can imagine right is just a project that somebody has a one-off right It's not to say that it's not worth fixing that because it's really hard to determine. Is this a one-off words It's actually an important critical software holding up the entire industry, right? So we fix it all because it's just bits at that point Right, it's just you're just shoving bits on the wire. This one Is a lot of one-off projects So j hips do this is a vulnerability because of a code generator that was vulnerable Or you were pumping out copies 15,000 copies across github of the same vulnerability Right because some code generator was vulnerable So I fixed the code generator and then we had to go fix the code that had been generated by the code generator But a lot of these were people just standing up little projects and then like for like a class or something like that and then going away So that's why we got a 2.3 percent merge rate. That's fine. But like, you know Does that kind of answer your question? Yeah, it does. I was wondering if we could use start to use You know like open open prs from modern or something as you know one indicator of project health Yes, that's that's the that's the potentially a long-term goal is to like can we use Right like there We're a known actor in this space of the open source security foundation Can we use those statistics as a way of indicating that this project is maybe Deader than the other ones and flagging that in scorecard or something like that. Yes. That is that is A potential discussion that we can definitely have. Yeah, anybody else unless you have a follow-on Yeah, to your right And then we'll go to you. Yeah, all right Just just ice cream cone it Nope, and that's off Okay, um, so Thinking of two issues false positives false negatives. Yes, and when I think about false negatives I'm thinking that a developer Solving to apply a fix try to apply the fix but did it incorrectly. Yes with something like that get past it and then false positive I'm thinking of someone that's got Intentionally vulnerable applications that they put out there So this actually has generated pull requests against the awasth webgoat, which is very funny Um, I was like, I'm like, they're like, yeah, we actually want to leave this vulnerability in there And I'm like, yes, of course you do. I just to close this pull request is not relevant to you But I do use use a testing ground. So sorry Um as to false positives the goal of this work is that at Best we're fixing in valid security vulnerability at worst. We're fixing us. We're actually we're hardening the application right, that's the goals that were So we'd like to not be fixing things that we don't need to fix at all but There may be places where this guard is not relevant because they're Unpacking a jar file that's known to be trusted right but It's still good to put that that check in there anyways because the file might be Corrupted and that that's something you don't want to have it You don't want to have a writing outside of the file directory because it's a corrupted file right Code gets reused it gets used in llms llms get copied and pasted across, you know We find one of the things we find with llms too is that um, there is a talk uh at black hat about How you know it was a talk called in need of peer review and it's a talk about how you can get Code uh, no how you can get um get a co-pilot to generate non vulnerable code or vulnerable code And one of the things they determined was that if you have vulnerable code with like sql injection vulnerabilities in it You're much more likely to generate new additional code In your same file that it also has sql injection vulnerabilities in it because it's mimicking Your file as it is right so if you have one that's not vulnerable because all this thing is getting used through is You know as trusted files, and then you have that same file as utility Now you got a vulnerability in it so There's someone over at manoir you know you've got a question. Yes go for it Hi, so this is kind of tangential and i'm i don't know if this has been already answered But you're generating a lot of pull requests on a bunch of repositories that you've identified Using code ql is vulnerable. Yes, are they linked in some sort of cv database so that it is well because you're already openly disclosing So at that point the the responsible disclosure question is Somewhat moot. Yeah. Well. Yeah, are is it then worth maybe Linking that those repositories are vulnerable. So things like sift and gripe can then notify dependencies Um gsd is the plan somewhat so so um miter And cve does not don't love cve numbers that are not human curated. They don't love automation um the global security database gsd is less Screw i don't know less less picky about the numbers that come in and so we One of the things we've talked about is just you know we can assign Gsd id's to all of these and then maybe later get cvs if they're relevant But like it gives an identifier even though it may not be Like the cv number that everybody wants and then we can later add that on as a part of this work But the the goal later also is to use private vulnerability reporting Which will give the maintainer the option of saying yes This is a valid security vulnerability and i can use github. They can then use github to get a cv number for it So it's part of the process is just Yeah github won't issue cv numbers or it get up won't issue cv numbers if the maintainer is not In agreement that it needs a cv number now you can get cv numbers if the maintainer is not in agreement with you It's just you have to go to mitre or another cna to to say this maintainer is not in agreement with me But i still need a cve number Yeah, is it part of the recommendation of the pr to perhaps yes It is currently part of the body of like hey if this is picking a real the security vulnerability It's not just in a test or something like that. Yes, you should fix Yeah, you like we will help you get a cve number if you if you think it's relevant so One more and i could probably ask you afterwards but So the trellis guys they recently created like 60 000 pull requests and just interested like how they bypass the github I think they said that it ran they ran their things for like a week Like they they just they just they just played the time game Yeah, or maybe it was longer than a week. So yeah, I mean, yeah, so that's curious about that. Yeah. Yeah Um I have talked to github there is a ways of getting past the the rate limit if you talk to the right pm And the pm is willing to work with you and yada yada yada, but it's not easy You have to like, you know play the game. So yeah That's it. I'll be around i'm gonna go for a plane ride Out there, and then i'll be back around you'll see me I'm the one of the duck so don't hesitate to come up and say hi to me or ask me questions And thank you all for coming. I really appreciate it and have a wonderful rest of your con