 All right guys, welcome to the next talk here in this wonderful Sunday morning. Thank you guys for showing up and getting out of bed and rolling into into the track. Right now I've got the pleasure to introduce Dylan on stage who's going to be talking about fighting secrets in source with Truffle Hog. So please welcome him to the tour con stage. Thank you for the kind introduction. Folks hear me okay? Cool. So basically I wrote this tool a little while ago that does as described it helps identify secrets committed to source code and I'm just going to be talking about why I wrote it, what sort of motivations were, how to use it, and what sort of the path forward looks for it. So you know at a high level I think I need to start with like why I'm trying to solve this problem in the first place like what is this problem? Is it a problem? So I think most folks are probably at least at a high level familiar with certain security incidences and breaches that have occurred from secrets being committed to source code. That's generally as a result of secrets being committed to source code and that source code becoming open source. But there are incidences that also occur from internal source code as well. It can also be used for lateral movement. So if you move onto a machine and that machine has source code developer workstation or production machine you can potentially extract secrets from that source code and then move laterally with the use of those credentials. They can also help elevate privilege. Same idea. You take a credential from the source code. You move to another machine in the environment. You may have elevated your privilege that way. Sometimes exploitation can be hard to detect of this. If you grab a credential and you use it against a public cloud endpoint it can be very hard for an organization to tell whether or not that credential was used for a legitimate reason from the application or being used by the malicious person. If a workstation gets lost and the secrets are hardcoded on disk it can be really easy to pull those secrets out if you don't have full disk encryption set up. And then the last bullet point here. Source code is kind of weaky. Like you may accidentally expose your .get directory on your website and then all of a sudden all your source code becomes public. Or someone internally may intentionally publish your source code. Or your source code may get pay spend or 100 other reasons. I'm sure folks in the audience are very familiar with these types of things. But ultimately source code ends up sort of leaking out in ways that we can't really control. And so to help prevent the fallout from that we try to not put secrets in source code. So this is one example. So Reed the guy that runs besides SF he also works at hacker one paid out $2000 to the hacker one bug bounty. Because someone submitted an issue basically saying hey Reed you published your GitHub token to public GitHub and they paid him out too grand for that. This is another example. Some researcher just went to GitHub and just did sort of like a Google door style. Let me search for all the slack tokens and found 1500 of them. So he was able to just squat in all those slacks. You can imagine like tons of those tokens probably belong to companies and he was able to just grab all the chat history from all those companies. There's another example. Basically developer accidentally committed an AWS token and then that organization was served with a $2000 bill that month from AWS because someone had taken that credential and used it to mine Bitcoin on that company's account. So that's a pretty expensive mistake. But it does get worse. In this particular instance again AWS token committed to GitHub. This organization was served with a $64,000 bill. So that's almost somebody's salary because somebody accidentally committed an AWS credential to GitHub. So these certainly can be very expensive mistakes. And then I'm sure most folks are familiar with this one. Uber recently in the last year a researcher identified an AWS credential that was committed to their GitHub. And that researcher may or may not have extorted Uber for $100,000. And Uber paid out. And so it became national news because that credential had access to tons of user information. And it definitely got a high amount of visibility. I think the CISO of Uber ended up testifying in front of Congress about it. What folks may not remember those just two years prior, Uber had another AWS token that was committed to GitHub. And this one was actually kind of interesting because GitHub was subpoenaed by Uber requesting all the IP addresses that accessed that resource because they wanted to know if anyone used this for nefarious purposes. So it kind of just speaks to like this this mistake can lead to subpoenas litigation, like multi $1,000 bills on your account. And it's, you know, sometimes really hard to figure out who's used this credential. In this case, they had to subpoena GitHub to get the answer to that question. So this is not a talk telling you where you should store your credentials. There's a ton of options out there. And depending on your infrastructure, you should definitely pick the option that makes the most sense to you. I'm not here to tell you, you should store it in environment variables or Unix domain sockets. You should do that research for yourself. This is a talk more just telling you not to store it in GitHub. So you can see truffle hog up there is that tiny border collie up in the corner. And all those lambs are your AWS tokens. And truffle hogs goal is to herd them all into your secrets management solution. So sort of at a high level, when we think about where source code lives, it sounds like an obvious, like there's an obvious answer that it lives in version control. When you stop to think about it a little further, our source code actually lives a lot of other places. So for example, package managers contain source code and package managers and version control can be out of sync with one another. So what's published to your internal or external npm, pi pi, take your pick may not necessarily reflect what's in GitHub. Mobile applications, every time you download a mobile app, say an APK for an Android app, you're downloading a whole bunch of source code for it. And folks have spiked on again, identifying secrets that actually get packaged into those APKs and shipped out to users. Slack, like folks will post snippets of code, they'll upload files asking for help. Tons of source code ends up in Slack. Websites, this one's kind of a funny one. But if you if you stop and start analyzing HTML and looking through comment blocks, and looking through like, like JavaScript variables and stuff like that, you'll find tons of credentials sometimes commented out basic off credentials that were used when the application was internal, just as the first round of off, but the password is still there. And it's a sensitive password used in other contexts. But the bulk of my talk here is going to focus on the last bullet point here, revision history, right? In version control, you have the most recent incarnation of source code. And that's typically where people go to look for vulnerabilities. We have this entire mountain of buried source code in the revision history that we don't tend to pay a lot of attention to. So I have an example here. This is the Facebook's React public GitHub repository. And on the top in green, you can see all the contributions that were added to the GitHub project. But more interestingly on the bottom here, you can see all the code that was taken away from the project since it started. And there's as much, if not more code, buried in revision control on the bottom there, then there is in the current version on the top. And this is consistent across most projects you look at. There's actually a tremendous amount of code that's buried in version control. And this is a problem when it comes to secrets and source code. For most vulnerabilities, this isn't an issue. You have a problem, you can patch it, you can fix it, and then folks are only going to download the most recent version of the code going forward. But if you commit a secret to source code and then you push over the top of it, that revision control is still accessible. And if that token is live, it can still be found and still be used. So this is really common. And one of the top most common reasons, I think, is a developer may accidentally push a credential, and then they'll push that upstream. And then other folks will pull that down on their local workstations. They'll realize they made the mistake. And then they'll know if they modify the source code further at that point, all the other developers will have to do a force pull and they'll have to merge and fix their get history. And it'll be really visible. And to save themselves the embarrassment, I think oftentimes, will just push over the top of it, rationalizing that no one will ever look back there. It can't be found because it'll be buried a mountain of other commits. That's a really common pattern I've seen. Another common pattern is maybe just a new feature comes that just completely replaces an old feature. So maybe you're working on a new application and you decide to temporarily store files in S3. And then at some point, you say to yourself, actually, I want to move to SQS. And I want to use their PubSub instead of temporarily using S3 for that. So you replace that large swath of code and you put a new credential in, you put an SQS credential in replacing the S3 credential. But the S3 credential is still there. It's still in revision control. And later, when somebody does a review of the source code and they find that SQS credential, and they remove that and rotate it, it doesn't actually remove the old S3 cred that's still live and still buried in source control. And that kind of leads into the next point here. When folks go to do their open source review, if you work for a company and you want to open source something, usually there's an open source review process. Security will do an audit. But the vast majority of the time, when security doesn't audit, this includes myself, we only look at the latest incarnation of the code. We don't go back and read all the old buried history. We don't read the negative commits that I showed earlier. And so if devs know this going in, they're going to want to do some cleanup before it reaches the open source software review. And when they do their cleanup, they may fix some cross-site scripting vulnerabilities. They may do some last-minute tune-ups. One of those tune-ups may be removing credentials. But they may use the pattern I mentioned before to do that. They'll just push over the top, and then the security guy will get the code, they'll do the review. And they'll only look at the latest incarnation of the code. They won't look at the buried commit history. So I have nothing up my sleeves. I took this screenshot an hour ago because I wanted to be as fresh as possible. I went into GitHub and I did the same sort of Google dorking approach. I searched removed password into GitHub. And you can see there's almost half a million commits here of folks removing their password. And if you click it, you can see what the password was because it's still in revision control. There's another one, removed AWS key. Again, those breaches and bills that I gave you were examples from a year ago. This is from a couple of days ago or from a day ago, and the top example looks like. There's nothing stopping people from using these. And if I was a bad guy, and I wanted to do some nofarious stuff, I wouldn't want to put down an AWS credit card and give them a bunch of attribution on how to find me. So I probably either use a stolen credit card or I just go find somebody else's credential and do it from their account. This is a really easy way to do that. Let's say you want a man in the middle of some traffic. This is a really great way to get SSL certificates. It's really hard to figure out what cert goes to what without a tremendous amount of effort. So the effort I'm envisioning here is maybe you were to go pull all of these private keys, and then you were to go to all the top websites, maybe the Alexa 1 million, pull all their public keys, then you can cross-reference those and try to figure out what goes to what. So most people aren't doing that. And so there's probably a ton of live, valid creds here, and just nobody's expanding the effort to figure out what's live and what's just a test certificate that doesn't matter. And then this talk is on secrets. It's not specifically on tokens or passwords. So another really good exploitation use of this sort of going through revision control here. Let's say you're doing reconnaissance on a company because you want to break into them. One of the things you may want to do is enumerate all of their domains, both internal and external. A lot of times what happens is when you're developing a project internally, before it becomes open source, you have a ton of references to internal host names, which again, you'll strip out for the open source review, but they'll still be in revision control. So it's very easy to go through and find all of those old internal domain references and then figure out all kinds of topology about their internal and external environment based on the domain names that were committed to source code. So here's an example that Netflix gave me permission to show. Again, it's just the same thing. They pushed you over the top of this credential. It was buried. It's no longer live, but you can still access the non-live one on Netflix's public GitHub. Somebody just committed an AWS credential. And so basically we need some way to scan these old commits. Nobody wants to go through and read all the old negative commits just to look for this one class of vulnerability. So that's really the reason why I made Triple Hog. You can't really grab for these. I'm not exactly sure why. I'm not an expert in the Git protocol, but the blobs that Git stores in the .git directory are not in a format that can be easily gripped. So basically I just wanted a tool that could go through all the old revision history of all the branches, and it can find secrets that were otherwise not in the latest version of the source code. So it's an open source tool. It does exactly that. It goes through all the branches, and its job is just to identify secrets that were intentionally or accidentally committed to source code. So when I first started the Triple Hog project, I wanted to sort of just prove the point, like, ship the minimalist amount of effort that I could to find and identify these secrets. So the way I first set this up is I said go through all the old revision history, and if anything looks like it's sort of high entropy, let me know. That way I can find sort of a lot of secrets, but there'll be a lot of noise with it as well, like URLs that have high entropy, just large blobs of base 64 have high entropy. I put some restrictions on it, like it has to match certain character sets and certain lengths, and it was effective. It did find that Netflix AWS token that I mentioned before, but one of the problems was it falls positive like crazy. This is the exact same repo from Netflix, and just a couple of lines down it falls positive on a URL because the URL had a bunch of entropy in it. So this tool was really good for like pen tests, and it was good if you wanted to do like a one-off open source review of pieces of software. Great for bug bounties, quite lucrative. You can, you know, a bug bountyer can run this and go through all the false positives because they don't care, they're just looking for the one payout. But this is really bad for like devsuck ops. This model doesn't really scale well. If you're to deliver these results directly to developers, every time they push a URL to their revision control, they get an alert saying they pushed a secret, and eventually they just start tuning it out. So this model doesn't really scale that well, and that goes the same with the security team, like if you have security alert every time this entropy detection happened on any source code anywhere in the company, they'll just start tuning it out because it's tons and tons of false positives. So it was really good for like one-offs for sensitive assets for bug bounties, but not that great for like a company using it scale. So I pivoted a little bit. I did exactly what I didn't want to do, and I wrote a bunch of really high signal, basically regular expressions to specifically look for specific types of secrets. That way when those flagged, we can suppress the entropy detection and only run the regexes, and when those flagged, we could be a lot more confident in giving these results directly to a developer, or just again setting up a triage queue for security to go through, and it wouldn't it wouldn't be quite as bad. So you can see sort of a screenshot of the some regexes up there. The big downside to this though is like there's a ton of different types of tokens out there that this doesn't flag on. Like you can see the whole list of regexes up there, and I'm sure you can think of public cloud API keys that aren't on that list. I do accept new pull requests, but that list is going to continue to grow forever. And doing things this way will miss all of those tokens. And another downside is it still does require some manual triage. So after it identifies an AWS token, it doesn't know if it's live or not. So somebody has to come in and figure that out. But one of the one of the upsides of doing it this way is, one of the upsides of doing it this way is it will in some cases detect low entropy secrets. So for example, I have a regex that I haven't pushed yet, but I will in the next couple days, that identifies if somebody hard-coded a password into a URL, like before the domain. And that password can be super low entropy, but the regex would pick up on it in any way, the entropy detection wouldn't. So when I say high signal, this is sort of what I mean. Like when I first started on the left, there is what a regex for a github access token used to look like. Basically, if the string github followed by a 35 to 40 character hex string showed up in a single line, then I would have it flag and that false positive like crazy. The reason why any github URL usually had a commit hash in the URL, and that satisfies that regex, that false positive. A giant monolith of minified JavaScript code with false positive because somewhere in the monolith this 2 megabyte file is the string github and somewhere else is a hex string. So I spent some time refining and tweaking these over time to make them more accurate. So what I ended up for this particular one doing is if the string github shows up in the next 0 to 30 characters, if a quote or a white space shows up, followed by the right character set in the right length, terminated again by quote or white space, then alert meme. So I'm really doing the best I can here to make sure that the rules that I introduce here are as high signal as possible so that these can be given directly developers and we don't have a amount of false positives. So sort of when you hook all of that up, this is sort of the model in my head of what that looks like and I've implemented this a couple of times. You have some sort of hook that fires, this could be a build pipeline, could have Jenkins or something like that, kick this off. Where every time a new commit, where build comes in, travelhog runs, you give travelhog a flag to tell it run from this commit onward. So you keep track of where you've already scanned, you give it a new commit, travelhog will scan that, deliver the results somehow to someone. That could be directly to developer, it could be to an SRE, it could be to a security engineer, then you have to triage them, figure out what's live, what's not, deal with what false positives there are and then finally you got to remediate. So when I say remediate, it's kind of a pain in the butt to remediate, but first you got to pull it out of source code, use something like a big friendly repo cleaner to do that, the BFG repo cleaner. Then you got to rotate the credential, you got to keep track of which one's been rotated, which ones haven't and you got to do that in a secure way. So that's another sort of tricky nuance to this is like once you set all this automation up and you're identifying these credentials like it's a bit dodgy to have like a big repository of hey look here's exactly where all the clear text credentials are and so you have to be sort of clever about the way you do that, you don't want to log the credentials for example because then you've got another repository of clear text credentials that you don't want. So you have to be careful with the way you're storing them, keep track of that and then finally what I mentioned before all the folks have already moved forward in their Git histories so if you go back and you laser something out, everybody has to do a forced merge and deal with the annoying merge conflicts that come from that. But if you remember back earlier like I mentioned most of what I'd be talking about would be focused on cleaning up revision history but there are a ton of other places that we keep our source code. Like I gave that big list and the second one on the list was package managers and if you remember what I said earlier package managers can be completely out of sync with GitHub and the reason why is in the case of npm and pypy the two that I spiked on and this is by no means a complete list of package managers that have this problem. These package managers only look at the file system when you package to them. They don't look at what's in Git or what's in GitHub. So for example if you're an engineer and you're working on a project and you have some testing script or some environment variable sourcing script in the directory of your project and you don't commit it to GitHub. When you run your let me package this up to pypy script there's a good chance that those scripts will end up in pypy but they won't end up in GitHub and when again when the reviewer does the review whether that be an open source review or just an internal security audit or maybe just a code review people are checking what's in GitHub nobody's pulling down the package and untarring it and reading that source code you're already reading what's in GitHub nobody wants to double that effort and these packages are also versioned just like revision control we have a long history of the same piece of code that's pushed up again and again and again iterated on all of those old versions could potentially have these problems you may have an environment variable sourcing script that was committed in one of the old versions and again like nobody has the time to go through all the same duplicate incarnations in the same code looking for that environment variable script so recently I spiked on a way to go through and scan those in the same capacity but what I found was basically if you publish to npm or pi pi and anywhere in the description of your project you have the string aws there's about a 2% chance of that package having a live aws credential now when I first did this I notified the folks where I could find the credential but that was the criteria I was using to figure out which packages to scan so I'm sure there's a ton of other projects out there that have live credentials and I'm sure since that time more folks have committed live credentials I mentioned some of the reasons I'll say them again so it could be environment variables you could have test scripts so maybe you've got some tests that you haven't staged yet maybe experimental code like I'm personally guilty of this when I'm iterating before I commit diversion control I'll just inline the credential just because I want to test something first when first trying out an API I want to figure out how it works that code if it's in an active project where you're committing stuff could end up getting zipped up and sent to npm or pi pi so not by any means saying that this was a good name for it but I have a new script called Santa hog that basically is designed to do just that I guess my thinking was like you're getting like goodies out of a package and that's that's kind of what Santa does and I kept the hog for consistency but basically name aside goes through all the old versions of the package and npm and pi pi and it scans for the same exact rejects is and it's got the high entropy flag as well if you want to do the entropy detection it's also open source and it's available on my github so here's an example of its output it doesn't quite look as pretty as the truffle hog output but here I'm running it on one of uber's packages t channel and you can see it flagged a couple of times it flagged on RSA private keys it flagged on AWS credential all in this one package but if you look at it a little bit more carefully you'll notice that these things that flagged were in a directory called node modules and if you know anything about npm node modules is basically the directory where all of your dependencies get stored so t channels dependencies basically live in the node modules directory and in this case what t channel did was they took their node modules directory and they ended up pushing that to to version control you can kind of think of that as like a statically compiled binary except most people don't do this it was probably done by accident and it was only in one of the versions of t channel but what they did was they introduced a dependency of a dependency of a dependency of a dependency of a dependency that ultimately all the way down that chain that dependency of somebody else way outside of uber committed an RSA private key and an AWS credential so uber took that other person's secret packaged it into their own package and then shipped it to t channel so if you were to scan ubers packages you'd find credentials that actually have nothing to do with uber and we're just accidentally included because they included all their sub dependencies so if we go back to our diagram here basically we have the same trios and remediation step we're just adding an additional hook from when we packaged npm and pi pi to run again the santa hog tool and this is what the new revised devop pipeline could look like if you're scanning those two sources of code so because at this point I have two projects that are relying on these regular expressions and at this point like a bunch of other projects started doing the same thing so for example somebody wrote a go version of truffle hog called git leaks and they had basically copy pasted the regular expressions over I just decided to pull the rejects is out and put there and put them in their own package that way the whole community can just point to one commonplace we can all write the same rules some other examples of projects they're doing similar things yelp detect secrets sort of the same deal they were like an enterprise e version of truffle hog and then lift to the same thing as well they fork truffle hog and they have sort of some enterprise features and they've they've also submitted back a bunch of features to my version of truffle hog but I pulled these radix is out just so all these projects can just point to one unified source of truth for the for the rule for the rules and then they can use their own engines so one thing I want to heart harping back to for a second is I mentioned that you have to have this like manual triage phase where someone figures out whether or not this credential is live truffle hog is at its core a static analysis tool but when you combine static analysis with dynamic analysis you can potentially remove this triage phase and you can automate the testing of these credentials and just create a system that only outputs true positives so basically you could think of that as truffle hog runs it finds an aws credential then you test to see whether or not that credentials live all automatically and then if it is live then you can get notified for it and so that's really easy to do for public public cloud so you can do it for an aws credential it's a little bit harder to do for say an rsa private key you'd have to again do something like I mentioned before aggregate a whole bunch of public keys and cross-reference it and stuff like that so this can work in some search situations other situations it doesn't work as well this is sort of what that can look like this is just a really simple python script I wrote that takes an aws key calls get call or identity on it that's a an api endpoint that any aws credential with any permissions can always call if we get an exception we know the key is not live and if the key if the key successfully makes the call then we know the credentials live and now we can pretty much automate the whole process when it comes to aws keys in particular anytime a developer commits an aws key will know whether or not the credential is live and we can automatically start remediation without having to deal with any kind of false positive and any kind of triage so that that sounds good but that comes with a huge like astrics so I'm not I'm not a lawyer I'm not going to tell you any kind of legal advice but if you remember back to the uber example that wasn't uber's credential that was some stranger's credential and if you start taking those and offing with them you're probably violating the computer fraud and abuse act not a lawyer not going to tell you whether it is or not but you're taking somebody else's cred and signing into something without them giving you permission to do it with that cred sounds like a violation the cfa so you have to be really careful with that and you probably shouldn't do this with bug bounties either again for the same reason companies typically will say if you find a credential stop testing their report it to us don't test it and that you know if you were to set up this automated process of just automatically logging into services you probably violate the cfa probably violate the bug bounty terms services so just be careful with systems like that I say that knowing that you know the benefit that you get from this is probably worth the accidental cfa violation but I guess that's something to be cognizant of nonetheless and then as I mentioned before source code lives in many places and I've only spiked on a very small portion of those places some other folks wrote this tool that basically does the same thing for mobile apps you give it an apk name and it will pull down that mobile app it'll extract the apk it'll decompile it and it'll look for for secrets in the apk so this tool is great and I've definitely used it before but just think of all the other places that we have source code Google Drive email wherever else like these are all potentially future projections forward of this project or others similar so you know that leads right into this point like I could use the community help for the rest of that like of all the package managers out there but we spiked on to and of all the places source code could live you know of all those bullet points I had only looked at two of those I could really use help building this out more I do accept also pull requests for new rules provided that they're high signal enough and I've seen contributions from companies like Netflix Microsoft Lyft and others so it's all available in my github and I encourage community support and the regular expressions are available for other other related and tangential projects these are the resources to the things that I've talked about so far by all means check them out take a picture and that's pretty much all I have for you so I guess I have another couple minutes if I have any questions so there's a question in the yes yeah so I had a little bit of hard time hearing the question I heard plans to integrate with it's very popular but there's like things that Garrett does that changes some things like it would be nice to have trouble hog is like a plot plug into Garrett it's just people use it rather than you know write our own books Garrett's a code review tool right yeah yes the question is basically like are there plans to take trumpet hog and build it into existing code reviews obviously it's not going to be scary I don't have any direct plans to do that but that being said it sounds like a good idea and I'd be happy to work with folks who would like to integrate it with other solutions it is importable as a library and so if you're already in Python it's relatively easy to do but otherwise feel free to chat with me after and we can figure out a way to do it yeah so the question is do you have any benchmarks for how fast it is I don't but I know that it's changed a lot and a lot of that has come from the community who have added performance increases so when I first rolled this out as its first proof of concept it used to go through each branch and scan each commit even if the commits were the same in the branches even if the hashes and everything exactly matched up which is a huge waste of time so someone from the community pointed that out and then added some extra functionality so that it wouldn't scan those commits and the time dropped dramatically and then there were a couple other optimizations are added as well so it went from like if you were to run this in the Linux kernel originally it would probably take a week or more to finish down to if you were to run this in the Linux kernel today it would probably take on the order of like maybe ours and that's kind of an extreme example most projects it'll finish in you know second-step minutes yes up front there any way to go first and then filter it down from that using the redgex yeah totally so the question was basically like is there any way I think to see whether or not satisfies both entropy and regex first to grab all the entropy you know hits and then filter down from that using regex like what has entropy and it matches the redgex you know gotcha yes the question I think was first run entropy and then from the results of entropy run the regex tools is there any functionality for that there isn't super native functionality for that but like I mentioned before you can just import the library and then run it twice in that order and that should be relatively easy to do happy to help you out with that if it's something you want to pursue further any other questions yes yes the question is what's the format of the rules it's just it's currently JSON a lot of people yell at me and tell me it should be YAML but I do it in JSON because I really don't like YAML deserialization bugs so it's just a JSON key value pair of key to regular expression yeah other questions cool well thanks for having me everyone