 Understanding vulnerabilities outside of your own code can be challenging. They can exist in dependencies, such as other people's code that gets pulled in alongside yours. And they can also exist when secrets, such as passwords, are passed between applications. As monolithic applications get broken into bite-sized, sometimes single-function modules, the use of APIs skyrockets to tie all of this together. For an API from one app to interface with another app, often requires passing of these secrets. So you can imagine the use of secrets is also increasing, along with the potential for their exploit. How big is this problem? Well, Mackenzie Jackson from GitGuardian is going to share a study on this very topic. You're going to see how scanning public Git repository commits uncovered 2 million leaked secrets in 2020. I'm sure this talk will include some great insights, so be sure and ask some questions. Hi, everyone. My name is Mackenzie, and I'm the developer advocate here at GitGuardian. Now, today I want to take you on a little bit of a journey inside public Git repositories and the sensitive information that we can find within there. Now, I'm sure you heard of stories about sensitive information, like secrets being found inside public Git repositories. But last year in 2020, we decided to really uncover how big of a problem is this really and started scanning as many public repositories as we could. And it turns out that secrets inside public Git repositories is a pretty big problem. In fact, as the title may suggest, we uncovered nearly 2 million secrets in that one year alone. In this presentation, I'm going to run you through the methodology of that research that we did, how we detected those secrets and validated them. We'll run through the results of what we found and where we found them. And then lastly, I'll spend a little bit of time on just some preventative measures and some best practices that you can put in place to prevent your secrets ending out in the wild. So let's start at exactly what are secrets and how we use them in software development. Well, the way that we build applications has changed over the years. It used to be true that we built these monolithic applications where everything that needed to run was internal. Now, as we become more reliant on the internet, we can leverage microservices. Microservices can be internal and external, and they can be things like payment systems, managed database, your cloud infrastructure, really anything that we can outsource that does a particular job. This speeds up our development process and allows us to really focus on what it is that we're trying to solve. But it comes with a cost. You see, to be able to connect to all these different microservices, we need to leverage secrets, and we do this programmatically. And this means that these secrets are used in code, but they're extremely sensitive. This creates a problem, because while we have all these secrets and a company can have hundreds of thousands of different secrets, depending on your applications, we need to both manage these and securely protect these and wrap them in authentication layers. But we also need to distribute them because our developers need these secrets to be able to build the application and test their systems. But also our services need these secrets. We need to include them in our CI pipelines and for testing. And this creates our problem because as sensitive as these secrets are, you can think of them as the keys to the kingdom of your organization. They're also widely distributed and this creates a problem, problem that we call secret sprawl. Now secrets can sprawl in a number of different places. They can sprawl inside internal documentation, internal Wiki pages, even within your internal messaging systems. Anywhere where you find code, you can find secrets. But Git creates a particular problem on top of that. That's because the way that we use Git, it actively promotes this distribution of code and it's great for development. But when sensitive information finds its way into Git, well, there's no way of knowing where it will end up and how many times it's been copied and distributed. You see, even if your secrets are inside your private Git repositories, everyone that has access to that private Git repository has access to the secrets inside them. It will still be cloned onto multiple different computers and private repositories can be made public with secrets buried into the history. And this is because with Git, nothing's ever truly deleted. Yes, you can rewrite your Git history but normally we just commit over it. Creates a trail of what we've done and makes it easy to go back but it means that when a secret gets buried and I get history and committed over, well, it's hard for us as the software developers to be able to find it and become aware of it. But if you're a malicious actor and you know what you're looking for, well, this is the perfect target. And that's why we've seen an increase on attackers really targeting Git repositories. We've seen this in supply chain attacks such as the recent CodeCov, where our attackers were able to gain access to private Git repositories. And another reason why this is really problematic is because of really how we use Git. Take this very standard Git tree that we have here, a very simplified version. We have a main branch and a developer branch. And let's say that we're creating a new feature on our development branch. And in order to get this feature working, we just quickly commit some secrets. We hard coded in there because we're just testing to see if it works. And we have no intention of pulling this into our main branch. But as we get closer to the review and merging this in, well, we clean up our branch, remove our secrets, we add them in as environment variables and handle them the way that we should. The problem is that when the reviewer comes to check out this code, well, he compares the latest version in that development branch with the latest version in the main branch. And if your secrets are buried with hundreds of commits behind, then a manual review is not going to pick this up. Unless you want to check every single commit in there, but I would say that that's an unreliable way to ensure that there's no secrets inside your repositories. So we can see this in private repositories, but what about in public? Well, what we uncovered through our study is that public repositories are not immune to this at all. And code very regularly makes it away from private repositories to public, even when it's not intended to be there. What we can see if we look at just a very regular kind of graphical representation of a Git repository. And so we almost have the exact same amount of code added as it is deleted. You can see here the additions in green and the deletions in red. And what this shows is that there's almost the same amount of code hidden behind commits as there is which is publicly accessible. So it means the amount of code that we need to protect and we need to scan through is a lot greater. And this creates one of the problems. In fact, if you really want to understand how big this problem is, and I'm using little example that you can do is to go into public Git repository hosts such as GitHub and type in removed AWS key in the commit message. You'll find thousands of commits with this message removed AWS key. That's because we often commit over top of something thinking that the original version is gone. But this misunderstanding really creates some problems when it comes to the security of our applications. So this led us to try and embark on this project to really uncover and quantify how big is this problem? How many people are actually exposing sensitive information in public Git repositories? So we did this by going to the largest available source of public data that we could find, which is the public repositories on GitHub. And we scanned every single commit that came through throughout the year of 2020. In fact, there was over a billion commits, nearly two and a half million commits we scanned every single day. In fact, just over, I believe. And one of the reasons why we chose this data set and it creates an interesting kind of, creates an interesting thought is that whilst people understand that a public repository is public in the sense that you can share the link and someone else can view it or find your project. What they don't have to stand is that not just public, it's actually often broadcast. GitHub has, for example, an API. This made it really easy for us to be able to scan all of these commits because every five minutes the GitHub API broadcasts everything that's been done. Every commit, every public event. So it's very easy to monitor. And it's not just us that's monitoring it, it's also malicious actors. So this is why even if you have a public repository that you think no one will be looking at, well, malicious actors can easily find this. So with all this information, the literal fire hose of data coming through, we had to build a robust detection method. So with a huge amount of data like this and something that is difficult to detect as secrets, we had to build individual detectives. You see, secrets are quite often high entropy strings, strings that appear random, but there's a lot of high entropy strings that aren't secrets. URLs, unique universal identifiers. So we had to be very, very certain that what we're finding is actually a true secret. So we built specific detectives for each secret type. We ended up building over 250. So for example, an AWS cloud provider key had a specific detector that was searching just for that key. It enabled us to really fine tune these detectives and ensure that we didn't get too many false positives during this initial gathering stage. And then we trained our detector on that historical GitHub data because, well, there's an awful lot of it. So it meant that we could really make progress in training our systems to weed out those false positives. The last step that we did is to get rid of the final dregs of any false positives that we might have. So once we had collected our results, we then filtered it. We used weak signals that we found in specific cases. Again, using the individual detectors to discount anything that may be a test credential or anything that didn't make sense for that secret to be in this specific file type or areas of files that just generated a lot of false positives. And then the last thing that we did is we actually validated it where possible against the provider. So for example, checking that the slack token is actually a valid slack token. This meant that not only were we finding true positive results, but we were also weeding out any secrets that were no longer active. So it meant that we had a pretty good base and a pretty good solid foundation to really benchmark our results. So what did we discover? Well, as we've already touched on quite a lot. So each day we found about 5,000 secrets through this method. And when we compared this to the historical GitHub data that we could gather, it was about an increase of 20%, although I must say that the amount of code also increased about the same price, so same amount, so it's relatively the same. And something that was initially not a surprise, but when we dig into it became increasingly interesting was that 85% of the results that we found were on personal public repositories and 15% were on organizations. Now, this wasn't surprising, but what was surprising about this is when we started looking into the results that we were getting and validating them, we realized that it was actually a lot of corporate secrets that were being leaked on personal public GitHub repositories. For instance, if an employee pushed to the wrong repository and inadvertently made the code for the organization public in their personal repository. So this was really quite a surprising result that we didn't expect to see so much of. The last thing, another thing that we did was we looked at really where these leaks were coming from. And then to be honest, we were quite surprised that India followed by Brazil with the top two results. These are countries that do have large engineering populations, so it does make sense with the United States coming in at number three. But still quite interesting to see that really this is a problem that's not contained to one area. It's really quite widespread throughout everywhere. Next, we can take a look at exactly what were the real secrets that we found. So number one, topping off the list was Google Keys. Now, this includes both Google Cloud providing keys, something that's really quite sensitive and something on the other end of the spectrum, which is like Google Maps Keys. So, you know, there's a lot of variance in terms of the sensitivity of here, but an awful lot of Google related keys that were being leaked. The next category that we found a lot of was development tools. These are things like Django or Rapid API, Octa, things that developers use within internal application, and found an awful lot of these. Now, these can be very sensitive because they can provide access to the inner workings of your application and allow you to really launch quite a sophisticated attack from the inside. We also found a lot of data storage keys, so access to databases, mainly, so MySQL or Mongo databases, which was quite concerning because this can provide direct access to PII of your customers or can be used in multiple ways to manipulate data during an attack. Now, other key mentions are messaging systems, cloud providers, and also private keys. What's interesting, particularly on the messaging systems, when we were running through these results with pen testers and other white hat hackers, so one of the favorite playbooks that attackers use is to try and get access into your messaging system. So they can do this in another number of ways, but scanning the employee's public GitHub history is one way to be able to see if you can gain access into these select channels because once you're in there, you can find a trove of additional sensitive information, quite often keys are passed between colleagues on these systems, or you can also try and convince the administrators to give you access to additional services. So having these messaging keys in public places is really a great way for an attacker to get a foothold into your organization, move laterally between systems, and then elevate the attack from there. Now, we can also take a look at where we found these secrets. So again, looking at this, probably not surprising, but as we dive into it, we can start to see some interesting results. The top two file types, file extensions that we found secrets were in was Python and JavaScript, the two most popular languages that are available on public repositories. But within this, we found that secrets were both being hard coded into applications itself or they were added into configuration files under the same file type. We also found a lot of data serialization files, so JSON, XML, YAML, or .properties. These are often used to configure infrastructure, configure your environment. And again, but within here, we also found a lot of downloadable files. So for instance, providers such as Google can give you a downloadable JSON or .properties file, which you can plug directly into your application, which has the keys and their variables preset. Makes it easy for you to get off the ground running, but it does mean if you don't take that secret and handle it in a proper and correct way, it's very easy to capture that with say, a get add all command. And next thing you know is within your Git repository. And then the last file type that we found a lot of was what we're calling forbidden files. So these are things like .env files, environment variable files and .pem files. These are highly sensitive. They often contain a lot of information, a lot of secrets, and you really should not end up into Git repositories. And this is where having a Git ignore file to make sure that these don't end up into your repositories is really quite important. Something that we found a lot of repositories did not have and could have prevented a lot of these secrets from entering out into the public wild. So now that we have looked at those results from the study, we can start to ask the question of why does this happen? Why are secrets being leaked so predominantly within these public Git repositories? Well, a lot of it really unfortunately comes down to human error. I say human error unfortunately, because it's something that is quite difficult to protect against. You can accidentally push code into the wrong repository and Git can be quite unique in this, in that we have to manage our local Git account and the account on our cloud repository. And we can also use the same accounts for personal and professional use. This is particularly true in GitHub where you tend to have one account that you use for everything. Makes it really easy to make this mistake of accidentally pushing something to the wrong host or to the wrong repository and releasing something that shouldn't have been. Released. Another reason is that secrets are buried in history. So we saw a lot of repositories that were made public, that were private repositories. Companies wanting to open source tools. But within this, buried under the history, like what we've talked about, there are secrets. And just checking the current version of this is not enough. And unlike other vulnerabilities, such as cross-site scripting or injection, once you update your application and you update your code, the latest version is really all that matters. And the vulnerability doesn't exist as long as you've released a new version. But secrets, it's not true because if you remove that secret, it's still in your history. It just makes it quite a unique problem. Easy to forget about. Another suspect that we saw was secrets being exposed in log files. So debug logs or error logs or just other auto-generated files. Your bash history, for example. So forgetting that these are in there. And again, capturing them with these wildcard git add all commands makes it very easy to accidentally push these into your repositories. We also talked about having sensitive files within your repositories. These shouldn't have ended up in there. It can be a combination of relying on a git ignore file or not having one. And accidentally these ending up where they shouldn't. And then finally, the last example which was very surprising is we saw a lot of repositories that were managing their secrets in git. Now, of course, a lot of these were once private repositories but you can see that there is a pattern of these secrets actively being managed within git. It makes it convenient. It's a central place to have it and easy to distribute among their team but it takes a very small mistake for all of those secrets to become public. So let's talk quickly about what we found and how you can prevent this. How can you prevent your organization secrets ending up in these public places? Well, number one, let's use the correct tools to manage our secrets. If we have a lot, we need to invest in a KMS or a vault. Slightly different things. And I know that's something like Hushikort Vault can be a bit of an investment to set up but this is really fundamental when it comes to maintaining kind of best practices in managing these secrets. What we have to realize is that git is not a vault and we shouldn't treat it as such. So point number two, never store secrets inside git. You know, secrets are just as if not more sensitive than your credit card numbers. So ask yourself the question, am I comfortable storing my credit card number inside this repository? If the answer is no, then you shouldn't be storing your secrets in there as well. Number three, try and use short lived credentials. So putting a time limit and making sure that you don't create these God credentials, credentials that give permission to everything and last forever. We wanna have a time limit on the credentials that we create. It enforces good hygiene for these and it means that if something is buried in the history, while hopefully in the future, if that ever becomes public, it isn't active and doesn't pose a threat. Additionally, on that same theme, we wanna restrict access to our APIs, to the minimum scope. I know it's convenient when you have API key that lives forever, provides access to everything. It requires less management and ultimately, I guess, less secrets but the problem is, is that when these get out, they can really be used to cause damage. Attackers can elevate their privileges, move laterally through systems. We wanna prevent this by making sure that our permission of our API is as minimal as possible. Does it need to be read and write or can it just be read? And can we limit the range of the IP addresses that can access it? So we know that just our infrastructure can grant access to that credential. And five, one of the important steps here is that as we've just discussed, humans aren't very good at detecting secrets in their code. There's lots of way that they've become public or within Git repositories, so we wanna detect them as soon as possible. And the best way to do this is using automated detection. And there's a very good service to use this because as we're at a GitLab convention, GitLab offers secret detection and it's offered in all of their services, all of their plans, including their free tier plan. So they cover over 20 types of secrets, so fixed API keys, encryption keys, even U.S. social security numbers. And there is the ability to add custom rules within there too. And if you're on the ultimate plan with the security features in the security dashboard, you can also gain more visibility around the secrets. So if you need more power than that or kind of higher level detection, then there is the commercial option. Won't go into this too much, but we're at Git Guardian, we have 250 types of secrets that we cover with some developer tools like pre-commit hooks to prevent them from getting into the repositories in the first place and dashboards so that you can follow through that remediation. And then the final option is to build a secret detection service in-house. Now, I have here some open source repositories that are quite popular, so Truffle Hog, GitLeaks, and actually GitLeaks provides the underlying rules for the GitLab detection. And these are great tools to build your own internal secret detection solution if you can't find one that meets the requirements. Now, particularly in large organizations, using these open source tools out the box can create a lot of false positives and can create a lot of frustration, which is why I put it into the category of build your own because there is some additional work that you need to do, but they provide a great foundation to start with. But as a minimum, I think if you're using GitLab, then take advantage of what GitLab have already set up in their free plan, which is their secret detection. And they were one of the first VCS hosts to actually include secret detection, not only into their long-term plan, as a feature. So definitely, as a minimum, if you take anything away from this talk, I'll be happy if you just turn on the GitLab secret detection, which can be done through the pipelines. All right, that's it. So I'll be around in the messages. If you guys have any questions for me, I'm more than happy to answer them. So I hope that you took something away from this talk and I hope that you enjoy the rest of GitLab commit. See you guys, thank you.