 So welcome to pillaging DVCS repositories for fun and profit. A little bit about me, I'm Adam Baldwin, not to be confused with this, Adam Baldwin. I co-founded NGNUTI, a pentesting firm, a pentester of webs, and I curate evil packet dot net. That's about it. So what the fuck is DVCS for those of you that might not be familiar with what get the problem with this? It sounds like a great tool for developers and for managing revisions and things like that. The problem is, is that this is being used by developers and by operations to push web apps out to production. They're starting to, it's a really popular tool, right? Like get push and now your, now your shit's in production, right? The problem is that they're like that, the problem is that this is being used by developers and by operations to push web apps out to production. They're starting to, it's a really popular tool, right? Like Git push and now your shit's in production, right? The problem is that they're leaving these dot meta directories public, right? We're supposed to be blocking these. It's pretty standard practice to block dot directories. However, you know, these are getting, these are getting left in web routes. And as an attacker, we want to have access to all of this, the information that's in this repository. The situation might be we're doing a pen test or we're an attacker and we want to get some foothold on this site. This might be that foothold. And no, that doesn't say dot gov. So in the situation that I ran into, we were deploying a site, they left the dot Git meta directory in their web route. And I wanted to see, okay, what type of exposure does this create? And as you'll see later on, we can't just simply clone the repository out of the web route. It just, it doesn't work that way. And so once we, the first step is identifying repositories, then we need to figure out how to extract information from those repositories. And then of course, some random process and then we'll profit as you will see. So here's how we identify repositories. It's pretty damn simple. All three of them Git, HG, Bizarre, and there's others. I focused on these three. They all have predictable file names. They all have something in the repository that is, that you can access and then pull a regex match against. These are very lightweight files is why I picked these. They have, they have great patterns. They're lightweight and there's other files that are, that you could use, but they're a variable size and it's, it's a real pain when you have to download, you know, a couple of meg files to pattern match against. As an example, example.com slash .git slash head and it could be any directory. It doesn't have to be just the, the route. If you want to see the patterns that we're using, there's a plugin for W3AF for the, the web attack and audit framework scanner. Just go look at the code and there's the patterns that we're using in there. Very useful for identifying these. As far as I know, there's no other scanners that are looking for these other source code. Please, if somebody knows of them, correct me and, and let me know afterwards. So we wanted to understand just how large a scope this problem was, right? If we, if we, you know, who's doing this? And so we scanned the top million Alexa sites and we found roughly 2000 repositories out of those a million sites. So it's not real common, but as you will see the impact and the information that we can extract from these repositories is pretty significant. Extremely useful to an attacker and gives you that foothold that you might need. Obviously the most popular was, was Git out of all those, so that's what we'll demo in a bit. So once you've identified the repo that you have, that you have a Git repository or an HG repository or bizarre, you need to clone that repository. And you can't simply run say Git clone and clone that URL. So here's the process that we came up with to actually extract data from, from the repositories. First thing you need to do is check for directory browsing. If directory browsing is enabled on the site, it's game over. You can simply WGIT the .git directory and then restore the repository. Git, reset hard, and then you've, you've got your contents. You got the source code, you got the entire, entire tree. If that doesn't work, barring that, we need to get the predictable files. There's an index file in every single one of them, an index or a dur state. That index file, that dur state contains a, a, a listing of all of the stuff that's in that, that repository. And for a web application, that's the web root. That's all the crap that they're sticking in that web root. Source code files and anything else that might be included. So even if you're not able to extract any additional data, you can glean a lot of information from just that single index file that's a predictable file in, in all instances. After we've, after we get that index and we parse that index, we need to download references that that index points to and then we can try to restore the repository after that. In some instances, that's not 100% possible because of how the repository is stored data, but you're going to get enough information. So pillaging. Once we have our, our information, what kinds of things are we looking for? What are we going to find? Well, easy enough, we're going to find platform details. We're going to find, if there's backup files, if there's SQL dumps in there, if there's, you know, there's source code. It's a web app, right? That's what, that's what it's going to be. Credentials, certs, API keys, just to get you started thinking. Out of those top million sites, here's the kind of stuff that we found. And I'm not going to go through all of them, but some of the more interesting ones, SQL dumps of entire databases, HD passwords, Excel documents, mail spools, you're kind of getting the picture of stuff that we're going to find. And basically, these repositories shouldn't have the stuff checked into them. That's, that's one of, that's one of the core problems. Not that they just left that the meta directory is, is hanging out in the web route, but that the developers are checking this crap in. Right? They're not being audited and no one's giving a flying fuck what they put into the repository. And even if they did and they go, oh, I shouldn't put that in there, they just remove it. And they forget about this old revision that's hanging out in the tree. And so if you're, you can find stuff that's not just in the current, current revision. If you look a few back, you're going to find some really juicy stuff, potentially, just by looking at the, at the logs. So here's, here's the, the epic fail montage. They remember pillage, then burn. Just some of the really, really fun stuff that we found. Database passwords. In shell scripts. Database dumps. That does not say dot gov. As well as user tables. Absolute piles and piles and piles of SQL dumps that people had done and just said, oh, we'll check them into the repository and, and push them to our get server because that's cheap backup. And it's a really good idea. Get ignore. The get ignore file or the hgignore file. Every one of these has a, it's a different ignore file, but they're really useful. It's, it's the, it's the developer's way of saying we don't want to be bothered by this stuff in our repository. So, so we're going to ignore it. We don't want to be bothered. We're not going to check it into our repository but we're going to put a pattern into the, the ignore file. What's a really good place to look because you're like, oh, these are the juicy things that you don't want us to know about, but now we know about it. It's kind of like the robots dot text thing. How about customer invoices? This particular site had, I'm pretty sure every customer invoice had ever generated hanging out in their repository. They weren't predictable file names in the web route. You wouldn't have been able to easily brute force them with something like DurBust or anything like that, but you could extract them from the repository. How about some API keys? Apparently they still use MySpace, Google API, Facebook, SSH key for the EC2 instance. Those aren't useful at all, are they? I don't know. HT Passwords. This is a really interesting one as well. It's a German site that when you went to the actual file in the web route it was HT Password protected. And if you pull that out of the repository, you can obviously get that particular file. I'd really be curious on the discussion. I'd love to have this discussion with the EFF on is this actually bypassing access control? These files are public because they're in the .git directory. They're just sitting there. You don't actually have to bypass access control to access them. So what type of line are you crossing if you're accessing this stuff in the .git directory? In the meta directory? Coincidentally enough, these are account numbers and routing numbers for a very large pile of their customers. So yeah, no authorization required. How about credit card numbers and doc files? Seriously? What the fuck? Yeah, this is actually a non U.S. company, but so I don't know PCI. Wow, that went fast. So we have a demo and then a tool release to automate all of this because I really didn't give you any technical details because I don't know. So demo time. Yeah, look demo. All right. So what I did was I pulled down the DVCS Pillage Toolkit. Once I push that up on the GitHub, you'll get the URL in a bit. What I included was a tool for BZR, HG, and BZGit, and Kerial. And I included the PyCloud scripts that we used to scan the top million sites. So you can just use them. PyCloud is very handy for scanning a lot of stuff simultaneously. And some regex. And that's about it. So to start out, I'm going to show you, so here's our, here's the Git directory of our target site. As you see, there's a lot of different predictable file names. We've got configs, we've got the index file, the object store. This is how Git stores all their crap, right? It's the Shaw value of a particular file. And those are the files that we're going to go after. And so let's show that Git clone, there. Git clone. Yeah, it doesn't work. Same thing with BZ and HG. It's not going to work. If we run the particular tool, Git pillage. What it does is it pulls down the index file, parses it and lets you know, hey, you're about to make 1900 requests. Depending on the site and depending on what they check in, this could take a really, really, really long time. Some sites had 20, 30,000 objects in their data store. For those situations, don't download it. Figure out what you might want to go after. Just hit no, look in the index file yourself and then look to see what juicy information you might want. They might not have anything you actually want to download. So in our case, we're going to hit yes. It's just a lot of annoying scrolling text. It's basically telling you all the objects that it's going to get and it's pulling down those references. Anybody uses the tool, if you get 404s, that's normal. Git doesn't let you get some of them just based on how it works. After it downloads those references, it tries to check out each file to restore it into your working directory. It works on Git, Bizarre and Mercurial. So it's kind of hard to see because the large font, once it checks out all the files, it's just going to run those file names against a regex file that we included in the toolkit and I'd love some contributions to that and it kind of shows you, oh, these might be files of interest that you might want to go look at right away. Pretty cheesy, pretty error prone as you can see. There's crap we just don't care about. I'm going to sanitize this file so you guys don't know who this is from. Well, first of all, there's all the source code for the site. Yay. So you've got all the, everything in their web route that we're able to check out. So now you went from having no knowledge of the site to actually having the entire source code tree for that web application, which is really, really, really powerful from an attacker's perspective. Let's sanitize the WP config. Great. This is a WordPress site, right? Let's hope that worked. There's some cookie ashes. Database, password, you know, again all the crap that you'd find in a WordPress config. Authkeys, all that stuff. I'd sanitize it in there but that's, that's something just so I could sanitize this file so you don't know who this particular site came from. It's, it's just a random WordPress site. So that's pretty much it. That's the tool. You can get the tool at the GitHub repository there. And I will be putting that up after the talk. I forgot to set it public before I got up here. Please fork and contribute. There's lots of bugs. There's stuff that it could use feature wise. It could use some love. Please contribute. Questions and answers in the QA room. That's it.