 morning everybody. How are you guys awake? Or sober? Who slept in this room last night? That's the only reason you're here. One guy. Okay, so this is Git Digger creating useful wordlets from GitHub. I'm Wick. I'm Mubix. Go ahead, explain it. So last night at random, well not random for Mubix, but we ran into a taxi line, decided to go with them over to Pawn Stars. Everybody know Pawn Stars? So inside we're walking around, we're looking at the souvenirs and all of a sudden we notice this kiosk. Everybody's using it. What's that? Well we walk up to it and it has a camera. You can take a picture of yourself and they allow you to log in with your username and password to Facebook, Twitter to send the image to yourself or to tweet it out to the public. So I email to myself, I'm not giving them anything. And this is the result on the screen. Legit, right? So I did most of the research. I did all the research. That's me. Yeah, that's him. Here. So we're not the first ones to make word lists. In 2009 and 2012, Sebastian Riv, French something, made word lists from Wikipedia. He's an awesome guy. I'm not trying to make fun of him. But also all of Matt Weir's stuff, if you haven't used Matt Weir's keyboard dictionary, it's one of the best ones to find people who just used along the way. And the other people who make awesome word lists are Rock U, going on. Go ahead. So we weren't the first to go digging through source code repositories. Malvituna security, they released SVN Digger where they went through a ton of SVN repositories, looped through, and then published the frequency count of all the files and all the directories that they found and pulled down from, I forget exactly where they pulled them down from. Google code. So just to point out really quick in the slide, if you take a picture of that QR code, we're not trying to hack you. It's actually linked to the information. So it's easier. I made them not him, so they're good to go. The only problem with using Google code and stuff like that is they like to put these captures in, which make it really hard to automate stuff. So this is how everything got started. Two o'clock in the morning, somebody posed a link to SVN Digger. Everybody thinks it's cool. I haven't seen anything like it before then. Rob's like, that's awesome. And that one line, that's the only reason he's standing up here right now. Because of that one line of code. So I'm like, this is awesome. I can do this crap 30 minutes or so. I'll go to bed, wake up in the morning, code will be done, and I'll have an awesome word list. So my first problem was that I couldn't find at two in the morning, mind you, I couldn't find a good way to get all the repositories. So I started to go to their GitHub list as the top repositories, the most forked. So I used some Python and started web scraping all of that. So use some basic Python, I'm web scraping that, I'm saving it into SQLite, the user names and the project names, and then just set my computer loose cloning, all the repositories. So now what do I do with it? I have these repositories. I'm using OSWalk to go through each repository and keep account of the user, the file name and the directory. I'm doing a whole lot of said, grab, awk, just trying to clean everything up, make it nice and easy. There was a ton of manual review because I thought it would be easy to go through and pull out all the user names and passwords and email addresses I found in this code. So I spent about 17 hours total on my 30-minute project and all kinds of hours trying to pull out user names and passwords and I got a mile line of said that I just copy and paste and come back later. So OSWalk was taking forever to go through and find everything and I'm like, there's got to be a better way to do this. So after some Google fool I found BetterWalk which claims that OSWalk makes unnecessary API calls to go, is this a folder? Is this a file? We don't know API please tell me. And they cut that out of their loop with speeds things up to two and a half times. So the good news is I've got some awesome word lists. And I posted them out on IRC, everybody loved them, I was like great. But the bad news is I've only got some repositories. I've got maybe the most popular repositories and that's it. SQL transactions were extremely slow, took maybe about 30 seconds to go, is this already in my table? Yes? Okay, let's add one to the count. And the 17 hours of manual labor really sucked because I am the laziest bastard on the planet. If I could have got my goon to carry me in here, I would have. And my hard drive was full. I've had terabytes of this data. So everybody liked it so I'm like okay, let's get a little serious. How can I make this better? How can I stream on it? How can I not do 17 hours of manual labor? First problem, storage. How am I going to store all the data? So my first thought did some Googling and bit cost awesome. $99 a year unlimited space built in indexing so I could give other people access to all the code and they could search for whatever in the world they want and get it. At that time, six months ago, at that time there was only a Windows client. It crashed every time I tried to launch a robo copy or just simple copy and paste and it was extremely slow because they encrypted all the data on the upswing. So what might have taken me six days to upload a terabyte with my slow ass connection would have taken like a month. The next option which I thought was the option was to have a NAS. Everything was stored in one place. It was protected. I could download directly to it. But it's hard to get free money for these things. So I had three terabytes already. So my solution right there is the first ten terabytes of all the data. That's awesome. So the next problem is how can I make downloading these repositories better, easier? How can I get all of the repositories? So when I was actually awake, I found the API which I felt incredibly stupid not knowing about. And it's nice because the API gives you all kinds of nice, useful information. The only thing that I haven't found is they'll tell you that it's a fork of a project, but they don't tell you who was the main project, who was forked from. So I can keep track of how popular a project is, but I have no idea which guy was the original. So database, SQLite sucks really bad when you're trying to store a lot of data. So I switched to MySQL. I've had questions in the past. Why didn't I use Postgres? Well, I know MySQL. And again, I'm lazy. Didn't want to learn something new. So let's put this all together now. So now I've got two main scripts. I've got the first Python script that is threaded, goes through, downloads all the data. It's got another mode that will go through and process all that data. And then I have another script which I'll talk a little bit more about that just takes a long list of usernames, passwords, email addresses, and I pass it a table name and it just goes and dumps all the data into that table. The MySQL database, that is what I upgraded the most. I've actually created a table to keep track of more product information, more project information, and the usernames and passwords and everything now has its own table. And I'm keeping track of the last scene ID so that I don't have to start over or repeat myself. So here's how the downloading works. Downloader goes out to the API and says, give me 100 repositories. I saw, I've already seen 5,000. So GitHub comes back at you and say, okay, here's the next 100. So it downloads it, dumps it into the database that I've got it and then automatically clones the repository to my hard drive. Unfortunately, the processing got a little better, but there's still a lot of manual work. So now the processor mode is checking my database going, okay, I don't have this repository, but I know it exists. It downloads it. Great. Or it goes through and auto loops it. Does the better walk on it. And now if you notice the red line, that's all my manual work. So I have to grep all this data, pull out usernames, passwords, emails, RSA keys, all kinds of fun stuff, and then clean it up, which can take for a one grep session for one day, can take four days for me to go through and clean it all up and dump it into the database. And then I have a bash script that will just go connect to the database, dump everything, and create the word list, and automatically send it back up to GitHub, which is the real irony. I'm downloading all their data and yet storing it on GitHub. So the updated news, I now have all the repositories. I can now get every single public one. Generating the word list with the bash script takes minutes, once all everything is in the database. Because of the updates I did in the database, I can now store the repositories on any hard drive I want, search the database, it will tell me which one to go to get. The sucky part about that is if I want to go back and grep for more stuff, I have to get this giant hub and plug all these hard drives in at the same time. It's awesome, you should see it. I'm estimating that it's going to take about 30 terabytes to download all the public repositories. However, I am pulling that number out of my butt based off of the amounts of repositories I got from the first 10 terabytes. Because everybody is uploading new stuff every single day. I could probably continue with this project forever and never see the end of GitHub. So this is the big data drinking game if you just heard me say big data, drink, but you guys are all hung over so I'm not going to ask you to do it. So obviously this is a build up to the actual word list. What did we get out of it? So anyone with kids knows exactly how this goes. So altogether. If you don't see kids, if you don't have kids, you can just get the movie and fast forward to that part because it is like the best part of the whole move. All right, so all directories, all files, these are pretty straightforward lists. But the cool things about them is what you see inside of them. And we're not just talking about password list. Password list is, you know, that's the obvious use, right? I'm going to have a set of passwords that I'm going to use against it. The all directories list and all files list is awesome when you're talking about web application attacks and the usernames. I didn't know that so many people loved Bob, but they do more than admin. So stats. Yeah. I promise this is the only stats. I just wanted to give an overview of how many passwords are in the database versus how many of them are actually unique to each section. So this is where it gets relevant to what I do. I'm a senior red teamer. And one of the things that I just break stuff. So I already talked about force browsing. The SVN digger kind of started that whole thing. The great thing about force browsing is when you get a set of directories or word lists or stuff like that, you can just exactly like a derbuster. You just go through it and find it. You can actually use these list with derbuster. The small default password list, which is not exactly the same thing that I would have expected as the default passwords. And you start with root, tour, blah. This actually got a lot of different ones. Static salts. It's hilarious when you find a repository that has a salt for passwords and then that repository is used as an application out there in the real world. I actually stopped pulling out static salts because there is so many. And I went, I'm never going to get this done in time to do a CFP on the project if all I did was pull out the static salts. So, five minutes. So number 22 on the list of files is exception.php. I have never, ever looked for that when I was looking at a web application. Even a PHP one. But after a week had done his research and shared the list, I used it, got code execution because the exception.php was actually loading the exception information from a file and you could just specify any one file you want. So it's on my list now. File.php I'm going to keep going through because this is five minutes left. That's pretty awesome. And this is one of my favorites, NTLM SSO magic. Anyone know what an NTLM SSO magic.php does? It has your username and password statically assigned in it so it does NTLM. All right. So, real world stuff. Anyone see this release? The secret tokens for Rails. If you have a secret token stored in your repository and it's also used in your production without you changing it, it's direct remote code execution. So, this is the gentleman and I'm going to butcher his name but I won't butcher his name. But he sent out an email. He was nice about it and sent out an email to all 1,000 users who had this in their repositories. I am much too lazy to do any of that. So not so obvious stuff. You start parsing every file from the revision history. Right now WIC isn't but if you store your password then the gentleman who just said it removes it but you can go back in the history if you don't nuke it. Mass static code analysis so you can find vulnerabilities in a ton of things really quickly. Dot SVN and dot settings are amazing things for some reason when you convert a SVN repository into a Git repository, sometimes people forget to delete those things and they can have configs including database configs and all kinds of things. Git ignores an amazing little file that tells your Git repository what files to never look to commit. Those are exactly the files that I want to look for because those are things that are important. So I usually look for that. 403s, empty directories, this is on GitHub or on Git. It doesn't, as well as SVN. It doesn't let you create a directory and commit it unless there is something in it. So empty directory and DS stores are usually how some people do it. Another thing is running OCR on all the images. We actually found a gentleman that or a girl that had their password stored in an image for the repository. It was awesome. Using a list of text files, grabbing all the emails which he already does and I'm stopping because it gives all ideas and we're done. Thank you. I actually want to give a quick thank you to Nova Hackers. Are there any Nova Hackers in the room? They suck. Without their help and support and encouragement, I would have never kept going with this project because they helped me out with resources. I now have a file server which can store up to 34 terabytes of data. So once I get the original 10 bytes switched over, I'm going to start downloading and pulling out some more stuff. Cool stuff. No, everyone's waiting for the next talk. Questions? All right. Cool. Thanks. Thanks for coming. Did you do indexing on the SQLite database? No. No, I did not. And I probably should have. I'm not a programmer. I'm a problem solver. Yeah. Well, I've had one. I gave this talk at B-Sides and I had one guy come up to me and mentioned, well, why didn't you just do it in memory? I'm like, oh, I have three terabytes worth of data to go through. I don't think my computer would live to do that and store the database in memory. I'm afraid. Maybe. I don't know. Thanks a lot, guys. I appreciate it.