 Okay, everyone, we'll now get started on the last session of the day, and I will hand over to Michael very, very shortly. The title of the speech is up there. It's supercharge your web recon with common speak and evolutionary word list, and I'll hand over to Michael. Thanks very much. Cool. All right, let's get talking about some word lists. All right, just to note, it's just me flying solo today. Shubs unfortunately wasn't able to make it into the US for reasons. So, but yeah, so just a little bit about myself. Shubs and I co-founded a company called Asset Note. Previously I was the director of Spudlabs in Asia Pacific. You may have seen me around. I've spoken to a bunch of conferences on various topics. Also organized sec talks and another conference, Tuscon in Brisbane, Australia. Shubs also comes from a pen testing background. He's a prolific bug hunter. And in the hacker one top 50. He's also a co-founder of Hackers Helping Hackers. So, I'm not going to, I'm not going to do too many plugs, but I would do want to plug Hackers Helping Hackers, which is a charity based in Australia. Basically, we, we get underprivileged and people who don't necessarily have good representation in the industry and we, you know, send them to cons, network and have various industry events and whatever. So, if that interests you, you know, consider donating. It's a really good, really good cause. Cool. Let's get on to the meat of it. So, this talk is about web recon and content discovery. And I just want to start by setting the scene with the, sort of the current state, right? So, let's say you're testing the security of a large network with zero knowledge of the network, what applications are on the network. You know, the first sort of thing that you want to do is some reconnaissance to get an idea of, you know, what are the applications, one of the systems that are running on this network, right? So, to do this, typically, you know, you'll load up, you know, a file, a directory, sort of brute forcing attack. And basically, well, not necessarily attack, recon, reconnaissance exercise. And you'll perform it across all web assets to find, you know, what's out there, right? Pretty standard stuff for everybody in this room, I'm sure. So, you know, as a security tester, you know, when you're conducting this directory brute forcing, you know, you're typically using some sort of wordless, probably sec-less of some of the wordless that maybe you've created yourself. And then you just pipe it into Dersearch, Go Buster, or whatever your favorite directory brute force is. Maybe it's Pat at all. I don't know many people that use that, but it's also a pretty good tool. So yeah, pretty straightforward, right? You guys will be used to that. This is an example. None of this should be shocking. This is just, you know, Dersearch with a basic word list, you know, looking for, you know, ASPX, HTML, JS extensions. And you're also including subdirectories of CMS and API. So it takes 10 minutes into the brute force and we find sales.aspx. And in this particular example, visiting this file returns a list of customers and sales, right? So it's a sense of information. But what happens, what would happen if we didn't find that? Like if sales.aspx wasn't in our word list, you know, we would have missed this buck, right? So obviously the quality of your word list is really important for the quality of your recon and by extension the quality of your pen test. So what are the problems with current word lists, right? So they're curated, i.e. they're created manually. They're created by often an individual or a group of individuals and, you know, so it's really down to their time and their effort as well as their experience in terms of the quality of the word list. They're not really updated regularly and they don't really evolve as new technologies are developed, at least not as quickly as they should. They're hard to customize and tailor to your needs because, you know, they're just big long text, you know, text files, right? And then you have to sort of cut it up and do whatever. It's not really that customizable, at least not in an easy fashion. And if you're creating them, they require like significant time investment to create word lists that are of any kind of quality, right? So how can we address these problems? Well, one, I think we need to move away from curated word lists. So under the current model, updates are slow because, you know, people have stuff to do. Nobody gets paid to create word lists, typically. And, you know, they can't really maintain them and keep them up to that. We need to figure out a way to keep word lists relevant to current technologies to make sure we get the most effectiveness out of those word lists. We need to reduce the amount of time and effort required to create an up to date and high quality word list. And they need to be easily customizable to your needs. So that's where we introduce the concept of evolutionary word lists. So addressing those problems that I mentioned earlier was relevant to our business. So we developed the concept of evolutionary word lists. So the idea is evolutionary word lists are not static curated lists. They evolve as the underlying technologies used by organizations shift. So the idea is that it evolves as the technologies used in, you know, actual environments are also shifting. And so by automating and analyzing large public data sets, we can create these evolutionary word lists. The dynamic, they're ranked by occurrence and they're generated regularly in a scheduled manner as part of an automated work flow. So none of this kind of manual curation and updating and maintaining. So one of the key areas is shifting, keeping up with the shifting text landscape. So that really goes to the quality of your word lists. If your word lists are out of date with the technologies that people are using to develop applications, then you're going to miss stuff. So just think in the last five years, think about all the new technologies and frameworks that have been introduced over that time. There's got to be at least a bajillion JavaScript frameworks that have come out in the last five years, right? You know, do the word lists available cover these technologies? It varies, right? How often are they updated? Not really that often. You know, and how are the curators choosing what to cover and what to put in their word list, right? Is it based on experience or is it based on something more empirical, right? Is it just their sort of gut feel that these are the right things to put into the word list and what to add is just because they're experienced and they see that stuff all the time. What if they are missing stuff or they don't usually work in a certain area of technology or whatever, right? So, you know, and the idea is if our word lists used for Web Recon, don't keep up with modern technologies, we'll inevitably miss, you know, significant vulnerabilities when using word lists for security assessments or bug hunting or whatever. Cool. So, that's where BigQuery comes in. So, looking for a way to sort of bring this concept to life, this evolutionary word list concept to life, we looked into BigQuery. So, BigQuery is a Google Cloud offering that processes and analyzes large data sets using simple SQL queries. You can process terabytes of data in seconds or processed by Google, can handle complex SQL queries, so including regex and user-defined functions. And more importantly, for us and for the concept that we're trying to get to, it offers a number of large public data sets that are updated regularly, right? So, they're evolving as well. And these data sets can be queried in an automated fashion. So, just an example of some of the data sets that are available on BigQuery that are quite useful. So, the ones that are updated daily are stories and comments from Hacker News, every SSL cert and cert transparency logs. Updated weekly, you have contents from over three million open source public GitHub repositories. Fortnightly, you have HTTP archives data set, which is obtained by crawling Alexa's top one million list. And then quarterly, all publicly available data from Stack Overflow, right? So, if you think about, you know, the idea of shifting with the technological landscape, these are pretty good data sets for that, right? And BigQuery doesn't cost much. You get up to one terabyte free per month. And it's five bucks per terabyte after that. So, pretty good for what we need to do. And, you know, it should be free for pretty much everybody's needs, right? So, writing a BigQuery query. So, it's pretty straightforward, right? SQL. Pretty, you know, pretty simple. Everybody or most people at least should be familiar with that. And, you know, as I mentioned, you can use regular expressions here to sort of, you know, pull out extensions like PHP, HTML, JS. Which is there. So, you know, basically all this does is it's pulling from the GitHub repo's data, like the data set. And it's looking for anything that has a .php.hml.js extension. And then, you know, ordering by grouping by that path. And then ordering by count. So, the count of those. And then taking 200,000 results. And this is kind of the output, right? So, you know, some of the obvious ones that are there, right? Index HTML stands out as number one. That doesn't surprise anybody. Right? But going back to the idea that we're keeping up with the technological landscape. Things like grunt files and gulp files, right? Grunting gulp. You know, a fairly recent kind of tooling with, like, front end, the rise of sort of front end development. So, you know, again, you know, they're right up the top, right? So, it gives you a good feel of, you know, how you can pull out current but also relevant results from BigQuery. You can also use JavaScript functions in BigQuery. So, you can create a temporary function. Basically, all this is doing is parsing a URI to get the path, right? And it's using this library which is just hosted on Google Storage. And it's just a very simple library for parsing URIs. It's something that Shubs wrote and it's up there. It's actually publicly accessible, so you could use this in your own BigQuery stuff as well if you wanted to. And then it's just pulling out the URL and grouping by the URL. So, pretty straightforward kind of stuff. So, we've got this idea of evolutionary wordless and the problems that we're trying to solve with evolutionary wordless. And then, you know, we've got BigQuery which we see could be a good fit for this. You know, how do we take this concept and turn it into something that's repeatable and can be used in a workflow when you're doing sort of bug hunting or security assessments or whatever. So, the initial attempts at automation were kind of promising but also not that great, right? So, the first attempt at this was common speak version one. And it was functional but not really ideal for a testing workflow. It covered directories, file names and sub-domains from Stack Overflow, Hacker News and the HTTP Archive and then sub-domains from the search transparency logs. The wordless ended up being very large, very noisy and essentially just a collection of bash scripts which made it hard to integrate into a workflow. So, you can see here this is common speak version one and this is getting the sub-domains from Hacker News, just a shell script, right? And it comes back with 67,360 domains, right? The problem with common speak one was it was very noisy and it wasn't easy to work with and, you know, really the initial focus was on quantity whereas it became clear that we needed to focus on quality as well. So, that's where common speak two comes in. It is way, way simpler. It's written in Go now. So, you know, it plugs into an existing testing workflow quite easily and currently generates three types of wordless. We're adding more modules over time and we're also getting heaps of interest and pull requests and stuff like that, people who want to add different modules once they sort of get their head around the idea. So, the ones that are currently in there, extension-based wordless from GitHub and it's quite large and it's sorted by popularity. Sub-domains from HCP Archive and Hacker News and approximately we're focusing on pulling about 500K from that and then route-based paths from popular web frameworks. So, Rails, Tomcat, Node.js for now, but it's pulling out paths, right? You know, configs, slash whatever, right? And, yeah, so, as I said earlier, the focus was on better quality wordless and, you know, that makes more sense from a bug hunting and security testing perspective. Often having better quality wordless is better than having, you know, a huge wordless that's of poor quality. As I mentioned, it's written in Golang. It's extendable, importable. It's really easy to embed in any kind of workflow and it's open source. I'll get to the links at the end. So, this is extension-based wordless, right? So, this is just the simple query pulling from GitHub. That was the first sort of query and it's just pulling out different files with different extensions. So, here, same sort of thing in Common Speak 2. So, basically, that's the command up the top. The project is just your Google project ID. So, that's ours, but you'll put in yours. Everybody gets one when you set up a BigQuery and all that kind of stuff. Credentials are your credentials. And then it's generating an extension word list with, this is with ASPX. And it's getting 100,000 results. And then it's piping it out to ASPX.txt. Yeah. What do you mean? So, yeah, so back here. Hack and use is updated, I believe, daily, so stories and comments. So, this is not Hack and Use hosting a wordless. This is BigQuery pulling all the stories and comments from Hack and Use into a data set that you can query. Yeah, so they're not hosting a word list or anything like that. Yeah, we're creating the word list from that data. Yeah, get back to where I was. Cool, and this is just sample data, right? And this is the subdomains word list from HTTP Archive. So, this is the query. Again, this is a little bit more complex, but not that complex, right? So, if you want to extend this and add your own modules and add your own queries, it's super simple, right? It's just a bit of JavaScript, bit of SQL. So, this is very similar to the original query that we showed earlier, where it's just got a JavaScript function that is used to get the subdomains using this URI.js library that Shav's wrote. And it's basically just getting on subdomains from the HTTP Archive URLs, grouping it by subdomain, and this is getting, I think that's 2 million. So, quite a few, right? And so, this is what it looks like in terms of using common speak 2 for that. Again, you've got your project ID, your credentials. It's got verbose options as well. So, again, this is not just like a really simple bash script. This is a little bit more featureful and something that you could ideally use. And then it's outputting into a subdomain's text file. If you have a look here, the unique subdomains, 484,701 subdomains. And it's sorted by counts. So, you can see here, right, WWW is up the top with, like, God, I'm not even going to bother reading that number, right? Makes sense, right? But, you know, you can see some interesting ones here, like M makes sense, right? Developer, whatever, but, you know, as you go on, you know, bits.blogs, right? Is up there? Like, that seems kind of weird to me. I would probably want to put that in a subdomain word list, right? Spectrum, I don't know, maybe that's related to some sort of technology, right? But, like, you know, some of these make sense to be up the top, but some of these are kind of surprising even in that top list, right? So, it is sorting by relevancy as well. So, automatically, by just, you know, working on the count. Cool. So, same thing. But this... Sorry, this looks like duplicate slides. Never mind. Let's get on to the future development of Common Speak 2. So, framework-based route extensions. So, I think, like, things like config slash routes.rb. More comprehensive, file and directory word lists for many languages and frameworks. So, we only support a few, right? So, we spoke about, you know, Rails, Tomcat, Node.js, whatever. But, you know, we're going to add some more support for stuff. We'll also add different queries with different stuff that they think about. The other thing as well, before I keep going with this, is it's not necessarily just useful for Recon, right? So, there's other security domains that are also useful. So, think about, like, fuzzing, right? With BigQuery and even with Common Speak, you can pull out all the PDFs that are on GitHub, let's say, right? And then you could use that as a corpus for fuzzing or any kind of binary or whatever. It's not just the Web Recon. We find use in it for Web Recon, but it's great in, you know, a whole bunch of other, you know, security domains, right? So, in terms of future development as well, parameter and value-based extraction from URLs sorted by frequency. A permutation engine as well to sort of, you know, take different permutations of the word lists to get, you know, again, focusing on quality and then schedule word list creation, again, to date and constantly evolving. We're still working on it. We're working on it right up until this presentation. But all of those features that we're listing here are expected to be released within the next couple of weeks, right? So, yeah, it'll be, here we go. In terms of where you can find this and find resources on it, labs.acernote.io and blog.acernote.io we'll be posting up some write-ups on using Common Speak and obviously all the materials from this presentation. GitHub under our GitHub repository, Common Speak 2 under Acernote. And then if you want to just get in touch with us our website or tweet at us is probably the easiest way to get to us. So, Infosec AU, it's actually underscore AU, sorry, that's a bit of a mistake, for Shubs and MGenericus for me. And that's the URL. And that's everything. I'll leave it on the URL for anyone else to take photos. So, yeah, any questions? Yes, we have used it in bug bounties to very significant success. It's also driving, so Acernote is a platform that monitors your external attack surface for vulnerabilities and it maps out stuff with Recon and we have passive and active sources. Our active sources are based on this generated by Common Speak. So, oh, yeah, definitely. Especially on bounties, right? We've seen a lot of success on bounties. Shubs, if you don't know, is a very, very successful bounty hunter. He's in the top 50 Hacker 1 globally. And where he stands out and where he gets his edge and where the top bug hunters really get their edges in Recon, right? And the stuff that he's pulled out using Common Speak has definitely been interesting stuff that I seriously doubt would be, I haven't checked, but I seriously doubt would be in SecList, right? Because it's up to date. It's sorted by frequency as well and popularity. So it's not just like, you know, here's a static word list. It's also like you can take, if you're sort of pressed for time or you don't want to do a huge scan or whatever huge brute force, you can take the top, you know, 200 or top 100, right? And you likely would have more success with that and necessarily just a, you know, unsorted random kind of word list, right? So there's it's definitely more flexible and at least in our experience using it, both for our business as well as for bug hunting, it's definitely been successful and to that but to that said, we haven't compared it necessarily to SecList, right? But, you know, we are confident that it's a better approach, right? Any other questions? Okay, we will now further questions. Thank you very much.