 Hi, everyone, I want to thank you for coming to my talk titled let the bugs come to me how to build cloud-based recon automation at scale For the first half of this talk We're going to go through an overview of how you can build your own cloud-based automation Ecosystem, so we'll go through kind of the framework the methodology lessons learned and Provide an overview of a lot of the cloud services that you can utilize and then just talk about methodologies that we can apply So that way all of the challenges and mistakes and the lessons learned that I've had over the past year and a half You can avoid that as you build this out and then for the second half of this presentation and talk We're going to do some live walk-throughs of the actual environment show how it works how it runs and At the conclusion of this I'll make sure to release the majority of the code that you'll see today So that way you can reference it and hopefully you can take it and build out your own environment and Drive that success for recon Additionally, I posted earlier this week a supplementary article that I'll reference on the slide here and feel free to Pull up the link and kind of follow along as we go the talk doesn't go line by line with it But I think if you want to delve in and try to gain more information throughout and feel free to jump in there and Leverage those resources. There's a lot of live code snippets and pieces that may add some background context So I wanted to share a little bit about myself. My name is Ryan Elkins I'm a husband. I'm a father. I have two kids and by day. I'm an information security architect I spend the majority of my time learning about cloud solutions how to secure cloud and my primary focus is cloud and Eventually once the kids are in bed 10 o'clock rolls around I like to try to be a security researcher and hack things when I can when I have them when I have the energy and so that's that's really what led me into this talk is that Limited time factor where a lot of times the free time that I do have starts at 10 p.m. At night, however at the same time I still want to be able to participate in learning and bug bounties and in doing a lot of the Hobby security work that I really love and not to mention that it helps me out in my day-to-day career as well And we'll share some different lessons learned but leading into that I guess it reminds me of a story and I'll I'll call this the the blockbuster dilemma and I'll ask the audience here. Does does anybody remember going to? Rent movies and actually walking into a store or into a blockbuster and you actually go and pick it out Well, if not, I'll walk through the process And I think there's a lot of carryover traits to the at least my experience with bug bounties and so You're sitting here on a Friday nights You just finish eating your your pizza that you just had to carry out dinner pizza and you realize Hey, we have four hours to do something. Let's watch a movie tonight. So you get the family in the car you you drive to the nearest blockbuster or Grocery store wherever they have movies that you can rents and you go in you spend about 45 minutes trying to get a Movie that everybody agrees on you eventually compromise you pick the selection You drive home and then you make some popcorn you you go upstairs you drag down all the pillows and blankets You get everything in front of the TV You eventually start the movie about two and a half hours later So you're down to maybe an hour and a half you get the movie started and you fall asleep Wake up the next morning. You realize hey, I missed the whole movie I'm gonna try to cram in the movie before I have to return the movie that day or you get find and you Eventually end up missing the day by like 10 minutes and you pay for a next-year day That is Exactly the challenge with bug bounties and I feel like we are reliving the times of the VHS rentals and DVD rentals Picture bug bounty this this was me before creating and why what drove me to build this ecosystem that I'm sharing with you today is You get the kids to bed. It's 10 p.m. I decided I want to hack tonight I'm going to work on a program try to learn figure things out So I go in open up the computer. I log into every single platform see what's available I review all the programs. I look at my private invites. I look at the public programs I dig deep trying to see maybe there's a public program somebody missed you do that every time even though you don't always find something and then and then you also look at what programs have incentives and Eventually you get to the point where you select the program So you have your program now you need to find all those SSH keys that you forget where you store them on your computer What password did I use the login so you get all that together then you realize hey I haven't even started my VPS server and do so you start at the VPS server and then it starts up You fire up burp speed or whatever proxy you use you start manually converting these scopes over into your burp sweet console You get everything ready then realize well. I don't want to just go for that root seed domain Let's we should do some recon So then you figure out you update a mass and you go out and you look up the syntax You get the syntax you load it up run it you kick it off. You wait about 10 minutes You get tired and you fall asleep you wake up the next morning You just realize that you just slept all night while a mass is running You have this big arsenal of subdomains, but you're going to have to wait another few days to be able to tap into this So your whole night was basically gone because you only have a few hours your time is so so important and For me I was living that over and over again where I would do this I would have this recon and then the next week I would just decide I'm going to go for a different program to do and it was a It's really a systemic issue that probably a lot of us face And I hope I hope some of this resonates with you and some of the challenges so I Needed to figure out a different strategy a better strategy. How do I solve this? And I loved and I thought a couple weeks ago. I know Codingo Michael Skelton and Heath Adams the cyber mentor were talking on Twitter about comparing poker to bug bounty and That really resonated with me because it fit the exact objective of what I've been trying to accomplish with this automation is I wanted to build this scalable cost-effective cloud-based platform that can do full Recon automation end-to-end so that way it takes all this raw data and transforms it into actual intelligence And for me the poker comparison is that the best poker players They don't play every hand the hands that they play they want to make sure they have the best Statistics in the best chances at winning that hand and they play it In the same way the time that I manually spend on a program or do an analysis I want the highest likelihood of Signal versus the noise that I'm going to find a vulnerability whether it be a duplicate or not I don't care at that point I'm just happy to have proper raw and tell her actionable intelligence from the raw data that I can act on so That's the ultimate goal and I feel like this program that I'm going to show and walk through We we come pretty close to that and and we'll talk through that as well So as I developed it I thought about it and I thought there's there's actually I categorized into nine tenets that I Felt important to make sure that I'm always guiding towards and I definitely didn't hit every one of those and there's still iterations and MVP and post MVP processes But eventually I want to stick as closely as I can to these nine tenets And I think it's important as you as you think about this as well that they all they all resonate and make sense So the scales horizontally part this solution had to cover every bug bounty program in existence I'm not smart enough technically to be able to be writing my own exploits in zero days But I am smart enough to apply the OOS top ten and search for those and some of the Common themes that we see and we read about and we learn about The problem is I need to I can't just hone in on five things and ever find it So I need to be able to scale that breath of knowledge across every single program at once The other part was scaling vertically So I want something that now now once I have the data the intelligence Information the fingerprinting the the server information about every single program I can continue to add new checks things So the the scales vertically are kind of your bug classes that you're checking for across the entire ecosystem environment of bug programs The other pieces I want to be cloud based I don't want to have to stand up my own server racks and run the capital investments of hardware and things in my My home office. I mean that that had no appeal to me I want to be able to have on-demand Scalability the same reason that businesses are driving towards cloud are the same things that I wanted to do I need that compute power when I need it and I want I want to make sure that I have it at my fingertips So cloud based is extremely important with no on-premise dependencies The other pieces infrastructure as code So being able to deploy your infrastructure repeatedly and source control and version control and all that it's really important I've failed quite a bit in this area because a lot of times I'm just working on something and I wanted to make it work and I can go in the gooey and click So that's probably the area that I have the most technical debt that I need to revisit But I'm certainly something I'm working towards Robust documentation for me. I could write a line of code and I could forget what it does the next day So being able to kind of understand why I did something a certain way have articles about it explaining it that way I can revisit this and not have to continue to relearn that and Then it also helps with sharing to the world and to the industry and making us all better for the same things by having Some statements and documentation around it low cost is another piece It it can get really expensive as you start standing up multiple VPS servers and running containers and cloud services So my goal is to keep it within the price of finding one low vulnerability per month I'm scaling right on the edge between like a low medium But my goal is to optimize this as much as possible So that way I can basically one low vulnerability per month will pay my cloud bills now Hopefully once we scale vertically it's going to gain more expenses But that hopefully means there's more success being driven and it's not as big an impact financially because it's generating a profit at that point The other piece is aggregating the data analysis and anomaly discovery So I want this raw data and I want it to be aggregated across different tools and pulling everything together So I don't have to individually go and look at one tools output and then the next one But bringing it all together to make the most relevant intelligence possible Being a security practitioner myself Security best practices are also important and I think that our program owners care about that as well I mean they're they're giving us the opportunity to go do recon find potentially damaging multimillion dollar impacts to these companies based on some of these severance price and Criticals it's important that we take our own due diligence seriously around the data that we're collecting and harvesting from these programs So that may be performing lease privilege changing passwords regularly doing encryption and transit and that rest and just General good security hygiene Then the last thing is version control the last thing I want to do is spend a year on this and Make a mistake and have no idea what I broke it or lose the code or delete something So keep in the same get hub under version controls extremely important and super valuable So I want to jump into this architecture. What am I actually talking about? Why are you spending your time today to learn and I think one of the things that I'll ask of you as you listen to this is Looking at this. This is the the actual Diagram of the architecture that's built out in the cloud and I'll share the code and you'll see it all working and Different things through follow-up articles and code that I'll release and we'll walk through a few examples within here It's a fairly large ecosystem Thus far, so we won't touch on everything in detail today But as you think about this this even though this talk is more focused around bug bounty and recon This applies so effectively to pretty much any area of information security And I would even extend that to any area of IT or technology I mean if you work in incident response or Doing any types of forensic analysis or vulnerability management or even if you're a risk and analyst or someone that's maybe not developing code Or in quite as technical overall these this is basically just taking data turn it into intelligent decisions and using orchestrated workflows and that's that's basically the extent of this But doing it properly can make you extremely successful And leading into this I feel like just in general across the industry if you're looking for a niche to really fill It's leveraging cloud to further mature your security your company security program And I think if you go out and search for articles, you're going to see a lot of information and data about how do you secure? The cloud how do you secure cloud services? You're going to see a few things about how do you hack cloud hack the cloud? How do you do that what you're going to find very limited is How can you trans leverage the cloud to actually transform and mature your own security program your own security processes? So this is just taking one niche area of bug bounty and recon Automating that with cloud services, but this could absolutely apply it to all areas And I would challenge you to start to use this within your own within your own job roles as well So what happens in this ecosystem is I'm I'm using I'm actually tying into all the major bug bounty platforms, and I'm using a It's a existing get a repo by our Arkady Tettleman and he's actually going through and making a lot of the API calls and scraping the programs and sites So eventually I'd like to get directly integrated into every single platform itself and not have that as a dependency But I'm using the raw JSON output that he's developed and published out to the community You know I'll reference it here for you and so I just I think him for the awesome Project he has out there and available to everyone and I've been able to leverage that and through that I can load bulk load every single project out there That's at least in the major platforms And then there's always the situations where you need to load maybe an external one that I want to do ad hoc or something I also have the ability to make a API post request that will load that program So once you have all the programs loaded It's then literally just a get request with the program name that I want to initiate this recon cycle on It steps through every single thing you see on this reference architecture does all in a automation it kicks off HTTP scans it does domain enumeration through Project sonar it has in a mass module that I'm not going to talk through in detail today I'm not using it as extensively right now I'm more interested in hitting some of the data sources directly with their API's But then once it gathers all that data from the the initial scope it will then kick off a HTTPS scan that gets kind of your URLs and it gets some of the base ports and Fingerprint details and the response codes and everything and after it has that it will circle back And it will do a web crawl of all those seeded base URLs And then the nice thing is iteratively whatever URLs it finds it does the Json the JS link finder It looks for everything API endpoints Analyzes all that pulls it back Gathers a list of every URL found in there whether it's in scope or out of scope Does a double check to make sure that everything is back into scope creates that list feeds it back into the the recon process? Does the recursive recon and then it goes back through and it will keep cycling through until it get until it gets Tons of information all the raw output then as it's done and completed It will do commands that aggregates the raw put all together and you basically get into structured formats that you can you can Queer you can search you can sync it to Persistent systems and and run tools like a semgrep or a sift or a regular expression Whatever you want to you could even do I'm using nuclear nuclei templates locally against the the file all the responses that are stored across Millions and millions of web pages so all there's unlimited amount of depth that you can do once you have data Basically curated in a usable format The overall infrastructure is broken into really three main components There's program generation program operations and program analytics and that's kind of how I bucketed everything that goes on So this is an example of what you would start with so if you want to do recon on and this example You can see I have Tesla so I want to say I want to do Tesla This is initial operation the nice thing about this too is that if I want to just do like a crawl I would say operation crawl and it's going to crawl the existing data I have for it or I could say HTTPS and it'll do an HTTPS scan against the URLs that we have for it But in this case this just says do the entire recon So it's going to kick it off and at the outcome of without touching anything else with just this post request I'm going to have dashboards like you see below where I I can Have real-time analytics and filtering capability around the URLs their status codes their responses and this isn't going to tell me Specifically vulnerabilities at this point But this gives me in a way to filter through and structure this so I can have the most highest likelihood of signal through all the extensive noise that we've discovered through the recon process and then I think graph database is a Really powerful visualization tool to quickly see those anomalies that are sitting and then you can also see relationships and This is a Neo4j. That's running within It's the Neo4j cloud service, but it's issued through as a third-party Google cloud API So I have that I'm not using as extensively because cost-wise. I don't think honestly I can afford it to run it across all the programs It's great for like one or two programs But once you start hitting those no limits that it just gets too costly to expand that out So maybe once we start returning having some return on investment would will be more there But I'm using network X locally with the Python libraries and you can still generate some raw analytics around some of the Individual programs, which is fascinating So it really starts with the program generation It's important to have all the programs loaded ready to go and that's where I talked about Arkady Tettleman's Bug bounty targets get a repository. That's just awesome and you can go through But I will highlight one of the challenges that I faced early on is with scope being so important that we stay within scope The bounty the program owners care about scope for a reason and a lot of times these companies They want to be able to have these programs and do this But they truly are sometimes systems that maybe our legacy or they know that it's not going to handle the traffic It might be that they're just not ready to have everything within their ecosystem or environment To be hit by thousands of hackers across the globe and I think that's fair and I think at the same time For us as researchers, it's important to honor that scope and make sure that we stay the best we can within scope And what I quickly found when you try to automate across everything is that It's difficult because the scope even if it's in the same exact field varies Significantly, so I've I've tried to build and iterate over different scope types where you might have a URL with a wild card at the end of it. You might have a domain with The sub domain wild card, but does that mean that I can also wild card other pieces of it? And then there's citer ranges and ip addresses and you'll kind of see the different variations here listed and one example I learned pretty quickly is Making sure that github is not included as the end scope whenever they have a repository there because I quickly learned that you start crawling github you get a lot of Responses and it can really take a long time for unnecessary scope So I made that mistake once and have fixed that pretty quickly So but tons of lessons learned like this that I never anticipated when I started on this journey of Developing this environment and then you can see if you want to load a new Program the post request payload is right here where all you have to do is provide the program The scope in the scope out the platform and the invite type so that way I can filter and protect any private Private repository tours or programs without whenever I'm doing demos and things I can better hide those so That's that's pretty much all it does and it is all loaded and stored inside an aws dynamo db So it's structurally maintained And then I have a timestamp field where I can then trigger and say once I have this and I run recon I set the timestamps so I know when the last time I ran this against so if I want to start doing more Automation around if I haven't touched the program in 30 days Let's automatically kick this off and those are all future enhancements that I could do but just trying to Attract the program so you can see in the screenshot right now. I have 653 programs loaded I know I have a couple other places to pull programs from so that's definitely not comprehensive yet But that's that's certainly enough to get started with And then the operations piece is everything is initiated via api So there's there's really there's nothing that you have to do outside of an api curl recall request if you have so it's Make the curl request to the aws api service. You can set your own custom c name I have api brevity in motion com That's that's kind of where I host my blog and other things and and I've leveraged that so I make the request there What it does is it feeds in and you can proxy that api call to what's called aws lambda And picture the aws lambda service as a serverless Function that is running and you think about like python definitely or python functions That's pretty much you want to leverage to use a small modular components that do one task and they do that single task well And so I I proxy that api call and in the parameters I send in the post or get requests to get filtered in as Arguments into that function. So it's just a a quick way to make a function api based and that's awesome I mean just think about that power alone We're in your company that you write a lambda or a python script or have some Some type of thing that you want that takes arguments and it maybe runs on the server today Throw it in there proxy it through the api gateway Now you have an api call to some type of data processing that you could kick off and that's that's just so powerful in itself To be able to apply it to so many areas And then once those functions run they output all the data into Into s3 buckets which are object based storage that just scales along with however much room that we need The other piece is ephemeral workloads So I'm I'm actually anything that I perform actively against the target So those are the the web crawls and the nuclei scans or anything like that that is all Stood up dynamically on ephemeral Droplets within digital ocean. So I use digital ocean droplets for everything that I do And it pretty much writes a custom Startup script where I start up the droplet automatically via api It feeds the startup script to it and the last command that I have in there is shut down So the startup script also completes all the work synchronizes the data queries or I guess queries AWS s3 pulls all the data in and then runs its processes And then the last step is synchronizing the data back to s3 So we don't lose the output and then it shuts the instance down and then I have a a lambda that runs every five minutes that's always query in my My digital ocean account and anytime that there's a droplet that's shut down It will just terminate it. So I don't get charged for longer than the duration So worst case I have five minutes of extra time runtime from Every operation I do and it's just it over and over sets up ephemeral for every process then it's very modular So that's that's how we do it. I'll share all the code. You'll see snippets You'll have I think that's already in the supplementary article So you'll be able to kind of see how I do it But it's it's really that just-in-time dynamic approach that I've been leveraging The last piece is the program analysis and I I know I already mentioned the data is stored in AWS S3 buckets, but I have buckets for raw data I have data or buckets for refined data I have buckets for data sets that I pull in from like third-party sources If you think like the maxminds apis or asm lists or Goip and all that and then there's a curated slash presentation area where um, you can do The presentation pieces where I might point what you see here's a quick site dashboard and I'll point to that data And it correlates aggregates it all scans it through and presents it back to me And um it where I can have the the visualizations and the the summarized output of it So I can make better use of my time and not dig through thousands of individual text files Which I know a lot of us face and struggle with today All right. Well with the time left, I think we should try a live demo Let's let's try to walk through the the recon engine end-to-end and the workflow and as we go through it I'll explain what each step's doing and try to fill in all those pieces And I think that will probably wrap up the rest of this talk But there's going to be tons of follow-up information I advise that you follow me in twitter check out the blog and I'll continue just to post more more information So you can replicate big chunks or if not all of this yourself personally So let's go ahead and jump into the AWS console here So I know earlier I mentioned you can kick things off for the curl requests and do all that You can actually test directly within the ecosystem as well So I'm going to just jump straight into the API gateway This is this is the API structure that I currently have set up within here Since we're going to test out adding a brand new program to here that doesn't exist anywhere else so far We're going to go ahead and check out our post requests And we can kind of see the structure of how this is set up where it's going to go through the API It's going to get proxied straight through and hit this back end lambda for a post request And one thing I don't think I mentioned yet is that I do all of the developments within cloud nine Which is another AWS service which also makes it so I don't need any crazy powerful laptop All I need is something where I can get to the browser and do that development So I'll show you we'll hop into the cloud nine development environment here So this is this is what the ID looks like it looks very similar to like a A visual studio code or something like that But what we're going to do is I have some test cases written out here for For example, so as I'm going through deploying new code or anything I can go in here and get some quick examples. So we want to do a post request in here So I know this is an example if I want to test this out. So let's copy this over And let's put this into the API post request gateway So we want we want to test out this service and if it's a get request we'd put it in here You can add some custom headers or anything like that We're going to go straight to the request body and paste this in now This program already exists. So just for demo purposes, this show is something that's completely brand new Let's call this recon Let's actually call it defcon recon village So we'll start that's going to be our program name. We're going to look for here's our scope in So we're going to give it a wild card to make sure that we can do some additional recon steps Add another site and let's put a an out of scope site as well And then we'll call it an external platform and this is all my own So what we're basically going to do is crawl my blog For for site. So we're going to kick this off here I'm going to say test We should get a response in a second Which tells us whether it successfully loads or not Let's hope for success program successfully created. So it created the program And it's most likely entered into our step functions functionality here So let's let's go into aws step step functions and this is our orchestration unit so we can see it's running right now So let's let's click in here and see what it's doing And I love how you can get kind of a visual status of what all it's going through So as we as we go through here, it hits the first choice And it kicks over to program build based on the operations. We send it So it it built the program loaded into dynamo db And then what it does is it checks the build status. It feels successful It goes in and goes ahead and does a sonar query against the rapid seven data set of Basically the four dns services. And so it it does an aws s3 bucket query using the Athena service Here and it's in progress. So it's pulling those results back. So if we were going to go into the Athena service here, we can actually go and look at our query history And we should see the current query. It's running against the rapid seven forward data set And so here's here's what it's running exactly And it just loaded I guess I clicked it. So we have it here. So this is what it automatically generates pulling the results back And once it completes it looks like it completed successfully It would have stored those results into an s3 bucket And now it's doing running htvpx. So since this is an active process against the actual target being my website It's going to kick off a digital ocean droplet to run this command And it's going to build install it everything from scratch. So if we go over here into My digital ocean account, you can see that this droplet just finished getting created The ip information and I I have a naming structure. So I know at least which program it results to so While this is running, it's probably loading up everything. It's going to do all the installs We'll keep an eye on that, but we'll talk through some other steps as it goes So let's let's jump back to the step function. So So far, what is really neat about step functions is you can do something called callbacks And one of the limitations with the lambda service is that A lambda can only run for 15 minutes and you're charged for the duration that it runs And so when you're doing something like a web crawl or an htvpx scan You really don't want your lambda function waiting the entire time for that to return back. So what we did was We called the lambda the lambda kicked off the Program generator the operation built the droplet and then the lambda completed so at close So right now we're not getting charged anything with the may ws ecosystem for running And then this this step function state machine is waiting on what's called a callback So whenever I built the droplet in the in the startup script I added a callback task that says when you get to the end of this The last thing to do before shutting down Is to make an api call to the step function service in aws and tell it that you've completed your journey And then success or fail and when it reads that in it will it will determine where to go next in the workflow So this workflow is essentially in a pause state just waiting on that callback to to happen So we'll give that some time and see when that happens What else can we jump into here as we're as we're talking about that? Let's let's check out some of the the actual lambda function that ran For the program build. I think that that's pretty interesting. So let's let's pull up These are our lambda functions here what they look like. So we're running htvpx. So let's look at operation htvpx And so basically it's listening for event context every single lambda has this same kind of format So it was sent in events into it And you can see the events get triggered as it goes down through so The operation htvpx we can click here and we can actually look at the step input and see that this is what was submitted to that Lambda function so it submitted the program variable and the operation So when we go back to that lambda We will be able to see where we basically go We pull in some of the data parameters And then we check does program exist and does operation as long as we have both of those things We know we're on track where we have the right status. It's going to go ahead and run So it it's creating some initial file names of the domain storing in s3 Doing and then here's some other commands where it says prepare htvpx generate the install script So let's let's hop in and check out the install script. I think this is one of the coolest things that it took a while to figure out of thinking through how am I going to manage scripts across 600 plus programs What if I want to make a change to the syntax or what if there's a project update where I need to add a different command Or something like that when you're relying So what this does is it generates these Scripts based on the runtime dynamically just in time So I don't have like a bunch of scripts just sitting ready at program build this actually gets generated on the fly Immediately before the program is or before the operations initiated So I only have one central place within the code that I need to update these scripts Which makes it really easy to make changes and modify and keep things up to date as we go So what we'll do is we'll jump down into the code that does that And what it is is you can see here like for generate script htpx. What this is doing is This is writing this is writing the file using python code Generating this file here. So this is the this is actually the operation what I'm using so you can see the htpx command here My passing in my program name parameters It gives me a program name to find output. So I know which one it comes from in the json And then I'm telling it kind of the the syntax of what I want from it Everything and then store your responses. So I'm storing raw responses locally on that droplet that's running And then eventually you get to the point where it gets down it gets everything finished it processes it at one point I was actually tar and all the responses up and just storing them trying to save spacing of compatibility But it became easier and more difficult and time consuming to continually untar the files Do that especially when you're getting into the millions of of files individually So it just does a a copy over to s3 This is the sync script that it's just another script that's in the supplementary materials You could you could reference and see it does that weights and then when it's done This is that call back to step function saying I finished everything. I wrote all my files back to s3 I'm all done here go on step functions to the next thing And when I write this file, this is all a file written to s3 So when this when this droplet runs it downloads this file to the server And then it knows based on the variables to run run this directly So you kind of it's like a script to run more scripts is essentially what it's doing So let's go back and let's check the status of our step functions here. So it looks like hey, we it succeeded it was successful and both Getting the htdpx data and then processing it So if we go back to the digital ocean place, we'll see that Hey, check it out if you remember this first droplet we created was called brevity dash htdpx and it's gone So that means that the lambda function that's checking for the shutdown ones already ran Deleted it for me completely gone. No history of it And it just it's standing up right now our our web crawl of the of the new data that it's found So we'll let it go ahead and crawl and we'll jump back to our step functions and see what happened here One thing I'm doing with the htdpx data that I think is helpful And it's not out of box from the command is when you have so many different programs and you have so many output files It's one thing to add kind of the program name to the naming convention the output file But what happens when you start to aggregate the data directly within those if it doesn't have the program reference And I found that to be fairly consistent across a lot of the tools is that it will have the information the URLs the domains But what happens when I do my aggregate analysis and I see some random domain I have no idea other than trying to track down and figure out the source file that it came from Honestly, that's one of the most difficult parts is tying it back to the The original program. So what this process htdpx does is it reads the json file Within just python code pulls the json and actually adds an Additional two elements. So it adds the program name and then it adds the base url because it The htdpx output actually has the full url has the domain has the parameters But it doesn't have that base and why why is the base important the base is important For things like nuclei templates where I don't want to just go after Something.com slash and then my nuclei payload. I want it to be something.com slash api slash Tash slash whatever and do that and whenever I can do that generate the distinct aliases or Distinct url base urls without any parameters or any additional on there It just makes it fewer and fewer bad requests that I'm going to be sending to these programs as I as I do any type of fuzzing So as the crawl goes it's going to go through it's going to do a similar process where I dynamically wrote the crawl script It feeds the crawl writes it back to s3 does all the processing analytics and then one of the neat things that I think is really neat is Amazon has a service called glue and it's uh, basically a crawler where it can crawl across an s3 bucket so As security reachers or bug bounty hunters. We probably have hundreds of output files from the same Tools over and over so I'll continue to use htdpx as the example So what happens is I will aggregate all 100 plus htdpx commands that we have across programs I'll put it into the same directory structure within a bucket and then I can set up what's called a glue crawler That will crawl this bucket and it will automatically extract the data So you can see here where it goes through it finds I have I've run the recon so far on 91 programs I don't have all 600 again yet. I've been kind of going slowly trying to monitor Tune things as we go. So right now I have 91 programs. You can see there's just over 10 million unique URLs across those 91 programs that I have indexed And then it will do a smart indexing. So it basically creates a structured index table That points to the raw json output files of htdpx with these so so this is a Now I can use the Athena service and use presto functions and query syntax And I can query programs now based on the htdpx aggregate And you can see I I believe I have the the program name These are the custom the program name and the base URL I mentioned are Are not out of box from there but from the processing I went ahead and added those in so I can do smarter queries against it So when we hop back to it looks like step functions still running We'll we'll give it a little bit more time here to run. Let's check on. Yep. You can see the server's still going So as we're waiting, let's jump into Athena and I'll show you what I mean by some of those queries So if you we pull in all the htdpx data and let's let's use Tesla as the example This was a sample query I wrote But if you saw like the the table syntax there and you can I'll expand it here Here's all of our all the different things. So if you're trying to Let's say you're doing fuzzing or brute forcing apis You might want to look for anything with the word api in in the url Like just like it's written here and then with a status code of 200 Or maybe you don't want 200 Maybe you want to try to send payloads that are going to bypass a 403 forbidden or something like that So this query would run I hit run query and it returns It looks like two results. We'll watch it run here In about four seconds it crawled 12 12 gigabytes of data in five seconds and it returned these two results here So let's let's adjust this I I haven't tested this one out So we'll see what happens if we do a 403 status if we get any more or any less I don't know what'll happen here. Hopefully we have some results that'll be fun to talk about But let's watch this run all zero results. That's no good But you kind of see what I mean where you can then start to think about well What do I want to start doing downstream now? Do I want to look for specific url parameters or anything? You write this in here And then what's cool is the at the completion of every one of these queries It outputs it into a csv file into an s3 bucket So if you take that ephemeral approach for a nuclei template or something I can do the same ephemeral startup of a droplet install nuclei And then I can pull this query that only maybe I write it so that it only selects url So that way I just have a list of urls that meet this criteria Download it to the s3 bucket and automatically pull that as my word list that feeds in the urls That it puts that it targets for from an intelligent standpoint So I've just increased the likelihood that I'm going to have a hit without having to send millions and millions of requests That are just never ever going to to have a find in our results So I think I think that's pretty cool and I'm really excited to dig in more and more Doing doing some of that there All right, so when we hop back into our workflow and step functions It looks like our command's finally completed. So let's check it out I think one of the fastest ways to do is do the the glue Athena search So let's let's see if the the updated table is actually successfully updated If we do a quick Athena query here, we'll we'll use this previous one and let's Let's create a new query And then we're going to say select the url the content link from here Let's just get out the let's take out the url api wildcard and status and let's just skip straight the program So we have where program equals What did we call that we called it? Defcon recon village Word by content link that looks fine. Let's let's run and just see if it worked. See what we have here Take about five seconds and check it out. Here's our program. Here's all of our Our urls and ordered by content links so we can kind of see what returned what didn't and I think the last thing the show is just the recon dashboard So you can see I can automatically update this dashboard to pull in the data So I have was that eight just over eight million rows imported here into these analytics And I've I've already custom sorted by program here I don't think I have it filtered on private invites public and bites So I'm not going to hit the drop down But you can see that I have 91 total programs count of urls. This has eight million urls generated in there I have a dashboard of I can look for like random Response codes that might be anomalies. So if I scroll down I bottom ranked where we have some that's a 300 response code I don't even know what that is But it sounds like something we should probably check out or a 202 or a 203 If I have one result out of eight million requests It could be an anomaly or something so then we can go through there and do And you'll see the different status codes that we have you'll you'll notice pretty quickly These are out of scope There's certain things that I'm not going to attack or do payloads or exploitation But it's still interesting to see the third party libraries that you might point off to or do and What I think will be interesting in the future as we cross-examine these using graph databases and things are How many commonalities do we have across the programs as a whole all the programs? Are they reaching out to the same javascript libraries? If there's something that if I could go to this connect facebook stk.js Are there 60 programs out of the 600 that maybe have that same relationship? So if if I go from a facebook bug bounty program standpoint Find a flaw in the stk.js file I can submit that to facebook but at the same time let's submit it to all the other programs and try to find those Double up on on some of the profits around this So those are some of the relationship techniques that you could use and apply so But as we wrap up today I really wanted to thank you for checking out and just taking this journey with me And I know I I don't think I mentioned earlier, but this I probably put in about 500 hours Honestly into this project There's a lot of code a lot of lessons learned a lot of trial and error a ton of failures And I'm I'm actually starting to see some success and and this is finally getting to kind of the the exciting part after the grind So feel free to reach out to me. I love talking about this I'm excited to learn and meet a lot more people share ideas and and learn how some of you are doing your automation So I really appreciate your time today and enjoy the rest of the recon village and I think the team for having me Thank you