 Hello Defcon Red Team Village. I want to thank you today for taking the time to listen to this talk. I've really been looking forward to it, and I'm excited to be able to share this information with you today. My talk is titled Combining Notebooks, Data Sets, and Cloud for the Ultimate Automation Factory. And all through this presentation, I really want to challenge you to think about pushing the boundaries of the art of the possible, trying to change how we do things in our day-to-day lives, the manual processes, how can we be more efficient? How can we operationalize and streamline these activities so that we can be more productive in our lives? And also, as a Red Team or as a Blue Team or as a Security Practitioner, how can we better ourselves in our careers and really elevate ourselves to the next step? I know it's a little, it's definitely sad that we're not all together in person at Defcon. I know I had everything booked, ready to go. I was excited looking forward to, but I will say having this virtual offering is something that's really changed the dynamic of information security and the landscape and the hacking and all of this. It's actually really set a playing field where a lot of people that didn't have the means or methods to attend Vegas or make it to Defcon or Blackhite, now it kind of sets the bar where having an online virtual free, having direct access through Discord, through Slack, through different channels and mechanisms to the industry experts, the people that I've looked up to for years who I've learned from, being able to share the stage with them alongside. It's a pretty awesome time. The other cool thing is now I can sit in Discord. I can be ready for questions and answers as we're doing this talk, so feel free, interactively, to do this. We won't spend a lot of time on the About Me section, but by day, I'm a security architect and my main focus is cloud security. I get to see a lot of use cases around the new, latest, greatest emerging technologies in cloud. How are they being leveraged to help businesses and help drive innovation and success and actually decrease operating costs and accelerate delivery? I'm taking a lot of those examples and then I'm combining it with my night job, which is security research. Security is my job, but it's also my hobby. It's what I love to do any time that I have spare time that's not either working or with family. I'm probably doing security research in some way, shape, or form, so feel free to check out these projects. I have a blog on Medium. Any code, any examples, any demos, the slides, everything from this presentation will be available in GitHub, so you can start checking that out. If you want to follow along with anything, a lot of the content, the resources, the notebooks that we go through will be available there, so I encourage you. Pull it up while I'm talking, do that, check things out, and feel free to ask questions as we go along, but thanks again and let's jump right in. So what are we actually going to cover? For the agenda today, there's really a couple things, but I really want to focus on inputs and outputs. I think it's a mindset. We always talk about the hacking mindset, the security mindset. Well, in this case of this talk and the scope, I want you to think about inputs. What are the objectives? What are the goals? What do you put in and then what are those outputs? How can we normalize these outputs so that way you don't have to try to remember? How do you do 500 different scripts? How do you make these tools work? How do I manually parse through all these outputs and data? So throughout this, we're going to show kind of modular, repeatable technology, agnostic designs, and then how do you apply this from a cloud-focused technology? And we're going to spend our time in the AWS ecosystem. However, you really could translate this to any of the major cloud providers. And then also, we're going to provide some solutions. So how do you tie this all together so that way you can become more efficient? And I'm hoping that if you watch this talk and you apply these concepts, your key takeaways are, I want you to make yourself better. I want to make the people around you better. And I'm hoping that overall, we can all work together and make the industry better. So this is the story of my life. Thank you, Jason, for sharing these tweets, because the timing was perfect. And I started thinking in terms of, and this is in the scope of Bug Bounty, if you've ever worked on Bug Bounty, you see these different tweets of, I just made $5,000, or I made a $15,000 bug, or I just made $400. It's kind of, that's how I always feel is, wow, let's just spend a couple hours. Let's make a couple thousand bucks. But that's not the case. And I'm probably, I don't know if this resonates with any of you, but I'm the person that, if you were to ever see me tweet, I just made $2,000 on the Bug Bounty. That's because I just probably put in 100 hours trying to research through reconnaissance, through tooling, through tests, and failures, and inputs, and report writing, just to make that. And that's what goes on, I think, often behind the scenes that we overlook. And one of the things is to do is, our most valuable commodity is time. My biggest challenge is finding the time to be able to do this, to really get in the weeds, to dig in, and try to find these red teaming objectives to relearn. I'm always relearning things, or figuring out, or looking up, or Googling, or stack, or all the places where I'm trying to relearn all the things I've learned historically. How can we make it faster? How can we be more efficient? So when I have these two or three hours in an evening to spend, I can be focused. I can look at my targets. I know exactly what I want to do. I don't have to spend hours doing the reconnaissance research, trying to just make selections on what my targets are, because that takes up all my time. And then the time the value is really zero and net none. I'm going to focus on how do I be productive, how do I make money, and how do I find issues, and further what I'm attempting to do. So that's what really drew me into how do I build this automated ecosystem and take an advantage of the new technologies, the new capabilities out there, so that I can be the best that I possibly can. So now let's jump into the architecture of the underlying design of the entire automated ecosystem that we're going to build and talk about through the remainder of this presentation. We'll have a bunch of demos and code snippets and things that we're going to kind of deep dive in, and hoping that you can leverage these as accelerators so they can be agnostic to a specific tool, a specific process. But as you look at different things that you do and experience from a red team in perspective, or even just from a general information security process or practice that you do in your daily jobs, these are all components that can apply in a lot of different ways, which I think is really exciting. And that's a big part of why I feel like rather than just trying to deliver this custom tool that it's just a script or something that you run, and you routinely run it and it does the same thing every time, what I want to kind of teach is that thinking in terms of how do you build that underlying infrastructure that can support anything that can be thrown at you? How can you better yourself and understand rather than you kind of move up from knowing just how to run a tool to actually understanding what's going on behind the scenes? How do I piece this together? And how do I make it my own? So I have this system that can be potentially better than others, or bring you to the best in the industry or just solve challenges that you have on a day to day basis. So as we look at this architecture, it's really broken down into three key pieces. And on the left side is the user interaction. So these are the components of the cloud services that you're going to generally interact with, whether it's an input, whether it's an output. But in the design that I have that we're going to walk through, it's going to leverage these four pieces. And we're going to, as we go further, I'm going to give you a quick overview of every one of these components, just to explain what it does, where it's used for. And in the back of your mind, continue to think, how can I reuse this for x function or what I'm trying to do or my daily job? And how can I convert my manual efforts that I do on a day to day basis? How can I save time be more efficient and automate this so that way I can make my life better? I can automate basically you can automate yourself. And then you can focus more on the latest greatest technology emerging, push your program for mature and everything you're doing and just continue to build and build on top. So we're not relearning, we're not doing the same manual tasks over and over. And this ecosystem will really elevate that so that way you can do and you can build and it's modular and it's really an ecosystem of a lot of microservices and doing that. The middle layer is your processing and computation. So you have your user interaction that will pass everything over to your processing and computation. And this is a pretty big arsenal of different tools. So you have whether it's big data type query analytics services, it could be eventually we're not going to show any of these today, but if you get into machine learning or image recognition and thinking in terms of future opportunities of like you have screenshots or web pages, maybe you want to process a bunch of images of those really quickly, this would be your processing and computation layer. And then the last layer I think is pretty key and it's really important to get this in your in the mindset as you kind of consider future opportunities is that data storage layer. You look at today a lot of the processes you may have like if you're doing reconnaissance, you might run five or six different tools. In those inputs you're going to submit maybe it's a domain, maybe it's an ASN, maybe it could be a CIDR range, it could be various things. You input that generally the outputs of your various tools and scripts, they're going to be slightly different. And that's where that source data bucket comes into play. So I expect anticipate let's pull in 100 tools eventually together. Let's all put that data source data into that source data bucket that's just various formats and everything. And then what we want to do is we want to think in terms of how do we normalize this? What's the actual objective of this because in a lot of cases, I mean, and we're going to look from kind of a red team bug bounty perspective, but think broader as you think about your actual job, whether you're doing risk management, whether you're doing third party analysis, whether you are a security engineer, architect, any role in this, think about how this can apply. But we look at our inputs and outputs. So a lot of our tools, they're either going to output things such as URLs, it might be parameters for a different web page, you could have the ASN numbers for network ranges, you could have CIDR ranges, you could have domain names, you could have sub domains, you could have company names, you could have certificate assignments, different, different inputs and outputs. So those are all of your various outputs that you're generally going to have across your tools. We need to make these common between that so we can further process this and make this data useful because I know one of the areas that I struggle with the most is parsing through this data of every tool and it just takes so much time. And that's something that we could if you do it one time for a tool where you normalize the data, you never have to do it again. And the cooler part about this is we only need one person in the industry to do it for per tool or start to kind of standardize and build in. And that's where on my GitHub, you'll see Project Straylight. And my goal is really to just have this running open network of here's accelerators we can do here's code snippets that have helped me along the way. And it's maybe not so much a technology or tool that you install, but it's just a bunch of ideas and things that have helped me in my career that over the past 10 to 12 years that I've either had to relearn or do, I'm trying to aggregate it all into this ecosystem, this architecture that we see of how can I do what I've spent so many hours doing it redundantly manually? How can I make it better so that way the newer generation the next the next people that are going to be experts leaders in this can just build on top of what we already have. So we're not continuing to rebuild, and we can advance further and further and we can win in this in this industry. So that's that's a lot of the intent there. So as we look at the data storage layers, you basically go from your raw data, which is could be the output of any tool you get to more of a normalized version. And then you also get into your presentation layer. And that's when it's your reporting, it's your analysis, it's your metrics, it's the outputs that you want to show for the delivery of your work. And this could speak to any level depending on how you present it. So you could do an aggregate of here's all the attacks that we've had over the course of x days summarized into charts graphs, and that might be at more of a CSO level. You could have indicators of compromise or different signatures at different levels, which you would feed into a an engineering team that could be part of your presentation. But it's how do you normalize and how do you show this in kind of a formatted way. So that way if you have to do monthly vulnerability reports or metrics or showing analysis of how you further progressed or analyze websites or targets and you want to compare what you've done, what you haven't done, that's where that presentation that comes in. So we're going to jump in and I think we're going to do let's do a demo and let's kind of show our show from end to end kind of one area where we can go across this of how this ecosystem works. And then after that, we'll dive in further. I'm going to show you the accelerators, the pieces, the components, and keep in mind that all of this is also available in GitHub, you can access it as we go along, you can check it out, feel free to use it issue, issue, create issues in the GitHub if you find any things reach out to me with questions, but I really just want to continue to build and teach you how you can kind of think in terms of building your own automated ecosystem. As you're watching this to your probably, you're probably thinking why cloud I have a really nice home lab. Do I really need it? There's three key areas beyond just the amount of what we just walked through in that architecture that you have available to you. But it's pretty exciting when you think about what cloud can mean to you. And if you hear, I mean, the businesses, industries, everybody is adopting it. It's been, that's nothing new. It's not anything new there. But I think there's three key parts of cloud that are really exciting as just an individual researcher at where you don't need millions of dollars or tens of thousand dollars of funding behind you to actually make progress. The first part is the democratized accessibility of datasets. There's some pretty massive and interesting datasets out there. The problem is that on our home labs or home computers or home servers, a lot of those don't have either the horsepower to accomplish what we need. We need more memory, more CPU, or it's not close enough to the data where we can't download petabytes of data or terabytes of data in a meaningful way to keep up and be able to analyze and do. And that's where you kind of run into issues of data gravity. And you need to build your compute actually closer to the data stores, the data repositories, because it's costly, it's expensive to move those. So how can we, how can we proceed with that? The other piece is that if you look across the world, we're facing right now, we're all in quarantine, we're, we're social distancing you have. So take a look at what's going on around the race to find vaccines and solutions to COVID and kind of the data around that in the pieces. The world is coming around cloud and the latest, greatest computational technologies to solve the world's biggest problems that impact everybody globally. So what if, what if we can take those pieces that the top scientists in the world, the top in their industries are taking that solve the world's most challenging problems? I'm sure we can apply that to our daily bug bounty targets or our daily security processes that we need to improve. But in the same way, we don't have to spend tens of thousands of dollars for this. The other neat thing about it is it's financially viable. We can now stand up servers that have 128 megabytes or gigabytes of memory, you can do unlimited, unlimited storage. And the nice thing is you don't have to pay thousands of dollars to buy this from a capital standpoint, it's all operating expense. So as you look at it, if you need to run something that's extremely memory intensive, and it's just one operation or a loop, you literally could stand up a massive server for just a couple of minutes, it might cost you $10, $20 for maybe five or 10 minutes of processing. But rather than needing to invest $80,000 in similar hardware equipment, you now can do this in your, in your home office in your home lab, through the cloud through the cloud capabilities charging in that $20. If you think about it in terms of like bug bounty investment, that's one P four and you have that well covered. I mean, that's on the low end thinking in terms of that, but just looking at it from a business perspective, thinking in terms of how can I, how can I continue to invest in my capabilities, the technologies, the processes so I can gain more time and I can make the time that I spend actually manually doing things, evaluating parts, how can I make it the most beneficial to myself. So now let's take a look at the various tooling that we, we can use. We've looked at the architecture diagram, I'd like to just give a quick overview of the different components, especially if you're not familiar with the AWS services that are listed on here, kind of what they do, why they're important and how they can apply to you. So if we go, go through the list you saw from the user interaction space, the Amazon API gateway. So what this does is this is just an entry point into your ecosystem. And the neat thing about Amazon API gateway is there's a lot of operative, a lot of things you can do with it, but where I've really leveraged it is a, as a proxy into your Lambda function. So what you can do is you can do basically a get request or a post request through the API gateway and pass that directly into an AWS Lambda function. And if you think in terms of what you've seen in the Jupiter, Jupiter notebooks and Python libraries and any functions you run where you pass something, you literally can pass it through a get request to the API gateway into AWS Lambda. So I kind of view that as top tier level of services where you've really normalized your content, you have exception handling, you know that you're ready to completely automate this. Let's promote it into a Lambda function, pull it out of your Jupiter notebook, and then let's put it behind an API gateway. So you can really kick it off and call it ad hoc as you need and really start your ecosystem of automation around there. Another tool is Amazon SageMaker, and it's really a larger framework and compute system where you can do more things like machine learning and training and all of that. But what I what I pretty much solely use it for the Jupiter notebook, so you can stand up the Jupiter notebooks, you can run it, you can adjust the size on it. So if you need a lot more gigabytes or storage for a duration or a function, a lot of times I'll ramp up and use a larger a larger instance of it for just a few minutes or just to do that process and then I'll turn it down to kind of the standard level. So Amazon SageMaker is really the tool mechanism for running the Jupiter notebooks. The static website hosting S3, that's an S3 bucket, you can actually set those up so that they can host web content, make them publicly available. So you hit it, you can set an index HTML page, you can run it, and you can run websites out of S3 buckets, which is which is really neat. The Amazon Simple Notification Service, it's a way that you can send push messages to whether it's SMS text, whether it's emails, different things. So it's really just a notification service that we can do. In this case, I'll show a code snippet where we can actually send ourselves text. So that way we know if there's anything that's going to take a little bit of time or you just want to get notified or have the final URL where you're going to access your data, you can get texted whenever it's ready. So you can kind of do bug bounty reconnaissance, all this on the fly as you build this system, which is pretty cool. Now as we move into the processing and computation layers, AWS Lambda is really what's used to run code. So you can you can set up it runs a lot of different things. It's not just Python, but it's node, it's you can in it runs on a serverless compute. So really use you put the code in there, it will stand up the compute resources and it only runs for a duration, I think the longest that you can run a Lambda is about 15 minutes. So you're, you're charged based on the duration that the computation is running. So if you think in terms of smaller Python functions that you have ready to automate and scale, that's where you can migrate them to the lambdas and they can just always be ready to work off of triggers. The Amazon elastic block storage, a lot of times, S three is there's a little bit of a caveat where S three, you get charged on get requests and put requests and writes, whereas EBS storage you pay for how much storage that you have assigned and allocated. So depending if I think I'm going to do millions of reads or writes, sometimes I'll do a lot of my testing against elastic block storage. And then once I have something that I know is ready to go robust, and I want to leverage it in the greater ecosystem, I'll move it over to our data storage tiers, whether in the, the refined or the, the raw data levels to do more normalized and automated operations against AWS glue is a way that you can set up and you can crawl data sets. So Athena is you picture that is like you write your, your SQL queries and your query language, and it will run it and distribute it against data sets. It can, it can run against CSV files. It can run against relational data stores on structured data and all those. So it's really powerful. And if you haven't ever dug into it, AWS Athena, I highly recommend it. There's a lot of power in it. And you can literally just sit there and use the console interface and, and pretty much ad hoc write queries against big data sets if you decide to as well. And then glue is, I started talking it's, it's what indexes a lot of those data points around it. So you can create glue indexes so you don't have to query the entire data set every single time. And you can do those AWS step functions, you won't see anything in the demos today leveraging this, but this is kind of a future capability where I want to leverage where you can actually, it's like setting up workflows and processes and if you need X to happen, and then once X completes, send the output to this one and do right now, I'm just leveraging things off of triggers and alarms and lambda files, but you can really start to use AWS step functions as orchestration as a central core piece. So something I'm going to dig into more in the future. AWS secrets manager is a place that so you can move all of your secrets, your API keys out of your code, put them in secrets manager. It's a quick call out. You can, you can load them, you can store them in it. You can kind of store other parameters in there as you need as well, pull them back into your code so they don't, they don't sit as code, but you can use them as variables on the fly. So highly recommend digging in there. There's some accelerators that you'll see that talk to those. And then on the right side are just all the S three buckets and you can create multiple S three buckets and have the different stores and it's just long term data data storage for you. So let's jump in with a demo. This demo is going to show how we can query rapid sevens project sonar, which is essentially it's the four DNS of the entire internet. And it's sitting in a data set within S three. So rather than try to copy that entire data set down into our own system, we're going to leverage a Jupyter notebook and we're going to actually run a distributed query against that external data set that's on hosted in not our own S three bucket, but the public data sets bucket. And we're going to get the results back and we're going to process it. And we're going to do it in just a matter of it's it's probably going to take a couple of minutes as we walk through talk to the code. But this is literally I think generally the queries around the four DNS for a domain wild card is about 26 seconds. So let's let's walk through this. So on the screen here, you can see my Jupyter notebook. And what we'll do is we'll walk through this to because I want you to think in terms of how can I reuse these components? How is this applicable well beyond just this sonar use case? So let's let's go ahead and start. We're going to go ahead and run this first section of code here. And this is what I talked about earlier too, which is nice is that if you're not an expert at code, you run it, you can break it down into cells and you can run each cell. So we're going to run this first cell. What we just did was you can see this execution ID here. What we did is we we set our domain to Microsoft.com. So we're actually querying all the wild cards of the Microsoft domain for DNS space. And keep in mind, we are not touching any of Microsoft systems. This is purely utilizing the forward DNS data set, and we're getting the results from there. And then what we've done is we've cut set some of our variables around our bucket of where we want to store the results, what database we want to query against an Athena, which we've had some the setup information is all contained within the GitHub is why you seek me to set that up. And then the table we want to query against. What we do next is we we set up our query. So what this is doing is basically just saying where select everything from wildcard Microsoft.com. And then we select the latest date of the data sets. So that way we don't historically go through every single one iteration that we have. It issues the Athena query, it takes about 26 seconds to run. We have the Athena query ID here. So in this next component, I will actually jump back up this run command that I'm highlighting. This is actually a way that you can call other notebooks, which is kind of neat. So as you think about in terms of you call functions within programming code, this will actually run and load an entire notebook. So what I've done is I tried to take anything that's not tool specific. And I put it in kind of a common notebook that you can set a lot of different things to run. So the what this is actually doing is the query Athena function is running out of that other notebook because what I've set it up well enough that you can pass in the the results buckets, you just send it the query and the Athena database and it will kick it off. So it doesn't matter what tool specific or anything, it will just run a query that you tell it to run, which is why I moved it into that kind of standardized function area. So what we're going to do is we're going to run this next one. And this is going to generate pull the results back. So you can see that it's already succeeded. If we were to run this immediately, it would just go in a loop and it would circle every five seconds and check for the results back. But it's pulled back 18,695 different rows. And you can see the types of information we've just pulled back. So I think you have 18,695 basically for DNS entries from Microsoft.com in under 26 seconds with the IP addresses, the C names, a records, mx records, all of those. And then with that, we're going to do this next thing is this is actually going to leverage the MaxMind database. And we're going to start figuring out query and for the IP addresses and getting all that information back. So let's go ahead and run this. And what this will do is it creates a data frame, which leverages Panda. So it pandas is a library within Python that you can run and I completely recommend it. It is one of the best things that I've ever found in terms of doing metrics and analysis and automation, but you can see it just processed 18,695 rows in just a few seconds. And now we have the latitudes, longitudes, countries and localities all merged to all 18,000 entries. So what we're going to do now is if you you say, Hey, I just did all this work, I don't want a chance to lose any even though theoretically, you could redo this over about 30 seconds. But if you want, you can use this command and you can save your data frame off to Excel. So you can save it locally. So if you run this, it will just print and store those 18,000 records into an Excel spreadsheet, which you can reload later on, which is pretty handy to use. And if you want to save your artifacts and data points, you can always do that. So what we'll do now is we're going to run this next command. And what this does is it just aggregates and you can see how fast that was it was almost instantaneous when I pressed the button. And it just grouped up all the different latitudes and longitudes so that we can get ready to plot them on a heat map. So we know 525 have this latitude, longitude and it did all that work. And we can dig in. This also calls that that the other notebook, because it's centralized, and it's not specific to the project sonar data set. So we so you can reuse and redo any of that. I am going to jump back up though, I think this is something that's pretty neat. And if you think about in terms of how can I reuse some of this, this get location function that I have behind the scenes that just pulled in and merged all this together, what I did if you're looking, I passed the entire data frame as a part of the function there. And I passed the value column, which is the IP address column. So if you think about reusability, all I literally did was I just took if you have an Excel spreadsheet of IP addresses, or CSV or anything like that, you load it into a data frame, you can pass that entire data frame, it doesn't matter what the rest of it looks like you pass the data frame name, and the column that the IP addresses are in or see or it will obviously it'll fail on if it's not an IP address as it goes through those because they're C names, but and you do that and you can process it. So anything you have that's completely reusable and that's just a few lines of code, it's in GitHub, you can pull it and I think in terms of just from us, a cyber operations, a SOC perspective, you could really get a lot of value out of some of that information, you could look at attackers, web denials, web traffic, who's hitting your web application firewalls and you can do heat maps and different things from an attacker perspective, which is pretty cool. So let's go on down, we have that information and then we want to now let's let's do this. So this is just a couple lines of code, but it's really cool. So this uses the Google Maps API and you could see how fast that was it just it just plotted the the information we have up here for the latitudes and longitudes and created a interactive heat map so we can actually zoom in on and we can see what's going on and this kind of shows the concentrations of Microsoft's forward DNS space across the globe. So you can you can look and as we zoom in, you can see the different regions they're they're included in and so let's if we were just even let's say let's tackle this is a big area here you can you can zoom in and even though it shows Seattle space, you can go and see that it's not all just in Seattle but they have a few offices it looks like in red main there's some Seattle and above so pretty cool that it's all interactive and you can do that and the other the other neat thing is just thinking well I want to show this to management how do I show my managers my leadership about this and I can't have them just opening a Jupyter notebook all the time and doing this well one thing is you could click the download button just download the map as a as an image but the other pretty neat thing about all this and if you think about the architecture of how I have that presentation layer what this next command does is it takes this map and it actually uploads it to a static s3 bucket that's hosting a web static s3 content as a website so then I hit this I uploaded the map file to the website and then I can navigate over to my bucket and if I hit load content right here it should pull up that map and now I can always send any of my leadership any managers anybody that wants to see anything I can run this on an iterative basis whether it's on every 24 hours I run this process and do and if you think in terms of like visually what you want your attack surface to look like for your company run this against your company every time have this dashboard always going and then you could even trigger on anomalies you could do different things but and it's it's just pretty neat to always see here here's our here's our coverage from an internet presence that we have there so I I think there's a lot of value a lot of opportunity to do in there and just the fact that you can literally load it and save the interactive mapping part of it to a to an html file and then have people actually go up browse to it navigate it that's pretty exciting so for this next demo we're going to use this is actually one of my favorite ones and I'm really excited to continue the build on the capabilities of this but this leverages a data set that's publicly available in s3 called common crawl and what it is is it's a it's a project that's been going on now for I think it's actually been around for about eight years where it crawls the end it crawls the internet searches and indexes and actually downloads the html of files and stores them in s3 buckets so it's just a massive repository and similar of basically web archives that you can search so if you've used way back machine and different components this is very similar and so what we're going to do is we're going to actually search it for a domain we're going to pull back the files and it actually can eventually download the files and we can store them in an s3 bucket and then we can run a bunch of different tools around it so if we want to do like find URLs or run hack crawler or anything that's going to go through and does a good job of analyzing html whether it's for vulnerabilities whether it's for URLs whether it's for parameters we can do that against it so let's walk through this now I'll kind of show you the framework around it and there's a lot that we can build into this capability the biggest challenge that I've had so far is just trying to do some of the asynchronous programming against it because a lot of the loops take too long if it's such a large site where you're trying to do downloads and actually pulling the files but as we get more efficient with this I'm pretty excited about this and I think it has a lot of promise with it so what we're going to do is we have all of our parameters set up we're actually going to run this one against a smaller domain for demo purposes because if we have something that has 70,000 websites and pages it's just not fast enough at this point to download this in a timely fashion we would be sitting here for maybe 40 minutes or an hour so in this case what we can always speed it up as we go but we're gonna we're gonna run this against derbycon.com and we're going to so we've kicked off our Athena query it's it's running it now it's searching it's looking for all of the web pages that this is indexed and cached and copied and saved within the common crawl dataset so let's go ahead and kick off our run and we're going to now what this is going to do is it's going to look and it's going to just wait for the for the the query to stop running and then it will pull in and load the results for us into a data frame so we'll let this run for a few minutes all right so it looks like we have our results here so it pulled back 579 different websites what we want to do is it since it's been running over the course of eight years there's a lot of different versions iterations which is actually pretty neat if you want to see how websites have evolved and look for changes or maybe there's a vulnerability or an issue or or a comment in there that they moved out and you could see the difference between the different versions but in this case I'm interested more in just the the unique url so we're going to run this next command which is going to sort them and it's going to drop all the duplicates and it's going to keep the latest version of the web page so let's go ahead and run this right now and that was pretty instantaneous and now we're down to 73 different rows so there were a lot of different iterations versions of the the site cached and then what this will do is we could run this next command we can save it off to excel so that way we can have our url listing for future analysis future loading and then as we go on what what this will do is if you want to look this will this will pull out and only list now our urls which is what we care about a lot of times especially if we're trying to scope we want to do web crawls and keep in mind that everything that we're showing today everything is is completely 100% passive we are not touching any of these sites these web pages or anything so this is all examples of what you can do 100% passive reconnaissance so we will never touch a system of any of these domains that we've looked at and then for next steps then you start you can start based on scope in allowances and permissions and things like that to get more intrusive as we do further analysis whether it's from a red teaming perspective or however but what we have right now is just a listing of urls and then what this next big section of kind of a lot of a lot of different functions and code i mean it's still for for the value what this does it's it's not too bad in terms of lines but what this is going to do is it's going to go through now of the queer results so in this case we're dealing with what like 73 records so it's not that many but if you were to run something i think i ran i think i ran one earlier this year against defconn.com or org and it came back with about 70 000 records so some of these there's a lot of data and content out there and it can take some time and this is where this is not that efficient not that streamlined a lot of improvement opportunity here but it it basically searches it uses a library in python called beautiful soup which you can parse down html it's great for html processing it has some building capabilities look for comments and urls and links so we're going to set this up and what i have three parameters that we can pass in just mostly out of efficiency is we can we can say we want to search the files and generally we're going to always want to do that do we want to write the file so what this does is this goes out and it retrieves well it's always going to go out and retrieve the files from the s3 buckets but do we want to write them so in this case we're not going to write them for this demo purpose but you could write it to an s3 bucket now you have your html files that you can run additional tools and things against so it's it's pretty valuable there and then we can say how many records so if we just want to subset for testing if we don't want to pull down all 60 000 at once we could we can assign how many records and it will loop through and iterate over and download the number records we say and then we we run this and this function will basically process these it's going through you can see i have a counter on it because a lot of times when you're just waiting and you get bored and don't know where it's at so we can you can always clean this up so it doesn't iterate every single count but we've gone through the 73 html files we've down we've processed them as byte streams we've looked for URLs the comments and now we can actually save off the difference section so we have we've grabbed all the comments we've grabbed all the titles and we've grabbed all the links so if i run this next one this is actually going to show you the the listing of links and what i what i like about this is that it actually shows you which website that the links came from as well which i think is really valuable especially with sometimes we need to crawl is you don't always know where did i find this link where did this come from in this case it does all the mapping so if you start building like a bigger mind map or something of it you can you can have some value there and if we want to look at something instead let's say we want to look at the the comments across it we can do that and then we can just print it out so this will just print out the top the first 10 comments we have let's just see what this looks like and then there's there's your list of comments that are within the html so as you get this you can start then further looking through there of are there any secrets or their passwords you can do regex you can you can do different things are in there so it's it's pretty neat i guess that's the that's the extent right now i think we're going to go through i started looking into network x i think there's some promise there in terms of getting a larger map at scale but i don't have a lot of that code done yet but a lot i want to build on especially with this functionality because i think there's so much value from even an automated perspective where now we can tie in additional tools that are really good at html processing we can start to grab javascript libraries codes external references and all that and do you can really start building a pretty big map of of your target surface let's jump into another demo this one is going to search the autonomous system numbers which is part of generally your reconnaissance methodology of where you want to find out the different cider network ranges that are assigned to different companies so what you typically do is you would enter in the company name or a keyword for that company and it would return the different um cider ranges for that so then you can further do your reconnaissance doing whether you do nmap scans or any type of searches against those ip ranges so what this does is this uses the max mine databases as well that we talked about earlier with the geolocation data doing that but this uses their asn databases and um so part of this is we're we'll walk down through it all but what it does is if you're looking here at the stupider notebook now we go and these first sections actually just prepare everything so if you want to be able to set this up an environment where it does it completely um it doesn't completely automated where it will download we have the urls here and you can set this up on your own just by running this notebook so that way you can continue the run tools against it and and do and leverages there because i i really want to eliminate all the manual steps possible so the last thing i want to do is have to log into their website get the latest database and do so this will walk you through it one pretty neat thing about this is that recently they've changed to and you have to have a license key now it's still free but you have to log in register in the portal get a license secret to be able to download those databases now when i share this code i don't want my license key in there that's another benefit of using the cloud in amazon is that they have a um service called system or secrets manager and what you can do is actually you can call secrets manager from your python code and pull out secrets in store so as you start to get a big list of different api keys and as you talk about across different clouds even previously we made a call for the google maps api i pulled the the secret or the api key that i use for that out of the secrets manager within aws so this function here is actually what does it it's in the notebooks it's in my central notebook because it's reusable but you just pass in the secret name the region name it loads it into here's the variable so that way you don't have to have it in clear text and it's a security best practice and it's definitely safe if you're putting your code into github and things so i think as as security practitioners anyway that's just something to get better at because we push our developers to do that we push we push a lot of people that we talk to they do that but it's it's hard and it takes extra time so here's kind of just an accelerator you can copy paste using your python code especially if you're working within within a cloud ecosystem an alternative to that is it's called pickle and it's a library within python 2 where you can actually save on files locally now they're just binary files that can get loaded back into pickle so they're not secured encrypted anything like that but it can at least call it from your operating system rather than pulling it or storing it hard coded into an environment then environmental variable or within your code so another another option if you don't want to use secrets manager so as we go through this this all gets a setup i already have all this but you can see if you put an exclamation mark before any commands you can actually run linux commands off of this and sage makers run in the back end of a linux server so you can do a lot of that too which is we're just pretty neat from a flexibility standpoint so this this manipulates all your data does the w get pulls it all down goes through that process does a cleanup i i hate having files just sitting on my computer that i don't need or want so whenever you ends up this this will do all the cleanup for you loads into a directory and then now you have your files so what we're going to do is we're going to actually start on down here because what's neat about this is we can we we can run all these searches after we do have everything set up with just what is this like 10 lines of code so it's it's actually pretty straightforward to do this and you can start to automate it then bypassing variables call this from other functions from a reconnaissance standpoint i'm going to restart reset our variables so we don't have any existing output so you can you'll see this cleanup okay so now what we're going to do is we're going to leverage microsoft again is our example and we're going to search their org for their number so we're going to go ahead and run this and what it's doing is it's actually calling out pulling in the max mine searching for it and then it comes back and it went through it looks like the latest row that had microsoft in was 430 000 so it went through a considerable amount of data for this came back and then it loaded it all into a data frame so we have all the entries for microsoft's site arranges into there so now what we can do is we can write that to a csv file or we can start to parse and manipulate and pull it into the next lineup of maybe you want to now kick off nmap scans automatically you could leverage this data and do because it's normalized so this actually wrote it to my back to my recon page that's that static s3 site so now i can have a list of here's this here's the output of the different site arranges so i can just kind of see what i'm doing on a overview of it so it's it's kind of neat too that collectively you can kind of now have this dashboard that you can watch as you do your own reconnaissance and bug down you or anything that you're doing and it's and it's kind of fun to see that so that's that's really that demo and then it actually we can look at the org too so we see it's 80 75 so if we want to search for that org number in case the names differ in anything we just swap out in it it actually ran both but i'm going to comment it out so you can just see the results and we run that against it and we should have the output of the org number pretty quickly too so you can see the orcs and it just happens they all have microsoft in a many way but that's just another option you can always do all right so let's take a look at two more quick demos and one of these and you'll you'll pretty much hear from me over and over especially if you worked with me of how much that i don't like to do manual effort in spreadsheets or csv files and and from a metrics perspective how can i completely automate it and the whole basis of this talk is around automating those manual mundane things that just take time away that we should always have almost in real time or at least near real time metrics and data points that we can have so i i hate using excel spreadsheets and building graphs and charts off of those now i'm fine with updating them and keeping them as kind of how do you maintain different components but one of the neat things about pandas is literally you can you can import excel spreadsheets with literally one single command and so when you can do that it's so easy to do i figured of course we have to be thinking about security awareness security awareness should always be in the back of our minds when we're working with co-workers ourselves i so what i did was i took the step and i said yeah absolutely we need a mask for that so as we're working in quarantine when we're on our zoom calls with our peers i created these masks that you can pick up definitely support pay the pay the AWS bills for this but remind your remind your co-workers on your zoom calls that they need to modernize they need to codify and they need to automate so i i recommend pop these on show that rock these out as you're as you're in your zoom calls and and just remind people that we need to we need to move faster we need to be more efficient we need to automate codify and move forward so as we go as we go into this let's let's check this out so we're going to we're going to do this one this is bsim so it's called building security maturity model it's one of the methodologies around application security so if you're building an application security program trying to measure how you're doing this is one of the models that you can leverage and do so all this does is really it imports it and this is what i've used to measure programs against over the years and i really like it i'm a big advocate big fan of bsim so what we can do is we can actually load this spreadsheet within just a moment and this is just a spreadsheet too so even though we're working we've shown massive data sets of of terabytes and gigabytes of data we've been processing you can use this for just like single megabyte spreadsheets as well and still automate and do so this loads it in we can see and this is kind of from a metric scorecard dashboard standpoint and it's also in github so we can check this out outside of this presentation as well and then let's say we want to keep dashboards just for our management or leadership of the scores you literally can just run this command as you update have your team members update spreadsheets as you complete a project or program and then you can have your spider charts that show kind of where you're at in the maturity loop where you want to be where you're at and these are kind of interactive you can you can overlay these and they measure based on how bsim actually measures of how high you are in the quadrant so just one one cool example that's pretty lightweight here's your chart output chart of you can see where you're currently at where your target is and how you're progressing so you can kind of build on this and think about all the different metrics spreadsheets you have that's literally one line of code that you can pull in an excel spreadsheet into a pandas data frame and manipulate and then you never have to do this again i mean that took two seconds to run the latest greatest metrics around an application security program measurement here's another example just with a csv of the the cve library so it's a data set where you can go actually to mitre's website you can download the historical all in a csv format all of the all of them so i've i've already kind of taken the approach of that i we can download we can download the file and we can actually just import it into a data frame so this these few lines of code here will walk through and it loads up 100 and what is this 176 thousand different lines of cve's and it has the data points around so if you ever want to do analytics around those trying to look at group by how many were issued per year you can get this you can run it pretty quickly so over the past 20 years you can see how the the count's have increased over time you can look at which months are busiest with cve's why do you feel like you're always working over the holidays you can kind of use this to compare and forecast where you're going to most likely have more patch management resourcing needed and different things so kind of cool insights that you can share and kind of make it actually is like data driven background and back and back you up on on why why certain things are happening and i think as we as we go and build programs and security we should always think in terms of data driven mindset the other neat thing is you can load additional libraries so this actually loads bouquet which is a charting tool a graphing tool within python which is used pretty commonly across the scientific computing community as well and you can also write these two interactive websites and things so you can show the dynamics so you can start to chart these out you can graph you can save them you can load them you can do so just a couple lines of code repeatable reusable and you can build these that you can just use over and over again so tons of efficiency gain tons of opportunities to do so i'm definitely definitely recommend grasping that don't work in don't do manual work and spreadsheets anymore other than maybe just populating data or loading it from an api but leveraging code and automate your things as we wrap up this presentation i forgot to mention that as you build this automation you have services running it's really easy with the cloud to build your s and s notifications so these are simple notification services and you can text yourself give yourself reminders let you know where the progress is and any of the processes that you're running so definitely a handy snippet of code to really leverage and utilize to your advantage the other part that i want to mention is we've looked a lot about kind of red teaming tech techniques and passive reconnaissance in doing this but what i want you to think about is if you're in an information security program right now working look at this from a holistic standpoint of how can you help out the other infosec domains that you work side by side along whether you're in vulnerability risk management whether you're in third party whether you're in a cyber fusion center you're doing security engineering or architecture think about those outputs and inputs and how can you better take the data that you're generating within your department and how can you normalize it and pass it on to make maybe the risk management team help them make better decisions have better normalized data that they can process and do so think about those gaps and i think that's somewhere that you can really accelerate in your career and it's a gap that we have of where we're really good at building out and ensuring individual silos within information security but if you can start to show from a career perspective how you can branch and understand kind of what are the value statements what are the objectives of these different departments within your program that's where i think you can really accelerate in time in places that you can really i guess just build and extentify your career so i i encourage you to take these principles these platforms these topics that we've talked about and start applying them to your daily jobs and i i think you'll see some great success and again i really appreciate the time today i hope that you've taken something away out of this i'm definitely jumping to the github i'm going to be continuing to build out blog posts and code snippets and accelerators in there so over the the weeks and months definitely push me help drive me to keep that populated by i encourage you to definitely reach out if you have any questions as you do as you go but i appreciate your time and enjoy the rest of the def con and the red team village and i'm and uh thanks again and have a great day