 Hello DEF CON and welcome to my talk here at the Recon Village for DEF CON 28 safe mode. Super excited to be here. This is my first time ever as part of the DEF CON speaker community. I'm super excited. So I hope you're excited also while we start talking about Ambly, the smart dark net spider. So who am I that's talking to you about this really cool project? Well I am known by a few names online. This may or may not be all of them, but it certainly is the few that are most common. Some people will know me as Cytosis Eurydice, others maybe CyberSci. Most people in this community are going to know me as Levitonin or Levy. This is what I am online where I'm actually interacting with people most often. By day I am a Cybersecurity Incident Response Professional and by night I'm a dark net researcher by choice and trade. I am a self-proclaimed master of spiders also which may or may not be both on the computer and off of the computer. Just keep in mind that spiders in real life out in the wild, they're doing a lot of good stuff for us, but those online they're equally hiding in plain sight doing a lot of cool things that we need them to do every day. And I'm here hoping to create tools that are based off of these spiders and build upon them in order to help us in fields like open source intelligence threat gathering in other research areas. But before we get ahead of ourselves, let's talk a little bit about TOR because we're going to be talking TOR a lot today. Keep in mind though that TOR is not the only dark net access point so we're going to be mentioning a few others as well. Before we get too too far ahead of ourselves because I could talk TOR, dark net and spiders all day long, let's talk a little bit about the different layers of this presentation. We're going to be going over open source intelligence or OSINT. We're going to talk about cyber threat intelligence, which you'll often see abbreviated as CTI. We'll talk about the different layers of the internet, which I've already slightly touched on. We'll also talk about the difficulties of finding cyber threat intelligence on the dark net specifically. And finally, this is going to lead us into talking about Ambly, a smart dark net spider specifically designed for cyber threat intelligence. So let's get started. What's open source intelligence? Most people in this village specifically probably have some sort of idea about what open source intelligence is, but we're going to break it down a little bit anyway. Open source intelligence is anything that is accessible from original sources, broken down as accessible, original information or data. You can also explain this further by saying something that is posted, viewed or interacted with by a user, which is accessible publicly online or offline. It doesn't have to be specifically on the internet. This includes, but it's not limited to, the internet at the clear, deep and dark levels, which we'll get into later. Mass media, television, radio, books, journals, print of all kinds, video games, specialized journals, conference materials and think tank studies. These are out there. People are talking about them all the time. We're going to be posting about DEF CON for months after this, right? Then there's photos and videos, not just on YouTube or Snapchat, but in general that are posted publicly online. Finally, another area is geospatial information such as where they actually are located, their GPS, where on a map they may be, or even their IP addresses. So what is OSINT good for? Absolutely everything. If you thought about that with a tune in your head, you thought right because I was talking about that song that's now stuck in your head. You're welcome for the earworm. So OSINT can be utilized in an array of situations, including, but again, never limited to, designing internal training for a company, understanding your threat profile, beating Keith and Joe at an open source CTF that may or may not be hosted by Trace Labs today, later on after this presentation. Maybe, maybe not, I don't know. Volunteering, finding out the title of your favorite book from when you were nine, that you forgot existed for about 10 years and then you desperately wanted to read again sometime later. Yeah, all of these are actual areas that I have used OSINT for on the clear net and down into the dark net. And that book for anyone who was interested was called Goddess of Yesterday, which I highly recommend. It's a good book. But now you may be asking, if you're not used to this field just yet, is open source intelligent gathering legal? Short answer is yes. Long answer is depends, which is a longer word than yes. The longest answer is if you're in a country outside of the United States, I don't know. It's going to depend. You're going to have to look into your own local laws. Even in the United States, you're going to have to look into your laws per capita, per state. This may be different in some areas. In Europe, I'm expecting that there's probably a lot stricter regulations on what is or is not open source intelligence than there may be here in the United States. That being said, in the US, there's a public law that open source intelligence is produced from publicly available information, collected, analyzed, and disseminated in a timely manner to an appropriate audience, and addresses a specific intelligence requirement. This is what's publicly posted on the CIA's website regarding open source intelligence. Now there are a few other sources that you can look into, including the public law number 108-458 posted in December 2004. We also have the FOIA B3 exceptions from the 50 USC 403-1, which is on intelligence sources and methods. And finally, ATP 2-22.9, which is the establishment of open source intelligence for the Army. These are all laws that are out there that you can read up on, and they will help you understand what is and is not legal in the United States. Again, certain states may vary. All right, we're going on to the next section now. We're going to start diving into cyber threat intelligence. But first, let's take a step back. Let's relax a second. Let's think through a scenario that's going to come up again and again through this presentation. One, let's go. During a pandemic, you're working from home. Everyone's working from home, we hope. Stay safe, right? Your company, which is global, just announced it's working on processing pandemic data to help drive towards a cure. Great, that's awesome work. But does that open you up to some threat that may be out there, some cyber threat actors? As part of the cybersecurity team, your job and your goal is to identify if there are any actors out there that you need to be aware of. And anyone who may be actively targeting not only your company but companies in your industry. So what do you do? How do you find that out? Keep this in mind. We're going to come back to this. All right, what's cyber threat intelligence? I just kind of threw a scenario at you, but what are we even talking about? Cyber threat intelligence is the collection and analysis of threat actors' motives, targets, and attack behaviors in the realm of cybersecurity. Often automatic machine learning and techniques implemented for data collection and processing. This is how we are gathering this information. However, you can do it manually. This is helping to shed some light on preemptive actions or for preemptive actions and reveals adversarial motives and tactics, techniques, and procedures, or TTPs, which we'll be talking about a lot. The reason we do this is to help in three areas within a company or possibly even the government. And what these areas are tactical, which is where you're performing malware analysis and enrichment or you're collecting threat indicators and you're trying to help out with defensive cyber teams. The goal here is to be able to talk to a semi or more technical audience. Then you have the operational team. You have a team that's trying to understand adversarial capabilities, infrastructure, TTPs. You want to leverage this information to conduct targeted prioritization operations. This team is probably more technical with the details and they want to know about attacks and campaigns from the past, the present, and possibly if you can anticipate it in the future. Finally, we're talking to the strategic team. This is the team that needs to know all of this information at high level. They want to know about adversarial motives. They want to know who's targeting your industry. They want to be able to leverage this information to engage in strategic security. This is for a non-technical audience in most cases, but it's really important that we make sure that they understand what's going on and why it's important to focus in on this information. Now, are there companies out there that are focused on this? Absolutely. A few of the top ones are Recorded Future, CrowdStrike, FireEye, Mandiant, and Sands. And this is an area that more and more people are getting involved in, especially right now and in the scenario. When there's a pandemic or there's a work-from-home situation, you have to be aware of your cybersecurity. So the more people who are looking at this, the more we need to know about it and know how we can dig into it. So what are some tools that we can use right now if we wanted to get into open-source intelligence and cyber threat intelligence? Well, we've got Maltigo, Spiderfoot, Malshare, the open-source framework, which everyone should start at at some point. It's right there, and it will lead you through the tree path to whatever you may need. We have the check usernames are been verified, which is a great tool to see if a username is used across platforms. You can use it to track people's movements if you need to find a target, or if you're working on something like a CTF or Trace Labs. Or you can even use it for yourself to see if the username you want, that you want to kind of trademark as your official account or handle, is used anywhere that you may need to grab. Have I been pwned? Excellent resource. See if you or anyone else has been pwned, if their email addresses have shown up in any leaks. We've got Census, Shodan, which is one of my favorite sites built with Google Dorks. If you know how to Google Dork, you probably are a leg ahead of a lot of people. Google dorking, or which is Google hacking, or just knowing the syntax of how to do a proper search at Google, is an amazing, amazing ability. And it's magic. I don't really know anyway else to say it. It's magic. We have ReconNG, the harvester, NMAP, Creepy, which is a little creepy. But it's branded well. There's so many tools out there. TraceLab has a VM full of them. Oscent Combine is another course where you can learn stuff about Oscent. And they have a lot of tools they talk about. Tools are popping up over and over and over again, all over the place. And most of them are open source. Some of them may have paid features that you can or cannot get if you want. But you can do a lot of this stuff without paying a dime. And they're very useful. So I highly recommend digging into these tools. And if we want to talk about these at all, please let me know. I'd be happy to chat with anyone about any of these. But for right now, we're going to move on. We're getting to one of my favorite parts. I could talk about this way too long. So I'm going to limit myself to three slides. Four slides, I lied. Four slides. We're going to talk about the layers of the internet. Number one, out of the box. Please, please, do not believe everything you see in infographs. Because honestly, I don't know how many times I see this whole igloo, is that the word? Iceberg, that we want, that we see for the internet, for these three layers, or in some cases, up to 12 layers I've seen. But I don't necessarily agree with this one based on my interaction with the layers of the internet, my understanding of it. But even more so than that, just like you can kind of do with stats, you can do almost anything with an infograph and be convincing. You can really get people's attention. So just keep that in mind. I'll get off my horse about that for right now and just let you know that we're going to be talking about the clear net, the deep web, and the dark net. Clear net, everybody's favorite layer of the internet, even if you don't know it. The clear net is, well, that might not be true. We'll see. The clear net is where you can access things like Google and Bing and DuckDuckGo. And if you look for something on Google and you can find the page, you're on the clear net. Otherwise, if you can find the front page but you can't find specific data, let's say you joined a forum and you're trying to figure out what's going on with one of the users, you might be able to search that user and see that that forum has a post, but probably not going to be able to see that post. Not without getting through an access point. And that brings you into the deep web. The deep web is anywhere where you're talking to people online. You're on Discord, you're at Defcon, you're probably on Discord. By the way, Defcon, you've gone to the Defcon forums, deep web, like to play video games, Twitch or Steam, deep web, how about ever log into a university or school portal ever, deep web. You have a bank account anywhere, deep web. All of these areas also have a front facing website that is accessible via the clear net, Google, Bing, DuckDuckGo, whatever your preferred search engine. But because you have to have an access point, you have to go into this. You're running into this area where it actually is the deep web. It's not as accessible. It's not as indexable for those spiders and crawlers behind Google and other search engines, which is what makes it the deep web, which if you go back to that infograph, this should be the largest section of the iceberg, which leaves us with the dark net, my favorite personally. There's a few different ways of getting onto the dark net, but the top three access points are those we see here. We've got the TOR project, which is the Onion Router Network, or just TOR. Like we said, we're gonna talk TOR today. And the focus is gonna be on TOR from here on out for right now. But keep in mind that we also have the free net. We have I2P and the bulletin board system. All of these require specialized access and or knowledge in order to get onto and interact with users. For TOR, you need to get the TOR browser to access the Onion Routing System. For free net, you need another specialized software. You can also get specialized software for I2P, which will run your traffic through different tunnels via peer systems and BBSs or require a Telnet client software. All of these require extra software and extra knowledge to even find these websites in most cases or chat rooms or what have you. And that's what makes this the dark net. It's not easy to trans, sorry, it's not easy to explore, to travel. If you're trying to get from one site to the other, you need to know someone or something to get there. So what does that mean for us? What is cyber threat intelligence on the dark net? Like I mentioned, we're gonna focus on TOR, but first let's go back and look at this. How long does it take to find a website for us to use? If you're on the clear net, a website that's newly posted takes about four weeks if you get the SOEs, okay. If you get the SOEs set up really well for that website, it can take about four days. And then you'll start seeing it populated on Google or other search engines, depending on how it's set up. So if I actually go and I log in and I try to find a website on Google and it's not a hurricane, like it may or may not be outside, then what are we dealing with? A few seconds, maybe a minute. It's kind of based on your internet speed at that point, right? All right, what about the deep web? Well, let's say the deep web website has a clear net-facing access point, right? Maybe you're trying to find a certain forum post. So the forum you can find on Google, so there's the few seconds to a minute. Then you have to either make an account or log in. So that's the next time. Let's say that takes two minutes. So that's three minutes now. Then you have to log in. So once you're logged in, you then have to find the post you want. Now, a lot of these sites have internal indexing, which means you can search stuff internally. And that may take another minute or so. So that's four minutes, five minutes. That's not terrible. So long as you know what you're looking for, if you don't know what you're looking for, it may take longer. If you're specifically trying to find information on a user or a subject, you may have to do a lot of digging, before you come across what you really want. And in those cases, it could be extended for a very long amount of time. That's fine. What about the darknet? How are you gonna find a darknet website? Some of them are posted on Reddit. If you look up darknet websites, you're usually gonna get Tor and they're usually on Google. Pacebin, GitHub, some of them aren't GitHub, yeah. Well, even if you find it, what do you do with that? It can take some time to find websites. Are they really what you want? What are you looking for? And what about how do you access them? And how long does it take to access them as compared to the clear net? Well, let's take a look at our scenario from earlier and we can try to figure this out. So some extra details. You've uncovered information about a group targeting your company or in companies like yours. The information you have in the case that this group is active on Tor but not where they're active. How do we go about finding this group's activity? Let's walk through it. Manual investigation, step one. Let's figure out what's going on by using Google as an ally. And I'm gonna start with Google because most of us do. Personally, it's not my preferred search engine but it is one of the better ones. Oops. So what do we have? I just searched here for Tor websites. I got a few of them. Here's a quick snippet of them. We've got nine best onion sites to visit. Awesome. List of some sites from Wikipedia. Cool. Very nice. We have best onion sites and how to access them safely. Awesome. How to find active dot onions dark net sites and why dot dot dot. And finally, we have the best dark net websites you won't find on Google. Hmm. I don't know about you but that one seems a little weird considering we found the link on Google. Personally, I think it's funny so I'm gonna pick that one. So that leads us to step two. Pick a rabbit hole to dive down which is what we're gonna do with the best dark net websites you won't find on Google as found on Google. From there, we're going to go into what does this article or this area have? So the first one they listed was the hidden wiki. Surprise, surprise, there are a lot of hidden wikis. Some of them are mirrors. Some of them are just lists. Some of them are kept up to date and some aren't. This one is the general normal access point but there are others and there are some that are better. They also have here hidden answers which is another good one and generally anonymous. But one big thing I wanna note here. If you look at these two, and I've got my mouse over here just to show you, these two URLs, first off, Rontor. So all the URLs are gonna end with dot onion. But you'll also see that this one is very short. It's a bunch of gobbledygook. You're not gonna be able to understand what it is in most cases but it's short. And that's fine and dandy but that's a V2 URL. What's going on right now is that we're switching over to V3 URLs which are these longer strings, usually around I think 52 characters. Sometimes you'll be able to see a tag like this where it's got an actual human readable, English readable word in the beginning. But generally speaking we're just looking for these longer URLs. Those are the ones that are gonna be sticking around for a while. This shorter one is going away in 2021. So yay we found this but what does that mean? What are we gonna do about that and how do we get an updated one if this is a good starting point? We're gonna talk about that too. So step four, we found the URLs but how do we actually access them? If you go and plug and play those dot onion links into your Google Chrome or Firefox or any normal standard browser, you're not gonna be able to reach it. You need to be part of the Tor network or the onion routing system. In order to do that you need either to set up the proxy or you download the Tor browser. So this is Tor. This is the base browser that you have right now. It's updated to 9.5.3. It's a base off of Mozilla Firefox and it allows the user to connect to the Tor network and onion router system. Now you can change a few settings in here generally speaking. It's not recommended to have JavaScript on. For example, you wanna try to be in the safest mode you can be when going through different sites. And you start off with the search engine duck duck go. Now duck duck go is a clear net search engine. It doesn't necessarily help you with Tor websites though it can give you some results similar to if you look up on Google. But now we can quickly and easily pivot into using those URLs that we found earlier. So let's go to the hidden wiki. Here's the first thing that popped up when I went on the hidden wiki about two days ago. Top of the page, hidden wiki, new URL as of 2019, 2020. Add this to the bookmark and spread it. This is the current URL for the wiki. Now can you get there from the small one? Yes, which you can see right here at the top. I was accessing the small v2 URL. This is the v3 URL and it's gonna help us stay connected to the wiki as long as it's viable and up. So when you do come across these v3 URLs you want to grab them or you may lose access to the site and beyond normal losing of access on dark net. So we're on the wiki. We're at step five. Visit the starter point. What are we doing on this website? Well, first we're trying to find information about the threat actors for our cyber threat intelligence scenario. So what areas of the wiki can help us? Well, we've got social networks, which slight side note, this includes Facebook and for the life of me I cannot understand how a dot onion site for Facebook is not going against their own terms and policies, but also if anybody could explain that to me please, by all means, I would love to hear it. But I also personally wouldn't recommend signing into any of your real life social media accounts as especially at the same time as doing other things on the dark net. The whole point of this is to say private, to try to stay anonymous. And if you're logging into your personal accounts and random accounts or your personal accounts and your dark net accounts, it's possible that you could link those two. We wanna try to avoid that. Similarly, if you're doing OSINT online and you're not using the dark net but you're using maybe a virtual machine, you don't wanna log into your own stuff in that instance. You could run into a conflict. Anyway, we have social media. We've got connect. We've got galaxy three, which is rather new. We have tour book and we have Facebook somehow. Facebook, I wouldn't necessarily consider a dark net social media, but it's there. Then we've got hack, freak, anarchy, wears, viruses and crack. We've got all of the stuff you need. Just try to rent a hacker, why don't you? Definitely isn't a scam. Definitely is not going to backfire on you at all, promise. Do we think the threat actors that we are looking at for the industry, for our cyber threat intelligence scenario are really taking any action here on these websites? Are they maybe selling their abilities or any code that they've made online? Maybe. Personally, I wouldn't think they'd be on these sites, but it's possible. You can definitely look into them. It all depends on what you're trying to get into. Now, there's also another area, which is the introduction points. Earlier, I mentioned the search engines. Those are Google, Doc.go, Bing, ClearNet search engines, awesome resources for ClearNet, for open source, all that good stuff. These are considered, in most cases, to be dark net search engines. Now, Doc.go, we talked about searches the ClearNet, but it's kind of working with Tor. We've got Amia, which is searching Tor websites, but it's on the ClearNet. Then we've got stuff like Torch or Not Evil. If we wanted to go into these search engines and try to find more information on cyber threat intelligence, how would we start? Well, that's going to bring us to step six. Let's begin the hunt. This is Not Evil. FYI, this will probably get posted after, but if you try to go on to Not Evil, it may be down for an upgrade on August 7. But otherwise, you can pretty much use this like Google, put in some keywords, and start searching. Now, I can tell you, but I will not show you, that when searching Not Evil for anything related to the pandemic, you get, surprise, surprise, a lot of pornography. Most searches will get you a lot of pornography. If we're being serious, that's just kind of how it is. You can tag videos and images and everything with anything. And it kind of works sometimes as a way of hiding actual posts and data. And sometimes it's just because people like to share it to each and own. But as point hub stats have shown in the past, when stuff is going on in day to day, you tend to see it turned into pornography and get really popular. So that's fine and dandy. But we're not going to show that here. Instead, what we're going to show is when I looked for hacker. Does stuff show up? Yes. We have some sites. We see some messages here. We see some posts down here. We have Vincent Canfield, a hacker on Keybase, who specifically says, do not chat messages on Keybase. OK. We have the social hacker. We've got a warning on that site. Check before purchasing. What are they selling? What's going on here? That's something we might want to look at. Now, what do we do if we search for cyber terrorists, which is the next little block we have over here? Well, let's see. We've got Daniel's onion list. That's cool. We've got some other languages that are popping up here. That's nice. We have a mention of the CIA. OK. We've got rat wires. All right. Interesting. So we get a few hits. But is a cyber terrorist group going to label themselves a cyber terrorist group? Maybe. Cyber terrorism is kind of a perspectives game. Most things are a perspective game, really, right? If I'm looking for information about someone that I think has done something bad, I think they've done something bad. They might think they're doing something good. Are they a hacktivist? How do we decide that? It's another thing we'll need to use some cyber intelligence skills and techniques to deal with. But let's get back to the next step, part seven, part of the talk. Websites on Tor, Ken, and our volatile. Torum may be up one day and down the next. Now, Torum's a great website for getting into the cyber security community on Tor. It's a very interesting forum. They got that traditional, sweet, sweet, terminal green on black going on. So it may or may not hurt your eyes. I would love to show you a picture of it. But when I went to get all the pictures, as per the slide mentioning, Torum was down. And that doesn't mean Torum is gone forever. Most likely it means it was down for maintenance, maybe they had a weather issue or some other issue, wherever it's being hosted. It is what it is. But this can make communication difficult. And if you aren't embedded, you can easily lose track of things if not be left behind completely. So there is a point at which we may want to talk about how do you safely integrate into the communities to get more information. You're already having a hard time searching the dark net. Is there a point at which you want to embed yourself in communities to start learning more? Is it safe to do? How do you do it safely? These are all really great questions that we don't really have time to get into today. But keep an eye out. This is definitely something I want to talk about with everyone more going forward. So summary slide, right? Cyber threat intelligence on the dark net. All right, so on the clear net and parts of the deep web. We can go through profiles, web pages pretty quickly. What do we say, like a minute to find a page on Google? Give or take? About five, up to five minutes for the deep web if you're just doing a shallow dive. Not too bad. Dark net's different though, huh? Although there are some tools in place, such as dark net search engines, like not evil, we don't have the same boosters here as we do everywhere else. It's not quite as easy. It's not quite as indexed. This is something we take advantage of in our day to day lives. A lot of us may even forget a time or not have been alive during a time before Google, before these things were indexed and easy to find. So if we're trying to do dark net research, there's a huge bottleneck here. If you're doing manual investigations, that's a slow process. And getting to websites and finding websites is the game of the dark net. You've got to trade stuff, right? You've got to interact to really dig in and find these. It's a huge hurdle when trying to hunt for cyber threat intelligence on the dark net. And this is where Ambly comes in. And I'm really excited that we've gotten to this part because this is a project that I have been hoping to work on for a long time. And when I finally got the chance to do it with the sale lab, I think I just about lost my mind. I was so excited. So taking a step back, let's talk a little bit about Ambly. Ambly's a smart spider focused on the dark net to gather and identify cyber threat intelligence. Ambly's using, it's gonna be using some machine learning and artificial intelligence techniques to actually identify relevant websites and to make sure that they're valid and they're pertaining to cyber threat intelligence. We're gonna be accessing websites, hopefully down the line behind capsha blocks or user accounts and anti-robot protocols. The goal here is to use deep learning and natural language processing to identify and rank new URLs based on the potential for CTI. We're going to hopefully have a report that gives us out web pages and identifies further investigation recommendations so that an analyst, government, researchers, whatever, what have you can actually use this to get rid of or to reduce that bottleneck that we talked about a moment ago. Reducing the bottleneck is the key behind Ambly's mission here. We want to make this easier and better and not just for cyber threat intelligence, there's other tools that may or may not be in the works here but this is the goal of Ambly. Ambly is for cyber threat intelligence to identify and track this information on the dark net. Right now, specifically Tor, but that does not mean it's limited. But let's take a step back. That's what the Ambly we want. We wanna be able to do a lot of stuff for the Ambly. But where are we right now? Well, Ambly is in a prototype stage and we'll get to that in a minute. But right now Ambly's going out and gathering data. Ambly's actually creating a database on viable dark net websites, specifically on Tor. So they're all dot onion sites. So how does Ambly do that? Imagine if you will, another scenario. You dropped in a random city. You don't know where you are. You don't know what country, state, city, anything. You don't have a phone, you don't have a computer and you don't know the language of the people around you. What do you do? How do you find safety? How do you get home? Well, personally, I would start walking, searching for any sign of a police station of some trusted resource that I can go to for help. Now, maybe there's this universal sign that we'll see or some sign that'll help me indicate that path. Maybe there's not. Maybe I just have to walk and I go down a path and I keep following that in any way it leads. And if it doesn't work, I go back. And we keep doing that almost recursive spin around until we find an area that we need to be in. This is kind of what Ambly's doing to create that initial data set on darknet websites. It's just going. It's free to move. It's collecting websites. It's identifying and collecting a whole database on darknet websites when they were viable, verifying that we could actually go to the site not only that the link was out there. And from there, we're going to use that data to start doing really cool things with machine learning. But let's talk about Ambly of today. All right, so we're in a prototype stage, like I mentioned. We're beginning to be on active and frequently updated Tor Wikis. Not the one I showed you earlier, but a different one. One that's really up to date and monitored frequently. We've got two access points that Ambly has. One, instantly on the darknet, anything that goes through any search, any website, we're going to the darknet connection point. This makes sure that we are always connected to one of the nodes and it keeps us in the network. So even if by chance and early on we ran into this, where maybe we found a clear net website, going to that site, still use the darknet IP addresses. We've now since removed the ability to go to clear net sites. As the focus here is on darknet only or dot onion sites. The second connection is to the clear net for the actual database storage, which currently is a MongoDB Atlas format. We'll talk about that in a minute. Now the crawls gather HTML, which includes URLs and text. The URLs are parsed out and stored separately. The text is parsed out and stored separately as well. And then the HTML is stored as is, but in a binary format. At no point, and this is very important, at no point are images specifically pulled. Elisted materials are avoided and they're not meant to be interacted with, including and especially CSAM, which is a term that's used frequently for people who are working on anti human trafficking and child pornography rings. This is specifically about sensitive materials regarding children that we do not want out there. We are not collecting that with Ambly. You may see in the HTML that image was there, but the image itself is not pulled. Now the database, like I mentioned, is a MongoDB Atlas. It is file sizes identified at the pull and marked so that we don't go over the limit. We did go over that. We'll see that in the next slide. The format of the data is important for avoiding those document limits. Again, we'll talk about that. And Ambly runs in a digital ocean droplet. All right, Ambly's first 12 hours. So, first 12 hours that we ran Ambly prototypes, stage one. We had MongoDB Atlas and local. We found they both have a limited size of 16 megabytes per document. And this is what broke our cycle. The reason for that is originally the text was in an unbounded array format, which just extended too large for certain websites to actually be collected. So what you'll see here is the first 12 hours of Ambly. We collected in 12 hours 86,546 websites. We got 1,819, rather, HTML pulled from those sites. And the text we got to 1,818. On the 19th is when we got the limit size was hit. That has since been rectified. But just imagine this, the first 12 hours, we got so many URLs. People talk about all the time that the dark net is small. And it is compared to the clear net, compared to the deep net share, but it is not tiny. There are so many URLs and websites out there. And this is specifically, Ambly's specifically designed to make sure that no two websites are pulled with the same URL. So even if two sites link to the same website or webpage, only one of them, the first one is stored. The other is dropped. That is 86,546 unique URLs pulled in the first 12 hours. I don't know about you, but I was ecstatic to see that. So let's get into the good stuff. Let's talk about some initial points of interest. We've got a few polls that I went in and I grabbed some interesting sites from. Fourth website to be pulled after going to the Torwiki was the Liberated Books and Papers site, which claims to be a collection of books, not easy to come by. Sixth website crawled was an error page for WordPress, which actually helped show that WordPress sites are being used on the dark net under the name of Torpress, which was really interesting. You can actually see that in the URL. 32, we found the TMG Mirror List, or also known as the Majestic Garden. This has a PGP for joining the Majestic Garden Group and they were multiple, not just these three, there were even more of these URLs that had the exact same thing on them, which brought up a very interesting question. What do we do about sites that are so similar? If they all have the same stuff, but they're just a different URL, do we count them as one entity or combine them? We'll see. All right, 61st website crawled, Relate List, New Area of Intelligence. Really interesting for looking up information on companies. I'm gonna skip ahead a little bit because I'm seeing we're short on time, so let's go into this. Initial points of contact. This is how we see the URLs. We see that the URLs are all stored. We see if it's visited or not, and this is the binary format for the HTML. We called a lot of sites, the Wiki, Bitcoin Wallets, Hacker Ads, Marketplaces, Resellers. You name it, we saw it, it was great. And we still have all of this data to go through. A few other interesting ones is the Black and White Cards, which is a mysterious group open to members. This one was really interesting just to read through. We also have Rent a Hacker, which I don't know if anybody else has noticed this, but there's about five at minimum versions of this website under different names and different URLs that say pretty much the exact same thing. So this is pretty much a mirror, very minor if any edits, but if you need a hacker, I wouldn't go to this one. Again, you're at DEF CON, so really if you need to go to them. Now, Ambly of Tomorrow. As a prototype, Ambly is on a road of continuous change. There's a few areas that we wanna get down on Ambly's development. One, integrate a new classifier, perm ID currently working on it, that will help Ambly's crawl to classify web pages during the initial crawl itself, telling us is this relevant to cyber threat intelligence? Is it relevant to cybersecurity, drugs, bitcoins, you tell me. Now then we wanna be able to identify web pages with CAPTCHA, which we found a few. Test the CAPTCHA Breaker component that I'm also working on. This includes specialized CAPTCHA found on tour sites, like this gorgeous DRED CAPTCHA. DRED's CAPTCHA methods are both magic beauties. They're beautiful, I can't speak to them, but they're also horrendous to actually deal with as a person. This one's nicer than the old one. Finally, you wanna implement a deep learning module using natural language processing to identify relevant links to prioritize. This is really great. I'm really excited about getting that part up and running, and that's really the big next stage. But that's all I have time for today. I'm running out of, I'm pretty much at my limit, so unfortunately I'm sorry I had to run a little fast at the end, but thank you so much DEFCON and the recon village and everyone who's came in today to watch this. Thank you so much for spending some time with me. I also wanna say I'm thankful for the sailors of the secure and assured intelligence learning lab that I work with and the KDD team of Kansas State University. Finally, I volunteer with the Innocent Lives Foundation, and I really wanted to just shout them out here at the last few seconds. I really hope you give them a look. They're great, and part of why I'm so passionate about this is because I get to work with them. Not specifically on this project, but on others. I really hope anybody who wants to reach out does. My social media is on this video, and let's chat. Let's talk some tour. Thanks again, everybody. Have a great rest of DEFCON.