 Hello everyone, my name is Apur Singh Gautam. I would like to thank Defcon Retain Village for having me here. I will talk about automating threat hunting on the dark web and the things that surrounds it. I presented this talk, I presented a short version of this talk at Grimcon and this is the extended version of it. So let's get started. I will switch off my webcam so that you guys can focus on the presentation here. Okay, so a little about me. My name is Apur Singh Gautam. I go by handle ASG Scorpion. I'm a security researcher. I started into threat intel hunting two years back and I've been loving it since then. I'm currently pursuing my master's in cyber security at Georgia Tech. Recently, during the summer, I was doing, I was a research intern at XC UC Berkeley doing research in threat intelligence. Some of my hobbies are gaming. I pretty much love rainbow sixies. I sometimes stream it also. I love hiking. I recently started into lockpicking and I am enjoying it. I contribute to the security community. It's like, it's my passion contributing to the security community. I'm a senior teaching assistant at Siberia. I contributed station X also and other local security meetups. And they are my socials. If you want to contact me or hit me up. So what's today's agenda? So we will talk about introduction to dark web. What dark web is, how to access dark web, what tower is, what is the difference between dark web, deep web, why you should perform hunting on the dark web. Before that, we will discuss what is threat hunting and why it is crucial to hunt on the dark web. We will discuss the different methods to hunt on the dark web. Can the dark web hunting be automated? What's the pipeline or the architecture of hunting, automating the dark web hunting? Then we will discuss a little about threat intelligence life cycle. That's how threat hunting on the dark web is analogous to threat intelligence life cycle. What steps are there that corresponds to this? We will discuss a little about operational security. That's opposite. And why is it important to secure yourself when hunting on the dark web. And yeah, that's it. So introduction to dark web. So I'm sure you must have seen this image a lot of times on the internet. That shows difference between surface web, deep web and dark web. So the surface web is the sites that are indexed by different search engines. That's Google, Bing, Yahoo, etc. And the majority of the portion of the internet is deep web. Now deep web is any site, any website that is not indexed by the search engine. This can include your databases, your survey instances or any other different websites that you cannot search from the search engines. The third one is the dark web that we will talk mostly about today. And this is the part of the internet where you need some kind of software, special software to access the dark web. This can include anything related to drugs or weapons that's being sold on the dark web or some kind of research or books that's sold there. So dark web includes several forums and marketplaces where people sell different kind of things. So these type of things are sold there. So majority portion of the internet is deep web and people confuse it between dark web and deep web. So as you can see only 6% of the internet is dark web and 80 to 90% is the deep web. Moving on, so how do you access the dark web? So there are several companies or several organizations that offer their own dark web systems. You can call it, so the famous one is Tor, that's the onion router. The second is I2P, that's an invisible internet project and zero net is also becoming popular nowadays. So Tor has a dot onion domain, like the domain name ends with dot onion and I2P is dot I2P. Talking more about Tor, what Tor is and how it works is it's like a three layer proxy system. If you can see on the image here, there are entry nodes, middle node and the exit node. The entry node is the entry node is where your traffic goes to and then it goes through the middle layer and the exit layer and then goes to the destination. This way your identity is hidden and also only the entry node is publicly listed, rest of the nodes are not publicly listed. So it hides your identity, it hides your IP address and these nodes are volunteer based systems. So Tor has about like 6000 relays and I don't know the number about I2P but it is also becoming more popular nowadays. And the major thing about Tor is each node only knows the IP address of the next node or the previous node. So if we talk about here, the entry node doesn't know about the exit node or vice versa. So in this way the nodes location of the nodes are also protected. So there are many misconceptions about Tor, the dark web. The famous one is whenever we talk about Tor, people think about criminal or criminal things that goes on Tor. Yes, there is criminal side of Tor but there is a good side of the Tor also or the dark web. So it's really famous among whistleblowers or activists. There are many countries where free speech is limited. So they can use Tor like people from those countries use Tor to express their speech or express what they think. Tor also has access to many old literature or researches which is not available on the open web. And it's like safe even for journalists obviously. And so there are many popular sites like Facebook and MyTimes that has their counterpart on Tor. Like their .any website counterpart. So it's useful for whistleblowers or activists. The second thing is Tor is so many people think that it's illegal to access the dark web. So it's not illegal to access the dark web. It's illegal to indulge in these kind of activities like purchasing drugs or purchasing any other illegal things on the dark web. That is being sold there but it's completely legal to access the dark web. And it's like the last thing which I would talk about is many people think Tor is like really big. But if you talk about the uptime or the availability of sites there are very few reachable onion domains on the dark web. If you compare it with the clear web. So it's like Tor is the dark web is like very little part of the internet. So we will talk about the dark side of the dark web. I mean the criminal side of the dark web. So there are many forums or marketplaces on the dark web. These are some of the relevant sites that's relevant to security researchers or people who want to access dark web for their organization's benefit. So this is like some of them are credit card market. So where different credit cards are being dumped or remote access. So these are these forums include remote access storage or some kind of remote access tools or insider threats. So insider threats are it's like recent coming up a forum where insider like the people who are selling their companies secrets. They talk amongst themselves. So these are some of the relevant sites. Now coming to the cost. So how much it costs for some kind of for something to buy from the dark web. So as you can see it's really easy to buy these things on the dark web and it. So like SS and you can buy any SSN for one dollar or fake FB friends a fake FB with 15 friends mobile malware bank details. So our exploiter zero days. So it's like these type of things are easy to get on the dark web. That's why security researchers and other people are focusing more on the dark web. You might have heard about recent news like 500,000 zoom accounts sold on the dark web or 267 million FB user profiles sold on the dark web. So there are many data breaches occurring day by day and they are being sold on the dark web. That's why researching on the dark web is really important. These are some of the product listings from the forums of the dark web. This is how the products are being listed. You can see these are the average cost of accounts for different online services like bank services. What's the average cost of it or what's the average cost of service for video games. This is the average cost of tools that is being sold on the dark web. As you can see again bank and financial average cost is $74. So you can buy a brute forcing tool for some bank at $74. Now coming to why you should hunt on the dark web. Before that let's talk about what is threat hunting. So threat hunting is like it's proactive searching for cyber threats proactive means before the attack happens. That's proactive search. You search for cyber threats from logs or indicators of compromise. That's IP addresses, emails, domains etc or textual data that we are doing when we are searching the dark web. So it's basically a hypothesis base because there is nothing concrete about the process. You take one use case and work on it and then you take another use case and work on it. And it goes iteratively in the same way. Many times there is use of machine learning or natural language processing. That's NLP and advanced analytics process in this. Because you need to scan through the textual data if you are hunting on the dark web. Or if you are hunting on the clear web also but for the textual data. So machine learning advanced analytics are useful there. So why it's serious. Why threat hunting on the dark web is really important. What's so big about that. So as I told you about again there are many forums, marketplaces, dump shops and where criminals. So what criminals or actors do they learn new methods and techniques on the dark web. They monetize their skills. They trade their exploits or tools or even drugs and weapons and communication. So they communicate with each other and they share their ideas for new attacks. A security researcher or a person who is researching on the dark web. He can find a lot on he can learn a lot while engaging within these communities. You can learn their techniques or TTPs that's the attack techniques and procedures. How they think about the attack, how they plan an attack. So if you do it correctly it can identify attacks in the earlier stages. That's planning and reconnaissance stages. And you can reduce the impacts that it causes. So suppose if your organization data is being sold on the dark web. There are different kinds of impacts that can cause to your organization. So some of the direct impacts are like personal information stolen or healthcare record stolen. Or even your company's straight secrets. And some of the indirect impacts are repetition of your organization. Revenue loss and nowadays the legal penalties that your data is lost. And you have to cover the cost of the customers. So this is like that's why this is really important to like to research on the dark web. Two hundredths on the dark web. On the same lines, these are the benefits of the threatening. So if you do it correctly, you can keep up with the latest trends of the attacks. You can get new TTPs that's tactics, techniques and procedures. You can identify insider threats. You can discover data breaches. The main thing is you can prepare your socks and incident responders to deal with the attack. Because they will know before only what are the TTPs attackers are using. So they can reduce the damage and risk to the organization by acting quickly on that. So coming to the methods to hunt on the dark web. So we will discuss about some tools that's used to hunt on the dark web. And then we will discuss about the human element that can be used to hunt on the dark web. So talking about the first tool that's really, really important for this is scrappy. So it's a web crawling framework. It's so famous. It's so important because it manages multi-threading automatically. So you don't have to spend too much time on the multi-threading part because it has already capabilities for multi-threading using one or two lines of parameters. The second thing is Tor. Obviously if you want to access the dark web, you need Tor. Onion scan is another tool that is used to search for onion websites. It can tell you if a website is up or not and the correlation between different websites on the dark web. Coming to Privoxy. So Privoxy is a web proxy before getting more into this. So when you access the dark web, you need some kind of proxy to access the dark web because your ISP, as I told you before, the entry nodes are publicly listed. So your ISP can have a blacklist to block their entry nodes. So you can't access the dark web. Or even if the ISP doesn't block it, he can see whether you are accessing the dark web or not. He cannot see what you are doing on there, but he can see whether you are accessing the dark web or not. So you might not want that. That's why you need some kind of proxy. And a majority of people use Sox proxy. So basic difference between HTTP and Sox proxy is a lower level proxy and it works on the Sox protocol. HTTP proxy only works on HTTP or HTTP websites, but Sox proxy can work on other protocols also. And there are different tools to use Sox proxy like T Sox, Polypo and Pryboxy. I've been using Pryboxy and it has been... So I don't have any problem with Pryboxy and it's good. So another thing is Scrapy doesn't allow you to use directly Sox proxy because it doesn't support Sox proxy. So that's why you have to use these tools like Pryboxy, T Sox, or Polypo to route your Sox through Pryboxy scripts. And there are other tools also, like there are search engines like Kilos or Recon where you can find different onion domains. Apart from using Sox proxy, you can also use VPN with Tor for extra layer of protection and encrypting your data. So getting more into Scrapy part, this image might seem a little confusing, but I will get into it step by step. So this is why Scrapy is really important and why Scrapy is so useful in hunting on the dark web. So for explaining this, I will explain it in terms of Python code. So suppose everything you see here is a different Python program. So Spider is a Python program. Downloader is a Python program. Middleware is a Python program and so on. So what Spider does in Spider... So in Spider Python program, you give your onion domain on which you want to crawl the data or which you want to get the data. So it gives it to the engine. Suppose the engine is just the program that manages every other Python program. So it gives the onion domain to the engine. The engine gives it to the scheduler. So what scheduler does is, here the multi-threading concept comes into the picture. Scheduler gets different domains and schedules it accordingly into multiple threads. So the onion domains goes into the scheduler. Scheduler gives it back to the engine and engine gives it to the middleware. So middleware program includes your proxy program and our login program. So what proxy program is in proxy function. So if we are talking about middleware, that's a Python program. There's a proxy function into it. So proxy function is where you will put down your privacy IP or Tor IP so that the request or response goes and comes through that proxy so that you can access the dark web. The login program, the login function is where you will put your user agents or cookies or... So for accessing the dark web forums, you have to... So nowadays for all the forums, you have to have a account to access the dark web or to access that particular forum. So to access the forum, you need some kind of cookies or also there are many forums, many high-level forums that implement captures and as Google doesn't work on the dark web. So these captures are like image-based capture or text-based capture that is easy to bypass. So you can use any machine learning capture bypassing service or any capture bypassing websites like death by capture or anti-capture to bypass the capture. So these all codes you will write in this login function in middleware program. Now your request or your traffic goes through this middleware program to the downloader. What downloader does is it's a simple program to extract the HTML and give it back to the engine. So downloader extracts the HTML and gives it back to the engine. Now engine gives it back to the spider. Now there is another function in spider that extracts the HTML entities that you want from the forums HTML like suppose forum name or document ID or the text-based data or author name who posted a particular content. So you get that and it is called items in Scrappy. So you get these items and it sends it to the item pipeline. Item pipeline is where your database is configured. So I use elastic search you can use any database whether SQL or new SQL and it directly saves the items to the SQL. And so the important thing to note here is that Scrappy. So as I told you before multithreading is automatically handled by Scrappy. The another thing is you don't have to give multiple onion domains to spider. So when the downloader gets the data when the downloader gets the HTML page from the particular forum there is a code you don't have to configure the code there is a code to get all the onion domains on that particular HTML page. So the scheduler automatically schedules the other onion domains to go through the same process again and again. In this way you don't have to give extra onion domains to the engine. And this is why Scrappy is like really useful in like crawling data from the dark web. And there are different so you can specify which domain to crawl and which domain to block and this way you can be safe from getting illegal data or getting illegal images. Moving on now comes the human part. So we discussed about the tools. What tools can you use to hunt on the dark web. There is a human element also that's called human intelligence or HUMINT. So it's the process of gathering intelligence through interpersonal contact rather by some kind of tools or technical process. That's why it's most dangerous and difficult form because you are directly talking to the actor on the dark web which is not safe and it's not safe because you don't want your identity to be revealed to the actor or you don't want your organization's identity to be revealed to the actor and it's important also because you can identify and respond to attacks much quickly you can do post attack investigation so suppose your organization there is a data breach on your organization if you want to confirm someone is selling this data on the dark web and if you want to confirm whether they are selling the correct whether they are lying about it or whether it is the truth so you can activate your human intelligence or activate the guy that is researching on the dark web to go and ask to the actors whether the data is correct or not so that's post attack investigation or you can also use it for new attack discovery so that's discovering new TTPs that the attackers are using or the attackers are discussing about you can assume as a high tech equivalent of what an FBI agent does when he spends months or years working to infiltrate a criminal organization that's why it's really hard to do it because you have to spend so much time on it and that's why it's risky for this you have to think like an actor how they communicate within these communities how they act within these communities another thing is like the source from this is really valuable to your organization safety moving on we talked about tools we talked about human intelligence part now comes the pipeline of the architecture of how you can automate these threat hunting so before that I would suggest to set up a different system you don't want your personal data system where you are doing threat hunting so you can set up any lab or VM whether physical or whether on cloud just isolate the network and install relevant tools like Scrapy, Privacy, Tor if you are if you are using elastic source or Kibana then Elk and different Python libraries that would be necessary for your task this is the automated architecture that I have been using it I will go it one by one and I have this automated icon for the task that can be automated and for the task that I don't have that's the only one I think Scrapy setup and design terrain and LP model so it's hard to automate that part so I will discuss it through so first of all you need to get the forums, forum links or market links so you can write a simple script to gather data from different search engines like I told you Rekon and other search engines where you can get all the forum links so that can be automated so another thing is using Sox Proxy like I told you you have to use some kind of Sox Proxy for this so you can get Sox Proxy's IP again you can write a simple script and get the Sox Proxy's that can be automated now comes the part of Scrapy setup so Scrapy setup is so here you will write your login functions here you will get your proxy setups and here you will like manage the settings of the Scrapy now you can't automate this because when you get the onion links different forum and links you have to go to the forums and sign in yes you can automate you can write scripts for logging in or signing in using different accounts but I found it difficult to do this that's why I just I have been using it manually I have been doing it manually so like creating different accounts like 4 to 5 accounts per and then noting down that into the Scrapy like the username password and cookies into Scrapy so for this you have to do this also you need different scrapers for different forums because the architecture of forums is different for different forums that's why you need to do this step manually because you need to analyze you need to first login to the forum analyze it and then write a different function for each forum for the HTML elements that you want to access coming to the crawler part so the crawler parser analyzer and the ELK part it's all the part of the Scrapy that I discussed you before so these all are part of the Scrapy system I have just written it differently so you can understand what each part does so what crawler does is again it crawls HTML pages from the forums parser does it parses the HTML pages like getting the HTML elements like post post content author etc and the analyzer so analyzer part is the part you can write different function for this in Scrapy so what analyzer does is so suppose you got the data from the dark web forum now you need to use some kind of techniques to evaluate the content that is relevant to your organization or relevant to your threat model so we will discuss what threat modeling is in the later part of the presentation so for now just understand there is you can't focus on every other threat that's out there you have to focus on threats that is relevant to your organization so you need to do some kind of analysis to like get the relevant data from the dark web so here comes the NLP model that I have been using so you can design or train your NLP model in this way that it can just get you the content that is relevant to your organization suppose if you are bank you don't want to focus on tools or you don't want to focus on data breaches that's not relevant to your bank you most likely would want to focus on the dump shops that's where credit cards or debit cards are being dumped so in this way different organizations have different requirements and you want to focus on those now designing and training NLP model can't be automated because you need some kind of content before content relevant to your threat model and then you need to train your NLP model on that it's like somewhat like either you get the data first or you design your model first so it's like egg and chicken problem but nowadays there are many NLP models like CDID LDA where you can provide like some kind of context before training the NLP model so it's easy to do that and then you store the data into ELK so these all things can be automated so coming to the part after getting this data so what's the process after hunting now we'll discuss a little about threat intelligence life cycle so what threat intelligence life cycle is these are different steps that your organization takes to build a threat to like it starts from getting the data till presentation of the data so how threat hunting on the dark web corresponds to this so there are like five phases as you can see direction, collection, processing, analysis and dissemination what we are doing is we are doing direction phase from the human sources like you can see dark web, social media forums so in the direction phase we identify dark web forums we register on those forums we acquire access on those forums in the collection phase you use SCRIPY to establish access and collect raw data processing phase is also using SCRIPY so you parse raw HTML data you extract topics and authors and the analysis phase is where we use NLP the machine learning models to infer relationship between these data we get data that is relevant to our organization we link data sources we identify trends and hacks and leaks etc and dissemination phase is where we visualize the data in dashboards if you are using Kibana or other kind of dashboards we give out alerts and reports for our higher managers or the other people to see in our organization so this is like crux of what threatening on the dark web maps to threaten this in life cycle now threat modeling as I told you I was going to talk about this in the coming presentation so what threat modeling is it's like getting your organization's critical assets and focusing on your organization's critical asset so it's like understanding threats and how you can mitigate it when it happens to your organization particularly so you understand what attacker want what different critical assets you have in your organization what are different types of actors that can target you whether they would be hacktivist or insiders or some kind of criminal groups and know their capability so here you choose your target on the dark web whether you want to so if you are bank you focus on critical markets if you are some other organization if you want you focus on inside the threat markets or you focus on general markets so in this way choose your target on the dark web you prioritize risk as you can use parameter of pain for that so you prioritize risk and focus on IOC's that are relevant to your organization another thing is you don't just use one source to target like there are many many forums on the dark web you don't just focus on one target you focus on multiple targets apart from dark web you focus on multiple clear web sites also like pagebin or twitter or nowadays on telegram also many these many actors are communicating so you focus on the dark web also and the clear web also to get all the things you can for protecting your organization so again data collection processing you collect data from the clear web dark web so some of the sites are pagebin twitter reddit on the dark web it's forums different forums and different marketplaces you can do all this using the scrappy crawler and parser that we discussed before the analysis part in the threat intelligence model is you use like I told you before you use NLP machine learning or deep learning techniques some of them are like LDA bird GPT to gather information related to your organization you use you analyze you use social network analysis for analysis of different users on the dark web that post data related to your organization there's clustering of products according to categories for the clustering thing you classify different so there's like binary classification classification so that's how you classify different products being sold on the dark web so these all things come under analysis I will touch a little on MITRE ATT&CK framework so what MITRE ATT&CK is it's a knowledge base of all the TTPs that was built using real world observations so it contains different tactics techniques and procedures that the attackers have used all these years so you use ATT&CK matrix to map the intelligence you obtained to understand the TTPs better and to protect your organization more so now coming to the operation security stuff so hunting on the dark web or if you are doing human intelligence stuff on the dark web you need to follow some set of processes so that you don't reveal your data reveal your identity or your organization identity as I talked before so what is OBSEC OBSEC is the practice of hiding yourself online so that you don't reveal your real self or compromise your own operations it's derived from the US military that's operation security you need to defend you need to hide your PIA that's personal information personal identifiable information so you need to you need to work on the dark web in such way that you don't disclose your full name or driver's license or bank account or even simple thing as email this is what you need to protect and that's why operation security is really important and it's also a hard thing to do because at the end of the day we are all humans and we like to be seen as knowledgeable and we like to impress others this all thing leads to gossiping bragging and over sharing with others that's why operation security is really hard and most of the time people think of it as a process so people think that I have to do human intelligence stuff now I have to follow operation security it should not be like that it should not be seen as a burden to perform or as another of your job task to perform it should be a mindset like you should always think about operation security before doing human intelligence or before engaging with actors so I will discuss some of the things that you can use to maintain a second in your daily lifestyle there are many other things so the main thing you want is hiding your identity so the first thing you can do you should do you should do is use separate system like I told or talked about before also use separate system where you don't store any personal information whether whether be it a lab or VM or some kind of system the main thing is to use Tor with proxy or Tor over VPN the main thing is maintaining different personas on the dark web so it's like I told it's an equivalent of an FBI agent going undercover so he has some kind of persona he has a backstory you have to do that you should have different personas for different identities that you have on different forums you should never mix it up that's why you have that's why you take extensive notes so that you don't mess up the personas and so you should always watch what you say and you should always think before posting you should so it's human intelligence is not a 95 job thing you can't just talk to or you can't just communicate to actors during your job time because they will know that you are doing this as your job so you can be exposed they can easily guess that you are a researcher and not a threat actor and that's why it would be a tip of for them that's why you have to do this work 24 by 7 it's not like you have to do this work on the weekends you have to do this work after your work hours it's a 95 job thing and you have to develop appropriate language skills because people don't talk formally like actors don't talk formally so you have to develop appropriate language skills or slang skills also there are many forums like there are different Russian forums or German forums so you might need to develop that language skills like learning Russian and learning German another thing to note is changing time zones suppose if you are in US and you are accessing or you are engaging in a community on a Russian forum you might want to change the time zone to Russia because it would be a tip off to the actors that you are a security researcher that you are not a real actor so these are some of the things that should be noted before doing a mental intelligence stuff on the dark web now that was it so concluding all this we discussed related about the dark web what dark web is how to access dark web we discussed about dark web forums and marketplaces what different products are being sold there what is the cost model around that we discussed about threatening on the dark web how you can hunt in the dark web we discussed different tools and the main tool was creepy that is the main framework that we are working on to hunt on the dark web we discussed about human intelligence how it can be used or how it should be used to support your tool based tool based data collection we discussed about the pipeline or the architecture that can be used to automate the dark web hunting we talked little about a threatening life cycle how threatening on the dark web maps to the life cycle steps and we talked about operation security and why it is so important and why it is so hard to do operation security again some more points to notice it is obviously threatening on the dark web is hard but it is worth the effort you do not get intelligence on the dark web you should always keep operation security in mind and like I said before you should look on more than one resource you should look on forums you should look on clear web forums also like apaisvin telegram as an example and you should look on different other forums on the dark web also it takes a lot of resources and a team effort you cannot do all these things alone so we have a team for this and we talked a little about usage of mitre attack framework and how to map your how it can be useful to map the TTPs map your TTPs to that these are some of the resources that I suggest you to read if you want to know more about the dark web stuff or dark web hunting stuff the major ones are recorded feature insight digital shadows they release their blogs or white papers regularly so read them and you will understand what all things are there on the dark web so yeah that was it I think thank you so much I hope you all like my presentation and you can contact me on twitter or LinkedIn if you have any doubt or if you want to discuss about this stuff more on indiscord to answer all your questions so yeah thank you