 And I am author of DataSplite, which is an open source intelligence framework. And I also run a DefCon village, which talks about reconnaissance and open source intelligence again. I've been in the conference circuit for a while, and here to share some of my experience around open source intelligence. Directly jumping to the agenda of the talk, we're gonna discuss, first we will discuss what is open source intelligence, and why security teams should use it. Following that we will discuss how you should do continuous discovery of assets and monitor them, and then use those assets for periodic attack simulation, how you can discover sensitive information, how should you monitor breach passwords of your account as well as your employees, and how can you use ocean proactively including social media, where you can check for reputation of your organization, et cetera, and then ultimately we will discuss what are the countermeasures, which you should think about. Before we start discussing that, a few terms, whenever I say brute force, I basically means using a trial and error method for obtaining information such as username, password, a bucket name, OTP code or anything for that matter. Whenever I use black box, I mean that we are referring to a system where we have no information about the system. When we say white box, we believe that we have all the information about the system. So before we do any kind of engagement or exercise on organization, we have all the access about them, and then gray box, which is kind of a mix. So probably assume you have passwords or maybe some access control policies, but not all the information. Whenever I use the word patch, it basically means phishing security vulnerabilities and other bugs, it could be functional bugs as well. So what is open source intelligence? Basically it is collecting information from internet, analyzing them, applying intelligence on top of it, and then getting actionable intelligence out of it. So if you just want to have a representation of that, find raw information, apply intelligence on top of it, analyze it, correlate that information, and then you have something for which you can use for taking some actions. And why we should do OSINT? Because nowadays when you look, everyone has a smartphone, everyone loves uploading pictures on Facebook, especially your boarding passes when you're boarding a flight, which have a QR code, which can ultimately result in leaking information. Now this is one of my favorite tweets on Twitter, where a government person tweeted just to show off what kind of things he's up to, and the screen clearly shows the ID card which he's holding, using which a physical security person can actually clone an ID card. This is also showing some call lists over here. I don't know if I can zoom here. Yeah, I can. So there are some call lists. You can figure out that they are using a Windows box, and IE Chrome and Skype are already installed on the machine. You can figure out, there are some predictable passwords for the Wi-Fi, and the camera placement, if in case someone is interested in doing something nasty to this organization. And this is just an example, you know what kind of information we upload on internet without even realizing that information is going out. For example, if you have a smart DSLR, you take a pic and upload it on some social media, you don't realize that it actually appends your geolocation information with that, unless the social media is stripping that information, anyone can figure out where you actually clicked that photo, okay? And now if it is a photo from your trips, it's all right, but if you are just clicking a selfie at your home, you are revealing your home address. And that is why OSL is important. Now why security teams, defensive teams should do open source intelligence because hackers do, right? And because hackers can do this and use this information to attack you, as a security team, you should also do open source intelligence and be ahead of hackers in the game. Few more reasons because people leak sensitive information on code aggregators like GitHub, Bitbucket, there are so many untracked assets which are running easy targets. Most of them are legacy boxes like a.abc.com, example.abc.com, and most common backup.example.com. Also, when every organization is moving to cloud, we have seen a lot of cases where people spawn boxes with private IP addresses, but they don't realize there's a public IP attached to that box as well. And eventually when you run, you find out the list of assets, you realize you have so many IP addresses lying out in public, which are not even required, and they're running 3306 port open, inviting attackers to gain access to your database, but again, they don't have passwords, and you can get passwords a lot of time from GitHub, Bitbucket, and such code aggregators, which we just talked about. Also, with open source intelligence, attacks become very targeted. If I know you are more interested in football and cycling, the kind of dictionary lists I will make for launching a brute force attack will have those keywords in them instead of running a one million passwords list for attacking you. Also, employees use breach passwords in corporate accounts. This is in line with the policy of using same password everywhere, and people love that. And that is something hackers always keep an eye on. Also, when you have full patch system, let's say you have all the latest patches installed on your machines, a credential leak can actually defeat all of them and still allow the attacker to get inside your network or your machines. There are a few case studies where people have found information about companies on internet, and then they have made a lot of money because bugs were really critical. They have found Slack tokens, API tokens of organizations. Tesla has a bunch of their AWS keys leaking out, which attackers used for logging into their AWS account, spawning institutions and doing some crypto mining. So things like that are pretty common nowadays, especially when everyone is moving to cloud environments, these kind of issues are coming in. And that's the reason security teams should really, really do open source intelligence in their cycle. Now the question is, how do you do that? First thing which you should do is have a continuous discovery process where you continuously discover your assets and monitor them, have a clear number of how many assets do you actually own. There should be periodic attack simulation using open source intelligence again. For example, if you have found 50 passwords related to your organization and you have 30 domains or subdomains, you should actually spread these credentials on your domains or subdomains so that you know if any of them is working or not. Because if that is working, it's a critical issue. You should also monitor breach passwords. You should have a periodic check on discovering sensitive information. Now this could be automated check or you could have a person sitting in your team doing this every month or every week, but these kind of systems should always be in the place. Now you should also be identifying security incidents using social media intelligence. How we will get into that later on? So the question, another question which comes is what is an asset? Now earlier we have always been talking about assets having monetary values. They could be servers, your hard disk, routers, your organization buildings, et cetera, et cetera. Now these are the kind of assets which have a direct financial value you can calculate. But how about your social media accounts? If you have an account which has a huge number of followers and something wrong goes there, do you count it as an asset? If your password is being leaked on your repository, do you count it as an asset? A lot of people don't. But think about it, if that is not an asset and that password can actually cause a lot of financial as well as reputational cost to your organization, it becomes an asset, right? So ideally these things should also be considered when you are making your asset inventories. They don't have any monetary value, but they can damage, they can cause a huge reputational and financial loss to the organization. So picking up the first item from things which we should do, container discovery and monitoring of assets, these are few of the assets, and of course not limited to these, but these are the kind of things people should be looking on. IP addresses, now they could be dynamic IP addresses, dynamic because in your data centers and elastic IP addresses if you are using any cloud infrastructure. You should monitor your domains, subdomains, cloud storage objects which includes your buckets, blobs, spaces where you actually store your information. Lead credentials, API keys, social media accounts, third party API keys, analytics tags which we'll come to that later. And supply chain especially if your company has acquired a company or maybe you have a merger or you have a vendor which has access to your internal domains. They actually increase your attack surface area and hence you should always do some kind of ocean on that part as well. Coming to IP addresses, few of the sources again not limited to, but common sources from where you can fetch the list of IP addresses. You can use your cloud API if you are using AWS, GCP or Azure. They give you APIs from where you can fetch the list of your IP addresses. If you have a system or a hybrid system where you have a data center as well you should contact your admins who can give you the list of IP addresses. Internet wide scans which is something pretty common these days. You can do the whole internet scanning using tools like Zmas, Zscan, using which you can have the whole attack surface of internet and saying that okay, these are the subdomains, these are the IP addresses which exist and port numbers which are open. So you can use those databases. You can also use ASNID which is autonomous synchronization number. It is assigned to an organization which has a huge range of IP addresses. You can also use who is reverse search. We'll come to that and reverse PDR records. Basically you can generate a list of IP address and then do reverse pointer lookup on that which will give you a list of domains and then you can figure out if those assets are actually related to the organization or not. I have a small demonstration around this. So here we are doing who is lookup. So first we have done digon.index.ru. Dig is basically the command for looking up DNS records for a specific domain or PDR records using which we figured out that there's no IP address they have. That is therefore IP address we can pick any one of them. If you pick that IP address and then do a who is on that and then we grab AS out of it. So we get the ASNID which is 13238 in this case. So we go further and then I'm just going to windows. Yeah, so we got the ASNID which is 13238. And then we pass it to this command, which in which we are doing a reverse way. So we are identifying all the IP addresses which has ASNID 13238 and then I have done some grip and unique and sort on top of it. So we just have a list of IP addresses. We run that. These are the list of IP addresses which are assigned to this organization, which is Yandex. Now, of course, in case of the organization which are using a cloud platform, you will get ASNID of those cloud platforms like GCP or AWS, which will not be relevant, of course. But if you're talking about bigger organization, this is one of the way to find out their IP addresses, especially when you're interested in bug boundaries, doing bug bonding for big organizations. Okay, going back to slide. The other thing which we use is Project Sonar, where Project Sonar is a project which is being carried on by Rapid7 and MIT University in which they do internet-wide scans, not only the domain name records, but also port scans and the release of results that you see, of course, every week. We'll have a demo around it, but this is to summarize what kind of data you can get. You can get forward DNS records of A records, AA records, CNM records, TXT records, SOA records. Now, obviously, these databases are huge. They can range from 17 gigabytes to 300 gigabytes. Well, I have uploaded them on Google BigQuery and then doing search on top of it, but if you want, you can have your own systems. The place you can download these datasets is scans.io. You can also find out domains registered on top of an email address. I've used my own email address, which is quite funny, so don't laugh. Okay, I just said don't laugh. Anyway, so this is the email ID which I'm using and all of these domains are actually, I bought on top of this email ID. So you can't just find out email address out of a domain. You can also find domains out of an email address, but there's a disclaimer. This is not reliable information because after GDPR, who is is not a reliable source anymore, but you always have to try all your luck. Then you have to find your subdomains of the organization and again, I'm still talking from a defensive perspective. It looks like I'm an attacker right now, but to defend your organization from attackers, you have to get into those shoes and do all the things which an attacker do. So you can use search engines. You can run queries like site-colonabc.com. It will give you a lot of results and then you can keep removing the subdomains you have already listed. You can again use internet-wide scans. You can use certificate transparency reports. You can use brute force. For example, if you have the subdomain example.com, you can try all the subdomains like a.example.com, b.example.com, and so on and so forth. You can also use reverse IP lookup. So we just had a look. We can find out IP addresses of an organization using AS and ID. So now we have a list of IP addresses which we can reverse lookup and it might actually give us some subdomains for the organization. Some of, there are a lot of, there are actually a lot of tools around subdomain information. These two tools are my favorite. I use sublister and AMAS because they have a lot of sources from which they fetch the list of subdomains. And I use AIO DNS Brute, which is a really good tool to brute force domain. Domain names, the only catch is it is a threat to network bandwidth because it just sends so much of traffic. We will quickly have a demonstration around how you can find subdomains from Project Sonar and I'm falling in the same problem again. All right. So this is the dataset which I have uploaded, open data, deset. This is the table which I've created and then FDNSC name is the table where I have forward DNS records. If you look, this is just 11 gigabytes of data and then the other one is reverse DNS in which we have PTR records and this is 80 gigabytes of data. Now I am running these queries the first time checking anything.nokay.com. Now nokay.com because they have a bug bounty program and also they have a lot of data so it is a good example to showcase. So when I run it on nokay.com it gives me a huge list of subdomains which belong to them as well as IP addresses which belong to them. So if you look at it, there are a bunch of subdomains which are coming and if you look at the results, there are 4,000 results. So we can actually find out, of course there could be duplicates in this but a good number of chances that there are more than 3,000 subdomains here as well as IP addresses. Similarly we can run this on reverse data on forward DNS. So when I run it on forward DNS, we have 595 domains. So these are just the CNIM records data set and again we have so many domains like scl.developer.nokay.com, scportal.networks.nokay.com. Now if you look at them, it is really difficult to brute force such domains because they are not just subdomains, they are sub-subdomains, right? This is the easiest way to find such subdomains. So now we talk about cloud storage objects. So a lot of companies store their data in buckets, AWS buckets, Google Cloud buckets, Blobs, Spaces. A lot of times they store sensitive data intentionally or unintentionally because sometimes they store sensitive data and they think the bucket is actually not publicly available and sometimes they are just serving aesthetic sites like JavaScript and images but by mistake they sometimes upload backup.tar files because human, you know? And then if you can list those objects, you can download the whole backup.tar file including config.php kind of files. I mean examples, these are just examples. But how do you do that? You can spider the websites, you can fetch the source code of every page, you can extract the name of the buckets or cloud storage objects based on the rejects patterns and then you can check for permissions. The other approach is you can create possible bucket names. For example, if the company name is example, you can have a common nomenclature of, you know, you can generate patterns like example 01 prod, example stage, those kind of patterns and based on that you can brute force that. Just a screenshot of how this could be done. So I created a bunch of patterns for Carbon Console which is a big organization, don't pry it at home. And then we are running a brute force against it using bucket finder.rv and then we can figure out one of the buckets which has some sensitive information. Going forward, apart from cloud storage objects, we talked about finding leaked credentials. So these credentials or leaked information could be passwords, API keys, third party access tokens, DB credentials, internal domains or a lot more, right? And where can you find all of this? On GitHub, on Bitbucket, pastebin.anion websites, et cetera. So one of the ways which I always do is if I have a project, I figure out the name of the client, they are GitHub repositories and list all the repositories they have and then check for any sensitive information being leaked in that. Because otherwise any information which you get on GitHub, you have to verify whether it belongs to the organization or not, which you can, but it's a pretty lengthy process. The good way is you just find out the repositories of the organization and then, you know, go about it. You can also use Google Custom Search Engine because Pastebin is not just one website where you have such anonymous paste allowed. There are so many websites, including just.github.com, PSBDump, pasty.org, and so many of them. So you can just compile a list of them and create a Google Search Engine on top of it. So you have one way you can search it. And also Google Search Engine allows you to have API, you can automate that because it gives you API keys. So you can actually include that in your automation suits. You can use automated tools, including GitHub and Trafaloc. Trafaloc is the one which I use. But it is a manual example which I was doing before I was creating this presentation. I had a quick search around the API keys or access keys, which we can access. So I came across one access keys, which of course I have redacted. But when I checked it with a tool called Nimgo Stratus about identifying the privileges of that key, I figured out that this was an AWS root account, which basically means I can do anything on the AWS account in case they have billing set up. So this is one of the way you can do a manual search. And then we have a small demonstration around Trafaloc. So Trafaloc is a small tool which takes input as a GitHub repository. When you pass it, it says, it basically downloads the source code of the repository as well as the commit histories. And then it uses projects patterns to figure out if there is anything sensitive in there. Now it has highlighted a bunch of things in this repository. It says there's a RSA private key, which was updated. Someone created it, updated it, and deleted it. That basically means there could be some juicy information. Let's copy that commit hash, open this in browser, and you figure out that this is actually a private key of the person who has committed that and then realize that he has committed his private key, so he deleted that. Finding this obviously is possible manually, but when you have tons of repositories, it becomes a troublesome task. So it is good to have tools like this where you can just run a for loop or some kind of automation to figure it out. The next thing is social media monitoring. Now this is a very important aspect, not only in terms of your leaked information, but also in terms of your organization reputation. For example, you are a security team in example.com when someone dropped the blog saying example.com hacked. Now you really would like to know to get notified of such kind of blogs, such kind of posts or Twitter tweets which are going around. So you should really keep an eye on that for which you can use streaming APIs. You can write your own scrapers. Also you can use Google alerts to have your organization name and anything, any new result when there is any new result coming from your organization, you should get an alert or you should have a page change detection as soon as you realize that there is a change in one of your pages, you should get a notification. And this is how you can do some kind of monitoring. I wrote a tool called tweet monitor long back which basically takes a keyword and it keeps looking for that keyword. If someone tweets around that keyword, it sends you an email and it also dumps this to Elasticsearch or any database of your choice. Now once this is dumped on Elasticsearch, of course, apart from the tweet, there are a lot of attributes of the tweet which includes screen name, time, zones, et cetera, et cetera. They are dumped to Elasticsearch on which you can draw your dashboards and figure out what are those users who are actually doing these kind of activities around your organization. It could be used for other purposes as well but I have been using this for representation checks. I have uploaded a video on YouTube which I have not included as a part of this presentation but you can go to this link and have a look on that, how this could be done. Another thing which should be really looked out is analytics tags. So a lot of times, we have administrator accounts which set up analytics for any of the domain and then they set up analytics for multiple domains of the same organization. What they forget is that they are using the same analytics ID across all these domains or subdomains. So there could be confidential domains or subdomains which you don't want to leak out but if you are using the same Google analytics code or any other analytics code for that matter, this can leak out all the assets which are using the same ID because they are using the same tag across multiple assets. So you can actually do reverse lookup on this. I'm using a tool called Built With for that matter in which if you open any website, it tells you all the tags being used and then when you click on the tag, it allows you to search for all the other assets in the world, I mean whatever they have crawled to if they are using the same tags. These are a few of the things in terms of discovering your data. Now once you have discovered all these kind of assets on a regular basis of course, you should do periodic attack simulation. Now what I mean by that is you should classify the assets. You have identified that you have bunch of IP addresses of domains, domains, cloud buckets and of course all of these categories have different kind of attack surface. So you should run custom scans. Now the examples could be if you have a server which is running Nginx, you don't need to run a brute force or dictionary attack for URLs for using all the URLs. You just need to find sensitive URLs for Nginx and check with them. Similarly if someone is using SharePoint, you would like to run, you would like to check services.asmx files instead of checkingconfig.php. So this way you can target your, you know dictionary attacks when you are talking about an asset. Also you can pass these assets to vulnerability scanners like Nessus, Nexpose, Burbsuit or anything for that matter. And then you can review reports. Again reviewing reports is a troublesome task so you might like to do some kind of automation there. So you can set up some centralized dashboards of all these reports. Also whenever you have a new release, new acquisition, new merger or any event which basically means that there could be more number of assets coming in or going out. You should check for these new assets and again check for all the assets you have, check for vulnerability resurfacing and run a complete cycle of these things which we just talked about. So apart from that, what are the countermeasures which you should do? Okay, so first thing is you should do everything on yourself before someone else use it against you. You should have OSINT awareness campaigns especially for your employees where you talk about not using same passwords across their multiple things, making sure when they upload pics from office space they make sure they're not revealing some kind of, you know, password is standing on their sticky notes on their desk, things like that. Also you should consider implementing metadata stripping especially when someone is sending some file on emails or uploading somewhere on the portals. There should be metadata strippers so you strip all the metadata which includes local paths, author names, softwares being used to develop those files. Those should be removed. You can also use Collective Intelligence Framework which basically collects information from third-in-del feeds and checks if IP has some bad malicious rank and then use it along with your SIEM integration support. You have an SIEM in which you are logging all the IP addresses but apart from those IP addresses you are also checking the rank of those IP addresses. Now if there's IP which is having a bad rank in malicious network, you can just check if this IP is sending you any hits and there's a new thing which I recently got to know about Honey Credits, just like Honey Pots. You create some fake credentials and spread it across multiple platforms. Now when a person uses this credential against you, you know that this credential does not exist and it's a clear chance of someone trying to attack you so you can simply block that IP address or that person for that matter. Now one important thing here is you should always try to identify the root cause instead of fixing the issue. Now I'll give you an example for that. There could be a case where your company is having a lot of new IP addresses coming every scan and you must be wondering why these assets are coming. So instead of just getting rid of those assets, you should try to figure out which team is doing this. What is the problem, why this team is doing it? Was there the reason or the reason is lack of induction program or the people are not aware of this or people haven't attended any security trainings, things like that. So try to identify the root cause instead of just fixing the issue. I mean, you should always fix the issue but you should always go about identifying the root cause as well. And whatever I just talked so far, it's not the ultimate way of achieving this thing. There are so many other things which cannot be covered in this duration of time but the talk is just to give you an idea of what kind of attack surface is lying out there and you should really focus upon it. This is how the process looks like if you want to understand this in a nutshell. Have a security team, if you don't have one, you should try making one. It can compromise of just one person but there should be some security department in the organization. You should always keep implementing open source intelligence countermeasures and then you should have a process of identifying assets, asset data sources, implementing asset discovery process, then periodic vulnerability checks and scanning on top of these assets. And once you have this data, this data should be passed back to OSINT countermeasures because now you have more insight of what kind of issues your organization is actually facing around these things. So you should pass it back to the phase where you are implementing countermeasures. So this is the kind of process you can follow. Of course not the exact process, not the bible for this but this is something we can use. So now you have done this, what is next for you? I have listed some resources. Again, there are so many resources in the internet but these are some of my favorite ones. Awesome asset discovery list in which this repository basically talks about all the resources which you can use for identifying assets, using assets, listing assets kind of things. Awesome OSINT resources, this is a kind of a generic OSINT repository where you will find resources, not just asset discovery but also in terms of investigations, cyber investigations, marketing, how you can use OSINT for those matters. This is one tool data spread which I wrote long back. This is kind of an open source intelligence framework which automates the process. If you pass a domain name, it will do bunch of things for you in a customized manner. Handpicked weekly OSINT news, this is one chap who keeps writing about open source intelligence so if you are interested in learning OSINT or picking OSINT, understanding OSINT, this is one source which you can relate to this person, publishes weekly news, whatever happens in the OSINT domain, this person published that. And then open data which I talked about earlier in the presentation. This is the link where you can actually go and read more about the project, download the databases, et cetera. And that's it, so I'm up for a question and answer. Please ask any questions if you have, thank you. Shiva here, just wanted to be curious on your intelligence in dark web. How do you do or how do you get any ideas or any thoughts? So identifying information in that area is really difficult but of course it is possible there are a bunch of search engines for onion sites like Ahimia and onion.cab.to websites which allow you to search in that space. But it is always tricky because the kind of activities which happens there, they try to keep it very close and they don't allow you to find more information about it. So it is difficult. I haven't seen a lot of automation around it but manually it can be done. I have a different, there are many open source projects who tries to maintain their own infrastructure and then something happens over the security labs and things. So what kind of suggestions do you have for those smaller projects with less number of people? Like can they use to see a couple of these or ideas to make sure they are interested in secure? What kind of small projects are you talking? I don't understand that. Let's say like where we have only four or five people and they manage to just spend money for one degree of instance or a couple of servers, someone is running their own lab. So can they use the similar ideas to make the open source? But I don't understand the objective here, like. Like the whole, like I understand your talk, like that's a very nice way of explaining things. I'm saying like can a smaller group of people can use the similar ideas to identify issues for open source? Yeah, of course. I mean open source intelligence, I mean this is just one perspective of using open source intelligence for organizations but it could be used for multiple ways. I mean people use it for investigations as well. So the answer is yes, they can. So you talked about assets, you've listed out some other entries which are generally not part of the asset list. Yeah. Is that all or is there any other thing that you think we should also be looking at as an asset? I mean, as I said, it is not limited to those categories and it will always be an evolving thing, you know, as the industry will move forward, the kind of assets, the kind of things which will fall into that category will always keep evolving. So this list has always have to be updated.