 But we're going to get started right now and it is my pleasure to introduce you to Ankur Tiagi. We're going to talk about visual network and file forensics. So without much ado, Ankur. Thanks a lot everyone. Thanks for joining me. I will be talking about visual network and file forensics. Can you hear me? Can you hear me now? All right. So the prime objective of this talk is to actually introduce you to visual forensics and how visualization can be very useful when you combine it with static and dynamic analysis of network content which is just some binary blob for you. You don't really know what exactly you are seeing on the network and you would like to understand what kind of traffic is being, what kind of content is being transferred between client and the server. Or if you are trying to analyze a malware and you would like to understand what exactly this malware can do, what kind of packer it has used or whether it is packed or not, whether you are looking at some encrypted content or whether you are looking at some compressed content. If you would like to do that, you can use visualization in combination with static and dynamic analysis. So that's the primary goal of the talk. We would like to see how visual analysis can complement static and dynamic analysis as well. All right. So during the presentation I will also demo a framework, a tool that I have created. It can analyze binary blobs and it can right now specifically target two types of file formats. These are pcap and pe files. So pcap files will help the tool to extract network content and then visualize it. For the pe files which are the prime containers through which malwares are delivered on most of the windows systems, you will see that the windows file format, once you statically analyze it or in combination with the dynamic behavior report from sandboxes like cuckoo and in case you have used noribn, noribn is again a sandbox, a minimal sandbox that you can use. Reports from those sandbox combined with the static analysis can complement, can be complemented with the visual analysis output that we will get. Okay. So primarily we will be targeting, we will be looking at the visual analysis in combination with structural properties. So these structural properties will be common for any type of content. So let's say you have an example, you come across an incident wherein you have a pcap file. Now you don't really know what actually happened during, you know, when the compromise happened on the system. All you have are some artifacts from the system and a pcap of the actual event from one of the SIEM's or IDS devices. Now you would like to, you know, analyze these artifacts and to do that, we will have, we will have a, you know, look at the visualization techniques and use the demo. During the demo we will see how exactly the tool can be useful in this. We will also look at the use cases. So primarily visualization in combination with static and dynamic can be a useful, you know, technique to classify content or to cluster it. But this has some, you know, gaps but definitely this can be a first level filter. And we will, you know, look at that. So first of all let's talk about, you know, the binary blob. So what we are looking at is a binary blob. We don't really know what exactly the content, what exactly the file contains. We will look at it from a black box standpoint. We will look at it in a sense that we don't have any understanding of what kind of file we are looking at. Okay. So if we have a file about which we don't know anything, we would like to, you know, first of all visualize it and see what exactly the file might contain. We don't want to submit it to our sandbox because, you know, submitting to a sandbox and then waiting for the output to come is probably feasible for a few files. But you cannot do that when you get a, you know, when you get hundreds and thousands of files in a pipeline and you get these files on a daily basis. You cannot do, you cannot just keep on submitting them to the sandbox and, you know, have some analysts look at the report of the sandbox and then decide whether these files are bad or good or you know, whether they have to be submitted to other sandboxes or other analysis probes. So visualization can be one of the techniques in which you can probably, you know, gather some insight on the file and then make a decision whether this file has to be actually submitted to a sandbox or you know, even CPU intensive operation in the cycle. The structure of a blob can be visualized and this structure can be used then to classify content based on the file type. So you can look at the visualization and see whether a file has certain characteristics, certain, you know, standard blobs. And if these blobs are visually visible in multiple files, it would probably mean that most of those files are common. They have a common header structure or they have a common, you know, derivative. They are following a certain standard, all these files. So when this is used in combination with static and dynamic, this can be actually useful to classify the content quickly. You would not need to actually submit it to a sandbox or you would not need to, you know, actually ask your analysts to spend some time analyzing that file and making a decision. So that time can be used to do something else to, you know, actually analyze and reverse engineer a new malware sample that you don't know anything about. But to quickly classify files, you don't need to submit those to malware analysts to do it manually. You can use visual in combination of static and dynamic to quickly classify the content. Okay? So to do all of this, basically, we will be seeing a framework that I've created. It's called Rudra. There is again a tool that I've created. It's called Flow InSpec. So these tools can be useful for you when you are looking at certain file types. So for example, you have a pcap file as input. Flow InSpec can be useful. You can have your regular expressions or, you know, fuzzy strings matched upon the pcap file. You can ask a tool to do that for you. You can ask the tool to extract network, you know, streams, reassemble them first of all, match the regular expression upon them. And then if you would like to save them to a new file, you can do that. It's basically a minimal idea that you can use without actually using, you know, really good tools. This is written in Python. So, you know, it exposes an API. It is a command line tool. You can, you know, use it in your own tool chain. Rudra, similar to Flow InSpec, has the capability to also analyze malware content. So by malware, here, I mean P files, Windows P file format. So if you have a P file as input, and you would like to understand what exactly the file has, you have hundreds of parsers available, right? But then Rudra can also be used in combination that you don't really need to use a third-party parser and then add visualization heuristics on top of it. You can use Rudra's parsing capabilities and in combination with visual, you get a combined report. So you have a tool which can do both these, which can do both of these. So now the output that you get can be used for as an input to the heuristics engine. So the heuristics engine can take the output and, you know, come up with a scoring model and then say whether the output actually signifies that the file is bad or good. Okay. The tool has a plugin-based architecture and it provides JSON reports. So now with the plugins, what you can do is you can, based on the content or whatever analysis output you get, based on the output, you can now write some rules. So the heuristics engine that you can write on your own will be able to make a decision whether the file is good or bad. So the tool will not do that for you. The tool will just tell you whether the files contain, it will extract the properties, statistical properties, static properties based on the file type and dynamic behavior. If it's a PE file, it will combine everything and give you a generic report. Now, based on the report, you can write your own plugins to actually classify the file as good or bad. In certain cases, it might be the, it might happen that, you know, some, some certain combinations of these static fields and the dynamic behavior in combination with visualization is, is an indicator of bad activity or is a strong indication of a suspicious file. But this might not be the case for every file. It might not be the case for everyone, right? It might be a case that certain organizations, they have a policy whether they would like to filter everything with really high entropy. And for someone else, it might not be the case. So the tool will give you the feedback and then this feedback can be used to write your own plugins to decide whether something is good or bad. Okay? So the structural properties which are common for any type of input that the tool gets, they are very, you know, they, they are like hashes. You, you get to, you know, calculate the unique and the common hashes. So by common, I mean the similarity hashes like SSD. The tool will also calculate the entropy compression ratio, min size. So these are some structural properties which can be calculated for any type of, any type of content that is provided as input to the tool. Now this will be common for p-cap files and p-files. When you visualize a p-file, you see some interesting patterns. And when you visualize a p-cap file, you see some interesting structures. And we will have a look at them. It is very important to, you know, remember the fact that entropy, that the entropy and compression ratio, these are structural properties. So entropy is always indirectly proportional to compression ratio. So a file with very high entropy will always have a really low compression ratio. Now we will have a look at these properties when we see the report. Here is a, here is a snapshot of what exactly the file metadata looks like. So you have a file which is actually a p-cap of, so there is a Windows executable which is packed by a UPX packer and it is being transferred over network. Now this has been captured as part of a p-cap. And this p-cap has a size of 160 bytes. So when I, when I see this p-cap, I don't really have a context of, you know, whether what kind of content is being transferred and what, what kind of packets are there in this particular p-cap. I would like to pass it via Vyasha. Vyasha will, you know, analyze the streams and let me know if it has, you know, if, if the protocols are supported by Vyasha, almost everything is supported by it. So it will let me know what all fields are there and it will tell me the details about the metadata or the headers of the packets, right? It will not be able to tell me what kind of content is there. It will not be able to tell me if the content is packed or not. In this case, what you're seeing here is a p-cap which has a really high entropy. Entropy is 7.6 and the compression ratio, like I said, will be very low because the entropy is high, it's 5.04. So if you have the best possible compression available, in this case, for this file you will only be able to reduce the size of the file by just 5 percent. Now this is because the file has really high entropy content. So most of the content is random. You cannot, you know, find patterns and minimize them. And that is why the min file size can only be reduced to 152 kb. Now these, these alone are not really useful facts, but then when combined with, you know, dynamic analysis or visualization, they can be used as part of a heuristics engine to classify certain content as good or bad. The most important fact here is to understand that, you know, static properties, statistical properties are useful when they are visualized. So there are two types of visualizations. So first one is basically just a histogram. What we do is we take the bytes and the frequency and graph it out. So this is just a fancy way of, you know, snapshotting a file without actually honoring the sequence of bytes. So this just tells you what kind of bytes are present and what is the, you know, frequency of each byte. This will give you an insight of whether the file has more non- printable binding characters, high ASCII characters or low ASCII printable characters or control characters. And you can also visualize the file by creating a bitmap. Now this is a really interesting way of visualizing a file. So this can be done for any type of a file, any type of network content, any type of a binary blob. You can do this for executable formats. You can do this for data files as well and you will see really interesting patterns evolving. For a file which has a fixed file structure, you will always see patterns. So if you take, let's say for example, mp3 file format and you try to visualize all mp3 files, you will always see some part of the file to be common. So in the visualization output, you will always see similarity because the file is following certain standard and that standard is common across all the files because they are all following the same standard, right? So in this case, headers for all those files will more or less have a typical structure and that can be visualized. In this case what you're seeing is this is probably some binding content captured over network and because of all the bytes mixed here, so this particular representation is taken from a blog by Curtis Mattoon. Shout out to him, he has written a Python script that can actually take in any type of file and create such graphs for you. So the one on the left is in grayscale, one on the right takes the byte values and uses them for the HSV calculation and this is a bitmap that you can create for any type of a file format. There is a tool called binvis. How many of you have used binvis by Aldo Cortesi? You should probably look at it. It's a really awesome tool to visualize your content. You can use it to visualize any type of a file and it also gives you an interactive ID sort of format. It's a website, basically you go there, you upload your file and it gives you an interactive webpage wherein you can select a certain section of your file and see whether that particular section of a file, if it has certain high entropy content or if it looks similar to a certain file format. You can do that. Shout out again to BS Manjunath. So he is from UCSB, he and his team they have created a tool called Sarvam. So the idea with that particular tool is to use these graphs, these images and create a model and whenever you have something new match it against the models and if it matches, if the images they match use image similarity algorithms and if they are similar, more or less they are coming from the same file or the malware samples or the binary content is more or less the same. That is the idea behind Sarvam. So what I would like to highlight here is that you can use visualization in combination with static and dynamic analysis. Visualization is the most important part here. Visualization actually helps you to quickly look at the content and classify it. You have your own insight, you have looked at a lot of different file types and looking at it visually will give you an insight whether the file is good or not because you will see certain patterns and you will expect certain structures to be visible for a known file format and if they are not you definitely know that something is wrong here. The file has been modified in some way or the file has been changed in some way and this is not expected. So visualization gives you an insight which is not typically available with static or dynamic analysis alone. So these are some of the features that the tool has. It can not only do visualization or statistical analysis, it will also understand the type of the file format. Like I said, right now the tool understands pcap and Windows executable PE file formats. So first of all it will try to identify the type of the file and then invoke the type specific parser. So if it's a PE file it will use PE specific parser and extract certain fields. If it's a pcap file it will use a pcap specific parser and extract those fields. For all types of files there are certain generic attributes or static properties that are extracted and these are mentioned here. We would like to understand if a file has certain regular expressions matching upon it. If we would like to run your rules upon the binary content you can do that. If you would like to test whether the binary content is actually shell code you can do that. So there are tools to do that. You can use those libraries to understand whether the content that you have is shell code or if it is just some random binary content, binary block. Like I said, for pcap file format the tool will do TCP reassembly and IP defragmentation. It will extract the streams. It will identify protocols. It will decode protocol fields and then match your rules upon them. This is very much similar to what typical ideas will do for you. You can do this with Python alone. You can do this as a standalone tool. You don't need to deploy any CPU intensive tools that take a lot of time, that take some time to process large amount of packets and then create output for you. If you would like to quickly prototype your own tools for your specific use case you can use a Python API rather than recompiling the whole tool set. If you have p file formats, p specific attributes will be calculated. These attributes are extracted by a lot of tools. There is nothing specific here. There is a very special way we use these attributes. We use them in combination with the dynamic as well as the statistical output, the visualization output that the tool generates. These properties when using combination they can be used as a heuristics. You can write your own rules. If a file has certain sections of certain names, section names can be changed easily. When this is used in combination with certain visual characteristics, this becomes a strong indicator. Someone can create a modified UPX-packed PE file which will not have UPX typical section names that have UPX0 names in them. Whenever you pack a file with UPX, you will expect section names to be named as UPX0 and UPX1, but someone can change those names. It can be changed easily, but the statistical properties of the file which is packed by UPX will always remain the same. That properties, the file because it is packed, will now have really high entropy and the sections will not be very visible. When you visualize it, you will not see structures, you will not see boundaries and patterns, you will not see the headers. This is something that the authors or the guys who have created the malware who are packed it, they will not be able to change that as easily. This in combination with the static analysis output can be a really good heuristics for you to classify files. Let's quickly move to the demo. First of all, let me show you a quick demo of how to classify, how to understand whether a particular file has shellcode in it. What we are trying to do here is we are using one of the tools called Flow InSpec. Can you guys see this here? First of all, let's see. This particular tool, Flow InSpec, it helps you to, it is more like NGREP. You can think of it as a Python based IDS. What you can do is you can pass it a file, a pkf file specifically and you can ask it to look for certain regular expressions within them, within the pkf file. You can ask it to look for a fuzzy string. You can ask it to reassemble the streams and look for shellcode within those streams. So that is what the tool does. So what I am trying to do here is I am trying to run Flow InSpec upon this file. This file actually has shellcode. Now we will see what kind of shellcode this is. From the name you can guess this and then. So whenever there is a match, the tool will actually show you the streams that have, you know, so we are asking the tool from these parameters you can see. So hyphenm asks the tool to run the shellcode analysis engine. So once you do that, if it finds a match, it will show you the stream that matches. This is, so we found shellcode in the server to client traffic. So this particular TCP stream, 105.73, this is the client and it is talking to 78.87 on port 80. This is HTTP traffic and on this particular stream, server has sent some shellcode to the client. Now this is what the tool has understood. We can probably look more into the pcap, extract the particular stream and then pass it on to some other tools if you would like and this would give you more insight. But you know, if you would just like to quickly look what exactly the shellcode is, the tool can create a profile for you as well. You can have a look at the command line options. There are lots of them. There are a lot of options available. You can use these options as per your requirement. So if you would like to do shellcode matching, if you would like to do regular expression matching or if you would like to, you know, do for this thing matches, you can do that. For shellcode matches, if you would also like to disassemble the shellcode or if you would also like to create the profile for the shellcode, what exactly the shellcode would do once it is deployed on the client, you can do that. And this is the output of the profile. So the pcap that we saw, the tool first of all identified that there is shellcode within it and it also identified the type of the shellcode. So this is a shellcode which will actually invoke the following calls. So the load library A, API call is invoked with the WS232 parameter. So this is a typical API call to load the Windows socket library. After that you see some socket API calls. The host attribute is 53.20 and the port is 4444. So quickly this particular shellcode will actually spawn a reverse TCP shell if it is executed on a Windows system. That is what the shellcode will try to do. Now this was identified from a pcap. This is something that you can do statically. The tool has not run the pcap through a sandbox. It has not extracted anything special. What it does is it reassembles the streams and then identifies the protocols and then runs them through the analysis engine. In this case it was shellcode. If the shellcode is found we generate a profile for that and once the profile is found we try to see whether this particular profile signifies something bad or good. So now let's move on to the second demo. What we are trying to do here is we will run this particular tool on a Windows executable, on a Windows malware and see what report it generates. I have a few files that are actually suspicious. I have got them from some intense source. Here they are. Let's run this on. All right. So the tool will quickly analyze pass a file first of all and you know generate some heuristics for you. So here is the past output. It will run here. It will run them against the Adobe Malware classifier. It will also try to look for whether this particular file is part of the whitelist bloom filter. So what we have done here is we have created a whitelist and the whitelist is not just a static list of all the good hashes. What we have done is we have taken hashes from Mandiant and from NSRL. These are some sources that give you a list of white hashes or the files which are commonly seen in uncompromised Windows systems. File hashes for Windows files and Adobe and popular vendors. So we have taken hashes from them and created a bloom filter out of it. And if a file particularly matches the bloom filter, we say that it is whitelisted. It is part of the hash set which is coming from the whitelist of files and as such, the file is most probably good. And there is always a possibility of certain false positives when you are using bloom filter because it is a minimize set of the actual the unflattened list of hashes. But in this case, what we will do is we will quickly classify them as white or black. Now this can be used as part of the scoring as well. Let's have a look at the web reports. The tool actually generates certain reports for you. You can use it as a command line tool or you can use the API import it as a Python module. Or if you would just like to quickly look at the reports, you can have a look at the reports, the ST and ML reports that it creates. Here you can see that the shellcode reverse TCP file that we had shellcode reverse TCP file that we had here is the byte frequency graph and the visualization the bitmap for it. Some of this some of the fields are extracted and they are passed out and here we can see that the STC traffic which was the HTTP response sent by the server to the client actually has window shellcode within it. Similar thing can be done for another type of file. Now this file has shellcode but this shellcode will spawn the calculator.exe so cal.exe program on the client. Okay, so this is the report for a windows executable this is called this is the cmd.exe command.exe clean file unpacked unpacked just last report. So we will quickly wrap up after this report. This is how the byte frequency histogram and the visualization the bitmap looks like for cmd.exe. Now what is happening here is that you can look at the reports you know the colors they actually indicate the structures and the boundaries between these colors so you know you see the color blocks these blocks indicate that the file format has certain structure it has a particular format that it is following. You can use the static attributes and in combination with visualization use it to classify the file. All right. So yeah that's it so basically the aim of the talk is to highlight the fact that you know visualization can be a good heuristics for you. You can use it in combination with static analysis and dynamic. It is not a replacement for static or dynamic analysis but it can definitely be a compliment. If you just use static or dynamic you will not get as much insight as you can get with in addition with the visual analysis. The tool is already available on GitHub. You can use it. You can extend it. You can submit your patches if you would like. Both the tools are open source. If you would if you have any questions or if you have any feedback feel free to get back. My handle is 7S3RAM. I'm on Twitter and GitHub as well. And that's it. Thanks.