 We're going to get started right away. It is my pleasure to introduce an old friend of ours, Gita Ziabarri. Thanks, everyone. Thanks for your patience, first of all. And thanks for coming to my presentation. I'm going to talk about YALDA. It's an automated tool that I wrote recently, one of my recent projects, and I taught to share it with you. I've worked for Fidelis Cyber Security. And I'm going to talk about YALDA. So we are going to talk about having an introduction of YALDA, how was it created, and what was the object of it. Then how could you use YALDA and what it could get used. Then we are going to have an overview of architecture of YALDA, a quick demonstration, and a GitHub that you can go ahead and download it and start using it on your own files. And using it is actually pretty much simple. It's just like going and just clicking on a file to configure and add some directories to it. So introducing YALDA, the motivation was basically creating a tool that analyzes files and gives you some indicators that saves time for repeating the job for each single file. And also, it was just for fully understanding what the file does and creating other media bulk intelligence collection, and being independent of other sources, third parties. So that was the motivation. So that guy in the back, John Bamenak, he actually is my boss. And he passed me a project to start with. So he started giving me some sort of files, massive files. I mean, millions of files in directories. And they were all compressed. And they were in JSON format. Each one of them, it had a dictionary inside it. And then the dictionary had information about the emails. And each dictionary was corresponding to a malicious email. It has all the key and values, of course, for that one. And it has some MIME encoded keys that the values were there. If you decode them, then you would be able to download the files. So that is how YALDA was born. I started analyzing them manually, and then figured out that campaign from the same sort of campaigns. Although the format and everything is a little bit different, but the fingerprint and the way that they are introducing this stuff is almost the same. And I figured out that it could get automated. This is just like an example. But just looking at it, I found out that this just doesn't make sense. This is malicious. I mean, I would be able to just say it by glance that this is malicious. Look at the from. It's coming from Russia, and it's saying that USPS ground. I mean, come on. And so the analyzing JSON file, each dictionary, this is just like the process of automation that I would get the base 64 encoded MIME, which was the key, and then the value decoded it. And getting a couple of files, the most important one was the file that I was getting that had the attachments. And then looking at the attachments, this was again for a particular campaign that it was almost the same. Like the domains were the same. The domains were different, but the structure was the same. So I started writing automated tool for extracting the information. The steps was pretty much like following whatever I done manually. It was, yeah, it works at least. So I was getting the JSON MIME, analyze keys, analyze from address, analyze the body, get base 64 encoded data, decode it, download attachments, analyze files, and extract domains. So that's how YALDA started working. So manual analyzes, non-process for automation, automating it, YALDA. And then adding more and more to it. YALDA was born. So as I was extracting information, then I started just talking to Annalisa and then asking, what do you guys do for getting more information? And then I made YALDA somehow that it does all of this stuff. Automated bulk intelligence collection. It's a file scanner and analyzer, extract data, cluster files, and it categorizes files. So automated bulk intelligence collection tool, it has an automated process of analyzing files. It applies intelligence in collected data and clusters files based on the similarities. It's a file scanner and analyzer. YALDA scans and analyzes files. It collects detailed information of the file. About 20 indicators are being extracted from each single file. And it makes it possible, how many of you guys are analysts here? OK, so it makes it possible for you guys instead of hours of doing sort of like analyzes, you just run YALDA on the set of files. Let's say even like a million samples that you have. And then it's going to analyze each single one of them. And it's going to give you some indicators, like 20 indicators is going to be extracted. Then based on whatever you are looking for and whatever interests you, it is going to extract the information that is valuable for you guys. So you can just like apply a filter and you can extract the information. It also extracts data, like it collects embedded objects and also it collects malicious URL and domains. And it categorizes files. There is like a flag and severity that in combination they are being used. So depending on the algorithm that is being applied to the file and the founding of the file, a file could be categorized as clear with a low severity, like a 1 and 2 suspicious, suspicious with higher severity, and malicious, of course, higher severity. It also clusters data. It gets the, if there is a malicious hash that it detects, it's going to cluster the strings out of the file and it's going to insert it in a different collection. Same applies if it is an executable file that is malicious. It is going to get the names of P sections, convert it to Shaw one, and in combination of the hash, and Shannon entropy is going to use it. Now, what to do with Yaldo and when to use it? So if you want to have a feed, a collection of feeds that you could use it, you could go ahead and just like submit files to Yaldo and it could be used for extracting MD5 feed, domain feed, URL feed, all of them malicious. Also, as I said, there are 20 indicators that are being extracted from each single file, depending on the file type, some of them might be empty. Like for example, if there is no P sections in it, then it's going to be empty. However, you are able to get all of these, like MD5, Shaw one, Shaw 256, similar MD5. If there are some malicious MD5s that are similar to this particular file that you are analyzing, it is going to give you a list of similar MD5s. File type, if there is any malicious URL that has been extracted from the file, it is going to be given to you. Magic literal source, virus total information. This is basically, just gives you a paramalink to virus total page, if it exists in virus total with the number of positive AV engines that are able to detect it. You don't have to have the key or use virus total, you can just disable it, it's not mandatory. Severely and flag are being used together. Severely is like clear, or it could be suspicious or malicious, and then the flag, it could be from one to five. One stands for clear, and five stands for very malicious. So it should sort of give you a scope of how the file is analyzed. File name, in just time, the time that it got analyzed and got inserted in database. Domain list, if it has any malicious domain, URL list, I'm going to go through it in detail, but the list of the functions in URL rules that are matching the particular file is also going to be extracted in a list. The file path and the source is going to be YALDA. It is just like as an indicator to be able to collect the information from database. Embedded files, the list of objects that are embedded in the file also is going to be displayed to you. PE sections, in detail, is going to be there. Parent MD5 and parent file path, if the file is embedded and is being analyzed, then you are going to have a link to the parent. So let's say that you are interested to, like you're analyzing the obtained results and you're interested to extract a PDF file that has embedded objects in it. It has also malicious domains and it is malicious with a stability of five. So you could go ahead and apply a filter and it's going to extract the data for you. So it's just based on the indicators, it is possible that you select exactly what you're looking for. It's a good source for generating your rules because it has detailed information about it and if it is malicious, also it's going to have strings out of it and PE sections. So you can use it as a good source for writing your YARA rules. It's a smart feed to Cuckoo Sandbox. Again, you don't want to submit everything to Cuckoo Sandbox, but if there is a specific malicious characteristic that you're interested, you can go ahead and start feeding it that category to Sandbox to see what is happening next. YALDO architecture, it's basically the architecture is could be categorized in file sections, in four sections, file, you submit it, then extract files, it starts extracting files and then starts analyzing and scanning every single file that has been extracted and then insert data in database. Extracting files is done. So mainly what you need to have is you need to have a directory and in your Linux books and dump all of the subfolders and files that you want any type. It could be compressed, it could be JSON, it could be encoded MIME and what it does, it goes through each single one of them and it starts extracting the files. If it is a compressed file, it's going to uncompress it. If it is like having many different subfolders, like what my bus initially passed me, it's going to walk through each single folder and extract every single file and again, if it is in a mail format, it's going to download the mail attachments. So you're going to get a list of smaller files and then for each file, it's going to apply foremost in it. Now, when foremost gets applied, it's going to have sort of like children embedded objects in it and not only it's going to analyze the parent's file, but it's going to analyze all the children of the file and each one of them is going to be sent to the decoder and it's going to get analyzed through YALDA. Decoding files. Let me just put a pause here and see, like, just want to ask you how many of you do automated process for extracting information of a file? So the first thing that you would do is manual analysis. It's just like you have to know what you are looking for and you have to know what is this malicious file doing and then you would need to have a couple of samples to see what is common between these. The decoder part is actually written based on the same thing here that I told you. So based on the file type, how YALDA is designed based on the file type, it's going to apply, it's going to send the file to a set of decoders and then if there is a match with one of the decoders or more, it's going to just like flag it and go further and see what could be extracted from it. If there is any domain or URL that could be matched, it's going to extract it and then it's going to flag the file. Now, the decoders, the way that they are written is they, as I said, like a sort of like manual analyzer, detect the malware, detect the campaign and start analyzing them one by one sort of and see what is the common part that is being repeated. When you know that part, then you can start automation and you can write the decoder for it. That's how it's done actually in YALDA. Analyze non-malicious samples from fingerprints. CVE 2017, 0199, who heard of it? Yeah, cool. So we know that it's an RTF file with embedded OLA object and it has a link, something like this. And then analyzing all of them, we can just extract the comment section out of it because you see that it has a link to download the embedded doc and what it does is actually, as you see, this is the common part. So if you start sort of like parsing the file and write a regex for it to see if it is RTF file, go ahead and just like analyze it and see if there is a pattern match with these regex, then you could extract the URL just as in parentheses. This is like an analysis that is done in one of the RTF files. So these are the information that YALDA extracted. The MD5 SHA-1256, size of file, magic literal, file type, file name. And then the severity is five, the highest severity and it's like that's malicious. This is the domain that has been extracted from this particular file. There is no P section, of course. The source is YALDA mining data. This is just an indicator that I used. And VTE info, if you want to have information, this is like the AV engines that are able to detect it and paramelling to get detailed information from VirusTotun. Your list, it's empty because I didn't apply any other rules on it. This is actually the dictionary format of the output of YALDA. And it's just like everything that I just told you, it is going to be in dictionary format and it's going to be inserted in database. If you're interested to have it in JSON format, it could be done easily and you can just have the information on a splunk. When I talk to analysts, usually they say that yeah, detecting a malware is just like a chain of detection and usually it's just not just one file, it's just combination of the steps that is being done. So it's really important to detect the first chain and then start analyzing it. So there was actually an email campaign that was going on and it was sending PDF attachments and none of the AV engines were able to detect it. When it was analyzed, we found out that there is a URL list embedded in the PDF file. The URL downloads JavaScript. JavaScript has a link to a URL, URL downloads and not the executable file. So it's just like a chain, right? But how about if you just like detect the first chain and just understand that okay, this is a PDF file and it is matching with this particular chain. We have the flag, right? We have the flag that this is suspicious at least and then we even can extract the URL that is the first URL that it's pointing so that we can just extract it, have the information and if you want, we can send it to Cuckoo Sandbox for detailed information. Applying URL rules. So I am not giving you any URL rules myself but YALDA is able to apply URL rules on the given files. There is just like a sort of like enable, disable line that you need to enable or disable it in the config file and then you would need to pass some of the URL rules that interest you based on your needs in a directory. It goes to that directory if it is enabled and then applies every single URL rule to the file and if there is a match with one of the functions, it's going to extract the file and it's going to show you as a list of the files that has been extracted. So according to the match and according to the number of functions that are similar, it's going to get flagged as malicious, suspicious or clear with appropriate severity. YALDA is scoring is actually the brain in YALDA because everything so far that I told you you should know that there is a malicious file or there is a malicious campaign and then you have to write the decoder. So they are known. It's not like something new get detected but this method is somehow based on similarities and when I find, when actually YALDA finds a file and it's malicious, then it starts clustering the strings. So it applies clusters on strings and also if it is executable and it's detected as malicious it's getting the PE sections, the name section of it and then Shannon entropy and it clusters the information. So it's something like this, see. The file is detected as malicious. I'm applying strings and get all of the strings out of it and also a Unicode string out of it, the Unicode strings and then making a list and cluster the information in another collection. There is a file that I have no idea about. Like none of my decoders were able to detect it and none of the YALDA rows were able to detect it. Still I apply the strings on it and now based on the similarities that I find with these strings, the list of strings found here and the list of malicious strings in my database, I start scoring it. How many files has been detected malicious that are having similar strings as the list of this file? So if this score goes higher, I would say that, yeah, it is suspicious. It needs more analysis. So it's based on the similarities. Same thing is applied for the name in P sections and Shannon entropy. Name of P sections, if the file is executable and if it is malicious, I'm getting it and then I convert it to Shah one. And also the entropy, the Shannon entropy, I'm just collecting these two and I'm using and logic for these two and the malicious hash and all of this information are being clustered somewhere. Now it's just sort of like a scaling again, scaling of the file that right now I'm seeing and see what are the similarities. If it is executable, then what are the similarities between these two? And based on the similarities and the scoring that I said the threshold, then I'm going to flag it as suspicious. Now, white listing, who handles feats and yeah. So you guys know that whole life is when you have false positives, right? And you start using all of the possible scenarios and all of the cases for white listing. You're going to have manual white listing. You're going to have a lot of different sources and still you have some false positives, right? So to avoid such a thing in Yalda, I started saying that, okay, fine, I'm going to use the traditional way. I'm going to use a log file for white listing. But at the same time, how about just like clustering the good guys instead of just clustering bad guys? So how about white listing the clear files, the ones that I know that they are clear and they are not malicious and then extract that information for using it as white list. So as Yalda is running more samples, it's going to extract more bad stuff and good stuff at the same time, right? So the white listing is getting stronger as the detection of malicious hashes also get stronger. So as I said, I'm just like applying strings on clear file and clustering it and also I'm clustering the PE section names and channel entropy of executable files that are known as clear. And then I'm using this information sort of to white list and for having like more accurate results. Let's have a demonstration. So here I'm having three files in the folder and it starts analyzing the file. It's extracts the files and then in this case, it's downloading the attachments and you see that there are a list of domain lists that they are all malicious and it's extracting it. Also it goes and analyze each single file like in detail if there is any embedded object, it is going to extract it. Like in this case, it does have embedded objects. So it just like lists the list of embedded objects here like embedded files. And not only it has the detailed information of the file itself but it also analyzes the embedded file in it and it gives you information with a link to the parent. And everything is being inserted in the database as I mentioned. So YALDA sort of like minimizes false positive or better to be said, it has better correlation. So the way that we are doing YALDA scoring, we get more accurate results because as we are analyzing more samples, we are getting like a stronger results. The same thing for automated white listing that we are getting better results as we start analyzing more samples. And then this clustering that again, we are having better results. And then categorizing algorithm is somehow that I'm not saying that if I don't know like it doesn't match with my decoders, then I'm not going to say 100% it's malicious but it has enough high severity that you can select and say that, okay, this is suspicious and there's a flag going on. So I better start analyzing this set of files with this criteria because you have 20 indicators that we can select out of it. This is the GitHub. So you can go ahead and start downloading and using it. It's in Fidelis GitHub. So there are three folders like bin, source, YALDA rules. YALDA rules is a place that you place your YALDA rules if you want to. If not, you don't have to worry about it. Bin has all the modules and functions as you see here. You can even go ahead. It's just like very straight format and you can just go ahead and if you want to add more decoders, you can do that. And the config file is something that I go through it in detail is very simple. It's just like a couple of lines that you need to specify the directories in it. And when you wanna run it, run this guy, YALDA file analyzer. So you would need to install the required Python modules and you would need, these are the Python modules that you would need. And this is the config file. You just go ahead and open it and start just adding the directories that you want or enable or disable. Like if you don't want to have virus total information or if you don't want to have sort of like YARA rule to be applied to it, you can say that zero, just like disable it. If you wanna sort of like have a printout, you can have debug one or you can just make it zero. And of course, because I'm using MongoDB, you would need to have a MongoDB installed and just specify the criteria that you're looking for. And to run it, you go to source and you start running its YALDA file analyzer and it starts analyzing each single file in detail and passes you the information. I'd like to talk to, thanks to John Bamenak, my boss, Hardik Modi and Chad Robertson and Jason Reeves. They actually gave me really great feedbacks for improving YALDA and really appreciate all of your feedbacks. So if you need to contact me or if you have any question, this is my email address or Twitter. And thank you so much. If you have any questions or feedback, thank you.