 Next up we have Mark Maeger on stop and step away from the data rapid anomaly detection Do-do-do by a random note file classification. We'd like to thank our sponsors endgame silence tinder and so-forths and Reminder if you could please sit down in the seats. We don't want to have a fire could violation with that enjoy the talk morning everybody So just getting things a little bit about me. I'm not a data scientist So take whatever I say up here on stage with a very big grain of salt and please feel to Ridicule and embarrass me after the talk about the things that I get wrong But anyways a little bit about me. I'm a senior malware researcher at endgame to put the reverse engineering and since the development and Past two and a half years pretty much since I've been in game. I've been doing ransomware protection Just to get into the agenda. I'm gonna provide a brief overview of ransomware What your current detection methodology looks like? ransom notes and then I'm gonna delve into some exploratory research about the detection Research I did and then discuss in-depth the proof concept framework that I came up with and then wrap things up Inclusion hopefully have a little bit of time for questions So if you don't know about ransomware basically it's a software. It's written to Deny users access to data on their host the most typical approach is through in encrypting individual files on a file system and The file extensions are what's going to be targeted So think of high-value documents like PDFs text files or document Excel spreadsheets things of that nature and so there's two typical types of output from ransomware the encryption encrypted files that I was just alluding to and the actual So the detection methodology can be broken down pretty simply into two areas right now You have static detections, which are either going to be signature base Signature or your career six base or machine learning base the main benefit to This approach is that all data is preserved if the detection is successful But the drawback is that you essentially have one chance to detect whether a binary is ransomware or you know malware or not and If we miss that then all data is going to be compromised on the host and For dynamic detections basically the way those work is its process is going to be running in the background That's going to monitor for any sort of anomalous behavior on the host There can be a focus on detecting encrypted files in certain cases Some approaches leverage canary files, which are files that are written to disk In you know kind of spread out in different locations And if they're modified in particular ways and that can be a trigger for alert So the main benefit for dynamic detections is that you know hypothetically as a process is Executing there will always be an ongoing chance that will be detected. So it's not just there's one initial chance to detect and then You know your host after that But you know so should still be able to detect it later on the drawback to Dynamic approaches is that essentially you're sacrificing a Large amount of files in order to determine whether or not there's ransomware executing on the host Maybe in certain cases it's it's easier to detect the anomalous behavior in some cases it might You know either be possible or take a very long time. So how can we improve what the current safety art is right now? Probably the best approach is to combine Benefits of static and dynamic detections in the ideal case. Yes, you would detect everything with machine learning Immediately, nothing would ever execute on the host But that's not always case, you know, there is definitely false negatives So you need a robust Dynamic detection to serve as fallback. So leveraging a learned security approach is Probably the most recommended way to make sure you're covered for ransomware Optimizing your machine learning models to specifically classify ransomware as opposed to just now or specifically That can prove very beneficial for this problem and then to go back to dynamic detections Perhaps there might be a way to Boost the time or reduce the amount of time that's required for detecting this behavior So getting into ransom notes A little bit of background on this Since I've been doing ransomware research for about two years or so. I've executed Detonated, you know, probably almost thousands of files manually in a virtualized environment and kind of studied how the output typically looks and then You know as sort of a Aside, I was seeing you know ransom notes being written to this in multiple ways multiple directories performance so What kind of got the gears turning for research that I'm presenting today is that? I started kind of seeing a pattern and how the ransom notes looked and so I wanted to explore that to see if there was a way That we could kind of classify those and see if there was something there that kind of unites all of them and makes them easy to detect So just go back a little bit ransom notes the files that are written to solicit the ransom payments They come across in multiple file types The most typical format is TFC files plain text files But you also see ones that are in or formatted text formats such as a HDL or TF And there's also images or even like GUI based like little dynamic programs Ransomers are going to be one of the first files that are written to disk and sometimes they're even written to every directories essentially The adversaries trying to be as noisy as possible in the hopes that they frustrate the users enough and get the point across that Their data has been you know, totally compromised and they have to provide the ransom to getting their data back So we'll go back through here and like look at a few ransom notes to kind of get the general idea of what I'm talking about so this one's from crypto locker and they kind of lead off with Just saying your files are encrypted and then they talk about You know, you don't have access to the description key so you can't recover your files They want you to email them and they're providing a specific time window for how long The ransom will be valid essentially and then they even get into talking about the AES encryption that they're practically using Going on to the next sample it pretty much starts out with the exact same way All your files have been encrypted and then they say something similar about all your documents are encrypted Can't recover. Please pay us point oh one Bitcoin, you know to a specific wall wallet ID and then they also provide email address and Finally, here's the actual image base ransom note and if you'll pay particular attention They were requesting 100 Bitcoin, which is approximately 750 thousand dollars right now. So Not exactly sure how successful they were with this ransomware campaign, but They were at least pretty pricey So as we saw from even just looking at three very disparate Samples of ransom notes you can kind of see a template kind of form They typically laid off was saying something about your files have been encrypted Sometimes they provide a family name and then they'll Sometimes get into talking about the actual Encryption that was implemented as part of the ransomware They get a point across that That files can't be recovered without a trip that they'll provide only if a Only if the ransom is provided and Then you know potentially provide email address and then a time window for when they're ransom essentially will be available. So You know as I previously you know said in my intro. I'm not data scientist. So Exploratory research for me in this case was just developing a better familiarity with data science data science concepts and And then you know moving on from there need to collect You know big enough corpus of ransom notes In order to do some training And on the flip side that we need to put together a nice base representative benign data set You know to go with the ransom notes and you know the fun the overarching goal of the exploratory research is to determine if This approach can possibly work for a class of patients So tools I'm using here pretty much just use an account for everything which you know came bundled with Python 3 Do your notebook and you'll also use a psychic learning space So delving into the data sets a little bit benign data. I just ended up using the 20 newsgroups data set which probably most of you are familiar That's a list of the actual 20 newsgroups that are part of that And then for the ransom notes, it's it's definitely a little tougher to Put together a large collection of ransom notes not every Ransomer family writes them out To disk so going through and kind of manually doing the research and figuring out which families actually drop notes Can be a little tedious. So a lot of this involved manually detonating ransom or samples over a period of years collecting the ransom notes You know soaring them off and then you know kind of digging them out for this project but also You know searching through blog posts Twitter and a lot of things like that. You know, I was able to collect enough Samples and I had something that was representative instance in general So the actual approach that I was taking for the exploratory research is We're just gonna go with unlabeled data. We're gonna take the 20 newsgroups data set and then we'll combine that with the ransom notes and so we will take a clustering approach using k-means and set it to 21 clusters and We write with that one newsgroups data set and ransom notes and we're gonna hope that with the 21 clusters the way they kind of Settle down this You know, they'll be distinctly each of the newsgroups will be in their own cluster and then ransom notes will stick together in their own cluster In order to analyze the data a little closely. We'll Take a look at the data using Count factorizer in a TFI So getting started just to do some very basic data prep Before tokenization, we're just going to strip out new line characters Converter lowercase strip out null bytes just things like that to just get the data starting to Make a little bit more sense and then when we do the actual tokenization We're going to limit it to alphanumeric characters only we're gonna strip out any Stopwords that are in the default species stop words list and we'll do limitization. So Very quick example here encryption would actually So here's just a you know, very quick overview of how the tokenization worked and For this example actually took, you know, very small board for a ransom note and pass it in and you can see how it actually breaks down to two very You know very core set of words there file and trips and Bitcoin in some payment. I mean that pretty much is Very descriptive of exactly what they're going for So now not sure how well you can see up there Breaking down the most common features that we're seeing in the 173 ransom notes we see a lot of the same sort of words we see Things describing You know files with data is being encrypted Bitcoin course encrypt encrypt You know things along that sort of nature Like even just looking through those words. You might be able to you know construct what the purpose of Is without having any sort of concepts and then when we break it out to Bigrams things make a little more sense because you're working with phrasing. So It's not just files in a vacuum. It's files being encrypted files Decrypted Private keys that kind of addresses things along that picture, but just to give you a little bit of an idea of what the data looks like Then we don't reply TFI DVF You know looks pretty similar to what we're getting from the count vectorizer So yeah, just gives you another view of what the data looks like now Not sure how well you can see this here, but essentially with it with the 21 clusters they They broke out like quite nicely for us actually and in cluster three Despite the ransom notes only consisting of a hundred seventy three unique samples Versus the 11,000 Messages that were in the 20 newsgroups data set The ransom notes all clustered together extremely well The that that cluster the I believe that's the top ten features that are that are in that cluster Matches extremely well with what we just seen in the previous two slides and that actually is a good test for for the data set because if you'll see the top Entry for the for the newsgroup in the in the image of the bright inside our crypt, which is the The encryption newsgroup at the time But yeah, if you see cluster six, it might be a little tough to tell But it kind of you can get an idea of how old it is because they're talking about clever chips which you know We're pop like that was around the mid 90s or so, but but either way distinguishing between newsgroup discussions around encryption versus Ransom notes that do discuss encryption at a more high level That's a good initial test of how strong the data correlates and so delving into how You know how the cluster actually worked We want to like kind of get under the hood and and pass in some sample data so it you know took another ransom note and Passes into the k-means predictor and if you break out the results for that Using square root of the sum of the squares we can calculate the distance From the centroid Here so in our case with that ransom note it did end up in cluster three, which is what we're hoping and For a second Example we kind of use something that's more generically just talking about encryption But not specifically a ransom note in this case it actually ended up being closer matched to Cluster for which is actually entries from computer graphics. So What did we learn from doing our sport? Well, as I mentioned before we have a small set of data, but the ransom notes do cluster together very well The second sample demonstrated that there is nuance and how the data was cluster together And you know from all that we learned that it appears to that the data is going to be appropriate for classification So, you know, we can actually go forward with the natural proof concept So for a POC framework, we have a few requirements first and foremost. We need to obtain the file change events in real time We need to Take the file paths that are being created and pass them to a model that we developed From there, we're going to read in the actual text data from the file paths that have been created We're going to read in the file contents and then pass a lot that along to a classifier to determine whether or not the data consists of a ransom note and Then if it is a ransom note, we need a way to mitigate the process. So to reduce the problem space for this We're going to put up a few restrictions here we're going to stick to English only and Doc TFC files only as I mentioned, they're the most common ransom notes But that doesn't cover, you know, the entire world of ransom notes But yeah, four matted text. It's going to require parsing and images We have to use OCR to extract the data and it's probably, you know, it would require a little bit cleaning up beyond that So at least for the for this research, I figured that was out of scope for what I was trying to accomplish And then we're going to stick to files that are only less than 20 kilobytes The reasoning for this is ransom notes are generally pretty small You know kind of going back to the template I was discussing earlier They're not really trying to get across too much they're very utilitarian just saying hey files encrypted Please send us a ransom. That's basically it. So Reducing the problem space there keeping me the less than 20 kilobytes You know helps out with performance. So we can break down the Framework into a couple of Components and just two pretty distinct processes. So we'll have a file change event listener and That's going to read in the events and place it into a queue for a second process Which will be the text extraction and the actual classification of notes And then if we determine that there's a ransom note There will be a process mitigation handler That will operate So here's kind of a high-level diagram of how a typical sort of infection scenario would play out with the framework on on disk So you'd have ransomware executing they drop a ransom note to the Ruben C try the event listener is going to be You know pulling for events at that time It'll see a follow creation event for the ransom note and then it'll pass along that file path to text extractor and classifier Which will read in the contents of the ransom note and then do the actual classification Hopefully return a yes, and then that was a result in the answer process basis So for the POC framework, we wanted to build out a more representative data set So for the benign side, we'll still stick with the 20 news groups But we'll take a smaller slice of it instead of the overarching 11,000 and then to supplement that we'll Leverage some of the Windows text files that I was kind of able to scoop up sort of typically talking about log files reading files Any sort of like installer logs, you know things along those lines and Then for the ransom notes did my best to collect as many Many more ransom notes as I could ended up finding a bunch on pay spin and a few other sources So that was a great source You know, but but still we're left with only 350 ransom notes compared to 11,000 benign messages so For the classification approach here. We want to address a data set in balance Which is very quite large So we can use smoke to generate a synthetic data for us and see hopefully back and kind of bridge gap for us and make up for that You know pretty big in balance. So the approach for the classifier here. We're going to use do feature selection via TF IDF and essentially what we have is a You know supervised learning problem. We're going to label the data this time As you did a benign or ransom note And then then yeah, we're breaking down All of this into a binary classification problem it does the text consists of a ransom note or zip an ID and For us a naive base classifier simple straightforward and that's the approach that we You know went went for immediately and you know delve into the The results we So very high overview the high-level overview of data processing pipeline here we start with our label data set and We pass that along to the pre-tokenization where we're stripping out characters converting lower-gacings along those lines We do the actual tokenization and then we You know get into Sanitizing the data a little bit by stripping out Stop words anything is not helping America and then do lemonization For we pass it along to the TF IDF vectorizer to vectorize the data We use smoke to balance out the data sets and then we'll do the actual training with our So for testing here, we're splitting the the data into a 80-20 split 80% of data will be training while 20% will be user testing we use a train test split from scikit-learn to handle that and just get it to a brief overview of the terminology involved that Probably extremely common And then known to most of you guys, but the accuracy score They're going to be referring to here is the actual accuracy classification score F1 score is going to be a average of the precision and recall confusion matrix just great way to represent true true and false positive and negative grades and For our cross-validation, we're going to use a Monte Carlo approach Essentially where we're running multiple runs You know through through building and test data sets each side so so we're in this case we're just We're testing out the model's ability to create testing to see how this how this approaches it is going to be flexible and not try to So for a single one single test here, we actually ended up doing extremely well accuracy over 99% F1 score 91 and you can see from the confusion matrix zero false negatives, which is great a few false positives, but Nothing too crazy So You know that's encouraging, but how does that scale? so we need to do some cross-validation to determine if that was just going to outlier or if it's a particular thing to come and So we ran through cross-validation 10 separate runs Very good training and test data and that ended up with actually very similar results Accuracy was over 99 and that point score was over 90 The confusion matrix looked about the same. So I think that you know Vivid the case, you know the The approach problem we're taking Just some graph data here to provide you a better representation of what we're looking at As I said, not a data scientist, but it's good. So breaking things out into the other components of the framework With the event listener, we do need to monitor file change events We're looking at all processes that are active on a host and we need a way to map each event to a specific process And focus specifically on file creates in our case There's a few approaches that you can take to to getting this data including You know using Python watcher, but as I said before the most important thing that we need here is we need a We need the type of file event. We need the type of Sorry the the process That's responsible for the particular event and we need the file path So Python watcher in this case. It's based off of the redirecting changes API that I've Windows API that I believe Doesn't actually return any sort of source process data. So in our case, that's that's not going to help so Alternative approaches to that you could go through event logs or you can write your own file many filter driver You know both of those, you know would work the developing your own driver. That's gonna take way too much work that so for our case here what I ended up wanting to do was leverage something that's gonna be pre-built and See if I can kind of sift through event log data for that to get our file events in real time and so for my case I was able to leverage sysmon If you're not familiar with sysmon, it's you know just a tool that's You know used for monitoring event data on on Windows and so there's a specific file create event actually of an ID 11 that's Perfect for our purposes. So we don't have to worry about distinguishing between different types of file change events. We only have one type of event For us here. It's just great Very simple configuration file that I came up with and then I posted that to the give repository for this project We're limiting things just to dot txt files as I previously mentioned And just trying to sift out other data. So we're not trying to crowd the event logs There's a registry key that you have to add in order to Properly allow the event log to be queried at an in real time. So that's there and so basically what we're trying to do in this case is we're going to pull the The event log and we're going to use the wi query language And we're essentially just gonna be pulling every 10 milliseconds in order to try to get updates Of new file change events that are coming in in the in real time So we need to limit the size of the result set that we're getting And we're parsing any results we get with stuff in the class firework. That's what the query essentially looks like You know pretty self-expand right there And for the actual approach for process mitigation very straightforward here all we need to do to determine is Is that process currently active with that id and process name If it is active we'll suspend it and we need to alert the user that there was this activity on their host And giving the choice to Terminate the process or resume process. All right, so we're going to try a live demo here. So Let's see what happens Okay, so I have the framework here running in a single python file I have process monitors set up with a couple filters We're looking at volcano dot exe volcano is a common ransomware family And I renamed the xcubal to volcano dot exe to make this more simple and we're going to use We're just going to look strictly at Right file events for a process with that name. So as you can see no events at the moment And here is my volcano dot exe And I will execute that and we get our pop up So it provides us with a specific file path To the text file that it determined to be a ransom note And it went ahead and suspended Volcano dot exe with that specific pit and if we go back to here In process explorer we can verify that that process has been suspended And if we go through here, we can kind of look through how You know the progression of the ransomware as it's writing files to disk It's like a 540 22 that was the first activity And See around 540 25 is when it was When the process is suspended and there was no further offense So detection time within three seconds or so But we have to like for our purposes actually since we're not keying off of any other Files what we're only keying off of is the text files. So we can go through the process monitor and See if through the data to only look at text files to get a better idea of how long it took for us to Detect the ransomware Let's see so we want a path that actually ends with dot text files And so here what we can see is that there are multiple ransom notes that are written to disk as I mentioned before Ransomware is typically pretty noisy without their distributing ransom notes on disk. So in this case, we actually have 22 of the same file that are that's going to be written out or well Actually, I think we're only looking for a key dot text. So that that might even be less And I think some of those were actual files are being directly encrypted But that gives you an idea of You know just how noisy the ransomware was So we still have that process suspended and we can go ahead and click terminate And as we'll see here the process is done Okay, so getting into uh, you know some more testing that I did at the framework Uh, I was able to test Against nine samples that uh were essentially holdouts because the ransom notes weren't part of our training or test data set So we're able to uh detect those nine specific samples from those families And as well as three samples I tested that already had notes in the training So in order to get some uh, you know a better idea of how successful this approach is I wanted to test against You know what's currently out there and so For our for our cases for our case for this. We just wanted to do testing against anything that was free or trial-based um, you know, I didn't want to shot any money for testing here And we want to break it out for two different tests. Uh, doesn't Doesn't detect the sample And if it does, uh, can we run it side by side with the classifier framework where you just came up with and You just want to give a rough estimate of what the detection speed looks like. Um, it was definitely potential complicating factors in that um For that particular test case because uh things like driver altitude can can definitely affect how it needs to Uh, products are running side by side. Um, but you know just a way to get a rough idea of how the performance uh compares to actual Stuff that's currently available for download And so, uh, the testing issue went extremely well. Um I kept things very generic. I don't want to call out any specific vendors or anything like that Uh, but in our case, uh, there was one specific product that did perform Very well and was was typically faster in detection. Uh, then the classifier framework that I developed and, uh That being said the detections Uh, where the where product feed did uh perform better Uh, the framework was was still close uh in performance and you know lag behind over by a couple seconds or so. Um, But surprisingly, uh, there were two products that out there were very Uh easily outperformed by the classifier framework And I mean if you even look at the a1 and a2 um the While it detects pretty much all of the time uh for the u12 samples that we saw uh, only uh, I think this unable to run a test for for one of the samples, but Uh, it was outperformed nearly all the time by our framework and that's actually pretty amazing considering uh, you know, you know sort of uh You know ad hoc approach that we took with uh, you know sifting through event logs for data Um, and then doing all you know all this classification at runtime and you know doing it all in python working, you know Uh, essentially going head-to-head with something that's running, uh native code and probably leveraging a mini filter driver to obtain their input so definitely Validates the approach that we took So that being said, you know, those results are great, but there are definitely limitations This approach there are plenty of certain uh, ransomware samples that don't drop dot txc Some don't even drop notes at all. Um, some try to convey their ransom message Just in a custom file extension that they apply to every single file Some samples drop ransom notes much later in the game after all the files have been encrypted, and then there's also samples that uh It Levered some sort of persistence and and typically respond even if you suspend the process from native whatever Uh, yeah, we might be able to detect it, but it's going to just keep going over and over Um, and of course there are a ransomware that actually take different approaches to um Denying uses access to their data uh nbr modifications anything for raw disk Or just simple three markers. Um, and of course as we mentioned going in we're sticking only to english So future work, uh, we'd like to improve the data sets You know, definitely more ransom notes less than they said This instead of day would be nice as well. Um, you know, as well as new ransom notes is as the ransomware families come up from their native layer And we'd also like to build out a more representative than nine text data set More log files more in solid files Things that nature If we could port our code base to a lower level language, uh, that would be great and We'd need very significant performance improvements and we'd be able to improve our detection time as well um, you know be nice to support, uh other file types, uh like for the The format in text as I mentioned before as well as images extract text Expand language support would would be nice as well As well as uh, experiment with the actual approach to classification So to wrap things up, uh Plustering gave us a Good idea of the data being suitable for classification On we saw that ransom notes do share enough features for our solution to be viable. Um, and you know Going into this we we do realize this isn't going to catch all ransomware But it could be a very integral piece of a layer detection approach. Um, so yeah, the proof of concept did work But there are definitely many improvements. All right. Thank you very much