 For our first speaker today, we have Brian on generating labeled data from adversary simulations with MITRE ATTACK. Please give it up for him. Thank you very much for coming so early on Sunday morning. That's awesome. I appreciate you guys coming and listening to this for a little bit. I wanted to talk to you about a couple of things and just give you a little bit of background on how I see this problem set. So the general premise here is that whatever I'm looking at, whether it's prologs or whatever the problem is that I'm trying to solve, I try to recognize the biases that I have. So I looked at this last week, I looked at this last month, that kind of idea. So if I can abstract away some of that bias and have a repeatable methodology, something that's based on math, maybe I can find some insights. And the interesting thing about what I'm talking about today is for me it's not theoretical at all. We have an internal red team that's really proficient. Is anyone here from the red team? We have a red team that sometimes will perform some activities based on MITRE ATTACK. And whether that's DNS exfiltration like we're talking about today or some other technique, they'll hit a canary URL first. So think about white box and overt out in the open versus black box. So if you heard of the threat hunting, the hypothesis, I think that there might be DNS exfiltration and therefore you come up with a plan and look for the artifacts. We'll get into that in a minute. But for me it's not theoretical at all. I know based on what I saw here that those guys, my friends that I drank beer and bourbon with, that they ripped us off. They broke in and they stole some stuff on May 18th, 2018. And that was the white box overt time where they hit the canary URL. And I know from patterns that means that they probably broke in again in a covert black box way. So when we talk about assumed reach, completely not theoretical, whether you believe in that philosophy or not, which I do, I know that these guys that bought me a beer the other night, they're probably sitting on some data that they exfilt. So that's kind of the background that we'll get into here with the threat hunting and how this ties in. But here's the quick agenda. Do a very quick intro. Believe it or not, the MITRE ATTACK is going to be real quick. Love MITRE ATTACK, absolutely do. But I think a lot of CFPs, a lot of cons, a lot of stuff's getting saturated, right? So if anybody wants more information that I'm going to provide in the slides here, please come up. I'll talk about it as long as you want to afterwards. But I'm going to trim that down just a little bit because I think everyone's probably heard a lot about it recently. So harvesting labeled training data. I'll get into what I mean by that. EDA, exploratory data analysis, machine learning, worked example. And I'll talk about just very candidly some challenges that I've run into and a little bit about future work. So before I do that, I just wanted to get a quick sense of what the background is in the room. So if we could just start here and we'll go around and tell everybody what you do and name. Okay, well, how about, could I just see a show of hands? How many people do something like threat hunting? How many people do any kind of data analytics outside of Excel? Awesome. And how many people have some sort of program where you're doing adversary simulation, where you've got actual purple team, both folks internal. Okay, cool. Thank you very much. That was a terrible idea. I don't know what I was thinking. All right, so real quickly about me. My name is Brian. I'm a threat hunting lead at a Fortune 100 financial services company. I also help out with the threat intelligence and now security orchestration, automation, response or SOAR. But the bottom line for me is that there's one prime directive. It's find evil. You know, Rob Lee from Sands talks about no normal find evil. I mean, I think about this all day, sometimes all night, I hear about that a little bit from my wife sometimes, but she's been very patient with that. It's almost an obsession. So minor attack framework, we're probably mostly familiar with it, but just to level set. It's tactics techniques, common knowledge, and it's a curated knowledge base and model for cyber adversary behavior, reflecting the various phases and adversaries lifecycle and the platforms they're known to target. So my buddy Zach, the lead red team guy at our place in Milwaukee, we did a talk at Derby Con in 2016. It's a small world. Anybody see that talking 2016 Derby Con? So we were just talking about very open, kimono, very transparent. Here's what we were trying to do with limited resources and budget and everything else. Because there's a lot of techniques. Here's what we tried. Here's why we're doing it. And here were the results. So I didn't put the attack timeline on here, but I know that we have pre attack now. But at that point, we were just primarily focusing on the later stages of attack. So that's the context in which I'm talking about some of these techniques. So in particular, and we're talking about exfiltration over alternative protocol. You know, I didn't blow this up because I wanted to fit everything on there. So you don't need to see what's on there. I will, I will make sure that I have my Twitter handle, which is just at Brian Gens. If anybody wants to watch that, I'll have all these slides up by Tuesday, Tuesday, midnight, central daylight time. So exfiltration over alternative protocol. Here, I'm focused on DNS. So as anybody think of any tools that you might use for DNS exfiltration, go ahead and shout it out. Somebody said iodine. Yep. Anybody else? Cobalt strike fans or write your own. So there's a lot of different ways you can do this, right? Interestingly, I don't want to create an overfitted model. If anybody ever see that thing, there was some kind of picture I saw on Twitter from the Bay Area and it was talking about overfitting a model. I can't verify whether this is true or not, but maybe somebody in the back can tell me if you heard about this, essentially that a lot of the models were trained on roads, the Tesla's with the self-driving car mode in the Bay Area. And then when they were driving on roads that were outside of that area, that didn't fit the initial kind of things that was used to. There were five salt lines that were laid down by a salt truck and that that was just messing with that. So if anybody heard about that, cool. If not, it's one example in my mind of how if I try to detect my buddies that are breaking in and stealing data, because if I can't detect them, I can't detect somebody else doing it, right? If I focus only on what it looks like with cobalt strike, that's too narrow, right? You use i9, maybe I'll catch it, maybe I won't. So the point is that this is one of many techniques, but it's one that we focused on because we had the instrumentation and the telemetry to dig into. So really, for me, all minor attack is it's a true north. It's a true north where I and we can sit in front of our CISO and executive leadership and say, certainly there are compliance requirements, there are other things that we have to do and boxes that need to be checked. But let's focus on what the attackers are doing. They don't have to do something off the menu, on the menu, off the menu, it's up to them. But let's start with known TTPs and tighten up our monitoring and our detection engineering efforts. So I'm not going to spend a lot of time on this. If you follow this stuff, you're probably familiar with Katie Nichols and there's a couple other folks who did that. But MITRE has got Caldera, Red Canaries, got Atomic Red Team, Uber's got Meta, and Games got Red Team Automation. They're varying degrees of what they're automating, but it's essentially helping teams, these are open source, helping teams figure out how to do repeatable processes for adversary simulation. And then CyberwarDog, Roberto Rodriguez, I think is at SpectreOps now and DevonKer and GameRank. I had an article that this will be linked to if you want to go to it when I send the slides out. Basically, they were using the API and hooking it in with basically letting you dig into that. Hang on just a minute. I'll get some more vodka. Just kidding. And then that talk that we did is online if you want to look at that. So there I was, my name is Alan Business. Red Team's sitting here, I'm sitting on the right, and the Red Team is lighting things up, probably Cobalt Strike at the time. And honestly, I can see them, they hit enter, and they're waiting for something to call back and then starts popping up. I think that they're counting the milliseconds to okay, I got a call back here. And then they're kind of looking at me, like, are your systems lighting up yet? Why aren't you guys hunting for this? Why aren't you guys looking at a ticket from Splunk or whatever Simi use? I looked back at it and there were 300 rows that were specifically related to what my buddy Matt and Zach had done. Has anybody heard of Low Cards Exchange Principle? So every contact leaves a trace. If a burglar breaks in, might break a window, might leave some skin, might leave some hair, some kind of sample that law enforcement could use to trace back at a DNA sample. Footprints outside of its muddy, this is what we're trying to do with Mitre Attack to identify Chris Sanders who has the investigation theory course. We brought him in in December to do some training for our folks. And he talks about a triangle of pyramids he's got of four different kinds of evidence sources. And he breaks it down like network, host, memory and OSINT. And I want to know what are the digital artifacts that are left when my buddies or somebody else breaks in and steal stuff? And our architecture is not like yours. Your architecture probably isn't exactly like it was six months ago or a year ago. So I think there's a lot of value in seeing what this looks like. And why is everything moving on the screen because it's early? I know how this goes. It was slightly amusing to do that because it's parallax and that's better than PowerPoint animation. So I think that counts. I don't think that's against the rules. But something that moves. So how many people have heard of EDA, exploratory data analysis? Well, let's start with, and again, don't worry about trying to read the small texture. What I wanted to show is the entirety of this rectangle, which is from Coralite's Brochi sheet. And this is all the DNS log fields and the type and the description of that. So in a minute when we're trying to figure out how do we represent the knowledge of what we're seeing on the network, how do we represent that and convert it into a feature? Or think of a column in a spreadsheet, right? How do we convert that into a feature that we can kind of hook into? And just in the same way that maybe you're training a child in some experiment, I don't know, to classify a fork versus a spoon, if none of that's labeled, if you don't know what the ground truth is, you're dependent on somebody coming and doing that. Otherwise, all you can do is kind of cluster them based on similarities, right? But the first thing we have to do before we make those decisions is understand what AA is. What does that mean in your environment? Protocol, proto, and then there's some other stuff we'll get into in a minute. But when I say ADA, I'm talking about starting with that. Jupyter Notebook used to be iPython Notebook. This is actually from Clarence's book that I've got here. Shameless plug. But I say that because this is from Chapter 2 of one of my new favorite books. Did anybody else buy data-driven security in 2014, like the day it came out? Yeah. This is something I've been digging into and it's very helpful. And it just reminds me that there's always something to learn and I always find extremely valuable to get somebody else's perspective here. So this is actually from the O'Reilly GitHub. And it's just an example of bringing in some imports, loading the data, and just kind of standard pieces there. Pandas, data frame just oversimplify if you haven't heard of it before. But I like to think of it visually as an Excel spreadsheet. I'm probably going to have pitchforks and torches after that comment. But I think it's tabular so you could think of it right now, but it does much, much more. Has anybody heard of KitWear's Bro Analysis Tools? So there's a guy named Brian. He's one of the developers there. And I just can't say enough good things about these folks too because I wasn't at BroCon last year. I saw the video that he'd done. And again, there'll be a link in this one at the bottom. And I contacted Brian because I was stuck on something on his open source code. I don't like to do that. I don't like to ask people to Google stuff for me to figure something out. I just, you know, a few weeks ago, I was playing around with something. I said, you know, I'll figure this out. I'm just not sure how long it's going to take. And I just sent him my question. And without getting into the details, it was essentially around why can't I join two of these data frames together? And it was because of something on the back end, the way they were doing the pre-processing with Bro Analysis Tools, which they describe as a software bridge. So you can get from Bro to Pandas and then from Pandas to Scikit Learn, which we'll talk about in a minute. But he appreciated the feedback from somebody kind of in the field, in the trenches, saying, this is what I'm trying to do. And he explained to me what the workaround was because it was a different data type. So just another great example of people, you know, pitching in the open source community. And I mean, I sent the dude an email at like nine or 10 at night and he responded that night by midnight. So it just, it's really encouraging when you're kind of working through something and somebody else helps you out a little bit. So feature engineering, again, we're trying to figure out what, what are the things that we can use to categorize something? You know, this is, you could talk about height, diameter, top. So in the same way, we want to find ways to represent the knowledge to describe what's going on, on the network with DNS. And hopefully that's going to allow us to figure out what features we can hook into and then train a model so we can catch my buddies the next time they break in. Okay. Griffin data science virtual environment. This is Charles Gevray. He does a class here with Austin Taylor and sometimes with J Jacobs from daily driven security. Awesome folks. He's got, I don't know if this is accurate. I've always thought of it as the Cali Linux for data science. So I use this. It's, it's pretty decent. And again, there'll be a link there. So I said, we're going to do a machine learning worked example. I pulled, I pulled some stuff out of this after listening to some of the other talks just because I wanted to make sure that I don't cram some kind of crazy algorithm in and, and try to show everything that I'm doing. Because again, I mentioned I'm doing a little bit of orchestration automation. So I want to kind of have that cycle going where I'm getting an internal IP address, you know, going forward and then enriching that with friendly intelligence or okay, here's the IP address, which host grab that let's assume 24 hour lease from in it. Okay, which host, which internal host name has it? Who's the last logged in user with that then go collect some other stuff. The more you can find out in the faster you can find out, feed that back in. There might be a feature or column that you can compute or some other insight. I'm gonna say reputation score. I know it's terrible example, but some other verifiable piece of information that you can create another column about and a very fine print in the bottom. I'm being very explicit about giving credit to Charles Gevry because I literally lifted this from his slide. I just made a different color. So thank you, Charles. There's different descriptions of how this works. I like this from his training class, because it's consumable for me, you get and you clean the data, you pre process, do a feature engineering. Now some of this stuff. This is naive of me. When I thought bro logs, I'm like, bro logs are pretty structured, right? I'm not gonna have a lot of this. No, because there's a lot of getting it from where it is to where it needs to go. A lot of the data engineering and the pipeline that kind of stuff. And believe it or not ID dot O R I G underscore H. That's the one I'm going to think of as the source IP that initiated that DNS request. Just the fact anybody think of a problem when you start doing stuff in Python and the label, the field or column name is called dot something. So you're going to throw an error, right? And it's just simple things like that where you'll see that in a few minutes here where we have a column that's renamed, not a big deal. But I wouldn't have thought that I'd run into that. It's just a different use than maybe we originally thought of for it. But then with the pre processing feature engineering, bro analysis tools, again, Kitware describes that as an open source software bridge that's going to kind of do some of the behind the scenes heavy lifting. So you can just use it kind of as a gray box and move forward with what you're trying to analyze. And then advanced feature selection. Then we have the data that we're going to split into train and test and then we'll build the model and we'll evaluate the model. And I tried a couple of different things. So I'll show you a couple of differences. But the main thing that popped into my head when we started talking about this is if I have labeled data, if I can use labeled data when we train that model, now I can move from unsupervised, which is clustering to supervised. Now we've got, I know from the 300 records that are from the DNS exfiltration from cobalt strike or whatever it is, I know what they did when they did it on May 18. So I have another column where it's one, if it's known malicious, and it's zero if it's not. Does that solve everything? No, there's some issues there because what are the attackers and red team want to do? You want to be stealthy. And the better your trade craft is, the more stealthy you are. The more stealthy you are and the quieter you are, the fewer artifacts that I have, which leads to something we call class imbalance. And you can correct for that. You can adjust for that. But I kind of wonder sometimes, do I want to? Do I want to make that seem like it's a bigger part of the log data than it really is? So I'm importing pandas. And then we just have a as pd. I'm importing numpy, get into the matrix, and then from bat, bro analysis tools, which he's going to change the name after they change the name at some point. So keep in mind that this will be called something else because bro itself is changing the name of their offering. But import log to data frame. And then a lot of times you'll say df equals, I just put dns underscore df equals. And I'm calling log to data frame dot log to data frame path to this is one hour's worth of logs. One hour's worth of logs. And then next I see dns df dot rename columns. So you see what I was talking about the id.origin. So if you have something dot id.origin, you're going to throw an error. And anything I say, there's two or three ways you could probably get around that. This is the quickest for me. Filtered dns df. All I'm saying there is the data frame, we call the dns df. The data frame is after the equal sign. So we're saying dns underscore df, we're referencing that panda's data frame. And then in the square brackets, we're saying id underscore origin underscore h, that's the host that initiated that dns request dot string contains. And you know, I masked this, but there is a there's a large subnet that wasn't relevant to this. And I won't get into that for opposite reasons, of course, but the point is, you know, you might segment that through, you might go through and convert the IPs to integers. You might, you know, you can do ranges, you can do a lot of different stuff with that. And I think that's actually covered in the data-driven security book, among other places. And then I did the type filtered just to make sure, you know, after I talked to Brian from Kitware. So I'm kind of leaving myself some breadcrumbs and going through. I pulled all the comments out of this just to make it easier to have less on the screen. So this is just a nuance, but filtered dns df dot is copy equals false. It's trying to be helpful if you don't do that and say, hey, you keep slicing these things off and you're trying to do things on a copy. So I had to Google that and it turns out if you just do equals false, then it stops throwing those errors. Sounds pretty scientific, right? Filter dns df query length. So here, what I'm doing, this is the, the current version of the data frame or the tabular data structure that we're dealing with. And then I say query length in quotes equals. And then I'm saying add a new column. And what I want in that column for each row is the data frame and then give me the length of what's in the query. So again, when you start talking about malicious URLs, when you start talking about message length and that kind of stuff, this might seem like one of the go-to things. However, I took a bunch of the other stuff off because originally, you know, when I'm trying to do this in a production environment, I want to know for that IP address, for that time period, or maybe expand it to longer, like 24 hours, does that IP address, and what does that look like in terms of the con.log? A lot of you, I'm sure, are familiar with con.log. But if you're not, I think of con.log and bro as the closest I'm going to get to 100% net flow, right? So basically just the phone record and set of the phone conversation. So I'm trying to take an entity-based view of this, a user 360 and a device 360 and essentially understand what behaviors are being exhibited by that host during that time frame. So just real quick aside, how many people have heard of Black Hills Information Security's RETA? Is anybody using RETA? I think I've got a link in there. I'll make sure it is before I send it out. But I've been using that for a while. Basically, you pipe in the bro logs to it, you import a bro log for a day, and you're a directory full of bro logs for 124-hour period. I should be more precise. And from there, you import it, it creates a MongoDB collection, and then you run Analyze, and it's going to tell you beaconing. And Jon Stran and a couple guys did a talk at Derby County a couple places. They're using some kind of crazy math behind the scenes like fast Fourier transform and looking at the signals. And what you get, I just use the command line, but they've got an AI Hunter product. What you get is basically a table or a CSV. And when I cat it out on the command line, what I see is a score on the left. Yeah, we're 99% sure this thing's beaconing. Well, there's other stuff that looks like beaconing, right? So I don't ever want to have one view into something. I want to have a more holistic approach and enrich these things by either computing new features, add a column, perform an operation on a different column. And now I know something else about that entity and that record. So the next one, 22. In 22, you have to put percent matplotlib inline so that you can have the plot actually display in a minute here. And the rest of this is just in 22, just the formatting for how I want that plot to look. There's a lot of cleaner and more sophisticated looking things. I just wanted to have the basics out there. And then in 23, we're just saying import math. And we're going to look at entropy. So entropy, you might have something along the lines of base 64, base 32 encoded, or you might have some encryption. So either way, I find that to be pretty helpful. Filter DNS. Again, we're creating a new column entropy. We're running Lambda. So in the data frame, you don't have to do four loops, shouldn't do four loops. You want to, you know, map or apply or use a Lambda function. And you're hitting it on that series, which is that column, right? So essentially, very quickly, I'm populating the value of this new column entropy with the results of that mathematical function. And now I know two more things about each of these rows. I know the message length, and I know the entropy. When I look at the length after I filtered out that other, those segments I didn't need were at 14,000. In 27, do you remember I was talking about that Canary URL? It's not actually called Canary URL. I did a sophisticated find and replace, and I masked it, because that's also scientific. So I'm trying to understand length of all of that inside the parentheses, which means just to zoom back out for a minute, that my friends on the red team internally have 121 records or DNS message requests or messages that were logged during 8 to 9 a.m. Well, look at that 121 out of 14,000. That's what I'm talking about in terms of the class imbalance. So Mitre attack, here's one of the ways this fits in. I can help do the detection engineering. I can help look for those artifacts. Every contact leaves a trace. I can help that will help me dig in and start dissecting things. And I have one goal in mind. I'm trying to protect this house. I'm trying to find how they did the X fill and then see if there's any similarities that I can come up with. And if that works out mathematically, then maybe I can run that against everything else from May 18th until yesterday and see what I find. So, you know, the point is I keyed in on that Canary URL, and now I can isolate the traces that they left based on that overt white box attack. And from there, this is a little bit early, you know, I don't really need to add this column just yet, but that's where it was and I didn't want to mess around with it. Excuse me. So essentially what I'm doing is filter DNSDF. I'm creating a new column called is malicious. This is my label. I'm going to have essentially it's, if it contains this Canary URL, is malicious is going to have a value of one. Notice I did a dot map. It's going to hit everything. The other interesting thing that I saw, has anybody ever seen bro logs with DNS, with DNS X fill, where once in a while you'll see a API dot encrypted string, 200 characters long, and then a post dot. Anybody have any ideas what that is? Yeah. Yes sir. No, I was hoping you told me, man. Now, so basically, again, this is a pattern. So I don't know this. That's kind of a Hail Mary when I throw that out there. Normally, I'm going to look at the make sure that these variables aren't related. I'm going to make sure that I look at the feature importance. We're not going to get into that right now because of time. Someone hurry up just a little bit. If it has post in it, now I'm looking for a string, right? So this is the thing about the spy versus spy. Anybody in here in the room that sees how I'm doing this, you're going to come up with a different way around it. So that's why I have to keep doing this and making sure that the model doesn't degrade and I don't get lazy in the detection here. We talked about query length. So I just said for the data frame, filter DNSDF, and then I want to know about the column that has the values that we computed, which is the query length. So we computed the feature, populated the column for each row, and now we have a histogram. So it might be kind of hard to see in the back. I didn't blow this up. There's some different things you can do on the scale here and that visualization didn't look much better. But you see a preponderance, try to work that word in on a Sunday morning, every possible chance. You see a high number over 14,000, it looks like, of DNS requests that are what, 25, 30, and then way over there on the right, you see just a few, I'm guessing like, not 121, maybe like 106 or something, that are 200. Can you write a signature? Can you write a rule that says, hey, anything that's got message length and DNS over 40 is malicious and flag it? What's going to happen when you do that? Yeah, it's going to light up, right? Because there's stuff that looks like that. Now, we computed two things, though. We figured out the entropy. You know, what's the degree of randomness? And in a moment, I'll show what the ranges are for the values. But in the upper right-hand corner, that's weird, right? So we've got, when we look at entropy against query length, something's definitely unique about those. So I'm going to bust through a lot of this, short on time. Thank you. I'll get it. Anybody has any questions about this? Again, I'll have the slides up by Tuesday at Brian Gens on Twitter. I'll put out the link there, but I'm going to push through the rest of this. So I said, here are the columns that I want for features. So now I made a new data frame. I said, I may have a new tabular data structure here, but only give me the data that's in these columns in series. And that's my new features underscore DF. We imported some other things. And again, the scikit-learn, we had a software bridge from bro to pandas to scikit-learn and bro analysis tools. And bro analysis tools is doing a lot of this transformation for us. Again, without us doing adversary simulation, I'm stuck with clustering because I don't have any ground truth labels. So where's the evil? Where's Waldo? Where's my buddy Zach and Matt? Anybody have any idea which one's malicious? You shouldn't be able to tell. I mean, you might have some ideas, but this is one issue that I run into with just clustering stuff. So with minor attack, I've got a column that's got one if it's known malicious because my buddy just did it. And there's a zero if I don't know, right? I don't know that it's so what I'm doing here is creating another data frame. And I'm going to push past that. Essentially, I'm going to split that into train and test sets, train the classifier model, make predictions. I'm using logistic regression, not maybe the kind you might think about from stats. And then I'm saying, Hey, predict, you know, how well is this model going to do once we get to the results? And in this case, it was 99.85% accurate when we look at the model evaluation or model results. But that's not the whole story. Overall, it had in the 2775, you know, we're okay with the top left and the bottom right, the four on the bottom left, that means there were four malicious ones that I told the model that those are malicious. And we, we miss those. So again, it has to do with your threshold and how it works. So again, that's the model we looked at. We need new slash right, more signal, less noise. And this is, you know, this is just something that I've come across. If anybody has any other perspectives on it, come see me afterwards and be interested to hear your perspective. But I just think that the more stealthy attackers are, the fewer footprints, the fewer contacts are going to leave, which makes it harder for me to hook into something future work. I'll just push through that. But like I said, looking at other bro logs, I want to generate some features based on the presence or absence of beaconing. So take the insight that I'm getting out of Rita from black hills information security or offensive countermeasures now, enrich that, and then do some other enriching IP addresses. Also very excited to look at some Neo4j and some graph history stuff as well. So thank you very much for coming out on a Sunday morning. I appreciate your time. I'll be in the back and I hope you have a great rest of the conference.