 They're here to talk to you all today. All right, there we go. We're starting off right nice. Now we're here to talk to you all about data science, so you're up here in the splunk field and what we do. Now, before we go much further, there we go, electricity. Before we go much further, I have one thing to say. We're not going to talk about anything here that requires you to purchase anything, which is good, but if I did, don't buy anything on what we say because we lie often, mostly about the capabilities of our math abilities. So my name is Ryan Kovar. I'm a principal splunk security strategist. I've been around doing this for a while and for those of you in the audience who are good at math, we have 30 minutes to go through 81 slides. So do the math and I'm going to quickly move on. But not too well for choreography. Hey guys, I'm Dave Harold, also a security strategist at Splunk and been around also quite a while, do a lot with Sands. Ryan and I and a bunch of other people up here, David and a bunch of people in the audience work on this thing called Splunk Boss of the Sock, which we would love for you to check out sometime. And my name is also David. I'm also a principal security strategist for Splunk. And I also do a lot of the same things these guys did. And I'm doing my best to not talk about MITRE today. So that's my personal growth goal. However, I did have a talk of blackouts, so feel free to go back. So we know a lot of things. We've been doing this for a long time. There's a lot of people up here who have experienced doing plumbing for networking. We've done some science and data and some research and development. We are certainly not experts in the field. And that's exactly what this talk is about. We're going to go briefly over are the things that we've tried in the world of data science, what it means to be a citizen data scientist, the things we're actually trying to do. And then we're going to go into the works in progress for some of our ongoing research. All right. So as Ryan just mentioned, this talk is all about applying machine learning from the perspective of security practitioners, not from the perspective of someone who's trained in data science. A lot of these things we did to teach ourselves how to do machine learning and how to think like a data scientist. We consider ourselves citizen data scientists, which is a term from Gartner, which we kind of like. We're going to talk about this thing called the MLTK, which is a toolkit that you can run on top of Splunk. We generally use Splunk because we work there and they pay for our kids to go to college and things like that. But these topics are generally applicable across a wide variety of different tool sets. So don't get hung up on Splunk. MLTK is an add-on that you can install on top of Splunk. It's a machine learning toolkit that allows you to apply all kinds of different algorithms. It is a thin wrapper over some Python machine learning libraries. It's super useful. There's a little bit of stuff that you can do in the MLTK. I'm not going to go over it all, but you can do things like numeric predictions and categorization and forecasting and outlier detection and a whole bunch of other cool stuff using this toolkit. And with that, Mr. Verve is going to go over some things that we can do with that toolkit and with some of the other pieces of Splunk under any other analytic toolkit. So yeah, let's talk about some actual detections. So a part of our day-to-day life is going out and talking with customers who want to be deploying a security machine learning. And security machine learning is one of my favorite subjects to walk into a customer to talk about because most of the time they don't really know what they're actually looking for. And so I can talk about the magic of bespoke machine learning and how we can deploy all this type of stuff. And a lot of the time what they actually want is really simple statistics, which I'm sure should resonate with everybody because this is the life of data science. Most of the time inside of the security, the things we see people getting the greatest amount of value from are rarity detections looking for things that are uncommon or first time seen like being able to detect the first time somebody goes to a domain, first time somebody logs or launches a particular process, first time a particular process launches with this particular file hash, things like that. Or time series spikes being able to track something that happens more than it typically does. A lot of this can be done with simple things like standard deviation. And we really like to describe this concept to end consumers because in that idea of the citizen data scientist, it's not something that takes a PhD. We can have the conversation about different types of distributions and whether standard deviation is a fair thing. And Kovar loves to talk about IQR versus standard deviation. But at the end of the day, what we want to be able to do is give people a ton of different content that's very straightforward for them to deploy. And I'm sure you're all sitting here reading through each item in that slide deck. You have to look at all of them because functionally when we look at these kind of simple categories where we see people driving tremendous value is with the ease of customization for it. So, for example, take this detection here. This looks for new error codes being generated by users in AWS data. And the way that it's doing it is looking at, you know, a corpus of AWS data. And it's saying, give me the first time and the last time that this error occurred for this particular user. And we look for if the first time is in the last day or if you're running on an hourly basis in the last hour or if you want to get banned so you can say, since last time I ran it or things like that. But functionally, it's a pretty straightforward type of detection. And the great thing about it is if you want to spin this around in a different direction, all you have to do is change the data source and what fields you're looking at. So instead of looking at error codes generated by users, which is, you know, casually interesting, one of my favorite ones is looking at the first time a particular user generates a particular API. If I've never created an instance before, I should have a record of David all of a sudden created a bunch of instances. If I've never interacted with a particular region and all of a sudden I start spinning a bunch of instances in a particular region, that's also really useful information. So these are the things we talk about in a high, high level. But how do we actually go apply that? How does that really work in the real world? And you'll note, this is customized for Las Vegas, polish here. So this is one of the things that we've actually seen customers deploy to a great extent. And my favorite example of all of these, we can talk at great length about it. But as we mentioned, 81 slides, 30 minutes, no time for great length. My favorite example is from JP Morgan Chase. They presented at our user conference in 2017, and I have happily stolen the slide from them. And they built a system for monitoring all types of their critical users. So whether it's an exec or assist admin or somebody who can send a lot of money someplace, they want to track extra detail or put extra attention on those types of users. And so what they have done is deployed a bunch of different first time scene detections. So we've got first time that an email comes in from an address, first time a particular domain sends an email, first time they get an attachment type, they get a PDF for the first time or a CR file or whatever. First time a service gets installed, first time the registry gets modified, etc, etc, etc. And again, this is in the Splunk nomenclature, we use SPL for our search language. It's the same search language that gets applied for each one of these, and you just change the fields and change the data source. The same thing applies for all other types of data analytics platforms as well. And what they do is they then aggregate all of these risky events or new events, and then look for activities that are occurring across multiple different, in this case, phases of the kill chain, because they presented this in 2017 before MITRE, I broke my rule, I wasn't going to say MITRE, before MITRE became the buzzword du jour. But this functional idea of saying if we see a user has a new sending address and a new process created and a new connection and new attempted access, that's a lot of suspicious things that we can now say this should be analyzed immediately. And it's really useful. But of course, we're presenting at AI Village, so we like machine learning too, and everything we're going to talk about for the rest of this presentation is actual machine learning. My favorite, because it's my first, machine learning detection, which I worked with our data scientists that's going to build out, was this great, it's a buzzword bingo of descriptions that uses standard deviation and IQR so it can make Kovar happy. It uses K-means clustering, it uses PCA, lots of different fun things in here to be able to look at anomalies in Salesforce data, which in this case is me, because I pulled like 38 million results out of our Salesforce data looking at like opportunities and things like that to know how to use Tableau. But it was great. It was a good use case and we're actually able to find bad things of users who are acting anomalously in a multivariate way in Salesforce data. I felt particularly great about it because I had 110 million audit events from 90,000 different users. I found, kind of scoped my IQR down to 291 outliers and 291 outliers at the course of 90 days is three outliers a day. That is a decent number. That is something I can actually give to a SOC analyst. It's not going to overwhelm them with noise. I felt great. I felt super happy and then actually looked at what I'd be sending to a SOC analyst and it's a bunch of Z scores of how many standard deviations somebody was away from their particular average on that particular day and no SOC analyst will use this. They will see it and they'll say, that's fun and throw it aside and go deal with a vulnerability scan or something else that is also terrible to deal with but less complicated. So he said, okay, wait, why don't we make something a little bit simpler. And when we're looking at simplicity, my favorite of these is kernel density. Not just because this meme makes me extraordinarily happy. Kernel density for those who don't know is a lot of machine learning that allows us to detect randomness. And frankly, I don't care about the details of the machine learning because I'm the citizen data scientist. I'm not the actual data scientist. What I really care about is the ability to put a single piece of SPL up on the screen that works and allows me to build a baseline and then add one more command to invoke Splunk Machine Learning Toolkit to be able to look at anomalies in the amount of data being sent out inside of our own firewall logs. And to test this out to validate that this actually worked, this was back in the day for GDPR hit and we actually had access to some internal data. I tracked users who were exfiltrating data from or sending out data from Splunk's own environment and found one user who uploaded 130 gigs of data to Dropbox on a Friday afternoon blowing out all of the per hour baselines established. It's totally legit. You can ask anybody, Dropbox is for sure a customer that we use or a product that we use. Definitely not something that is not a Splunk product. So you don't have to be any concerned at all about that. It's fine. But this is a great result. It's clearly valuable. It's clearly understandable for an analyst and literally it's adding one extra line of SPL into a query that I already used. So this is kind of the power of being able to expose these types of capabilities to the citizen data scientist to make it very easy to do. But of course, in our team, we also like to do research. So let's talk about the research. All right. So this is a short overview of a talk that we did a couple of years ago, Ryan, and I did at the Thread Intelligence Summit, I guess a year and a half ago. And so you might want to check out that talk if you're if you're interested in what we're about to kind of give just a little bit of a taste of here. The first thing kind of the inspiration for this, there's a gentleman, Mark Parsons. If you don't follow him, he's an amazing researcher, Thread Intelligence Analysts works at Microsoft, and he had been doing a lot of work on analyzing SSL certificates and using them sort of as a new way to track adversaries and to track behaviors, especially considering that most of the traffic that we see today now on our networks when we look at wire data is encrypted. And so that takes away a lot of visibility. But one thing that's left are the SSL certificates themselves. He's done an awesome amount of work on this topic and kind of inspired us to do some similar things in our tool set. So obviously, everybody knows, I think, what an SSL certificate is, but it's basically an artifact that allows you to assign a crypto key to some information in the subject or the subject of the certificate. And it's super important, obviously, for SSL, and we have visibility to it when we look at network data. We thought, you know, it'd be interesting to teach ourselves a little bit about machine learning and maybe do something that's useful with this huge data set. And so we asked a very simple research question, which is, can we build a model that would predict inclusion on an SSL blacklist? And so we have an SSL blacklist that comes from abuse.ch. And we have a huge corpus of SSL certificates that come from the Rapid7 project sonar. You should check out both of those projects, especially the sonar project is extremely interesting if you're looking for a big data set. And so we took that and we said, hey, we essentially have some labeled data in that we have data that's on either on this list or not. And we have a huge corpus. And so we thought this will be an interesting sort of first foray into supervised machine learning. Started off down the down the trail of feature selection. And in Splunk, at least, it's nice because you can look at data like on the left, which is gives you sort of a view that's easily consumed by humans. And on the right, something that tabular view that we can feed into our machine learning model and use for as features. And we just did some simple, very simple things. We started off simple, right? Number of certificate extensions, number of issuer elements, number of subject elements, etc. You can read those off a very simple list of quantitative features that we began to use. And we really didn't modify this too much, honestly. We got actually a lot of mileage out of this, which I was surprised by. Put that into we use our machine learning toolkit, we did some PCA, which I don't have a slide up here for, but principal component analysis to kind of figure out which of those features actually had an impact on the data. And then we fed that into a whole bunch of different a whole bunch of different machine learning algorithms. And honestly, I'm a citizen data scientist. So I don't know the ins and outs of every single one of these. But I can tell how they look, how they perform right against one another. So I can say, obviously, this support vector machine showed up as high accuracy rate with a low fall, low false positive rate, which was was super cool. Built this model. It didn't take too long, took a long time to load all that certificate data that I mentioned earlier. It didn't take too long to build the model applied it. This is the things that are you can do very simply with with SPL, this SPL shown there, doesn't look too simple, maybe. But if you if you if you read through it, it's actually not not too complicated, it's quite repetitive. There's not too much there, as far as complexity. And the cool thing was we found tons and tons of of these certificates, right, which we predicted that would be on the blacklist and that that actually were. And this was very useful, right? It was useful because it could add context or something, right? Remember that we had that 4% false positive rate, which, you know, if you're running this across the corpus of millions and millions of millions of certificates, that's not that useful, right, because 4% of a giant numbers is still a giant number. However, is super useful for adding context. And so in our day jobs, what we're trying to do is help customers see an alert and understand if that alert is something that they can ignore, or if it's something that they need to investigate more. And in order to do that, you do, you know, a lot of enrichment. This is a great way to enrich data. So you can say, oh, I've seen this alert has certain characteristics and trying to determine, you know, is this important to me or not. And also I can say, well, you know what that certificate is not on a blacklist, because it would have triggered probably a filter that, you know, some sort of a match, if it were. But we can also say, you know what, that shares a lot of characteristics with other certificates that are on that blacklist. And that might give us a little bit of extra context to be able to be a little quicker in how we triage that event. That was an extremely fast overview of that. Again, if you want to see a lot more detail about that process, definitely check out the talk that we referenced earlier, and we have a link to here. And I'm going to let Ryan talk a little bit now about the whole goal of this, right? Everything we're doing is to reduce your data set, because I don't know the hell that Dave was just talking about for five minutes for PCA. I can't do any of that, but I can read a computer screen and look through it. What we're trying to do is really find the bears in your network, right? Because when you start off with tens of millions of events, but you can find the things to sort of match, that's where you start getting the real return on investment for machine learning across a real data set. So what we learned was it serves help doesn't solve cyber, but it can really help you add context and actually bubble up the things that matter in your alerts or whatever else you're looking for. So we have this going on. We tried it. We a lot of good success, because remember the whole goal of this exercise for Dave and I was just to learn, could we actually do data science? Could we actually take machine learning? I am literally someone who failed math in first grade. So for me, this was a real challenge, and I found it was actually doable. And we had results at work compared to the bullshit marketing stuff that we've all seen, right? So this was pretty exciting. So we said, all right, next ML challenge, where's Glass Show, right? So we said, cool, let's go back in history a little bit. We have a little bit of a theme here with bears. Mr. Podesto once received an email. It looked just like this. And then there's another email that looks just like this. Now there's a big difference between these two emails, one of which has a bitly link to get his Gmail account changed, and the other goes to Google Mail. There is virtually no other difference in the legitimacy of this email other than you have a bitly link and you have a Gmail link. If you have a junior analyst who's never maybe seen this before, you're going to have some problems, right? Because they are so closely identified. I've been doing this for 20 years. I can look at an email and something just feels weird, right? When I realized I'm mentally running through these models, and I'm literally looking through that and I'm finding the deep fake difference, which we didn't know we were doing deep fakes this week, but this is the original deep fake we all grew up with, right? You have to find the difference. And the difference sometimes could just be a little sliver. It's really hard for junior analysts or a senior analyst to look at that and know. So we came up with an idea. Let's keep applying this research and we're still in the middle of doing some of this, but we wanted to see if we could predict a model to predict APT emails. And I'm not going to go over the nation state or commodity or anything like this. The idea is just can we predict things that look different than just like, you know, porn links coming in or church emails or whatever it is that has those links that all get through your spam filters. So here's the plan of action. We're going to go ahead and we're going to look at these emails. We're going to look at these headers because the SMTP headers are just a hell of a corpus of information, right? And what I did is I got a whole bunch of senior analysts, a little drunk and a white board, and we just went through and said, what makes it, what gets your hinky filter going? What makes something kind of click for you about being weird and email? And that was my method of feature selection. All right, so we went through and these are all the features that I've had some really senior badass analysts in the world look at and identify things that they have normally seen APT emails use have modified look different. If you've ever seen some of these emails come in, you'll notice, for example, if someone says that they're a legitimate outlook client on their desktop in a corporation, and it goes through a legitimate desktop in another corporation, you're going to see proof point headers from one or you're going to see if you're a certain age postini, you're going to see a code through Gmail, it's going to come back down, it's going to have all these awesome headers. But if they sent it from a droplet and digital ocean from a Python script, it might have like five different lines of headers. And that's it. Because SMTP headers like that protocol is the sluttiest protocol that exists. Like you can just about just say, hey, y'all, you want to email chucking on over, there it goes, right? So there's a lot of variations there, which is awesome for ML, because now we can actually select features that are interesting. So then we went through and we started working on this. And here's some things that we identified from an Office 365 corpus, we weren't really happy with all the data we could get out of there. So we kept looking. Then we compared it to the real things that we wanted to focus on from Office 365. I'm going to throw this in just because I got angry at Office 365. And I had to go like this, right? Old man Kovar yells at Microsoft cloud, the reason being, they get rid of all those headers, if you're trying to bring it into your network, you have to do a lot of work to try to get that information. And for network defenders, I need that data. So we turned to something called stoke. If you're not familiar with stoke, we have a lot of presentations on this, it's a file analysis framework. One of the things that does really, really well is take data off the network and rip it out and turn it into a beautiful JSON blob. That JSON blob looks a little bit like this. Because at the end of the day, email is just a key value, right from the value to value. That's all this is beautiful for JSON. And remember before I have all those headers, well, those just become values and extensible format. So we keep going through this and something I found, when I looked at email from stoke, the exact same data set, I had 131 different fields, that's some TP that I could actually do feature extraction on first office 365. So like we said, this is our work in progress, we're kind of literally walking you through, I was doing our research. So we decided to go with stoke. The idea here, we're going to take good emails that we know are good, maybe from our personal accounts or things that we're actually sending. And then we want to find a corpus of apt bat. There's various different ways to do that, one of which is that we're looking at going into virus total and actually just looking for people have identified it. We pulled something like 7000 items out of there, looking for IOCs. Side note, please stop uploading your emails into virus total. I'm not joking, people are doing this in an automated method, right? So stop doing that if you're in the audience and doing that. However, it's great for me because I get some real data. So going on problems that we've experienced doing this. Am I just recreating a goddamn spam filter? Yes, I am. But it's my spam filter. And I get to tune it how I want to look for things that I care about for my work. And that's really important to me. Second, I speak English pretty poorly. So all the work I'm going to do is actually going to be inherently biased to American English for an American organization. But that goes back to this, it's my spam filter and I'm tuning it for me. Right? If you're a Spanish speaker, if you're Czech, if you're Polish, whatever it be, all these are going to apply as well. Most of the languages, the tool sets we use, NLTK, everything like that, natural language toolkit, they have these different plugins. As we said, we need a labeled corpora of data. We've kind of come over that. We figured out a solution for that with some of the virus total. However, for those of you've done real research in this, I'm not really sure if my labeled corpus of data is going to be great. What I want is for someone to take the research that we're doing, who is at a giant organization who has a labeled corpora, and then try it out and see if it works. Right? Second, in terms of at least for Office 365 stuff, I'm having a real problem keeping up on documentation, the same thing for Google Mail. When you start trying to put this in production to get this headers out of information, you're going to run into problems. I've already had my research break twice because of changes on API from both Google Cloud Mail and Azure, or sorry, Office 365. So keep that in mind as you're going through. As we said, this is a little bit of a work in progress. The other part is, as Isaac Newton once said, cloud data at rest tends to stay at rest, and it can be very expensive to pull it out of the cloud, or it can be very difficult. Exciting things like if you're using Google Mail, there's a 24-hour delay to pull it out of the cloud, which is not exactly great for network defenders. It gives them 24 hours before you can find them. And we also have day jobs. This isn't our full-time day job, so we're a little bit behind on some of the research we want to get to. However, we have had some initial successes. I call this the Texas Ghanaian dataset. We set up a little, I don't want to say honeypots, but we put out some email addresses all over the place and started getting a whole bunch of spam in, and I started reading through it. And the two major things I saw were Texas churches, and they made a significant span, and then Ghanaian hackers. I have no clue why. But we started to run some of our ideas around this for ngrams and kind of break apart things in the emails in the headers that might be of interest. And immediately we found things that were pretty interesting, like receive, library capital, Ghana holding, which I'm not really sure what that is, talk, contact, and message. Because a key part of this when you start doing NLTK is every spearfishing email has certain things that are identical. ASAP, as soon as possible, click, execute, run, download, unzip, zip, password, immediate, invoice.pdf, so many invoices and pds, right? So when you look at these, these should be statistically different than the things you're sending over. You can start analyzing that grammar, you can look for force verbs, you can look for weird conjugations, you can look for left adjusted. Depending if you're over 35, you probably learn to do Dear Ryan comma carriage return tab, try doing that in a Python script for email, right? Those are various things that you can pick up on that I notice as an analyst that are almost impossible to do without having some sort of machine learning toolkit. So in conclusion on slide 76 of our 27 minutes so far talking, we are simple country data scientists. We are network defenders. We are not data scientists. None of us have a math degree specifically. This is all something people in this room, if you're not a data scientist, if you're a network defender who wants to learn how to apply math, everything we've talked about today from spunk at least is free. You can go ahead and download the Splunk for free, you can the MLTK install app for free, you can put a giant data set into Splunk and then start doing it for your own personal research. Everything's out there. I'm going to be talking about that later today here taking our capture the flag data sets that we've open sourced that are six gigs and 20 gigs worth of network trap data doing analysis against that take the DARPA data set that came out in 1999, the NFS data set that came out in 2009, the MACC DC data set that came out in 2012. There's a lot of data sets out there for you to start learning how to do this machine learning against data that should look very similar to yours. So these are the things when people talk about I want to do machine learning or I want to have AI and my network what they really are talking about is they want to do statistics honest to God. They also don't know what their data is. So that's why you're angry about IQR over standard deviation, right? Very simple thing to learn. Multi-variate predictions versus seasonality. I used to love this because for various reasons I used to get to take a break every February when I worked for the Department of Defense because for some reason something happens in the middle of February that I just didn't get fear-fished. Look at that, right? Second, just apply a little bit of magic. That magic might be your machine learning, that might be something else. We even really figured it out but it's all there. Takeaways. You can absolutely do this because these three Joes, we did it, right? So if you have a network and you can put together tools and you can write Python at all, this is all achievable, right? Data is everywhere. It wants to be free. It wants you to run things against it and draw conclusions. Why not go for it? And I'll do a call out for Splunk. I'll do a call out for every other vendor. You need to standardize shit and you need to make it available and possible for everyone to do this because that's the only way we're going to win these battles and actually help people do their networks, right? So think about that. If you're in a vendor space, if you can affect change, go yell at your vendors. But we should be standardizing output of logs so that we can actually do better analysis. So instead of having to spend all of our damn time doing processing and forming of data, actually draw conclusions, and just come in as it should be so that we can run these analysis and actually defend our network. So special thanks to a lot of different people who've informed this talk, especially Ben Parsons, James Cole and James Elliott, a special group I love, IKBD, Rapid7, Census. If you're not familiar with Census, they have awesome data. John Landkal, Marcus Lafarera and Lauren Deeson at Punch Cyber, Phillip Drager and of course Splunk for letting us do fun stuff like this at home. I'm Ryan. I'm David. David. Thank you. 81 slides, 44 minutes.