 Okay, I'm going to go ahead and get started even like kind of as people are coming in, um, and that's just how it's going to go. So this is identifying and correlating anomalies in internet wide scan traffic to newsworthy security events, also known as the longest title of a talk I've ever given in my entire life. My name is Andrew Morris. Um, I work for a company called Grey Noise. All the slides are gray. Now they're purple. So I work for a company called Purple Noise. Um, again, my name is Andrew. I work at Grey Noise. I also am Grey Noise. There's no one else in the company. It's literally just me. Um, but it's kind of an open secret and I like referring to the company as we because it makes us sound more legit and professional, but make no mistake. It's just, it's a guy in an apartment doing all of this. Um, before I started Grey Noise, I was on the R&D team at Endgame. And prior to that, I was always kind of private sector red team doing various different things. I've been staring at internet scan traffic for so, so, so very long. Um, and I'm not a data scientist. I'm not good at math. I'm not even good at stats. I'm not good at machine learning. I don't know anything about any of these things at all, but I had to learn a little bit about some very basic statistics to kind of do the thing that I'm trying to do and write this big old, big old bastard of a sequel query that I'm going to show you guys. Um, so today what I want to do is go from, I want to talk about the process of going from a bunch of firewall logs in disparate systems to actually, you know, some kind of anomaly, some kind of indicator of an anomaly that is able to be correlated with an actual thing that makes sense. So just a bunch of Apache and firewall logs from that in a lot of systems to, hey, at this time in this place, there was a giant uptick in people scanning and probing for those things and it probably had something to do with this vulnerability that is associated with that traffic or this event, this thing that happened, right? Um, and so that's it. And the way that I did it is with this. That's it. This giant ass sequel query. Um, will you guys do me a favor if anyone in the room, who has already heard of grain noise before? Will you raise your hand if you've already heard of grain noise? Alright, so we got probably, I don't know, maybe 3% of the people in the room have already heard of grain noise. So, um, we do this thing where, and I'm always going to say we, it's just hardwired in my brain. We do this thing where we write these tweets anytime we see these upticks in scan traffic that are explainable that have value and this is how we do that. We just run this query. I mean ticks forever, but this is it. And so I'm going to break this whole thing down and I'm going to kind of get there, right? So, um, what's the internet background noise? Uh, it's basically, it's the baseline omnidirectional scan traffic that's generated by all these people that are scanning the internet. Like the show dans and the, and the mirage and like the censuses and like the wanna cries and like everybody scanning the internet creates this thing called internet background noise, right? So what is scanning the internet actually mean? Really, it just means querying all 4 billion, I'm routable IP, it's 4 billion odd routable IP addresses, sending them like a SIN packet or a UDP probe or something like that to try to figure out like a certain open port or protocol. Way back in the day it used to be that you, if you had like one box and you wanted to figure out what was running on it, you'd port scan it and that would basically be one IP, many ports and you'd get information. Now, mass scanning is the flip, is the inverse of that. It's one port every IP to figure out what's open on the internet, right? So why would people scan the internet? There's many reasons to do it. Find exposed devices, measure risk exposure, like try to figure out how many Apache servers are there or how many people have this port open or how many versions of this, how many places are this version of software running or I just want to hack a bunch of devices. I don't know whatever. So who actually does it? Lots of good guys do it. Shodan, Project Sonar, Binder Edge, Census, I have like a, actually many giant lists of all the labels that I've provided in here and I'm going to go through that. A lot of bad guys too. Mariah, WannaCry, Satori, Mustik, what's an anomaly? It just means like anything more than we're used to seeing, like an actual, like an uptick, some amount more, right? And what is the newsworthy event? It's just like a thing that happens that we're trying to talk about that falls into something of interest, right? This is like one list of the labels that we have already made in Grey Noise right now. This is like, in Grey Noise, like what we do is like we're looking at all of this omnidirectional internet wide scan traffic and we're trying to label all of this stuff. So these are some of the actors that do it. Here are even more of the actors that do it. I took these screenshots right before I came up here on stage. So I guess before I go any further, I'm going to go and explain like what Grey Noise is and how it works. So it is a system that collects and analyzes all internet wide opportunistic omnidirectional scan and attack traffic, right? Why? Well, the reason why is because every single one of those packets is flying around the internet has its own unique and special story. And that's actually true. There's no such thing as random noise that happens for no reason. Every single thing is explainable. It's just that everyone thinks that it's just bullshit, but it's not. You can look at these, the trends of this internet wide backscatter and scan traffic to figure out like what's going on, to try to peel back, like what, why are people looking for these things, you know? And from a security operations center's perspective, if you're like a network defender or something like that, then if you're actually looking at the firewall of your network or something like that, you need to be able to differentiate between the actual things that matter that are hitting you specifically versus like the things that are just hitting everybody on the internet. How many times have people been responding to an incident or something like that that ended up just being like a Chinese SSH bot or something like that? And it's like, this is not a big deal. This thing is hitting everybody. This isn't some sexy APT shit. So really, what we're trying to do is we're trying to provide rock solid negative ground truth of what everyone should be expected to see. And this is actually something that Alex Sierra over at Verizon said yesterday, two days ago. And I love that negative ground truth. That's exactly what that's exactly what we're trying to do. So the technical mission statement of gray noise is label every single internet wide scanner as either good or bad and put it first of all, take the category of everyone and then try to label as many of them as humanly possible. Like what are you what are you doing? Why are you doing it? This is kind of a state of the union of that. We're right now. Good, the markup of good is about fraction of a percent. It's about one tenth of a percent of internet wide scanners. Bad is about 10 to 20%. It's labelable where we know they've done something violating some like the computer fraud abuse act they're logging into a system that they don't belong. It doesn't belong to them. They're slinging an exploit. They're compromised and they're slinging an exploit on behalf of somebody else, blah, blah, blah. And then unknown. That's everybody else. That's the gray noise or I guess the purple noise. So how do we do it? We have a big network of nodes in a gajillion different networks all around the internet. They're constantly shifting around in AWS and Google cloud and digital ocean and all these different providers. They're always shifting around. They have no business value whatsoever. And they just hang back and they wait for people to talk to them. They're completely passive. It's just like a ton of people with their ears to the ground listening for like these little teeny tiny minute signals and then aggregating all of those together in one place and doing all this labeling and analytics to try to find value and all of that. So again, we want to go from all of the traffic that's hitting everyone to actionable. Hey, this thing happened and it probably had to do with this, right? This is probably why. So this is what the raw data looks like a lot of the time. These are just IP tables logs. And I rip these straight out from gray noise. So at some point, a long time ago, those were gray noise nodes. It's bad upside. And so now I need to figure out, okay, like, well, why, why is what we're doing challenging? What, what are some of the things we have to overcome? First of all, you have to get when you're doing this kind of thing, you need to have a very diverse optic and you have very diverse set of data. You need to have data in a lot of different places. And you need to make sure that the anomalies that you're measuring are being that they're justifiable and equal amounts of places that you are observing, which is to say that you have to avoid collection biases. If you have one little optic in one place and that starts seeing all kinds of crazy stuff, that is not an internet wide anomaly. That's just probably somebody scanning your machine or something like that, right? But you need to really be able to correlate it across many different places that have different, that are different kinds of networks and things like that. So it's insufficient to just have cloud providers to do this really, you really need to have residential IP space as well, business IP space as well, because there is like an idea of macro targeting that math scanners do. I'm not even going to get into that right now. One of the other reasons is you have to make sure that you're getting like an unbiased opinion with your data. You can't just install a honeypot on a network that you own because that device has business value. And if something has business value, then bad guys are going to pay close attention to that. They're not going to just like back, they're not just going to like accidentally see that like you need to have things that have no business value, right? You need a lot of data. And so you need to have a lot of volume and managing that amount of volume can be difficult, especially if you, you know, if, if you don't, if you're not used to dealing with relatively large amounts of data and putting them into databases and querying them and all things like that. And then money, like who cares about this kind of stuff, right? So like for me, everything that I need to do, this is all I do. Grey noise is all I do. So like I need to be working, it's for, this is the kind of thing that I'm talking about right now that is like, it's more of like an R and D thing. It's less of like a thing that is easy to package and product and make money with, which is mostly why I'm like just like lobbing it up free for everybody. So the solution, my proposed solution to this is collect all this stuff, put it into a database, average it over time. And then when we see any more unique IPs, any more than two or three times the normal amount of that of what's expected over some period of time, then have it tell you and then do a little bit of research and try to tie it to some kind of event. So you're going to parse out the data and this is really what you need. You need the time, the source IP, the destination IP, the protocol, the port, that's it. And if you really want to do this on a budget, you can do like, you can have like a unique constraint with this. Like you don't need to record any of these data points more than once. You really just need them kind of once. Once you've seen somebody scan a box that you own on a certain port at a certain time and then they do it again, you don't need to record that again. You just need it. All right, this is fine. This is good. Okay. And so then what we're going to do is given a 30 day rolling average of the unique IP addresses scanning the internet for a given protocol pair, show me if there is an increase in unique IP address count that is higher than three or four whatever times the rest of the months day to day increase, which is to say, and this is where things are going to get a little tough for me because I am bad at math and I'm still grasping this stuff myself. But what is the average number of unique IP addresses on the internet that are scanning for a given port protocol over the past 30 days? What's the average daily? And also what is the difference from yesterday to today? How many times is it like a 0.9x multiplier? Is it a 1.1x multiplier of that rolling 30 day average? And let's and then show me everything that is above four times the regular 30 day. And like, I can see some people right now that are like, this is so simple and it is. I mean, I know this is easy stuff, but it's really, really effective when you have really good clean data. And so then there's, I mean, there's all kinds of like statistical tradeoffs, which in this case, like, if you decrease the window from 30 days to like a week, then you're going to be able to get, you know, a measure of the anomaly faster. But it's going to be, it's going to, it's going, it's going, you're going to get a faster rate, it's going to be less accurate, right? It's going to be more chaotic and sporadic and volatile. If you increase the, if you increase the number of IPs that you need to see or above or below a certain number of IPs that you need to see, then you're going to miss some of the smaller anomalies. So how did I actually do it? I did it in SQL, which is the literal worst possible way that you can do this, because I hate myself. And it looks like this. So I'm going to just basically break down this whole thing. So the first thing that we do, I'm not a DBA anything. I'm like not a lot of things today, but I'm really not a DBA. So from, first of all, we're going to take a window of the last 30 days with the date, port protocol and the day and the number of times we've seen a distinct IP hit, um, hit us, hit one of a node that belongs to us in the last 30 days. So where we have where date greater than current date minus interval 30 days. That's easy to understand. And then we're just going to mash it all together, but we have a having clause, which basically says it's, it's like just to avoid another gross subquery. We basically are sort of we're, you know, grouping out in the, in the, in the group, in the groups after the group buy. And so then we're going to say, but I only want to see things where that I only want that count to be affected when we've seen some one unique IP address that has touched one over over one distinct node. So don't show me anything that's only seen, that's only touched one of our devices. It has to be two or more, right? And that's going, we're going to call that like, so wait, that's having distinct. Yeah. Okay. And then, so now that we've put that kind of into its own little list, now show me the average. I don't understand windows very well. This is very like, this is very cobbled together from like stack overflow. Um, but basically I want to see everything where, yeah, at the end, I want to see all of the different, the days, the port protocols, the unique amount of IPs and what the both the month, yeah, what the month, the average, the 30 day rolling average was for that given thing on that day. And then out of all of those things show me everyone there at the very end select date protocol port IPs round times mean show me everything that is above a four X multiplier on the 30 day rolling average in the last five days. I never want to do that again. How can we make that better by doing literally anything else? Um, probably, I mean, this takes forever to run. It's inefficient. It's in sequel. It's gross. It's slow. Um, it's not real time. It's, it's limited to dates, not times. We don't have any time stamps. Um, factors, very little information like an anomaly here is just like dictated by how many people are looking at something, but there's way more ways to calculate out an anomaly like maybe an ASN that's scanning the internet for a certain given thing or maybe a, you know, maybe a, I don't know, like a given organization or or boxes that look like a certain thing, um, et cetera. Um, and I mean, I'm sure any actual decent library, you could just like cram into this and it would just do everything that I just spent the last six months working on just immediately. Um, so then the correlation piece of it is like, okay, why does that, like, why did you see that? What is the, the, um, the cause most likely to, for that, to explain that increase? Most often it's like a big botnet has just operationalized a new vulnerability that they want to, they want to capture, um, as many devices as humanly possible on the internet that, um, that are exposed to that vulnerability or something like that. Um, sometimes it's because a new CV comes out like Heartbleed or, you know, like Shellshock and everyone in their grandma is like, oh, let me scan the internet to figure out how many devices are vulnerable to this thing, which is funny because when, whenever this happens, it's always security researchers are that big first wave. You can see them all like using this standard security researcher tool, and then like the bad guys are way slower. They come in like months later, because it's all these like bottom of the barrel bad guys that are like, you know, kind of it's a numbers game for them. They're just trying to compromise many devices possible. It's always like, you know, it's like six months later, they kind of like, you know, gotten around to like figuring out how many boxes are exposed, but like there's not, there's not a big jump. It's always the, it's always the people in this room that are doing it basically. Or like a worm breaks out, right? It's like, which we've seen with like Reg Man. I mean, way back in the day, obviously, the great example would be Configur MSO-067 back in the day. I mean, that was, if you were, if we had the same information, if we had the same optic then at the time, then we would have seen on one day, many people are scanning for port 445. And on this next day, that times a quintillion people are scanning for port 445, right? So, and we see other like worm stuff now, which I'm not really going to get too terribly into right now. But so like some easy hacks, if you want to do this, to do some of the correlation piece is just googling the port number to try to figure out, search GitHub to find the port number to try to associate it with something that you're seeing. Look on exploit DB for new exploits for a given vulnerability. Use Metasploit to figure out the default port number for different things. That's an effective tactic. Look at CVE. Sometimes CVE, the CVE actual page will contain the port that the vulnerability is on. Most of the time it doesn't actually almost none of the time it does. But one of the resources in those things will will include that information. Check out router exploit, which is like Metasploit for only for routers. So here's some of the things that we found. Some of the observations actually like over the last seven months. These are just like screenshots of tweets, but I'm going to go through kind of each one of these a couple of the success cases. So like top left was yeah, 52, 869. We weren't the first one to catch this 360 net lab was and I have a shout out to them. They do they do great work. It was like a universal plug and play service that Satori had weaponized. So we tagged everything that every all of the new people that were doing that all of a sudden and like figured out a signature to be able to have that tagged as bad. Top right gray noise observed 5555. This one's whack. It is the ADB Android debug bridge vulnerability, which is like ADB like in your Android device when you're you know doing something. There's a vulnerable. It's not technically a vulnerable. It's like a design thing where you can do arbitrary code execution if you have access to that dammit or whatever. Back in February, we detected a big uptick in that of people like looking for it and but there was no real like smoking gun at the time. And then probably like a month ago, we saw people start to actually exploit it on the internet. Bottom left. Oh, yeah, Belkin and 750. That was just like some crappy router that had some vulnerability that we started seeing just like this giant thing, giant uptick. And we dug into it and we found that. And then yeah, this D link 2750 be on port 8000. We saw we saw 1000% increase from one day to the next day. It was like, all right, something's going on here. This was the ADB thing. I mean, this is what it like this what the actual numbers kind of looked like of how many people on average were scanning for some of these. So this I mean, the reason that I'm putting a news article in here is not because I'm trying to show like how cool we are. I'm trying to show you that this works. It's effective. I mean we wrote this and it was like we like saw this and it matched perfectly with what the rest of the world was seeing with what some of like other people were reporting on. The same thing happened. This one was gnarly. This was reg may this micro tick vulnerability that I think might have been like a vault seven thing. And so this one was it was a nasty one. And so then the uptick and this was gigantic. No one was really looking for it. The other fun thing with gray noise is like once we have these upticks, we can look back in time and figure out like who's scan for this thing like ever like who's scan for it like last year. So yeah, and then this is just kind of a summary of some of the ones that we've been able to find. These are the successful predictions that we've been able to find using this methodology. I think I just went over all these triple get and that was gnarly. We do the same thing with your eyes that people request. So like we were able to do the same exact thing to see when triple triple getting was beginning to be opportunistically exploited. We see oral oracle web logic is like the gift that keeps on giving. It's just it's like permanently screwed up and it's just always being exploited. And I think I talked about all the rest of them failed predictions. There was one maybe like four months ago where we saw this uptick in port 5647 TCP and the people that were scanning for that port also had the port open on their device, which is like a golden ass indicator. And so we were like, yeah, we got we are clearly the smartest people ever exist. Well, I was like, I'm clearly the smartest person ever exist. I was so wrong. And so I worked with the red hat people that it affected like red hat satellite server. That was what it was the default port for. I was like, yeah, man, like this is so cool. I'm saving the world. And then I ended up working with their team and they're like, yeah, that's not us at all like our devices are never exposed to the internet. They're deep inside of networks and the IPs that you published out are not running red hat. A lot of more running windows. I was like, ah, shit, okay. So then I had to kind of eat crow on that one. Blah, blah, blah, blah, blah. But they were they're awesome to work with. They were really, really great. They were super cool about it. They knew that we were like trying to do the right thing. But yeah, they were like, yeah, this does not affect us. I was like, shit, okay. So I just kind of want to get a couple tips for people if you're going to be doing the same thing or want to do the same thing filter out the known good actors when you're doing stuff like this. So Bob Rudis or harbormaster on Twitter over a rapid seven, like he he was the first person that I know of that when he was when he to bring in known good labels when he's calculating upticks. So what that is to say is like, he doesn't if there's a if there's a good like if showdans out their skin in the internet for a given thing and then they add a different thing from a thousand different places. He doesn't want that to affect.