 So, thank you everyone for being here. We're really excited to be at DEF CON. This is my second time. Thomas is his first. And it's amazing to be around people who care about security and sharing information. So our talk today is about malicious CDNs. And we're going to cover one particular one. I mean, there aren't many, but Zbot has been an interesting fast flex proxy network in the past few years. And we're going to show how we have been studying it using like SSL scans and combining like a few interesting heuristics that use graph theory and some statistics, basic. Hi, so my name is Thomas Matthew. And I'm a researcher at Cisco Umbrella, which was formerly OpenDNS. And my main focus is on data science and machine learning. And I'm Dia. So I'm the head of security research at Cisco Umbrella. And my interest are in graph theory and security overall. So what are we talking about today? Just a quick overview or a few words about CDNs. Most of us know about them. These are like a very interesting and powerful technology that enables people who have content, especially popular sites, to deliver the content in an efficient way so then people around the world can get the content with low latency based on the edge notes that are closer to these users. So for the sake of the stock, certain features or requirements of a CDN infrastructure are of a particular interest to us. So specifically that the content of a customer will be delivered with low latency. The customer's website will also be protected against DDoS attacks. The customers also try to hide their origin IP behind the CDN infrastructure. And then if the site is communicating via HTTPS, then you can also have SSL search deployed on the edge notes. So then you can guarantee that end-to-end secure com. Now, most of us know about the legit ones. So Akamai, Cloudflare, Google Cloud, Amazon Cloudfront. Cloudflare tends to be abused a few times by some kind of bad content. But they are on the legit side and we work with them to kind of mitigate some of these threats. But then there are like some pure criminal content delivery networks here. They're more on the reverse proxy fast flux network side. And Zbot is one of them. In fact, we've been studying this network for the past few years. We've had a talk at Black Hat 2014, 2016 and Botconf. And I invite you to go check like the details on how to detect it and some of the other features of the infrastructure. So this is kind of an overview of how this network operates. So you have in fact around 30,000 to 40,000 compromised machines, mainly routers and access points in Ukraine and Russia. And they are maintained and harvested by the actors. So it's not necessarily the guy who is selling the service in the underground the same as the ones who are basically compromising the machines. Usually you can buy installs for thousands of machines and then you can go and provision them to your customers. So as we know, like there's this big segmentation of expertise in the underground. And what you do is this actor will offer them in the underground in terms of fast flux. So then as a customer, let's say you are willing to deliver malware, ransomware sites, phishing, especially. And we also saw a lot of these carding and cybercrime forums who will switch between these infrastructures to protect the content so it's not taken down, so it's not revealed to security researchers, et cetera. And in that case, what they do is they hide behind the Zbot CDN or fast flux proxy network. And when they buy the service for a couple hundred dollars, or I would say in that range, then you will get, you'll be provisioned if I would say between 40 to 50, sometimes it goes up, but it's in that range a number of IPs or bots on which you'll be having your SSL installed. And that way you will guarantee that the end to end communication from the victims talking to ransomware C2s, crime or consumers or researchers like us will be talking HTPS to the end to end. So this is kind of an interesting infrastructure, in fact. It's been around for years and it's worth investigating from the SSL perspective, as we said, because all of the bots will have SSL certs installed on them. And when you have like scans, you can figure out a lot of interesting patterns. Quickly, I mentioned crime forums, dump shops, malware. These are like just screenshots that you can find anywhere on the web. And I guess quickly, as the cert, as we know, it will have this section called common name or a field which has to match the domain you're trying to protect with HTPS. Obviously, I will not get into details where you can have like wild cards and subdomains and stuff like that. But then the main point is that, like we said, if you want to protect your site end to end, then you'll have to have your cert off the, let's say, crime forums from deployed on the bots that you got provisioned so then they can deliver your content with SSL encryption. So the main objective in today's talk is to provide researchers with a series of statistical tools that can allow them to analyze large sets of SSL data. And all the data that we're going to be discussing in today's talk is actually available at the following URL. And it's collected by Rapid 7 and the University of Michigan. So this is a high level overview of kind of how the talk is kind of going to go. We're going to discuss what exactly is contained in the SSL sonar data. And then once we do that, we can then model that data using a bipartite graph which splits the data into common names and then ASNs. Once we have that bipartite graph, we can then start collecting kind of what we call global information through a series of histograms that calculate the relative frequencies of the popularity of both ASNs and domains. These histograms then allow us to create micro local features at a per domain basis. And then once we have those tiny histograms, we can use a bucketing scheme to convert them into a vector. And then this vector can then be measured against other domains in the same kind of neighborhood of popularity in the graph. And then ultimately we can then use just like a very simple anomaly detection mechanism to see whether a domain in a particular neighborhood is unlike its other neighbors. And that's how we can identify whether a domain is potentially a Z bot host. So with SSL, we don't really need to dig into the details too much. We just kind of know that SSL is a secure soccer layer. It's used for encrypting traffic over HTTP. And there's just been an increase in websites employing SSL. And the type of SSL data that we're working with is the X509 certificate. And the X509 certificate contains information regarding the issuer as well as the subject. And for today's talk, we're more interested in the subject that is the person who the SSL certificate belongs to. And in particular, when we're looking at the subject information, we're interested in a field that Dia mentioned that's called the common name. Now, a common name can be any alphanumeric string. But in particular, we're interested in common names that are legitimate domain names. And that's because we want to see what are the domain names that are associated with a particular X509 certificate. So this is just an example of how an X509 certificate looks if you decode the base 64. And one of the useful things about SSL data is that it allows us to map out not only residential IPs versus commercial IP space, but also allows us to understand where a network can be spread out over a series of IPs. So let's say we have a set of X509 certificates and their corresponding IPs. If we look at the common name information, we can see how a particular common name is spread out over a series of IPs. And then we can make a guess that that common name entity is somehow involved in some way with the hosting or the co-location at that IP. And so with sonar data, it's about a twice or four times monthly scan over the entire IPV4 space. But in actuality, because certain network operators don't allow the University of Michigan to scan their IP range, we don't get information from certain ranges. And it's just a basic scan on 443. And then the most important part is we have an X509 certificate as well as information regarding the IP that it was found on. And this is kind of the flowchart that we use in order to kind of get our data prepped before we perform the analysis. We take a raw monthly scan of sonar data, we then extract out the common names, and then we map an IP to an ASN. And then we come up with this kind of quadruple, which is the SSL SHA, the IP, the common name, and then the ASN that the IP belongs to. And what's really great about the sonar study is because this is produced on a monthly basis, we can kind of see how hosting patterns emerge and change over a five-month period. So real quick here, what you can see is the range of, I would say, number of SHAs we collect every month. So we can see that between 250K and a million unique SHAs during a five-month period. And then from those, obviously, the data also has the raw cert, so you can decode it and then extract the common name. So the point here is that we have quite big enough, I would say, a data set that allows us to try to find some interesting patterns and find these anomalies and, I would say, threats in general. And the other thing is that it's difficult to manually inspect these domains. That's why we're looking for a large scale abstraction model that can help us do these kind of analytics. And in a sense, graphs, as you know, are very useful to do a lot of things. So most of you know about bipartite graphs. You have two sets that are disjoint. And in this case, we took a simple representation where you have a common name set connected to the ASNs, which means the common name is hosted on an IP, but then that IP belongs to an ASN. And we found that by lumping all of the IPs within the ASN node, that's more useful for our investigation than keeping the CN to IP bipartite graph. And that's basically what you end up having. This has been useful to do the type of analysis that we're going to describe in a second. So I guess the first takeaway is that the bipartite graph is a useful representation of this problem that helped us solve some of the issues. So, classically, there are multiple methods for analyzing a graph and kind of figuring out or identifying various substructures. One is you can use a graph factorization technique, or you can identify the various connected components within the graph and then study each of those connected components or kind of calculate the minimum spanning trees. For this talk, we're actually not going to use any of those three methods. We're instead going to be looking at another set of statistics. But in general, our goal is to identify possibly anomalous substructures within the graph. And so a substructure within the graph can be thought of as a certain set of domains and ASNs that have some sort of odd shape. And I know that sounds slightly vague right now, but as the talk goes forward, you'll start to see what we mean by a substructure within the graph. So when we're analyzing the graph, we need to come up first with a baseline metric of what we can consider normal. And so what we wanted to do was create a metric that is based on the topological features of the graph. That means we're looking at somehow the relationship between domains and they're mapping to ASNs and vice versa. Now a really easy way to do that is to kind of just look at the popularity of each common name. And so the popularity of each common name becomes defined as the essentially the degree count at each domain vertex. And so we then calculate the frequencies of how a particular domain name is distributed across a set of ASNs. And then for each ASN, we model that ASN as having a particular type. And now the type of the ASN is referred to its popularity. And the popularity of the ASN is how many unique common names appear on it. So there's like this type of weird mirror relationship between the two popularity scores that we create. And so Dio now will kind of show with a simple example how this works. So let's break it down here in like a very simple example. So we see our common name, I would say red set linking to the other set of the bipartite graph, which is the ASNs in blue. And the nice analogy we're going to use in the stock is common names take them as people, individuals and ASNs are basically cities or states. And you can see that let's say a person like John at the top, he lived in let's say one, two, three, four cities, which are for ASNs or he lived in four different states. And what we try to do here is to study the ASN part. So in a sense you're looking at the behavior of cities in terms of how many people they hosted. So what we have here is that the ASN at the top, the blue one has three incident edges, which means it has a degree of three and it had three common names hosted on it. And you look at the other follower following like ASNs, the second one has an incident, one incident edge, which means it has a degree of one. Anyhow, so what you end up having is like the three bullets at the bottom where you have two occurrences of an ASN with degree one, three occurrences of an ASN with degree two, and two occurrences of a degree three. And the simplified histogram at the bottom you can see ASN degree on the X, number of occurrences of that event on the Y. And that's how you can scale that to a bigger dataset. So when we apply then this technique to the entire January data set of around 22,000 ASNs, the histogram on the right is a sampling, a stratified sampling from a set now of 5K. And what we notice is that there's a definite kind of long tail to this, to this distribution. So the majority of the ASNs are kind of lumped in the zero to five range as you can see what's being circled. But then there are a couple ASNs way out to the right that host more than 50,000 unique domain names. And the quick takeaway is that as Thomas said the majority of the ASNs are basically hosting between one and a hundred domains. In a sense the majority of cities in the U.S. for example are hosting between one and a hundred people. Just to use the analogy. So here are some kind of numbers, some raw numbers that kind of back up that statistic. So if you look at the number of ASNs hosting just one unique common name it's around 7,600. Then the number of ASNs hosting two common names it's a little under 4,000. And then if you look at the number of ASNs just hosting under 20 unique common names it's 19,000. So that's more than 97% of all ASNs gonna fall within that 20 unique common name band. And of course ASNs hosting more than a hundred are only less than a thousand. So let's take the mirror set now. The common names which are let's say the people or individuals. So in a sense you're trying to figure out some behavior or like some better understanding. So in here we can see that the first thread dot has one, two, three, four outgoing edges so it has a degree of four. This next one has five outgoing edges so a degree of five. And you end up having that list of I would say events. So two times we have a common name with degree one etcetera etcetera. And you end up constructing that histogram at the bottom where you have like the common name degrees on the X. And the number of times that event happened in terms of a degree value. So what happens when we apply then this metric to the global data set? Well out of around 850,000 domains we again did a sampling to kind of represent it on the histogram. And we can see again that there's a very close clustering towards the 1-200 range. And we can kind of see this if we zoom in. That's the bottom histogram where we can see the majority of common names mapped to say 1-3 unique ASNs. And then there's a very sharp drop off. And you get ticks at essentially every other kind of count in between 1-140. And so let's now just take a really quick look at the outliers. So you can see that one of the outliers is D-link. And the other one is Google Video. Both domains that people are pretty familiar with. And they're definitely not malicious. Yeah, look the quick take away I would say is that Google Video as we see is found on, as a common name, is found on 2,000 different ASNs. That's basically the common name you'll find on the search that are used for YouTube content delivery. Obviously Google has deployed a lot of I would say caching mechanisms on edge nodes around the world. So that's why you'll see this big diversity of ASNs for Google Video. So Google Video is not a fast flux. It's just like a core CDN I would say common name you find on the search that are serving the content. So I guess the one other point I just kind of mentioned is that there's an exponential kind of drop off when it comes to how domain counts are distributed across ASNs. So you can see the jump from Synology to Example and then Example to D-Link and then D-Link to Apple iTunes is very rapid which kind of just shows how quickly everything starts to converge towards these extremely towards common names that map to very few ASNs. And so why are we kind of talking about all of this and kind of doing all these histograms? Well the goal of this talk is to find kind of substructures within the graph and so as we know and as we move towards the right of this graph and we start to see how an ASN, I'm sorry, a common name maps to more and more ASNs, we gain more information about that common name. So for example when we know that an ASN map, a common name map to say a thousand ASNs, it's very easy to understand possibly what that common name's behavioral role is. But as we move towards say an ASN, sorry, a common name that maps to only one ASN, it's very difficult to understand what's going on. So what we can then see is that around 97% of all domains just map to a single ASN. And the problem here then is that in that range of one to ten mappings, there's just not enough information to kind of make any type of inference. Quick thought here is that in general if you're trying to do data analysis, data is useful but then if the data is too sparse, there's no chance to find anything interesting. And if the data is, let's say Google video and D-link, they're like so popular that you are not expecting to find anything useful. That's why here we kind of focused like Thomas said on a very small range that we believe has the core or the essence of some of the interesting patterns we're trying to track. And in general, we could have taken like a clustering approach because this is kind of an unsupervised method on this big data set. But then we found that using like some simple statistical techniques like histograms is very good to find, to get like build this understanding step by step of the data you're looking at. So we're trying to isolate your focus on specific regions that then you can go and appeal them off with other techniques. So with the following kind of information that we just kind of discussed, it's very easy to come up with a simple heuristic to filter out 99% of all the domains. We just remove or we just don't look at domains that map to fewer or more than 10 different ASNs. And to kind of expand, again, it's kind of like when you're analyzing a document. If a document only has a single word or a couple of words, it's very difficult to know whether that document was just created by some random process. Somebody's going to typing it out a word and then spitting out that document. But as the document increases in size, it's potentially a lot easier to find some sort of topic within the document. So the goal here is to kind of find the topic of a domain. You want to see kind of how the domain is deployed within this larger ASN structure. And in particular, we're trying to understand whether that topic can be considered malicious. So we've talked a lot about kind of the macro level of the graph, but we haven't talked that much about the micro level. And so the micro level is understanding how a particular domain is mapped to a set of ASNs. So the two histograms right here kind of give frequency counts at a domain level. So the x-axis denotes the type of ASN. And just as a refresher, the type of ASN means how many other domains are mapped to that ASN. And then the y-axis represents the frequency of that type of ASN for the particular domain. So if you look at the top histogram for a nirana market, we can see that it contains one ASN that is extremely popular. So this ASN hosts more than 25k other unique domains. And then it has a concentration of ASNs that are hosting at least a thousand other unique domains. And that's kind of where its general kind of mass density is concentrated. But if you look at MinusQ and its ASN frequencies, we notice that the ASN that contains the most or hosts the most other domains is only a little under 2,000. And then the majority of its ASNs that it's found on are only hosting one or five other unique domain names. So there's clearly a difference in how these two domains are distributed. To bring back the analogy of earlier, so I guess that people living in cities or states, like nirana market, think about that as let's say John. And John happens to have lived on one single city that had 25,000 people in it. But most of the time he lived in cities that had between zero and let's say one in a thousand people. So you can see kind of how this guy is migrating between different cities. Similarly for MinusQ, it happened to have lived only on one city or ASN that had also hosted 1,500 plus other common names or people. But most of the time the MinusQ has kind of rotated around cities or ASNs that were in that smaller range at the bottom. So these are like very lowly populated ASNs. And this as we will see is very interesting because it will tell you what's this common name used for depending on where is it residing and how is it moving around on the ASN ecosystem. So I understand that pictures can sometimes be a little bit confusing especially at the resolution scales that I've had them up. So hopefully this kind of numeric information can kind of further highlight what we were discussing. We can see that for MinusQ, the ASNs that it's hosted on are all ASNs that just host one, two or three other unique domain names. While for Narana market, there's not a single ASN that hosts less than a thousand other unique domain names. So the kind of this general intuition, it's only natural to ask how can we come up with a mechanism to determine how far apart or how different Narana market and MinusQ are. Real quick, in fact, this one, the MinusQ is part of ZBOT. So having that interesting pattern will be highlighted later and we were able to find it with an unsupervised method. So as I mentioned, you can't directly compare these two histograms because they're on completely different scales. MinusQ is only having counts under a thousand essentially and Narana market is having counts over a thousand. So we need some sort of representation of the entire spectrum of possible domain ASN counts to kind of be created. And so this object will be unique per domain and then we can use this kind of vector as a mechanism to determine similarity. And so in order to create this vector, we need to have a bucketing scheme which maps certain regions of counts to a particular dimension within that vector. And so in this case, we're interested in domains that might be mapped to a variety of different ASNs but the ASNs that they're mapped to are actually quite unpopular. So what we're looking for, what we're most interested in are then ASNs that are at a very low frequency. And as a result, we then create a bucketing scheme that is incredibly sensitive to low frequencies. The best way to think about this is perhaps in a picture. Let's say you're interested in a certain color. Then you devise a filter that essentially blocks out other colors and then kind of just focuses either on the gray colors or the purples. In the same way, you can think of the distribution of domain ASN distribution as kind of belonging to colors. So like low frequencies are more like blues, high frequencies are more like reds, and we care more about blues. So as a result, when we bucket the histogram per domain, we bucket into nine different bands, and each band refers to kind of an index of popularity. So we have a band that's counting the number of ASNs that are mapped to one in five, five in 10, 10 in 20. But then as we go larger, we increase the size of each bucket. So that means that, for example, all the numbers that range between 1,000 and 4,000 will all be mapped to the same bucket in the vector. And so again, this allows us to have a much better resolution of understanding how a domain is mapped to lower frequency ASNs. And if this has become a little bit, if I kind of messed up the explanation, this slide kind of gives a really nice pictorial representation. Yeah. So if we look here, we can see mean secure, and it has that long, I would say, table or, sorry, array. So the array, as we can see, it has all of these 1, 1, 1s. And based on the bucketing that Thomas described, 1, 5, we're basically counting how many numbers are in the range of 1, 5, so you can see 1, 1, 1, all the way to 5, so that gives you like 15 occurrences of those numbers. And you cannot keep going, so you have the 5 to 10, you have basically five numbers occurring in that range. And that's how you build your vector for both the top domain and the bottom domain. And you can see that that way you can have these two vectors that will allow you to compare these two domains in the same scale. Yeah, I guess just the take away three here is that the majority of domains we saw earlier are mapping or hosted or living on between 1 to 200 ASNs. And as Thomas said, we had to devise a bucketing that is sensitive to low popularity ASNs. In other words, you have like a variable, variable resolution depending on the lower, I would say bands of interest. Now, the next step here is that we want to go back and focus on the common name and how many ASNs each domain maps to. And that will be like the next step that will lead us to explain to you how we find these outliers and hence these ZBOT domains within this very big data set. Okay, so we now are going to go back to the original list of domains that were found in the X519 certificates. And so as we know, we can kind of filter out the mega outliers, the dealings, the Googles, because we kind of already have a very good idea of what they are. And now what we want to do is come up with a mechanism to kind of create neighborhoods of domains. And in this case, a neighborhood of a domain is other domains that share a very close count in how many other ASNs they're mapped to. So for example, on the picture you can see that let's say we were looking at the list of domains that map to 160 to 150 different ASNs. Well, in that kind of neighborhood, you would have like iTunes.Apple.com, ASOS Media, download.MacAfee.com. And so all of these domains we say belong to the same neighborhood because they all kind of map to the same amount of ASNs. So once we have a neighborhood, we also have a histogram, a histogram vector that we created for each domain. And now this is where we can just apply a really simple pairwise Euclidean distance between any of these two domains using the domain's histogram. Real quick, like a quick analogy again is think about these bands as like income. So you have like people and they are in a band of income like 150, 160K. And you have like these cities with neighborhoods and you're trying to find people who are within a how close are they to each other if they're making within that range of income. And you will see later with Thomas that some of them will have some interesting outliers and they are maybe anomalous. They're making this much money, but maybe it's there's something fishy about them. So this is a hypothetical distance matrix for a band or a neighborhood that contains only three domains, domain one, domain two, domain three. So each cell you can think of the value there is calculated by calculating the distance between any two domains. So let's just look at the red column. So the distance between domain one in itself is naturally going to be zero, right? Then the distance between domain two and domain one is also going to be some value. And then the distance between domain three and domain one is also going to be some value. And so what this means is that if I look at the red column, I can then see the distance between D1 and every other domain in its band. And this naturally kind of means that if we want to find domains that are very different from its neighbors, we just calculate the Euclidean norm of each column of this matrix. And that then allows us to kind of figure out how different it is from its neighbor. And of course the larger the value, that's going to mean that it's more different than its neighbors. So over the January data set as a trial, we kind of ran this a while ago. And we had one really interesting case in the neighborhood of 100 to 110. And so as you can see, the average in this band or in this neighborhood is around 128 or so. But there's one very clear outlier, which has an overall distance from its neighbors of around 567. And so in the histogram, you can see how the averages are all kind of bunched up in this really tall spike. And then you have these two outliers way out at 400 and around 500. And just kind of really it's kind of easy to kind of calculate the standard deviation then of these two. And then notice that they're definitely two standard deviations away. And so what was great when we ran this, we found this domain called tangerine- secure.com, which through some kind of more further manual probing, we were able to actually identify as a Zbot domain. But Zbot lives in other ranges as well. And so again, in the neighborhood of 30 to 40 different ASNs, we found a couple of other outliers. So in this case, as the amount of kind of information decreases, right, because they're going from say 100, which is a lot more information to say like 30, the spectrum of possible distances increases. So it becomes a little bit noisier. But at the same time, if you look at the tail of the histogram, there's still interesting domains. So the majority of the distances are kind of all nicely lumped together. But then if you cross the 200 barrier, the 200 distance mark, you can see that there are actually five domains. And out of these five domains, three actually turned out to be malicious. So Minoskew, Secure Data SSL, and SecureTandrianAxis.com. And the kind of further validation was done actually through some more passive DNS down the road, and then some more active probing. And what's really good about this method is we were able to take a set of around 8,000, 800,000 domains, and then reduce it to a far more kind of manageable set of around eight domains that we can inspect manually by hand or give on to an analyst, and also gain IP information. Like about the previous slide, the analogy you could think about also, you have all of these ZBOT domains are trying to kind of hide within the large, I would say, SSL, ASN, IP space ecosystem. And they're part of the same gang. However, you can see that some of them have lived in low to medium to high income. That's the analogy. And then with this method, you were able to find them with all of that pipeline of macro, micro, distance, measurement calculation that we described to you. So I guess the few final thoughts here is that when we found this last list that was reduced from 8,000K to, I would say, less than 10, we had to use some extra signals to verify the true positives and also weed out the false positives. And for that, we used some simple ones like how many SHAs does the common name map to and also the ratio between the IPs that the SHA was found on to the ASNs where those IPs are living. And you can see that, that ratio was very revealing to find all of these ZBOT domains like Minio Secure, the Secure Data SSL, and the Tangerine. They happen to have IP count over ASN count between one and two. In other words, actually that was confirmed, that confirms the business model of the actor behind ZBOT where when he sells you between 40 and 50 IPs, he will never give you IPs that belong to the same ASN. Usually you will get like one IP per ASN and he tries to diversify the offering that he gives to his customers. So yeah, this has been like very useful to find this actionable intelligence so we can like block these domains or further investigate them. Obviously we have like some other systems to catch these ZBOT domains, but the whole exercise that we try to share with you is that you can start with like a very large data set and kind of try to peel it off by building this understanding from a macro, micro, and then as the intuitions kind of strengthen, you end up having like these interesting ways to find like reduce the set and have it manage, have it to the scale where it's manageable by hand or by eyeballing. So I guess just a quick comparison here. You have secure Tangerine which is ZBOT, Blue Apron which is a legit domain, and they happen to live in the same neighborhood like they're making between 30 and 40K money-wise. The idea is that they live in ASNs that have actually, they are mapped, they live in a neighborhood where the most people, most common names there are mapping to 30 to 40 ASNs. And for Tangerine, you can see that all of the ASNs we mention here are all like Ukrainian ISPs. It's basically residential, you know, IPs that are either pop routers or access points that have been leveraged for this infrastructure. Whereas Blue Apron is hosted on, even though it's in the same band, it's hosted on like your legit Akamai's and Orange and Amazon. So I guess the takeaways I'll let Thomas go over those. So I guess one of the big takeaways we took is you can use kind of the global structure of the ASN domain name graph to help us inform decisions at the local level. And then there's also really kind of easy statistical tools that you can use to winnow down a pretty large dataset into something that's far more manageable. And what was kind of great about this is that we start off in January and then we kind of ran this approach every month on the subsequent kind of sonar SSL feeds and started monitoring the IP space for the domains that we found. And Dia will kind of show some more examples in the later months, like in April and June of what else we found. So one more thought actually about the main takeaway. So you might think, okay, you spent all of this time to just catch like eight or three things. Fair enough, but the exercise is useful because you could take this generic method and apply it on any other dataset. So most of research is often is not about the problem itself that you're trying to solve. It's about the whole methodology and the mental exercise you go through with your team and your peers. So that's why we felt like this might be useful for as a general thought process so we could apply it on other datasets where you can basically represent the data as a bipartite graph of X to Y. Now, some bonus slides about the ZBOT infrastructure itself by studying SSL we were seeing like some interesting patterns actually. The slide got messed up there, unfortunately. But the top timeline shows this malware C2 domain that has a different way to operate with SSL than the domain at the bottom which is PrivateZone.ws PrivateZone.ws which is a known crime or forum. So at the top you can see that aorospu.cc was created on April of this year. The first DNS queries we saw in our traffic were four days later. Then it was hosted two days later on another bulletproof hosting infrastructure we call ALEKS that we actually covered on Thursday at Black Hat. And then on April 23rd there was a cert created that was then deployed on the ZBOT Fast Flux and the domain started being hosted on ZBOT. So you can see here as you buy the service you're immediately provisioned with an SSL. Either you buy it yourself and you provide it to the actor and he will push it on the machines, on the nodes or he will do it for you. Similarly for PrivateZone.ws which is a known crime or forum for years the domain was created like four years ago like 2014 I would say. For a long time it's been hiding behind CloudFlare but actually it had an origin IP that was unknown let's say unless you have other ways like SSL and passive DNS probing. But then over that period of CloudFlare protection they were using a variety of SSLs provisioned by CloudFlare. Then on May the 7th, 2017 they created a dedicated SSL cert and then on May the 7th the same day they started being hosted on the ALEKS Fast Flux infrastructure and then on June 27th now they became hosted on the ZBOT with the same old SSL and finally on July the 19th they created the second SSL and push it on the edge nodes that are bought by the customer which are around 40 to 50 machines in Ukraine and Russia. So what we're trying to show here is that it's kind of interesting to see how the actor sets up his back in the infrastructure and how he maintains all of these different like domain creation, SSL creation hosting, change of SSLs, et cetera. I guess the final slide here is like the same one as we showed earlier just like maybe to kind of do a bring it all together. So again, it's like an infrastructure that is provided for customers to hide their content behind the scenes. You can have one SSL cert per domain or you can have oftentimes actually we saw even if the bot is still holding the common name of a known crime or forum, it doesn't prevent any other domain to be I would say hosted or delivered through that IP but in that case it will not be using the SSL encryption because the domain of the new site will not match the common name of let's say private zone. So yeah, that was it. Thanks again for your attention and hopefully questions. We just used Python and Seaborn. Yeah, so you actually what's kind of funny is you don't really need any fancy machine learning stuff. You just kind of use some basic stats and you find interesting stuff in the data. So yeah, like another thing the whole scans were pushed into HBase that way we can kind of do the search at scale. But yeah, like Thomas said it's mainly like a lot of numpy. Yeah. And like some good I would say judgment and discussions with the team. Yeah, go ahead. He asked whether we found anything interesting about the certificate authorities for the SSL search for like the domains that we're hosting Zbot. Well, unfortunately, your typical abused ones like the Commodore and let's encrypt and I don't wanna name names but those are the ones you see a lot being used because either they offer like free certs or they have a lot of resellers. Usually you find a lot of these I would say suspicious or bulletproof hosting providers who will offer you hosting plus SSL certs. So it became like a very like a very common commodity to get like a cert with the hosting space. You're saying did we use the same method to find some other botnets? Good question. So at the moment Zbot happened we are tracking other I would say bulletproof hosting infrastructures that are I would say distributed but this one happens to be the only one that uses a CDN like structure. So the others, they will have certs but they are deployed in only one or two domains as we saw, sorry, IPs. So as we saw earlier, if the information is too sparse then there isn't much to find with this method then you better off like using some other techniques that are much simpler and no need for like complicating your life basically. But okay, sorry, you meant DDoS command and control. We haven't. So that would be a good discussion. We can talk. Yeah, thank you. Akamai were not a problem. Like we just said they were good and they were hosting like good stuff. Yeah, it was a bunch of like residential people who actually were I guess unsuspectingly hosting a lot of these guys. Akamai for the most part is actually really have a clean network. Yeah, like the reminder is that the most of these bot IPs are in Ukraine and Russia so they're mostly residential ISPs that are being I would say abused. So yeah, Akamai was in the example that we showed it was a legit clean I would say infrastructure. Thank you all.