 I'm Madi Namazifar and currently I'm a senior data scientist with Twitter but before that I was lucky enough to be a part of the amazing TALOS team of Cisco and this work on detecting random strength is something I did when I was, oh wow, that's awesome. It's something that I did when I was with TALOS. So first I want to give you the definition of the problem that I'm trying to address here. Let's say that you're given a random string and you want to decide, you're given an arbitrary string and you want to decide whether this string is a random sequence of characters. So one thing to note here that I said random sequence of characters and this work does not address strings that are random sequence of dictionary words. The other thing is that my focus is on strings that are at least eight characters long and anything less than that is very difficult for even a human being to detect randomness. So I'm focusing on strings of length eight or more. So why do we even look at this problem? Our motivation for this was detecting domain names that are generated using domain generation algorithms. You know better than I do how these are used. And this is not a new problem. This problem has been studied quite a lot. There's a rich literature around it. At least I'm aware of some of them here and also a bunch of works that are done at Cisco. Usually the way these works work is that they look at this problem as a classification problem and they take a machine learning classification approach to solve these problems. But here my approach is a little bit different. So I give you the big picture of the approach in one slide and then I'll go deeper into the details of it. So the first thing I want to do is I want to put together as many dictionaries as I can find from different languages. And basically out of these I want a word list. List of words as many as possible. And once I have this I put them together. I call it the mega dictionary. How do I use it? I basically take the arbitrary string that I'm given and I take substrings of it and I look them up in this mega dictionary. So based on the number of dictionary hits that I find out of these substrings, based on the length of these substrings and based on the different languages that these substrings are from, I come up with a randomness score. And based on this randomness score I determine whether or not this is a random string. And if you notice here the idea is that if you could see how the substrings of this string could be covered by different words from different languages then we have an idea whether this is a random string or not. So I'll get into some details here. So the first part is that mega dictionary. First I try to find as many dictionaries that I can, language based dictionaries basically from the internet. These are almost 70 different languages that I found dictionaries for. Some of them are constructed languages. Some of them are for instance for English. I have multiple versions of the English dictionary like British English dictionaries or American English dictionaries. So this is the languages that I was able to find free dictionaries for. So other than that, also I should note here that out of these dictionaries I only want the key, not the value. So each dictionary you have the word and you have the definition of it. I don't care about the definition, I just want the words. So that was the languages. I also get some names based on census data, female names, main names, surnames. Also I get a list of scrabble words. These are the words that are not necessarily in the English dictionary. They could be acronyms here. And so these two items were given to me by my good friend and former colleague Adam Katz. And they helped a lot, actually. The next thing is I get the Alexa 1000 domain names. I add them to my word list. This is again important. The word Yelp, the word eBay, might not be in any dictionary, but these are actually important words. Some numbers, and also I've also got my hands on a dictionary or a list of texting acronyms. Yellow, TTYT, BRB, things like that. So for some of these words I need to do some special treatments. For instance, the words that are coming from Eastern European languages I need to get rid of accents on the characters. For the Mandarin language I need to get these characters, if I can find my pointer, these characters and basically translate them to Roman characters. And for that I use the Penin standard. For Russian and Ukrainian I needed to use special decoding and I also needed to use, well, take care of the fact that I and Y in this decoding are used interchangeably and a bunch of other special treatments like that. So next thing I need to note here is that the same word might appear in multiple dictionaries. The word book appears in at least these three dictionaries, English, Polish, Dutch. So to take care of this I run a MapReduce job to find all the dictionaries that a given word appears in. So the result of this looks like something like this. For each word, for each given word that I have in these dictionaries, I have a list of dictionaries that that word appears in. So here in this example, Sui appears in the French dictionary and the Catalan dictionary and a bunch of others. So this is at the end what my mega dictionary is going to look like. It's a Python dictionary and the keys here are the words and the values here are a list of dictionary names. And the lookup complexity of this is constant so it's pretty fast to look up anything here. And at the end I have about 12 million words in this mega dictionary. So that was the dictionary part. Now how am I going to use this dictionary for detecting random strings? So I look up, I find sub strings of a given string and I look them up in this dictionary. How do I do that? I do that by traversing strings and this is how I define traversing a string. I can traverse a string from left to right this way. I have two indices. I fix them one at the end, one at the beginning. If I'm traversing from left, I fix the right index and I move the left index one at the time and as a result, these are the sub strings that I find. Right, if I find the sub string, the same string from right to left, I get these sub strings. It will be a little bit more clear why I do it once from right to left, once from left to right later on. But let's just fix the definition here and then go to the next slide and look at an example. How do I find the sub strings of a given string? So let's say that we are just dealing with the English dictionary only. And let's say that this is my given string. So I start traversing the string and at each step, and I'm doing it from left. And at each step, I find the sub string and I look that up in my mega dictionary. I continue until I find a hit. So none of these words appeared in my English dictionary until I hit this word. That appeared in my English dictionary. So I take that word, I take it out of my string, put it aside, and now I'm left with this sub string. So again, I restart the indexes and I start doing the traversal from left again. And these are the sub strings, none of them are in the English dictionary until I find this word. And it's a hit, I take it out, I'm left with this sub string, and so on. So at the end, I get these three words out of this string that is good to be here. So I did this once from left. I do the same thing once from right to left. And at the end, from left to right, I get this list. And if I do it from right to left, I get this list. You see that? So these two ones from right, ones from left give me a higher chance to find the right words in that given string. So I need to pick between these two. And because the minimum length of the words that are in this list is four, I take this one because the longer words that you find in the sub string, the higher is the chance that this is not by chance. So that was just looking up in the English dictionary. What about the case that we have? Well, we built a dictionary based on almost 70 different languages, right? How do we use that? So let's say that we are looking at this string. And let's say that I found these substrings in this string. And these are all the dictionaries that each one of them appears in. So the question here is, OK, so at the end of the day, how many languages do I need to cover these substrings that I found? If I take the union of these, these dictionary lists, it's going to be way too many. If I take the intersection of these, it's going to be zero. They don't have any intersection. So I need to find the minimal set of dictionaries or languages that cover these substrings. So how do I do that? That's actually the problem of minimum hitting set. It's a very well-known problem, very well-studied, and this is basically the father of a bunch of other well-known problems, such as set covering problem and stuff. This is the definition of the problem. I don't want to get into the definition. You can always look it up. But unfortunately, this is an NP-hard problem, but the good news here is that r sets are small enough. If you look at it again, r sets are small enough that even if I do a greedy search in the space of possible solutions, I can find the minimal hitting set. So I exactly do that. I basically have this very, very simple greedy algorithm for finding the minimal hitting set problem. This is by no means an optimal or the best greedy algorithm for this, but I don't care because it's fast enough in a very, very small fraction of a second. It gives me what I need. That's good enough for me. So back to our example. So we had these sets that we wanted to find the minimum hitting set of. And by just applying a very simple greedy search, I find these minimum hitting sets. And the cardinality of these sets is two. So my minimum hitting set number is two, meaning that I need at least two different languages to cover these substrings that I found. So based on this minimum hitting set number, the length of the string itself, the percentage of it that was covered by the substrings that I found, some of the length of the words that are in the string, length of the substrings themselves, I define a randomness score. And that becomes my touchstone for detecting randomness. One last thing to mention here is that I do it twice. I run the string first against English language only. So English language is universal. Many people use it. So I first run it against that. If according to the English language, I don't have a verdict about randomness of this string, next I go to all the languages. And this is basically a two-phase filter. So a bunch of other considerations here. If you have a sequence of alternating vowels and consonants, if you do this practice yourself, sit down and put a sequence of random, alternating vowels and consonants, it looks pretty legitimate to you. You would think that, oh, it's got to be in some language somewhere. So for this, I penalize if I see something like this. Also, another thing that I consider is that if I see a dash or underscore in this string, it means that there is some natural separation at that point. So because otherwise, why would be an underscore or a dash in the middle of a string? So based on that, I treat that as a separation and I look at each one of these separate pieces separately. So I want to get into some results here based on the experiments that we ran. And I look at both false positive and false negative rates for this. Let's first look at false negative. So I'm using nine domain generation algorithms. These are from known mollivers and botnets. And these are reverse engineered by the TALOS team. So they gave me the code and I generated basically these samples. So this is the number of sample that comes from each one of these algorithms. And then this is the number of strings that were generated by this specific algorithm that were missed by my randomness detection. So if you look at the missed percentages, I don't know how clear it is on the screen. But basically, the rate is pretty low. Across the board, it's probably around 1%, less than 1% false negative rate. So how about false positive? For this, I took Alexa 10,000 domain names. I filtered out strings that are shorter than eight characters. I put them aside, as I mentioned in the first slide. So I'm left with almost 50 or 400 domain names. And I run them through the code. So the rationale here is that in the Alexa 10,000, you're unlikely to see lots of DGAs. So hopefully a lot of those are legitimate websites. And out of this 5,400 domains that I checked, 42 of them were detected by my algorithm as being random. And this is the whole list of them. Some of them, like for instance, this one. This one, if you showed it to me, I would say this is a random string. Or I don't know, like this one. This is pretty random looking to me. But some other ones are not random. Like for instance, as a matter of fact, I know that this is a pretty legitimate Turkish website. Or this one is a legitimate Farsi website. So but at the end, out of 5,400, we have 42. And we can say that it's about 1% false positive rate. Which is not bad. So 1% for false positive, 1% for false negative. And overall, I think it's a good rate compared to other studies that I've seen on this problem. And this would conclude my talk. Thank you very much.