 So who of you saw the talk about politicians speak this morning? Nobody, okay. Yeah, it wasn't German. So maybe yeah, I wanted to respond something to the people who did but Yeah, apparently now Talking gibberish in human understandable language is You didn't hear about that today, but talking gibberish in electronic languages You are probably familiar with that. So then here is a Security researcher with a checkpoint and he will talk to you today about DGA's so Algorithms that produce gibberish, but they got a bit smarter in the past and He will tell you something about how to Detect gibberish, which somebody some people might want to have for politicians too But you have to use reason for that and he will give you an idea about how you can do that for ggs Okay, give a warm round of applause for Ben here Is this thing on oh it is Okay, first things first if this slide makes any amount of sense to you then I'm sorry to have to tell you this But you're probably a robot So I will the good news that's the bad news The good news is that you've come to the right lecture because once this is done You'll be able to detect gibberish just like the rest of the humans. You'll be able to blend in and no one will Know a thing So first I'm going to refresh your memory a bit about what DGA is and what the problem is that it was Trying to solve Let's look at a regular scenario a basic scenario where an infected system has been infected with malware And it wants to converse with its command and control server That's what malware does nowadays in the past. It may have just done its own thing without receiving any commands, but today Malware usually waits for commands and operates based on commands that it received so In this basic usual scenario the malware came with a built-in DNS address It's hard-coded and the malware queries the DNS server with this hard-coded address and receives a response This is the IP address of the CNC server now the infected system contacts the rest of the internet and the CNC server The CNC server very excitedly responds. Yes I have another machine under my sway and the connection is complete now the infected system and the CNC server can converse So all of this is fine and good until one day the powers that be the maybe do 40s. I don't know they Find out about all of this and they talk to the people in charge of the DNS server That's probably your ISP not necessarily and They tell them well There's this there's been this shady activity going on and it's making use of your DNS servers Would you kindly make sure that it stops and the people in charge of the DNS server do not want any trouble So they remove the record pointing to the IP address of the CNC server and now the infected system Just as before makes the DNS query to the DNS server and that's okay Where's the IP server of where's the IP address of where my CNC server and The DNS server basically responds go fish Now the CNC server just standard fully functional waiting to send commands And it stands there and it waits and it waits and it waits and that's not very good for the campaign now DJA is basically a mechanism that campaign managers came up with they looked at This problem the ease with which a DNS takedown can happen and they said we want something better That won't be taken down as easily So I could stand here for a lot of time and talk theoretically about DJA and how it works But I think a practical walkthrough of just how it works in practice is going to be more productive so Let's see how it actually works. It begins our story begins with The CNC server and it has access to a pseudo random number generator Which is basically a creature that takes in a small amount of entropy randomness and outputs a large amount of entropy randomness Now this sort of random generator specifically takes in a publicly available seed such as the date of today or Maybe the headlines of today's newspaper. I don't know the important thing is that it should be publicly available to everybody Now it is a customized algorithm that takes in this publicly available a small amount of information and output outputs a large number of domains These domains are not very understandable and this is typical and this is basically what this lecture is about but now what the CNC server does is Take one of those domains at random typically That's what it does and registers it with the DNS server to point at the IP address That's relevant the IP address that the infected machine can contact Now what happens that they infected client side something similar the infected system also has access to the same pseudo random generator that came bundled with the malware and it has access to the publicly available Seed because it's publicly available So it consults the pseudo random generator and asks what are the domain of today? What are the domains available to me today and it gets a list of how many domains are there it varies sometimes 15 Sometimes it's 200. There's a lot. That's what I'm trying to say and now the infected system knows that the CNC server had Registered one of those domains to point at the IP address, but it doesn't know which one So what is it going to do? There's really only one solution contact all the addresses it's going to iterate over all the addresses one by one and Make DNS queries Asking for the irrelevant IP address now most of those domains had not been actually registered so what results is Very peculiar sort of conversation across the DNS protocols protocol that Kind of resembles the cheese shop sketch by Monty Python If you're not familiar with it It's a sketch that involves a guy walking to in into a cheese shop And he tries to purchase various kinds of cheese and as the sketch progresses It becomes increasingly clear that the shop does not actually hold any kind of cheese at all The guy has do you have any powers on the shop on this is no well how about brit no and so on and so forth and the DNS Conversation going on resembles this exchange greatly because what happens is that is that the infected machine as the DNS server? Well, do you have the IP address for G this gibberish address the DNS of response? No Well, how about this gibberish address? No, how about this gibberish address? No, sir. Sorry About this gibberish address. No, not today, sir And this goes on and on here. You can see a traffic capture depicting this process You can see repeated no such name no such name no such name the DNS server says what do you want for me? I have never heard of any one of those domains. Please stop bothering me But recall that the CNC server had actually registered one of those domains one of those domains is a valid Domain now that points to the IP address of the CNC server So eventually the infected system is going to make the golden query and the DNS server is going to excitedly jump up and down Oh my god, I know this one It reaches down to the drawer and pulls up the IP response and the infected system is elated now It finally has the IP address of the CNC server as before any contact contacts as before and all is well with the world Well, you notice that I kept saying as before as before as before what all this work just for before this bloated Mechanism of DJ just to get the same result as before Well, not exactly because think what you now have to do if you want to try to take down this infrastructure infrastructure now suddenly let's look at the domains generated in one day by this algorithm that's a lot and If you're trying to take down this infrastructure and you do not have access to the pseudo random number generator the algorithm Basically, all of this is random to you so every day you're going to have to chase down and hunt all of those addresses being queried all over the world and That's not going to be easy Now if a DGA take down happens at all here's the more likely scenario of how it's going to play out First you have your victim and your victim gets infected in some I don't know enterprise or something and it contacts the CNC server and there's that ex filtration So on going on and eventually someone wisest up to this and this hot potato gets thrown over to incident response now incident response they do whatever they can with this they try to put out this fire and Maybe if we are lucky they want to draw conclusions from this and make sure that the information that they have obtained is relevant and can be used later to Prevent further attacks of this nature by the same family of malware. So if you're lucky this They are incident responses body body with Middle management at some security vendor So it's get bus over to middle management as some security vendor Which ever burns it down somewhere where the sun don't shine or if you're lucky it passes The middle management passes this over to some reverse engineer who is going to spend like a few weeks Or maybe a few months pouring over this file in Ida pro until if you're lucky This thing results in a report The report ever lies down at the bottom of the internet and on pays attention to it But if you're lucky either a streamline process or some kind soul is going to take this thing and make sure that the data about how The pseudo random number generator works is Incorporated into a firewall somewhere which will actually once this process is down Block any future traffic based on the same domain generation algorithm See an easy streamline process. What could possibly go wrong? But Suppose that you had the ability to automatically detect the DGA Now you could aggressively cut out a lot of those middlemen now a lot of the links here have Better things to do with their time middle management bless their soul They have better things to do with their time reverse engineers, too if you had the ability to Automatically detect DJ traffic you could theoretically put it straight in the firewall the firewall is going to see the outgoing DJ traffic and aggressively it's going to shut it down after like four or five queries and Say, huh, I'm sorry. This traffic looks shady. It looks like DGA. You're not getting through so automatically detecting DJ it's useful and it's cool and as a consequence there have been a past attempts to solve this problem and We're going to look at some of the past features that have been suggested to identify DGA and we're going to talk a bit about How those features do not necessarily work 100% well all the time and how they could be improved, which is what we did Okay, let's talk about some ways to detect DGA one way involves looking at carter frequency It's a one-on-fact that some letters are more common in the English language than others So if let's say I take the common letters in English and color code them as green and the border line letters I color code is yellow and the rare letters. I call code as red and they take five words that I randomly picked out from the dictionary and five gibberish segments of comparable length you can at a glance tell which Five words came from the dictionary and which five words didn't so that's a useful feature another useful feature is Along the same lines it base it's based on frequencies of pairs of letters instead of single letter some pairs of consequent letters in English are more common than others ti is common xz is not so common and The another feature is called the longest meaningful substring It's been suggested in some researches into this problem. It involves looking at your input your domain name and seeing what's the Longest substring in there that you can actually find in the dictionary So bug me not contains the words bug and not So the longest meaningful substring is of length 3 Amazon is actually a word in the dictionary So this is definitely going to register as not gibberish by this metric ebay contains the word bay So it's like three quarters not gibberish and this actual gibberish is well gibberish I can't find any word in the dictionary in this thing so another useful feature and the last Feature that has been suggested in the past that I want to talk about is the annex domain Remember just like the cheese shop. There are the repeated no sir. No sir. No today sir. No such name. No such name Just counting these is a useful feature So we have all of those suggested useful features for detecting DGA and we are done right at the program is solved And we can go shopping well Not exactly as you might have imagined And the first issue with what I just said is what I like to call the tumblr conundrum I mean, let's look at reddit reddit is like there's the word red and there's a word deep That's for Moscow. You know the little dot it yeah from our squad and There's let if you look at Google it contains the words go and my god We lucked out ogle is a word So the longest meaningful substring Criterion is going to look at the Google and say okay go ogle seems legit, but let's look at tumblr What is a tumblr? You're not going to find this in the dictionary I mean tumblr is not a word and no substring of it is a word either I'm blur is not a word blur not a word blur still not a word either and So there you have one issue because the longest meaningful subscripting is is going to look at this and say this is gibberish And if you got a human to take a look at this the human is not going to be so hasty To say that this is gibberish and we're going to touch touch later on the reason why That's one issue. The second issue is quigiboo now quigiboo is a DGA engine that surfaced a few years ago And it draws its name from an incident in an episode of The Simpsons where Bart is playing Scrabble against Homer and Bart is stuck with the list of letters that you see up there at the top of the slide And he doesn't know what to do until he says well You know what? I'm going to put on the board the word quigiboo and he plays that word and of course It's worth a billion points because he used all his letters and on a triple world square and so forth and so on and Homer is not happy and he asked about what is this word and Bart without blinking an eye He says well quigiboo. It means a stupid North American yellow ape so Much like Bart was able to pass this under the radar because quigiboo sounds like a word even though it's not a word Quigiboo the DGA generator passes domain names under the radar because they sound like words, but they're not words What quigiboo does and this is stupidly simple It makes sure that every other letter in its output and domains is a vowel Now you're sitting there and thinking Ben look Just this every other letters a vowel and all of the features that you talked about earlier are now suddenly useless Let's look at the letter frequencies earlier the gibberish generated by domain generation algorithms Contained lots of rare letters X's and Z's and J's now if you average out the frequencies of letters You're going to encounter suddenly it looks much more peachy because you have vowels everywhere and vowels are common letters You're telling me okay Let's look at the pair of letters the big rams is the same thing pairs of letters with vowels in them are very very common not all of them, but a lot of them and three letters you're going to run into more or less the exact same issue so the letter frequencies approach is now going to be significantly weaker than it was before and How about the longest meaningful substring well You can play a game and you can look at the domains listed here Which I swear I pulled randomly from a quigiboo based DJ and start looking forwards by skimming this I found give and nope and I swear I did not plan this in advance gated which is very appropriate for this conference So the point is that now that your features that seem so strong before they're now They're half useful and if you take half useful features and feed them into a machine learning algorithm You're going to get a result. That's a total loss so Because of all of those problems Huh, okay, let's improvise Because of all of those problems we came up with our pretty idea for a solution a pretty nice theoretical idea That involves looking at the input and deciding how close it is to a concatenation of words from the dictionary So now tumbler that was a complete nonsense before can by with just one edit Turn into tumbler, which is actually a word from the dictionary, huh? We have a severe here. Let's see Does it work? Oh Excellent So we insert a just one letter on you got tumbler, which is a word Google to edits and becomes Google a word from the dictionary That's not a coincidence. This is the word that inspired a company name read it with two edits becomes red It and now we can finally make sense of those domain names and as for quite jibbo quite jibbo generates strings of gibberish and sometimes it's going to lock out and Create a word that that's out of the dictionary, but it's not going to successfully create Concatenations of actual words one word might be in there But not you're not going to be able to look at the whole thing and make sense of it through the lens of this criterion so The way forward since clear step one we measure the minimum distance of the Input from a concatenation of dictionary words any concatenation as long as we can reach it Using the criterion of edit distance. I skipped over this But the ad the edit distance between a word and another word is the minimum amount of Insertions of letters deletions of letters and edits of letters that you need to get from that word to the other one So we look at the minimum edit distance between our input and a concatenation of words from the dictionary and The miracle occurs here and then we profit because this new criterion is going to defeat quite jibbo and mitigate the tumbler Problem because the word tumbler is now going to suddenly make sense to us and we're going to win the game So we were super happy and then we actually try to implement this thing in practice Why we ran into trouble why did we run into trouble? Well, let's look at how you actually perform this This computation of edit distance the canonical algorithm to do this is called a flood search What you basically do is that you take your input and you perform a bread first search in this space of possible strings By performing every edit that you can think of it's basically a stupid brute force so as the amount of edits that you're willing to search for grows and grows the Size of the space that you're going to search for grows exponentially if you have an input of size eight And you are you're actually willing to search it out exhaustively and see how close to the dictionary it is Then you're going to need to do I don't remember the number by a heart, but it's a large number of look ups So let's say that one look up into the dictionary is going to take one microsecond I imagine that's in the right ballpark So we take the number of look ups and we multiply it by the number of seconds that we can expect one look up to take And now we reach the conclusion that in order to get our Answer for how close to the dictionary is the input that we have on our hands one input one domain name We're going to have to simply plug it into the algorithm and sit and wait for two and a half days two and a half days oh By the way, that's a lower bound the number of look ups that I mentioned earlier It's not the actual number of look ups the actual number of look ups is greater every calculation you see in this presentation It's a back of the envelope combinatorious calculation. Yes, it's not exactly precise just to give you an idea So that was a lower bound. Oh, and by the way, we're implicitly comparing words against it at least a set of all possible concatenations of dictionary words and There's an infinite number of them laptop paper Concatenation of words bottle chair person podium still concatenation of words There's an infinite number of them So the problem is that the Google's infinite server project isn't going public in another half a year So we're not going to be able to take this infinite database and fit it anywhere So at this point, we realized that we're going to have to improvise so we needed to Come up with some ugly hacks and cut some corners on this insane computation to make it actually feasible to compute this thing So the first ugly hack that we came up with it's it was like a first aid solution is To use a greedy algorithm and instead of trying to watch the whole input against the set of all possible concatenations of dictionary words try to look for Prefixes of the input and match them against the plain old dictionary now first of all by using this imprecise approximation you can expect your running time to drop by a lot because the length of the input that you feed into the Distance the eddy distance algorithm and has an exponential relation as I said earlier to the time you can expect The search to take so that's first of all you can see the numbers here The difference is drastic but the more important thing is that now we're actually Comparing against a finite dictionary and that's progress now It's not all sunshine and rainbows because the expected time using the back of the envelope calculation here is Still what like an hour and something per one a traffic capture. That's unacceptable Unacceptable. We're not going to sit there for an hour and wait for the computation on a single traffic capture to go through So we need more ugly hacks the second ugly hacks hack involves Looking at the classical approach of asymmetric search that I talked about earlier. That's just a stupid the breadth first search Now what would happen if I took the dictionary that we have and I bloated it to contain all Strings of letters that are within an edit distance of two of the original dictionary. What's going to happen? Well, you're going to have a larger dictionary But the interesting thing that's going to happen is that you will now be able to do a much smaller Flood search for the same result because think about it if your input is within an edit distance of four of the original dictionary Then it's within an edit distance of two Within something that's within an edit distance of two of the original dictionary So with just two edits you're going to be able to match up against that something now your flood search The input your flood search is the same length But the number of edits that you're expected to make is much smaller So you cut the exponent and now suddenly the running time is drastically reduced But the downside of this is that that now your dictionary is larger Much larger by the back of the envelope calculation. It shows like by a factor of 10,000 So it's it's unwieldy, but we can still leave with it So that's not enough. Let's go on to the third ugly hack The third ugly hack is not so ugly actually it involves Looking at the distance measure that we are using it's the edit distance now What would happen if we disallowed in place editing of letters? We only allowed insertions and deletions Now on the face of it the metric that you're going to get is not much less legitimate than edit distance Actually, there's no reason to think that it's less legitimate. It's actually bound within the Insertion deletion distance between two strings of letters is bound between the edit distance and twice the edit distance the proof of this is left as an exercise to reader and now we switched one criterion for another criterion and The two criterias are more or less. We think that they're Legitimate to the same degree But now our flood search is going to contain much less options to iterate through because all the in-place edits are gone You don't have to worry about them anymore So all of those ugly hacks are nice But the the nice thing really is not the individual ugly hacks But the way that they come together to form something greater than the sum of its parts Now, let's look at the symmetric search that I talked about earlier and how it can combine with insertion deletion distance Let's look at the not a word spooks We can remove the P and get shoes and then we can remove the X and get shoe And if we look at the word shout, we can remove the T and get shall and we can remove the O and get shoe And those two words after a fashion have now met in the middle So what we have now proved is that the edit distance the insertion deletion distance between spooks and shout is At most four because you can take spooks and remove the P and remove the X And now you can go backwards and add the O and add the T and spooks and shout are now connected between four edits Now there's a peculiar something about this computation You have noticed that we didn't have to actually insert any letters. We only deleted letters so we deleted letters from spooks and we deleted letters from shout and Eventually we reached this kind of lowest common denominator and the truth is that it's possible to do this for any two inputs So really we can forget about symmetric insertion deletion and just talk about symmetric deletion now This is a nice thing because now we're only left with deletion out of insertion and deletion and editing that we started with before now Let's combine this with the bloated dictionary idea that we saw earlier We can now keep a dictionary containing all the reduced forms that you can get from awarding the original dictionary by removing letters So now you can take your input and start deleting letters and comparing against reduced forms and eventually you're going to be able to meet in the middle just like spooks and shout did and The expected time to carry out this computation is much much lower because now you don't have the crazy flood search anymore you only have to delete some letters now The to really get this algorithm all the way through you'd have to Look at all the deletions that you can perform on your input But we decided to limit it to taking away free letters Because really the performance gain was very nice from this and otherwise what a word in English You know that you can take away free letters and there's still something sensible left of the original There are some but we thought that it was a good trade-off So the limitation that we got out of this was that if actually meeting in the middle requires more deletions than The number of deletions that we were willing to make that's we said that to be free free We won't be able to do the meet in the middle thing But otherwise this thing is going to work and the important thing is that Insertions we get for free all insertions that need to be made in order to get to the awarding the dictionary are going to be detected Even if you need like eight insertions or something It's still going to work because you're going to find the reduced form that you got by deleting those eight letters from the awarding the dictionary So how good is this thing? How quick is this thing now after we have applied all of those ugly hacks? well Obviously we need to implement it to find out I was going to implement it in pearl But just the other day I heard that Perlin is horrible and you shouldn't write anything in Perlin so I Went back to the drawing board to write everything in Python and There's all sorts of calculations here of how many atomic lookups at In the deletions dictionary the bloated dictionary that we created are going to be required for this test run and how much time the test took including I'm going to say the Including the generation of the actual gibberish to be tested and you divide this by that and you multiply this by that To see how long it's going to take to do the same fit that we tried to do before Which is to see how close to the dictionary a string of eight characters is? Earlier we took us two and a half days and now by this calculation It's going to take us a quarter of a second a quarter of a second That's an improvement by a factor of like nearly a million so we're very happy now and In the wise words of Hannah Montana we can now enjoy the best of both worlds the dictionary size is Hefty, but it's not prohibitive and the our query time has improved by a really drastic amount Okay, yes, we have this deletion this limitation on the number of deletions that we can deal with but I explained Why I think it's a worthy trade-off and this feature is now something that can actually happen in the real world So we decided that now what we need to do is an experiment So let's recap the three features that we decided to extract from a pickups to see how close this pick up is to being a DGA generated traffic So there's the max number of DNS requests that got the same response. Whether that's an IP address or an x domain For ten of the involved request we Calculated this feature that has just spent like 20 minutes explaining how it works of how close this domain is to a concatenation of words from the dictionary And for ten of the involved request again This was computed for ten of the involved request just to improve performance. I think that's enough and finally we computed the Frequency based on big grams. I explained this feature earlier and mainly for comparison's sake to see how our metric refers against this and also they can work in unison to Maybe detect jointly things that each one of them alone could not so Now we're going to see the resulting classifier classify a traffic option That's how this works Okay now We're going to run this on a traffic option now. I set this log to Basically pause the first three times something interesting happens and then it's going to just blow past this so we're not going to be standing here forever Okay, we have the traffic option we start to analyze it and we're going to extract the features the features From it obviously now we find that the most common DNS response that was received during this traffic Capture is an X domain So we're going to look at the request requested domains that got this response and now I bet you're getting a little suspicious Looking at those domains. It's a list of 28 domains And now we have our maximum domain collision features This is the maximum number of domains that mapped to the same response Now we're looking at the longest longest because they're actually all the same length So they just sorted alphabetically actually Relevant request and we're going to start analyzing them using the features that we talked about before Now we're going to start with what's called the pronunciation the fancy That's just a fancy way of saying that we were going to look at the pairs of letters the big grams and see how frequent they are and How this compares to what we would expect from English So we're going to start looking at the first domain name and now the algorithm looks at this input the domain name and says Okay, it starts with C. How much am I surprised by the fact that it starts with C? Well, as you can see the answer is five point six surprised more or less Don't ask me about the measurement units on this because this is an overall talk about probability theory And we don't have the time for that The same happens now because after the city the seeders and I and the algorithm asks yourself, okay I saw a see how surprised am I by seeing an eye after the C? Well, it's a four point three surprise less than before and This goes on and on now we are looking at the eye until finally we have a bag full of surprise and we are very surprised We have a number and we normalize it by dividing by a factor of the length of the input And we get a score of how surprised we are generally by a big rum and wise pairs of letters wise By this input by this domain name So we move on to the next domain name the same thing happens We get a general idea of our surprise We are by it and then same for the next domain name We went for three domain names and now it's going to go whoosh We went over all the domain names and we averaged them out and now we got a measure of how surprised We are by the big runs the pairs of letters in all of the relevant domains here generally on average And the answer is zero point nine surprised We keep on going and now we're going to calculate The lexical the violence here, which is really just a fancy word for the feature that I spent 20 minutes Explaining how we're going to put together the closeness To a concatenation of words in the dictionary as approximated by a host of ugly hacks so We're going to start calculating it and in order to calculate it the greedy algorithm is going to iterate over every possible Prefix and see which prefix looks the most promising to take away from the input and say okay I imagine this more or less is my most promising candidate as something that used to be word in the dictionary But got mutilated somehow So we perform the lookup on Prefix see and we can delete nothing and stay with see or we can delete the see and stay with nothing and It turns out that both of those options Are in the dictionary? So obviously it's better to have to delete nothing and just say that see is in the dictionary. It's a candidate you can just take it away and Say okay, that's a word from the dictionary the downside of quizzes that sees just one letter So it may be in the dictionary But it may not be the best candidate to take away from the input because we want a longer candidate to cut the input length This is how the greedy algorithm operates It wants to take away the longer the best candidate and that also depends on the length we want to Make the input smaller as we work on it Now we look at the prefix see I and the same process happens basically with every prefix That is in this input until finally the algorithm makes the choice Okay, I think taking the first two letters see I is the best choice here That's the closest thing I have here to award in the dictionary that I can take away Now the same thing happens again, and it takes away the prefix Li and it takes away the prefix Q I and so forth and so on until finally It reaches a Score for similarity to the to a concatenation of words in the dictionary for this domain name Now we get the same calculation for the next Domain name and the next domain name and the rest of the domain names And they're all averaged out to find the final measure of the closeness to a concatenation of dictionary words Of all the domains that were relevant the 10 domains that we extracted and decided to take a look at so We now have our final list of features We have 28 domains pointing the same way we have a lexical devancy of 0.6 And their principality the poor noncibility devancy of 0.9 where lexical devancy I remind you is The feature that we built here and the pronunciability devancy is based on the pairs of letters and how frequent they are now The algorithm is going to be looking closely at those features and it's going to look at the 28 domains pointing the same way and say Well, that's too much 28 domains pointing the same way is really too much everything more than five raises alarms already So it says that's excessive now as for the Closeness to a concatenation of words in the dictionary it looks at the value Later I'm going to tell you where the parameters for the class if I came from it looks at it and says that's also excessive It's not close to Concatenations of words in the dictionary at all and finally it's going to look at the pronunciation devancy the how Likely the pairs of letters that appeared seem to be and it's going to say that it's actually reasonable because we got the value of zero point nine and Anything up to one point five is actually reasonable. So lots of domains. They look like gibberish to our feature But the big ram feature looked at this thing and said, okay, seems fine Why is this? I imagine some of you have guessed it's quite jibbo. This pickup was generated by quite jibbo So now this classifier is going to look at the domain collusion and lexical devancy and pronunciability devancy and say, okay Mr. Pronunciability devancy you said that this was reasonable, but I have another feature now that I can rely on and this is definitely DJing so this Was the short demo of how this thing works and now we're going to look at some pretty graphs Now we We took 10,000 pickups out of Checkpoints malware lab just to see how the data looks if we map it across the features that we have created and These are your 10,000 pickups mapped across the closeness to a concatenation of dictionary words And the maximum domain and collusions the number of DNS requests have got the same response Now if the lump there to the upper right seems suspicious to you then you're probably right Because when we took some test samples like a hundred test samples just to test the waters And we have labeled them by hand the DJ samples were in the lump on all the clean samples aligned neatly across the vertical Over there to the left and if you want a visualization of the classifier itself That I promised you that I'm going to tell you how it was generated Actually, I tried all sorts of machine learning algorithms tried Gauss and mixture models and also that but so far I got the best result by just looking at the test data myself so really this is a case of a machine learning a Subclass called been learning and I kind of bombed because I really wanted the proper machine learning algorithm to make sense of this data And I'm still looking for it. This is a classifier The it was not generated based on any of the data that you see most against the classifier. That's the test data I mean It's the 10,000 pickups that we took I generated the the parameters that generate this classifier based on other traffic captures, of course You don't test your classifier based on the same samples that used to generate the classifier Then you're going to have overfitting and that's not cool and This is how the classifier looks and now I'm going to talk a bit about the future of this project and what Needs to be done a further on it now first of all more testing because testing on a hundred samples is nice but I bet that there's a lot of Surprises that Djs have have up their sleeves that will require this project to evolve and Get better features and fine-tune the features to be better. Actually, we're going to touch on one of those in like two slides Next as I said more machine learning. I want I Want to be using an actual machine learning algorithm Even though this finally tuned tuned by hand classifiers worked when I think you saw the graph You saw that it was a very reasonable conclusion to draw from the data and Finally, there's gibberish detection 103 Because as it turns out domain generation algorithms, some of those at least have already Anticipated this kind of analysis that we have performed here and you have Djs such as the one used by their maths new Mower that Generates random domains by concatenating concatenating words from the dictionary So the big one feature and our features is going to do nothing to detect this kind of DGA How could we evolve to defeat this? Well, I'm just trying an idea out there Maybe we could take a look at the words from the dictionary that you actually found and make some kind of semantic comparison To see how likely they were to actually appear together in the same domain name I mean panther and asphyxiation. I guess I could see it a good name for a rock band So How about undetectable gibberish can this arms race eventually end in undetectable gibberish? Well, my personal opinion is that an undetectable lump of gibberish is a contradiction in terms Because the idea with undetectable is that it resembles the legitimate distribution of what you expect to see in the real world Of course every conceivable feature now if something resembles the real world across every conceivable feature It's not going to look like gibberish to you But that doesn't mean that we're off the hook because while I think that undetectable gibberish is in impossibility Undetectable auto-generated domain names are very much a possibility. I think that in theory DGA authors could eventually create an algorithm that generates random domain names that are very much like Domains that you see out there in the real world across every feature that you could conceive of and It's not going to run into any issues because even if you put that set of constraints eventually It's going to be the space of possible domains is a So large even with all those constraints that there's plenty of space there for auto-generated domains to prosper and flourish But you know that's all theoretical and in the future In practice you have the total gibberish DgA is running amok today You have quite a bit of a running amok today and at the top end of the hierarchy you have the dictionary Concatenation-based DgA is running amok today You know what let's first force all the DgA is to actually use undetectable gibberish And if we do that I believe that we will have done enough for that day Yes, then we can think what we can do from then from that point on Now your next question regarding this nice classifier that tells you whether traffic captures are Containing DgA or not is can I have it? The answer is yes There's the address of the github repo I really hope that people try to use it and say oh my god, Ben It doesn't work for me at all and it's useless please improve it because I want to prove it I want this thing to work and be available for anyone wants to detect DgA is in their pickups So we are we can now summarize this whole journey We can say that DgA is our pain and that automatic detection of DgA helps But if it is done naively and it gets confused by strategic placement of owls But if you do it less naively it gets less confused by strategic placement of owls and it becomes equipped to handle funny domain names like tumblr and And detectable auto-generated domain names may be a possibility in the future But you know what first let's force all the GAs to use them and then we can see what we can do then at next So thank you and they are there any questions Okay Thank you to you Ben for your very nice talk very interesting talk if you have to leave now Do so quietly please so we can have a nice and informative Q&A session Thank you very much. So we have a few minutes of time a few not not so few minutes We have a lot of time for Q&A. So if you have any questions line up at the Microphones and we will also take questions from the internet so Leaving is okay talking is not unless you step up to a microphone. Thank you So we will start at the front from my side left from your side, right? microphone Hi, so thanks for your talk I was wondering you had lots of issues with the dictionary app because of size and so on And then you built all these hacks but I wondered whether you thought about getting rid of the dictionary altogether by including syllable information for example because I Also think you're rather interested in possible words of the English language than the words that are listed in the dictionary so for example Mandalorian or something like that is probably in a dictionary about Star Wars But probably not in the dictionary used but it is a possible word of the English language by the rules for the syllables So yes, I just thought maybe you could get rid of the dictionary by using more syllable information So well actually there have there has been in One paper on this subject that use exactly the approach that you described Well, more or less by stemming words and trying to look at more themes and so forth and so on. It was very interesting Their particular attempt didn't go so well, but I think it's a good approach I don't know about getting rid of the dictionary completely Because I'm not very convinced about how the morph the word stems or the syllables or the Atoms that you are proposing are going to be at reconstructing words from the dictionary But you know what? I think it's an avenue worth pursuing if actually this could help Downsize the dictionary then it's something that I'm interested in looking into and doing because right now the dictionary way is like 10.7 gigabytes and anything to make it smaller is very welcome Thank you Okay, the left front microphone, please I was wondering how you manage domains like xkcd or things which are valid, but which aren't in the dictionary Things that are you mean you mean like Xkcd because it's not even like a word in the dictionary at all. Yeah, that's a That's a really good question. You're you're not go not every domain name is going to fit into this paradigm of Domain names that are like a word in the dictionary Which is why this thing still needs to be tuned and more approaches for recognizing Valid domain names need to be added to it looking at the top domains. I mean that's a given You see what let me ask you a question. What sort of approach Do you see that would have foreseen from first principles that xkcd is a legitimate domain name? That's kind of an issue. So I think things like xkcd. They will either have to be Manually a whitelisted or you could have a Specific, I don't know. I really don't know how you could see in advance that xkcd is legitimate except You know taking this a priori knowledge at applying it Okay, the Ria writes microphone, please puny code domains as well as Some domains in in their use in other countries that are look like English letter gibberish China uses a lot of strings that We would necessarily recognize and have you looked at other properties of these two domains such as Time of first use, you know has anyone gone to these domains before as well as How close how more recent are they registered? Okay? So first of all regarding the languages issue this project was specifically about English It is a work in progress and improve of concept I specifically quickly thought about that and if this thing is actually going to be used and it's going to Function properly. I am interested in expanding it to be able to handle other languages Now where can you remind me of your other question so Yeah puny code or Domain strings that are English characters, but are used in countries or popular in countries that it's not English words That's what I said Other characteristics of these domains For instance, have they have they ever been As well as right, you know, right? Yeah. Yeah, I remember now Well plenty of projects trying to solve this broad this problem in the past have relied on this sort of thing keeping a sort of ongoing intelligence operation that Will be able to tell you whether the domain is a DJ or even in general is suspicious Based on this sort of intelligence. When was it registered? When was it accessed and so forth and so on? These are all various of features I Specifically decided that there were out of scope for this project and this project should join hands with any sort of engine making use of an ongoing intelligence operation instead of reinventing the wheel and Reimplementing that sort of approach Thank you Okay signal angel are there questions from the internet I'm just looking at them right now, but it seems The internet has the same question as That's convenient. So okay is the next question from microphone right front It sounds like the DJ algorithms will soon output some sort of that they stick poetry That will make your system have to identify that they stick poetry But also in like 400 different languages such as the dot-com song will allow you to register it That will be a problem. So my question is In what place in the DNS ecosystem? Do you believe that your system will have? I mean, where exactly do we have a use for this? Okay, so actually This was conceived in an entirely different context than what I described in the first few few slides This was conceived as a feature that you can compute on a sample in a sandboxing context So you can do machine learning operations and so forth and extract the feature That you can look at the sample and say, okay It's DJ. That's one of the features that I can look at and use but as I mentioned In one of the first slides I think that the pinnacle of what this thing could do is if the code were optimized for Performance in that kind of context which it currently is not it could sit on a firewall and throttle DJ traffic before it manages to get out of the network Thank you Okay, and there's another question on the rear Right smith microphone. Yes. You mentioned that you tried other Machine learning models for this. Did you try a Markov chain classifier? Actually that one I haven't tried Markov chains. You're saying, okay I'm going to try that tomorrow. I'm aware of some other research that does DJ a classification using Markov chains And if you're there's a couple advantages to this. So if you're concerned about Biograms not being enough, you know, can you extend it to a third order or fourth order Markov chain assumption with smoothing and you can also You know look up time is is linear. So it's pretty fast Okay, I'm going to have to take a look at that. Thank you Okay on the left rear microphone. That's another question Did you try using using bloom filters for the lookups because they might reduce the lookup time significantly to use what bloom filters Bloom filters. No, I have not I look into that too. Lots of useful suggestions today. I see I am hopeful that Okay. Ah, we have a friend time for another couple questions. So I'm looking at the signal angel nothing from the internet Okay, that's question. Maybe to wrap it up from the rights Fronts microphone. Thank you. I like your presentation very much. I'm more from the data science And not to experience with the network perspective. Okay, I think it looks like a cat and mouse game And I think the from the methods perspective. I think it's losing games Or I would like to hear more about the motivation at the beginning The the the standard process that is described how it is handled with the middle management and the reports that is written and reverse engineering Are their statistical methods involved to today that are the same cat and mouse game And you are only playing it better or is Just this game which I believe personally is it's a losing game. Mm-hmm. I agree with you completely That's what I said in theory. It really is a losing game because Offers of DJs are going to evolve eventually they have and they will and As I said my personal analysis and it seems that you agree with me is that eventually theoretically they win this game now Security is full of such games where the attackers theoretically eventually win because the Mission the burden on the shoulders of the defenders is like some Eventually in the worst case the worst theoretical case. It's equivalent to some NP complete problem Or is generally in theory in the worst case impossible, but While you are correct, I believe that we should be focusing on what's happening right now as I said, okay It's a losing game right now, but Besides trying to look aside and think okay, what game changer can we bring in here to make this ultimately not a losing game? We need to do our best to push forward and make it less of a losing game temporarily even if eventually we know that In theory we're going to lose. That's what I believe Okay, then the rest is give a warm applause to Ben and thank you very much for your talk