 Welcome everybody and I am standing between your lunch and you know I would take some time but I'm here to have a some good conversation and share my experience of building a cyber security product from the beginning. So I have been data scientist now actually AI scientist whatever you call it from long time let's not get into the history but what I want to know from you is how many of you are into the cyber security okay so I can actually pitch my talk accordingly and so you know seven years ago I predicted you know one of the conference talk that you know there would be conferences and there will be networking events where people will be talking about just random forest exit boost and that time you know people were saying no no no we are not there yet now I can predict that you know a few years from now everybody will be talking about PowerShell they will be talking about what are the attacks you are getting and why are you getting it how to overcome it because I can see it because there are a lot of loose ends where it is very easy to take over so let's start with so what we'll do is we'll start with a big problem what is the problem definition what is the magnitude of that then we'll get into how we can address some of the problems using deep learning and then you know let's make it like more conversational I don't mind having a question in between I would be actually more than happy to do it and then we will go through the real case studies and conclude with something like okay what is the takeaway message if I am not from cyber security but you know I want to apply in some other field so I'll give you some takeaway message on that as well so time to act so what if I say that you know the person whom you next to you you shared your hand right and you introduce yourself just by telling your name and if I just you know gather a bunch of data from your Facebook LinkedIn I can actually predict I have a GAN network and that GAN network can predict what kind of last five passwords you may be setting up it is actually possible we haven't released that model we have that the thought we have experimenting it but you know we had we don't want to release that in the wild but it is very much possible because the deep learning machine learning has reached to that stage there are only n number of things which we imagine right which we have a you know okay what would typically would you do right you will have some birthday you will have some father's name mother's name you'll have some you know kid's name or birth of city you know one of the things you just take n dimensions come on you all of you are data scientists right you just promote from there and you have a some reggae and you know feed that to the GAN and you know is gonna learn it from millions of data points and figure out what that function is but attackers don't need to do that they may be doing it but they don't need to do that they do it very easy they do they are they are way sharper and clever what they do is very simple social engineering they will figure out okay here is Satnam he likes to do rock climbing I'm just gonna send him a some rock climbing festival link and he's gonna click on it and that malware is gonna drop there is gonna be based on a macro in a if it is you know he he also knows that you know I use more of a Mac so he will use a different technique he won't use a Microsoft office base he'll use a different one he'll just send me a zip and that's it the Mac Safari when it unzips it it's gonna execute that payload and he's gonna take over the story is over so when you say this particular graph what it shows it takes only few minutes it's just a matter of few minutes for any adversary to come into any enterprise you name it and this is Verizon telling it in 2018 they're telling the same thing in 2019 as well the plots are different I kept this plot because I it was much more clear in telling the message now if you take a look at the right side of the graph it takes a few minutes for adversity to get in but in order to detect that you have been compromised it takes an hours and months correct so the problem is highly skewed highly a symmetric okay and you talk about data breaches and attack we name it every company at some point they have been breached and you take it any sector you know the more one of the sector which is predominantly attack is a financial sector adversaries goes where the money is that makes sense right that's what they do they also work you know you know in their own shifts but they will be working in a different time zone they also have holidays so they they have very structured process to do all these attacks so let's try to understand what the problem is now if you look at I'm talking about more enterprises here so if you take a look at enterprise network right this is what the typical enterprise is you have the various departments depending on you know which enterprise you're talking about let's say you know this is a typical enterprise network engineering operations and sales HR so you do some basic hygiene you do the network segmentation and so that you know you don't want the threats to move around you don't want malware to move around so you do all that but then you have a question of you know multi-dimensional security right you have network security and point security and so on right that now the latest one is a cloud security right now if somebody has experience in cloud and then you know they have cloud security experience as well they're sitting in Santa Clara you just make a deck and go to the VC I'm pretty sure you'll be able to raise your seed funding right there that's how the market is right now now when I talked about the seed funding and the VC and M&A this is a picture here I'm sorry about like you know the crowded space here but my intention over here is the cyber security is a complex problem so if you'll it's not readable but I'm gonna read it for you so on the top left what you see is a network and infrastructure security then you see web security endpoint security in the endpoint security itself there are like some 50 vendors then you go for MSSP risk and compliance identity and access management security operation center I can keep on reading but I would actually encourage you to look out for momentum cyber report and what my goal over here is that in each of these silos they need data scientists in each of that now when we talk about a network endpoint network security it is started in 80s intrusion detection system and so on but now there is a next version of that the data is being pumped to the cloud and then in the cloud you're doing the anomaly detection so the reincarnation of the problem is going more on a magnitude and at a more higher scale but the problems are there and each one of them need the AI and machine learning now you talk about the cyber security problem space last year 6.2 billion dollars were invested and if you talk about the number of companies MNA transactions happened like 183 and there are 3500 cyber security companies we are just one of them but cyber security problem is so rich and so deep that you have to break it up in pieces and then you have to addresses and that is actually also one of the problem that these are all different silos and somebody has to come and orchestrate that that's where I see that machine learning and AI has to play that symphony role now I'm assuming that you know if you go and talk to CISOs they will go and say that okay yes I don't need another vendor I don't need another technology I'm gonna make sure of the basic hygiene so these are the basic hygiene I'm assuming that you're gonna take care of that so you take care of the basic hygiene obviously you need to keep on patching your vulnerabilities you have to keep patching keep up to it keep updating it but think about it you have a 40,000 endpoints Microsoft just released a one critical bug and you have to update all 40,000 of them it will take time right and then during that time it is actually vulnerable and that's where the wanna cry and the ransomware use that advantage and they spread so let's talk come to what we talked about the magnitude of the problem and how what part of that AI and ML can address it how also we will discuss that so when we discuss machine learning and deep learning before that let's talk about data right 80% of the things what we do is data manipulation preparation and so on what are the different data points we are talking about data sources we are talking about now here are the few data sources we are listing from network point of view endpoint point of view from authentication point of view I mean these are few buckets there are five more buckets you know I can present the probably a workshop only on what are the data sources and depending on what data source you are targeting the volume and the size may be different see for an example network logs net flow is a huge you are watching each and every endpoint and you are watching the interactions among them watching the how much traffic is going from each router and switch in in the enterprise it's a lot of data right and now if you talk about endpoint logs endpoint logs we we have the endpoints less depending on the Mac and which OS it is but you will have the generic things what processes are running what applications are running what files got changed created and so on right if you take a look at those link that becomes like a thousand dimension categorical data so the problem another challenge over there is unstructured logs so you have to convert that into the structural structure data and then you process it now is this something brand new are we talking about not really I mean this have been addressed from 90s it's just like there is a game between adversary and a defender obviously adversaries are you know way ahead in the game they are always ahead because in order to hack a code and take over you just may need to write ten lines of code but in order to make sure that what are the different surfaces from where I need to secure it you may need to write ten thousand lines of code and not it's not about the coding it's also about the different angles from the human angles right so those are also equally vulnerable so it has been addressed so if you see the revolution and evolution both of it in the security space we move from rule base to data mining to machine learning now we are moving more towards the deep learning but what has been the driving part so far till 2015 2016 where the UBA user analytics entity user entity and analytics behavior which basically talks about that I am going to model each entity in the network enterprise I am going to model its behavior and whenever the behavior changes I am going to flag the anomalies but that modeling that behavior is not straight forward right we there are different times on different places different angles you know different attributes due to which the behavior changes right so and then also depending on if you're talking about the entities right the different kind of servers so it's it's not a straight forward another thing is okay even if you generate the anomalies somebody has to validate them do we have enough security analyst unfortunately not in a typical security operation center you get 500 to 600 alerts in a typical now how many analysts they typically have like four or five and each analyst how much he can investigate maybe at most 10 so all you are able to address is like you know 50 of them and so I am saying this is a highly like ambitious and highly optimistic scenario you may be able to investigate 10% of the alerts so the 90% of it still remains so machine learning and AI if it is generating more anomalies it's not going to help the security so the data mining earlier was more about like you looking at the clustering and then saying that okay then you tag it okay it was it is a malicious traffic or a benign traffic so there were much more engagement with the analysts over there but when you move to the machine learning way you got lot more tagging done and you got the tagging done so now you are able to say the machine learning classifier is able to say it is a many malicious or a benign traffic but at the same time even if it is able to tag it somebody has to validate if you want to switch if you want to stop that particular switch or you want to take out that or you want to take over take out that particular endpoint you have to validate that and that's the response part of it which which is a still manual so it's like a human being right I've got a brain I'm looking at the thing and identifying the thing but what I what I probably felt some some months back is that this cyber security space apart from the machine learning this another aspect is knowledge graph because what I believe that I'll bring that I'll bring that the knowledge you are you are up to the point that how do we bring in the domain knowledge and the knowledge and weave it along with the machine yes there are actually two aspects to it one is bringing up the domain knowledge and other is as this particular cyber domain is almost like 16 different like variations like the web security and then comes a physical security so suppose I'm a hacker I'm probably sending you your web packet which actually you open up and it then hits the physical layer so when we are having two different machine learning models so probably that is probably detecting that is a web security it detects no it's a fine it's a good packet okay and this model this tells that fine it was just a normal TXT file it is also fine but when you actually join these two things what I'm telling you is basically to join all how do we bring the context yes like like I'll get to that part thank you so I think you watch your point is that how do we bring the context together I'll take the next question that's I want to move forward and show you some concrete example because we talk more at a higher level I want to get to at least one or two examples and show you the real things so here are some examples I mean these this is an example from Cisco and this is a very powerful example it is in the production you can look it up I have the references here how do you figure out a malicious traffic in malicious encrypted traffic so I encourage to have a look at that malware detection this is a work by Sophos I give them a huge credit in terms of publishing their work they have at least four or five papers on RZ you can look it up and you know one of them I have listed over here what they are doing is they are leveraging deep learning for doing a malware detection because their product is endpoint base and you can look at the multi-attribute and you don't need to actually define the features and that's where you were saying that okay can I use a domain knowledge to define the features and then do my classification or let's you know use a deep learning to go ahead and you know without the definition of features so this is again a huge topic I will give you one example and then you know for specifically for malware detection I encourage you to take a look at their papers for another one is adversarial machine learning and you know when I said to start with prediction of the passwords this is just a one example now attacks on self-driving cars how do you confuse and images I mean this area of adversarial machine learning is also huge so encourage you to look at the ML spot now what are the different use cases so we talked about the era of anomaly detection now the second era of AI and machine learning in security is using a deep learning obviously you need to have labels that's where it will be successful so here are the use cases which I found it very solid and crisp in fact for bunch of them you can also find Kaggle competitions and you can get the data sets from there and try it out yourself so let me walk through with a one case study I'm going to show you the code and give you a sort of a glimpse of you know how we are applying machine learning here for one of the problem here so now what we started with we started with a breath we are shrinking it we are taking a few specific examples and sort of like get make a hypothesis how we're going to tackle the problem piece by piece number one and number two is how I weave in the things together and number three is you know how do I bring the domain knowledge as well so the tour traffic tour traffic is being used a lot by adversaries and most of the traffic in a tour is malicious and cloud flare is very successful in detecting the tour traffic I'm not advocating any specific company but the you know that's what I found in the research so here is a data set which we got tour and a non-tour data set and obviously as you said earlier that getting a data set itself is a big challenge in this case we reached out to them and we got the data set the data set was obtained by students over several weeks they did the activities over the tour network and over a non-tour network and they did whatever activities they were doing in the non-tour case like in a regular traffic the same activities like the email chat browsing they all did in the tour traffic as well so objective over here is how do we classify it so let's let's take a look at that in the python notebook so we have all the libraries okay and what we are doing over here is let's take a look at the data here right so the benign net URLs so the problem definition over here is you're trying to figure out sorry just a minute I this one is here my apologies just a minute so here what we are saying is we have certain attributes of a tour traffic and certain attributes for non-tour so in this particular case we have the data available and let's look at some of the attributes what we have what this Canadian Institute students they have collected so source port destination port the IP addresses the flow duration the flow statistics right the standard deviation mean and so on and also the max max flow and so on so these are the different attributes which you present right which is there in the data is a typical data analysis what you're doing now you can say that okay Satnam you know is this the first step you always do I always do as a first step even if I need to fit deep learning neural network I tend to have a some understanding of the data because that actually helps you to bring the domain knowledge so in this case if I you know the domain knowledge tells me that if I need to make a model source for destination port the IP addresses will be changing so I should not bring that into the model because you know I don't want to actually learn from those IP addresses there is nothing from then to learn I mean depending on which network you go slash 16 slash 24 you know which subnet you go you know it will be changing and it will be from enterprise to enterprise you don't want to really bring that so you drop those columns then you bring you know you keep the other columns and then obviously you you look at the data you look at the mean standard deviation you realize that okay let's I should do a preprocessing and I should do some kind of a scaling between 0 and 1 so you take care of that so I mean this is like the basic hygiene stuff you keep on doing that analysis and then you look at okay what are my dimensions and what's my ratio for training data test data so do I have enough data for my malicious traffic if not okay let's use some sampling technique which sort of like compensates for that so we use a stratified sampling here and obviously the first one I actually took a smaller part of the data so that I can you know run it faster and we build a logistic regression in that gives me 90% accuracy and here my objective is not really you know to get my best right away and I'll tell you what my objective is so and then I say that okay I'm gonna fit deep learning neural network I'll make a feed for one neural network and fit it so you this is a standard chaos and the feed for a neural network the different layers you've written and then you say that okay I have several hidden layers and then I'm gonna be fitting a neural network and then say that okay what is my accuracy for that network it turns out that okay I am getting somewhere around 96 or so so the part which I want to stress here this is a beginning okay now if I have this kind of a problem statement what we did we took an academic part of it we took the data which was in this case we got it from academia but in the case of industry how this happens typically you may not have data so you may need to generate a data if you are a startup if you are an enterprise you may have a data and in this enterprise also you may need to maybe you know you may not get all the data in a bunch goes you may get a some part of it so the question really comes that okay do I take a small bucket of the data and then you know analyze it this is what it shows here but when we need to make a production data and do a production model obviously you need to get the whole the big data which we used to talk sure question I can't hear you so the result that you've shown there is higher support for non-tor so is that a general property of networks or the I mean sampling this is more specific to this data here okay but how did you get the sample like so I'll take this question because I want to move and give the big picture and then I will walk through with this particular data set you know just after my talk I'll walk through with that the reason for that is I have two more demos and I want to give a big picture of a cybersecurity so now what we talked about here that we took a we took a one problem and we we showed that how I applied the feed forward neural network now let me take the second problem and which this is actually the problem which you see quite a bit and you click on it they may go to a URL and can I predict it whether it's a is it a malicious URL or it's a benign URL if you take a look at that this problem has been addressed quite a bit it has been address depending on again the data size but at the same time adversaries are also evolving so why I'll give you an example here see if you take a look at the domains here some of these domains are like the domain generation algorithm is generating it if that is generating it's a random thing but adversaries come to know that okay yes the defenders have the algorithms which can detect these kind of a domain so what they did instead of using the random generation they use the words from a dictionary and then generate started generating the domains so those words are very nicely named see for an example home job institute so it it's very nice right now you give it to earlier when you try to train your classifier with the random DGA based data right versus a non DGA based it will learn the pattern but now the sex the sex the challenge comes that okay if you have this kind of a data would you be able to detect it the another challenge comes that they actually route the data through the Twitter or through the Facebook now Twitter and Facebook everybody uses how do you flag that as a malicious so this is a one typical pipeline okay I'm not saying this is the best model in my demos I'm just saying this is one of the approach and telling you that how do we address it how do we build it up as a problem and as a machine learning problem as a deep learning problem so you have a benign and a malicious data in this particular case you got the data from the various sources over the internet if you are building it for academic purpose you may get some data but for a production version you may need to buy a lot of data and you may actually to buy it so the objective over here is can I detect the benign versus the malicious url okay so I have benign urls and I have malicious urls okay and can I make a classification class classifier which can classify it sure you can you can start with very simple you can start with logistic regression Xeboost and then you can slowly build it up in this particular case we applied LSTM network because we are actually looking at each of that character as a word in terms of the LSTM and then we are saying okay I'm going to learn that word embedding and I'm not going to specify what feature I'm looking for and that would give me higher accuracy so we do all the basic hygiene stuff you know what's my training test data and I'm just going to rush it through just but I can take the questions on this particular notebook after the talk and then you know we build the model we have we have an embedding layer LSTM layer we do a dropout and make another dense layer and you know you predict whether it's a benign or malicious and it gets you somewhere around 95% or so but in the production is this a good number it's not it has to be 99.999 something like that the reason for that is the percentage wise how many of those urls you have to still you know take it blast it in a sandbox and then validate oh is this really a malicious or not because it is coming in the these urls are typically embedded in the users email we we want the genuine and the benign emails to come to our inbox right so all these calculations are happening in a fraction of milliseconds so there are very tight timelines on which these calculations have to happen and there is a very tight bound on what is the allowed accuracy so these are the considerations which have to be taken into account when you make a production deep learning neural network sure so in a security as such you what we do is you know we we want the highest precision you actually you are okay with having a positive because you want to secure it so you want to have a more you are actually liberal towards having a more positive but you don't want to miss detection so that is the that is like almost like across the board you pick up any problem that's what the security you look at just just just two more minutes and then I will open it for the things so what we talked about the the various use cases right now what I'm gonna say is something different what I'm saying it now we talked about okay how I'm gonna model it machine learning AI you still have anomalies you still have alerts somebody has to investigate the question comes that can we do something different so if this is a chess game between adversary and the defender can we slow him down for each move if he has to think whether it's a real or a fake then if we can achieve that it's like a race right in that race if we can slow him down then we will get as a defender we'll get lot more time for action so so this framework is called a deception so out of these doors if the attacker comes in he does not know which one is real or which one is fake so what I'm trying to say is let's think about it on a LinkedIn you have a bunch of email addresses bunch of employees and few of them are fake adversity doesn't know does not know it adversary when he comes to the network when he does a nmap scan a few of those computers are actually fake they have the services they I mean they they are actually being projected by some server somewhere they have the real services it when the adversary does a scan it will look like a very genuine computer the name also like that the name will also look like this if for an example here you have mum EPS 4343 now if I have another computer which is also mum EPS 4353 and this 5 3 is you know happen to be the real one but you know I have a 6 3 now that is actually the decoy so these decoys will actually make the adversary to think whether this one is a real or it's like a setting of the traps right if I set up the traps then can I do something better so what we could do is we set up these traps and I'm not advocating any particular product what I'm saying is the approach is much more broader you take your AI and machine learning which was like you know figuring out okay this this classification that classification but you still don't know where the adversary is in the house it is like if I have a some fake jewelry and if somebody is trying to steal it right then okay yes you actually get a signal that you know you have a thief in the house it's like a motion detector right in the enterprise we don't have a motion detector so the deception is going to act like a motion detector and now I will combine with AI so here in this particular example what we are saying is there are a lot of host there are a lot of endpoints so in this room we have a lot of Macs and I would keep some more Macs which are fake and which will be you know giving sort of like I'm not getting into how we do it but let's say they are there now when the adversary is gum bump on to one of them and you interact with him then you can get lot more data lot more information about him and then you do the AI so I have this particular example I'm just gonna show you and you know this particular technology part I will tell you how it has been done but let me show you this attack so the attack what we are doing over here is okay so here is an attack here so the attack what adversary is trying to do over here is that he he you know the typically what you do here you have a phishing attack you take over a one computer and then you escalate the privileges right so in this case you the adversary is taking one of the script have anybody heard of PowerShell and all that yeah so what adversary is doing is he has taken a PowerShell script and he is modifying it but before you know he modifies it he also checks it on a virus total that how many of AVs are actually detecting it so he goes there so and he checks it in this case you know that 20 of them have detected and then you know he goes and he changes certain words like you know he changes some variable names and some function names only this much no code changes such is just a variable and function name changes he is easily able to get those 12 out of 22 he is able to able to get them to 5 so now there are only three of them actually so he is now able to get to them so now only three of them are able to detect it so just a function name and the variable name change see that is the reality now if he offers skate then only two of them are able to detect it so what I am trying to say is it is as simple as that you from the internet you take any script you change the function name you change the variable name out of 60 antivirus only two can detect it ok you do not need a rocket science this is this is what adversary needs to do so the question comes that can we do some detection with the deception in AI together sure you can if you have this deception there then adversity does not know that he is engaging with the real or with the fake so everything whatever he does is getting recorded ok all the activities are getting recorded so in this particular case all the power shell logs are getting recorded and we have a machine learning classifier which is able to detect whether this is a benign script or a malicious script or is it you know what kind of tool has been used or it is able to figure out that ok what is the intent of the adversary so in this case you know adversary is escalating the privilege is just executing it so what we are able to show here in the defending place here let me show you that here that we are able to detect it you know this one I am sorry if it is not visible but what it is telling that that there is a privilege escalation attack which adversary did and we are able to detect that and this is all done behind the scene it is all AI ML so let me show you AI ML and the data science sorry the deception together so how it works you have the logs power shell logs you do a preprocessing and then you have a set of classifiers of one set of classifier will tell you what is the tactic and the another set of classifier can predict whether it is an obfuscated not obfuscated and another classifier can predict whether which tool it is all this you can model it as a machine learning problem so what we talked about we started a very breath we went into depth into very specific examples right and then we said like there are two three things which we need to take into account one you which you rightly pointed out can be weaving the things together so for that I do not have in the presentation what the recent approaches are you model it as a graph database you take a look at all the incidents together and you take a look at the source destination you weave in together you you say okay this incident happened after this on this particular source only and then somebody has to have weave in the domain knowledge yes this could be because of this payload so you typically model it as a graph and then you take a look at whether you have some behavior before and then you say whether it's an anomaly or not the another problem another way to approach this one is also something like when you are making your classifiers you can bring the domain knowledge over there in terms of a features so I think I said enough let me take your questions there were bunch of questions here yes sure sure correct that's a excellent question so let's take an example of ransomware see for an example of one a cry before one a cry came all these machine learning and the deep learning base vendors were not able to detect it but as soon as they got the few samples they were able to quickly run their models and then update that model and deploy it because they were mostly in the cloud based but what you can do is see these machine learning and deep learning base models are still better than the rule base because those are really looking for a signature now what this promise is that if somebody does a very tweak on that until unless they explore a new zero day vulnerability then machine learning and deep learning classifier will fail because that is a completely different but if probabilistically if it is nearby then it is able to do it so in a sense when the new malware comes which is of the similar family the model will be able to detect it but if it is coming from entirely different family all together which is a different zero day vulnerability you still need to wait for the samples to come for still need to wait for the models to get updated so typically they get updated quite frequently by the vendors and the advantage over there is that like you really don't need to have any earlier it used to be more look for a signature you blast the model into a sandbox you look for the signature you look for which has to look for and then you code that signature into the antivirus that is like a very slow process but now these deep learning based models you are able to update it but still there you know they can't do it for completely unknown thing that is true for any domain any unknown data set if you take a look at which for which the it has not been trained the classifier will not be able to predict some more questions okay okay so it is not fine see when the wanna cry got exploded and you know it was spreading the news got spreaded like the news went very fast but still people were getting you know you if you take a look at the wanna cry spread people were getting the attacks like it is just like in a city like okay you got in a part of the white field it got a blackout and it is getting blackout blackout blackout and till ISE but you know that that part of the town got a blackout right if I able to update my model well in time I'm still able to you know my 70% of my city I'm still able to stop doing a blackout so what what I'm trying to say is that there is a time to react and the deep learning model here or AI based models are giving you that ability but the next thing what I said was if you combine it with the deception then you can get much more faster before even the wanna cry is spreaded then you can actually say it because the wanna cry is using a some specific vulnerability and in the enterprise network if I deploy that vulnerability in a some computers and if some malware is trying to exploit it then I can detect it that oh here is able to you know somebody is trying to some malware is trying to spread so you can detect it faster so there is a there is a race between detecting it faster and then you know and then there is another race between some zero day right for zero day you you still those are in the in the security domain they are called APT's advance persistent threats and they are typically you know done by the nationwide nation actors you know some particular country may be behind it and they have very very specific targets so over there what defenders what companies they do is they try to stop the lateral movement so if something has been compromised you know you can't stop as I showed you in my first screen it takes only few minutes to get compromised what people do is they assume that it will get compromised no matter what you do what they try to do is how do I contain it how do I make sure that it does not spread it so over there there are bunch of techniques and one of them I talked towards the end was combination of a deception and AI together