 And also I am a director over here, so I am taking care of these projects also, so I do both of the stuff. So let me just start by giving you the first intro, because I think Tharas is a big company, they want everybody to know about them. So I do the quick intro of my sponsors, this is like Tharas right now, it's like a big company in 68 countries and they are like good on this, as you can see they are like nearly 20 billion euros so they are with you. So we create more projects, do work on the abstract, match part and then create new products. They are into like a different space, aerospace, ground transportation, security, they are everywhere basically here. So I think you will find in like they are part of the defile deal, they were part of the deli metro, so they are just there everywhere. And there are lots of folks over here like the guy Albert, he was a Nobel Prize physicist, he is also part of the group. So like we have got a good folks of people over there, so that was how my sponsors like a commercial break, let me get back to agenda. So yeah, so things probably I am going to talk about, first basically events is a very normal thing that is happening in the real world. I will just talk about what events are, then why do we need a sessionization basically, more for machinery perspective, why is it important? I will first cover the traditional approach and then take some real world use cases that we found in the data, then I will talk about the applied data science way. So I agree with I think you must have heard a grandstand in the morning. So we should definitely first understand how things are actually, what is then the real world, then understand the concepts and make it more intuitive to actually apply. So applied data sciences basically mix of both the things, you know the maths, you may not be solving maths equations, but you know what to apply basically. So I will cover the basic things first and then maybe tell you summary how to apply these techniques and solve some problems by simpler techniques. So it is a balance of both the approaches. So talking about events, events could be anything, it is a digital world you know, so you could be having like orders placed in a financial market, there could be users tweeting on social media, then there could be people clicking on websites and like there could be events from I2 devices, events through router. So anything that comes to the timestamp is an event. So any digital entity over the scale of time, generating information is event basically. So you want to capture and understand these events. So that is the basic idea. And so as you can see like in the digital world, you have like mobile devices, internet, I2 devices. So you will get lots of information over here and everything is like a stream coming over time. There would be an actor, actor could be a user working on a website. There could be maybe a mobile device or some other device. So they all will be generating information like bits of streams over the scale of time. So time is the most important part because these comes on the scale of time and we need to somehow find a meaning out of that, find the information from it. So I will give you a few examples how we can basically see information flowing over. So let us say we have an entity, one entity, it is generating some random events basically at some fixed interval of time. So most likely this could be a server internet like a DNS server or maybe some other server which is doing a periodic health check with a master server. So that is a very organized time based event patterns, maybe like a resolution query. That is one type of entity. Then second you could be having a human. He is doing random things at random of time basically. So there is actually no real pattern in human behavior. Then let us say you could have one more user. He is also having some pattern but it is possible there is also something periodic happening in between his activities. So there could be a time based order, maybe like a bot also working from his identity. So there could be a machine who is using his identity and doing various activities. So there could be a mix of events for some entity and there could be a pure event which is very organized. If you see these arrows, all these three sets of things are very time organized. You see the first event, second and third, that is what always happens. You see the same thing again, again, again. These are actually could be real bots. Maybe could be some malicious bots doing some activity which is probably not really intended to do something good things. So the idea is how can we capture these patterns basically, understand various actors working on the digital world and somehow create a signature and these could be like good guys, could be bad guys. But if we can, by the help of machine learning, find those signatures and then their auto correlation, that would be a really important thing. So that's where we start with the concept that we can have two major things that we can call as a session. So session is basically a continuous activity of things happening. So I have like two views over here. One is like the operations view which we normally deal with. Like when we're working on a laptop, it's like a session, a web session. That's what we simply mean. So that's called the continuity in activity. There's some like mean activity period like we type for some time and then we leave that, again do some activity and leave a laptop. So that's some activity period. And whenever the gaps is more than the mean activity period, we say that was one session, another session, third session. So that's a normal meaning of a session basically. Any activity that's happening. Then probably there could be more time-based events which are happening. So it's possible in your history of data, you would have seen something multiple events of type A, B, C, O, D. They were happening again and again in some basically time-based window. And this window was seen again for different users, maybe different point of time. So all this was happening repeatedly. So from this, you can maybe do a time-based correlation and say that this even leads to second. Second leads to third, third to fourth. So from a mix of random events, based on their time occurrence and based on their sequence of occurrence, we could statistically found that they are correlated. That's one way of finding correlation into multiple large amount of data. And most likely the humans would have some kind of random pattern, but the bots will have a more organized pattern. So the idea is how can we separate machines from humans? And how can we better model the machine traffic in a more mathematical model? So that's the major idea, modeling a machining activity into basically some kind of model where you can basically predict and where you can maybe find outliers, other things, do lots of things from this thing. So this could be like a chain of event, maybe like web session activity, somebody logging to a server, opening a file, transferring a file. So that's one sequence of things that normally happens. So FTP server does it regularly, but if somebody attacks a network, he'll maybe use a different path. So the path would diverge from a known given path. So we can then find the deviations from a path. So there are multiple applications, which can be used to then solve lots of cases like fraud detection, cyber security and maybe other behavioral cases, where you want to see how does a machining behavior appear and how it's deviating. So these are multiple applications, but the core essential part is how can we find this chain based on the data. So that's the core emphasis. I'll focus on how do we actually discover these chains automatically. And there is another critical thing. If you have a chain, you can also find the root of the chain. So for example, if that was a malicious bot and it was a chain of event, for example, that was like an FTP or telnet login, a backdoor login. And it's doing maybe opening a backdoor port and transmitting data to some other networks, so it's basically stealing information. You want to stop that. So it's possible if you try to stop the first event, the chain may or may not stop. If you must have wondered the lecture from Swarjit yesterday, it's like correlation can be causality, but it's not necessary basically. So it can be a causal. So for example, the first event may be causal and if you stop this, it's possible you stop the entire bot. That's possibility but not necessity because we are just observing data. Let's say for example, the bot was entering from a telnet session, but he could have another entry access basically like from SSH or some other things. So a chain of events may be correlated. It could be causal. So idea is you can try these things. So if you get a correlation of events in the order of time, then you know this is the root event. So if you find the root event, then you can try to basically block the root event and actually block the whole chain of events. So this could be maybe like a campaign on social media. You want to stop that. So like you have seen lots of malicious campaign happening. So if you can stop the originator, you can stop the entire chain. So it can be causality, may not be. But finding the change is definitely important. And they give us lots of information. But talking about real world, so the time periods between them, they'll actually vary. So that's how real world is. It's not really easy to model a real world. And that was, I guess, because I was telling in the previous talk, all these things we do in a lab is pretty easier. But in a real world, things vary a lot. Those time periods are really, the second event will come after maybe a long more time, sometimes shorter. So it's hard to model them. So to model a real world, what we say adds some kind of stochasticity. So let's say instead of having a period, let's learn a distribution for them between the variation. So all these variations, maybe let's say model by a simple caution distribution or any other distribution. So this can then be used as something called a stochastic period. So the notion is real world is dynamic. And to model a changing phenomenon and to basically connect a changing phenomenon, we need to have these changing periods also. So how we can model a stochastic period, that's what probably I'll talk about in the next part. How it was some maths behind this, basically, to find such a discovery. And we expect these kind of things would be more robust and applicable in real world, where we can find different kinds of chains, different kinds of campaigns, bot activities. And we could also find this form of mammoth of data. So this would actually really help us find these interesting stuff, interesting patterns. So there are different applications, as I told you, in different areas of businesses. So it's possible somebody is doing a market manipulation. So you can find this via these chain discoveries. So if you find basically how do you, some machine is making orders and trying to manipulate the market, you can discover those chains, basically, by those sequence of actions. Similarly for tweets, you have seen like political campaigns, like you must have seen the newspapers, like they were biased the political party, so they were biasing users. So there was already a program in the background who was analyzing activity and doing some actions. So you can find those chains, basically, using this analysis and see what is originator. Then you can catch that and maybe stop that. Similarly for websites, people may be doing transactions, account takeovers. They may be doing the same activity in a different way. So you can again find those chains. There are multiples of applications. But the core remains how we can do a time-based correlation of these events and find those chains and find the root of the chain, basically. So these are multiple applications. So to convert this to a mathematical model, the first step is convert it to a pulse train. So for every event, for a particular user, so let's say we identified some entity. An entity could be a machine, a server, could be a user doing some activity, or anything that generating events. So translate those two like a pulse. When we see the event, say a pulse 1, and when we don't see, it's 0, basically. So we get an impulse train. And now we can use mathematics to basically solve or answer this impulse train. There are lots of ways to basically finding such patterns. So I'll tell about theoretical ways. One is like using a Fourier transform. Using an impulse train, you can find the Fourier transform and find the frequencies in that. That's one way. Then second is doing a stochastic analysis. We just told you, look at the time periods. The varying time periods and model the variation in those time periods. That's another way. Third is like basically using Gaussian mixture models. These are some more names which are used to basically model stochastic periods. And there could be more complex ways because sometimes the current techniques are not suitable. Like in Gaussian mixture models, you have to specify a k. Like that's a parameter. And you don't know how many classes are there. So like I was showing you an example with, let's say, four kind of relations then. But I don't know there would be four, five, or six. So that's why you can't tell the k beforehand. So there are more complex methods like infinite Gaussian mixture models, which can help us identify those patterns. So I'll just briefly cover these topics. And this will be going in increasing order complexity. Then I'll come down to apply to the science. So just to summarize these techniques, I'm just plotting a small diagram. So as you increase the complexity, you get more information. But eventually the technique becomes too complex. Like non-parameter vision methods are really complex. It's hard to do, but they are lots of information. They're very close to real world. They capture the uncertainty. But that's not easy to model. So we can do a balance. That's where applied data science comes in. So you can take some assumptions, maybe do a small ED in your data. And if the assumptions are true, then probably, yeah, directly apply those techniques. So in the end, I'll just summarize how you can quickly take a balance and some of the known techniques, apply it and basically find the problem, solve the problem in a more easy fashion than doing all the complex techniques. So but as I think like Gantt's understanding was telling in the morning that we need to get the essence or the understand how things are working. So that's why I'll first cover these basic techniques. Like what does a Fourier transform do? What's the intuition? So if you know the intuition of a Fourier transform, then maybe you can modify it or use a different way. Similarly, if you know the intuition of Gaussian mixture models, you can modify it. Similar goes for infinite Gaussian mixture models. So I'll quickly cover these topics in a brief introduction period. So that's what a Fourier transform does. If you look at this animation, this is a signal. It could be a mix of lots of signals. So ideas, can we find the original signals and sign signals? So if we can just split this signal and take those basic frequencies out, that's what a Fourier transform does and just note the period at what time are they occurring, what the frequency is occurring. So that's what is simply a Fourier transform. So it's a simple intuition basically. So and this gives us basically what's happening inside. So we are able to decompose a complex signal into simpler signals. So if there are multiple operations going on in a chain of events, then yeah, maybe this could tell us that if that's a clean signal. So just talking about Fourier transform, it's the same thing. We have these sine and cos waves. We just want to find the coefficients. So the point is if we know these coefficients, those were the vertical bars and basically the frequencies or strength of those basic signals. So that can tell us what are the frequencies operating in the complex signal. So the idea is simply basically your function that's f t and then you have basically a periodic function which we know is basically sum of cos and sine waves. So the idea of Fourier transform basically convert map your f t to a p t and in that process discover those coefficients a and b and like the values of the cos and sine signals. So that in short is a Fourier series. That's a quick intro. There are a few variants of these series. I'll just mention those also. So and to also do this, this is simple intuition. If we want to like in a simple machinery model, we want to train something. We have a label and we have a data. So we simple do make an error function and do a difference on that. So this is the function we want to have and this is the model function. So y minus y hat gives us the error and the root mean square error is a term we want to optimize basically. It's a simple machinery problem that we solve. So the same thing flies there also. What we do analytically, find the derivative equate to 0. So you simply just get these formulas and this is what the Fourier term does internally. So it's simply basically mapping your signal to a known signal finding the errors and basically setting those to 0 to find the coefficients. So this term over here, the a0 is important because if you see it's like this integral of a signal over a period. If you have a periodic signal, it will give the mean value of that and this is important because when you look at the 0th frequency would be a very high peak from the other peak. So you should know that this is the mean of the signal and maybe you should remove that and look at the other frequencies. That's more important. So this was for a time period 0 to t. That's what is a Fourier series and there is a compact way to say the same things. So there is something called an Euler's formula. So you can represent both cos and sine using complex numbers just by using a simple expression called erase for j omega t. So you can write the same formula let's say Cn and Ej omega t. So you need to find only one coefficient that could basically give you the relative strength of those signals and the Fourier series basically just the same thing in a time period 0 to t. If you change this time period from t to infinity, that's a Fourier transform. Just the same thing because the time period is not just limited. It could be explained into any point of time and this is like CTFT or a continuous time Fourier transform to basically look at a continuous signal. There is simple discrete version of that. That's called the DTFD, discrete time Fourier transform and this is the one we will use to analyze impulse trains so because events are discrete events so something happening on the internet happens at some random point of time. Some of the event happens at some point of time. So this is like discrete time event happening and this is helpful in analyzing those events. So I will just explain the same thing via a diagram. So let's say we have a simple impulse train so this is like a 10 second pulse every 10 seconds you are getting a pulse. That's a very simplest function we can have and if you take a Fourier transform the real part tells you about the magnitude of the signal and the imaginary part tells about the phase shift. So if you notice the gaps are over there just like 10 seconds each and the frequency peak is at 0.1. So 1 by 10 is 0.1. So frequency is basically 1 by time period. So Fourier transform simply tends to be a time period that's in operation in that signal. That's one easy way of finding the frequency that's operating in the original signal. So and just to tell you what I am plotting over here it's a simple Fourier transform of signal just use a NumPy library take a Fourier transform use the absolute part because we just want to look at the magnitude where is the frequency peaking in and the first half is the positive frequency second half is negative so just plot the first half and that's what you basically are looking in that diagram and so in frequency are basically just the inverse of the time periods so now you can actually see and understand what we are plotting. So I will take some case studies this is like a public data set it's called CTU 13 if you just google it you will easily find it so this was done by Czech Technical University they actually ran a malware in their own network and that was something called like a Dawn bot that was the malware that simulated and it basically attacks the windows SVC services so this data is available you can try it so I will just show some real examples how I can actually use the real data to find those events happening in the background so this is case one so this is actually a real bot activity so this bot was basically deployed in the university and it was trying to contact the DNS server of the university look at this pattern so it's a simple impulse train whenever the bot raise the DNS request we just say it's one otherwise it's zero so as you can see it's kind of organized so bots normally have an organized pattern so this is something called a beaconing so they send beacons basically to check what's happening so they are checking with the DNS server of basically what's happening what's happening sometimes they are short time period sometimes they are even so if you take a Fourier transform to understand the signal so the normal transform would be really messy it's really hard to understand so if we zoom it a bit then we can see something interesting so I showed the because we just look at this transform directly it may not be meaningful but zoom it a bit and you need to figure out how to zoom in basically to how much area that comes from the experience for example if you see like these are 15 minute interval I can see 3 bars in 15 minutes so approximately 5 minutes is basically 1 interval of this thing and 5 minutes 300 seconds so 1 by 300 is approximately 0.0033 so if you see the peak over here it's very close to that 0.0033 so it's just ahead of 0.0025 so just by intuition you can discover that yes this is actually a periodic signal and you are seeing a peak at that and there is a property of Fourier transform that the frequency and this multiple will show the same peaks so you are seeing multiple peaks over here in this chart but they are also the same frequency 0.003, 0.006, 0.009 and it goes on so but this signifies only one single frequency so because it will repeat at the multiple and this is the phase shift I think this is very negligible because it is mostly periodic and in sync at the 0 so this is a one way of looking at let me also now use the same example to introduce the second way of looking at things that's the stochastic period the first was the frequency way now basically look at in the time domain so in the time domain we look at the time delta simply subtract the time stamp as simple as that you get a signal and do a time delta and you will basically get deltas delta will vary a lot but that's ok and if there is a periodicity in the signal the time deltas will vary in a band so that band is of interest to us basically and maybe we can somehow plot density distribution of the time deltas that will give us like how much they are varying are they varying in a small range or a larger range so and we can probably learn a probability distribution out of that so that's other way of handling looking at these time events in the time domain itself so let me show the same example again so what I have done over here the same thing I am just finding deltas of that every consecutive bar the difference is over here so if you see a difference of 300 mostly so it's varying around 300 the difference is very small over there small over here so you have very small delta over here basically so this is just a plot of delta by the same activity but if you notice delta graph it's actually mostly lying in a band so majority of these deltas are in a band and as we can see the signal is actually very close to periodic that's why it's a band but it's varying a bit in the band it's not exact if you find the exact period you won't find it but yeah so what we can do we can find the mean of this band simply and that will give us the mean period of how this band is basically centered and then we can also find the standard deviation of this and standard deviation can also be used to define a property that's called periodicity so we can say like when you have a normal series so we have some like a coefficient of variation how much is that my distribution varying so that score tells me the property of the distribution so inverse of that could be periodicity so that means is the band very narrow or it's basically too large so is the signal really periodic or just noise so this is like a score a dimensionless number that score can tell us is the signal was really periodic or was really noise so these are like properties on this band can help us analyze that signal and then based on this you can also find sessions so for a continuous activity in that same band you can have one session another continuous activity you can have another session so based on the continuity of these time periods you are able to find multiple sessions all based on time stamp so we are just looking at the time stamp and that's it we don't care about the event or what other relationship was just based on the occurrence of time stamps in that time axis we can get information so if we just expand this further if we just do the same histogram of that just like a density plot from the histogram and as you can see most of the these time deltas are in the range of 300 so you see a peak over here that's in the around 300 it's more looks like Gaussian distribution so this we can use and this would be like a pdf or the definition of a stochastic period like this probability density can explain the variation in that signal so that Donbott's DNS queries are basically can be modeled by this simple Gaussian distribution you can find the mean and sigma of that so those two variables are now good enough to describe the DNS queries of our Donbott so the same histogram I have done with a more detailed one it's the same thing but this is a more detailed basically so we have zoomed in a bit and it's more coarse that's a more smoother version of that so but yeah you can clearly see pdf coming out of that so this is one simpler way just find the deltas and see the final delta basically subtract the two numbers and you get a pdf and pdf you can represent by a mu and sigma so just by two numbers you can compress a whole information in a pdf so that's one way I'll take more examples to list at the same point in both frequency and time domain yeah I mentioned again examples of this actually a real attack this is a backlisted IP so this IP was backlisted sometime 5 years ago because Donbott was very popular at that point of time and it used a 5, 6, 7 at a backdoor port to actually transfer data from somebody's laptop to other server so this is a very activity it's like a span of 2 hours so it contacts to the master server and then basically transfers data very frequently so this is very combustible look at delta it will be more simpler so again if you see there is some kind of pattern over here in its activity on the time scale there's some high peaks some low peaks but majorly there's not really too much of variation it's in some kind of band you can still find it so in this band majority of the activity is still available but there is more interesting thing in this let me take another diagram so the same band we can also find a Fourier transform of this so if you see we have a very large peak in the beginning and because of that the whole signal got compressed you can't look at the other peaks there are some peaks but they are very small to understand so because the first 0th frequency is the mean of the entire signal I showed in the formula earlier so idea is simply just take out the first frequency moment you take out the 0th frequency you see a very nice pattern coming out over here so these peaks are indicating what's happening inside the signal so as you can see these peaks are also again at the same point like 0.025 approximately 30 seconds and this band is also around 30 so as we can see Fourier transform is able to find those fundamental frequencies operating in the signal now let's look at the same signal from the time delta point of view so from time delta point of view again we plot the histogram and density graph we can see a single peak over there but there are multiple ways of looking at a particular data so like even density plots are biased basically or histograms are biased so I can have a different smoothing thing and get a different view so like even the same diagram actually it may not be one band it's possible it could be two bands see the blue and the green so that there is a top frequency and there is a lower time delta if we zoom in a bit we can see like two clear gaussians over here so it depends on how do you plot a pdf so it depends on granularity you can see multiple different bands coming in there so it's subjective to how do you interpret it so that's why you need to be cautious to see are you doing a smoothing version of that or having granular version so maybe this could be making more sense in this case scenario it could be like an interaction so if you can say like this is one stochastic period and like this is the another stochastic period so this is just like interaction of two stochastic random variables and we are just seeing a continuation of these so it's like a continuous chain of interactions and this is one way of looking at this so this is one activity I will show you more examples like this another case so this was like a genuine systematic DNA query this is the time pattern looks really complex just a mix of signals we will look at a delta the delta is really interesting they are only four levels in the delta so if you can see and those delta are very equally spades they are actually multiples of 8 so the first delta is like 0.8 then 1.6 then 2.4 then 3.2 so it's actually DNS problem happening so you have them like all the recursive DNS query if you are able to resolve a DNS query you answer back if you are not able to resolve you send to some other server basically so based on whatever you are able to resolve that DNS query to the IP address you will have multiple time bands so a normal query will have multiple these hops so that's why you have like multiple of these but from a mathematical point of view you can just see four bands in operation so even if you don't know logic of DNS we can still figure out how does the DNS works because of this mathematical formulation so if you look at the Fourier transform this would be interesting now because I am showing you more complex cases so the Fourier transform has a limited utility now if you see I can see a peak that's nearly around 0.24 that's a very high frequency around like 4 hertz so but that signal is not even part of this in this small range of let's say 2 seconds I can see maximum 3.2 gap there is a bigger gap of 4 gap of 4 seconds that's not even in this zoom session but Fourier transform is only able to catch that part it's not able to catch the lower frequencies so that's why probably we are able to get a limited information Fourier transform but this is good Fourier transform but yeah it may not be good for all the cases you have to go to more complex cases and more higher forms of modeling so I will just show you the same thing in a time delta perspective again as you can see I can model in two ways if I do a smoother version of that you could have a normal pdf but it's possible it's actually 4 pdfs in action or like 4 random variables in tracking with each error let's call them a, b, c and d so it's basically you can see it goes from a to b, b to a and you can get a chain of things basically b to c, c to d so you can get like the transition probabilities of each of these events so now if you model the whole DNA system as a state system basically of 4 states so all this happening is basically stopping from state a to b b to a then c and d sometimes between so this 4 states can summarize the entire signal and that complex dns logic in this state diagram so this is a way of compressing information in by time based correlation and modeling it into like a simple state machine and you can then use like Markov chain to see is the transition probability normal or is it going out of unexpected things so you can also find the deviations from these kind of chain so but this is one way of modeling it yeah so I will just show you one more example it's like a more interesting pattern so if you see the deltas over here even they have a varying pattern like a sine and a cos wave so this is like two signals in operation and this can be captured from the imaginary part of the Fourier transform if you see there is a peak at a very exact frequency 0.1 so at every 10 seconds those two signals are mixing and they are basically of different phase that's why you even see a sine and cos going in every 10 seconds it will complete a cycle and you will get a new pattern so there are multiple ways of interpreting it so these are like nice techniques to find those patterns and model them mathematically yeah but more generically the time deltas is the more better way to model them and then you can have these PDFs and if you can find those each of these peaks then yeah you are able to define one phenomenon of a stochastic variable but the problem is basically how do you find the right PDF as I was telling you there are different views you can have zoomed in view, you can have a coarse view, a smoothen view so which is the correct one we don't know and how do you discover automatically so that's the key and the next part of session is basically how do you find these the right ones of these PDFs to model the right phenomenon so that's the idea so we have to do some calculations because it's subjective to the level of you're zooming you want to find the best one so we can make it automate the system we don't want a human interaction or like a human analyzing these plots we can do it by a variation but we don't want that so simple idea is let's say you have a set of points you want to let's say model them by some PDF so you want to let's say have two PDFs or two histograms and that's the idea it's a one dimensional data if you do a clustering something like a DB scan and you'll get clusters there's also techniques called like gaussian mixture models it will basically give you two gaussians as simple as that you specify the k for this if you can give k equal to two it will give me the two PDFs as simple as that so this simple thing works and I get the PDFs back and the one which I wanted it can also do for two dimensional also so these are the contours so contours are the same PDF contours project the PDF down so the peaks have smaller circles and when you go down you have a bigger circles that's what it's telling the centers are probably there in the centers so this again can be modeled by gaussian mixture models in two dimensions so these are techniques easy to basically model so we can find these PDFs if we know the k so that's the important part knowing the k over here so I'll also give the intuition behind gaussian mixture models so we understand how we can modify it also it's a variant of k means k means like doing a hard clustering of points this is like doing a soft clustering of points so and these like this is the points that we are happening so it will give probabilities instead of the exact labels so I think this is probably we need to define k over here so this is a hard problem so that's where the next other things comes in this is what we built but that also some heuristics like elbow methods to find the k but these are like very manual things or they can't be automated so that's why it's a hard problem so there I think you have you can move to the Bayesian world so I think you must be knowing this in the Bayesian world this is like the prior this is the likelihood and then probably this is the evidence and that's how you basically do a Bayesian setting and that's you get a posterior similarly there's some properties of Bayesian properties yeah so if you have like us a prior and the posterior of the same family you can call them conjugate yeah so these are some properties of Bayesian we can utilize and I think this is some example we can skip these things this is an example of a conjugate basically this how do you construct GMM I think this is a parametric setting we define the k and these are the parameters like we first choose which of the cluster you want to pick the point from and then choose basically that Gaussian Gaussian is again the mu and the sigma so it's a set of basically two properties choosing the cluster that's the set of pi's and then the Gaussian properties so that's how this GMM works you can maybe model like a latent variable so basically T first lets you choose a cluster and then you find the Gaussian these are the equations I can easily find on internet so this is the function training is basically a simple thing just maximize the likelihood I can easily find those things it's a EM algorithm it's like simply like a K means so in the E step you first find the distribution of these clusters and like you find associate each point to a centroid that's doing the first step of K means security of the centroid this is similar over here so it's a similar analogy of K means but doing a probability domain so I'll describe this part then come to the main thing we want to solve because this association we want to do in the real world and there would be lots and lots of users basically over the internet over the enterprise network so there could be millions of entities so we can't do it manually we want an automated method basically we want to discover by a automatic clustering technique and as you see GMM and on K means they require the K we can't automate them so that's where the next level of things comes in these like the more complex levels like the infinite Gaussian mixture models where you can actually have infinite Gaussians in it these are a class of Bayesian non-parametric approaches so there's a lot of math behind it I won't go into the maths most of them but quickly give an intuition so to begin enter these we use probabilistic programming and to understand that let's use a very simple example like a loyster regression that simple equation use a sigmoid use of weights and multiply with the data basically in a Bayesian world you convert the betas from a scalar number to a distribution so now those weights no longer numbers it could be like a distribution having a mean and a standard deviation so that's a Bayesian setting and in a Bayesian setting you can basically now use sampling so in a Bayesian world you sample from a distribution and things like MCMC gives sampling it gives you the estimates of those parameters so like you get a Gaussian for a beta 1 Gaussian for a beta 2 so that's the idea in a Bayesian world and what's the benefit like in a normal sigmoid you just get the value of the sigmoid the probability now you also get a band called a credible interval that also tells you that how much will they vary for example at the value 50 I can be 80 to 100% confident that I'll get a probability 1 this is the Bayesian world how do we map it to the clustering so like the Gaussian distribution is something also called a Dirichlet distribution you can just check on Google if you don't just need to look at the math but the interesting property is in a Dirichlet distribution the sum of all the values sum to 1 so because of weight vectors we want they need to sum up to 1 so that's the property as given by Dirichlet so these are the formulas you can look on Google the idea is simple if you draw any point the sum of all the points should be 1 so if you take like a triangle the point in the middle is basically equally distance from all the three corners like 1 3 1 3 1 3 and then point maybe some other could have a different weight but sum up to 1 so you can imagine it's like a triangle on a plane so with each point on the point 1 basically so this is called a simplex so this is a way how we can sample a weight vector for different values of k and there is some parameters alpha you can just look at different values alpha give distribution like if you have small value alpha you have more density at the corners a bigger value more density at the center so I think yeah that's a normal thing there is some interesting property of Dirichlet you can read on internet one interesting property is basically you can split a Dirichlet distribution so if you have a dimension pi 1 and you can distribute by theta 1 minus theta you will get a new distribution with a split of b and 1 minus b so what does it mean let's say we start with a two dimensional Dirichlet we set k equal to 2 in the beginning then we can split into two parts we can split again into four parts and do it let's say for any k splits and this k can also go to infinity so this is the intuition behind the infinite Gaussian mixture model that you can keep on splitting the weight vectors and a multiple of those distributions and it will go recursively and this whole process is called a Dirichlet process so that's basically converting a fixed Gaussian Gaussian modeling to an infinite Gaussian I won't discuss maths anymore so this you can also look for an example like something called a Chinese restaurant process so let's say you have empty tables in the beginning and people come in somebody can set any table and there is a first guy let's say this is on the first table then a new guy comes in either he can set on the previous table and share the food with them or go to a new table so that's how you can generate that so let's say the first table he went there is a probability 1 by alpha then at the second time he can actually either use the same table and do a new one so that's basically how you model a restaurant and this is a Dirichlet process I was talking earlier basically remember example called the Indian buffet yeah sure so like if you've been to like north India I've been to I'm in Delhi so the weddings are really fat weddings the buffet over there like 500 dishes over there at times here so that's also like a Dirichlet process anybody can pick any dish and basically fill up their plate and the dishes can attract a lot number of people so these are again common phenomenon this is also a Dirichlet process and you're actually going to observe one in the next of 10 minutes the buffet over there is also a Dirichlet process yeah so this is a simple simulation of this how Dirichlet works basically it's how when you go to a random this is like a tables over there you are randomly coming to tables and they're setting over there so you get a distribution out of that so I think that's example I'll let me show you this in a simple what's happening I break it up so these like called a stick breaking process so choosing a cluster is like choosing these different clusters so I can vary the distribution to some of all these sum up to one so these are the weight vectors but choosing a cluster so but different iterations I can have different combinations and for each combination I can make sure of Gaussian I can have Gaussian with 3 peaks with 2 peaks like 2 clusters 3 clusters and I can have multiple such combinations this like a simulation of actually infinite GMM running in operation so as you can see the clusters are changing over here and the number of points in the clusters are also reallocating so this is like a infinite GMM in action so you want to use this thing they are ready packages for that so like pi mc3 tensorflow has this policy modeling packages you can just run them there's also package in sklearn so it's a simple thing you can set a prior over there if the prior is a Dirichlet distribution you get a finite Gaussian mixture model if the prior the Dirichlet process you get basically an infinite Gaussian mixture model so this was the idea but this is all complex right so policy modeling has it's good it can model lots of real world things but yeah it's very computational intensive it is basically can't work on the data set so it's good but yeah not practical so that's where this applied data science comes in and the automatic clustering this probably we can use so based on the intuition we've developed so far if we just think logically we want to cluster a point that was the main idea on a simple axis basically so instead of doing clustering let's split the points first and find the dense region so that could be more logical so based on density or continuity if I want to split them then out of that region just get the pdf like find the main division of that that would be easier so bottom approach is more easier so it's like a applied thing we are doing and how can do that so it's a simple thing to find the dense regions you basically find a minimum distance to cluster them so that's the idea how do you find the dense regions but the idea is if you are able to get the regions and then most likely you will get some regions which are not unimodal they will have multiple peaks so you just check for that is it heavy tail or multimodal if yes go back split it again so you do recursively basically what's the heavy tail is basically a status call excess kurtosis greater than 6 for normal distribution is equal to 3 and if it's more than 6 the heavy tail basically the tail is very heavy similarly if you have more parameters like kurtosis based on this we can find bimodality and so we can just to simple check if you get a unimodal if you don't get just go back split again so simple recursive operation or finding a dense regions and then probably yeah you get a list of these questions so that's a simple exercise and finding dmin I think I covered this part in the last talk I will just summarize it quickly so what's the idea how do you find a dmin the minimum distance to cluster so let's say you have a dense clusters you can see the inter cluster distance would be pretty high over in a dense cluster there will be a peak in the beginning then peaks later on so the inter cluster distance would most likely be the first peak in your period distribution or the density plot over distance matrix if either cluster goes a bit sparse it will spread out so this peak will go down but still the first peak will represent the inter cluster distance so the idea is if we can make assumption do a simple EDA if there is a local dense regions in your data you can find from an EDA you can run this algorithm so you will find the distance matrix and get the first peak that's your basically the dmin you want to use for clustering and you can use successive peaks for hierarchical clustering that's a simple idea so this is the example I showed in last year so this was like I made this odac by these points and I got like the first peak as the inter cluster distance and by using this dmin I got all the clusters that was a simple idea and yeah so I am just done yeah wrapping it yeah so this is the basically a simple idea to automate it just do a grid search on these dmin so it's like you want to find the minimum distance to smoothen the curve you find the optimal curve by varying those curves so simple idea is do a Fourier transform and just do attenuate the smoothing parameter that's we get from freedom transform basically and do in a loop so it's like simple like a grid search thing and you can easily get all those parameters and maximize via score like a log loss you get the best curve so that's how you automate it and yeah you get those things and eventually I think you have bigger results so that's it this is one technique thank you