 Okay, everyone, so welcome to anomaly detection algorithms and techniques for real world systems. So my name is Minaj Anandi. I am the lead data scientist at StealthBit Technologies. And so this talk is sort of based on some of the work I've done at StealthBit over the past year and a half. So a quick outline for this talk is a quick overview of anomaly detection. What makes this different from other data science and machine learning type problems? And then I'm gonna talk about detecting anomalies in three different settings. And data streaming settings, we have data coming in and you need to do anomaly detection in real time. Density based anomaly detection. So this is when you have a bunch of data points and you wanna figure out which one and others. And then finally, anomaly detection in time series. So if you have user activity over time, how do you find anomalous speaks and spikes? And then finally, I'm talking about like two practical things about how do you do this in real world with garage testing and how do you convey this information to end users? So what are anomalies, a million dollar question? And the thing is like we all have some conceptual idea of what anomaly is, but it's really hard to define. It's sort of like you know it when you see it with things, but if I were asking you to find anomaly, you'd probably give me something along the lines of, it's just something that's noticeably different from what is expected or is different from everything else. And that's the kind of description, but it allows us to like sort of flexibly define anomalies. Like anomalies are not this like hostile, like hard, steadfast thing that's gonna say fix over time. Anomalies are gonna change over time. And one thing that's important to consider is what is considered anomalous now may not be considered anomalous in the future. You know, a couple weeks ago, kids running around the park at 3 a.m. would be considering how there's this thing called Pokemon Go and kids hanging around parks at 3 a.m. is completely normal. So there are very different approaches to anomaly detection. So one thing we can do is sort of develop a statistical model of like normal behavior. And then we can sort of test as each observation comes in, like how well does it fit into this model? And so we can say if the people according to this model, and then we observe them and say, oh, according to this model, you're acting anomalous. The way, and this is one way we can do this. Another way is we can sort of take a more machine learning approach and try to use classifiers to label data points as like a normal anomalous. So there's one big issue with this approach of treating as a machine learning classification problem is that there's a huge classic more than 99% of your data should be normal and less than 1% or less than 0.1% of your data should be anomalous. That huge class imbalance is gonna be causing you big problems. You can't do deep learning on this because you don't have enough anomalous examples and other traditional machine learning algorithms like support vector machines and random forest are also having issues. We'll also have issues just because of the class imbalance. So for this talk, algorithms that are specifically designed to deal with anomalies and deal with outliers that are very rare in your data set. And so the first setting we're gonna talk about is anomalies in data streams. And so in this setting, we basically have data coming in continuously and we need to be able to identify anomalies in real time or near real time. So as soon as they come free, five seconds at most, we'll be able to say, this is normal or this is anomalous. And so then we have a huge constraint because when we have data streaming, you really can't keep track of everything. You can only keep track of maybe like the last 100 events that happened. And so you need to be able to come up with these like sort of quick and dirty methods that can act quickly, that can like quickly label or identify things as anomalous or not, but also work with the fact that we have limited memory. So let's start with something that should be familiar to people if you've taken a statistics course, the Z-score. So the Z-score is something that's sort of used the time and like how extreme an observation is. If you know the population mean and you know the population standard deviation, you can calculate the Z-score and it sort of measures like how extreme that value is, how likely is it. And so the general idea of the Z-score is that you have the mean which just tells you like what the center is, what you should see the data centered around. A standard deviation tells you like how far, sort of like a measure spread, like how far away can we go from the center before we say, oh hey, something's wrong with this. And one way you can use that is to do moving averages and moving standard deviation. So as the data comes in, you have an average like last hundred data examples and the standard deviation lasts 100. As the new point comes in, you update the average, you update standard deviation and then you can calculate a Z-score for this new point. Now if this Z-score exceeds some threshold, let's say 3.5 or 3, then you can flag that point as anomalous. But there is actually a big problem with this. It's not so much in like how we use the fact that like averages and standard deviation are actually kind of bad when it comes to extreme values. So it turns out standard deviation and also means are very sensitive to extreme values. One small extreme value in the data can like drastically increase the standard deviation and they can drastically shift the mean. And as it increases standard deviation, you can also cause other points that would traditionally be considered anomalous to not be considered anomalous because then you'd like deflate their Z-score. And so to get more into like the quick mathematical theory about this, why does this happen? So what is the mean? So it turns out the mean, if you're giving a set of numbers, you want the mean as sort of like a summary statistic, as some number that is private. And arithmetically, the mean is actually a number that solves this optimization problem. You can basically think of your set of numbers as a vector in some data space, as some in a vector as a vector space. And the mean is basically a vector, some number S that sort of minimizes the distance of the vector in like the L2 norm. So this may be complicated, it may be too much math for 9 a.m. 10 a.m., not 9 a.m. So the key thing you need to get out of this is that there's this like X i minus S squared, that's quadratic. And that basically means when you have extreme value, it's gonna like increase quadratically. And I'm gonna show this in the next slide. On the other hand, the median is solved with that minimization problem, which is optimization in the L1 norm, and it's pretty robust to, and show like what's the issue with the mean as standard deviation. So it is two plots. What I did was I created an array of, it's an array with 99 values of 10, and then one value that's like extreme value. So that can be, the extreme value is 10 to the number at the bottom, so it'll be 10 to the two, 10 to the three, 10 to the four. And you can see how like the mean drastically is being increased the value of that extreme value. And as you notice, it takes sort of like a quadratic shape. It's slow quadratic, but it's sort of like an exponential, it's sort of like a huge spike. And so that's showing like the mean is like super susceptible to one extreme value. All I've done is one extreme value, and it causes the mean to shift greatly. Same thing with standard deviation. Once the extreme value and the standard deviation just shoots up really quickly. So how can we get around this? Well, I mentioned it before on the last slide, the median is pretty robust to outliers. So can we use the median to come up with a new way to define the center and spread? And yes, we can. And so this is called the median absolute deviation. And so this is actually sort of like a more robust, is more robust version, it's like a more robust cousin of the standard deviation. And how it works is it basically it's sort of the median of the median deviation. So you pay up the data, you calculate the median, you subtract the median from each one, and then you find the median of that. So it's like the median median deviation, whereas the standard deviation is sort of like the mean of the deviation from the mean. And so this provides a more robust measure of spread, because it doesn't get affected by that huge outlier. The deviation from that outlier is not gonna be the median, so we don't really care about that. It doesn't affect our median absolute deviation. And this is a pretty easy thing to code up. So this is what it looks like in Python. It's just three lines. You calculate the median, and you take the median of the median deviations. And then you do a new version of these scores. And this is called the modified Z score. And so using the median and the median absolute deviation, we can compute the modified Z score for each data point. You take in your data X, and you subtract out the median, that's what that X tilde is, divide by the median absolute deviation, so that's your notion of spread, where normally that's where the standard deviation be. And you multiply by this constant, 745, that's just more of a constant to make things work out mathematically when you do the integration. That's just more for mathematical theory. You don't really need any more details about why that happens. And then you can use the modified Z score in place of the Z score to do threshold-based testing. And so when you use modified Z score, it's recommended about the same levels of 3.75 of the modified Z score, then you have an anomaly. And so this is nice because it's a quick and dirty method to doing real-time detection. Like you don't need to keep track of two things, the median and the median absolute deviation, and you can quickly compute these modified Z scores for each incoming data point and then the output normal anomalous on the fly. So now let's talk about density-based anomaly detection. We basically have a bunch of data points in some end-dimensional space. And we wanna know which one is noticeably different from the others. That's essentially the goal of density-based anomaly detection. If you look at this plot, you can kinda see which one is gonna be the anomaly, right? It's that guy up in the top right-ish corner. And so a quick primer about density. So if you look a lot in statistical methods and some machine learning methods, you're gonna hear this idea of density-based methods, like DB scan, which is a clustering algorithm, is a density-based clustering algorithm. And so what do they mean by density? So the statistical theory behind this is that in, like we assume that all the data is generated according to some probability distribution. We assume it's generated according to a normal distribution. We assume it's a quarterly distribution. And all those probability distributions have some probability density function. That's like the likelihood of a particular value appearing. And so that's what we mean by like density-based clustering. We sort of want to infer what is the true probability density function generating this data. We don't know what the true probability density function is. No one knows what it is. Any higher power, if any, that you believe in. But we're trying to estimate the probability density function. And so how we estimate density is gonna be a little trick of heuristically how we're gonna think about it. It's like think back to your elementary science classes. What did you learn density as? Density was mass over volume. And so that's sort of like the little heuristic we're gonna go use as we go in. Mass is gonna refer to the number of data points. And volume is gonna be like the volume of the space we're gonna be looking at. And so the way we do this with local outlier factor is we're sort of gonna like try to quantify. So this is an algorithm called local outlier factor. It was invented in 2001. It's a very famous density-based anomaly detection algorithm. And there have been a lot of variations of spin-off off this. And so the question behind this is that like the anomalies should be more isolated compared to the normal points. Like that red point up there, it's more isolated from all the other points. And the idea is that we quantify the relative density around that point. It should be much less than the relative density of every other point. And so the goal of local outlier factor is we're gonna wanna want to estimate the density about a point. How many points are around it? And so to do this, we're gonna need this first intermediate value which is the K distance. So for each data point, we wanna compute the distance whose K is nearest neighbor where K is specified beforehand. So K could be three, K could be five, K could be 100. And so this idea is that this K distance is gonna give us our relative density is like mass over volume. This K distance is gonna give us an idea of volume. If you see the outlier, so that's applying its K distance to his fifth nearest neighbor. And so it has like a large fifth nearest neighborhood. On the other hand, this little point in the center has the purple one and has like a little magenta neighborhood. That's a normal point. It has a very small fifth nearest distance, K distance close to the other points. And so the thing is that the more isolated point is, larger K distance is gonna be. Since it's farther away from all the points, we have to look farther from it to find other neighbors. And now here's an interesting idea. So this is called the reachability distance. So this is gonna be a little confusing at first. And so the reachability distance, A and B, is the maximum of the K distance of B and the distance between A and B. And so this is sort of like a non-symmetric distance function, but it's gonna give us some idea of like how your neighbors think of you. So this is an interesting idea. So if you can imagine your nearest neighbors as sort of like your friends, the general idea of the reachability distance is, do the people you consider to be your closest friend consider you to be one of their closest friends? State of point? Then yeah, your nearest neighbors are probably also, your closest friends are also, probably also considered you a close friend. On the other hand, if you're the anomaly, then you can go up to your closest friends and like, hey, you're my friend, right? And they'll be like, yeah, sorry. And to sort of prove this, can the camera see me up here? So this is this anomalous points neighborhood. And these are his five closest neighbors, these teal points right here. So let's expand the neighborhood this one, this point. So this is one of his neighbors. If you look at it, the anomalous point is not in its neighborhood. So this teal point doesn't consider him to be one of his close friends, but he considers him to be one of his close friends. And so like, that's what the reachability distance is gonna give us some like sort of like, see you. So if your neighbor see you as one of their neighbors, you're good. If your neighbors don't see you as one of their neighbors, you have a problem. And now, like I said, this is all the years we want to estimate density. So we have this idea of volume. We have this idea of mass because we specified how many neighbors we have to look for. If we specified K is five, then we know the mass is five because we have a volume that contains five data points. And now we can sort of, and so for each point, we're gonna calculate the local density about it by taking the average reachability distance of its neighbors to A and taking the inverse of that. So maybe we're wondering why we do this inverse? Well, the average reachability distance is sort of like the volume. And you take the average, you're dividing by the number of points, which is like the mass. So now we have like volume over mass. And we need to kind of get that into density. How do we do that? We take the inverse. And so that's why the local reachability density is this like one over the average reachability distance of the neighbors. Is that confusing to anyone or? Okay. And now we have the average local density for each point. And now we can calculate this local outlier factor score for each data point. And so the local outlier factor score is basically the ratio of your neighbor's density compared to you. And so the idea is that like if you're an outlier, you come from less dense areas. So this ratio should be higher for outliers. So the average density of this point is probably something like very low. Yeah, if you take these five neighbors and take their density, it's gonna be much higher. So this ratio is gonna be a large value divided by small value, which is gonna be extremely large. And so the idea of the LOF score is like the higher it is, the more likely you're an outlier. And so for this example, that outlier up there, the red point has an LOF score of 2.7. What does that mean? Well, first of all, a normal point should have an LOF score between one and 1.5. What does one mean? It means the average density of your neighbors is about you're in the same neighborhood, the same density region as your neighbors. On the other hand, if it's much higher, that means you come from a less dense region. So it should be, you should be, so you come from a less dense region so the value is much higher. And so let's say if a point has an LOF score of three, that means the average density of that point's neighbors is about three times larger than its local density, which means it comes from a remote region. It comes from a region that's like one third as dense. And that's kind of strange because the points near to you are nothing like you at all. And so that's sort of like our definition of normally, like it's nothing like its neighbors. Okay, so next and final section is sort of like time series based anomaly decodes. So the time state is anomaly detection. We sort of this in X by time. Can we not define the extreme spikes and troughs in the time series? And one thing we need to keep in mind is we want to identify both global anomalies and local anomalies. So global anomalies are ones that like in the entirety of the time series, these sort of stand out. Whereas local anomalies are more just like even the specific like short time range between that point, it sort of stands out. And so this Twitter engineering blog and shows like global anomalies are these huge spikes up there where it's like the local anomalies are sort of normal compared to the global thing, but within their like specific time section, they sort of stand out. And so this algorithm is called the seasonal hybrid ESD or the seasonal hybrid extreme student eyes deviate. And so this algorithm was invented at Twitter in last year. We released it last year. And so this algorithm has two components. It has a seasonal decomposition component which deals with sort of like the time series element of the problem. And then deals with has the ESD component, the extreme student eyes deviate, which deals with the anomaly detection component of the problem. And so the general idea is that we sort of want to like remove some of the temporal noise from the data before we try to detect outliers from it. And so the first composition will sort of remove the temporal noise that we don't care about. And then after we remove the noise, so we basically clean the data, we can move it onto the second component which will then actually identify the anomalies within like the remaining meat of the data. So the seasonal decomposition. So this time series decomposition is sort of like a pretty classic method in like econometrics. And so they get out of time series into like the three important parts. You break it down to the trend which is like the actual thing you care about in the time series. You break it down to a seasonal which are just like periodic patterns. So these could be like seasonal patterns. Let's say if you're in economics, you notice that like people's like consumption of like electronic goods goes up during the winter because people are buying Christmas presents. So this is like a seasonal thing that happens every winter. You don't really care about that noise. And then there's like a finally like a random residual component which is like just your error in the time series. It's just a catch all for the things that don't fit in the other two. And like I said, the trend component is like really the thing we care about. We care about like the trend because it's always like the important stuff in the time series. And so what seasonal decomposition does is we sort of remove, we take the time series, we break it down to these three parts and then we remove the seasonal. We basically remove this periodic noise. If we're like monitoring user's behavior over time and we notice that like people don't do anything on Fridays, and we don't really care about that. That's not specific to a user. That's specific to everyone. People don't do much on Fridays because Fridays they kick back and relax. And so we sort of want to remove that because we don't really care about that. And so after we, so this example of the time series, so the above is the observed time series and then below that are the three components. So if you look at the above time series, it's sort of like it has this increasing over time. And so the trend is just like the straight line is increasing over time. You notice like there's periodic fluctuations of observed data and that's the seasonal. It's capturing that periodic site spikes and troughs. And then you have this random one which is capturing the random noise and the fluctuations that can't be captured by you. And so now we have taken the time series and removed the seasonal component. So we removed that temporal noise from it. Now how we find the anomalies. So the idea of the extreme studentized deviant is sort of like a statistical procedure to iteratively test for anomalies. We basically specify beforehand how many anomalies we think there are. We say I think there are 10 anomalies in this data set or I think there are 50 and then we sort of iteratively test for them. And one thing we have to be careful about is when you do multiple testing if you run the risk of false positives is like a very important issue in statistics of like multiple hypothesis testing leads to like false positives. It's a huge problem in psychology where they'll run multiple tests, run like seven tests at a time and it's like oh one of these tests returned positive you therefore we found a meaningful result. XT Webcomic has a comic about this about jelly beans causing cancer or something of that. And so the extreme studentized deviant does is it allows you to do this like iterative testing while compensating for the fact that you're gonna do multiple hypothesis testing. And so like all statistical tests you basically have to specify the alpha value that you want to test for. So this is like alpha equals 0.05 you're testing at the five significance level and you need to specify how many anomalies you're looking for maximum. And the procedures of how this algorithm works is for each data point you compute the g-score which is basically the absolute value of the z-score. You take the data point, subtract mean divide by a standard deviation and take the absolute value of that. And so then you take the point with the highest g-score and using your alpha value that you can compute this critical value or this critical threshold. If the g-score of your test point is greater than the critical value you flag that point as anomalous. And now regardless if you flagged as anomalous or not remove it from the data set. And you just repeat steps one through five for the number of anomalies you're looking for. So if you're looking for 10 anomalies maximum you do this 10 times. And so what this does is like oh find the most extreme point, test it and see if it's anomalous compare what you statistically expect. If it is flag it and remove it if it's not just remove it. And so now you have the remaining data and you find the next extreme point. Test it, if it's anomalous remove it if it's not still remove it and keep doing that until you've done. And so how this algorithm works the result is algorithm. So here's an example from Twitter basically they have this time series as a little hybrid ESD algorithm on it with alpha equals 0.05 and they can show a couple anomalies at the end. And so you see the anomalies correspond to these global spikes where it's really high or these global troughs where it's really really low from what you expect. Like it's very noticeable you can look at that pickager and say yeah that's kind of anomalous it's really high or really low. How do we do this stuff in practice? So one good thing I've learned while doing this is if you're giving this to an end user if you're designing a system for an end user you want to provide them with risk scores. So you don't want to treat this as like a classification problem if you know machine learning you want to treat this as a regression problem because it's like very hard to say if something's like truly anomalous or not like saying that something's like yeah or no this is 0% anomalous. That's like a kind of a big judgment to make. And so rather you want to give a risk score between like zero and a hundred. So this way it's like yeah I think this is a pretty risky value. Like I'm gonna give this a risk score 50 or I'm gonna give this a risk score of 70. And I recommend doing it from like a zero, a hundred scale because people sort of like intuitively understand that scale. Like a hundred is bad, a zero is bad. How do you get these risk scores? Well, one thing you do is when you're calculating the anomalous events is you can sort of like assign a probability to it. You can say like given your critical values given your LOF score I can say this corresponds to a probability of being a likelihood of this is anomalous given this critical value given this LOF score is like 75 or 0.75 multiply that by a hundred. It has a risk score of a hundred. And you can also get additional calculations afterward but you want to give them risk scores. People understand risk scores. They don't really understand probability. And finally how do you test these algorithms? So one key thing is when you're testing your algorithms you have to do it in completely different environments. Don't just test it locally within your company. You want to test it within your company. You want to test it in at a beta customer. You need to want a small company in a small environment with like maybe 50 users or something like that or you want a huge environment with like thousands of users or hundreds of users so that way you can see how well it does across different environments. And so this is huge because like in machine learning usually you do like the training, the cross validation and then you do one set of testing. You have some testing data set that you run it on one time. And so that doesn't really work for anomaly detection. You need to test it in multiple environments. In addition I also recommend since anomalies are pretty rare, create some synthetic data where you know there are anomalies in it. And if you can find the anomalies in those, okay that's good. If you can't identify the anomalies in your synthetic data, then you have a problem because if you can't even find the anomalies you know you're there, how can you find the anomalies that you don't know are there? And one thing is that you over time will consistently be testing these algorithms and fine tuning them. And so one thing I recommend is building a test harness to like automate testing of your algorithms. So as soon as you make some changes to it you can easily test it and see how well it performs compared to previous iterations. And I recommend sort of like building your own test harness so you sort of know what you're looking for and what you're testing for compared to like some off the shelf test harness. And so yeah that's my talk. So I guess at this time I'm done so if there are any questions I can take those now. So why do you have to specify the number of anomalies? That's because like theoretically easily go through the entire test for all the data points normally but like then you have this issue of like you're doing the multiple comparison too many times. Like if you test for enough times you're gonna flag a bunch of false points, get a bunch of false flags cause like the way the critical value works is sort of like decreasing over time. Like okay test the first point with some high critical value you remove it then you sort of lower your critical value so you're sort of like gentle relaxing. And indeed in the too many times or if you do it for the entire data set you're basically gonna find too many false flags cause your critical value will get too low. Okay so I guess how well do you scale? So actually the first one, the modified Zscore stuff and media absolute visualization. This scales pretty well cause a lot of us can be paralyzed pretty easily. Like you just need to calculate the median globally and then you can calculate the deviations. If you have some distributed system like Spark that across the nodes calculate deviations for each one bring it back together calculate the median absolute deviation on like the global system. And so this actually works very well. It's like scales very well as large data sets cause it's like a quick and dirty method. It's supposed to be quick and it's supposed to scale easily without doing like too many complex computations. So next is the local outlier factor. How well does this scale? So this actually, if you implement this correctly it's like doing this K distance stuff scaling that well. So in SciPy there's an algorithm called the KD tree which is like, it sort of does nearest neighbors really well. It's like implemented really well. SciPy KD tree is in the spatial library. And so that will do the K distance really well. And so as a result this algorithms if you use that or some like efficient implementation of a KD tree or whatever programming you use you can scale well to the number of data points you have. The big issue with LOF is scales poorly with the dimension of your data. So if you can go for like 10 dimensions that works okay you have like a 10,000 dimensional data points that doesn't scale so well because you have to calculate distance in 10,000 dimensional space. And then you run into this issue of like the curse of dimensionality. So it scales well like with a number of data points it scales pretty poorly with the number. There is an algorithm that deals better when you have like high dimensional data. I can't remember what it is off the top of my head. Yeah, but it doesn't exist. People have thought about this and that's why I'm like the variations of this. And then finally there's the seasonal hybrid ESD. This actually scales very well with the amount of data points. I mean it was invented by Twitter so it has to like scale to the amount, you know people vomiting into the boy that people do on Twitter every day. So that's a scale well to like that level of data. And so this actually scales pretty well to like large amount data. Time series is usually like one time versus one value. So you don't have to run into this like dimensionality problem. Yeah, so this is like that's like the million dollar question like how do you pick the K and like any sort of these like clustering all of them like K means how do you pick K? Like that's the million dollar question. Best thing is to try out different values see what you get because like, oh hey maybe this looks like this would look anomalous. Maybe that point would look so anomalous if I had like two data points right here. So like just a trio points and like maybe I said K was two then this point will look to anomalous at all. Do you want to like try different values also taking the fact that the higher the K is the like slower this algorithm will be. So as rule of thumb I generally use like there are end data points I use log N as value and then work from there. What do you mean mixture of distributions? So what you're thinking is the more like using K means to find like little clusters, anomalous clusters so they're two different like data generating processes. How do you do that? Yeah, so that's actually a kind of really can't handle like this LOF assumes that like all the data is generated from like the same distribution. Yeah, I'm sure there's some other algorithm that happens like mixture of Gaussians I'm just not sure what they are. So I imagine to be similar to like how you do feature selection other supervised learning just because of the company I work for I really don't have to do much feature selection cause like company has like proprietary security software and that data gives us like maybe four to five features for each user. And so I don't have to do any feature selection on that. I can just use all the features they give me. So I don't have like really any experience in this but I imagine wouldn't be too different from like how you do feature selection in any other traditional machine learning problem.