 Hello everyone and welcome to our next talk, which is mine To introduce myself, which is wrong I did a postdoc in machine learning after getting my phd in algebraic topology with Some of the people who did some of the math behind the concepts in this talk and I've just extended it for this thing so That's enough about me Let me play my talk back for you So here Welcome everyone to my topo on calculating concepts very quickly with this library for cool cool Which is based on caverns. I hear no audio. Nope. See Daisy. Um, there's no do you hear audio from the talk? Okay, well that's the problem. I'm trying to Sorry about this guys. I don't know what's going on. Um, I am up. I'll apologize Key takeaway for this talk is this is a novel way of calculating drift Welcome everyone to my talk on calculating concepts very quickly with this library called Gorko, which is based on cover trees My name is Ben Gattell. I am a senior data scientist at elastic My Twitter handle is kind of mathematician so The key outline key takeaway for this talk is this is a novel way of calculating drift. It's done using new technique using cover trees and And because it's so fast. It's log n time to actually calculate drift for a new sample We can use it for on in new ways that the concept drift couldn't have been applied before now We're going to go with the objective. Why we care about constant drift mathematically how it's done Then a couple results and then finishing up with next steps So, why do we care about data such as well our data drifts So if I work with malware and there are new malware families coming out every month There are new benign pieces of examples of software coming out every month every time that Adobe patches Photoshop It changes the pattern slightly and it changes its location slightly in our data set Every single time a new piece of malware comes up. It's a new P it changes are the way I'll just the malware data is distributed slightly another problem What well what those do basically which gets into the second problem is they lower the efficacy of our models that we deploy So our models expect data in a sort of sort of the same sort of general location as Our training data and every time we deploy we are kind of fixing our models view of the world at a certain day and Then we're view training it or we're testing it on essentially future data. So it doesn't have a perspective For the, you know, if it's trained in March doesn't have this perspective It might not work very well at the end data coming in at the end of April So one of the methods for dealing with this is you build a dashboard that basically tracks the error and When the error gets too high you discard the model and go on for a new one But the problem is I don't trust virus to the labels and I don't think anyone really should trust Fires to the labels and that's the best way we have because the sheer volume of data We have to try to trust virus to the labels. So and For other things I wouldn't trust early labels and on data Because we have to move very quickly and sometimes with whiteness thing and stuff and You know, well and sometimes we only get like a label when our customer complains and we'd rather Get ahead of that and figure out like hey, where's the drift happening? What's going on? And then additionally aside from the fact that things are just changing all the time Our models are also an attack people are trying to bypass them by doing new and weird things to the data like packing a spam filter so And seeing if that works or like posting a different pattern so that to see if that works to see if they can get past Like the Facebook spank filter they're constantly innovating with a specific goal the mind of bypassing our spam our models This may not result be an adversarial example per se from the literature, but it sure kind of acts like that sometimes and Additionally, we could be under attack via our poisoning system maybe if the Detector is fast enough we could get ahead of those things beforehand So here's a trivial example So we have these two data sets and one is our training data set That's so we've got one one two one one and so our training data set has 70% ones 20% twos and 10% threes And this is what we trained on Realistically, there's no point in training a machine learning model on a sequence of numbers, but suppose we do But then we deploy this model and we get this other sequence of data two two one two one, so on and Our when we actually deploy it we see data coming over the wire We have 30% one 60% twos and 10% threes that's a little bit different from our training set So the real world is different from our training set So in this case, this is a we can model this distribution very easily because it's a discrete space There are only three of things that our data could be there's only It isn't continuous like a lot of the stuff we have to deal with the data science So we can actually compute the distribution fairly easily and this is a pretty good that we can feel confident in our estimate of the Distribution we have to do some Bayesian statistics to do this properly Which we're not getting into this talk, but This is Basically how it works and then once we have like a concept of our training distribution and our Real-world distribution or our test distribution as I'll be calling it to this today We can take the kubek like to divergence. So this is a Sort of a measure of the distance between this distribution and this distribution and it's very easy to calculate on this categorical discrete Data because it's just going to be well point seven Times log of point seven over point three. So that's the category for ones over the categories of the ones Take the log and then multiply it for the category for ones and that gives you the first term Second term is the same thing for twos and the third term is the same thing for threes And then we add those all up and you get the kale divergence between these two distributions is point one six So that's all great, but the problem is our data looks like this And I don't know whether this is a has any difference is our are these two are my blue points? Sample from the same distribution as my orange points. I don't know It's really hard to tell by just looking at this picture And and now in this case they are so my orange points are sampled from a two-dimensional Gaussian two-dimensional normal distribution Where it's just it's it's nice sphere, and they're Yeah, they're both they're the covariance matrix is one one and the covariance matrix with blues is one one And it gives you this nice distribution, and it looks okay And what I'm going to do is I'm going to drift the blues slightly over and can you tell is this correct? Well, you know every single time that I put a blue is it's right next to some orange points So this looks like it could be from the same distribution. I Can't really tell with my eyes and one of the techniques we might do is like oh well I'm going to take the cane your neighbors and see if the the distance from my cane your neighbors Over time goes up, and then I'm out of distribution, but if you tell here, that's not going to tell you much Am I go here? Well, I've got some outliers here now So maybe that will work now if I go here. Well, this is really distributed So this center the center of this distribution should be around here in the center for this distribution is around here So that's really drifted, but you've got some outliers over here now What happens when there's like millions of orange points, and there's only a few hundred blue points Well, you might not be able to tell the difference with that in that case so and also like There's all this what I've described to you is kind of a feeling and actually get at the math of this thing I have to like model the distributions that these came from and that's quite a hard problem and Especially in high dimensions, so things get kind of complicated The solution to this that I came up with is to use a cover tree. Now. Why do I want to use a cover tree? so the cover tree basically The key takeaway for cover trees is it's a cane your neighbors data structure. There are many like it. There's KD trees There is B trees. There's KD trees You know there is a K mean Streets a whole bunch and Well, the cover tree has this wonderful thing where in 2016 Mario Majone and when you allow prove that it can arbitrarily well approximate the online distribution of the data There's some caveats to that statement and mathematically will probably to probably statement and take it take the whole page But basically the core concept of their proof is for a nice data set and to know what a nice data set is You need to know what a low-dimensional manifold is And you have to have enough data from that But for a nice data set which most of our data sets are pretty nice Sort of kind of a cover tree will arbitrarily well approximate the underlying distribution of the data There's some caveats and you can go read their paper if you want on exactly what that means But for us what this means is if I use a cover tree to build a model of my data for K nearest neighbors I can take care of these neighbors or I can just fiddle around with the properties of the cover tree to like get information about my data set up So I can tell things like the local dimensionality I can tell the things like how things flew together And I can tell like sort of the shape of my data fairly well I can tell clusters and things I can infer what clusters are Let's go build a very simple cover tree in two-dimensional data. So here's a you know infinity sign A bow tie whatever you want to call it And I start my cover tree by picking a point at random and then building a sphere You know in this case the circle that covers everything. So this is the start and now this doesn't help me that much I haven't split up my data geometrically to like divide and conquer the K nearest neighbors Structure because the naive way would take a linear time big bigger event But we want to divide and conquer so it takes log event. So here's what I do I shrink my sphere down add another one and I cover it again You can kind of sell that this is this object is longer than it is wide It's kind of if you really squint it's kind of one-dimensional thing in this direction Um, so that's great. But now if I go, okay, well, I'm gonna split it up further So really define conquer and I so I've got these two nodes And they're both children of the previous node on the previous slide and I'm gonna build their children So here's their children Now I divide and conquer the their stuff and I can kind of tell that They each of their children have four and if I if you squint a little bit It's kind of square shaped in this dimension at the scale The top scale it was too blurry. I could only see a line here at the scale I can kind of see that that's up and it's got a nice shape square shape And so there's four children and the reason you know that that that dimensionality has Relevant to the number of children you have and then I can go further and you know It's got some more it's kind of split up nicely And I've got like a one-dimensional object here again because that all these children are kind of one-dimensional ish And I can split up some more and you can see that it's kind of like Conforming to your data nicely and like at every single step of the way I can make some inferences about what the shape of my data is that are very cheap because it's basically counting So how do we build a probability distribution? So On slide 2 we built a very simple a probability distribution where I had a Discrete space and now I've got a tree and essentially when I have a tree at every single node I can either go left or right or down or you know in this case my number of children I have is is variable But I will be a parent note and I have to go to one of its children. So I always have a Discrete choices a discrete choice of where to go So how does that apply? So here is a very like simple cover tree And I've colored it by the the total like the population of this node So public this color here is the popular relative population of this node. So about 66% of the data is Underneath this node about 33% of it is underneath this node and you can see that the colors are kind of change and I get a Kind of a picture of how dense my data is That doesn't help me because I need to make choice. It doesn't help me and make the choices at all the steps So if I go here, this helps me make my choices So here I've got 66% of my data over here and I've got 33% of my data over here and then once I've take if I have to take one of those two choices if I take this choice I'm now at this node and I have two choices again And the relative probability of taking this choice versus this one. Well, this is basically 0.1 probability of taking this into this path and this is a 0.9 probability of taking this path and once I get to this note, I have a Probability of one of going straight down because I have no other choices and at every single point I have a discrete distribution like we had on page one which tells me where I'm going and Because my tree is geometrically motivated these choices, which are all discrete and easy to compute and easy to count and easy to track off Now I can compute the KL divergence because it's it's trivial It's it's all it's just counting and taking logs and something things up So this is my prior distribution as I said, we're going to use some Bayesian statistics And so here is my prior how this is built of my training set. How do I include my test set? So when I see a piece of data coming over the wire I can make an inference of where it should be long in this tree So if it was already if it was in my training set and didn't get involved in building the tree Where would have it ended up? So I have a point here So this is my new point that came over the wire and I queried and said oh well this point is covered by this parent and It's covered by that parent and that parent all the way up So the path for this point Goes up this way and now for each of those elements in those parts I can increment my probability distribution and get a posterior one that involved that includes the new information that I got But there's a new point over here and now I can do that again with another point Yep, cool I have a posterior at this and points been added and you can see that the distribution changes and then I can add Some more points and I can add some more points So these are all my posterior and I can update my original thing by taking the path of each of the points and Updating all the little with the discrete distributions that I have So this enables me to sort of Quickly make a posterior distribution out of my prior distribution and everything is discrete And then I can take the kale divergence between my posterior Discrete distributions and my prior discrete distributions. So in this case This one has a probability of point three three and here I have a probability of point six six and afterwards After I've included all my training set. This is a probability of like point four and this you know This is a probability of point six and this is a probability of point four And there's a very different and I can plug those into the kale divergence thing and get a positive number For the kale divergence at this node and I can do it for this node and this node and this node and get all of the kale Divergences and add them up and then I can get the total kale the kale divergence with respect to my original tree of The posterior distribution with respect to the prior distribution Which is really useful and so this basically yeah does it and it's The code for doing this Enables you to do it Very quickly because it's basically just counting a couple logs and some other statistical equations And The slide just covers how you do it So let's go back to that original test set So I've got a basically a blown-up version of the original those slides of the Gaussian distributions where I had them We're a large number of points from my Training distribution, which was a single Gaussian sample that the center of the thing and then I have a small number of points from a test distribution and in this case I have a hundred thousand points from my training distribution and only five hundred points might try a test distribution and these both have a covariance Matrix of all ones. So it's actually quite of and it's 20 dimensional This is actually harder than most data sets because MNIST is about 10 dimensional if you really get down to it. You want to know what that is? You can contact me on Twitter and ask me why MNIST might be lower dimensional and Between the training distribution build the tree fix it and then I make a posterior distribution for a bunch of test distributions I take the kale divergence and you can see it's a bit noisy The reason being is I generate a lot of the exact position of your points. It's fairly sensitive to you So this is one of the problems and one of the things I'll talk about at the end is like there are ways to normalize this and resolve this because I've designed it for speed not really for accuracy And I'll get into why I did that afterwards because honestly I came for it came to the drift stuff backwards but you can see the further you get the further you separate the two distributions the Higher is the number gets so the sanity test the unit test of does this work is correct And I'll have to compare this to other methods This takes big O of and sorry Vashah Stein takes big O of n to the fourth that Vashah Stein is the only other Rift calculator that I'm really no works and Is model independent and doesn't have some funny business with Building an estimated distribution initially This takes K log in another big difference between Vashah Stein is the traditional way of doing Vashah Stein requires a relatively equal size test and training set You can't have the test set be a point five percent of the training set That would just would not work so great on the traditional method I know of and this one Honestly, it's online. You can do the test that can be a fraction of the size of the training set and because it's online You can do it in real time And this is Stupidly fast you can track for the embryo open source embryo malware data set You build a reasonable couple tree. You can track 16,000 new samples per second on my laptop so that's fast enough that Inference wouldn't be this this would not be the bottleneck imprints would be the bottleneck if you would Distribute build a cloud distribution a cloud model and put this as a guard in front of that cloud model So we built it so that it can go fast enough to defend And now here is where the thing originally came from and then we backtracked on to drift So here's the test set attack So this was originally Categorized by Ian Goodfellow and there's some defenses have been proposed by Nicholas Carlini and others So but here's the just of the attack normal users just query They don't care about whether they got a false positive or false negative. They just keep querying and For us we'll say if we think of a normal user as being like an enterprise user Trying to figure out whether all the reports coming over the wire of like new bout benign binaries are malicious or benign Or they'd be submitting every single binary that they see on their network to a service Or their or you know that will tell them whether it was malicious benign for an AV You know an ADV product But a malicious user would be Submitting things with a specific goal in mind They don't care about what they care deeply what the label is and they care deeply about Defining a false negative. They want to find a malicious sample. We classify as benign So what they'll do is they'll try a bunch of stuff until they find something That is misclassified as benign and it's actually malicious and bypasses our model and then we can and then they deploy it as far as much as possible so If you do if you apply this to you know the the fast drift calculators to that you can Attach a little drift calculated to each user and as samples come over the wire you Track it for that user And if the user starts exploiting something they're going to get a very boring distribution where it's just one point repeated over and over again and You spike your distribution your kale divergence spikes up and then becomes very normal because it's essentially you're tracing the same path over and over and over again and You can see that there's an exploration phase where the malicious users try to look for a benign Miss class or a sample and then they find when they find one There's an exploration phase and the kale divergence rises rapidly. You can see that also that this is In logger scale, so this is literally hundreds of times more kale divergence than the benign samples and here There it's the same thing hundreds of times more kale divergence than something benign and the attackers around successful Because of this you can build you could probably build a very good online defense of your system against a test set attack so Here's where basically the you know the the system was originally built sort of with this mind and other attacks in mind because I'm more interested in defending my models and calculating drift But because it's so bloody fast it does this really well and because it's so bloody fast you can retrofit it so Here's you know Here's where I'll talk about some next steps So as I mentioned previously this thing is a bit noisy there are ways to fix it in this case Window sizes and like some tuning things that I didn't go into on in these slides because I've already got 20 minutes There are some tuning systems. You can use to really clean up And de noise There's also some interesting waking functions You can do inside the tree because there will be there's some noise from like an over active um Leaf node might End up just building a very high-cale divergence for that just that one node and it throws everything off And there's a couple other little things like that that still need to be cleaned up before this goes into production but the system itself is looking very promising and It's honestly the fastest of its kind that I can see in the world Yeah, and that's basically it forget for Gorka's Ability to detect benign, you know test set attacks and other drift things The library's at elastic comm. Oh, sorry github slash elastic slash Gorka It's named after my grandmother because the The base algorithm for this thing is called gm a and if you say that while drunk you can kind of get grandma And grandma's a bit Oh, too on the nose. So we named it. I mean, I named it Gorka Anyway, well looking forward to hearing you guys on Twitter and twitch and slack or whatever So thank you very much and good night