 Okay, perfect. All right, so with that being said, it's my pleasure to introduce Sven Kutel, who will be telling us about the math of ML security. Hi. So this talk is gonna actually focus more on like the security problem for next NIV stuff as a whole. The math is actually really simple, but it's the way I approach it, there's probably largely motivated by my background, but how I actually solved some of the problems I've encountered is it doesn't actually involve any heavy or interesting math in some senses. So to start off with, I got my PhD in Equivariant Algebraic Topology from Johns Hopkins. My advisor was Nitu Kichlou, and I didn't want to continue in Algebraic Topology. Category theory is not quite my bag. So there was two options I had at the end of my PhD was either like leave academia or go into teaching and I wanted to make a livable wage. So, you know, severe panic at the end of my PhD started teaching myself machine learning because the machine learning industry makes decent money. I had also been going to DEF CON, I had been, when I graded math exams and math homework, I'd be watching DEF CON talks in the background. And so I went to DEF CON finally, I focused off on my PhD and met some people there. And then the next year I was there, I met some more people and founded the AI Village at DEF CON, which is a machine learning secure, which is a sort of a workshop inside of DEF CON that focuses on the machine learning security, the actual practitioner side of things. So we have, it's a community that you can join our discord. We have 3,000 people through our doors. We have pretty much the entire data science teams of all of them, Sophos, Silence, Endgame, Elastic, all of the big players in X&AV, the data science teams are all in the village, although active and through that, I managed to get a job. I can give more details about that, that's my background. I wanted to talk about the industry problems I'm experiencing, like what machine learning routine means and the terms that I'm talking about here, and then how I solve some of these problems and where I think more thoughts and research needs to go into in order to actually do machine learning routine correctly. So firstly, I've got to define what this means. So security background, defining what machine learning routine means, the threats that we're dealing with, we have to do some correct threat modeling if we want to do security. Blue team objectives that are slightly different from like protecting just defending against threats, and then like how I solve some of the problems I mentioned in the first three parts. So security background, so vulnerability is a bug and software that attackers can use to get in. CVE is a registered vulnerability that's registered with NIST or the government, which is particularly registered correct. If we know this one as a threat and it's indexed in the database, and we will then refer back to this particular CVE but be some malware it uses it or an attacker uses the CVE to breach a system, it's used in insurance and a bunch of other and like all sorts of marketing materials. Red team is the offense, people trying to break into systems. Blue team is the defense, the people trying to prevent break ins, but the second term is kind of evolving in my opinion. Advanced persistent threat, is a person or group that is trying to break in and has a lot of money and think like NSA, KGB, PLA, it's that level of people. They are not necessarily governments, but they are well-funded enough that, there's very few criminal organizations that reach this level. This is the main thing that we're trying to do is the main boogeyman in security. Zero day is an attack that's brand new, like a CVE, before it registers as a CVE, it would be a zero day and then once it registers a CVE, it's known and then it's one day, two days, however many days you've known since the threat exists and a script kitty is scared to someone who doesn't develop their own attacks. I might drop these terms in amongst the talk, so this is the definitions. I do machine learning blue team, so I will say that I blue team act in a sense. So antivirus, it's actually quite, you've all installed antivirus in your computers, but the way it's trying to change. 1980s, it was like McAfee, so if it's semantic. In 1999, at the turn of the century, there were about 100,000 samples that were known and registered of malicious things. Seven years later, there was 5 million. Now I manage a database that has a severely down-sampled count and we have 150-ish million more, but that's the ones we train on. We've got more than that, but that's our down-sampled version. We get millions of samples a day and we select ones we like from that. More than 500 new samples per day for certain in like certain count by certain counts. That's a lot of malware and what's happened was in 2000, in the 90s, we use signature-based defense. What we do is you pick a byte sequence that was vital to the software's workings. This required a reverse engineer to go into the binary software, decompile it, figure out what's going on, and then pick byte sequence that was significant that it would be hard to change because the mitigation for a signature-based thing is to change that byte sequence, figure out what it is and change one of the bytes and then signature no longer matches. And then you play that cat and mouse game and get like various heuristics and things. But what happened was with hundreds of millions of samples, the things start breaking down. What would be used is, we started using machine learning. On the left here is a PE file format. It's kind of hard to read, but what you should know about PE is like, there's a DOS header from the 80s. This is how executables ran on Windows machines. This DOS stub, all sorts of different things that are go into this file that are all like, they have to be exactly correct or the file doesn't run. There's shenanigans you can play here. But if you don't run, if this thing doesn't run, like the integer point is slightly off, the code won't execute and you will induce a crash. And so like, you have this like structured data, which is different to other machine learning disciplines. If you might have seen images, those are unstructured sort of data. They will talk about more of that later. It doesn't really, you don't really care about like how the file format, and like you change one of the pixels, it doesn't break the image. So we have this like very fixed thing. We also got similar situations for network packets, file events, all sorts of other stuff, where if a network packet, if you manipulate a byte in the IP address header, part of the IP address header of the network packet, it will no longer send and you will have a broken network, a broken packet. So we security people have to convert that into stuff. So you do machine learning on this and then the two places you can do machine learning are either on the edge, which is on the client's computer. We are then restricted in the amount of compute we're allowed to use because it would be stupid to protect someone's computer it would require a hundred percent of that computer, a hundred percent of the time to do it. So in the other option, so we have a compute restriction, the other option is to run in the cloud where you ship a binary or parts of the binary or something about the binary up to GCP, AWS or Azure. And then you do processing in there and then you ship down the results of your thing. The last thing you have to know about this stuff is labels are very expensive. You need an expert sometimes to read a binary to tell you whether it's malicious or not. There are sheets that they use and that's how they're like shortcuts and like there's a whole process for doing reverse engineering and it's well-established process. That's how we develop our features. We use experts to help us build features and build labels but then there's a whole different thing. So that's a weird thing. So now let's talk about machine learning threats. So spam models are the best place to start to think about this machine learning threats because it's the most realistic, most attacked models on the planet. So Facebook and also the oldest. The Facebook and Gmail spam models are without a doubt the most attacked machine learning models on the planet. The first response for a machine learning model that required like the first attack that were like demonstrated in the wild was against the spam model in 2004 and Facebook has a whole team devoted to building a, and maintaining their spam models inside of their site integrity teams. Gmail, same thing, you have a whole big mess. The way they get attacked is basically millions of people try to bypass it every day and somebody gets through. When that somebody gets through they suddenly have money like cash on a hand essentially like very valuable thing because they know a way to get their spam through and then they'll sell it or communicate it to other spammers. And then once they do that those spammers start using that technique and then the spam models essentially useless you have to retrain and redeploy with labeling their new stuff. They're gonna try the old stuff too because maybe you messed up. This requires you to be fairly agile. You have to retrain your model very quickly. They deploy new models at a very rapid pace. So finally the other reason why you start talking about spam models is there's this company called Proofpoint that does outlook like enterprise spam and phishing detection. They do this, so phishing is you send an email and you try to get someone to click on a link and you try to steal their credentials. This is the whole, send an email that supposedly from US bank you click on a login and then the attackers have the ability to log into your bank and steal your money. But it's much more worse at the enterprise scale because now they're trying to email your CEO and get your CEO to wire like 12 million. They have been cases where $3 million have been wired to an offshore bank account in the Caymans because they phished a CEO, a bill and the CEO paid it and then the money's just gone. So Proofpoint does enterprise security for emails like when you're running your own outlook server and you're not using Gmail. Researchers Will Pierce and Nick Landers stole the model and then started building bypasses offline and getting past everything. It was the first machine learning CV back last year. Very exciting stuff. So we actually machine learnings have one of these registered CVs which is there's gonna be more coming down the pipeline. So here's malware models. It's a similar story but a little bit different. I work with malware models. So writing the target is not just like you need to get spat Facebook. So if you're one of our spread spam in India you have to spread it on Facebook because that's the most common user thing. They don't care about, they don't use email as much and Facebook is so integral to the society that they need it. So you need to target Facebook. But for malware, you don't really know like there's two different ways of doing malware. An APT is gonna use a targeted attack that's trying to get to one person or one organization and they are going to figure out what everything they can about that organization. What security products they use, what, you know mitigations they're saying they're gonna find out when mothers made a name of the CSO at that chief's information security officer at the company that they're targeting and whatever they can to get through. And they have a lot of money. They're probably gonna get through but maybe not if you're good if you're you need to defend against that. But then so that's one side but then the other side is people who spray and pay they create a piece of ransom that they're trying to get on everybody's computer and they don't know or care what what AV company you're using to protect your computer. So they just try to get past as many as possible. So they don't care. Those are the two main threats. The APTs are you need to detect by behavioral means because you're probably gonna fail at some point in your productions you need to have layers. So you put other types of rules in there. If you've got a the other type spray and pray you probably can detect with the machine learning model and they're just trying to bypass everybody. So it's not targeted but they are aware that a lot of people are using machine learning models. So model malware is tested against virus total communication between our authors exists. They sell code to each other. They've got a whole big thing but it's not quite as fast as a spam because it's harder to write and modify malware than spam. Microsoft Tay, I don't know if you guys remember this from 2016 Microsoft deployed this on Twitter and neo-nazis and trolls started tweeting things at it. It had this really bad way of integrating data into itself and it became really anti-Semitic and nasty within hours. So it was taken six down 16 hours after launch after Tay was tweeting out some really horrible stuff. So this is an example data poisoning that happened 2016 but the way they deployed this was done but they didn't know that at the time it was like now we know you need to sanitize your data sets very carefully. First machine learning response was a spam model back in 2004 they use this thing called Bayesian poisoning. If you have a spam message, you write your message in nice clear big black ball text and then you put a Wikipedia article essentially at the bottom of your message in clear white text so that your email you read it, you're like, okay that's the spam message and you don't see the other thing but the machine learning model doesn't see the difference and the benign text sort of overwhelms malicious text. If you look at Bayesian naive based classifier and Bayesian poisoning, that's a thing. The similar attack happened against machine silences malware model in 2019. So this attack is still effective and it's old. The most popular thing you're gonna hear about in terms of adversarial examples is one that doesn't work because of for my industry because we have structured data sort of there's caveats that I can talk about later. So adversarial example, you have a sample X and then you create a perturbation Y. This is the perturbation. This is the gradient of the, so the exact way you calculate we create this perturbation is very much doesn't matter. Basically you do gradient ascent against on the image space with fix the model on image image space you do gradient ascent to try to build find a part where you've got the most misclassification. This particular attack is called the fast gradient sign method attack. There are thousands of other types of attacks where they're basically doing modifications of gradient ascent sort of kind of, there are literally several thousand papers were written about adversarial examples in a few years because they became super popular. They cause deep learning classifiers on images to just be completely unreliable. And there is no known defense that works well that has stood up to the test of time. There's lots of people who think that they figured it out but when really pressed in situations they don't work. So a definition of adversarial example is if you've got a X of class Y and a machine name algorithm F so you have F of X equals Y so it's correctly classified. Your point X is correctly classified as Y and adversarial perturbation within an epsilon is a small perturbation within a epsilon ball of zero such that F of X plus Y is not equal to F of X plus Z is not equal to Y. This relies on you have a known good. It doesn't mention mistakes because you could be like, you could be off in the middle of nowhere and your models just doesn't know about that stuff and makes mistakes. It just made a mistake. And the adversarial example is you don't, they don't care about that. They care when you work class right correctly and you move slightly off and then you became misclassified thing. And security against this is impossible. Personal dimensionality. X is of several million dimensions. If you've got the weird trick where you have, you guys know that, you know, sorry, you folks know that the problem dimensions, high-dimension stuff, it's weird. There's a lot of room in high-dimensional space as fairly easy for mathematicians to understand. You've got epsilon balls stopped really working nicely in very high-dimension space and become very intuitive. You know, the classic example of put a unit sphere on the corners of the hypercube and a single sphere embedded so that it barely touches the spheres that are on the corners and increase it. You've got the image, the little sphere in the middle on two dimensions and the sphere is still inside this cube on three dimensions. And then at 10 dimensions, that sphere in the middle is, breaks out the size of the cube and at millions of dimensions, it's like it has infinite radius, which is weird. But we know as mathematicians, you know that high-dimensional spaces have these weird properties that are unintuitive with distances. But a lot of computer scientists who happen to immerse themselves in high-dimensional stuff, don't. And then they try things that, sort of kind of work, they're kind of hacks. But then inevitably, someone finds a space in that epsilon ball, which is huge, because epsilon is sometimes a lot. In terms of volume, that epsilon ball is absolutely enormous and they find one that's misclassified. They find an area where your model just misclassified that thing because you're trying to correctly classify a huge space. There's ways to sort of fix this in making your model elliptics. But again, that doesn't fix the, relying on the unknown good. And now the reason why this is possible to explain this, deep learning is not explainable. Self-driving cars, this is the nightmare of that thing. You see this car in the middle of a salt leg. So that's just salt and tape. Friends of mine have done this on production cars from certain companies and gotten them to do dumb things with just like tape on the ground. And dumb things that cause accidents with certain car companies and deaths. So that's just like tape on the ground and the car is misclassifying. And then you have the stop sign, which there's just some tape on it, but for a machine learning algorithm, that reads like an 85 mile per hour speed limit sign, not a stop sign. So this can cause self-driving cars to accelerate to stupid speeds when they should be stopping. So that's done. Anyway, so that's the like threat landscape that we're dealing with and the terms. Do you guys have any questions about like life as it is for machine learning people? I'm kind of curious, this misclassification problem, it's not really that it's not confident. It's a stop sign. It's that it's really confident, it's something else. Is that okay? So that's actually the- It's 57% confident that it's Panda, 93% confident that it's Gibbon. But this confidence score is bullshit. So what it is, is the way that these models work is you have your input space of thousands of dimensions for E1 for each pixel. You have convolutions, all sorts of stack, like a huge amount over different architectures that are very complicated and change all the time. But then the output space is for ImageNet. There are 1,000 classes in ImageNet. The output space is a one-hot encoding of this thing. So Panda I think is like, I don't know what the thing, but Panda is like the 87th classification for Panda. What the 57.7% confidence means is that the output should be a list of zeros with a one for Panda. But it is actually a list of like nearly zeros and a 0.577 for Panda. So that confidence score is made up. It's not actually that confident. That's just the output of the model that it's trying to get there. It's not real. And then the 93%, that's what the classification, it's the given one, which is maybe like the 101st encoding, the embedding for the one-hot encoding is this other thing. Of course, there's more complicated versions of this. You can do, you take the gradient at your point and then you can figure out a more slightly better confidence score using the gradient concepts derived from adversarial examples, which gets a better thing. But the confidence score on this particular thing is made up bullshit, sort of. So if you go to archive, one of the sort of things that the AI village sort of runs into is everybody's sort of model is different. So this is my opinion. I have maintained a malware model and I have built my threat model in my internal things from our internal like things. We go to archive. A lot of academics will write about adversarial examples because they're super interesting and they will write in the instructions that we're working on this problem, but that don't really care about that. They care about solving adversarial examples. But when adversarial research started and exploded, every single one of them mentioned malware models in their introduction, saying we need to resolve adversarial examples because malware models exist and this is a threat model. And largely the malware community, like next gen AV malware model community is things that are kind of maybe five years down the line and we'll have to worry about it, maybe. So that's my disclaimer. So ML blue team, I need to protect my model. I need to know a few things. Some of this is easy and you can do these days. What's our current efficacy of the last weekday hour? That's a question that I can't answer because the labels from machine learning from malware models are expensive. So essentially I have unlabeled data that's coming in and I need to know how good I'm doing on the last day, week, day, hour. For a given sample, I need to know an actual confidence number that means something. The ideal thing would be a, like I would know the ideal thing for the first term would be like the error rate and the ideal thing for the second one would be the chance of this thing being wrong based on some probability distribution. But I don't, the actual confidence for deep learning, no idea, adversarial examples are arguably zero for a lot of these things. Every single time someone's come up with a way of doing a confidence number, someone's built an adversarial example that only targets the model, but also the confidence score. So there are, this is arguably zero. Like there is no proofs about this. There's no confidence scores about, like good confidence scores about pretty much any machine learning model and we desperately need it. How does confidence and efficacy change over time? Because I'm playing cat and mouse, the opponents are trying to bypass me. They keep moving to new parts of the probability distribution space and my model may or may not work on them. If they're moving, they're not targeting me exactly. So my model might generalize perfectly to the new space, but if they move to a new space and I happen to be one of the things that they bypassed, I want to know about that. How do we track the model's efficacy? This is sort of dashboards, like also redundant was the first question. How do we detect threats? Like Microsoft A, if you've got a poisoning attack, I have seen sort of poisoning attacks in the wild. They're not poisoning the attacks, like they're defined in the machine learning literature. They are, but they operate sort of similarly. How do I detect when I'm getting poisoned? And the classic way a model poisoning attack happens is you make a lot of queries that are benign, but they're in a certain location and they're benign, but they name, I'm going to classify them as benign. I'm going to call them benign. Everyone labels them as benign, but then there's a single sample in the middle of those that's malicious. Machining or model will draw a nice circle around there saying, hey, this is benign and then your one payload gets through. How do I detect a reliable bypass? With the machine, there's a way of bypassing pretty much every next-gen AV company. How do we detect when that's happening? I don't know, it leaves a signature and when it's happening at scale, maybe I can detect that signature. And then the last one is, how do we respond to a threat once detected? One of the classic answers is retraining a model because the new model is probably not going to be as vulnerable to the thing as the old model, but that may or may not work at scales like Facebook. And they've got a whole different system that's probably, I haven't. And now Quora has a very interesting good system for this for detecting spam and it's great. And they have a, yeah. So that's like the stage of like, this is what we need to solve. And as I've said, the first two, the first three questions are, sorry, like pretty much all of these are unanswered. Open questions, how to do this? So the first thing I wanna tell, like the machine learning industry is start trying to make a model perfect. We have a statistic, our models are guaranteed up to a statistical validation set. So how we train a model is we pick 90% or like my model, I pick 80% of the model and I train on that. I pick 20% of the data set, sorry, 80% of the data set, train on that. 20% of the data set, it's just held out and I don't do anything with that. When I go to production, I check with my model, my model is trained at 80% how well it does on the 20%. And if I selected my 20% at random, uniformly at random for my data set, then they should be from the same distribution and I should be able to say on this distribution on this day, I have a 99.9% accuracy number. If I want to make this adversarily robust, I'm essentially asking for a 99.999 to me there's not much difference in like asking this thing to be 99.9999% robust. At some point you've reached your score and you don't need additional robustness and your model is a machinery model, it's gonna mess up. There's gonna be fun, you were always gonna find another serial example. You're always gonna be able, there's gonna be parts of the things you cannot defend from. Stop trying and then start working on the stuff you can actually do. And then the second point is measuring drift solves 80% of the problems. I'm playing captain mouse. The opponents are trying, constantly trying to move to areas of my data set that I don't have data for and I don't classify well. So if they've moved, then I'm gonna start seeing data from this new area and stop seeing as much data from the old area and my data set underlying data set has drifted. So I detect that 80% of my problems actually go away if I can solve measure the drift. The poisoning, that reduces the drift because now suddenly I see a lot of data from one location and I can see like it manipulates the data set statistically. So, but the problem with measuring drift is current technologies for doing this are there's multiple ways to do it. You can use Wasserstein, optimal transport stuff. Basically solve a very big PD that tries to move mass from one layer to another. The earth move the distance or whatever you wanna call it. That takes big O of n cubed log n, which means that this is the time it takes, right? But 100 million, I've got 100 points. It's gonna take 100 to the fourth, 100 essentially 100 to the fourth time, seconds or milliseconds or microseconds to do it. 100 to the fourth is not too bad, but I don't have 100 points. I have 300 million points. And even if that's like nano in picoseconds, 100 to the fourth, 300 million to the fourth is longer than the age of the universe. So Wasserstein is never going to work. The other ways you can use hacky deep learning methods, but and they seem to sort of kind of work, but they're hacky deep learning methods and they're set themselves susceptible to attack. So that's not what I wanna do. So what I did. So trivial example for data set trust is, so I've got training distribution, one, one, two, one, one, so on. So I've got 70% ones, 20% twos, 10% threes. I have another one, another distribution that's from my test set a month later, trained on the first distribution. This is my test set. Like what I'm seeing over the wire, a month later, I get two twos, one, one. So it's, it's changed. You can see that the Kubla-Kolech divergence is a lot. You can measure a difference using that. So this works really well if I have discrete space, but I have this. So that's my data set. So my oranges are gonna be my training data set and my blues are my test data set, or the essentially the test data set. So that's my data set at the time of training and I validated and all good. A few days later, that's what it is. A few days later, that's what it is. A few days later, that's what it is. So there's a big difference between the first and last one and I have to be able to measure the difference. And this is not in two dimensions. This is in 3,000 dimensions. So how do I do that? So I use a coverage tree. So this is the thing that I did in my postdoc a little bit. A coverage tree, I'll talk about how to build one. Essentially it approximates the data set well. If you have the nice data set is enough data sample from the large dimensional manifold, you can put exact numbers on this, like within, if you wanna approximate your, the probability distribution within a epsilon bound of one measure, you need X number of points given a data manifold with curvature S of dimension. So, you can put down an exact statement. And if you wanna read that, it's Maro Mazzoni at all has a very nice paper from 2017 that goes over exactly how to build a cover, use a cover tree to build an approximation of a, of your data set. So I knew this from my postdoc. It is a wonderful result. And I would like to see much more people like looking at it, but it's embedded in a very hard to read paper. So the other part is now you have this tree. Trees are wonderful with computers because you have the structure you can navigate. Ideally, you'd want like a graph because of how things work, but or a like homotopy chain complex or something like that, because that would enable you to do even more stuff, but trees are nice and easy to encode inside of a computer. So let's build a cover tree on this point. So I have my data set, which is this nice infinity sign when I pick a point at random and I get a radius of some power. So this is, this is radius two and this covers my entire data set. So the root of my tree is done. And it covers, so it's a single each layer reminds the first layer that covers my entire data set. So what I do is I shrink that sear down by half, so no longer radius two, it's a radius one, and I don't cover this part. So I pick a new point at random and I add that to the thing and I've got this covered. Do this again and I get this. So between, if you squint your eyes, you can kind of see that this thing is longer than it is wide. So it's sort of, if you squint your eyes really hard, this is sort of a one-dimensional thing at the very high scale and two-dimensional thing at a medium scale and then at a small scale because back to being a one-dimensional thing. And the cover tree sort of reflects that with this fractal mention with the number of branches in each part of the tree. And you can see it building this tree structure that enables you to like navigate things. One, another thing it does is you can see it focuses in and becomes very dense. There's a lot of small points, sorry about my cat, near the actual data and it's very sparse away from it. So you can actually get a density measure based on this ball here as a radius of 0.125. I know how many points there are, meaning I know the exact density of this particular region. Provingly, I know the density. Yes, thank you. And I know how to get there, get there. So I start at this big point, I started this big one. I know the density, this large area and I get these more focused densities as I go down. You can do more complicated things like add labels back in and then start doing the heterogeneous versus homogenous, like figuring out the density when you are heterogeneous and then the density when you're homogenous is eventually you're going to become a single point. At some point, your ball, your sphere stops seeing classes of the other type. In terms of security, it's good or bad. So you stop seeing elements of the other classes. So you can actually measure when that happens. So this isn't really fancy math. I have this like tree structure. There's all sorts of cool stuff you can do with it. You can probably use this tree structure to accelerate public or topological data analysis. I have tried, but that's hard to code and I have a full-time job. So what you can do with this is you can build a probability distribution, you know, a density measure. You can split your partition, your space up in a particular way. And now you have for each part of that space, you know, it's volume and you know, it's number of points. So you have a accurate density measure. So, oh, shite, the arrows are off. So what this is is you take the cover tree and this is the probability you take a particular path. So if you're at the root, the probability that you go to the right is low. It's like 0.3. And the probability you go south is 0.7. And then you get this tree structure and you know exactly the probability of getting to any particular points. You can sample from this distribution. So the density at this area is the width of this thing divided by the number of points. And you can imagine this being a more complicated measure of density. With infinite points, you can probably get a fairly accurate measure of the probability distribution within a large number of points. So if you had this distribution, what I can do is this is my prior distribution. And at every single one of my points, I have a neat, discrete categorical distribution. So I can use the Kale divergence calculations that I mentioned previously and add a single point. So this changes all of the probabilities in the tree. Doesn't change the probabilities in this part of the tree, but it does change this from the first node and then all the nodes that you touch on like that. And then I can keep adding points. So I get this situation where I have this prior distribution built out of a partition of my tree, partition of my dataset of my space given by the training dataset and all the probabilities along this provision given by the training dataset is forms of prior. And then I can use that and some Bayesian statistics and this is my posterior. And so you can see that there's a big difference between the two. And you can then calculate the Kale divergence. So this basically outlines how you do it. And then you can use a Gaussian sandwich, yes. You take two 20 dimension spherical Gaussians, you overlay them at zero zero and then you take one of them and move it slowly to the side. And you can measure the Kale divergence except I have a hundred thousand points in the training set and 5,000 points in the test set. So this is actually measuring a difference quite dramatically because there's only 500 points in the posterior. This takes for a complete dataset or with K points for a single point, you can do this in online fashion and do this whole thing in log n which is a little bit better to then enter the force. So that's a much smaller growth curve than enter the force. Lashenstein requires sort of an equal size test and training set. This doesn't require this, it's online. It's really great and it's explainable to people. And so you can actually do this. And this is on our industrial dataset. 150 million points went into the training set and I used 60 million points to the test set which is ridiculous compared to Lashenstein. You couldn't scale to this. And I used a training dataset, I used my model and my training dataset that was cut off on May 1st, sorry, January 1st, 2019 because that's old enough that I have enough things to see how the model performs over time. In this case, this is my false positive rate, my true positive, sorry, this is my false negative rate, my overall error rate, sorry, overall error rate, true negative rate, false positive rate, which is these efficacy scores we care about. This red line is the overall kale divergence of this distribution shift that we're measuring. You can see that I can measure that thing on my 300 million points, do a breakdown versus drift, it's here, works out. So it works, this strategy of using a cover tree works on a more than 3,000 dimension, more than 300 million point dataset, which is awesome. Now, to describe like attack that we might experience, so the test set attack is the academic version of like trying shift until it works and then doing that thing until it stops working. So repeating a single point doesn't form a normal distribution and you do this, you take the kale divergence of a sequence of points, so this is an attack, the red lines are attack or sequence, the blue lines are a normal unit, they're a normal user sequence and the green line is the uniform baseline from the training set. And the span is a single standard deviation from the sample from the training set. You can see all of the attack or distributions explode with high kale divergences, all of the C users have a very small kale divergence. And this is sort of unrealistic because no one's gonna go along and query from a test set in real life. My users do not care about the test set, they care about the stuff in their environment. They're gonna query my model for things in their environment, which are a completely different set of objects to someone else's environment. If I change the way that I've sampled from the training, the test set, I get this other thing. You can see that the performance is still, these red lines that are bouncing around here, never found a serial example, an example that was misclassified. All these ones found, an example that was misclassified and you can see that the spike happens very quickly when they do. Yeah. So that's it. So let's go ahead and give Spen a hand. Thank you very much. Feel free to unmute and clap or you can use the clapping emoji, I guess, within the app. So if you want to ask me about like, more about machine learning, I gave very high level overview. There's most stuff you can do with that that the fact of mentions of real life data sets, figuring out security data sets do not form a manifold. They are from this weird spectra sparse fractal thing that I don't think has been mathematically classified. There's more, but I know you, like there's a whole bunch of security knowledge that's going to some of the decisions I made into why I did some of these things. So I decided to focus on that and make this more of an industry talk than a mass talk. Do you guys have any questions for me regarding like ML security or like leaving academia? So I'm curious, you have a couple of words on. Yeah, so for folks that, you know, if they're coming up on the end of PhDs or postdocs and they're like thinking about industry. Yeah, what's a useful way to go about that process? The first thing I would say is go to the career counselors that are for the undergrads. Okay. There shouldn't be any impact from quantum computing on my data sets. I don't think quantum computing will scale to the stuff in any reasonable amount of time. And I have no idea how it's going to end up working. I can't really answer that question, but I'm pretty sure there's no impact. But to career, I just saw that there was a question in the chat. Sorry. To career advice, go to the school counselors for the undergrads to do your resume. The biggest thing I've seen from people leaving sort of a math or like any sort of academia, even a CS PhD, is that they have a CV that is six pages long. And recruiters aren't going to read that. Recruiters aren't going to open that. They're just going to see it six pages long and be like this person doesn't know what they're talking about. When you're applying, you need to have your CV, your resume, which is a different thing to your CV as like one or two pages. And it needs to do things like pick on keywords because your biggest thing for applying is not your interviews. Because you're probably going to, if you're half decent and you are taking some sort of Coursera class or you've done sort of some sort of projects in tech so you know sort of what Python is, you can do some Euler, some of the Euler challenge problems with reasonable amount of time. Your biggest problem is not like getting past the tech interview, it's getting past HR. Because you don't have any tech skills on paper. You need to figure out how to get those tech skills on paper, figure out how to get those tech skills in your resume. So HR will be like this isn't going to be a waste of our time. And then you can probably get to the second part of you and actually talk to the engineers who may speak a little bit more like the more of your language. Yeah, so one of the things that's helpful early on if you're first or second year is try to do an internship. Maybe try to write a paper with machine learning or talk to a CS PhD who's doing machine learning and see if you can do something weird like a machine learning or something along those lines. So you have something on paper that says like I am a machine learning person. I know how to code or make a portfolio. I made a Rubik's cube. So I modified a JavaScript Rubik's cube to make it flat. So it was a one by three by three Rubik's disk thingy. I modified the JavaScript hack that took with me a few days because I needed it to teach a class on group theory. And the department wouldn't buy the little flat of your Rubik's cube for Amazon for me. So I did it with code. And that is what half of my, that piece of thing was like half of what my, to you that I kind of got my job was about. How did you do that? Like I needed to do it. So I learned JavaScript over three days and hacked it. It wasn't that hard, but yeah. So there's somehow like some benefit to having, you know, things on paper, like actual courses, or if you've written a paper down an internship, that's really more geared for just this like first stage of the interview process. And then you're saying like more like having, having sort of projects, things to talk about. Like that's more for the, like when you would have the in-person kind of interview. The projects you, the HR is going to, the HR works is they are given a list of keywords that the person needs to satisfy, which are probably in the job description. If your resume doesn't have that list of keywords, you're never going to get a post HR. The other way to get past HR is to get recommended by an engineer already at the company. And then HR just shoes you through and then you can start talking to the actual people. This doesn't guarantee a job, but it does. That's like the hardest part of getting foot in the door for interviews. Because when we opened the job, when we, we had three open positions for jobs this year, one of them went to an academic that I recommended through the system. We had over 500 applicants. We interviewed eight. He applied and HR said, yeah, the HR didn't put, he applied not through the normal ways, wouldn't go to the HR wouldn't put him through to us. Because one of the keywords was, have you deployed a model in production? And he said, no, because he had only ever done academic work, never done anything in production. She said no, but I opened up his GitHub repo, saw that he had a paper, a project on doing malware models for Android. And looked at the code there and like, this is better than half the stuff we've got. And just wrote a recommendation with a link to that saying, this is better than our code. Screw this production question. And then he's came through and then he's sailed through the rest of the interview. But his CV, he's handed a giant CV. His, sorry, he didn't sail through the rest of the interviews. When I interviewed him, I knew he was going to be having difficulty. I knew he was going to have difficulty with the business side of things. So when you ask, so when there's a product, I've got some really cool math that I really want to solve using a hierarchical hidden mark of mobile to detect ransomware. And I really think it's going to work out, but it's going to take me three months to see if it works. And I don't know if it's going to work. So as far as the business is concerned, are we going to let Sven, who's expensive, dig around with some math for four months and cost us a lot of money? Or are we going to have a result that's maybe a 10% chance of working for as far as they're concerned? And another several million dollars in order to get this to production, if once at that point, or are we going to have them do a short thing? And if, so I know that they're making that distinction. And there's that calculation. And there's ways for me to word this, that'll make it easier for them to sign off. And there's ways for me to like do the project management and the project description that make it easier. And if you want to, like yet first I'm not expecting a, like if you're straight out of grad school, you're not expected to be able to make business. So this is like understandings, but if you can, that's, you can actually speak in that language intelligently. Like I wouldn't do this really complicated thing. I'd do the simple thing because it's going to work nicely. It's probably going to work. And it's going to get us 80% of the way there for like five days of work instead of 100, like 99% of the way there for six years of work. And like making that distinction like mentioning that like, my consideration is screw deep learning. I think a little market model can make it do a very nice job here or whatever. It goes a long way and actually getting through the second parts of your interviews, because you're going to talk to the manager, the manager wants to hear, let's do the simple thing that might work. That will take three days rather than the complicated things right away. But the complicated thing is the pun thing. Another like, introduce that. So, if you go to Google, interview at Google, it's going to include like, like you're going to be asked to do the A star algorithm on the board. You're going to be kids, try a race marker or a laptop. And you are going to do, you are amazed you have to navigate. Do and then be using the A star algorithm and just do the A star. Which is like, like it's the standard question. Like you know, they're going to ask you like, do the A star algorithm or something like that. Which is like a graph traversal algorithm. So you just need to know that code. And you need to know enough data structures to do other things. But then once you hired, I care way more about your unit testing, your test of your development cycle, development strategies, all that stuff. So, you're all mathematicians, you can learn the A star algorithm in a week. You can memorize the data structures that you need to solve these problems in another week. You can learn how to code them up and do that in like two months and get like really good at doing Euler problems, which use these concepts in like four months. And that will get you through some of the technical interviews, but then, yeah, the final hurdle is the business deciding whether to hire you a month. So like you've gotten through HR, you've gotten to the technical interviews because you can do, do Euler problems. It's a website that has really nice programming challenges on it. And then the final one is like, are you, are you, are you, how good are you at development? Because if you were a developer who's never written a test and don't believe in the concept of testing and development and unit tests and like actual documentation and things, I will highly recommend never hiring you because I don't want to work with you because you're going to produce shitty code. And I have worked with people who don't put tests and they're a pain in the ass because once they leave or six months later, the code stops working and there's no way to diagnose it. That's the final thing. So maybe take a class on development cycles, maybe take a class on data structures, maybe take a class on databases just on the side. If your advisor would let you and that would get stuff on paper. If you can hack some projects, that's a good idea. But if you're like later in the game, Coursera, lots of Coursera. I think we probably have to wrap up in a minute or two here, but anybody have any other questions or anything? So if not, let's, let's thanks again for his awesome talk. Thanks so much. Yeah, the delight to have you. Let me go ahead and stop there recording.