 And without further ado, here's Alex with his talk. Thank you. Thank you for coming on a Sunday morning. We are gathered here today to talk about AI. My talk is entitled AI DevOps Behind the Scenes of a Global Antiviruses Machine Learning Infrastructure. And just a meta note real quick. In making this talk, I was, I'm going to be talking about work that was performed by many people on my team. So I actually interviewed about four people. And this work spans about two years. And I have about 20 minutes to tell you about it. So this is going to be a very kind of distilled highlight reel of our story. Who am I? So I'm doing a lot of things at the team. I do a lot of web dev. I do a lot of AWS backend stuff. Some ML research. I've done some Android stuff. That's a picture of my face and my contact information. This is a picture of my cats, which you have three seconds to enjoy. This is the data science team at Sophos. Maddie Shihapa is actually in the crowd today. And Hilary Sanders, Constantine Berlin, and the two mats at the bottom contributed a lot more than I did to the work that I'm taking credit for today. I also have you on here because we're hiring. So if you have any expertise at the crossroads of cyber security and data science, we'd love to hear from you. Okay, so in the beginning, there was signature-based antivirus. And by beginning, I mean 2017. That's about the time that the data science team was added to Sophos. And we were the first kind of introduction of ML to their antivirus product. And there were a lot of just initial misconceptions. The big one being that machine learning is all you need. So this is something that we kind of had to convince people was wrong even though we are the data scientists. And so just to kind of give you guys a way of thinking about this, if you imagine sort of the space of all files, and the green ones are benignware and the red ones are malware, and you kind of imagine that similar files are closer together. So the malware is sort of clustered in the middle. If you have signature-based antivirus, then your analysts are sort of writing these rules to draw these circles around specific malware samples. And the problem that has kind of besieged the whole industry in the last 10 years or so is that malware authors will just mutate their malware. They'll create variations and it just overwhelms analysts. And the process of trying to update your signature and capture these new variations without capturing any false positives is just very tedious and it's very laborious because it's all done by humans. So ML's value is to kind of fudge the lines a little bit, kind of generalize and take those original signatures and then just kind of learn from them. So it kind of helps to counteract this main tool that malware authors are using. So it's actually very well suited for this challenge. But it's important to keep in mind that without the signatures, there would be no ground truth and there'd be nothing for machine learning to learn. So ML is not really all you need. You still need humans in the loop. So some other things that we've added to kind of help ML are whitelists for critical system files that you might not want to fp on like explorer.exe, blacklists for things that you just know or malware and what I'm calling fuzzy whitelists. So we have sort of a reputation system where if we've seen a file in a lot of different computers, we've seen it a lot of times, we sort of make the assumption that that means it's not malicious. So we sort of give it a probabilistic, it's probably good. When people at the company asked, well, why can't machine learning do all this? We were like, well, math and stuff. Basically, if you imagine like a neural network or something, the number of layers that it has, the number of neurons that it has, that's a certain capacity that it has to learn patterns. And there's no reason to waste the capacity trying to learn certain patterns of files that you never want to convict or files that you always want to convict when you can just do these kind of dead, simple whitelists and blacklists. So the tone kind of changed from machine learning is all you need to kind of, hey, you know that project you got over there? You know how you could make it even better? Put some machine learning on it. And that project? That would be great with some machine learning. If you don't know, this is from Portlandia and this is an episode where they're making fun of how everyone in Portland thinks things look better with birds drawn on them. The best part about this episode is the ending when a bird actually flies into her shop and she's totally disgusted about it, which is sort of what happened to us. Now machine learning is alive and well at Sophos, but the reality of trying to deploy it in a production environment was a lot uglier than anyone envisioned, which means we have this scalability. So the main issue with scalability is just that when you're working with a product that's deployed across millions of endpoints and it's scanning tens of thousands of files on each endpoint every day, that's a very different data set than the one that you trained and tested on, which may be tens of millions of samples and it's a static data set that you're kind of optimizing over and over on. So a lot of just different things sort of manifest. So just as an example to kind of illustrate this, was anybody born on August 12th? Okay, so it's nobody's birthday today in this room. If you were to be very naive, you might take that statistic and say, well, there's nobody born on August 12th, which presumably is wrong, presumably roughly 1,365th of the population is born on August 12th, but so that kind of demonstrates a small sample size issue, and the same thing occurs with malware analysis. If you have a small malware sample data set, then those kind of tiny pieces of the pie just kind of get, they disappear, so it may be a small variation of a family or some critical system files. So what we've done is we set up this testing omelette, so a model will get like three million files to work on to train and test on, and if it reaches a certain threshold, then it gets like five million, and if it does well there, it gets 10 million, and you keep ramping it up. That way you don't waste training time on large amounts of data. You start with three million, and that's a test you can run relatively quickly. If that goes well, then you keep amping it up. So in general, scalability just kind of amplifies things. We also have to strive for ridiculously low false positive rates because we're running our model so often, and the last thing that sort of gets amplified is just costs. So if you're using something like AWS, which you should be, it's great because whenever you ask for resources, you get resources. The problem with AWS is that whenever you ask for resources, you get resources, and you don't always know how much you're really asking for. So in one situation, we had a bunch of data on S3 that we were trying to move into Glacier to save money, and we neglected to notice that there was a small transaction fee to every move from S3 to Glacier, and we ended up blowing through our monthly budget in a couple of days, so that was our bad. So another thing that we did to address scalability was to switch from DynamoDB to Redshift. So DynamoDB, if you don't know, is it's great for row-based data, if you want to pull out individual rows, that works, but it's not great for aggregate statistics. So you can't really run SQL queries over large datasets because you have to pull down the entire table every time you want to do that. Redshift is column-based, so it supports that much more quickly. So one other thing I'll touch on is just the actual going into production. So one thing that we constantly have to deal with is what we're calling concept drift, which is also referred to as model decay, but essentially what happens is that you put a model in production and even over the course of a month, the effectiveness of the model starts to drop perceivably. This is just because new malware is being released and new benignware is also being released, and so to actually compensate for this, we have to train with sort of a handicap. So we have to train on older data and test on newer data to kind of predict how our model will decay. There are also language discrepancies, so all of our research is done in Python and we've literally had to rewrite some parts of Keras and C++ to get it to run on Windows machines. Releasing updates is tricky because you don't really know what you're releasing. I mean, you know certain ways of changing your model and maybe the architecture has changed, but you don't know how its behavior is going to change in production. And it's not like releasing a code update where you can be pretty certain about what's going to happen. This is kind of a black box, and so that's a little scary. And that kind of ties into cultural because now you have data scientists who are in the production bloodstream, which is sort of bad because we don't really know what we're doing there. There's been a lot of knowledge that we've had to gather along the way, like the things I've talked about about scalability. And so there's been a lot of learning that we've had to do as a data science team and we've all had to take off our data science hats and put on engineering hats. And then kind of one anomaly that came up, someone wrote a Hello World program in C++ and they compiled it with certain settings and our model said it was malicious. And management came to us and they were like, well, why is that happening? And we were like, well, math and stuff. But they wanted to fix, right? It's like, that's not good. And in a production system, if there's something that's impacting customers, you do a hot fix. So you kind of surgically fix the code that's causing that problem without changing anything else. You can't really do that in machine learning. So if you imagine a model like a pinball machine, and you imagine that the model is all these pegs and samples were running through it are these balls and the way the classification that it's producing are these bins at the bottom, you might run a bunch of samples through the model and it might get most of them right, but there's that annoying little benign wear ball that we've fped on. And what are we supposed to do about that? Like the position of all these pegs has been optimized through an automated process over hundreds of hours. Like what are we going to do about that? We could pick some peg and say, well, I wonder what happens if we twist it and run it all again. And okay, yay, we got it, but oh wait, now we have a false negative. So there's really no way that our puny little human brains are going to look at all these pegs and be able to figure out the right optimal solution that works for all samples. So if you're afraid of robot overlords, you might want to be. But the actual takeaway here is that if you go back to the original thing that I was talking about, machine learning by itself doesn't always work. It makes mistakes and when it's fudging the lines, sometimes it fudges the lines in the wrong direction and it actually loses something that a signature would have caught. So robot overlords still kind of need us. Cool. Okay, I have six minutes to demo something and answer questions. So AI total is an internal tool that we use for testing out our models. This is everything that it's built with. Some of the interesting stuff, we're starting to use SageMaker, which I don't know if people are aware about, but you put your model in a docker and docker container and it kind of handles everything else, which is pretty great. Okay, so I wasn't able to connect to Wi-Fi, which I'm not blaming anybody, but maybe a DEF CON related issue. So I'm not able to give you a live demo, but this is AI total. We've done a lot of work on different things other than PEs, so we have, I don't know if you can read the top, but we have HTML models, URL, PE, Office Docs, and Android. And so the way that this works is that you can actually live query the models that we have. So this is our URL model, and you can actually type in URL here and get instant feedback from the model. And this allows us to kind of just get a little sense before we push things out into production, like how's it actually going to behave. For the other models for PE and Office Doc, you can actually upload a file, but for URL, because it takes text, you can just type it in. So yeah, that's what I have for you guys. Thank you, and any questions?