 Well, welcome back to machine learning. One of the most exciting things this week, almost certainly the most exciting thing this week is that FastAI is now on PIP. So you can PIP install FastAI. And so thank you to Prince and to Karem for making that happen, to USF students who had never published a PIP package before, and this is one of the harder ones to publish because it's got a lot of dependencies. So it's, you know, probably still easiest just to do the condo end-up date thing, but a couple of places that it would be handy instead to PIP install FastAI would be, well obviously if you're working outside of the repo in the notebooks, then this gives you access to FastAI everywhere. Also I believe they submitted a pull request to Kaggle to try and get it added to the Kaggle kernels. So hopefully you'll be able to use it on Kaggle kernels soon. And yeah, you can use it at your work or whatever else. So that's exciting. I mean, I'm not going to say it's like officially released yet. You know, it's still very early, obviously, and we're still, you're helping add documentation and all that kind of stuff, but it's great that that's now there. A couple of cool kernels from USF students this week. I thought I'd highlight two that were both from the text normalization competition, which was about trying to take text which was written out, you know, you know, the standard English text. They also had one for Russian, and you're trying to kind of identify things that could be like a first, second, third, and say like that's a cardinal number or this is a phone number or whatever. And I did a quick little bit of searching and I saw that there had been some attempts in academia to use deep learning for this, but they hadn't managed to make much progress. And actually noticed so Alvira's kernel here, which gets 0.992 on the leaderboard, which I think is like top 20 is, yeah, it's kind of entirely heuristic. And it's a great example of kind of feature engineering, and it's in this case the whole thing is basically entirely feature engineering. So it's basically looking through and using lots of regular expressions to figure out for each token what is it, you know. And I think she's done a great job here of kind of laying it all out clearly as to what all the different pieces are and how they all fit together. And she mentioned that she's maybe hoping to turn this into a library, which I think would be great, right? So you could use this to grab a piece of text and pull out what are all the pieces in it. It's the kind of thing that the natural language processing community hopes to be able to do without like lots of handwritten code like this. But for now this is, well, it'd be interesting to see like what the winners turn out to have done. I haven't seen machine learning being used really to do this particularly well. Perhaps the best approach is the ones which combine this kind of feature engineering along with some machine learning. But I think this is a great example of effective feature engineering. And this is another USF student who has done much the same thing, got a similar kind of score, but used her own different sort of rules. Again, this gets you a good leaderboard position with these as well. So I thought that was interesting to see examples of some of our students entering a competition and getting kind of top 20-ish results by basically just handwritten heuristics. And this is where, for example, computer vision was six years ago still. Probably all the best approaches was a whole lot of like carefully handwritten heuristics often combined with some simple machine learning. And so I think over time, you know, the field is kind of definitely trying to move towards automating much more of this. And actually interestingly, very interestingly in the safe driver prediction competition was just finished, one of the Netflix prize winners won this competition and he invented a new algorithm for dealing with structured data which basically doesn't require any feature engineering at all. So he came first place using nothing but five deep learning models and one gradient boosting machine. And his basic approach was very similar to what we've been learning in this class so far and what we'll be learning also tomorrow, which is using fully connected neural networks and one hot encoding and specifically embedding, which we'll learn about. But he had a very clever technique which was there was a lot of data in this competition which was unlabeled. So in other words, where they didn't know whether that driver would go under a plane or not or whatever, so unlabeled data. So when you've got some labeled and some unlabeled data, we call that semi-supervised learning and in real life, most learning is semi-supervised learning. Like in real life, normally you have some things that are labeled and some things that are unlabeled. So this is kind of the most practically useful kind of learning. And then structured data is the most common kind of data that companies deal with day to day. So the fact that this competition was a semi-supervised structured data competition made it incredibly practically useful. And so what his technique for winning this was to do data augmentation, which those of you doing the deep learning course have learned about, which is basically the idea like if you had pictures, you would like flip them horizontally or rotate them a bit. Data augmentation means creating new data examples, which are kind of slightly different versions of ones you already have. And the way he did it was for each row in the data, he would like at random replace 15% of the variables with a different row. So each row now would represent like a mix of like 85% of the original row, but 15% randomly selected from a different row. And so this was a way of like randomly changing the data a little bit. And then he used something called an autoencoder, which we'll probably won't study until part two of the deep learning course, but the basic idea of an autoencoder is your dependent variable is the same as your independent variable. So in other words, you try to predict your input, which obviously is trivial if you're allowed to like, you know, the identity transform, for example, trivially predicts the input. But the trick with an autoencoder is to have less activations in at least one of your layers than your input, right? So if your input was like a 100 dimensional vector, and you put it through a 100 pi 10 matrix to create 10 activations, and then had to recreate the original 100 long vector from that, then you've basically come, you have to have compressed it effectively. And so it turns out that that kind of neural network is forced to find correlations and features and interesting relationships in the data, even when it's not labeled. So he used that rather than doing any, he didn't do any hand engineering, he just used an autoencoder. So you know, these are some interesting kind of directions that if you keep going with your machine learning studies, you know, particularly if you do part two with a deep learning course next year, you'll, you'll learn about and you can kind of see how feature engineering is going away. And this was just an hour ago. So this is very recent news indeed, but it's one of this is one of the most important breakthroughs I've seen in a long time. Okay. So we were working through a simple logistic regression trained with SGD for MNIST. And here's the summary of where we got to. We have nearly built a module, a model module and a training loop from scratch, and we were going to kind of try and finish that. And after we finish that, I'm then going to go through this entire notebook backwards. Right? So having gone like top to bottom, we're then going to go back through bottom to top. So you know, this was that little handwritten nn.module class we created. We defined our loss, we defined our learning rate, and we defined our optimizer. This is the thing that we're going to try and write by hand in a moment. So that stuff, that and that, we're dealing with from PyTorch, but that we've written ourselves and this we've written ourselves. So the basic idea was we're going to go through some number of epochs. So let's go through one epoch, right? And we're going to keep track of how much for each mini batch, what was the loss so that we can report it at the end. We're going to turn our training data loader into an iterator so that we can loop through it, loop through every mini batch. And so now we can go ahead and say for tensor in the length of the data loader, and then we can call next to grab the next independent variables and the dependent variables from our data loader, from that iterator. So then remember we can then pass the x tensor into our model by calling the model as if it was a function. But first of all, we have to turn it into a variable. Last week we were typing variable blar.cuda to turn it into a variable. A shorthand for that is just the capital B. So capital T for a tensor, capital B for a variable. That's just a shortcut in fast.ai. So that returns our predictions. And so the next thing we needed was to calculate our loss because we can't calculate the derivatives of the loss if you haven't calculated the loss. So the loss takes the predictions and the actuals. So the actuals again are the y tensor. And again, we have to turn that into a variable. Now, can anybody remind me what a variable is and why we would want to use a variable here? I think once you turn it into a variable, then it tracks it. So then you can do a backward on that. So you can do it. And once you turn it into a variable, it? It can track its process of, as you add the function, as the functions start getting layered between each other, it can track it. And then when you do backward on it, it back propagates and does the gradient boost. Yeah, right. So a variable keeps track of all of the steps to get computed. And so there's actually a fantastic tutorial on the PyTorch website. So on the PyTorch website, there's a tutorial section. And there's a tutorial there about autograd. Autograd is the name of the automatic differentiation package that comes with PyTorch. And it's an implementation of automatic differentiation. And so the variable class is really the key class here, because that's the thing that turns a tensor into something where we can keep track of its gradients. So basically here they show how to create a variable, do an operation to a variable. And then you can go back and actually look at the grad function, which is the function that it's keeping track of basically to calculate the gradient. So as we do more and more operations to this variable and the variables calculated from that variable, it keeps keeping track of it. So later on we can go dot backward and then print dot grad and find out the gradient. And so notice we never defined the gradient, we just defined it as being x plus 2 squared times 3, whatever, and it can calculate the gradient. Okay, so that's why we need to turn that into a variable. So L is now a variable containing the loss. So it contains a single number for this mini batch, which is the loss for this mini batch. But it's not just a number, it's a number as a variable. So it's a number that knows how it was calculated. So we're going to append that loss to our array just so we can get the average of it later, basically. And now we're going to calculate the gradient. So L dot backward is the thing that says calculate the gradient. So remember when we call the network, it's actually calling our forward function. So that's like go through it forward. And then backward is like using the chain rule to calculate the gradients backwards. Okay, and then this is the thing we're about to write, which is update the weights based on the gradients and the learning rate. Okay, zero grad, we'll explain when we write this out by hand. Okay, and so then at the end we can turn our validation data loader into an iterator. And we can then go through its length, grabbing each x and y out of that, and asking for the score, which we defined up here to be equal to which thing did you predict, which thing was actual. And so check whether they're equal, right? And then the main of that is going to be our accuracy. Could you pass that over to Chen Xi? What's the advantage that you found converted into an iterator rather than like use normal Python loop? We're using a normal Python loop, so this is a normal Python loop. So the question really is like, compared to what, right? So like the alternative, perhaps your thing here would be like we could use like a something like a list with an indexer. Okay, so, you know, the problem there is that we want was a few things. I mean, one key one is we want each time we grab a new mini batch, we want it to be random. We want a different, different shuffled thing. So this you can actually kind of iterate from forever, you know, you can loop through it as many times as you like. So this is kind of idea, it's called different things in different languages. But a lot of languages are called like stream processing. And it's this basic idea that rather than saying I want the third thing or the ninth thing, it's just like I want the next thing, right? It's great for like network programming. It's like grab the next thing from the network. It's great for UI programming. It's like grab the next event whether somebody clicked a button. It also turns out to be great for this kind of numeric programming. It's like I just want the next batch of data. It means that the data like can be kind of arbitrarily long because we're just grabbing one piece at a time. Yeah, so, you know, I mean, and also, I mean, I guess the short answer is because it's how PyTorch works. PyTorch, that's PyTorch's data loaders are designed to be called in this way. And then so Python has this concept of a generator, which is like a different type of generator. I wonder if this is going to be a snake generator or a computer generator. OK, a generator is a way that you can create a function that as it says behaves like an iterator. So like Python has recognized that this stream processing approach to programming is like super handy and helpful and supports it everywhere. So basically anywhere that you use a for in loop, anywhere you use a list comprehension, those things can always be generators or iterators. So by programming this way, we just get a lot of flexibility, I guess. Is that sound about right, Terrence? You're the programming language expert. Did you want to grab the box so we can hear? So Terrence actually does programming languages for a living. So we should ask him. Yeah, I mean, the short answer is what you said. You might say something about space, but in this case that all that data has to be in memory anyway, because we've got. No, it doesn't have to be in memory. So in fact, most of the time we could pull a mini batch from something. In fact, most of the time with PyTorch, the mini batch will be read from like separate images spread over your disk on demand. So most of the time it's not in memory. But in general, you want to keep as little in memory as possible at a time. And so the idea of stream processing also is great because you can do compositions, you can pipe the data to a different machine, you can. Yeah, yeah, the competition is great. You can grab the next thing from here and then send it off to the next stream, which can then grab it and do something else. Which you guys all recognize, of course, in the command line pipes and redirection. Yes. Okay, thanks Terrence. It's a benefit of working with people that actually know what they're talking about. All right, so let's now take that and get rid of the optimizer. Okay, so the only thing that we're going to be left with is the negative log likelihood loss function, which we could also replace. Actually, we have a implementation of that from scratch that you net wrote in the in the notebooks. So it's only one line of code as we learned earlier. You can do it with a single if statement. Okay, so I don't know why I was so lazy as to include this. So what we're going to do is we're going to again grab this module that we've written ourselves, the logistic regression module. We're going to have one epoch again. We're going to loop through each thing in our iterator again. We're going to grab our independent and dependent variable for the mini batch again, pass it into our network again, calculate the loss. So this is all the same as before, but now we're going to get rid of this optimizer.step and we're going to do it by hand. So the basic trick is, as I mentioned, we're not going to do the calculus by hand. So we'll call L dot backward to calculate the gradients automatically, and that's going to fill in our weight matrix. So do you remember when we created our, let's go back and look at the code for, here's that module we built. So the weight matrix for the linear layer weights we called L1W and for the bias we called L1B. So they were the attributes we created. So I've just put them into things called W and B just to save some typing, basically. So W is our weights, B is our biases. And so the weights, remember the weights are a variable and to get the tensor out of the variable we have to use dot data. So we want to update the actual tensor that's in this variable. So we say weights.data minus equals, so we want to go in the opposite direction to the gradient. The gradient tells us which way is up and we want to go down. Whatever is currently in the gradients times the learning rate. So that is the formula for gradient descent. So as you can see it's like as easy a thing as you can possibly imagine. It's like literally update the weights to be equal to whatever they are now minus the gradients times the learning rate. And do the same thing for the bias. Does anybody have any questions about that step in terms of like why we do it or how? Did you have a question? Do you want to grab that? No, that step. But when we do the next top deal. The next here? Yes, yes. So when it is the end of the loop, how do you grab the next element? So this is going through each index in range of length. So this is going 0, 1, 2, 3. So at the end of this loop, it's going to print out the mean of the validation set. Go back to the start of the epoch at which point it's going to recreate a new iterator. So basically behind the scenes in Python when you call it on this, it basically tells it to like reset its state to create a new iterator. And if you're interested in how that works, the code is all available for you to look at. So we could look at like md.train.dl is a fastai.dataset.model.dataloader. So we could like take a look at the code of that. So we could take a look at the code of that and see exactly how it's being built. And so you can see here that here's the next function, which basically is keeping track of how many times it's been through in the self.i. And here's the iter function, which is the thing that gets called when you create a new iterator. And you can see it's basically passing it off to something else, which is a type data loader. And then you can check out data loader if you're interested to see how that's implemented as well. So the data loader that we wrote basically uses multithreading to allow it to have multiple of these going on at the same time. It's actually a great, it's really simple. It's like it's only about a screen full of code. So if you're interested in simple multithreaded programming, it's a good thing to look at. Okay, now, oh, yes. Why have you wrapped this in a for epoch in range one since that'll only run once? Because in real life, we would normally be running multiple epochs. So like in this case, because it's a linear model, it actually basically trains to as good as it's going to get in one epoch. So if I type three here, it actually won't really improve after the first epoch much at all. As you can see, right. But when we go back up to the top, we're going to look at some slightly deeper and more interesting versions, which will take more epochs. So, you know, if I was turning this into a into a function, you know, I'd be going like, you know, death train model. And one of the things you would pass in as like number of epochs kind of thing. Okay, great. So one thing to remember is that when you're, you know, creating these neural network layers. And remember, like, this is just as part of his concern, this is just a it's an end up module. It could be a we could be using it as a layer, it could be using the function, we could be using it as a neural net pytorch doesn't think of those as different things. Right. So this could be a layer inside some other network. Right. So how do gradients work? So if you've got a layer, which remember is just a bunch of we can think of it basically as its activations, right, or some activations that get computed through some other nonlinear activation function or through some linear function. And from that layer, we it's very likely that we're then like, let's say putting it through a matrix product, right, to create some new layer. And so each one of these, so if we were to grab like one of these activations, right, is actually going to be used to calculate every one of these outputs. Right. And so if you want to calculate the derivative, you have to know how this weight matrix impacts that output and that output and that output and that output. Right. And then you have to add all of those together to find like the total impact of this, you know, across all of its outputs. And so that's why in PyTorch, you have to tally when to set the gradients to zero. Right. Because the idea is that, you know, you could be like having lots of different loss functions or lots of different outputs in your next active set of activations or whatever, all adding up increasing or decreasing your gradients. Right. So you basically have to say, OK, this is a new calculation reset. OK. So here is where we do that. Right. So before we do L dot backward, we say reset. OK. So let's take our weights. Let's take the gradients. Let's take the tensor that they point to and then zero underscore. Does anybody remember from last week what underscore does as a suffix in PyTorch? Yeah. I forgot the language, but basically it changes it within the place right there. The language is in place. Yeah. Exactly. So it sounds like a minor technicality, but it's super useful to remember every function pretty much has an underscore version suffix, which does it in place. Yeah. So normally zero returns a tensor of zeros of a particular size. So zero underscore means replace the contents of this with a bunch of zeros. OK. All right. So that's it. Right. So that's like SGD from scratch. And if I get rid of my menu bar, we can officially say it fits within a screen. So of course we haven't got our definition of logistic regression here. That's another half of screen. But basically there's not much to it. Yes, Svesh. So later on, if we have to do this more, the gradient, is it because you might find like a wrong minimum, local minimum, is that why? So you have to kick it out. And that's what you have to do multiple times when the surfaces get more complicated. Why do you need multiple epochs? Is that your question? Well, I mean, a simple way to answer that would be, let's say our learning rate was tiny, right? Then it's just not going to get very far, right? There's nothing that says going through one epoch is enough to get you all the way there. So then it'd be like, OK, well, let's increase our learning rate. And it's like, yeah, sure, we'll increase our learning rate. But who's to say that the highest learning rate that learns stably is enough to learn this as well as it can be learned? And for most data sets, for most architectures, one epoch is very rarely enough to get you to the best result you can get to. You know, linear models are just, they're very nicely behaved. So you can often use higher learning rates and learn them more quickly. Also, you can't generally get as good an accuracy, so there's not as far to take them either. So yeah, doing one epoch is going to be the rarity. All right, so let's go backwards. So going backwards, we're basically going to say, all right, let's not write those two lines again and again and again. Let's have somebody do that for us, right? So that's like, that's the only difference between that version and this version is rather than saying dot zero ourselves, rather than saying minus gradient times LR ourselves. These are wrapped up for us. There is another wrinkle here, which is this approach to updating the weights is actually pretty inefficient. It doesn't take advantage of momentum and curvature. And so in the DL course, we learned about how to do momentum from scratch as well. Okay, so if we actually just use plain old SGD, then you'll see that this learns much slower. So now that I've typed just plain old SGD here, this is now literally doing exactly the same thing as our slow version. So I have to increase the learning rate. Okay, there we go. So this is now the same as the one we wrote by hand. So then, all right, let's do a little bit more stuff automatically. Let's not, you know, given that every time we train something, we have to loop through epoch, loop through batch, do forward, get the loss, zero the gradient, do backward, do a step of the optimizer. Let's put all that in a function. Okay, and that function is called fit. All right, there it is. Okay, so let's take a look at fit, fit, go through each epoch, go through each batch, do one step, keep track of the loss, and at the end calculate the validation. All right, and so then step, so if you're interested in looking at this, this stuff's all inside fastai.model, and so here is step, right, zero the gradients, calculate the loss. Remember, pipe torch tends to call it criterion rather than loss, right, do backward. And then there's something else we haven't learned here, but we do learn the deep learning course, which is gradient clicking. So you can ignore that. All right, so you can see now like all the stuff that we've learned when you look inside the actual frameworks. That's the code you see. Okay, so that's what fit does. And so then the next step would be like, okay, well, this idea of like having some weights and a bias and doing a matrix product in addition. Let's put that in a function. This thing of doing the log softmax, let's put that in a function. And then the very idea of like first doing this and then doing that, this idea of like chaining functions together, let's put that into a function. And that finally gets us to that. Okay, so sequential simply means do this function, take the result, send it to this function, et cetera. And linear means create the weight matrix, create the biases. Okay, so that's it. So we can then, as we started to talk about, like turn this into a deep neural network by saying, rather than sending this straight off into 10 activations, let's put it into say 100 activations we could pick whatever number we like, put it through a value to make it nonlinear, put it through another linear layer, another value, and then our final output with our final activation function. Right, and so this is now a deep network. So we could fit that. And this time now, because it's like deeper, I'm actually going to run a few more epochs. Right, and you can see the accuracy increasing. Right, so if you try and increase the learning rate here, it's like 0.1. Further, it actually starts to become unstable. Now I'll show you a trick. This is called learning rate annealing. And the trick is this. When you're trying to fit to a function, right, you've been taking a few steps. Step, step, step. As you get close to the middle, like get close to the bottom, your steps probably want to become smaller. Right, otherwise what tends to happen is you start finding you're doing this. Right, and so you can actually see it here. Right, it got 93, 94 in a bit, 94, 6, 94, 8. Like it's kind of starting to flatten out. Right, now that could be because it's kind of done as well as it can. Or it could be that it's going to go backwards and forwards. So what is a good idea is later on in training is to decrease your learning rate and to take smaller steps. Okay, that's called learning rate annealing. So there's a function in FastAI called set learning rates. You can pass in your optimizer and your new learning rate and, you know, see if that helps. Right, and very often it does. About an order of magnitude. In the deep learning course, we learn a much, much better technique than this to do this all automatically and at a more granular level. But if you're doing it by hand, you know, like an order of magnitude at a time is what people generally do. So you'll see people in papers talk about learning rate schedules. This is like a learning rate schedule. So this schedule, just a moment, Erica, I just come to Ernest first, has got us to 97. Right, and I tried kind of going further and we don't seem to be able to get much better than that. So yeah, so here we've got something where we can get 97% accuracy. Yes, Erica. So it seems like you change the learning rate to something very small. Ten times smaller than we started with. So we had 0.1. Now it's 0.01. Yeah. But that makes the whole model train really slow. So I was wondering if you can make it so that it changes dynamically as it approaches closer to the minimum. Yeah, pretty much. Yeah, so that's some of the stuff we learn in the deep learning course is these more advanced approaches. Yeah. So how it is different from using atom optimizer or something that's the kind of stuff we can do. I mean, you still need annealing. As I say, we do this kind of stuff in the deep learning course. So for now, we're just going to stick to standard SGD. I had a question about the data loading. Yeah. I know it's a fast AI function, but could you go into a little bit detail of how it's creating batches, how it's loading data, and how it's making those decisions? Sure. It'd be good to ask that on Monday night so we can talk about it in detail in the deep learning class. But let's do the quick version here. So basically, there's a really nice design in PyTorch where they basically say, let's create a thing called a data set. And a data set is basically something that looks like a list. It has a length. And so that's like how many images are in the data set. And it has the ability to index into it like a list. So if you had like D equals data set, you can do length D and you can do D of some index. That's basically all a data set is as far as PyTorch is concerned. And so you start with a data set. So it's like, OK, D3 gives you the third image or whatever. And so then the idea is that you can take a data set and you can pass that into a constructor for a data loader. And that gives you something which is now iterable, right? So you can now say, iter DL. And that's something that you can call next on. And what that now is going to do is when you do this, you can choose to have shuffle on or shuffle off. Shuffle on means give me random mini batch. Shuffle off means go through it sequentially. And so what the data loader does now when you say next is it basically assuming you said shuffle equals true is it's going to grab, you know, if you've got a batch size of 64, 64 random integers between zero and length and call this 64 times to get 64 different items and jam them together. So FastAI uses the exact same terminology and the exact same API. We just do some of the details differently. So specifically, particularly with computer vision, you often want to do a lot of pre-processing. Data augmentation, like flipping, changing the colors a little bit, rotating. Those turn out to be really computationally expensive, even just reading the JPEGs turns out to be computationally expensive. So PyTorch uses an approach where it fires off multiple processes to do that in parallel, whereas the FastAI library instead does something called multi-threading, which can be a much faster way of doing it. Yes, Yannette. So an epoch is a real epoch in the sense that all of the elements, so it's a shuffle at the beginning of the epoch, something like that. Yeah, yeah. And not all libraries work the same way, some do sampling with replacement, some don't. We actually, the FastAI library hands off the shuffling off to the actual PyTorch version, and I believe the PyTorch version, yeah, actually shuffles and an epoch covers everything once, I believe. Okay, now the thing is, when you start to get these bigger networks, potentially you're getting quite a few parameters. So I won't ask you to calculate how many parameters there are, but let's remember here we've got 28 by 28 input into 100 output, and then 100 into 100, and then 100 into 10. And then for each of those, you've got weights and biases. So we can actually do this. Net.parameters returns a list where each element of the list is a matrix, actually a tensor of the parameters for that, not just for that layer, but if it's a layer with both weights and biases, that would be two parameters. So basically returns us a list of all of the tensors containing the parameters. Num elements in PyTorch tells you how big that is. So if I run this, here is the number of parameters in each layer. So I've got 784 inputs, and the first layer has 100 outputs, so therefore the first weight matrix is of size 78,400. And the first bias vector is of size 100. And then the next one is 100 by 100, and then the next one is 100 by 10, and then there's my bias. So there's the number of elements in each layer, and if I add them all up, it's nearly 100,000, and so I'm possibly at risk of overfitting here. So we might want to think about using regularization. So a really simple, common approach to regularization in all of machine learning is something called L2 regularization. And it's super important, super handy. You can use it with just about anything, right? And the basic idea, anyway. So L2 regularization, the basic idea is this. Normally we'd say our loss is equal to, let's just do RMSC to keep things kind of simple. It's equal to our predictions minus our actuals, you know, squared, and then we sum them up, take the average, take the square root, okay? So what if we then want to say, you know what, like if I've got lots and lots of parameters, don't use them unless they're really helping enough, right? Like if you've got a million parameters and you only really needed 10 parameters to be useful, just use 10, right? So how could we tell the loss function to do that? And so basically what we want to say is, hey, if a parameter is zero, that's no problem. It's like it doesn't exist at all. So let's penalize a parameter for not being zero. So what would be a way we could measure that? How can we like calculate how unzero our parameters are? Can you pass that to Chen, please, Ernest? You calculate the average of all the parameters? That's my first one. Can't quite be the average close. Yes, Taylor? Yeah. Yes, you figured it out. Okay. So I think if we like, assuming all of our data has been normalized, standardized, however you want to call it, we want to check that they're like significantly different from zero, right? Not the data, the parameters. The parameters rather would be significantly different from zero. And the parameters don't have to be normalized or anything. They're just calculated. Right. So significantly different from zero. Right. I just assumed that the data has been normalized so that we can compare them on the same scale. Oh, yeah, got it. Yeah, right. And then those that are not significantly different from zero, we can probably just drop. Okay. And I think Chen, she's going to tell us how to do that. You just figured it out, right? The meaning of the absolute value. Could do that. That would be called L one, which is great. So L one would be the absolute value of the weights average. L two is actually the sum. Square root sum of squares. Yeah. Yeah, exactly. So we just take this. We can just, we don't even have to square root. So we just take the squares of the weights themselves. And then like, we want to be able to say like, okay. How much do we want to penalize not being zero? Right. Because if we actually don't have that many parameters, we don't want to regularize much at all. If we've got heaps, we do want to regularize a lot. Right. So then we put a, a parameter. Yeah. Right. Except I have a rule in my classes, which is never to use Greek letters. So normally people use alpha. I'm going to use a. Okay. So, so this is some number which you often see something around kind of one e neg six to one e neg four ish. Right. Now. We actually don't care about the loss. When you think about it, we don't actually care about the loss other than like maybe to print it out. What we actually care about is the gradient of the loss. Okay. So the gradient. Of that. Right. Is. That. Right. So there are two ways to do this. We can actually modify our loss function. To add in this square. Penalty. Or we could modify that thing where we said weights equals weights minus gradient times learning rate to subtract that. As well. Right. Right. Sorry to add that as well. And. These are roughly these are kind of basically equivalent, but they have different names. This is called L2 regularization. Right. This is called weight decay. So in the neural network literature. You know that version kind of. With the how it was first posed in the neural network literature. Whereas this other version is kind of. How it was posed in the statistics literature. And yeah, you know, they're they're equivalent. As we talked about in the deep learning class, it turns out they're not exactly equivalent because when you have things like momentum and Adam, it can behave differently. And two weeks ago, a researcher figured out a way to actually do proper weight decay in modern optimizers. One of our fast AI students just implemented that in the fast AI library. So fast AI is now the first. Library to actually support this properly. So anyway, so for now, let's do the. The version which. Pytorch calls weight decay. But actually it turns out based on this paper two weeks ago was actually L2 regularization. It's not quite correct, but it's close enough. So here we can say weight decay is one in egg three. So let's set our concept out our penalty multiplier a to one in egg three. And it's going to add that to the loss function. Okay. And so let's make a copy of these cells. Just so we can compare. This actually works. Okay. And we'll set this running. Okay. This is now optimizing. Except if you're actually. So I've made a mistake here, which is I didn't rerun this cell. This is an important thing to kind of remember. Since I didn't run this rerun this cell. Here when it created the optimizer and said net dot parameters, it started with the parameters that I had already trained. Right. So I actually hadn't recreated my network. Okay. So I actually need to go back and rerun this cell first to recreate the network. Then go through and run this. Okay. There we go. This happens. So you might notice some notice something kind of kind of counterintuitive here. Which is that that's our training error. Right. Now you would expect our training error with regularization to be worse. That makes sense. Right. Because we're like we're penalizing parameters that specifically can make it better. And yet, actually it started out better. Not worse. So why could that be? So the reason that can happen is that if you have a function that looks like that. Right. It takes potentially a really long time to train. Whereas if you have a function that kind of looks more like that, it's going to train a lot more quickly. And there are certain things that you can do which sometimes just like can take a function that's kind of horrible and make it less horrible. And sometimes weight decay can actually make your functions a little more nicely behaved. And that's actually happened here. So like I just mentioned that to say like don't let that confuse you. Right. Like weight decay really does penalize the training set. And so strictly speaking, the final number we get to for the training set shouldn't end up being better. But it can train sometimes more quickly. Right. Yes. Can you pass it to Chen Xu? I don't get it. Okay. Why making it faster? Like the time matters? Like the training time matters? No. This is after one epoch. Yeah. Right. So after one epoch. And congratulations to saying I don't get it. That's like the best thing anybody can say. You know, so helpful. This here was our training without weight decay. Okay. And this here is our training with weight decay. Okay. So this is not related to time. This is related to just an epoch. Right. After one epoch, my claim was that you would expect the training set, all other things being equal to have a worse loss with weight decay. Because we're penalizing it. You know, this has no penalty. This has a penalty. So the thing with a penalty should be worse. And I'm saying, oh, it's not. That's weird. Right. And so the reason it's not is because in a single epoch, it matters a lot as to whether you're trying to optimize something that's very bumpy, or whether you're trying to optimize something that's kind of nice and smooth. If you're trying to optimize something that's really bumpy, like imagine in some high dimensional space, right? You end up kind of rolling around through all these different tubes and tunnels and stuff, you know, or else if it's just smooth, you just go boom. Right. It's like imagine a marble rolling down a hill where one of them you've got like, it's called Lombard Street in San Francisco. It's like backwards forwards, backwards forwards. It takes a long time to drive down the road. Right. Where else, you know, if you kind of took a motorbike and just went straight over the top, you just went boom. Right. So whether, so kind of the shape of the loss function surface, you know, impacts or kind of defines how easy it is to optimize. And therefore how far can it get in a single epoch? And based on these results, it would appear that weight decay here has made it, this function, easier to optimize. So just to make sure, is the penalizing is making the optimizer more than likely to reach the global minimum? No, I wouldn't say that. My claim actually is that at the end, it's probably going to be less good on the training set. And indeed, this does look to be the case. After five epochs, our training set is now worse with weight decay. Now, that's what I would expect, right? I would expect, like, if you actually find, like, I never use the term global optimum because it's just not something we have any guarantees about. We don't really care about. We just care, like, where do we get to after a certain number of epochs? We hope that we found somewhere that's like a good solution. And so by the time we get to, like, a good solution, the training set with weight decay, the loss is worse because it's better here, right? But on the validation set, the loss is better, right? Because we penalized the training set in order to kind of try and create something that generalizes better. So we've got more parameters, you know, the parameters that are kind of pointless, so now zero, and it generalizes better, right? So all we're saying is that it just got to a good point after one epoch. It's really always saying. So is it always true? No, no. If by it you mean does weight decay always make the function surface smoother? No, it's not always true. But it's like, it's worth remembering that if you're having trouble training a function, adding a little bit of weight decay may help. So by regularizing the parameters, what it does is it smoothens out the loss function. I mean, it's not why we do it. You know, the reason why we do it is because we want to penalize things that aren't zero to say, like, don't make this parameter a high number unless it's really helping the loss a lot, right? Set it to zero if you can because setting as many parameters to zero as possible means that it's going to generalize better, right? It's like the same as having a smaller network, right? So that's why we do it. But it can change how it learns as well. So let's... Okay, this is one moment, Erica. So I just wanted to check how we actually went here. So after the second epoch, yeah. So you can see here it really has helped, right? After the second epoch, before we got to 97% accuracy, now we're nearly up to 98% accuracy, right? And you can see that the loss was 0.08 versus 0.13, right? So adding regularization has allowed us to find a, you know, 3% versus 2%, so like a 50% better solution. Yes, Erica. So there are two pieces to this, right? What is L2 regularization and the weight decay? No, they're the same. So my claim was they're the same thing, right? Okay. Weight decay is the version. If you just take the derivative of L2 regularization, you get weight decay. So you can implement it either by changing the loss function with a squared loss penalty, or you can implement it by adding the weights themselves as part of the gradient. Yeah, I was just going to finish the questions. Yes. Can you pass that to Devesh? Can we use regularization convolution layer as well? Absolutely. So a convolution layer just is weights. Jeremy, can you explain why you thought you needed weight decay in this particular problem? Not easily. I mean, other than to say it's something that I would always try. You're overfitting thunder. Well, yeah. I mean, okay. So even if I... Yeah, okay. That's a good point, Yannette. So if my training loss was higher than my validation loss, then I'm underfitting, right? So there's definitely no point regularizing, right? If, like, that would always be a bad thing. That would always mean you need, like, more parameters in your model. In this case, I'm overfitting. That doesn't necessarily mean regularization will help, but it's certainly worth trying. Thank you, Yannette. That's a great point. There's one more question. Yeah. Tyler, do you want to pass it over there? So how do you choose the optimal number of epoch? You do my deep learning course. It's a long story and lots of... Do you do it by heroistically or is there any... It's a bit of both. As I say, we don't have time to cover best practices in this class. We're going to learn the kind of fundamentals. Yeah. Okay. So let's take a six-minute break and come back at 11.10. All right. So something that we cover in great detail in the deep learning course but is, like, really important to mention here is that the secret, in my opinion, to kind of modern machine learning techniques is to massively overparameterize the solution to your problem, right? Like, as we've done here, you know, we've got, like, 100,000 weights when we only had a small number of 28 by 28 images and then use regularization. Okay? It's like the direct opposite of how nearly all statistics and learning was done for decades before and still most kind of, like, senior lecturers at most universities in most areas have this background where they've learned the correct way to build a model is to, like, have as few parameters as possible, right? And so hopefully we've learned two things so far. You know, one is we can build very accurate models even when they have lots and lots of parameters like a random forest has a lot of parameters and, you know, this here deep network has a lot of parameters and they can be accurate, right? And we can do that by either using bagging or by using regularization, okay? And regularization in neural nets means either weight decay, also known as kind of L2 regularization, or dropout, which we won't worry too much about here. So, like, it's a very different way of thinking about building useful models. And, like, I just wanted to kind of warn you that once you leave this classroom, like, even possibly when you go to the next faculty member's talk, like, there'll be people at USF as well who are entirely trained in the world of, like, models with small numbers of parameters. You know, your next boss is very likely to have been trained in the world of, like, models with small numbers of parameters. The idea that they are somehow more pure or easier or better or more interpretable or whatever. I am convinced that that is not true, probably not ever true, certainly very rarely true. And that actually models with lots of parameters can be extremely interpretable, as we learned from our whole lesson of random first interpretation. You can use most of the same techniques with neural nets, but with neural nets, they're even easier, right? Remember how we did feature importance by randomizing a column to see how changes in that column would impact the output? Well, that's just like a kind of dumb way of calculating its gradient. How much does varying this import change the output? With a neural net, we can actually calculate its gradient. So with PyTorch, you could actually say, what's the gradient of the output with respect to this column? You can do the same kind of thing to do a partial dependence plot with a neural net. And I'll mention, for those of you interested in making a real impact, nobody's written basically any of these things for neural nets, right? So that whole area needs libraries to be written, blog posts to be written. Some papers have been written, but only in very narrow domains like computer vision. As far as I know, nobody's written the paper saying, here's how to do structured data neural networks, interpretation methods. So it's a really exciting big area. So what we're going to do, though, is we're going to start with applying this with a simple linear model. And this is mildly terrifying for me, because we're going to do NLP, and our NLP faculty expert is in the room. So David, just yell at me if I screw this up too badly. And so NLP refers to, you know, any kind of modeling where we're working with natural language text, right? And interestingly enough, we're going to look at a situation where a linear model is pretty close to the state of the art for solving a particular problem. It's actually something where I actually surpassed the state of the art in this using a recurrent neural network a few weeks ago. But this is actually going to show you pretty close to the state of art with a linear model. We're going to be working with the IMDB data set. So this is a data set of movie reviews. You can download it by following these steps. And once you download it, you'll see that you've got a train and a test directory. And in your train directory, you'll see there's a negative and a positive directory. And in your positive directory, you'll see there's a bunch of text files. And here's an example of a text file. So somehow we've managed to pick out a story of a man who has unnatural feelings for a pig as our first choice. That wasn't intentional, but it'll be fine. So we're going to look at these movie reviews. And for each one, we're going to look to see whether they were positive or negative. So they've been put into one of these folders. They were downloaded from IMDB, the movie database and review site. The ones that were strongly positive went in positive, strongly negative, went in negative, and the rest they didn't label at all. So these are only highly polarized reviews. So in this case, you know, we have an insane violent mob, which unfortunately it is too absurd, too off-putting. Those in the area will be turned off. So the label for this was a zero, which is negative. So this is a negative review. So in the first day of the library, there's lots of little functions and classes to help with most kinds of domains that you do machine learning on. For NLP, one of the simple things we have is text from folders. That's just going to go ahead and go through and find all of the folders in here with these names and create a labeled dataset. And, you know, don't let these things ever stop you from understanding what's going on behind the scenes. We can grab its source code and as you can see, it's like five lines. So I don't like to write these things out in full, but hide them behind little functions so you can reuse them, but basically it's just going to go through each directory and then within that, go through each directory and then go through each file in that directory and then stick that into this array of texts and figure out what folder it's in and stick that into the array of labels. So that's how we basically end up with something where we have an array of the reviews and an array of the labels. So that's our data. So our job will be to take that and to predict that. Okay? And the way we're going to do it is we're going to throw away all of the interesting stuff about language which is the order in which the words are in. Now, this is very often not a good idea, but in this particular case, it's going to turn out to work not too badly. So let me show you what I mean by throwing away the order of the words. Normally, the order of the words matters a lot. If you've got a not before something, then that not refers to that thing. But the thing is, in this case, we're trying to predict whether something's positive or negative. If you see the word absurd appear a lot, then maybe that's a sign that this isn't very good. So cryptic, maybe that's a sign that it's not very good. So the idea is that we're going to turn it into something called a term document matrix where for each document, i.e. each review, we're just going to create a list of what words are in it rather than what order they're in. So let me give an example. Can you see this okay? Okay. Here are four movie reviews that I made up. This movie is good. The movie is good. They're both positive. This movie is bad. The movie is bad. They're both negative. So I'm going to turn this into a term document matrix. So the first thing I need to do is create something called a vocabulary. A vocabulary is a list of all the unique words that appear. So here's my vocabulary. This movie is good, the bad. That's all the words. Now I'm going to take each one of my movie reviews and turn it into a vector of which words appear and how often do they appear. In this case, none of my words appear twice. So this movie is good. It has those four words in it. Where else this movie is bad, it has those four words in it. So this is called a term document matrix. And this representation, we call a bag of words representation. So this here is a bag of words representation of the review. It doesn't contain the order of the text anymore. It's just a bag of the words. What words are in it? It contains bad is movie this. So that's the first thing we're going to do is we're going to turn it into a bag of words representation. And the reason that this is convenient for linear models is that this is a nice rectangular matrix that we can do math on. And specifically, we can do a logistic regression. And that's what we're going to do. We're going to get to a point where we do a logistic regression. Before we get there, though, we're going to do something else which is called naive base. So sklearn has something which will create a term document matrix for us. It's called count vectorizer. So we'll just use it. In NLP, you have to turn your text into a list of words. And that's called tokenization. And that's actually non-trivial because if this was actually this movie is good, dot, right? Or if it was this movie is good, how do you deal with that punctuation? Or perhaps more interestingly, what if it was this movie isn't good? So how you turn a piece of text into a list of tokens is called tokenization. And so a good tokenizer would turn this movie isn't good into this. This space, quote movie space, is space and good space. So you can see in this version here, if I now split this on spaces, every token is either a single piece of punctuation or this suffix int is considered like a word. That's kind of like how we would probably want to tokenize that piece of text. Because you wouldn't want good full stop to be like an object. Because there's no concept of good full stop. Or double quote movie is not like an object. So tokenization is something we hand off to a tokenizer. FastAI has a tokenizer in it that we can use. So this is how we create our term document matrix with a tokenizer. SK Learn has a pretty standard API, which is nice. I'm sure you've seen it a few times now before. So once we've built some kind of model, we can kind of think of this as a model, just-ish. This is just defining what it's going to do. We can call fit transform to do that. So in this case, fit transform is going to create the vocabulary and create the term document matrix based on the training set. Transform is a little bit different. That says use the previously fitted model, which in this case means use the previously created vocabulary. We wouldn't want the validation set and the training set to have the words in different orders in the matrices. Because then they'd like to have different meanings. So this is here saying use the same vocabulary to create a bag of words for the validation set. Could you pass that back, please? What if the validation set has different set of words other than training set? Yeah, that's a great question. So generally, most of these kind of vocab creating approaches will have a special token for unknown. Sometimes you'll also say like, hey, if a word appears less than three times, call it unknown. But otherwise it's like, yeah, if you see something you haven't seen before, call it unknown. So that would just become a column in the bag of words is unknown. Good question. All right, so when we create this term document matrix of the training set, we have 25,000 rows because there are 25,000 movie reviews. And there are 75,132 columns. What does that represent? What does that mean? 75,132. What can you pass that to Devesh? Can you pass that to Devesh? All vocabulary. Yeah, go on. What do you mean? So like the number of words, union of number of words. The number of unique words. Yeah, exactly. Good. Okay. Now, most documents don't have most of these 75,000 words. So we don't want to actually store that as a normal array in memory because it's going to be very wasteful. So instead we store it as a sparse matrix. And what a sparse matrix does is it just stores it as something that says whereabouts of the non-zeros. So it says like, okay, document number one, word number four appears and it has four of them. Document one, term number 123 has that. That appears and it's a one. And so forth. That's basically how it's stored. There's actually a number of different ways of storing. And if you do Rachel's computational linear algebra course, you'll learn about the different types and why you choose them and how to convert and so forth. But they're all kind of something like this. And you don't really on the whole have to worry about the details. The important thing to know is it's efficient. And so we could grab the first review. And that gives us 75,000 long sparse, one row long matrix with 93 stored elements. In other words, 93 of those words are actually used in the first document. We can have a look at the vocabulary by saying vectorizer.getfeaturenames. That gives us the vocab. And so here's an example of a few of the elements of getfeaturenames. I didn't intentionally pick the one that had Aussie, but that's the important words, obviously. I haven't used the token Aussie here. I'm just bidding on space. So this isn't quite the same as what the vectorizer did. But to simplify things, let's grab a set of all the lower case words. By making it a set, we make them unique. So this is roughly the list of words that would appear. And that length is 91, which is pretty similar to 93. And just the difference will be that I didn't use a real tokenizer. So that's basically all that's been done there. It's kind of created this unique list of words and mapped them. We could check by calling vectorizer.vocabory underscore to find the idea of a particular word. So this is like the reverse map of this one. This is like integer to word. Here is word to integer. And so we saw a third appear twice in the first document. So let's check train term doc 0,1297. There it is. This is 2. Where else unfortunately Aussie didn't appear in the unnatural relationship with a pig movie. So 0,5000 is 0. So that's our term document matrix. Yes. So does it care about the relative relationship between the words? As in the ordering of the words? No. We've thrown away the ordering. That's why it's a bag of words. And I'm not claiming that this is necessarily a good idea. What I will say is that the vast majority of NLP work that's been done over the last few decades generally uses this representation because we didn't really know much better. Nowadays increasingly we're using recurrent neural networks instead, which we'll learn about in our last deep learning lesson of part one. But sometimes this representation works pretty well. And it's actually going to work pretty well in this case. OK. So in fact, like back when I was at Fastmail, my email company, a lot of the spam filtering we did used this next technique, naive base, which is a bag of words approach. It's just kind of like, if you're getting a lot of email containing the word Viagra and it's always been a spam and you never get email from your friends talking about Viagra, then it's very likely something that says Viagra regardless of the detail of the language is probably from a spammer. So that's the basic theory about like classification using a term document matrix. OK. So let's talk about naive base. And here's the basic idea. We're going to start with our term document matrix. And these first two is our corpus of positive reviews. These next two is our corpus of negative reviews. And so here's our whole corpus of all reviews. So what I could do is now to create a probability. We tend to call these more generically features rather than words. This is a feature. A movie is a feature. So it's kind of more now like machine learning language. A column is a feature. We often call those F in naive base. So we can basically say the probability that you would see the word this, given that the class is one, given that it's a positive review, is just the average of how often do you see this in the positive reviews. Now we've got to be a bit careful though, because if you never ever see a particular word in a particular class, so if I've never received an email from a friend that said Viagra, that doesn't actually mean the probability of a friend sending me an email about Viagra is zero. It's not really zero. I hope I don't get an email. You know, from Terence tomorrow saying like, Jeremy, you probably could use this, you know, advertisement for Viagra. But, you know, it could happen. And, you know, I'm sure it would be in my best interest. Yeah. So what we do is we say actually what we've seen so far is not the full sample of everything that could happen. It's like a sample of what's happened so far. So let's assume that the next email you get actually does mention Viagra. And every other possible word. Right. So basically we're going to add a row of ones. Okay. So that's like the email that contains every possible word. So that way nothing's ever infinitely unlikely. Okay. So I take the average of all of the times that this appears in my positive corpus plus the ones. Okay. So that's like the probability that feature equals this appears in a document given that class equals one. And so not surprisingly, here's the same thing for probability that this feature, this appears given class equals zero. Right. Same calculation except for the zero rows. And obviously these are the same because this appears twice in the positives. Sorry, once in the positives and once in the negatives. Okay. Let's just put this back to what it was. All right. So we can do that for every feature, for every class. Right. So our trick now is to basically use Bayes' rule to kind of fill this in. So what we want is the probability that given that I've got this particular document, so somebody sent me this particular email or I have this particular IMDub review, what's the probability that its class is equal to, I don't know, positive? Right. So for this particular movie review, what's the probability that its class is positive? Right. And so we can say, well that's equal to the probability that we got this particular movie review given that its class is positive, multiplied by the probability that any movie review's class is positive, divided by the probability of getting this particular movie review. All right. That's just Bayes' rule. Okay. So we can calculate all of those things. But actually what we really want to know is is it more likely that this is class 0 or class 1? Right. So what if we actually took probability that it's class 1 and divided by probability that it's class 0? What if we did that? Right. And so then we could say like, okay, if this number's bigger than 1, then it's more likely to be class 1. If it's smaller than 1, it's more likely to be class 0. Right. So in that case, we could just divide this whole thing, right, by the same version for class 0, right, which is the same as multiplying it by the reciprocal. And so the nice thing is now that's going to put a probability D on top here, which we can get rid of. Right. And a probability of getting the data given class 0 down here and the probability of getting class 0 here. Right. And so if we basically what that means is we want to calculate the probability that we would get this particular document given that the class is 1 times the probability that the class is 1 divided by the probability of getting this particular document given the class is 0 times the probability that the class is 0. So the probability that the class is 1 is just equal to the average of the labels. Right. Probability that the class is 0 is just 1 minus that. Right. So there are those two numbers. Right. I've got an equal amount of both. So it's both 0.5. What is the probability of getting this document given that the class is 1? Can anybody tell me how I would calculate that? Can somebody pass that? Please. Thank you. Look at all the documents which have class equal to 1. Uh-huh. And 1 divided by that will give you... So remember though it's going to be for a particular document. So for example we'd be saying like what's the probability that this review is positive? Right. So you're on the right track but what we're going to have to do is we're going to have to say let's just look at the words it has and then multiply the probabilities together for class equals 1. Right. So the probability that a class 1 review has this is two thirds. The probability it has movie is 1, is is 1 and good is 1. So the probability it has all of them is all of those multiplied together. Kinda. And the kinda, Tyler why is it not really? Can you pass that to Tyler? So glad you look horrified and skeptical. Uh, word choice is not independent. Thank you. So that doesn't hold... So nobody can call Tyler naive because the reason this is naive base is because this is what happens if you take base as theorems in a naive way and Tyler is not naive. Anything but, right? So naive base says let's assume that if you have this movie is bloody stupid I hate it but the probability of hate is independent of the probability of bloody is independent of the probability of stupid. Which is definitely not true. And so naive base ain't actually very good but I'm kind of teaching it to you because it's gonna turn out to be a convenient piece for something we're about to learn later. And it often works pretty well. It's okay, right? I mean I would never choose it like I don't think it's better than any other technique that's equally fast and equally easy. But, you know, it's the thing you can do. And it's certainly going to be a useful foundation. So here is our calculation, right? Of the probability that this document is... That we get this particular document assuming it's a positive review. Here's the probability given it's a negative. And here's the ratio. And this ratio is above one. So we're going to say I think that this is probably a positive review. Okay? So that's the Excel version. And so you can tell that I let Yannette touch this because it's got latex in it. We've got actual math. So here is the same thing. The log count ratio for each feature. For each word, f. And so here it is written out as Python. Okay? So our independent variable is our term document matrix. Our dependent variable is just the labels for the Y. So using NumPy this is going to grab the rows where the dependent variable is one. Okay? And so then we can sum them over the rows to get the total word count for that feature across all the documents. Right? Plus one. Right? Because that's the... I'm going to send me something about Viagra today. I can tell. That's that. Yeah. Okay. So do the same thing for the negative reviews. Right? And then of course it's nicer to take the log. Right? Because if we take the log then we can add things together rather than multiply them together. And once you like multiply enough of these things together it's going to get kind of so close to zero that you'll probably run out of plotting point. Right? So we take the log of the ratios. And then we can... As I say we then multiply that or in log we subtract that from the... So add that to the ratio of the class, the whole class probabilities. Right? So in order to say for each document multiply the Bayes probabilities by the counts, we can just use matrix multiply. Okay? And then to add on the log of the class ratios we can just use plus B. And so we end up with something that looks a lot like our logistic regression. Right? But we're not learning anything. Right? Not in kind of a SGD point of view. We're just... We're calculating it using this theoretical model. Okay? And so as I said we can then compare that as to whether it's bigger or smaller than zero. Not one anymore because we're now in log space. And then we can compare that to the mean and we say, okay, that's 80% accurate. 81% accurate. Right? So naive Bayes is not nothing. It gave us something. Okay? It turns out that this version where we're actually looking at how often a word appears like absurd appeared twice. It turns out at least for this problem and quite often it doesn't matter whether absurd appeared twice or once. All that matters is that it appeared. So what people tend to try doing is to say take the term document matrix and go dot sine. Dot sine replaces anything positive with one and anything negative with negative one. We don't have any negative counts, obviously. So this binarizes it. So it says I don't care that you saw absurd twice I just care that you saw it. Right? So if we do exactly the same thing with the binarized version then you get a better result. Okay? Okay. Now this is the difference between theory and practice, right? In theory naive base sounds okay but it's naive unlike Tyler. It's naive, right? So what Tyler would probably do would instead say rather than assuming that I should use these coefficients R why don't we learn them? Does that sound reasonable, Tyler? Yeah. Okay. So let's learn them. So we can, you know we can totally learn them. So let's create a logistic regression, right? And let's fit some coefficients and that's going to literally give us something with exactly the same functional form that we had before but now rather than using a theoretical R and a theoretical B we're going to calculate the two things based on logistic regression and that's better. Okay. So so it's kind of like yeah, why why do something based on some theoretical model because theoretical models are never going to be as accurate pretty much as a data driven model, right? Because theoretical models unless you're dealing with some I don't know like physics thing or something where you're like okay, this is actually how the world works. There really is no I don't know we're working in a vacuum and this is the exact gravity and blah, blah, blah. Right. But most of the real world this is how things are like it's better to learn your coefficients and calculate them. Yes, you're net. Jeremy, what's this dual equal through? I was hoping you'd ignore not notice but you saw it. Basically in this case our term document matrix is much wider than it is tall. There is a reformulation mathematically basically almost a mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it's wider than it is tall. So the short answer is if you don't put that here anytime it's wider than it is tall put dual equals true and it will run this runs in like two seconds if you don't have it here it'll take a few minutes. So like in math there's this kind of concept of dual versions of problems which are kind of like equivalent versions that sometimes work better for certain situations. Okay, here is so here is the binarized version right and it's it's about the same right so you can see I've fitted it with the the sign of the dot term doc matrix and predicted it with this right. Now the thing is that this is going to be a coefficient for every term there was about 75,000 terms in a vocabulary and that seems like a lot of coefficients given that we've only got 25,000 reviews so maybe we should try regularizing this. So we can use regularization built into SK Learns Logistic Aggression Plus which is C is the parameter that they use a small this is slightly weird a smaller parameter is more regularization right so that's why I used 1E8 to basically turn off regularization here so if I turn on regularization set it to 0.1 then now it's 88%. Okay which makes sense you know you wouldn't you would think like 75,000 parameters for 25,000 documents you know is likely to overfit indeed it did overfit so this is adding L2 regularization to avoid overfitting. I mentioned earlier that as well as L2 which is looking at the weight squared there's also L1 which is looking at just the absolute value of the weights right I was kind of pretty sloppy in my wording before I said that L2 tries to make things zero that's kind of true but if you've got two things that are highly correlated then L2 regularization will like move them both down together it won't make one of them zero and one of them non-zero right so L1 regularization actually has the property that it'll try to make as many things zero as possible whereas L2 regularization has a property that tends to try to make kind of everything smaller we actually don't care about that difference in really any modern machine learning because we very rarely try to directly interpret the coefficients we try to understand our models through interrogation using the kind of techniques that we've learned the reason that we would care about L1 versus L2 is simply like an error on the validation set and you can try both with SK Learn's logistic regression L2 actually turns out to be a lot faster because you can't use dual equals true unless you have L2 and L2 is the default so I didn't really worry too much about that difference here so you can see here if we use regularization and binarized we actually do pretty well okay so yes, can you pass that back to W please before we learned about elastic net right like combining L1 and L2 yeah, yeah you can do that but I mean it's like you know with deeper models yeah I've never seen anybody find that useful okay so the last thing I'll mention is that you can when you do your count vectorizer wherever that was when you do your count vectorizer you can also ask for n-grams right by default we get unigrams that is single words but if we if we say n-gram range equals 1,3 that's also going to give us bi-grams and trigrams by which I mean if I now say okay let's go ahead and do the count vectorizer get feature names now my vocabulary includes a bi-gram right by vast by vengeance and a trigram by vengeance full stop by vero-miles right so this is now doing the same thing but after tokenizing it's not just grabbing each word and saying that's part of our vocabulary but each two words next to each other and each three words next to each other and this turns out to be like super helpful in like taking advantage of bag of word approaches because we now can see like the difference between like you know not good versus not bad versus not terrible right or even like double quote good double quote which is probably going to be sarcastic right so using trigram features actually is going to turn out to make both naive phase and logistic regression quite a lot better it really takes us quite a lot further and makes them quite useful I have a question about the tokenizers so you are saying some max features so how are these biograms and trigrams selected? right so since I'm using a linear model I didn't want to create too many features I mean it actually worked fine even without max features I think I had something like I can't remember 70 million coefficients it still worked right but just there's no need to have 70 million coefficients so if you say max features equals 800,000 the count vectorizer will sort the vocabulary by how often everything appears whether it be unigram by gram trigram and it will cut it off after the first 800,000 most common ngrams ngram is just the generic word for unigram by gram and trigram so that's why the train term doc dot shape is now 25,000 by 800,000 and like if you're not sure what number this should be I just picked something that was really big and you know didn't worry about it too much and it seemed to be fine like it's not terribly sensitive alright okay well that's we're out of time so what we're going to see next week and by the way we could have replaced this logistic regression with our PyTorch version and next week we'll actually see something in the FastAI library that does exactly that but also what we'll see next week sorry next week tomorrow is how to combine logistic regression and naive Bayes together to get something that's better than either and then we'll learn how to move from there to create a deeper neural network to get pretty much that result for structured learning alright so we'll see you then