 Welcome back. And here is lesson four, which is where we get deep into the weeds of exactly what is going on when we are training a neural network. And we started looking at this in the previous lesson. We were looking at stochastic gradient descent. And so to remind you, we were looking at what Arthur Samuel said. Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment, or we would call it parameter assignment, in terms of actual performance, and provide a mechanism for altering the weight assignment, so as to maximize that performance. So we could make that entirely automatic, and a machine so programmed would learn from its experience. And that was our goal. So our initial attempt on the MNIST dataset was not really based on that. We didn't really have any parameters. So then, last week, we tried to figure out how we could parameterize it, how we could create a function that had parameters. And what we thought we could do would be to have something where, say, the probability of being some particular number was expressed in terms of the pixels of that number, and some weights, and then we would just multiply them together and add them up. So we looked at how stochastic gradient descent worked last week. And the basic idea is that we start out by initializing the parameters randomly. We use them to make a prediction using a function, such as this one. We then see how good that prediction is by measuring using a loss function. We then calculate the gradient, which is how much would the loss change if I changed one parameter by a little bit. We then use that to make a small step to change each of the parameters by a little bit, by multiplying the learning rate by the gradient, to get a new set of predictions. And so we went round and round and round a few times, and eventually we decided to stop. And so these are the basic seven steps that we went through. And so we did that for simple quadratic equation, and we had something which looked like this. And so by the end, we had this nice example of a curve getting closer and closer and closer. So I have a little summary at the start of this section, summarizing gradient descent that Silva and I have in the notebooks in the book of what we just did. So you can review that and make sure it makes sense to you. So now let's use this to create our MNIST threes versus sevens model. And so to create a model, we're going to need to create something that we can pass into a function like, let's see where it was, pass into a function like this one. So we need just some pixels that are all lined up and some parameters that are all lined up. And then we're going to sum them up. So our Xs are going to be pixels. And so in this case, because we're just going to multiply each pixel by a parameter and add them up, the fact that they're laid out in a grid is not important. So let's reshape those, those grids and turn them into vectors. The way we reshape things in PyTorch is by using the view method. And so the view method, you can pass to it how large you want each dimension to be. And so in this case, we want the number of columns to be equal to the total number of pixels in each picture, which is 28 times 28, because they're 28 by 28 images. And then the number of rows will be however many rows there are in the data. And so if you just use minus one, when you call view, that means, you know, as many as there are in the data. So this will create something of the same with the same total number of elements that we had before. So we can grab all our threes. We can concatenate them Torch.cat with all of our sevens. And then reshape that into a matrix where each row is one image with all of the rows and columns of the image all lined up into a single vector. So then we're going to need labels. So that's our X. So we're going to need labels. Our labels will be a one for each of the threes and a zero for each of the sevens. So basically, we're going to create an is three model. So that's going to create a vector. We actually need it to be a matrix in PyTorch. So unsqueeze will add an additional unit dimension to wherever I've asked for. So here in position one. So in other words, this is going to turn it from something which is a vector of 12,396 long into a matrix with 12,396 rows and one column. That's just what PyTorch expects to see. So now we're going to turn our X and Y into a data set. And a data set is a very specific concept in PyTorch. It's something which we can index into using square brackets. And when we do so, it's expected to return a tuple. So here, if we look at, we're going to create this data set. And when we index into it, it's going to return a tuple containing our independent variable and a dependent variable for each particular row. And so to do that, we can use the Python zip function, which takes one element of the first thing and combines it with concatenates it with one element of the second thing. And then it does that again and again and again. And so then if we create a list of those, it gives us a data set. It gives us a list, which when we index into it, it's going to contain one image and one label. And so here you can see why there's my label and my image. I won't print out the whole thing, but it's a 784 long vector. So that's a really important concept. A data set is something that you can index into and get back a tuple. And here I am. This is called destructuring the tuple, which means I'm taking the two parts of the tuple and putting the first part in one variable and the second part in the other variable, which is something we do a lot in Python. It's pretty handy. A lot of other languages support that as well. Repeat the same three steps for a validation set. So we've now got a training data set and a validation data set. Right. So now we need to initialize our parameters. And so to do that, as we've discussed, we just do it randomly. So here's a function that given some size, some shape if you like, will randomly initialize using a normal random number distribution in PyTorch. That's what random does. And we can hit shift tab to see how that works. Okay. And it says here that it's going to have a variance of one. So I probably shouldn't have called this standard deviation. I probably shouldn't have called this variance, actually. So multiply it by the variance to change its variance to whatever is requested, which will default to one. And then as we talked about when it comes to calculating our gradients, we have to tell PyTorch which things we want gradients for. And the way we do that is requires grad underscore. Remember this underscore at the end is a special magic symbol, which tells Python tells PyTorch that we want this function to actually change the thing that it's referring to. So this will change this tensor such that it requires gradients. So here's some weights. So our weights are going to need to be 28 by 28 by one shape, 28 by 28, because every pixel is going to need a weight. And then one, because we're going to need again, we're going to need to have that unit access to make it into a column. So that's what PyTorch expects. So there's our weights. Now, just weights by pixels actually isn't going to be enough because weights by pixels will always equal zero when the pixels are equal to zero. It has a zero intercept. So we really want something where it's like WX plus B, a line. So the B is we call the bias. And so that's just going to be a single number. So let's grab a single number for our bias. So remember, I told you there's a difference between the parameters and weights. So actually speaking. So here the weights are the W in this equation. The bias is B in this equation. And the weights and bias together is the parameters of the function. They're all the things that we're going to change. They're all the things that have gradients that we're going to update. So there's an important bit of jargon for you. The weights and biases of the model are the parameters. So we can. Yes, question. What's the difference between gradient descent and stochastic gradient descent? So far, we've only done gradient descent. We'll be doing stochastic gradient descent in a few minutes. So we can now create a calculator prediction for one image. So we can take an image such as the first one and multiply by the weights. We need to transpose them to make them line up in terms of the rows and columns and add it up and add the bias. And there is a prediction. We want to do that for every image. We could do that with a for loop. And that would be really, really slow. It wouldn't run on the GPU. And it wouldn't run in optimized C code. So we actually want to use always to do kind of like looping over pixels, looping over images. You always need to try to make sure you're doing that without a Python for loop. In this case, doing this calculation for lots of rows and columns is a mathematical operation called matrix multiply. So if you've forgotten your matrix multiplication or maybe never quite got around to it at high school, it would be a good idea to have a look at Khan Academy or something to learn about what it is. But it's actually, I'll give you the quick answer. This is from Wikipedia. If these are two matrices A and B, then this element here one comma two in the output is going to be equal to the first bit here times the first bit here plus the second bit here times the second bit here. So it's going to be B12 times A11 plus B22 times A12. That's, you can see the orange matches the orange. Ditto for over here. This would be equal to B13 times A31 plus B23 times A32 and so forth for every part. Here's a great picture of that in action. If you look at matrix multiplication.xyz, another way to think of it is we can kind of flip the second bit over on top and then multiply each bit together and add them up. Multiply each bit together and add them up. And you can see always the second one here and ends up in the second spot and the first one ends up in the first spot. And that's what matrix multiplication is. So we can do our multiply and add up by using matrix multiplication. And in Python and therefore PyTorch, matrix multiplication is the at sign operator. So when you see at, that means matrix multiply. So here is our 20.2336. If I do a matrix multiply of our training set by our weights and then we add the bias and here is our 20.336 for the first one. And you can see though it's doing every single one. So that's really important is that matrix multiplication gives us an optimized way to do these simple linear functions for as many kind of rows and columns as we want. So this is one of the two fundamental equations of any neural network. Some rows of data, rows and columns of data, matrix multiply, some weights, add some bias. And the second one which we'll see in a moment is an activation function. So that is some predictions from our randomly initialized model. So we can check how good our model is. And so to do that, we can decide that anything greater than zero, we will call a three. And in a thing less than zero, we will call a seven. So preds greater than zero tells us whether or not something is predicted to be a three or not. Then turn that into a float. So rather than true and false, make it one and zero because that's what our training set contains. And then check with our thresholded predictions are equal to our training set. And this will return true every time a row is correctly predicted and false otherwise. So if we take all those trues and falses and turn them into floats, so that'll be ones and zeros, and then take their mean, it's 0.49. So not surprisingly, our randomly initialized model is right about half the time at predicting threes from sevens. I had one more method here which is dot item. Without dot item, this would return a tensor. It's a rank zero tensor. It has no rows. It has no columns. It's just a number on its own. But I actually wanted to unwrap it to create a normal Python scaler mainly just because I wanted to see the easily see the full set of decimal places. And the reason for that is I want to show you how we're going to calculate the derivative on the accuracy by changing a parameter by a tiny bit. So let's take one parameter, which will be weight zero and multiply it by 1.0001. And so that's going to make it a little bit bigger. And then if I calculate how the accuracy changes based on the change in that weight, that will be the gradient of the accuracy with respect to that parameter. So I can do that by calculating my new set of predictions and then I can threshold them and then I can check whether they're equal to the training set and then take the mean and I get back exactly the same number. So remember that gradient is equal to rise over run. If you remember back to your calculus or if you'd forgotten your calculus, hopefully you've reviewed it on Khan Academy. So the change in the y, so y new minus y old, which is 0.4912, etc., minus 0.4912, etc., which is 0 divided by this change will give us 0. So at this point we have a problem. Our derivative is 0, so we have 0 gradients, which means our step will be 0, which means our prediction will be unchanged. Okay, so we have a problem. And our problem is that our gradient is 0. And with a gradient of 0, we can't take a step and we can't get better predictions. And so intuitively speaking, the reason that our gradient is 0 is because when we change a single pixel by a tiny bit, we might not ever in any way change an actual prediction to change from a 3 predicting a 3 to a 7 or vice versa, because we have this we have this threshold. Okay. And so in other words, our our accuracy loss function here is is very bumpy. It's like flat step, flat step, flat step. So it's got this this zero gradient all over the place. So what we need to do is use something other than accuracy as our loss function. So let's try and create a new function. And what this new function is going to do is it's going to give us a better value kind of in in much the same way that accuracy gives a better value. So this is the loss member of small loss is better. So to give us a lower loss when the accuracy is better, but it won't have a zero gradient. So it means that a slightly better prediction needs to have a slightly better loss. So let's have a look at an example. Let's say our targets are labels of like that that is three. Oh, there's just three rows, three images here one zero one. Okay. And we've made some predictions from a neural net and those predictions gave us point nine point four point two. So now consider this loss function. A loss function, we're going to use torch dot where, which is basically the same as this list comprehension. It's basically an if statement. So it's going to say for where target equals one, we're going to return one minus predictions. So here target is one, so it'll be one minus point nine. And where target is not one, it'll just be predictions. So for these examples here, the first one target equals one will be one minus point nine, just point one. The next one is target equals zero. So it'll be the prediction, just point four. And then for the third one, it's a one for target. So it'll be one minus prediction, which is point eight. And so you can see here, when the prediction is correct, correct, in other words, it's a number, you know, it's a high number when the target is one and a low number when the target is zero, these numbers are going to be smaller. So the worst one is when we predicted zero point two, so we're pretty, we really thought that was actually a zero, but it's actually a one. So we ended up with a zero point eight here, because this is one minus prediction, one minus zero point two is point eight. So we can then take the mean of all of these to calculate a loss. So if you think about it, this loss will be the smallest if the predictions are exactly right. So if we did predictions is actually identical to the targets, then this will be zero, zero, zero. Okay, where else if they were exactly wrong, they'll say they were one minus, then it's one, one, one. So it's going to be the loss will be better, i.e. smaller, when the predictions are closer to the targets. And so here we can now take the mean, and when we do we get here point four three three. So let's say we change this last bad one, this inaccurate prediction, from zero point two to zero point eight, and the loss gets better from point four three to point two three. So this is just this function, this torch dot where dot mean. So this is actually pretty good, this is actually a loss function, which pretty closely tracks accuracy was the accuracy is better, the loss will be smaller. But also it doesn't have these zero gradients, because every time we change the prediction the loss changes because the prediction is literally part of the loss. That's pretty neat isn't it? One problem is this is only going to work well as long as the predictions are between zero and one, otherwise this one minus prediction thing is going to look a bit funny. So we should try and find a way to ensure that the predictions are always between zero and one. And that's also going to just make a lot more intuitive sense because you know we like to be able to kind of think of these as if they're like probabilities or at least nicely scaled numbers. So we need some function that can take our numbers, have a look, it's something which can take these big numbers and turn them all into numbers between zero and one. And it so happens that we have exactly the right function. It's called the sigmoid function. So the sigmoid function looks like this. If you pass in a really small number, you get a number very close to zero. If you pass in a big number, you get a number very close to one. It never gets passed one. And it never goes smaller than zero. And then it's kind of got the smooth curve between. And in the middle, it looks a lot like the y equals x line. This is the definition of the sigmoid function. It's one over one plus e to the minus x. What is xp? xp is just e to the power of something. So if we look at e, it's just a number like pi. It's just a number that has a particular value. So if we go e squared, and we look at, it's going to be a tensor, use pi torch, make it a float. There we go. And you can see that these are the same number. So that's what torch.xp means. Okay. So for me, when I see these kinds of interesting functions, I don't worry too much about the definition. What I care about is the shape. So you can have a play around with graphing calculators or whatever to kind of see why it is that you end up with this shape from this particular equation. But for me, I just never think about that. It never really matters to me. What's important is this sigmoid shape, which is what we want. It's something that squashes every number to be between 0 and 1. So we can change emnus loss to be exactly the same as it was before. But first we can make everything into sigmoid first and then use torch.wear. So that is a loss function that has all the properties we want. It's something which is going to not have any of those nasty 0 gradients. And we've ensured that the input to the wear is between 0 and 1. So the reason we did this is because our accuracy was kind of what we really care about is a good accuracy. We can't use it to get our gradients just to create our step to improve our parameters. So we can change our accuracy to another function that is similar in terms of it's better when the accuracy is better. But it also does not have these 0 gradients. And so you can see now why we have a metric and a loss. The metric is the thing we actually care about. The loss is the thing that's similar to what we care about that has a nicely behaved gradient. Sometimes the thing you care about, your metric, does have a nicely defined gradient and you can use it directly as a loss. For example, we often use means grad error. But for classification, unfortunately not. So we need to now use this to update the parameters. And so there's a couple of ways we could do this. One would be to loop through every image, calculate a prediction for that image and then calculate a loss and then do a step and then step the other parameters and then do that again for the next image and the next image and the next image. That's going to be really slow because we're doing a single step for a single image. So that would mean an epoch would take quite a while. We could go much faster by doing every single image in the data set. So a big matrix multiplication, it can all be paralyzed on the GPU. And then so then we can, we could then do a step based on the gradients looking at the entire data set. But now that's going to be like a lot of work to just update the weights once. And remember, sometimes our data sets have millions or tens of millions of items. So that's probably a bad idea too. So why not compromise? Let's grab a few data items at a time to calculate our loss and our step. If we grab a few data items at a time, those two data items are called a mini batch. And a mini batch just means a few pieces of data. And so the size of your mini batch is called, not surprisingly, the batch size. Right? So the bigger the batch size, the closer you get to the full size of your data set, the longer it's going to take to calculate a single set of losses, a single step. But the more accurate it's going to be, it's going to be like the gradients are going to be much closer to the true data set gradients. And then the smaller the batch size, the faster each step we'll be able to do, but those steps will represent a smaller number of items. And so they won't be such an accurate approximation of the real gradient of the whole data set. Is there a reason the mean of the loss is calculated over say doing a median? Since the median is less prone to getting influenced by outliers. In the example you gave, if the third point, which was wrongly predicted as an outlier, then the derivative would push the function away while doing SGD. And a median could be better in that case. Honestly, I've never tried using a median. The problem with a median is, it ends up really only caring about one number, which is the number in the middle. So it could end up really pretty much ignoring all of the things at each end. In fact, all it really cares about is the order of things. So my guess is that you would end up with something that is only good at predicting one thing in the middle. But I haven't tried it. It would be interesting to see. Well, I guess the other thing that would happen with a median is you would have a lot of zero gradients, I think, because it's picking the thing in the middle and you could change your values and the thing in the middle might, well, it wouldn't be zero gradients, but bumpy gradients. The thing in the middle would suddenly jump to being a different item. So it might not behave very well. That's my guess. You should try it. Okay, so how do we ask for a few items at a time? It turns out that PyTorch and FastAI provide something to do that for you. You can pass in any data set to this class called Data Loader, and it will grab a few items from that data set at a time. You can ask for how many by asking for a batch size. And then you can, as you can see, it will grab a few items at a time until it's grabbed all of them. So here I'm saying let's create a collection that just contains all the numbers from 0 to 14. Let's pass that into a Data Loader with a batch size of five. And then that's going to be something, it's called an iterator in Python. It's something that you can ask for one more thing from an iterator. If you pass an iterator to list in Python, it returns all of the things from the iterator. So here are my three mini batches. And you'll see here all the numbers from 0 to 15 appear. They appear in a random order and they appear five at a time. They appear in random order because shuffle equals true. So normally in the training set, we ask for things to be shuffled. So it gives us a little bit more randomization. More randomization is good because it makes it harder for it to kind of learn what the data set looks like. So that's what a Data Loader, that's how a Data Loader is created. Now remember though that our data sets actually return tuples. And here I've just got single ints. So let's actually create a tuple. So if we enumerate all the letters of English, then that means that returns 0a1b2c, etc. Let's make that our data set. So if we pass that to a Data Loader with a batch size of six, and as you can see it returns tuples containing six of the first things and the associated six of the second things. So this is like our independent variable and this is like our dependent variable. And then at the end the batch size won't necessarily exactly divide nicely into the full size of the data set. You might end up with a smaller batch. So basically then we already have a data set, remember. And so we could pass it to a Data Loader and then we can basically say this. An iterator in Python is something that you can actually loop through. So when we say for in Data Loader it's going to return a tuple. We can destructure it into the first bit and the second bit. And so that's going to be our x and y. We can calculate our predictions. We can calculate our loss from the predictions and the targets. We can ask it to calculate our gradients and then we can update our parameters just like we did in our toy sgd example for the quadratic equation. So let's re-initialize our weights and bias with the same two lines of code before. Let's create the Data Loader this time from our actual MNIST dataset and create a nice big batch size so we do plenty of work each time. And just to take a look let's just grab the first thing from the Data Loader. First is a fast AI function which just grabs the first thing from an iterator. It's just useful to look at, you know, kind of an arbitrary mini-batch. So here is the shape we're going to have. The first mini-batch is 256 rows of 784 long, that's 28 by 28. So 256 flattened out images and 256 labels that are one long because that's just the number zero or the number one depending on whether it's a three or a seven. Do the same for the validation set. So here's our validation Data Loader. And so let's grab a batch here, testing, pass it into... Well, why do we do that? We should... Yeah, I guess, yeah, actually for our testing I'm going to just manually grab the first four things just so that we can make sure everything lines up. So let's grab just the first four things, we'll call that a batch. Pass it into that linear function we created earlier. So remember linear was just x batch at weights, matrix multiply, plus bias. And so that's going to give us four results, that's a prediction for each of those four images. And so then we can calculate the loss using that loss function we just used. And let's just grab the first four items of the training set and there's the loss. Okay, and so now we can calculate the gradients. And so the gradients are 784 by one. So in other words it's a column where every weight has a gradient. It's what's the change in loss for a small change in that parameter. And then the bias has a gradient that's a single number, because the bias is just a single number. So we can take those three steps and put it in a function. So if you pass... This is calculate gradient, you pass it an x batch or y batch in some model, then it's going to calculate the predictions, calculate the loss and do the backward step. And here we see calculate gradient. And so we can get the, just to take a look, the mean of the weights gradient and the bias gradient. And there it is. If I call it a second time and look, notice I have not done any step here. This is exactly the same parameters. I get a different value. That's a concern. You would expect to get the same gradient every time you called it with the same data. Why have the gradients changed? That's because loss.backward does not just calculate the gradients. It calculates the gradients and adds them to the existing gradients, the things in the .grad attribute. The reasons for that will come to you later, but for now the thing to know is just it does that. So actually what we need to do is to call grad.zero underscore. So .zero returns a tensor containing zeros. And remember underscore does it in place. So that updates the weights.grad attribute, which is a tensor, to contain zeros. So now if I do that and call it again, I will get exactly the same number. So here is how you train one epoch with SGD. Loop through the data loader, grabbing the x batch and the y batch, calculate the gradient, prediction, loss, backward. Go through each of the parameters. We're going to be passing those in. So there's going to be the 768 weights in the one bias. And then for each of those update the parameter to go minus equals gradient times learning rate. That's our gradient descent step. And then zero it out for the next time around the loop. I'm not just saying p minus equals. I'm saying p dot data minus equals. And the reason for that is that remember PyTorch keeps track of all of the calculations we do so that it can calculate the gradient. Well I don't want to calculate in the gradient of my gradient descent step. That's like not part of the model. So dot data is a special attribute in PyTorch where if you write to it, it tells PyTorch not to update the gradients using that calculation. So this is your most basic standard SGD stochastic gradient descent loop. So now we can answer that earlier question. The difference between stochastic gradient descent and gradient descent is that gradient descent does not have this here that loops through each mini batch. For gradient descent it does it on the whole data set each time around. So train epoch for gradient descent would simply not have the for loop at all but instead it would calculate the gradient for the whole data set and update the parameters based on the whole data set which we never really do in practice. We always use mini batches of various sizes. Okay so we can take the function we had before where we compare the predictions to we used to be comparing the predictions to whether they were greater or less than zero. But now that we're doing the sigmoid remember the sigmoid will squish everything between 0 and 1 so now we should compare the predictions to whether they're greater than 0.5 or not. If they're greater than 0.5 just look back at our sigmoid function. So zero what used to be zero is now on the sigmoid is 0.5. Okay so we need just to make that slight change to our measure of accuracy. So to calculate the accuracy for some x batch and some y batch this is actually assume this is actually the predictions. Then we take the sigmoid of the predictions we compare them to 0.5 to tell us whether it's a three or not. We check what the actual target was to see which ones are correct and then we take the mean of those after converting the Boolean's to floats. So we can check that accuracy let's take our batch put it through our simple linear model compare it to the four items of the training set and there's the accuracy. So if we do that for every batch in the validation set then we can loop through with a list comprehension every batch in the validation set get the accuracy based on some model stack those all up together so that this is a list right. So if we want to turn that list into a tensor where the the items of the list of the tensor are the items of the list that's what stack does. So we can stack up all those take the mean convert it to a standard python scalar by calling dot item round it to four decimal places just for display and so here is our validation set accuracy as you would expect it's about 50 percent because it's random. So we can now train for one epoch so we can say remember train epoch needed the parameters so our parameters in this case are the weights tensor and the bias tensor so train one epoch using the linear one model with the learning rate of one with these two parameters and then validate and look at that our accuracy is now 68.8 percent so we've we've trained an epoch so let's just repeat that 20 times train and validate and you can see the accuracy goes up and up and up and up and up to about 97 percent so that's cruel we've built an sgd optimizer of a simple linear function that is getting about 97 percent on our simplified M list where there's just the threes in the sevens so a lot of steps there let's simplify this through some refactoring so the kind of simple refactoring we're going to do we're going to do a couple but the basic idea is we're going to create something called an optimizer class the first thing we'll do is we'll get rid of the linear one function so remember the linear one function does x at w plus b there's actually a class in pi torch that does that equation for us so we might as well use it it's called nn.linear and nn.linear does two things it does that function for us and it also initializes the parameters for us so we don't have to do weights and bias in it params anymore we just create an nn.linear class and that's going to create a matrix of size 28 by 28 comma one and a bias of size one it will set requires grad equals true for us it's all going to be encapsulated in this class and then when i call that as a function it's going to do my x at w plus b so to see the parameters in it we would expect it to contain 784 weights and one bias we can just call dot parameters and we can de-structure it to w comma b and see yep it is 784 and one for the weights and bias so that's cool so this is just you could you know it could be an interesting exercise for you to create this class yourself from scratch you should be able to at this point so that you can confirm that you can recreate something that behaves exactly like nn.linear so now that we've got this object which contains our parameters in a parameters method we could now create an optimizer so for our optimizer we're going to pass it the parameters to optimize and a learning rate we'll store them away and we'll have something called step which goes through each parameter and does that thing we just saw p dot data minus equals p dot grad times learning rate and it's also going to have something called zero grad which goes through each parameter and zeroes it out or we could even just set it to none so that's the thing we're going to call basic optimizer so those are exactly the same lines of code we've already seen wrapped up into a class so we can now create an optimizer passing in the parameters of the linear model for these and our learning rate and so now our training loop is look through each mini batch in the data loader calculate the gradient opt dot step opt dot zero grad that's it validation function doesn't have to change and so let's put our training loop into a function that's going to loop through a bunch of epochs call an epoch print validate epoch and then run it and it's the same we're getting a slightly different result here but it's much much the same idea okay so that's cool right we've now refactoring using you know creating our own optimizer and using faster pytorch is built in nn.linear class and you know by the way we don't actually need to use our own basic optum not surprisingly pytorch comes with something which does exactly this and not surprisingly it's called sgd so and actually this sgd is provided by fastai fastai and pytorch provide some overlapping functionality they work much the same way so you can pass to sgd your parameters and your learning rate just like basic optum okay and train it and get the same result so as you can see these classes that are in fastai and pytorch are not mysterious they're just pretty you know thin wrappers around functionality that we've now written ourselves so there's quite a few steps there and if you haven't done gradient descent before then there's a lot of unpacking so so this this lesson is kind of the key lesson it's the one where you know like we should you know really take a stop and a deep breath at this point and make sure you're comfortable what's a data set what's a data loader what's nn.linear what's sgd and if you you know if any any or all of those don't make sense go back to where we defined it from scratch using python code well the data loader we didn't define from scratch but it you know the functionality is not particularly interesting you could certainly create your own from scratch if you wanted to that would be another pretty good exercise let's refactor some more fastai has a data loaders class which is as we've mentioned before is a tiny class that just you pass it a bunch of data loaders and it just stores them away as a dot train and a dot valid even though it's a tiny class it's it's super handy because with that we now have a single object that knows all the data we have and so it can make sure that your training data loader is shuffled and your validation loader isn't shuffled you know make sure everything works properly so that's what the data loaders class is you can pass in the training and valid data loader and then the next thing we have in fastai is the learner class and the learner class is something where we're going to pass in our data loaders we're going to pass in our model we're going to pass in our optimization function we're going to pass in our loss function we're going to pass in our metrics so all the stuff we've just done manually that's all learner does is it's just going to do that for us so it's just going to call this train model and this train epoch it's just you know it's inside learner so now if we go learn dot fit you can see again it's doing the same thing getting the same result and it's got some nice functionality it's printing it out into a pretty table for us and it's showing us the losses and the accuracy and how long it takes but there's nothing magic right you've been able to do exactly the same thing by hand using python and pytorch okay so these abstractions are here to like let you write less code and to save some time and to save some cognitive overhead but they're not doing anything you can't do yourself and that's important right because if they're doing things you can't do yourself then you can't customize them you can't debug them you know you can't profile them so we want to make sure that the the the stuff we're using is stuff that we understand what it's doing so this is just a linear function is not great we want a neural network so how do we turn this into a neural network or remember this is a linear function x at w plus b to turn it into a neural network we have two linear functions exactly the same but with different weights and different biases and in between this magic line of code which takes the result of our first linear function and then does a max between that and zero so a max of res and zero is going to take any negative numbers and turn them into zeros so we're going to do a linear function we're going to replace the negatives with zero and then we're going to take that and put it through another linear function that believe it or not is a neural net so w1 and w2 are weight tensors b1 and b2 are bias tensors just like before so we can initialize them just like before and we could now call exactly the same training code that we did before to for all these so res dot max zero is called a rectified linear unit which you will always see referred to as value and so here is and in pytorch it already has this function it's called f dot value and so if we plot it you can see it's as you'd expect it's zero for all negative numbers and then it's y equals x for positive numbers so you know here's some jargon rectified linear unit sounds scary sounds complicated but it's actually this incredibly tiny line of code this incredibly simple function and this happens a lot in deep learning things that sound complicated and sophisticated and impressive turn out to be normally super simple frankly at least once you know what it is so why do we do linear layer value linear layer well if we got rid of the middle um if we got rid of the middle value and just went linear layer linear layer then you could rewrite that as a single linear layer when you multiply things and add and then multiply things and add and you can just change the coefficients and make it into a single multiply and then add so no matter how many linear layers we stack on top of each other we can never make anything more um kind of effective than a simple linear model but if you put a non-linearity between the linear layers then actually you have the opposite this is now where something called the universal approximation theorem holds which is that if the size of the weight and bias matrices are big enough this can actually approximate any arbitrary function including the function of how do i recognize three from sevens or or whatever so that's kind of amazing right this tiny thing is actually a universal function approximator as long as you have w1 b1 w2 and b2 have the right numbers and we know how to make them the right numbers you use sgd could take a very long time could take a lot of memory but the basic idea is that there is some solution to any computable problem and this is one of the biggest challenges a lot of beginners have to deep learning is that there's nothing else to it like there's often this like okay how do i make a neural net oh that is a neural net well how do i do deep learning training with sgd there's things to like make it train a bit faster there's you know things to mean you need a few less parameters but everything from here is just um performance tweaks honestly right so this is you know this is the key understanding of of training a neural network okay we can simplify things a bit more we already know that we can use nn.linear to replace the weight and bias so let's do that for both of the linear layers and then since we're simply taking the result of one function and passing it into the next and take the result of that function pass it to the next and so forth and then return the end this is called function composition function composition is when you just take the result of one function pass it to a new one take a result of one function pass it to a new one and so every pretty much neural network is just doing function composition of linear layers and these are called activation functions or non-linearities so PyTorch provides something to do function composition for us and it's called nn.sequential so it's going to do a linear layer pass the result to a value pass the result to a linear layer you'll see here i'm not using f.relu i'm using nn.relu this is identical it returns exactly the same thing but this is a class rather than a function yes Rachel by using the non-linearity uh won't using a function that makes all negative output zero make many of the gradients in the network zero and stop the learning process due to many zero gradients well that's a fantastic question and the answer is yes it does um but there won't be zero for every image and remember the mini batches are shuffled so even if it's zero for every image in one mini batch it won't be for the next mini batch and it won't be the next time around we go for another epoch so yes it can create zeros and if if the neural net ends up with a set of parameters such that lots and lots of inputs end up as zeros you can end up with whole mini batches that are zero and you can end up in a situation where some of the neurons remain inactive inactive means they're zero and they're basically dead units and this is a huge problem it basically means you're wasting computation so there's a few tricks to avoid that which we'll be learning about a lot now one simple trick is to not make this thing flat here but just make it a less steep line that's called a leaky value leaky rectified linear unit and that they help a bit as we'll learn though even better is to make sure that we just kind of initialize to sensible initial values that are not too big and not too small and step by sensible amounts that are particularly not too big and generally if we do that we can keep things in the zone where they're positive most of the time but we are going to learn about how to actually analyze inside a network and find out how many dead units we have how many of these zeros we have because as this as you point out they are they are bad news they don't do any work and they'll continue to not do any work if if enough of the inputs end up being zero okay so now that we've got a neural net we can use exactly the same learner we had before but this time we'll pass in the simple net instead of the linear one everything else is the same and we can call fit just like before and generally as your models get deeper so here we've gone from one layer two and i'm only counting the parameterized layers as layers you could say it's three i'm just going to call it two there's two trainable layers so i've gone from one layer to two i've dropped my learning rate from one to zero point one because the deeper models you know tend to be kind of bumpier less nicely behaved so often you need to use lower learning rates and so we train it for a while okay and we can actually find out what that training looks like by looking inside our learner and there's an attribute we create for you called recorder and that's going to record well everything that appears in this table basically well these three things the training loss the validation loss and the accuracy or any metrics so recorder dot values contains that kind of table of results and so item number two of each row will be the accuracy and so the the capital l class which i'm using here has a nice little method called item got that will will get the second item from every row and then i can plot that to see how the training went and i can get the final accuracy like so by grabbing the last row of the table and grabbing the second index two zero one two and my final accuracy not bad 98.3 percent so this is pretty amazing we now have a function that can solve any problem to any level of accuracy if we can find the right parameters and we have a way to find hopefully the best or at least a very good set of parameters for any function um so this is kind of the magic yes retro how could we use what we're learning here to get an idea of what the network is learning along the way like xyler and fergus did more or less um we will look at that later um not in the full detail of their paper but basically um you can look in the the dot parameters to see the values of those parameters um and at this point well i mean why don't you try it yourself right you've actually got now um the parameters so if you want to grab the model you can actually see learn dot model so we can um we can look inside learn dot model to see the actual model that we just trained um and you can see it's got the three things in it the linear the value the linear um and you know what i kind of like to do is to put that into a variable make it a bit easy to work with and you can grab one layer by indexing in you can look at the parameters and um that just gives me a something called a generator it's something that will give me a list of the parameters when i ask for them so i can just go weight comma bias equals to de-structure them and so the weight is 30 by 784 um because that's what i asked for so one of the things to note here is that to create a um a neural net so something with more than one layer i actually have 30 outputs not just one right so i'm kind of generating lots of you can think of as generating lots of features so it's kind of like 30 different linear linear models here and then i combine those 30 back into one so you could look at um one of those by having a look at here so there's there's the numbers in the first row we could reshape that into the the original shape of the images and we could even have a look and there it is right so you can see this is something so it's cool right we can actually see here we've got something which is which is kind of learning it's to find things at the top and the bottom and the middle and so we could look at the second one okay no idea what that's showing and so some of them are kind of you know i've probably got far more than i need um which is why they're not that obvious um but you can see yeah here's another thing it's looking pretty similar here's something that's kind of looking for this little bit in the middle so yeah this is the basic idea to understand the features that are not the first layer but later layers you have to be a bit more sophisticated but yeah to see the first layer ones you can you can just plot them okay so then you know just to compare we could use the full fast ai toolkit so grab our data loaders by using data loaders from folder as we've done before and create a cnn learner and a resnet and fit it for a single epoch and whoa 99.7 right so we did 40 epochs and got 98.3 as i said using all the tricks you can really speed things up and make things a lot better and so by the end of this course uh or at least both parts of this course you'll be able to from scratch get this 99.7 in a single epoch um all right so jargon so jargon just to remind us value function that returns zero for negatives mini batch a few inputs and labels which optionally are randomly selected the forward pass is the bit where we calculate the predictions the loss is the function that we're going to take the derivative of and then the gradient is the derivative of the loss with respect to each parameter the backward pass is when we calculate those gradients gradient descent is that full thing of taking a step in the direction opposite to the gradients by calculate after calculating the loss and then the learning rate is the size of the step that we take um other things to know um perhaps the two most important pieces of jargon are all of the numbers that are in a neural network the numbers that we're learning are called parameters and then the numbers that we're calculating so every value that's calculated every matrix multiplication element that's calculated they're called activations so activations and parameters are all of the numbers in the neural net and so be very careful when i say from here on in in these lessons activations or parameters you got to make sure you know what those mean because that's that's the entire basically almost the entire set of numbers that exist inside a neural net so activations are calculated parameters are learned um we're doing uh this stuff with tensors and tensors are just regularly shaped arrays rank zero tensors we call scalars rank one tensors we call vectors rank two tensors we call matrices and we continue on to rank three tensors rank four tensors and so forth and uh rank five tensors are very common in deep learning so don't be scared of going up to higher numbers of dimensions okay so let's have a break oh we've got a question okay is there a rule of thumb for what non-linearity to choose given that there are many yeah there are many non-linear areas to choose from and it doesn't generally matter very much which you choose so this choose value or licky value or yeah whatever um any anyone should work fine later on we'll we'll look at the minor differences between between them but it's not so much something that you pick on a per problem it's more like some take a little bit longer and a little bit more accurate and some are a bit faster and a little bit less accurate uh that's a good question okay so before you move on it's really important that you finish the questionnaire for this chapter because um there's a whole lot of concepts that we've just done so you know try to go through the questionnaire go back and re-look at the um the notebook and please run the code do the experiments and make sure it makes sense all right let's have a seven minute break uh see you're back here in seven minutes time okay welcome back um so now that we know how to create and train a neural net let's cycle back and look deeper at some applications and so we're going to try to uh kind of interpolate in from one end we've done the kind of from scratch version and at the other end we've done the kind of four lines of code version and we're going to gradually nibble at each end until we find ourselves in the middle and we've we've we've touched on all of it um so let's go back up to the kind of the four lines of code version and and delve a little deeper so let's go back to pets um and let's think though about like how do you actually you know start with a new data set and figure out how to use it um so it you know the the data sets we provide it's easy enough to untie them you just say untie that'll download it and untie it um uh if it's a data set that you're getting you can just um use the terminal or i throw a python or whatever um so let's assume we have a path that's pointing at something so initially you don't you don't know what that something is so we can start by doing ls to have a look and see what's inside there so the pets data set that we saw in lesson one contains three things annotations images and models and you'll see uh we have this little trick here where we say path dot base path equals and then the path to our data and that just does a little simple thing where when we print it out it just doesn't show us it just shows us relative to this path which is a bit convenient so if you go and have a look at the readme for the original pets data set um it tells you what these images and annotations folders are and not surprisingly the images path if we go path slash images that's how we use pathlib to grab her sub directory and then ls uh we can see here are the names that the paths to the images as it mentions here most functions and methods in fastiae which return a collection don't return a python list but they return a capital l and a capital l as we briefly mentioned is basically an enhanced list one of the enhancements is the way it prints uh the representation of it starts by showing you how many items there are in the list in the collection so there's 7394 images and um it if there's more than 10 things it um truncates it and just says dot dot dot to avoid filling up your screen um so there's a couple of little conveniences there and so we can see from this output that the file name um as we mentioned in um lesson one if the first letter is a capital it means it's a cat and if the first letter is lowercase it means it's a dog um but this time we're going to do something a bit more complex well a lot more complex which is figure out what breed it is and so you can see the breed is kind of everything up to after the uh in the file name it's everything up to the the last underscore and before this number is the breed so um we want to label everything with its breed so we're going to take advantage of this structure so um the way i would do this is to use a regular expression a regular expression is something that looks at a string and basically lets you kind of pull it apart into its pieces in very flexible way it's this kind of simple little language for doing that um if you haven't used regular expressions before um please google regular expression tutorial now and look it's going to be like one of the most useful tools you'll come across in your life i use them almost every day um i will go to details about how to use them since there's so many great tutorials and there's also a lot of great like exercises you know there's redgex redgex is short for regular expression there's redgex crosswords there's redgex q and a there's all kinds of cool redgex things a lot of people like me love this tool um in order to there's also a redgex lesson in the fast ai nlp course maybe even two redgex lessons oh yeah i'm sorry for forgetting about the fast ai nlp course what an excellent resource that is um so regular expressions are a bit hard to get right the first time so the best thing to do is to get a sample string so the good way to do that would be to just grab one of the filenames so let's pop it in f name um and then you can experiment with um regular expressions so re is the regular expression module in python and find all we'll just grab all the parts of a regular expression that have parentheses around them so this regular expression and r is a special kind of string in python which basically says don't treat backslash as special because normally in python like backslash n means um a new line so here's a a string which i'm going to capture any letter one or more times followed by an underscore followed by a digit one or more times followed by anything i probably should have used backslash dot but that's fine followed by the letters jpg followed by the end of the string um and so if i call that regular expression against my filenames name oh looks good right so we kind of check it out so now that seems to work we can create a data block uh where the um independent variables are images the dependent variables are categories just like before get items is going to be get image files we're going to split it randomly as per usual um and then we're going to get the label by calling regex labeler which is a um uh just a handy little uh fast a a class which labels things with a regular expression um we can't call the regular expression this particular regular expression directly on the path lib path object we actually want to call it on the name attribute and fast ai has a nice little function called using attra using attribute which takes this function and changes it to a function which will be passed this attribute but it's going to be using regex labeler on the name attribute um and then from that data block we can create the data loaders as usual um there's two interesting lines here resize and org transforms org transforms we have seen before in notebook two in the section called data augmentation and so org transforms was the thing which can zoom in and zoom out and warp and rotate and change contrast and change brightness and so forth and flip um to kind of give us almost it's like giving us more data being generated synthetically from the data we already have um and we also learned about random resize crop um which is a kind of a really cool way of getting um ensuring you get square images at the same time that you're um augmenting the data um here we have a resize to a really large image but you know by deep learning standards 460 by 460 is a really large image and then we're using org transforms with a size so that's actually going to use random resize crop to a smaller size um why are we doing that this particular combination of two steps does something which i think is unique to fast ai which we call presizing and the best way is i will show you this beautiful example of some powerpoint wizardry that i'm so excited about to show how presizing works what presizing does is that first step where we say resize to 460 by 460 is it grabs a square and it grabs it randomly if it's a kind of landscape orientation photo it'll grab it randomly sort of take the whole height and randomly grab somewhere from along the side if it's a portrait orientation then it'll grab it you know take the full width and grab it random grab a random bit from top to bottom so then we take this area here and here it is right and so that's what the first resize does and then the second org transforms bit will grab a random warped crop possibly rotated from in here and will turn that into a square and so it does um so there's two steps it's first of all resize to a square that's big and then the second step is do a kind of rotation and warping and zooming stage to something smaller in this case 224 by 224 because this first step creates something that's square and always is the same size the second step can happen on the gpu and because normally things like rotating and image warping actually pretty slow also normally doing a zoom and a rotate and a warp actually is really destructive to the image because each one of those things requires an interpolation step but it's not just slow it actually makes the image really quite low quality so we do it in a very special way in fast ai i think it's unique where we do all of the all of these kind of coordinate transforms like rotations and warps and zooms and so forth not on the actual pixels but instead we kind of keep track of the changing coordinate values in a in a non-lossy way so the full floating point value and then once at the very end we then do the interpolation the results quite striking here is what the difference looks like hopefully you can see this on on the video on the left is our pre-sizing approach and on the right is the standard approach that other libraries use and you can see that the one on the right is a lot less nicely focused and it also has like weird things like this should be grass here but it's actually got its kind of bum sticking way out this has a little bit of weird distortions this has got loads of weird distortions so you can see the pre-sized version really ends up way way better and i think we have a question Rachel are the blocks in the data block an ordered list do they specify the input and output structures respectively are there always two blocks or can there be more than two for example if you wanted a segmentation model would the second block be something about segmentation so so yeah this is an ordered list so the first item says i want to create an image and then the second item says i want to create a category so that's my independent and dependent variable you can have one thing here you can have three things here you can have any amount of things here you want obviously the vast majority of the time it'll be two normally there's an independent variable and a dependent variable we'll be seeing this in more detail later although if you go back to the earlier lesson when we introduced data blocks i do have a picture kind of showing how these pieces fit together so after you've put together your data block created your data loaders you want to make sure it's working correctly so the obvious thing to do for a computer vision data block is show batch and show batch we'll show you the items and you can kind of just make sure they look sensible that looks like the labels are reasonable if you add unique equals true then it's going to show you the same image with all the different augmentations this is a good way to make sure your augmentations work if you make a mistake in your data block in this example there's no resize so the different images are going to be different sizes so it'll be impossible to collate them into a batch so if you call dot summary this is a really neat thing which will go through and tell you everything that's happening so i collecting the items how many did i find what happened when i split them what are the different variables independent dependent variables i'm creating let's try and create one of these here's each step create my image create categorize here's what the first thing gave me an american bulldog is the final sample is this image this size this category and then eventually it says uh-oh it's not possible to collate your items i tried to collate the zero index members of your tuples in other words that's the independent variable and i got this was size 500 by 375 this was 375 by 500 oh i can't collate these into a tensor because they're different sizes so this is a super great debugging tool for debugging your data blocks i have a question how does the item transforms pre-size work if the resize is smaller than the image is a whole width or height still taken or is it just a random crop with the resize value so if you remember back to lesson two we looked at the different ways of creating these things you can use squish you can use pad or you can use crop so if your image is smaller than the pre-size value then squish will really be zoom so it will just well stretch it'll stretch it and then pad and crop will do much the same thing and so you'll just end up with a you know the same just looks like these but it'll be a kind of lower more pixelated lower resolution because it's having to zoom in a little bit okay so a lot of people say that you should do a hell of a lot of data cleaning before your model we don't we say model as soon as you can because remember what we found in in notebook two your your model can teach you about the problems in your data so as soon as i've got to a point where i have a data block that's working and i have data loaders i'm going to build a model and so here i'm you know it also tells me how i'm going so i'm getting seven percent error wow that's actually really good for a pets model and so at this point now that i have a model i can do that stuff we learned about earlier in o2 the notebook o2 where we train our model and use it to clean the data so we can look at the classification a confusion matrix top losses the image cleaner widget you know so forth okay now one thing interesting here is in notebook four we included a loss function when we created a learner and here we don't pass in a loss function why is that that's because fastai will try to automatically pick a somewhat sensible loss function for you um and so for a image classification task it knows what loss function is the normal one to pick and it's done it for you but let's have a look and see what it actually did pick so we could have a look at learn dot loss funk and we will see it is cross entropy loss why don't earth is cross entropy loss i'm glad you asked let's find out cross entropy loss is really much the same as the m nest lost we created with that with that sigmoid and the one minus predictions and predictions um but it's um it's a kind of extended version of that um and the extended version of that is that that torch dot where that we looked at in notebook four only works when you have um a binary outcome in that case it was is it a three or not but in this case it we've got um which of the 37 pet breeds is it so we want to kind of create something just like that sigmoid and torch dot where that which also works nicely uh for more than two categories so let's see how we can do that so first of all let's grab a batch yes there's a question why do we want to build a model before cleaning the data i would think a clean data set would help in training yeah absolutely a clean clean data set helps in training but remember as we saw in notebook 02 um an initial model helps you clean the data set so remember how plot top losses helped us identify mislabeled images and the confusion matrix helped us recognize which things we were getting confused and might need you know fixing and the image classifier cleaner actually let us find things like an image that contained two bears rather than one bear and clean it up so a model is just a fantastic way to help use zoom in on the data that matters which things seem to have the problems which things are most important um stuff like that so you would go through and you'd clean it um with the model helping you and then you go back and train it again uh with the clean data thanks for that great question um okay so in order to understand cross- entropy loss let's grab a batch of data which we can use dls.one batch um and that's going to grab a batch from the training set we could also go um first dls.train and that's going to do exactly the same thing and so then we can destructure that into the independent independent variable and so the dependent variable um shows us we've got a batch size of 64 but it shows us the 64 categories and remember those numbers simply refer to the index of into the vocab so for example 16 is a boxer and so that all happens for you automatically when we say show batch it shows us those strings so here's a first mini batch and so now we can view the predictions that is the activations of the final layer of the network um by calling getPreds and you can pass in a data loader and a data loader can really be anything that's going to return a sequence of mini batches so we can just pass in a list containing our mini batch as a data loader and so that's going to get the predictions for one mini batch so here's some predictions okay so the actual predictions if we go preds0.sum to grab the predictions for the first image and add them all up they add up to one and there are 37 of them so that makes sense right it's like the very first thing is what is the probability that that is a else vocab so the first thing is what's the probability it's an abyssinian cat it's 10 to the negative 6 you see uh and so forth so it's basically like it's not this it's not this it's not this and you can look through and oh here this one here you know obviously what are things it is um so how did it uh you know so we we obviously want the probabilities to sum to one because it would be pretty weird if if they didn't it would say you know that the probability of being one of these things is more than one or less than one which would be extremely odd um so how do we go about creating these predictions where each one is between zero and one and they all add up to one to do that we use something called softmax softmax is basically an extension of sigmoid to handle more than two levels two categories so remember the sigmoid function looked like this and we used that for our threes versus sevens model so what if we want 37 categories rather than two categories we need one activation for every category so actually the the threes and sevens model rather than thinking of that as an is three model we could actually say oh that has two categories so let's actually create two activations one representing how three like something is and one representing how seven like something is um so let's say you know let's just um say that we have uh six emnus digits and um these were the uh can I do this uh and this first column were the activations um of my model um for for one activation and the second column was for a second activation so my final layer actually has two activations now so this is like how much like a three is it and this is how much like a seven is it so this one is not at all like a three and it's slightly not like a seven uh this is very much like a three and not much like a seven and so forth so we can take that model and rather having rather than having one activation for like is three we can have two activations for how much like a three how much like a seven so if we take the sigmoid of that we get two numbers between naught and one but they don't add up to one so that doesn't make any sense it can't be point six six chance it's a three and point five six chance it's a seven because every digit in that data set is only one or the other so that's not going to work but what we could do is we could take the difference between this value and this value and say that's how likely it is to be a three so in other words this one here with a high number here and a low number here is very likely to be a three so we could basically say in the binary case these activations that what really matters is their relative confidence of being a three versus a seven so we could calculate the difference between column one and column two or column index zero and column index one right and here's the difference between the two columns there's that big difference and we could take the sigmoid of that right and so this is now giving us a single number between naught and one and so then since we wanted two columns we could make column index zero the sigmoid and column index one could be one minus that and now look these all add up to one so here's probability of three probability of seven for the second one probability of three probability of seven and so forth so like that's a way that we could go from having two activations for every image to creating two probabilities each of which is between naught and one and each pair of which adds to one great how do we extend that to more than two columns to extend it to more than two columns we use this function which is called softmax so softmax is equal to e to the x divided by sum of e to the x just to show you if i go softmax on my activations i get 0.6025 0.3975 0.6025 0.3975 i get exactly the same thing right so softmax in the binary case is identical to the sigmoid that we just looked at but in the multi category case we basically end up with something like this let's say we were doing the teddy bear grizzly bear brown bear and for that remember our neural net is going to have the final layer we'll have three activations so let's say it was 0.02 negative 2.49 1.25 that it calculates softmax i first go e to the power of each of these three things so here's e to the power of 0.02 e to the power of negative 2.49 e to the power of 3.4 e to the power of 1.25 okay then i add them up so there's the sum of the x and then softmax will simply be 1.02 divided by 4.6 and then this one will be 0.08 divided by 4.6 and this one will be 3.49 divided by 4.6 so since each one of these represents each number divided by the sum that means that the total is 1 okay and because all of these are positive and each one is an item divided by the sum it means all of these must be between 0 and 1 so this shows you that softmax always gives you numbers between 0 and 1 and they always add up to 1 so to do that in practice you can just call torch.softmax and it will give you this result of this this function so you should experiment with this in your own time you know write this out by hand and try putting in these numbers all right and and see how that you get back the numbers i claiming you're going to get back make sure this makes sense to you so one of the interesting points about softmax is remember i told you that xp is e to the power of something and now what that means is that e to the power of something grows very very fast right so like xp of 4 is 54 xp of 8 is 2920980 right it grows super fast and what that means is that if you have one activation that's just a bit bigger than the others its softmax will be a lot bigger than the others so intuitively the softmax function really wants to pick one class among the others which is generally what you want right when you're trying to train a classifier to say which breed is it you kind of want it to to pick one and kind of go for it right and so that's what softmax does that's not what you always want so sometimes at inference time you want it to be a bit cautious and so you kind of got to remember that softmax isn't always the perfect approach but it's the default it's what we use most of the time and it works well on a lot of situations so that is softmax now in the binary case for the MNIST 3 versus 7s this was how we calculated MNIST loss we took the sigmoid and then we did either one minus that or that as our loss function um which is fine as you saw it it worked right um and so we could do this exactly the same thing we can't use torch.wear anymore because targets aren't just 0 or 1 targets could be any number from 0 to 36 so we could do that by um replacing the torch.wear with um indexing so here's an example for the binary case let's say these are our targets 010110 and these are our softmax activations which we calculated before notice from some random numbers just for a toy example so one way to do instead of doing torch.wear we could instead um have a look at this i could say um i could grab all the numbers from 0 to 5 and if i index into here with all the numbers from 0 to 5 and then my targets 010110 then what that's going to do is it's going to pick for row 0 it'll pick 0.6 and then for row 1 it'll pick 1 0.49 for row 2 it'll pick 0.13 for row 4 it'll pick 1.003 and so forth so this is a super nifty indexing expression which um you should definitely play with right and it's basically this trick of passing multiple things to the pi torch indexer the first thing says which rows should you return and the second thing says for each of those rows which column should you return so this is returning all the rows and these columns for each one and so this is actually identical to torch.wear so isn't that tricky and so the nice thing is we can now use that for more than just two values and so here's here's the fully worked out thing so i've got my threes column i've got my sevens column here's that target here's the indexes from 012345 and so here 00.611.49 002.13 and so forth so yeah this works just as well with more than two columns so we can add you know for doing a full m-nest you know so all the digits from 0 to 9 we could have 10 columns and we would just be indexing into the 10 so this thing we're doing where we're going minus our activations matrix all of the numbers from 0 to n and then our targets is exactly the same as something that already exists in PyTorch called f.nllloss as you can see exactly the same right so again we're kind of seeing that these things inside PyTorch and Fast.ai are just little shortcuts for stuff we can write ourselves nllloss stands for negative log likelihood again sounds complex but actually it's just this indexing expression rather confusingly there's no log in it we'll see why in a moment so let's talk about logs so this loss function works quite well as we saw in the notebook 04 it's basically this it is exactly the same as we learned notebook 04 just a different way of expressing it but we can actually make it better because remember the probabilities we're looking at are between 0 and 1 so they can't be smaller than 0 they can't be greater than 1 which means that if our model is trying to decide whether to predict 0.99 or 0.999 it's going to think that those numbers are very very close together but won't really care but actually if you think about the error you know if there's like a hundred thing a thousand things then this would like be 10 things are wrong and this would be like one thing is wrong but this is really like 10 times better than this so really what we'd like to do is to transform the numbers between 0 and 1 to instead between be between negative infinity and infinity and there's a function that does exactly that which is called logarithm okay so as the so the numbers we could have can be between 0 and 1 and as we get closer and closer to 0 it goes down to infinity and then at 1 it's going to be 0 and we can't go above 0 because our loss function we want to be negative so this logarithm in case you forgot hopefully use vaguely remember what logarithm is from high school but that basically the definition is this if you have some number that is y that is b to the power of a then logarithm is defined such that a equals the logarithm of y comma b in other words it tells you b to the power of what equals y which is not that interesting of itself but one of the really interesting things about logarithms is this very cool relationship which is that log of a times b equals log of a plus log of b and we use that all the time in deep learning and machine learning because this number here a times b can get very very big or very very small if you multiply things a lot of small things together you'll get a tiny number if you multiply a lot of big things together you'll get a huge number it can get so big or so small that the the kind of the precision in your computer's floating point gets really bad where else this thing here adding is not going to get out of control so we really love using logarithms like particularly in a deep neural net where there's lots of layers we're kind of multiplying and adding many times so this kind of tends to come out quite nicely um so when we take the um the probabilities that we saw before um the the things that came out of this function um and we take their logs and we take the mean that is called negative log likelihood and so this ends up being kind of a really nicely behaved number because of this property of the log that we described so if you take the softmax and then take the log and then pass that to nll loss because remember that didn't actually take the log at all despite the name that gives you cross entropy loss so that leaves an obvious question of uh why doesn't nll loss actually take the log and the reason for that is that it's more convenient computationally to actually take the log back at the softmax step so PyTorch has a function called um log softmax and so since it's actually easier to do the log at the softmax stage it's just a faster and more accurate PyTorch assumes that you use soft log max and then pass that to nll loss so nll loss does not do the log it assumes that you've done the log beforehand so log softmax followed by nll loss is the definition of cross entropy loss in PyTorch so that's our loss function and so you can pass that some activations and some targets and get back a number and pretty much everything in in PyTorch every every one of these kinds of functions you can either use the nn version as a class like this and then call that object as if it's a function or you can just use f dot with the camel case name as a function directly and as you can see they're exactly the same number um people normally use the class version um uh in the documentation in PyTorch you'll see it normally uses the class version so we'll tend to use the class version as well uh you'll see that it's returning a single number and that's because it takes the mean because a loss needs to be as we've discussed the mean but if you want to see the underlying numbers before taking the mean you can just pass in reduction equals none and that shows you the individual cross entropy losses before taking the mean okay um great so this is a good place to stop with our discussion of loss functions and such things retro were there any questions about this why does the loss function need to be negative um well i i mean i guess it doesn't but it's um we want something that the lower it is the better um and we kind of need it to cut off somewhere um i have to think about this more more during the week because i'm it's a bit tired um yeah so let me let me refresh my memory when i'm awake okay now um next week um well note not for the video next week actually happened last week so the thing i'm about to say is actually revenge so next week we're going to be talking about data ethics and i wanted to kind of segue into that by talking about how my week's gone because a week or two ago in i did a as part of a lesson i actually talked about the efficacy of masks and specifically wearing masks in public and i pointed out that the efficacy of masks seemed like it could be really high and maybe everybody should be wearing them um and somehow i found myself as the face of a global advocacy campaign um and so if you go to masks or all dot co uh you will find a website um talking about masks and um i've been on you know tv shows in south africa and the us and england and australia and on radio and blah blah blah talking about masks uh why is this well it's because um as a data scientist you know i noticed that the data around masks seemed to be getting misunderstood and it seemed that that misunderstanding was costing possibly hundreds of thousands of lives um you know literally in the places that were using masks um it seemed to be associated with you know orders of magnitude fewer deaths and one of the things we'll talk about next week is like you know what's your role as a data scientist um and and you know i strongly believe that it's to understand the data and then do something about it and so nobody was talking about this um so i ended up writing an article that appeared in the washington post that basically called on people to really consider um wearing masks uh which is this article um and you know i was i i was lucky i i managed to kind of get a huge team of brilliant um not not huge a pretty decent sized team of brilliant volunteers who helped you know kind of built this website and uh kind of some pr folks and stuff like that um but what came clear was you know i was talking to politicians you know senators um barfers and what was becoming clear is that people weren't convinced by the science which is fair enough because it's it's hard to you know when the who in the cdc is saying you don't need to wear a mask and some random data scientist is saying that doesn't seem to be what the data is showing you know you've got half a brain you would pick the who in the cdc not the random data scientist so i really felt like i if i was going to be an effective advocate i needed sort the science out and it um you know credentialism is strong um and so it wouldn't be enough for me to say it i needed to find other people to say it though i put together a team of 19 scientists including you know a professor of sociology a professor of aerosol dynamics the founder of an african movement that's that kind of studied preventative methods for methods for tuberculosis a stanford professor who studies mask disposal and cleaning methods a bunch of chinese scientists who study epidemiology modeling a ucla professor who is one of the top infectious disease epidemiologist experts and so forth so like this kind of all-star team of people from all around the world and i'd never met any of these people before so well no not quite true i knew austin a little bit and i knew zaynip a little bit um i knew lex a little bit um but on the whole you know and well reshma we all know she's awesome so it's great to actually have a fast ai community person there too and so um but yeah i kind of tried to pull together um people from you know as many geographies as possible and as many areas of expertise as possible and um you know the kind of the global community helped me find um papers about about everything about you know how different materials work about how droplets form about epidemiology about case studies of of of people infecting within without masks blah blah blah and we ended up in the last week basically we wrote this paper it contains 84 citations um and um you know we basically worked around the clock on it as a team and um it's out and um it's been sent to a number of uh some of the earlier versions three or four days ago we sent to some um governments so one of the things is i in this team i try to look for people who um well you know working closely with government leaders not just that they're scientists and so this this went out to a number of um government ministers and in the last few days i've heard that it was um a very significant part of uh decisions by governments to change their to change their guidelines around masks um and you know the fight's not over by any means and in particular the uk is a bit of a hold out but i'm going to be on um itv tomorrow and then bbc the next day you know it's it's kind of required stepping out to be a lot more than just a data scientist i've i've had to pull together you know politicians and staffers i've had to you know you know hustle with the media to try and get you know coverage and you know today i'm now um starting to do a lot of work with unions to try to get unions to understand this you know it's really a case of like saying okay as a data scientist and in in conjunction with real scientists um we've we've built this really strong understanding that um masks you know are this simple but incredibly powerful tool that doesn't do anything unless i can effectively communicate this to decision makers so today i was you know on the phone to you know one of the top union leaders in the country explaining what this means basically it turns out that in buses in america the kind of the air conditioning is set up so that it blows from the back to the front and there's actually case studies in the medical literature of how people that are seated kind of downwind of an air conditioning unit in a restaurant ended up all getting sick with covid-19 and so we can see why like bus drivers are dying because they're like they're they're right in the wrong spot here and their passengers aren't wearing masks so i kind of trying to explain this science to to union leaders so that they understand that to keep the workers safe it's not enough just for the driver to wear a mask but all the people on the bus need to be wearing masks as well so you know all this is basically to say you know as data scientists i think we have a responsibility to study the data and then do something about it it's not just a research you know exercise it's not just a computation exercise you know what what's the point of doing things if it doesn't lead to anything so yeah so next week we'll be talking about this a lot more but i think you know this is a really to me kind of interesting example of how digging into the data can lead to really amazing things happening and and in this case i strongly believe and a lot of people are telling me they strongly believe that this kind of advocacy work that's come out of this data analysis is is already saving lives and so i hope this might help inspire you to to take your data analysis and to take it to places that it really makes a difference so thank you very much and i'll see you next week