 So what did we see here? We easily get a factor 3 between different trivial ways of implementing the very same function. And this is quite typical. And it is another example of how deep learning works so well because we have lots of places where we gain ourselves a factor 3 or maybe a factor of 10 by doing certain things right. We are a really big community now. We have awesome developers that work on everything from the hardware side to algorithms to good ways of implementing activation functions. And by being this community that all collaborates, we're incredibly effective. And that is why if you look at all the players in the deep learning space, they're generally really enthusiastic about this collaborative, open, co-chairing way of running deep learning. Deep learning works so well because lots of tricks came together. So let's dive right into classification. What is classification? We have input and labels, xi and yi. And in the case of classification, the yi happened to be integers. Now what do we need? We need the right transfer function and we need the right loss function. And let's see what we can do about them. So the output. I want to know for a given input xi, which of the possible outputs could be the right one? Ultimately that means that at best I can hope to have a probability distribution over these integers. Of course, I hope to be right. I hope that I have an algorithm that is usually right, but no algorithm is perfect. And as we discussed before, going into a domain where we can approximate how good our algorithm is, is a very good idea. So how can we convert the output of the neural network that is going to be continuous between minus infinity and infinity into a vector of possible probabilities? Now just to be clear, we want to in the end have a vector that has the number of classes as its length and the probability that this item is of that class, an estimate, as the end turns to that. So for that, we use the softmax function. What do we have there? We have e to the xi. Now what will that do? It will take large values and make this be very large. Small values up to negative infinity. Now negative infinity will reach zero. So e to the xi will produce a value between zero and infinity. And now we divide that by all of, by that same value for all the possible classes, zj. What will that mean? We have all these values, small or large. We divide it by the sum of all of them, which means that at the end, every value will be between zero and one, because one would be the largest value if the sum is entirely dominated by one of the e to the zj. It will be between zero and one and it will adapt to one. So this is the properties of probabilities. Now, given that we now have a way of formulating a distribution of probabilities, what is the right cost function here? Well in a way, we want the predicted probabilities to be close to reality. Now in reality, one of them is true and the others are not. Now there's some realities where reality might be, I am not given ground truth, number three was the right answer. But there can be cases where even ground truth or like the kind of labels that we get have uncertainty, but like now that for the moment. So the target vector y can now be a one hot encoded vector where yij is one, if and only if xi belongs to class j and otherwise it's zero. Now, what that means is that if I have lots of inputs with index i, the output or like the target vector is encoded as a matrix where for each input i, I have a probability of output j that is zero everywhere apart from one where truth is. Now what we want to do is we want to have a cost function that measures how similar the two of them are. And this is cross entropy here. So let's see what we have there. We will have on the inner side of that equation, we have sum over j yij of log n of xij where n is the output of the neural network. So what that means is we effectively, not like keep in mind that all the yij that are wrong are zero and only the one that's right is one. So this sum boils down to the log probability that the network this way assigns to item i being of class j. So only that estimate matters of the outcome that actually happened in that case. And then of course we will have a sum of all over all the data points that we'll have here. Not like we will sum from i is one to s which is the number of samples that we have. Now there's a warning that is really important which is PyTorch combines softmax and cross entropy into one function. I've seen a good number of people run softmax and then put it into the cross entropy function in such a way that softmax was run twice, which of course won't work well very well. Okay so this is now how we can take the pram of classification and formulate it as a lost function that really makes sense to us. Now I want to give a bit of an aside here because it's very important in real world applications. I told you before that what we often do when we do modeling is we try and use a lost function that says how good is our model and we want to have a model that is ultimately working well. So we use log p as how good our model is in a way as a starting point. Now if I optimize on the wrong cost function I will always do worse than if I optimize on the right function. So for example the payoff function for me might be I take my deep learning system I build it into a production system and whenever I get it right I make money for my company. So the payoff function in that case is how often do I get it right? So now we are in a conundrum like I like log probability because it's smooth and it will optimize really well but my ultimate payoff function is how often do I get it right which isn't even differentiable. So what could I do in this case? So one standard procedure and you will generally get a lot of mileage out of that is you want to in the end optimize for what matters but what you can then do is you can optimize for the log probability first and then you do a second optimization where you then fine tune the model so that it's as good as possible and at getting as many of the items right as possible. You will generally get some advantage of that. Now we're ready you know like what we have is we have a function which is the neural network up to the softmax and we have a loss function which is cross entropy here and we just choose an optimizer we will learn a lot more about optimizers next week and now it's time to take some data and train it on the data and what we will do is we will use a simple 2D case because we can meaningfully visualize it and learn from it. So now train classification on the spiral dataset.