 So before we move on to looking at regression problems, I want to just stop and give you a little bit of an insight as to how we calculate a loss function in a classification problem. Now, of course, with numbers, it's very easy to state what the difference is between numbers. I mean, there's a variety of ways that we can do this mean squared error, et cetera. Absolute mean squared error. Absolute mean error, I should say. But what do we do if something is either a nodule or not a nodule, a cancerous growth or not a cancerous growth? Or when it comes to imaging, when is it a bus, a stop sign, another car, how do you calculate that difference? If the target value, the actual value, was a bus and our model predicted that it was a stop sign, how do we calculate that loss? And of course, we do it through cross entropy, and we have the binary and the categorical cross entropy loss functions that you've seen before. So let's just peek a little bit behind. I just want to give you some intuition about this. So imagine I have these 12 elements. We're going to start off with this concept of entropy, really. Entropy just borrowed from physics, where it just means auto chaos as a system. It's just left to its own devices. There's no external energy input. We're going to increase the entropy. So the chaos is going to increase. And I use here, if I have a house, if I start with a heap of bricks and bags of cement just lying there, it's not going to build a house by itself. You have to put some energy into that system. But if you start with a house and you just leave it, it's slowly going to decay until you have just this rubble. And you're going to see the bricks and the cement dust just lie as the house decay. So the entropy increases. And we use that same concept here. We just borrow that concept about if we have information about something, how can we quantify that? So I imagine I have these 12 elements. I've just grouped them all together, but they need to be in a specific order. So I have car, car, car, car, car, six of them, bus, bus, bus, four of them, human, and a stop sign. So you can imagine this is some classification problem. And it is for self-driving cars. So it's these 12 elements in an image or different images. And I ask you to come up with a series of questions such that when I drew one of these at random, now you can see there are six out of the 12 are cars. So there's a high probability of being a car. Four out of six, four out of 12 are being a bus. One out of 12 me selecting the human at random and the stop sign at random, also only one. So there's a different probability of me drawing one of these at random. But if I did, you had to ask me questions. How can you structure those questions to get to what I do at random? So this is one way to go about it, a clever way. You can ask, is it a car? Now these are binary questions. Is it a car? And if it's yes, if the answer is yes, well, then it was a car. If it's no, of course it can be any one of the other three. It'll be better to ask the second question, is it a bus? Because if it's yes, then it is a bus. If it's no, I only need one more question. Just you can ask either human being or stop sign. If you ask human and it's yes, it is a human and no, it's a stop sign. So on average, you had to ask one question for this group that had the highest probability. You'd had to ask two questions to find out if it was a bus. And you would have to ask three questions before you found out whether it was a human or stop sign. So we can sort of put those two together. We can say, well, I had to ask one question for something that had a probability of 0.5. So I can say one times 0.5 plus. I had to ask two questions for that thing that had probability of a third. I had to ask three questions times the probability of one over 12 and another three times probability of one over 12. And that gives me this entropy value. And it comes from information theory. So it says that on average, if I did this many times over with that probability, you would have had to ask about one and two thirds questions to find out what I do at random. That is the information locked away in the set. Now we can look at it from a theoretical point of view where we can calculate the theoretical minimum number of questions. So if everything here as far as the probabilities were concerned were a power of two, then you're going to see that these two values for entropy and the theoretical entropy is going to work out exactly the same. But if not everything is a power of two, as far as the probabilities are concerned, you're going to get this theoretical minimum at the bottom. And here we can see 1.62, which is slightly less than the 1.66. And here's the equation for entropy. It says take the negative of the sum of all these products. And what you do is the probability of the first one that was a half times the log base two of that probability. So I would do half log base two half and then a third times log base two of a third and then a 12th times log base two of a 12th. And then another add to that another 12 times a one over 12 times the log base of one over 12. So just plugging those four probabilities there, multiply them, those two, the product of that, add all of those and put a negative sign at the front because we're going to end up here with a fraction and we want a positive value, so we're going to put the negative out there. So this vector D, that just refers to the fact that it's just a vector formed by these probabilities, half, a third, a 12th and a 12th. So let's build a little function here inside of R. I'm going to call my function entropy. I'm going to use this assignment operator and then I'm going to say I'm specifying a function. It's going to take a single argument D and what I want to do is I'm going to just create this internal variable called X started at zero and then for I in D, so in each of my elements in D that I pass to my function, I'm going to say take X, whatever it was now, we start at zero and I add to that, it's zero at the moment. I, so that's the first, that's one over two times log of one over two base two. Now if I run through it again, X now has a new value and I add the second element, the third element, the fourth element to that and I want to return negative X. It's minus the sum of all of these. So if I run entropy and I put my vector in there, it's going to come out to 1.652815, that's the entropy. So that's the theoretical minimum number of questions or bits required to answer this question. So from there, we go to this idea of cross entropy, cross because we are not comparing a single distribution of probabilities here, but two distributions. We're going to compare two of them. And remember multi-hot encoding, one hot encoding. So remember, let's suggest that one of my target values, the actual value is zero, one zero. So it had three elements in it. The target, the number of classes or the sample space of my target variable had three elements. And in this case, it was the second one. So maybe that was car, bus, and stop sign and this one was bus because the second one was one. So I'm taking an object, category, categorical variable and I'm changing it into a vector. And my predicted values might be something like this, 0.1, 0.1, 0.8, 0.1. So we've got to see this as a distribution and this is a distribution and we're just going to compare the difference between those two distributions. And the way that we do that is a very similar thing. So we're going to take each of the elements in my vector of the actual values times the natural log of the corresponding value in my predicted vector. Why natural log and not log base two? Well, we're not asking binary questions here and remember that if I asked for the log, say that A was two, log base two of something that will just be log B divided by log A. And if A was two, I'm just dividing by a constant. So it's kind of useless just to do what you are doing by asking for log base two is you really just dividing by this constant. So usually we just stick with the natural log and that's why I put L in there just to indicate natural log. So remember this is my target and this is my probability. Now it's very rare that one of these would be zero because you can't take the natural log of zero and they'll just be code inside of TensorFlow or whatever framework you're using to have a little escape should one of those be zero to still give you a return on that. So just to create that little cross entropy function you're going to take two vectors P and P hat I'll just name them P and P H A T and there is exactly the equation that we have up there and we're just going to return negative the sum of all of those and we do the sum by iterating over all of these. So that's one way to do it, there are many other ways to do it. So if I pass my two vectors here zero, one, zero and open one, open eight, open one we see a categorical cross entropy value of open two, two and hence we can do gradient descent we can do back propagation to do gradient descent to improve this error, this loss and that loss is categorical cross entropy and remember with the categorical cross entropy to get the derivative of the log function the derivative of the log of a values is one over that value. So the back propagation is definitely not that difficult. So that's it, I just wanted to give you some intuition some idea behind what do we do where do we get this difference between two categorical data point values and the easiest is to do it in this format of categorical cross entropy.