 In this chapter, we will see in greater detail how the loss function will take its own place in our learning system. We will start by drawing a basic diagram where we can identify our model, our input data, and the loss function, which can be expressed as function of the output of the model and the label, or as a function of the parameters of the model. We will go forward and we will revisit the mean square error loss function. Then we will discuss how we can use this mean square error in order to perform regression and how to perform classification, and we will see which are the drawbacks of using MSC for classification. Then in order to cope with these limitations, we will introduce the sigmoid plus cross n triple loss function in order to be able to train on classification tasks and eventually we will see the softmax plus negative log likelihood loss, which is something similar to the sigmoid plus cross n triple loss, but with an additional assumption, which is that the output of my network is going to be a conditional class probability defined across the classes. At the end we will see how to craft a totally new loss function for a completely different task, which is going to be the task of recognizing different identity given different poses of faces. So we have to introduce the triplet embedding loss. We will give the motivations why we have to specify a new loss function, how we actually come up with our loss function in terms of mathematical expression, and finally how we can get a working implementation of this loss function. So let's draw our basic diagram where we can see where the loss function is actually taking its own space. So we have a input x here and then we have a label y, which is our input data. Then the x is going to be fed into a block which I call network, and we have here we feed the x and then we have an hypothesis h based on the parameters of the network, theta, on the specific input x we fed. Then here we define our loss function which is fed with our prediction and our label. In this case I can write here our loss function is a function of the prediction h of theta on the example x and our current label y. This one by definition is a scalar. Sometimes instead of using the notation L loss based on the prediction and the label, since both this input x and the y label are fixed. We usually write this as j function of the network parameters and what we do with gradient descent, we basically compute the derivative of j with respect to the parameters and use this derivative in order to minimize the loss function j. By using the update theta becomes theta minus learning rate eta dj in d theta. Here I should mention I should use a transposition in the case I use the Jacobian notation, but it's not relevant. The only loss function we have seen so far is the msc or mean square error which I can write in this way. So we have our j function of the parameters of the model theta, capital theta was equal to 1 above 2m. The summation of for i that goes from 1 to m samples of specific errors, square errors. So that's why we also call it error function because in this case I specify a error e and my j, the loss function is an average across all the errors which are computed per sample. And this one can also be written as 1 above 2m summation for i that goes from 1 to m of the norm of h theta of x, example i minus my label for the example i square. And if you'd like to write this one in a vectorial notation, you simply would write 1 above 2m summation for i that goes from 1 to m of h theta x minus y transpose h theta x minus y. Our last way to summation z last layer vector is also called logits of my network and each component would be a logit. If you'd like to perform regression, we would have that our hypothesis on the x would simply be our logit. We don't apply the last sigmoid so basically we don't apply the last non-linear function. For example we could have bounding boxes prediction and for example h of theta could be for numbers giving me the coordinates of the bounding box or for example if I'd like to learn the steering angle of a self-driving car, steering angle, then that would be a scalar so I could have my h of x simply a scalar. And so otherwise for classification we have that our h of x is equal to a l which is the sigmoid applied to the logist. And we saw that if you'd like to use it for multiple classes we can use a y which is simply a one-hot encoding. So for example it could be like 0 0 1 0 and this would be my label for a specific example. And my outputs here are all belonging to the interval in this case 0 1 the power of number of classes and in this case here number of classes is equal to 4. And in this case we have that every a k of l represents the probability of the class k. And all these classes are independent in this case so there is no whatsoever mutual exclusivity. We will see later how we can estimate a different kind of probability across all classes. Furthermore we can add another term which is going to be something like lambda above 2 m summations for j different from 0 of all my weight ij layer l. So this summation is done for all i's all j's but the bias the term relative to the bias term and for all layers. So this is like a three times summation and of all my weight squared. So in this case we introduce the regularization term in order to fight overfitting or high variance models. Using msc mean square error in order to perform some classification create some problems. Those problems are related to the speed of convergence of the algorithm. This is because if we have our sample x and our label y which let's say it's equal to 1 so actually it's a scalar. And we have that my network predicts a 1 so we are going to have that the error in this case is going to be 1 over 2 times 0 which is 0. Let's instead have h of x equals 0 so it's completely wrong. In this case we have that the error will be one half one square which is one half. Unfortunately if we try to learn this system although we are far away from the correct solution the msc will contribute just for one half. And moreover we have that this solution here means that in our chart since we have applied a sigmoid we are way down this way. What does it mean? It means that when we compute the derivative the derivative is basically 0 and the system will not do anything in order to improve this solution. So on the contrary if the solution if our hypothesis would have been let's say one half then I would have actually been able to correct the solution to the correct value to one. But if we are like anything very very small then my solution won't be able to move anymore. Because again the derivative outside this little region within the y-axis it basically drops to a very very small value. Can we fight this problem? Yes we can. We can fix this by using a different loss function which actually fights this problem by increasing drastically the power with which the update is going to be performed. And more specifically if we work out the algebra it computes a function that is one above this one so they simplify away. And basically we have a factor of one instead of having a factor that is going from zero to one fourth and zero again. This loss function is called cross entropy. Let's see how we can come up with this loss function in an intuitive way. Which also will end up having a very nice mathematical consequences and as I said we'll be cancelling out this derivative factor due to the sigmoid. So summarizing using MSE with sigmoid is terribly slow. And so we should never use mean square error for performing classification tasks. Therefore we're going to introduce the cross entropy loss function for speeding up convergence. So let's start with my example where I have my label it's equal one. And my hypothesis of the network after applying the last sigmoid on the logit it's outputting a one. So this is just OK right. We are fine. We shouldn't be penalizing. So my loss function should be zero. In this case my loss should be zero. Instead let's say we said label is one and I predict zero. This is bad. And therefore my loss function should be let's say tending to plus infinity. I'm saying we can say tend to plus infinity because actually my hypothesis will be never able to be zero. It can tend to zero because we are applying a sigmoid which are defined on an open range from zero to one. So let's see how we can make this happen. So we have our we have our output here H. And here I have my loss. We can go from zero to one. The loss should be only positive. And we said we would like to be zero when we are outputting one. And when we go to zero we should be going up this way here. So do we know some function that works this way. Well I think we do. And it's something like this one which is of course is the natural logarithm. So which one is our function is going to be simply the opposite. And of course we don't care anymore. We don't care what's happening here because we never are going to be reaching that area. So we can have a minus logarithm of H for when the label is one. So if my label is one then I'm going to multiply this one with the logarithm of my H of X. And here we have our first term. What if Y instead is zero. So again we have two cases. We have the first case in which the output is actually zero and it's fine. Or the case in which we output a totally opposite solution which is so bad. And in this case the loss should be equal to zero and in this case the loss should tend to plus infinity. Let's draw this one as well. Here we have our one. This is our H. Here there is a zero. This is the loss. And we would like something that is going to infinity near the one and zero for zero. Well we just use the one above and we flip it. And this one it's simply, so we said it's the one above so minus log. We flip it minus H and then we sum one. So we actually, well we subtract one to H so we shift to the right hand side by one. If I would have simply flipped with H I would have had this function here. And then by subtracting one to H so it becomes minus H minus minus one so plus one. I would, I just simply bring this chart one unit to the right hand side. So I can finish this equation here below. So we can have that, I can call this one actually entropy. We sum this one to what? The case in which we have one minus y. So if y is zero this becomes one and this becomes zero. So this one basically goes away if y is equal to zero. And this term is going to be actually a non-zero term for y equals zero. So we multiply this one with the logarithm of one minus H theta of x. And actually we are going to have here a minus in front of everything. So we make this amount positive. And there we go. So we have made the assumption that H of theta of x is equal to the sigmoid of the logit which belongs to the open interval zero to one to the power of k. So we don't have to worry about the reaching infinity because it actually never reached the zero or the one. So let's write it down carefully. So we have our j of the parameter theta is going to be minus one over m examples. Sum over my examples going from one to m. Sum also over all the classes k equal one to capital K. And then we are going to have y k i example which multiplies the hypothesis on the i th example kth component plus the opposite. So we are going to have the one minus y k i th example which multiplies the logarithm of one minus H of theta x i th example. As we said before we are going to be using the assumption that H of theta H theta of x is equal to the sigmoid of the logit and which tells us that this belongs to the open zero one to the power of k. So this expression here j it's called cross entropy and basically it's a measure of surprise. Basically if the network is very wrong our surprise would be very huge. So if we expect to have a label that is one and instead I get a prediction that is very very towards zero. The logarithm of zero is going to be a very very negative number. There is a minus here so we are going to have the surprise for the specific example is going to be a very huge surprise. Instead if my prediction is quite close to one the logarithm of one is going to be a very tiny number a negative number. I have a minus here so it's going to be a small positive number and therefore my surprise it's going to be small. In the same way on the right hand side you're going to have in case I expect a zero the first part goes away and in this case we have if I predict a zero my logarithm is going to be a logarithm of basically one which is going to be a small number a negative number. And in case instead I predict a one in here I will have a number which is very close to zero and therefore the logarithm of a number that is quite close to zero is going to be a very very negative number. And again with a minus in front here it turns out to be a very positive number. So it is connected to a very high surprise. In this way we fight the fact that if we are way down the one of the size of the sigmoid this surprise which turns out to grow very quickly if we are very wrong helps us going away from plateaus where the derivative of the nonlinear function are very tiny. In addition we can add one more term like in the case before we have a lambda divided by 2m and then we have a summation for all terms but the one relative to the bias of the parameters of the network for all layers squared and this is called also weight decay or regularization. Finally I will introduce the softmax plus negative log likelihood criterion which actually is going to be almost the same of our cross entropy criterion. So when someone speaks about cross entropy it's better to be sure in which kind of cross entropy he or she might be speaking the sigmoid plus cross entropy or the softmax plus negative log likelihood. So this one actually is sigmoid plus cross entropy. But basically they perform the same operation and if we perform the algebra you actually get to a similar simplification of the derivative of the final nonlinear function which could have slowed down the training. Let's introduce this softmax layer which is very kind of popular in literature. So we have that softmax of my z my logit so the last weighted summation and this is a vector. It's going to be a vector which component lowercase k is going to be equal the exponential of the k component of the logit divided by the summation for k that goes from the first to the capital K component of all the actually components of the logit. So you can easily see that if I have the exponential of one term divided by the summation of the exponential of all the other terms this cannot be greater than one so this one will always be lower than one. Moreover being an exponential simply a positive function we also have that this is true. So the softmax layer turns out to give results which are always between the range 0 and 1 so we can write simply this one belongs to the open 0 1 capital K. Moreover we can say something nice we can say that if I perform the summation over all k so for k that goes from 1 to capital K of softmax of my logit kth component this is of course is going to be the first one plus the second one plus the third one so this is simply going to be equal to 1 so this is by definition basically by construction it's going to be equal to 1. So what can we see here? Every terms of the softmax layer corresponding to the logit gives us a value between 0 and 1 and as I mentioned all these values is 1. Well this can be thought as probability over all the classes of the specific problem so if you would like to predict what is the likelihood of having one mutually exclusive class within a specific sample we can use this kind of layer because again each and every output will represent the likelihood of each and every class to be the correct class for the specific example. For example this has been used for the ImageNet competition where we have 1000 classes and we have to estimate which is the correct label one out of 1000 for a specific given image therefore we can use the softmax to have the probability of each and every class being the correct class for the specific sample. Associated to this softmax layer we have to use a negative log likelihood loss function which is as follows so j of theta is going to be minus 1 over m summation across all my samples of simply the natural logarithm of my prediction example i element y this one says my loss function for a given sample xi i'm going to compute the natural logarithm of the probability corresponding to the correct class so if my correct class it's let's say the fourth out of 10 and my age of theta is going to be representing 10 numbers from 0 to 1 which sums to 1 then this term here corresponds to the fourth probability to which I compute the natural logarithm and then I have a minus in front and then I average out all this negative log likelihood furthermore this can be rewritten also as 1 over m summation that i's go from 1 to m we have logarithm of summation 4k that goes from 1 to capital K of our exponentials which were the denominator of the softmax layer and this is for the i example minus simply simply the logit to which I select the y class and again this is for the i th example and it's missing a parenthesis here I simply computed the logarithm of the exponential which gives us the simply the argument of the exponential sometimes instead of using the softmax another layer is used which is called log softmax is used for numerical robustness this is because the probabilities can easily turn down to very small numbers let's say again if we have the image net competition we may have 1000 classes so we start usually with a random probability if it's uniform of 10 to the power of minus 3 and can easily go to smaller values and we have problems with the machine epsilon and instead if we use log softmax all these tiny tiny small numbers turn out to be actually very very negative numbers which are less prone to give problems of annealing finally we can always add our lambda over 2n summation for all terms but bias of our parameters ij layer l squared are there other loss functions? yes there are and they are made for specific cases for example let's consider this problem we would like to be able to identify different identities so we would like perhaps to create a network which is able to recognize faces belonging to the same person why should this be difficult? let's see why here we can see the faces of two different people share more pixels across similar views then for the same person but different views this creates some troubles for the network to learn actually what an identity means because the input samples are really really different although we human we can easily recognize that these faces for example in the first row belong to the same person but we can do so because we have a specific center in our brain which is dedicated for this purpose a neural network will struggle in order to be able to actually make this association let's see how to make the neural network life a bit easier we can come up with a new framework for training against identities we can provide as we see here two examples of the same identity for example we provide the face of Liza through the machine and then we provide another face belonging to a different subject like Bart then the network will encode these input samples into an embedding space which for convergence practicality lives on the unitary sphere we have the three embeddings corresponding to the three input images and we would like to have embedding belonging to the same identity coming closer together whereas the embedding belonging to a different identity we would like to push it away in this way if we train on several samples belonging to several identities we will end up having multiple clusters across our embedding space which are defining different identities in this way the network will be able to recognize identity across different posts of the face of the same subject so let's try to write down these equations so let's call this one as my input anchor this is my input positive sample and this is my input negative sample which after going through the network will become an embedding A here A positive here and a negative here and basically what I'm writing here is simply that A is defined as my embedding of my anchor sample and so on for the P and N so I have my anchor here I have my positive example here and I have this distance then I'd like to have a extra guard here which is going to be the size of square root of alpha and here I can have my negative so I'd like the distance of my negative being further away a square root of alpha with respect to the distance of the positive and the anchor so I can write this one as my loss function J is going to be 1 over M summation across my whole dataset the distance between my anchor and my positive which I square plus my alpha and then minus distance between my anchor and my negative squared so in this case here I'd like to have this element greater than this element I have here three positive parts I have one positive element, a second positive element and a third positive element and if I'd like to minimize this function I can simply make this term grow in order to make it bigger than this positive term to minimize this function would be to minimize this distance here anyway, even though this first distance may shrink down to zero we still have this alpha here which imposed us to have a distance between the anchor and the negative that is at least as big as alpha if we leave this equation as like this because function works pretty well but it can become negative because this term can grow in dimension a lot and it can overcome this first term and then we start having a negative loss which is something that we don't really want so in order to do not make the loss become negative we simply take the positive part it's like applying a regal in this way whenever these terms become bigger than the summation of these two our loss function becomes zero and then we don't have any more any contribution to the learning I wrote an implementation of such triplet embedding you can find it on my github account here we can see the criterion equal nn.triplet embedding criterion which parameter alpha can be set as an argument and then the loss function is defined over the embedding apnn equal one over the number of samples summation of the positive part of the summation of the three parts the square distance between the anchor and the positive plus the alpha minus the distance the square distance between the anchor and the negative sample