 What is the best loss function? The age old question in machine learning. Will we solve this problem today? Nope, but we will talk about some loss functions, their pros and cons, and even discuss a recent paper on adaptive loss functions. So that would mean that you don't need to keep trying out different losses to find the best one that suits your needs. Fun stuff. Okay, first, let's focus on regression. I've got this data set here. It looks like a line can fit this data. I'll train a linear regression model on it, and I choose to use the squared loss to minimize it. The L2 loss. My curve looks something like this. Okay, looks pretty good. And it's also pretty simple. But if I introduce some outliers in this data, my model responds by freaking the hell out and trying to fit those data points better. This happens because that square term scales the errors by these outliers. So the model really wants to get these obscure points right. I'll just change the loss function to the absolute difference. My model now treats the outliers like any other data point, so it won't go out of its way for outliers if it means compromising the rest of the model. This might lead to poor predictions from time to time, but if you really don't care about the extreme cases, this will do. Support vector regression uses this by the way. The advantage of the squared error is the ease with which we can compute the gradient for machine learning during gradient descent. This gradient is not as simple in the absolute error case because of the points of discontinuity. The mean absolute error isn't optimized through gradient descent, but it's optimized by computing subgradients instead. It adds a bit more complexity, and I'll add some reading material in the description down below. We got two losses, one that loves outliers and another that ignores them. If you think that one doesn't work, you'll just use the other. And that might be fine in most cases, but consider this. Our data is about like 70% in one direction and 30% in the other direction. Technically, this data does not have any outliers, but our absolute loss may treat the 30% data as outliers and ignore it altogether. While the squared loss will try to capture those 30%, both decisions can lead to poor model performance. How do we compromise? We can do so by using the pseudo-huber loss. This is the best of both losses. If a data point has a relatively low error, we take the squared loss. If the data point is an outlier, we take the absolute loss. The result is that it reduces the effects of outliers on the model while still being differentiable, and as such, it's slightly more complex. The main problem here is that we have an extra hyperparameter play with. These are the most popular regression-based losses that you see in built-in regressors. Now for classification losses. In classification, our outputs are obviously the class. But more precisely, it's the list of probabilities of belonging to different classes, and we just choose the class with the highest probability, cuz duh. This list is a probability distribution. We compare this to the ground truth, and how we compare it depends on the losses we use. So cross entropy loss. Entropy has its roots in information theory, so I'll explain it from that perspective. So say that there's this weather station, and it sends you a weather forecast at the beginning of each day, and it tells you what weather it is on that day using some n bits of information. In its best case, say this information can be packed in as low as 3 bits on average. 2 bits for sunny, 4 bits for rainy day, 3 bits for a cloudy day, and so on. The entropy of a distribution is the average number of bits required to convey a piece of information, like today's weather in this case. So the entropy in this example is 3 bits. But the tower isn't perfect, it's designed by engineers who have flaws themselves. There is some wastage, and it is found that the tower actually sends you 5 bits on average. This is cross entropy. We are comparing the true average, and the satellite's current average. Entropy is 3 bits, but cross entropy is 5 bits. This means that we could have had a system that tells us the weather with just 3 bits, but we have a system currently that is our satellite that is using 5 bits to do the same thing. Ideally, we want these numbers to be much closer to each other. This 2-bit difference is known as the KL, or the Colbeck-Lieber, divergence. This little satellite is actually similar to a model that we train to predict the weather in machine learning, as a classification problem. And so in many classification problems, cross entropy and KL divergence are often used as loss functions to minimize. Another loss is the hinge loss, typically used in support vector machines for classification tasks. Minimizing this, we get a boundary that splits the data well, and is as far away from every data point as possible. That is, it maximizes the minimum margin from the data points. This loss penalizes data points even if they are correctly labeled if they lie in this margin. I've made several overly mathematical videos on kernels and SVMs. Check it out if you want to lower your self-esteem. I'm going to wrap up this video with a paper discussion. We've taken a rough look at six common losses for classification and regression, but there are far more. Some better suited for certain problems. We have a set of points we want to fit a regression line through. Squared loss does it decently well. We add outliers and try to fit it again. It doesn't look too great anymore, so we try the pseudo-huber loss, and this gives us better results. But I'm not satisfied yet. So let's try some other losses. So we have the Welsh loss, the results are trash, the Guyman-McClure loss. It fits this data better. Now Kowshi's loss. This fits the data even better. I like this. It's nice that I found the loss function I liked, but I found this by trial and error. Is there a way that I could have just used a loss function without trial and error and somehow arrived at the actual minimum that I wanted? Turns out that all these losses that I mentioned can be generalized to this equation by setting different values for alpha, which is a shape parameter. How do we add alpha into the mix though? Maximum likelihood estimation. We maximize the likelihood of the probability distribution or minimize the negative long likelihood. So it becomes an adaptive loss. This technique is typically used to derive losses mathematically. And this actually leads to some interesting results. Here are some examples of images when we let a variational autoencoder to determine a loss and generate images. They aren't half bad. The idea of an adaptive loss sounds like an amazing idea. Hope you all have a better idea behind loss functions, the differences between them, and a sprinkle of research on adaptive loss functions. And so we can avoid trial and error to determine the most appropriate losses. I have resources in the description below. If you like these videos, please subscribe to keep the lights up in my little apartment and I will see you soon. Bye bye.