 So what do we see here? We can look at the training loss. It first has some period of time where effectively nothing happens and then very quick learning happens and then it converges. If we zoom into this, we can look at the various modes. What do we see? We have a first mode that kicks in and then we have a second mode, an orange that kicks in, a green mode that kicks in and so on and so forth. Interestingly, all of these basically started zero. Then the learning happens and then they converge. Why is that happening? Well, just as we saw before, if we start close to zero, we will be in that range and then it takes a while to discover that, which basically means that the weights go away from zero and start representing that mode. So one way of thinking about the dynamics of linear systems and nonlinear systems just as well as we can see later is that they exist these modes that basically kick in one after the other. Now, let's zoom out a little bit. Today we of course focus on linear network, but what are the components of neural networks? Now like with linear operations, with initializations, we have loss functions, activation functions, optimizers, regularizes and architectures. That's arguably the building blocks that we built everything out of. So it's time to think a bit about loss functions. So here's a generic idea, which is use log probability as a loss function. This is almost a philosophical idea. What does it mean for a model to be good? Now, ultimately, deep learning, machine learning, whatever we use, it is about model building. What does it mean for us for a model to be good? Well, arguably, it means that we want to assign high probability for the data that actually occurs and low probability for other observations that could have happened that didn't end up happening. Using log probability is one good way of doing that. And now you might ask, well, why do we keep talking about log probability? Why isn't probability enough? Well, we talked about abstractions a little bit earlier, which is that there's a difference between the math and the implementation of that. Probability and log probability are just monotonous functions of one another. In that sense, it shouldn't make any difference. However, probabilities tend to be, no, they are always between zero and one. And many times, probabilities are very close to zero or very close to one. If we thus used probability, we might have vanishing gradients. If we use log probability, it maps these probabilities in a much more meaningful way. And therefore, we can implement it more readily on the actual systems that we actually use. So let's use a typical example in neuroscience to see where we might use log probability. What we do in neuroscience in many times is we obtain k spikes during some interval. We know that within neuroscience, there's a lot of randomness. Now, if we repeat the same thing, maybe sometimes we get five spikes, then maybe we get 10 spikes. There's some variance there. And we know that in false-order approximation, we have a Poisson distribution. So if you look at the Poisson distribution of getting k events, we have lambda, which is the conditional intensity function, which is basically the baseline rate that we have here. Then we get lambda to the k, e to the minus lambda, k factorial. Now, all these terms are potentially problematic. Now, lambda to the k can easily get, for a large case, get to be infinity for the way a graphics card might represent numbers. It could also become zero. The e to the minus lambda, e to the 1000 is infinity. So it only gives us a very narrow range that we can play with. And k factorial, of course, has the same properties. Whereas once we have a log of that probability, what we have is k log lambda. It's going to be in reasonable range. Each of these terms is nicely meaningfully within a small interval for realistic data. So another possibility is to use the mean square error. We already saw it today. Now, like we use it when we do linear regression, which is simply the square loss here. It's a quadratic function. But it's also the logarithm of the Gaussian. Now, we want you to just show that if you write down the equation for a Gaussian and you calculate the log of that, that you're going to get minus like scaling, of course, this is an equation.