 Great. Now let's briefly stop through the very, very rough logic of debloning. What is debloning made out of? We have a fast component here, which is a function, a parametrized function, often written as f theta, where theta are the parameters here. This is what the artificial neural network does. It defines some output y, which is a function with all those parameters of an input x. So we have this function, which can be very complicated. In ResNet, as we'll get later to it, it might be hundreds of layers of different kinds and so on and so forth. So we have a function, and then we have a loss function. That loss function measures what it means to have good performance. Examples mean squared error, mean absolute error. We've seen a few, or we will be seeing a lot of those. And then it produces an output y. And now we will be optimizing it. So we are trying to find a set of parameters theta star, which is the ones that maximize or minimize the loss function in this case, which depends again on theta. And of course, yes, in practice, we will only find local minima and not even them in lots of cases. But that's not the issue here. Now, our real problem is this optimization. So how could we be doing that optimization? We could use a zeroth order optimization, or we say we have that function very easy. We don't even need autograd. We just get the value of how good that is. And then we try a different set of weights. Now, of course, and we could use things like simulated annealing, but of course, optimizing like that is insanely slow. However, when we do fast-order optimization, where we use gradients, we can be massively faster. We can be many, many, many orders of magnitude faster. In a way, one way of thinking about it is, if we have a zero order technique, we can ask on a given coordinate, do we get better here or there? And we go to the next weight and do we get better there? But given that we have millions of parameters, that would be potentially order million slower than we would have it otherwise. So in any case, it helps a great deal to have the gradients. So gradients, you've all seen gradients, but just the intuition here, how much would the loss change if I change the parameter just a tiny bit? Like you can say, the gradient vector basically contains the partial derivative of the loss after each of the components of the vector x that I have. And before we allow you to go on and be happy with automatic gradients, automatic differentiation that we get from PyTorch, I want you to remember how difficult it is to differentiate by hand. So we set up like a simple pram where you can calculate the partial derivative. You'll do this one by hand and check if you agree with PyTorch. And in the future, we will fortunately have PyTorch do that for us.