 So, so that you don't have to do this again, let's talk a little bit about the magic of automatic differentiation. What we saw we have in deep learning is we have a function y that of course depends on various parameters. y is f of x and you generally have a human programmer take this function and convert it into, in your case, a touch implementation of that function. Now, what you've of course seen before is symbolic differentiation by human or by computer. No, no way you take a function and you differentiate it yourself. Now, like cosine, sine, x squared, 2x, you name it, okay? That's symbolic differentiation. And of course, there are commercial tools like Mathematica that do that much better than certainly I do. I don't want to speak for you. So what's automatic differentiation? Now, like what we could do is we can symbolically differentiate this and implement that the derivatives as a function. Automatic differentiation is basically going directly from the PyTorch program that implements the function to a program that implements the derivative of that. And this transition is, as we will see, is very efficient. And so much so, in fact, that no one goes this route. Now, like the alternative could be you use a pencil on paper and calculate the derivatives and implement that as a touch function. Why don't we do it that way? It turns out it would be very much slower for implementation. And also, it makes life very, very difficult if you have to do that by hand. Thank God we are over those times. So let's talk about the modern history of automatic differentiation. And I should say, I first heard about automatic differentiation in 1999, where there was a package from AdLab and I never managed to get it to work, despite the fact that I knew that it would revolutionize my research. I never managed to get there. So in 2015, we started having autograt become a real thing, followed by Chana, then Torch autograt. Then PyTorch came around 2017, TensorFlow get popular towards late 2017. And I should also mention, Jax came up in 2018. So that's a brief history of autograt. Each of the packages here led to massive progress in the field, allowing us to now be very agile with deep learning type systems, or in general with systems where we want to optimize functions that are non-trivial. The main reason why they are so good is they do two things. They make gradient calculations be fast for the computers, and maybe more importantly, they make them fast and efficient for ourselves. I remember when I did it back then, every time I implemented a function and its derivative, I needed to manually check if it's actually correct. That problem has largely gone away these days. So how does it work? A lot of these approaches are based on the computational graph. What's a computation graph? It's a data structure for storing gradients of variables used in computations. So what do we have there? We have each note, stars the value during the computation. As you'll see in a slide or two, we can always describe a computation as a tree. So each note will star the value. It's used to calculate the gradient, and it's also used to star the function that created the note. A directed edge existing from U to V represents the partial derivative of U with respect to V. And to compute the gradient, we'll find the unique path from L to V and multiply the edge weights, where L is the overall loss. So let's look at exactly how such a thing works. So here we have a very simple function, which is we basically add the product of two variables to the sign of one of the variables. But let's step through it step by step and see how the compute graph looks like. So we have internal variables double. You have inputs x1 and x2. So how does it start? We have basically the first variable w1 is simply a copy of x1. The second one is simply a copy of x2. Like we just built an internal version of those variables. Then we have w3, which is the product of w1 and w2. So you see every time we have a line in basically our Python or PyTorch code, this will produce an extra note here. We have w5, which is w3 plus w4. And then we have the output, which is w5. Now, this is just a way of representing the flow of computation as we calculate something. Any calculation that you can have anywhere in your code can be written like this. Not like if you want the output of each line when you code is a new variable that is set to something. We just need to make sure that every time we have a new variable, we basically instantiate a new note here. And that way we can produce a computational graph. Now, how would we calculate gradients in this case? Well, we know the output. Like we know w5 is the output. So we know that the gradient after w5 here is 1. And then we basically follow this backwards. So what do we have here? We have w4, which is being added into this. Now, we have to now do gradients through this plus here. w4 is w5 times the derivative of w5 after w4. This derivative here is 1 because we just add two things. It's w5 is w4 plus w3. So here we have a 1. And then we have w5 here. And same thing here, w5. And then we can go one step farther. Like, what is the derivative coming through this line here? It's w4 cosine w1. Here we have the derivative of the output after w4. Here we have the cosine because that's the derivative of the sine. Similarly, on this axis, here we have the product of the two of them. So that's the derivative after w3 times w1. And then it goes all the way down here. The derivative after x1 is the derivative from part A and part B. And so we have cosine x1 plus x2. And here the derivative is x1. OK? So we have, in a way, on this backward pass, which is really just using chain rule in a way, we can calculate the gradients. So why are they so useful? For a single neuron with n inputs, we need to generally keep track of n gradients. So for a standard vanilla feed-forward neural network for MNIST, which is 784 times 800 times 10, we thus need all the 600,000 gradient computations. That's an awful lot of calculations. And we might have 60,000 training examples. So we better be fast. And as I told you earlier, in lots of ways, the rate-limiting factor for us when we do deep learning is being fast. So calculating a gradient should add itself can have a compute graph. So we can have systems that learn to learn. But we will talk much, much later about learning to learn. So basically by computing compute graphs, while we run functions, we allow computers to do the same step that we did here by hand to do that automatically. And of course, much more clever in using lots of tricks to make it be more efficient. So now while you run forward calculations in PyTorch, it basically sets up this whole compute graph that allows it to very quickly calculate the gradient. Now what we'll do is you'll take a simple example that we set up for you. You will calculate the gradients by hand, going the same way backwards. And then you'll run autograd. And you will see if automatic differentiation gives you the correct results.