 I guess we can get started. All right, so today we're going to talk about backprop. And I'm sure for some of you, a lot of this is going to look familiar. I'm going to start with some sort of refreshers about sort of busy concepts and talk about sort of more general formulation of backprop a little later. And then tomorrow Alfredo will go through how you use autograd and things like this in PyTorch. OK, so basic concepts. We have parameterized models. So parameterized models are nothing more than functions that depend on two parameters, an input and a trainable parameter. And there's no conceptual difference between the parameter and the input. They're just they're both parameters of the deterministic function. The thing is the parameter is shared across training samples, whereas the samples, of course, are different for every training sample. So in things like in most deep learning frameworks, the parameter is actually implicit to the parameterized function. When you call a function, you don't actually pass the parameter value. It's sort of stored inside, at least in the object-oriented versions of models. But just need to remember that your parameterized model is just a parameterized function. It takes an input and it has a parameter vector. It produces an output. In simple supervised learning, this output goes into a cost function that compares the output of the model with the output you want. Here it's called C. The prediction, the output of the module is called y bar. And the C function compares y and y bar, where y is the output you want. y bar is the output you get. So I'm giving here two very simple examples of parameterized functions, which I'm sure you're familiar with. The first one is a linear model. So a linear model just computes a weighted sum of the components of its input vector multiplied by weights. And if you do linear regression with square loss, the C function is just the square distance you clean in distance between the y vector and the y bar vector. y and y bar can be vectors of scalars or tensors or whatever. It doesn't matter. Things with numbers in them, basically, or things that you can compute distances between. That's actually all you need, technically. But here is a slightly more complicated parameterized function here down the bottom, which actually computes nearest neighbor. So here, there is the input x, and then w is a matrix. Each row of the matrix is indexed by the index k. And to compute the output, we actually I put the number k that corresponds to the row of w that is closest to x. So we compute the distance between x and a particular row of w, which is wk. And then we value over all k's, and we figure out which of those differences is smallest. And we output that k. So that's what the argmin function does. It returns the value of the argument that minimizes the function. So it's a function of k, and that returns the k that minimizes that function. The point I'm making with this, which is a complicated way of explaining nearest neighbor, is that the type of computation that takes place in your parameterized model could be very complicated. It doesn't have to be just like a neural net, something that you compute with weighted sums and nonlinearities. It can be something complicated and involves a minimization of something else. It could be the minimum of some function. And we'll come back to this in a few weeks. Yes? You can denote it in a different way. I could have written w, the w matrix, multiplied by, let's say, a z vector. And the z vector would be constrained to be a one-hot, and in which case, you would select a column of w. And you could do a min over this z vector. Then it would be a different notation, but kind of a different effect, a similar effect. But then I would have to write another equation, like z is one-hot, and explain what that means. Forget about the notation. Just remember the fact that there could be something complicated going on in this parameterized function. It's not necessarily just a very simple thing. So what I've done with this diagram here is that I've introduced a way of denoting, of writing neural nets and various other models as block diagrams. And I'm using three different types of symbols here, or four, really. The bubbles represent variables. The bubbles that I filled up represent variables that are observed. So x is an observed variable. It's the input to your system. So you observe it on the training set and the test set and whatever. y bar is a computed variable. So it's something that's just produced by a deterministic function. You can just compute from the observed variable through a deterministic function. So y, similarly, is an observed variable, because it's observed on the training set. It's not observed on the test set, but during training it's observed. And then you have two types of functional modules. One type is those kind of blue round shape modules, which represent deterministic functions. And the round side indicates in which direction it's easy to compute. So here you can compute y bar from x. It's considerably more complicated to compute x from y bar. If I give you y bar, you'll have a hard time giving me an x that corresponds to it. And then you have another type of module, which are usually used to represent cost functions, represented by squares, red squares, to make them more visible in this case. And they have an implicit output, which is a scalar output, a single number. And they can take multiple inputs and basically compute a single number, usually a distance between the inputs or something similar to that. So with those sort of basic symbols, you can represent standard supervised learning systems. For those of you who are familiar with graphical models, this is a similar notation that is used in what's called factor graphs, where the squares are factors. Factor graphs don't have those deterministic functions because they don't care about which way dependencies can be computed, but in our case, it's really important. Okay, so loss functions. So loss functions are things that we minimize during training. And there's two types of loss. There's the per-sample loss, so in this case, l of x, y, w. So you give it a sample, a pair x and y, and a value of the parameter, and it computes just the scalar value. In our case here, we use a very simple loss, which is just equal to the cost module, the output of the cost module that we put on top of our system. This is kind of the standard sort of supervised learning paradigm here, where the loss is simply the average, I mean the per-sample loss is just the output of the cost function that we put in. It's not always the case. And then the loss that we actually minimize during training is the average loss over a training set. So a training set S is a set of pairs, x of p, y of p for p equals zero to p minus one. And the overall loss, which depends of course on the training set and the parameter values, is the average of the per-sample loss over all the samples. And I forgot to say, in the first time here, it's x, y belongs to S. So machine learning is all about optimizing functions, most of the time minimizing functions, sometimes maximizing functions, occasionally funding Nash equilibria between two functions, as in the case of GANs, but most of the time we minimize functions. And we do this with gradient-based methods. Not necessarily gradient descent, but gradient-based methods. What is a gradient-based method? Gradient-based method is a method that finds the minimum of a function assuming that you can easily compute the gradient of that function. So that assumes the function is more or less differentiable. It doesn't actually have to be everywhere differentiable, technically. It needs to be continuous and it needs to be almost everywhere differentiable, otherwise you run into trouble. But it can have kinks in it as long as they're not too nasty. And gradient descent, of course, as you probably know, consists in computing the gradient. So you see a function here at the top. It's got a minimum at the top right. I've drawn the lines of equal cost of that function and the arrows that you see are the gradient vectors. The gradient is pointing up at every location and the gradient is always orthogonal to the lines of equal cost, equal altitude, if you want. Okay, so gradient descent is like being in the mountain in the fog and it's night and you can't see anything but you want to go down to the village and so you look around you and you look for the direction of steepest descent and you take a step. Okay, so the little algorithm here at the top, the W vector, which is your position, is replaced by your current W vector minus some constant times the gradient vector and the gradient vector points up so when you do minus, you're kind of walking downhill in the direction of steepest descent. Now this is if eta is a scalar constant but in sophisticated algorithms, eta can actually be a matrix. So if it's a matrix, if it's a positive semi definite matrix, you will still go down, go downhill except that not necessarily in the direction of steepest descent. In fact, the direction of steepest descent is not necessarily the one you want to go to. If you have a situation like the one at the top where the value is a little elongated, the gradient actually does not point towards the minimum. It points off center. And so if you want to go directly to the minimum, you don't want to follow the gradient. You want to be a little smarter than this and by using sort of eta that are matrices, you can actually, you could in principle do this with so-called second-order methods which are still gradient-based methods but they're kind of being practical in most cases. We'll talk about some issues with this in a few weeks. Now there are algorithms that are not gradient-based. Optimization algorithms that are not gradient-based. So when your function is not differentiable, when it's like a golf course, it's flat and it's got a hole in it or when it's kind of staircase-like where the gradient doesn't give you any useful information or when it may be differentiable but you don't know the function, you didn't write the program that actually computes it because that function might be the entire environment around you. Then you cannot compute the gradient efficiently. So then you have to resort to other methods. Methods are called zeroth-order methods or gradient-free methods. There's a whole bunch, a whole family of those functions of those methods which I'm not gonna talk about at all. Deep running is all about gradient-based methods. That said, if you're interested in reinforcement running, most of reinforcement running actually uses gradient estimation without the gradient. What you want is, I don't know, you wanna get a robot to run to ride a bike and once in a while the robot falls and you don't have a gradient for the objective function that says don't fall. Or the objective function that measures how long you, the bike stays up without falling. Nobody tells you what to do to minimize that cost function, right? So you have to try things. You can't compute a gradient of that function. Okay, so in RL your cost function is not differentiable most of the time but the network that computes the output that goes into the environment is differentiable. Okay, so from that point on, that's gradient-based. Only the cost is not differentiable. Okay, so there'll be a situation like the diagram I showed earlier. Imagine here that G is differentiable. You can compute the gradient of the output of G with respect to the parameters and its input and everything but C is not differentiable. In fact, it's completely unknown. The only thing you know about C is that if you give it a Y bar and a Y, it tells you the value but doesn't give you the gradient. Okay, that's kind of what RL is. There are other things about RL, okay, but that's the basic difference between reinforcement learning and supervised learning. Yeah, well the reward is just the output of C, that's all. Okay, so C is a black box and what you get is the output of C. You don't get a Y either, right? So you're not being told what the correct answer is. You just have a black box, you give it a Y bar and it gives you a C. That's it. You can't compute the gradient of C with respect to Y bar. That's right. So what you do is you change Y bar a little bit and you see the C go up or down. If it goes down, you kind of reinforce that. If it goes up, you do something else. Okay, so basically you're telling the system how good it's doing without telling it the correct answer and it doesn't have access to a gradient. Okay, so what it tells you is that RL is horribly inefficient, right? Because you don't have a gradient so you have to try, you know, if the output Y bar is low-dimensional then it's okay, right? You can try to make it larger, smaller, things like that but it's not too bad. If Y bar is a high-dimensional vector, there's such a huge space to search. There's probably no way you're gonna find an optimal value for it unless you try lots and lots of different times, right? So that's a huge problem with RL. A lot of techniques in RL actually, I'm not gonna talk about RL in this course actually except today, maybe. But a very popular technique in RL is so-called actual critic methods and a critic method basically consists in having a second C module which you know which is a trainable module and you train your own C module which is differentiable to approximate the cost function, the value function that you get, the rewards, the reward function you get. So reward is the inverse of a cost, okay? So I mean, it's the negative of a cost. You'd be more like a punishment actually but so then that's a way of making the cost function differentiable or at least approximating it by a differentiable function and then you can just use backprop. So the AC, AAC, and AAC are versions of this. Actor critic, advantage actor critic, et cetera. Okay, right, so what you have to know how to do is compute the gradient of your per-sample of your objective function with respect to the parameters. In practice, we'll use stochastic gradient as you were probably aware. So instead of computing the gradient of the entire objective function which is the average of all samples which is take one sample, compute the loss, L, big L, compute the gradient of this loss with respect to the parameters and then take one step in a negative gradient direction. Okay, so that's the second formula here. W is replaced by W minus some step size the gradient of the per-sample loss function with respect to the parameter for given sample, X, P, Y, P. Okay, so in practice people use batches. So instead of doing this on a single sample, so first of all, if you do this on a single sample you're gonna get a very noisy trajectory. You're gonna get a trajectory like the one you see here at the bottom where instead of the parameter vector directly kind of going downhill, it's going to oscillate. So some people say it should not be called SGD which means stochastic gradient descent because it's not actually a descent algorithm. It should be called stochastic gradient optimization but it's stochastic so it's very noisy. Every sample you get is gonna pull in a different direction and it's just the average that pulls you towards the minimum of the average. So it looks inefficient but in fact it's actually fast. It's much faster than batch gradient at least in the context of machine learning when the samples have some redundancy between them and it goes faster to do stochastic gradient. So to the question of batching. So what people do most of the time is that they compute the average of the gradient over a batch of samples, not a single sample and then do one step. And the only reason for doing this this has nothing to do with algorithmic convergence efficacy or anything. The only reason for doing this is because the kind of hardware that is given to us that is our disposal GPUs and multicore CPUs is more efficient if you have batches. It's easier to parallelize. So you get more efficient computation use of your hardware if you use batches. It's a bad reason for batching but we have no choice until someone builds a piece of hardware that actually is properly designed. The reason we have to do this again is because the chips that we have in VDI GPUs are heavily parallelized but they're parallelized in a simple way and the simplest way to parallelize is to batch. Yes. If there was no... Yeah, but okay. So here is why stochastic gradient is better, right? If I give you a million samples but in these million samples actually I only have actually 10,000 different samples and I repeat those 10,000 samples 100 times and I shuffle them and I give you this training sample and I give you it's a million samples. You don't know it's actually only 10,000 samples repeated 100 times. If you use batch gradient you're gonna compute 100 times the same quantities and average them. So you're gonna spend 100 times more computation than necessary whereas you use stochastic gradient but the time you've seen 20,000 samples you've already done two iterations, two passes through your entire training set. Okay, so it'll be at least 100 times more efficient. So the question is, you know, what's... How much can you average in a batch without losing efficiency and redundancy? And some people have done experiments with this. There's some empirical evidence that the number of samples that you can put in a batch is roughly equal to the number of categories you have if you do classification between one and two times the number of categories you have. So if you train on ImageNet you have 1,000 categories. You can have batches up to about 2,000 and beyond this you start losing conversion speed. That means they're completely random basically. I mean that never occurs, right? Because think about the following scenario that you give me that training set and what I do is that I split it in half and I use the first as a training set and the second half as a validation set. If there is zero redundancy in your set that means my machine is not gonna work at all, right? It's not gonna be able to generalize on the second half. So if there is any possibility of generalization there has to be some redundancy. Okay, so let's start simple and talk about traditional neural nets, okay? Traditional neural nets are basically interspersed layers of linear operations and point-wise nonlinear operations. So the linear operation you have an input vector, you compute a weighted sum of that vector with a bunch of weights and in this case here we have six inputs with three hidden units in the first layer. So there's three different sets of weights with which we compute weighted sums of the six inputs. The conceptually the operation to go from the input, the six dimensional input vector to the three dimensional weighted sum is just a matrix vector multiplication, right? Take the input vector multiplied by a matrix formed by the weights. It's gonna be a three by six matrix, right? And so you multiply this by six dimensional vector you get a three dimensional vector, okay? So that's the first type of operation in a classical neural net and the second type of operation is that you take all the components of the vector, the weighted sums and you pass them through simple nonlinearities. In this case, this is called a value, it's called a half wave rectification in engineering. You know, there's different names for it but basically it's the positive part, mathematically. So it's equal to identity when the argument is positive and it's equal to zero when the argument is negative. And then you repeat the process. So the third stage is again a linear stage, multiply that three dimensional vector by matrix, in this case a two by three matrix, you get a two dimensional vector pass the components through nonlinearities, okay? I call this a two layer network because I think what matters are the pairs linear, nonlinear, okay? So most people will call this a two layer network. Some people will call this a three layer network because they count the variables but I don't think that's fair. You do this but you don't want to do this. If there is no nonlinearities in the middle, as I said last week, you might as well have a single layer because the product of two linear functions is a linear function and so you can collapse them into a single one. Basically a single matrix that is the product of the two matrices. Okay, so here it is in a little more detail. The sum of unit i is, so si which is the weighted sum for unit i is the sum over all the predecessors of i which is denoted up of i, okay? So the j index goes over all the predecessors of i of wij times zj where zj is the output, the j's output from the previous layer. This is a sort of stacked kind of regular layered neural net. And then you take a particular si and you pass it through one of those nonlinear functions f and you get zi. Okay, so now let's talk about how we compute gradients and things like this. And there's two forms of it, okay? There is a two-dimensional form that I'm going to explain right now which does not even require you to know what a derivative is, funnily enough. And then there is a slightly more general form and there's an even more general form that maybe I'll talk about next week. Okay, so let's say we have a big network, we have a cost function and so our thing has an x and a y and it's got a cost coming out but in fact you don't need to make this assumption, the only assumption you need to make is that you have some primary trans function that produces a scalar on the output, that's it, okay? And somewhere in that network you have a nonlinear function h. I called it f in the previous slide but I called it h here. So it takes one of those wedges on s, you pass it through this h function and then it produces one of those z variables, okay? I'm not putting an index here because it's just I'm taking one particular, one of those functions outside of the network and I view the rest of the network as kind of a black box. So let's assume that, okay, so we're gonna use chain rule. Chain rule if you remember from kindergarten. Okay, high school. Okay, college. If you have two functions that follow each other, you have g of h of s and you wanna differentiate it. So g of h of s prime is equal to the derivative of g at point h of s multiplied by the derivative of h at point s, all right? Just going back. Clear? Okay. But if you've gone through a couple years of college, you can write this the way Newton wrote it or Euler or whatever with the infinitesimal quantities. So you can write dc over ds, which means the derivative of c with respect to s is equal to dc over dz times dz over ds, okay? So it's the derivative of c with respect to z, multiplied by the derivative of z with respect to s. And the nice reason for writing it like this is that it's obvious you can simplify by dz, right? You have dz at the bottom and at the top. And so by simplifying by dz, this is the second line, you get dc over ds, okay? So you're gonna split the derivative by having sort of an intermediate variable that you put at the bottom and at the top, right? It's very simple manipulation, symbolic manipulation. Now dz over ds is just the derivative of z with respect to s, but z is equal to h of s. So that's just h prime of s, okay? So dc over ds is equal to dc over dz times h prime of s. So if someone gives you the derivative of the cost function with respect to z, you multiply it by the derivative of your nonlinear function and you get the derivative of the cost function with respect to s. Okay, so imagine you have a chain of those functions in your network, you can back propagate by multiplying by the derivatives of all those h functions, h functions one after the other, all the way back to the bottom, okay? So basically if you wanna compute a gradient, you basically have to use a network that looks very much like this one except you have signals that go backwards and wherever you had an h function, what you have now is a derivative coming from the top. Okay, so scalar, just the same as z. You multiply by the derivative of the h function and then you get the derivative of the cost function with respect to the input variable to h, which is s. So basically what you have now is basically a transform network that computes your gradient, okay? Now you can convince yourself of this because if you don't really sort of completely go back chain world, which I hope you do, but imagine that you are twiddling s by little, okay? We're gonna perturb s by ds. So as we go through h, h as a slope, which is s prime of s, so z is gonna be perturbed by ds times that derivative, right? ds times h prime of s, okay? And this is what is being written here. Perturbing s by ds will perturb z by dz equal ds times h prime of s. This will perturb c by this whole perturbation dz times the gradient of, I mean the derivative of c with respect to z, right, which is dc over dz. So basically we get that dc equals dz, the perturbation of z. So perturbation of c is equal to the perturbation of z dz times the gradient of other derivative of c with respect to z, okay? But we've computed this dz before, we know it's ds times h prime of s. So we just substitute in and then we pass ds on the other side and we get simply that dc over ds equals dc over dz times h prime of s, okay? We just rederived chain world, okay? I've done nothing more than rederiving chain world. But it's a little more intuitive if you think of it in terms of twiddling things around. And sometimes it's useful when you're writing the backprop function for a module to think in those terms, okay? Because sometimes it's easier to think about it in those terms than to actually write down the equations. All right. Now we had two types of modules in our normal net. The other one is a linear module, okay? And for this one, I'm gonna again view the entire network as a black box, except for just three connections going from a z variable to a bunch of s variables, okay? So an s variable is a weighted sum. So the s0, for example, is gonna take z, the z at the bottom here, by multiplying it by its own w, which I call w0, okay? I drop all indices that are annoying. And so then I ask again the question if I twiddle z by how much will c be twiddled, okay? So if I twiddle z, s0 is gonna be twiddled by z times w0, right, because z is multiplied by w0. So if w0 is two and I twiddle z by dz, the output after the weight is gonna be twice, okay? It's gonna be twiddled by twice the value. But now z actually influences several variables, in this case, three. So it's also going to cause a perturbation for s of one and s of two. The perturbation for s of one is gonna be dz w one and for s of two is gonna be dz w two, okay? So the overall perturbation, now, okay, so we get dz times w of zero, that is the perturbation for s zero, dz times w of one for s one, dz times w of two for s two. But now s zero, s one and s two are going to influence c. And the question is by how much? So c is gonna vary by whatever s zero was varying times the derivative of c with respect to s zero, right? But also c is also going to vary because s one is varying and also because s two is varying. If the variations are small enough, then the overall variation is just the sum of the three variations, okay? And so what you have here at the bottom is that the entire variation of the cost is gonna be equal to the variation of z multiplied by w0, which is the variation of z zero. And then you're gonna have to multiply that by the derivative of c with respect to z zero, which is dc over ds zero, okay? So what you see here at the last equation is exactly that. And you have to sum the contributions from the three components, okay? So dc over dz in the end is dc over ds zero times w zero plus dc over ds one times w one plus dc over ds two times w two, okay? When you have a branch like this, you perturb the input variable, all branches are perturbed and you have to sum up the result on the cost function, okay? Which you assume you know those are the dc over d, whatever variables. Any question? Is that clear? Incorporant symbol. Okay, what does that mean? Like look at this formula here. It says if I have the gradient of, or if I have the derivative of c, the derivative derivatives of c with respect to s zero, s one and s two, okay? All three of them. Then I compute the weighted sum of those derivatives with the weights going up, but I'm using them going down. And that gives me the derivative of the cost function with respect to z, which feeds those three weights. So basically when you back propagate through a neural net, you compute weighted sum of the gradients using the weights backwards. Okay. All right, so this is to give a little bit of intuition, but there is a much more general formulation for this. Before we do this, let's try it this way, okay? Do it one step at a time. So, because actually a neural net, the way you wanna see it is more something like this where you have at least a traditional neural net. Where you have an input variable, you multiply the input variable by the first matrix w zero, that gives you s one, and then pass that through a nonlinearity that gives you z one, then multiply that by the weight matrix w one, that gives you s two, pass that through a nonlinearity that gives you z two, linear again, blah, blah, blah. How many layer neural net is this? Three, yes. The layer is kind of a pair, linear, nonlinear, right? Most modern neural nets don't actually have clear linear, nonlinear separations. They're like more complex things. Okay, so, so sk plus one basically equals wk times zk where wk is a matrix, zk is a vector, sk plus one is a vector, and then zk equals h sk where h is kind of an application of a scalar h function to every component. So, if you wanna write this in PyTorch, you write something like this. There's many ways to write it in PyTorch. You can write it from scratch, you can write it in a functional way, or you can write it this way, which is more like object oriented, and it kinda hides a bit of complexity for you. So, you import Torch, you import an n from Torch, you make a sort of an input, which is some three, order three tensor. You count how many elements it has, and that's gonna be the size of your input layer. We're gonna turn it into a vector, okay, but not yet. And then you define a class for your neural net. So, the constructor is going to just initialize three linear layers. So, linear layers need to, in this case, be separate objects because they contain a vector for the parameter. The values don't need to be separate objects because they don't actually have parameters, okay? So, that's the complexity that's hidden in those nn linear. So, nn linear actually does a little bit more than just multiplying by a matrix, it also adds a bias factor, but that's okay. So, you initialize those layers with the right sizes that you pass as argument to the constructor, and then you define a forward function, which is how you can compute the output as a function of the input. And so, the first line here, x dot view minus one just flattens the input tensor into a vector, and then you apply the n0 module to x, you get s1, then apply the value, the nonlinearity to s1, you get z1, et cetera, et cetera, and then you return s3. Okay, and the beauty of PyTorch, which a failure will explain to you perhaps tomorrow, is that you don't need to worry about computing the gradient because as you've written the forward function and PyTorch knows what this looks like and he knows how to back propagate gradient to it. It knows how to transform the graph that corresponds to your forward function into a graph that corresponds to the backward function, so you don't have to worry about it. But you still need to know how to compute gradients because sometimes you have to write your own module. You invent this new type of neural net and it's got this new multi-head, multi-tail, memory, attention, STM, whatever. You have to write your own thing and you basically, you might have to write your own CUDA kernel or whatever, right? But it's pretty simple. Yes, uh-huh. So as I said, if you don't have a nonlinearity, the whole thing is linear, so it doesn't, there's no point having layers, okay? Now you have to think, what is the simplest nonlinearity you can think of? It's gonna be a point where is, component-wise nonlinearity, what's the simplest component-wise nonlinearity you can think of? Something that has a single kink. Now the funny thing is, we're talking about gradient-based learning, this is not even differentiable, right? Because it's got a kink. But if you're a mathematician and you're obsessive-compulsive about it, you would call this not gradient, but a sub-gradient. But, you know, how many mathematicians are there here? Yes. There's part of the function. There's several sub-gradients. So if you have a function that has a kink in it, at this point, any slope that is between this one and that one is correct, it's fine, okay? All of those are good sub-gradients. And so the question is, should you use something somewhere in the middle or just zero? It doesn't matter because it's just one point. So it has no impact, no practical impact. Okay, so here's the strategy more general form. We're going from the kind of specific to the strategy more general. So here's a form of chain rule for modules that may have multiple outputs and multiple inputs or may have inputs that are vectors and outputs that are vectors. Okay, I don't give them different symbols here. The basic formula DC over DZF in this case equal DC over DZG times DZG over DZF holds, right? This is the same chain rule formula that we wrote previously for scalar functions. It also applies to vector functions. Okay, but there's one thing that we need to remember. The gradient of the scalar function with respect to a vector is a vector of the same size as the vector with respect to which you differentiate. But if you write it this way and you want the notations to be consistent, it's a row vector. It's not a column vector anymore, okay? So we'll take a scalar function which depends on a vector, therefore a column vector. Differentiate this scalar function with respect to this column vector and what you get is a row vector, that's the gradient. It's not the gradient technically, the gradient is once you transpose it, but it's DC over DZF, that's the notation. And you can see that it kinda checks out. So let's imagine that ZG is a vector, a column vector, so of size DG by one and ZF is a column vector of size DF by one. Then this little chain rule equation here gives you a row vector of size DF is equal to a row vector of size DG multiplied by a matrix whose number of rows is DG and number of columns is DF, okay? And of course the last size of the vector and the first size of the matrix have to match if you want this product to work out. So a more convenient form for this would be to kind of transpose everything to say DC over DZF transpose, which is now a column vector, is equal to the transpose of the product here and that would be the transpose of DZ over DZF times the transpose of DC over DZG. And that would be kind of a more convenient form of writing it, but it's kind of simple this way. More consistent. Okay, so what's this funny animal here, DZG over DZF? So we have a little neural net here that has two modules in it, F and G. The output of the F module is ZF and the output of the G module is ZG, okay? And basically we want the gradient of the cost function with respect to ZF. We assume we know the gradient of this cost function with respect to ZG. We know how to back propagate to C. And to compute the gradient with respect to ZF, if we know the gradient with respect to ZG, we need to multiply by this matrix, DZG over DZF, which is called the Jacobian matrix of G with respect to its input. Okay, G has two arguments, so we can differentiate it with respect to Z or respect to W. We're just gonna differentiate it with respect to Z. Okay, what is this matrix? So the entry IJ of that matrix, the Jacobian matrix, is equal to the partial derivative of the i-th output, okay, the i-th component of the output vector of the G module with respect to the j-th component of the input vector. So if I twiddle the j-th input, it's gonna make all the output twiddle, and that basically is an entire column of the Jacobian matrix. Okay, that's back problem, right? So if you have a network composed of a cascade of modules, you just keep multiplying by the Jacobian matrix of all the modules going down, and you get all the gradients with respect to all the eternal variables. Now, you actually need two sets of gradients. You need the gradients with respect to the states, but also the gradients with respect to the weights. And as I said, a module that has parameters has two Jacobian matrices. It has one with respect to its input state, and another one with respect to its parameters. Okay, so you have the two equations here. So let's say now you have kind of a slightly more general neural net, which is a stack of many modules. Each module is called FK. So K is kind of an index for that module. And its input is ZK, and its parameters WK. And the output is ZK plus one. So ZK plus one equals FK of ZK, WK. Very simple. So how do I compute DC over DZK, which is the gradient of the cost function or whatever function you want to minimize, with respect to the input of module ZK? Assuming I know DC over DZK plus one already, you just multiply by the Jacobian matrix of the module K, which is DZK plus one over DZK, or in other words, the FK of ZK, WK, with respect to ZK. Okay, so let's just change rule again. DZ over DZK equals DC over DZK equals DC over DZK plus one, which I assume I know, times the Jacobian matrix of FK with respect to ZK. Second line is same thing with respect to W. DC over DWK is equal to DC over DZK plus one, which I already had at the top. And then DZK plus one over DWK, which is the Jacobian matrix of the F function with respect to its weights, to its parameters, whatever they are. That's all there is to backprop. Are you okay? Any question? So this is a little concrete example. What's so microcrete? So let's say we have one of the simple functions here, G of XW, we don't know what's inside, but it's okay. And it goes to a cost function. It's a graph. And through this manipulation of, you know, multiplying by Jacobian matrices, we can transform this graph into the graph that will compute the gradients going backwards. And so things like PyTorch and TensorFlow will do this automatically for you. You write a function, it turns it into a graph, and then there is something that turns this graph into the derivative graph if you want that backprop against the gradient. So in this case here, the gradient graph looks like the one at the right. When you start with one at the top, and then you compute the Jacobian of C with respect to Y bar, you multiply this one number by this Jacobian. Jacobian is actually a vector. Okay, it's a gradient. It's a row vector. And that's DC over D Y bar. Then you multiply that by the Jacobian of G with respect to its weight, and you get the gradient with respect to the weights. That's what you need to train. So that's an example of automatic transformation. That's what AutoGrad does. Now what becomes complicated is when the graph, the architecture of the graph, is not fixed, but is data dependent. So let's imagine that depending on the value of X, you have a test in your neural net code that decides that if X is a vector that is longer than a certain length, then you do one thing and if it's shorter, you do another thing. Then you're gonna have a condition and two graphs depending on the input, right? You still need to generate the graph for backpropagation. If you have loops, it becomes complicated. You can still do it. Yeah. Yeah, it usually doesn't work very well if the number of loops that you have is more than say 50. And saying 50, it could be 20, right? It depends. And you probably heard of LSTM and what's special about LSTM compared to regular recurrent nets is one way of basically making them work for longer than like five, okay? But they don't work very well past 20 or so. The point is that you can have a variable number of steps. It's specified by program and it could be variable. It depends on the size of the input. A lot of the neural nets that people use nowadays are variable size. X could be a variable size multidimensional array and that means G has variable size inside and you can have kind of complicities going on there. So again, in terms of the sizes that those things take, so dc over dw is a row vector, which is one by n where n is the number of components of w. dc over dy bar is one by n where m is the dimension of the output and dy bar over dw is number of rows is the number of outputs of G and the number of columns is the dimension of w, which is n. So it checks out. It's all fine. Okay, now what kind of modules are we using in neural nets? So as I said, the linear and radio modules or nonlinear, point-wise nonlinearity modules are just two examples of things that we use to build neural nets, but what you build deep learning systems in general. But there's tons of, and tons, I mean, if you look at the PyTorch documentation, there is like a huge list of them, of such modules. And the reason why you need a lot of them, I mean, most of them are kind of, can be built out of kind of smaller, like more elementary functions, but the reason why they are pre-built is because first they have a name and they're debugged, but also because they're optimized. So sometimes you can kind of write, CUDA kernels directly or they're generated by compiler or something. So, but here's a bunch of elementary modules. And I'm not sure I'm gonna be able to use my, okay, let's start with the duplicate module. So what is a duplicate module? It's a module that takes a single, takes, it's basically a Y connector. Okay, you want two people to listen to music on your iPhone, you need one of those Y cables. So, the first output is equal to the input and the second output is also equal to the input. Okay, Y1 equals X, Y2 equals X. So you think, you know, you would think that you don't even need a module like this, but you actually do. Sometimes in fact in PyTorch it's kind of implicit, but you need to make it explicit sometimes. So whenever you have a wire that splits into two or N, on the way back, the gradients get summed, okay? And it's exactly the same situation that I explained earlier. In fact, you can decompose this little module that I explained here. You can think of this Z variable splitting into three wires as one of those branch modules. And as the three wire converge, you have to sum the gradients, okay? Which we figured out, but you can sort of build this into this split module, this duplicate module, or triplicate, or end picket, whatever it is, okay? So whatever you copy a variable, whatever you use a variable in multiple places, you need to sum the gradients. Again, the autograding in PyTorch does this for you, but remember this. So add, so if you have two variables, and you sum them up, when you twiddle this guy, the output would twiddle by the same quantity. When you twiddle this guy, the output will twiddle by the same quantity. What that means is that the gradient of whatever function you want to minimize with respect to the output of a sum is equal to the, is equal. When you have the gradient of the cost function with respect to the sum, what is the gradient with respect to each of the two branches that you added up? It's actually equal for both branches, okay? So if you have a connection going this way, and you get a gradient from the top, you just copy the gradient, okay? It's because you get the same influence from both sides. It's independent of the value of the inputs. You have to think about it, it's pretty obvious. Actually, let me try this. Be able to do this, this work. Ha, only works if I mirror my screen, because I can't write on a screen that doesn't exist. Okay, hang on with me for just a minute here. No, that's not what I wanted. Oh wow, okay, here we go. Sorry about that, control plus actually works. Okay, let's see if this works. So if y equals x one plus x two, dc over d x one, let's say, is equal to dc over dy, dy over d x one. Okay, this we assume we know. How much is this? And of course, d y over d x two is also one, okay? So there you have it. dc over d x one equals dc over dy, dc over d x two equals dc over dy. Just take dc over dy, copy it and you're done. Max, that's an interesting one. So y equals max of x one, x two. dc over d x one equals dc over dy. D y over d x one, right? It's just chain rule. What is d y over d x one? Yep, and otherwise it's one, yes, correct. So in fact, you can completely understand this graphically. Basically you have x one, say again. Yes, yes, yes. Right, so the answer was d y over d x one is zero if x two is larger than x one and is one if x one is larger than x two. But intuitively it's very simple. If you have variable x one and variable x two, basically the output, this max module is basically just a switch. Okay, I'm putting an arrow here, but it's not an arrow, it's a switch. Okay, I can move this switch from left to right. Okay, I can choose to connect x one to y or to connect x two to y. Now, once I've decided on which side I connect, it's just a wire, right? Regardless on how I chose to put this switch in one position or the other, in this case I use max, okay? But it's just a switch that I decide to put on one side or the other. When I decide to put it on one side then I've just connected x one to y and it's just a wire. So x two if I twiddle it has no influence on the output. Therefore, the gradient of the cost function with respect to x two is zero, okay? And the gradient of the cost function with respect to x one is of course equal to the gradient of the cost function with respect to y because it's just a wire. It's the same variable, but really, okay? So that generalizes a switch of multiple variables. So however many variable I have, if the output is determined by a switch that I can move to one of the input variables, then when I back propagate, I just propagate through the variable that was connected and the other ones just get zero. It's easier to kind of draw this way than to actually write the math. You have to use delta functions and stuff. Okay, looks of max, that's a fun one. Oh, I have to use a new page. Next page actually doesn't go to the next page. Okay, so stuff max is a module where the output is equal to e to the xi. So it's a module with, which I should not draw this way, which I should draw this way. And it has as many outputs as it has inputs. I'm pulling these the yi and these the xj, let's say, okay? Or x, whatever. So stuff max is this. Okay, it's a very convenient way of transforming a bunch of numbers into a bunch of positive numbers between zero and one, that's some to one. Okay, when I take the exponential, so x, the xj's can be any number. When I take the exponential of those numbers, I get positive numbers and I'm normalized by their sum. So what I get is a bunch of numbers that are between zero and one and sum to one, which some people call a probability distribution. Okay, so you get interpret yi as a vector of probabilities over the discrete set of outcomes. What is log stuff max? So log stuff max is, oops, that's not what I wanted to do, is the log of that. So you get the log of the stuff at the top minus the log of the stuff at the bottom, right? So you get the log of exponential xi, and that's gonna be xi, unless I'm mistaken. And then you get the log of the sum of, you get minus the log of the sum of the exponentials of xj's, right? So that will give us xi minus log of sum over j. Of e to the xj, that's called log stuff max. Now, the guy who invented stuff max in 1989 or so, or maybe 88, I don't remember, is a gentleman by the name of John Bridal, from Britain, and he regretted calling it stuff max. He said it should have been called stuff dog max. But it's too late, people call it stuff max. So here's an interesting exercise for you. I'm not gonna tell you how you back propagate to this, okay? But I want you to do the calculation. That's a very good exercise. So log stuff max is actually a module in Python, but do it on your own, it's a perfect exercise. So basically compute dc over dxk, assuming that you know all the dc over dy i's, okay? So you're gonna have a bunch of dc over dy i's. So here you only have one output actually, but it's okay. So let's say there is only one yi, and you know the gradient of the loss with respect to this yi. What is the gradient of the loss with respect to all of the xk's? That's good exercise. Is it an official homework? It's an official homework tonight, okay? It's more than just an exercise. You can find the answer, but it's more fun to kinda, I mean, you're not, you don't run as much if you don't kinda try by yourself if you just look at the answer. Okay, so the stuff max and log is a combination of modules that is very commonly used in a multi-class classification, right? So you can take a neural net, the last module would be a softmax, so it would normalize all the outputs, make them positive, make them look like probabilities. And what you want is you want to maximize the probability that the model gives to the correct answer, okay? So you know the correct answer is bird. Bird is number four in your categories. You want the fourth output of your softmax to be as high as possible, okay? So that's the, there'll be log softmax. Now, if you separate this in two things, so if you have softmax and then you take the log of the output as your cost function, the log of the correct output of your cost function, you get numerical issues because you don't get log of zero, okay? So as the score gets very, very small, the log kinda diverges and you get sort of numerical problems. So better off writing log softmax directly as a single module because then that numerical issue disappears. Okay, that's a good question. In fact, it's a very good question for the next 20 minutes. Not just the next 20 minutes, actually. I stupidly put my pen back. Did I? What did I do with my pen? I have no idea what I did with a pen. It's here. Okay. Okay, so let's say we're gonna have a neural net and it's gonna take an X variable and then it's gonna have w zero and then a value and then w one. Okay, and now we get a bunch of scores and we wanna turn this into a score between zero and one. Now, this network has only one output and so we can only do a two class classification and the module we're gonna put here is the sigmoid function, also called logistic function. And so this function is h of let's call it s, since we've called this s before, one over one plus exponential minus s. Okay, so this function, when s is very large, this exponential is equal to close to zero and so h is equal to one and when s is very small or highly negative, then this exponential becomes very large and so the overall function is zero. Okay, so that function is like this and here it's 0.5 and the asymptote here is plus one and here it's just zero. Zero point five is unreadable. Okay. So I could just take the output here which I can call y bar and plug this through some cost function which I compare with y. Okay, now, so y would be also a binary variable, zero or one. Now what do this cost function, what should it do? I could use square error, right? So c could be equal to the difference between y and y bar squared. Sounds perfectly reasonable. Doesn't work very well. Reason it doesn't work very well is that the sigmoid, and people in the early days of neural nets in the 1980s, we're doing this very commonly and the network wouldn't converge and they would say neural nets don't work but they were just doing it wrong. So the problem that you have here is that if y is equal to one for one class and zero for the other class, the system wants to get the output equal to one and it can't because it's an asymptote. So it tries to make the weights, W1, very, very large so that it gets to one or to zero. It has to make the weighted sum enormous if it wants to get close to the desired output. But there the gradient is very small, right? Because the derivative of that sigmoid, that sigmoid is very flat there. So when you back propagate, the gradient is basically zero because the sigmoid is flat. So you get this saturation problem. So some people like me said, back in the old days, one of two things, either you set your targets in between, so not at the asymptotes or you use a different loss, okay? So basically you say, here is the sigmoid function. The target for category one is gonna be at, I don't know, 0.8 and the target for category two is gonna be at 0.2. So there those would be a table and so the weights won't go to infinity and you won't have those problems. But here's another idea and the other idea is just take the log of it, okay? Take the log. So if you think about this sigmoid function here, it's actually a softmax. It's a softmax between two variables, one of which is equal to minus x, the other one is equal to one. And what you're getting is the softmax output from the input that's always equal to one. Okay, let me write this function in another way. I'm gonna multiply the top and the bottom by e to the s. Okay, so I get e to the s divided by e to the s plus e to the s times e to the minus s and that's one, okay? This is a softmax. A softmax where one input is one, the other one is s and what I'm looking at is the s, the output corresponding to the s input. So the sigmoid is just, you know, softmax is just a generalization of the sigmoid for multiple outputs. Now if you take the log of this, you get s minus log of one plus e to the s. Another question is, and again, this is a special case of softmax with only two inputs where one is equal to one. One of the two inputs is equal to one. Okay, so the effect of the log, like look at this function here. This function looks like this where when s is very large, the one doesn't count in the sum and so you basically have log of exponential s, which is just s, right? So for large s is just the identity function and for small s, the one dominates and so it's log one, which is zero. And so you get zero. It's kind of like a soft reel you kind of thing. But the point is, it doesn't saturate. So you don't get those vanishing gradient issue. Yes, but you have a log in front. So log e power s is s. Oh yeah, I mean sure. I mean if you do s minus s then, I was just talking about the second term. Yeah, yeah. I mean s minus this is kind of the other way around. Yeah, yeah, absolutely. Yeah, if you take the entire function, it's the exact opposite. Okay, do we have softmax also as one of the exercises or not? Yeah, right, okay. All right, so let's end with a few tricks, practical tricks and you'll see more of them tomorrow and as you start playing with backprop. So the idea of using value instead of hyperbole attention. So hyperbole attention is just like the sigmoid except that it's multiplied by two and you subtract one. So it goes from minus one to one instead of zero to one. But it's essentially the same shape. We saw it last week. Yeah, we talked about it last week. And they're both footing out of favor. Value tends to work much better when you have many layers. And probably the reason is that it's a scaling variant in the sense that or a scale equivariant. If the input is, if you multiply the input by two, the output will be multiplied by two but otherwise unchanged, right? It's got only one kink and so it has no scale to it. Whereas if you have two kinks, then the input has to have a particular variance to kind of fit those two kinks in the right place. So people will use value. The use cross-entropy loss for classification, log-softmax is a simple special case of cross-entropy loss. We'll come back to that. There's a word of caution there. Yes. I definitely expect the log-softmaxes but it's super easy to accidentally miss that. Right, yeah. You want to use log-softmax, not softmax, definitely. If you feed it to a cross-entropy loss function, it expects outputs from a log-softmax, not a softmax. If you don't know this yet, you might waste a lot of time. Use stochastic gradient on mini-batches. We talked about this before. You want to shuffle the training samples. So if you use stochastic gradient, the order of the examples matters. If you have, I don't know, a 10-way classification doing MNIST, right? You're classifying the 10 digits from zero to nine. If you put all the zeros and all the ones and all the twos and et cetera, it's not going to work. Because what's going to happen is that in the first few examples of zeros, the system will adapt the biases of the last layer to just produce the correct output and will never learn what a zero looks like. And then you show a one and it's going to take just a few samples for it to adapt the biases so that it learns to produce one without actually looking at the input. And it's going to keep doing this for eons on eons and it's never going to converge. So you absolutely need to shuffle the examples in the case of MNIST, but it's true also for a lot of others. You probably want, in a mini-batch, as I said before, in a mini-batch you want examples of all the categories. If you really want to use a mini-batch, use samples from different categories. And if you don't use a mini-batch, just have samples of different categories one after the other. There is debates as to whether you need to change the order of the samples at every pass through the samples. It's not entirely clear. Some people claim it's better if you don't. Some people claim it's better if you do with various theoretical arguments for it. You need to normalize the input variables. So if you look at standard code that people publish for training on ImageNet or speech recognition or whatever, the first operation they do is that they normalize the inputs. What do they do? So an image is really going to be three planes, R, G and B. Okay, so think of it as a three-dimensional array where the first dimension is color plane and the other two dimensions are space. Or sometimes the other way around. Sometimes the channel is last. But it's better to think of it this way. So what you do is you take each of those guys. So let's say blue. You compute the mean of all the variables in this blue image and you do this for every single image in your training set. Take the entire training set or a good chunk of it and compute the mean of all the blue inputs for the entire training set. That gives you a single scalar, right? Let's call it NB. So it's the mean of all the blues, okay? So if you see you can do the same, you can compute this standard deviation where you compute the variance of all the blues and take the square root. That's the standard deviation, sigma B. Do the same for green. Do the same for red. Okay, so you get six numbers, six scalar values. And now what you do is you take whatever you see in the image. You take the R component, IJ and to normalize it, you replace it by itself minus the mean divided by the standard deviation or the max of the standard deviation and some small quantity. So it doesn't blow up. What does that do for you? It normalizes the contrast and it makes the variance zero. This is good for various reasons. In fact, it's a good idea to have variables inside of a neural net that are zero mean and you need variance or kind of variance that are more or less all the same. Of course you do this also for the green and the blue. Yeah, across many images. It's a single mean. Yeah, I mean there's various ways to do it. You can do it for a single image, for a group of images. You can do it, which is what Batchman does. You can also do it like on a small piece of an image that's called high pass filtering. But the simplest thing is and what almost everybody does is kind of standard ImageNet pipeline or image processing, image recognition pipeline with convolutional nets, for example, is this. Yeah, the channels have very different means and so in a typical natural image, when you're in outside and inside, the components would be very different. You'll have color shift and the amplitude of blue is relatively low. For example, if you are in full sun, the amplitude of red is basically non-existent if you are under water. So if you want any kind of signal you need to normalize, this is basically automatic gain control. The means are very different, of course, because that depends on your overall luminosity and you don't want a system where the recognition depends too much on the global illumination of your image. So that's a way of getting rid of global illumination if you want and bad tuning of your exposure or contrast or whatever it is. But there's very good numerical reasons for doing this. So in most precooked, and we'll come back for why this is a good idea later. In most precooked code, you will also find things scheduled to decrease the learning rate. So the learning rate, the ADAS system, first of all, most systems don't use just plain stochastic gradient, they use things like Adam, which automatically adapt the step size. Or other tricks, they also use what's called a momentum trick, or Nesteroff momentum in particular, which Adam integrates. And generally if you really want good results, you need to kind of decrease your learning rate as time goes by. And so there are kind of standard ways of scheduling the decrease of the learning rate that you can use. Occasionally, not always, you can use a bit of L2R1 regularization on the way. So what does that mean? That means, so L2 regularization means at every update, you multiply every weight by one minus a small constant multiplied by the learning rate. So basically, and people call this weight decay, statisticians call this L2 regularization. It's going to have an additional, in addition to your loss function, in addition to your cost, you have a regularization term that only depends on the weight. The cost depends on the sample as well, right? And you have some sort of variable to control this importance here. So L2 regularization means R of W equals the square norm of W. When you compute the gradient of R with respect to a particular component of W, what you get is two W, I. And so in the update rule, when you do, WI is replaced by WI minus eta gradient of your overall loss with respect to W. What you get is WI minus eta times the gradient of the cost with respect to W minus, because it's a minus gradient, two alpha WI, this is WI. Oh, you're right. Which I can rewrite as, so this is WI, and I can rewrite as WI times one minus two eta alpha minus eta dc over dWI, okay? So what does that mean? You take every weight, and then at every iteration you shrink it by a constant that's slightly less than one. And so that's why it's called weight decay. In the absence of any gradient from C, the weights exponentially decay to zero, okay? So what that does is that it tries to tell the system, you know, minimize my cost function, but do it with a weight vector that is as short as possible. Okay, the other one is L1. So L1 regularization is basically a regularization term equal to sum over I of absolute value of WI, sorry. Which is the L1 norm. So when you do the gradient update, you get WI minus eta dc over dWI and then when it's the gradient of this, or minus the gradient of this, that would be sine of WI, and of course, you need the alpha in front. So this is a constant, which is positive if WI is positive, negative if it's negative, but there's a minus sign in front. So basically here WI is being shrunk towards zero by a constant equal to eta times alpha. The statisticians call this LESU, least absolute whatever. Okay, I mean some cute acronym, right? There's some sort of pun in it, but and they pronounce it LESU for a reason I never understood. And so that basically shrinks all the weights towards zero by a constant. And what that means is that if a weight is not useful, it's gonna get eliminated to zero. And that is very interesting when you have like a very large, a network with a very large number of inputs, many of which are not very useful. This will basically eliminate the inputs that are not very useful because the weights that connect to it will go to zero. So one other question. Okay, so first of all, you don't want to use it at the start because there is a curious thing with neural nets, which is that the origin of weight space is kind of a subtle point. And so if you crank up L1 or L2 initially, the weights just go to zero and nothing works. So it's in one of the tricks, actually I forgot a very important one in this list, which is that the weights have to be initialized in the neural net to be initialized properly. These various tricks that are built into PyTorch to initialize, one trick is called the Keiming trick. It's actually the Leon-Boutou trick from 20 years earlier. But, and the idea is it was reinvented multiple times. But the idea is you want the weights that go into a unit to be random, I mean you initialize them randomly, but you don't want them to be too large or too small. You want them to be kind of roughly the right size so that the output is roughly the same variance as the inputs, okay? So if the inputs to a unit are independent, the variance of the output, the variance of the weighted sum will be equal to the sum of the variances of the input multiplied by, weighted by the square of the weights, okay? So if you want, if you have any inputs and you want the output to have the same variance as the input, you need the weights to be proportional to the inverse square width of the number of inputs, okay? And that's basically the trick. So you initialize the weights to values which are drawn randomly with zero mean and the variance is one over the square width of the number of inputs to that unit, okay? And that's, you know, built into PyTorch as well. So initialization is super important and it can't, if you do it wrong, your network is not gonna converge. What do you mean by that? Well, I mean, you probably wanna start with the alpha equals zero and then maybe crank it up. And then it depends, you know, how much you want to regularize, how much is necessary. I mean, a lot of people just don't use any, okay? Either at one or at two. But they do use Dropout, okay? So Dropout is another type of regularization and you can think of it as a layer inside of a neural net that you just insert in a neural net. What Dropout does is that it randomly, it's a box that has any input and an output and it randomly sets N over two of the outputs to zero. And it's a random draw at every new sample that you draw. That layer basically kills half of its components, okay? This is crazy, right? But in fact, it kinda makes the other variables more robust. Basically, it forces the system to not rely on any single unit to produce an answer. It sort of distributes the information across all the units because it knows that during training, half of them can disappear. So it tends to kinda distribute the information better. It's a trick that Jeff Hinton and his team came up with and turns out to be a quite efficient way of regularizing neural nets. A lot of people use, I mean, there's variations of it. And we'll talk about more of those. They are in one of those papers, efficient backprops that I wrote many years ago and that you are invited to read. Okay, last thing for today, even though we're late. This trick, I mean, this whole framework of having a compute graph and backprop are getting through it. Of course, it doesn't work just for stacked module. It works for any arrangement of module, including the ones that are dynamical that depend on the inputs question. Yeah, so the question is why do we care about the fact that values are scale, equivariant, if we're gonna normalize anyway? The question is, why do you normalize too? So if you have sigmoids and you normalize, you're basically forcing the system. If your normalize is a variance, for example, is too small, then the system is not gonna be able to use the nonlinearity in the sigmoid, hyperboleic tension, let's say. If you make it too small, it's gonna saturate. So what's the right setup? Not clear. Well, you don't care. As long as it's the same variance all over the network. If it's not the same variance all over the network, then you're gonna get issues. Some layers are gonna learn faster than others. Some are gonna diverge when others are converging. So you want the variances to be roughly the same all over the network. And that's what things like batch norm do for you. We haven't talked about batch norm yet, but okay, that's it for today. Thank you. See you next week.