 So again, you have two weeks for finishing up the homework, it might take up to two weeks depending on your skills. If you know how the program might be taking you half an hour, to us it takes a couple of hours to finish the thing. Again, it depends on your skills, right? So start early, so you finish early, perhaps. Or perhaps you just take the whole two weeks, but you know, up to your skills. All right, I just eclipsed and I leave the end to give you the lectures. Bye-bye. So the first one is a value. We already talked about it. This is the positive part function, sometimes denoted with a kind of a superscript plus, which is really just a max of 0 and x. So it's a function equal to 0 when x is negative, equal to the identity when x is positive. It's not everywhere differentiable because it has a kink, but as I explained, when we backpropagate, we just define the gradient as a subgradient, which is a perfectly fine mathematical concept. But there are variants of this. And one issue is the value is that when you're in the flat part, there is no gradient being backpropagated through that particular nonlinearity, right? So if the weighted sum coming into or whatever value coming into a value is negative, that value, when you backpropagate through this, will just produce a 0 as the gradient going through it because it's flat, right? You can change the input that doesn't make any difference to the output. And that sometimes can be a problem. So people have come up with this other thing, the r-value or prelu or things like this. And there are basically versions where the negative part also has a slope, which can be trainable or can be adjusted or can be set randomly. It's a good idea to have a function that has only one kink because, and that's probably one reason why values and preluses and things like this have become so popular, is because they're basically equivalent to scales. So what does that mean? That means that when the input is multiplied by 2, the output is multiplied by 2, but otherwise unchanged. I think in mathematics that's also called homogenous. So the advantage of this is that a r-value doesn't care about the amplitude of the variables that come into it, right? It will fulfill its function regardless. Whereas if you had a function with two kinks, then it really matters. The amplitude of the input signal really matters because it will determine whether you're using the two kinks or only one, for example, right? And that seems to be an advantage, particularly in very deep networks. So the leaky value has a small negative slope, which you can barely see here. The prelu has coefficient a here that determines the slope in the negative part. These are still contrast equivalent functions. So here's one that isn't contrast equivalent or isn't equivalent to scale, to amplitude, if you want. And it's soft plus. So you can think of this as kind of a soft version of r-value. And the sharpness of the kink is determined by a parameter beta. So this function, you know, log of 1 plus exponential beta, right? When beta x, when x is very large, the exponential dominates the 1 and the log cancels the exponential. So that basically becomes the identity function, right? The 1 over beta cancels the beta. And for x negative, then, you know, exponential something negative is close to zero. So this becomes log 1, right? And that's basically zero. So you can tell that this function smoothly transitions between basically being close to zero to being close to the identity. And the speed of that transition is this beta parameter. So if you have large beta, it looks very much like a r-value. You know, it's a little soft at the corner, but it's very much like a r-value. If it's small beta, then it's a smoother transition. That's nice. It's sometimes used as a cost function as well. I'll come back to that. This is a different way of making the r-value soft and also have a bit of a sort of negative value, which may be interesting in some cases. So as I said, I'm not going to go through the entire list of all of those variants. You're welcome to play with them if you're interested. This sort of various parameterized function that can make a r-value kind of look more like a linear function or look more like a hard r-value. Things that have funny values for parameters that kind of try to stick to pieces together. Lots of variations. Now this one, I'm showing this one because this one is not monotonic. It sort of goes down a little near zero and then goes up. It's generally dangerous to do this because if you have a non-monotonic function, it means there are several possible input values that will give the same output values. And Backprop doesn't like that very much because that is probably going to create some sort of set-up point or local minimum or something. So people tend to stay away from non-monotonic function. They don't have to be strictly monotonic, but monotonic function is a good idea. Okay, so this one is kind of a double saturating function, which in this case has the threshold at six. Why not? This one is used actually fairly widely. So it's useful to know about that one. It's called the sigmoid function. Physicists actually call this the fermi-derac distribution if you put a bit of a parameter inside of the parenthesis. And it's a very one-on function. It's sometimes called a logistic function as well. So it's a function that is equal to one when x tends to, in the limit of x, going to infinity and equal to zero when x in the limit of x going to minus infinity and it sort of smoothly transitions between the two. So if you want some binary variable in your network, a switch for example, something that decides A or B or something, like an elementary classifier, but you want this thing to be differentiable, this is a good function to use because it will smoothly transition between zero and one. We'll see in a minute that if you want to activate a particular part of the network or deactivate it, this could be a good way to compute a coefficient with which to activate or deactivate a part of the network in a differentiable manner. So keep this one in mind because we're going to use it. This is essentially the same function as before except it has a different name and it goes between minus one and plus one instead of zero and one, but it's essentially the same function except you multiply it by two and you shift it by one and then there's a coefficient on the x which I'm not going to go through. It's called a hyperbole tangent. It's also very popular. This was the standard function that a lot of people were using in neural nets, going back to the 1980s and 90s. Most neural nets were using functions like this and not using values. These are actually a fairly recent development, less than 10 years, about 10 years. It has that values are better because of this equivalence to contrast. They're better when you have very deep networks for some reason. We don't fully understand why, actually, from the theoretical point of view. But experimentally, you can train much deeper networks with values than you can with hyperbole tangent. So hyperbole tangent, I'm not going to shoot it this way, but most numerical libraries have just a call for hyperbole tangent that you can use. The advantage of this function is that because it saturates, it bounds the values that your network can take, which is good in many ways. But it has a disadvantage. The disadvantage is that if the variable that goes into the hyperbole tangent is very large, then you get into the flat spot of this hyperbole tangent, and you basically get no gradient. When you back propagate gradient through that module, you don't get much of a gradient at all. For the same reason, the flat spot of the value can be a problem. Here you have two flat spots. What's more, the amplitude of the input variable can influence how much of the flat spot you hit during forward propagation. This may be why using those functions limits how many layers it can have. One of those layers is going to hit those flat spots because the weights are too large or something, and that's going to kill the entire network. Essentially, you're going to get vanishing gradients below that. You don't have this invariance to scale. One advantage of this over the sigmoid is that it's symmetric. For the same reason I explained last week that it's a good idea to normalize the variables of a neural net so that they have zero mean. It's also good for the internal variables of a neural net to have more or less zero mean. A hyperbole tangent will produce variables, state variables, neuron outputs that essentially have zero mean. Not necessarily, but they have a good chance of being centered around zero. Whereas with a sigmoid or a radio for that matter, there's no chance at all that they will have zero mean well because they are strictly positive. Not strictly, but they're positive. Because of the saturation problem of the sigmoid and the hyperbole tangent, our friend in Montreal a few years ago proposed this other kind of smoother function that doesn't transition as fast and still has a slope when it saturates. It's called a soft sign. So it's x over 1 plus absolute value of x. This is I believe a paper by Xavier Glow and Yosha Benju. And you know, that has some advantages. So some people use it. Hard 10h. These two really be called a ramp, saturating ramp. It's just a saturation function, right? It goes between minus one and one and it's linear in between and there is two kinks. So it has again the disadvantage that it's not a scaling variant. So then you have weird functions that don't tend to saturate, but they sort of bring the value towards zero. So this is a function that if you give it a small value, it will basically return something that's really close to zero. And if you give it a bigger value, it will return something that's not a scaling variant. So it's going to have a value of 0. So you can use it as a reference to the identity minus or plus a constant. So this one is called the 10h shrink and it's a difference between the identity and hyperbole tangent. It's kind of a hard version of it called a soft shrinkage. This is used in sort of very weird cases that we may talk about when we talk about unsupervised learning and spark coding in particular and then you have kind of variations of those. So this is the log of a sigmoid. Why is it good to have a log of a sigmoid if you have the sigmoid and the log? The reason is because of numerical instabilities. So it's very often the case that you will have functions with exponentials in them and then you want to take the log of them and if you write them in PyTorch as two different modules, one that computes this function with exponentials and one that takes the log, you're going to get numerical instabilities. There is some values of the input for which the output would be almost zero and the log will basically want to be minus infinity. But if you do the computation in one swoop and you kind of rearrange the term, you never get those numerical issues. So taking the log of a sigmoid is an example of that. Okay, now we're getting into modules that are not simple nonlinearities but they're very important because they're used absolutely everywhere but they're modules that turn a vector into another vector. Okay, and there's two of them, softmax and softmin. They basically are the same thing. They should really be called oxsoftmax and oxsoftmin but historically they've been given the name softmax and the name stuck. It's really not a good name for that but in fact the inventor of that technique in the context of neural net is a guy called John Bridal and he regret calling them softmax and softmin he said I should have called them oxsoftmax and oxsoftmin. So they're really the same function except one has a minus in it and they're vector functions. Okay, so... And Jan is frozen. Where's my computer frozen? Yes, my computer is frozen. No, Jan is frozen. Jan is frozen too. Okay, very good. So let me until he comes back. So the documentation of Torch is huge and then most of the time we get people that are overwhelmed about how do we use this, how do we use that. So the documentation doesn't really tell you how to use these functions and so we're trying now to briefly cover to just go over everything such that whenever you're going to be looking at the documentation because you're going to be looking at the documentation you know, quite often at least you can navigate and figure out and understand why those things are there, okay. Can you provide some examples that non-monotonic LOX functions will lead to problems? I have no idea. That was the first time I hear about that. So we can ask Jan when he comes back he's going to take a few minutes and he said okay, fine, I entertain you for the moment. So that I don't know. Like I have no personal experience about non-monotonic LOX functions. Can I address other questions? Why is Rilo considered when it's not differentiable? So, okay, this is actually so cool. If you have a neural net made of Riloos only you can actually show that the outcome of this network is going to be simply a piecewise linear interpolation of like several linear networks, okay. And so what happens is that like a given network will section the input space and apply a linear transformation from that input to the output, okay. And so there are no like for the given patch you're going to have like a linear network which is straightforward to train and the only thing basically that you are doing during training is going to be like basically figure out where to position those patches and then what kind of transformation you would like to apply to those regions, okay. But also like theoretically speaking this Rilo based neural net are much easier to analyze. Rilo is differentiable. Actually, yes, Rilo is differentiable. You have to use subgradients, yeah. Again, the issue with Rilo is that you have exactly zero gradient for the negative part. And then if you get in that region also things don't move, right? With the gradients. So instead we have to use for example leaky Rilo's when we train let's say generative adversarial network which we cover towards the end of the class such that you always have some gradients flowing through the network, okay. So again, Rilo's might give you some issues because of the zero gradient side. So we just give a small king like a small sorry slope such that you still have gradients. And then there is a question for Jan. So can you mention, can you hear me right, Jan? Yep. Okay. Can you mention a few issues about the fact of using non-monotonic activation functions that it was not clear. Ah, right. So let's say you're using a non-monotonic activation function something like absolute value, for example, right. So it's like the radio except it's, you know, identity in the positive part, negative identity in the negative part. So there, whenever the output of the network wants that particular unit to have a particular value it's got two choices as to what the input should be. Okay. And which one he chooses depends on just what the current value is. And this may or may not be good in the sense that, you know, the gradient could take you in one direction or the other direction. Let's imagine that the non-linearity you put is a sinusoid, okay, with multiple periods. There is lots of different places that the input can be that will produce exactly the same output. And so it doesn't pin down the value of the input. It just says, you know, the input should be either this or that or that. And that creates local minima necessarily because, or saddle points because it means there are several choices to get the same result, which means there are several values of the parameters that will give you the same output, which means there are several local minima. Okay. So, you know, there is this sort of intuition that local minima are probably a bad idea, that saddle points certainly are bad. And so the more non-monotonic, non-linearity you put in your system the more of those local minima or saddle points you're going to create. And that's probably bad. Okay. And there is one more question. We've before mentioned that using those saturating non-linearity are leading to possibly vanishing gradient. So isn't a reload perhaps leading to exploding gradients instead, since it's not saturating? So, not necessarily. Exploding gradient. Okay, the gradients would explode if, so for example, imagine a network that has, you know, a single input, single hidden unit, single hidden unit, et cetera, right? So you're kind of repeating, it's a very simple network with basically a single hidden unit at every layer, right? And you stack the layers. If you put a value at each layer and the weight is one, let's say, for every layer, you're not going to, you know, as you back propagate, the gradient will say, and let's imagine the input is positive, so all the values and all the weights are one, so all the values are in the linear region. When you back propagate, the gradients will also be one, right? Because you're going to get the gradient multiplied by the derivative of the value, which is one, then multiplied by the weight, which is one, then by the slope of the value, which is one, et cetera. So the gradients are going to be one everywhere. Now imagine that the weights are not one, but are two, okay? So all the weights in each of the networks, in each of the layers are equal to two. So first of all, for an input equal to one, after that, you get two, then four, then eight, then 16, et cetera, right? You get powers of two as you go up the layers, so you're going to be exploding state. And the gradient also explodes. So if you start with a gradient one from the output, you back propagate through a weight equal to two, so your gradient is now two, and then four, and then eight, and then 16, right? So you get a huge gradient here for the lower layers that have, you know, which means the weight here is going to change a lot. And a relatively small gradient for the top layer, which means, you know, that weight is not going to change much. And so that's a bad idea. What you want is you want the amplitude of the states and the amplitude of the gradient to be more or less the same all the way through. And what people do to kind of enforce this today is they use a module called batch normalization or group normalization. I mean, there's sort of various normalization tricks to ensure that the activities of the units are, you know, more or less all the same. They have, you know, variance one roughly, right? So that's a way of kind of preventing this from happening. So the explosion is not necessarily going to be due to the, to the radio. Like if you use an exponential for the nonlinearity, yes, you're going to get exploding gradients. But a radio just has a slope of one. So that's not necessarily going to create issues. You can have that issue because of, you know, gains inside the network that are bigger. Okay. Can we take one more question? We can. Okay. So could the non monotonicity actually be a good thing? For example, it could mean that there are multiple weight initializations which could work instead of getting stuck due to a bad weight initialization. Yeah, that's right. So it's just that again, you're going to have several points. If you, if you have those, those, those non monotonic functions and several points are bad. So it may work like it depends what type of function you're using. You know, not a huge amount is known about this. What is known is that, is that with monotonic function, particularly with radios, there's a little bit of theoretical results on why, you know, intuitive, you know, the intuition that why they're good kind of translates into some very weak theorems on why they're good. But, but really those are, you know, really complex systems with complex dynamics that people don't understand very well. Now what you can think about the way to think about the radio is as a, if you know anything about electrical engineering is a diode. Okay. So it's like, you know, something that lets values that are positive but not negative. And this is sort of the most elementary way of detecting something if you want, right? The most elementary way of detecting something that happens versus something that is not, that does not happen is if you have some sort of linear representation of it, is you have a threshold and below that threshold things are equal to zero or they're positive, they're positive. This is the basic principle of, you know, a radio, for example, right? You build a, you build a radio and you want to detect the signal so that you can hear the audio and eliminate the carriers. You have to have a diode. You have to have a detection system. That's called a diode in electronics and it plays the same role as a radio. And Alfredo is going, oh, funny. I never thought about this before even though I'm an electrical engineer. That's where the name comes from, right? The rectification. Yeah, rectification. The diode is actually called rectifier. Yeah, yeah. More general term for it. One last question. Since this is actually, I heard about this paper. Any thoughts about the siren paper, the one that is using sinusoidals for the activations? Oh, that's different. This, no, no. So it's not an activation. The siren paper, if I remember correctly, right? So this is a paper by... I mean, he's at Google, but he was at Berkeley for a while. Okay. Vincent Sitzman, Martel Bergman, Lindel and Betstein. Okay, these are not the names I remembered. Okay. Okay. But there was an earlier paper, I think, from the NERF people on, like, four year representations. Okay. Okay. So I think... So I may be confused about what Siren actually had, the siren paper talks about, but I think what it is is that it's basically a pre-processing layer. So let's say you have a neural net that has a single input, a scalar input, okay? And you want to compute some complex function of that single input. You could just build a neural net with a number of hidden units and several layers and put the output. But what they do is that one thing you can do, which seems to be very useful, is that you hardwire the first layer. So the first layer is basically hardwired. And it's very similar to the business function expansion I was talking about last week when I talked about, like, hardwiring the first layer. And what you do is you make this activation function for this guy be equal to a sinusoid. And for this guy, it's a sinusoid with twice the frequency. And for this guy, three times the frequency. And for this guy, four times the frequency, et cetera, right? So you have activation functions that are, you know, sinus X or some coefficient, sinus 2X, sinus 3X, et cetera. And for good measure, you also add the cosine. So cosine X, cosine 2X. So you basically expand the dimension of the input, right? And that's basically, that would be sort of a four-year series expansion of your function. So it makes it easier then for the system to compute a complex function by basically computing weighted sums of those basis functions. Okay. And finally, you can do this with all kinds of stuff. Yeah. Finally, why are we covering all these activation functions in today's lecture? We're not covering all of them. We're just covering the ones that people use. Okay. I put the slides for all of them because, you know, that's a good reference. But that's basically, you know, you can find this in the PyTorch manual. Yep. Okay. So here's what I wanted to talk about. Sorry about Softmax, right? So by the way, my computer crashed when I tried to grab my little pen here and I disconnected one of the connectors that has a USB to everything else, and it just completely crashed the machine. Sorry about that. Okay. So Softmax is this thing. It's a module that has any inputs and outputs, and it computes the exponential of each input and then normalize them by the sum of the intervals of all the inputs. Right. So basically, let's say we have three inputs. These are the XIs and or XJ. I should talk, I should say. And these are the, sorry, not the Xs, but the Zs. I'm going to call them Z, ZI. Okay. So ZI is equal to exponential XI over sum over J of exponential XJ. Right. So each of these guys, okay, is you take the exponential of the corresponding input and then you normalize by the sum of the exponentials of all the inputs. Now, what does that give you? It basically allows you to transform a bunch of numbers between zero and one, a bunch of numbers, the XIs, whatever they are, into another bunch of numbers, the Zs, that are between zero and one and sum to one. Okay. So it's quite obvious that when you sum all the ZIs over I, you're going to get one, right? Because you're going to get the same thing on the numerator and the denominator. Okay. So the sum of the ZIs is going to be one. And they're all going to be between zero and one because of course the sum of the exponential XJs is larger than exponential XI. So this is a way of turning, and we're going to use this absolutely everywhere. Okay. So it's very important. So this allows us to turn, for example, a bunch of scores in some arbitrary unit into things that look like a probability distribution over a discrete distribution, over a discrete outcomes. Okay. So here's a special case of this. Let's imagine that we only have two inputs. Okay. X1 and X2. So Z1 is going to be equal to, you know, e to the X1 over e to the X1 plus e to the X2. Okay. Cool enough. Now let's imagine that X2 is always equal to zero. Okay. It's constant and equal to zero. Then this turns into e to the X1 over e to the X1 plus one because exponential zero is one, right? So if I divide by e to the X1 above and below, which means I multiply by e to the minus X1. Okay. I get one because they cancel. Here I get one also because they cancel. And here I get e to the minus X1. And surprise, surprise, this is the sigmoid or the logistic function, or as physicists call it, the Fermi Dirac distribution. Okay. So now what we've just seen is that the function basically is a multi-multinomial generalization of the logistic function. Or conversely, the logistic function is a special case of softmax for two variables where one of them is always zero. Okay. And why are we using exponentials here as like our nonlinear function? Because the exponential is a very simple way of smoothly turning a number from any, with any range from minus infinity to plus infinity into a positive number. Right. There's many ways to do this, I will admit. But exponential is good. There are other deep reasons for this. Okay. But which I'm not going to go into right now. We say about this a bit in a future lesson, right? Yeah. We'll talk about, you know, basically when we talk about energy-based models and their relationship to probabilistic models. Mm-hmm. But, you know, intuitively it's a good way of turning a bunch of scores into a probability distribution. A probability distribution. So if you are interested in having calibrated scores that are between zero and one and sum to one, if you want to do classification, for example. Okay. You use a softmax module because you get whatever score come out of your neural net and then you turn them into a bunch of positive numbers at sum to one. You can interpret this as a probability distribution over categories. Okay. Why is it called softmax and what should it be called soft argmax? The reason is that if all the X's are very small, okay, let's say large and negative or whatever, let's say all the same values, and one of them is significantly larger than all the others, then what's going to happen is that the output is essentially going to be very close to one for that input that is much larger than the other and is going to be close to zero for all the other ones. Okay. And so that basically creates a kind of competition if you want between scores. And the interesting thing about softmax is that softmax doesn't care about a shift in the variables. So if I take a vector X and I add another vector V to it, okay, a constant vector, I mean, I add a constant to every term, right? So this is not just a constant vector V, it's a vector that has the same value everywhere, right? So let's say I add C to all the components, right? And I compute the softmax of that. This is actually equal to softmax of X, okay? So take a vector X, add a constant to all the inputs. It makes no difference to the output. So what that tells you is that softmax actually only cares about the relative values of the inputs, doesn't care about absolute values of the input. If your inputs are, you know, between minus one on one, or they are between, you know, one million minus one and one million plus one, it doesn't make any difference to softmax. It's going to make a numerical difference when you compute the exponential, but that's a different story. In fact, Petrush is smart about this. It's not going to blow up. So that's a really important property. And that's quite interesting. So now the softmin function, I was talking to you about. So this is why it's called softmax, right? Because basically, and that's why it should be called soft argmax, because basically it tells you which of the values in the list of values is the largest one. It gives it a value close to one, and then it gives the other values a value close to zero with some smooth transition. If you want a non-smooth transition, you can apply softmax to X, which you multiply by some coefficient beta, right? So for a very large value of beta, which would be a coefficient that you pick, which you can learn also if you want, but you can set it. For large values of beta, the transition between zero and one when X, you know, grows larger than the, one particular component grows larger than all the other ones would be very sharp. Okay? So set beta to a thousand or something. If one X is only slightly bigger than the other Xs, it's softmax that would be one and the other ones would be zero. Okay? For a small value of beta, you're going to have a smooth transition. As you increase one of the Xs relative to the other ones, you're going to have a smooth transition between zero, if that component is much smaller than all the other ones, to one if it's much larger than all the other ones. So softmin is just softmin of X is just softmax of minus X. Okay? So that's a generalized form, right? Which when you have a beta and that kind of takes care of, you know, negatives versus positives and stuff like that. So if beta is negative, you're going to have a negative in front, which may or may not be way too long. Right. So for the same reason that I told you that if you want to take the log of a function like this, you better do it in one fell swoop rather than sort of computing the function and then taking the log because you might have very large numbers, you know, in between. And particularly when you backpropagate gradients, you get essentially zero gradient or infinite gradients. You backpropagate gradient through a softmax. You will get, if one X is much larger than the other ones, you get, you basically, the softmax basically saturates. So when you backpropagate, you get gradient zero. And, you know, it could be that, you know, the log of us, you know, the output of the softmax is close to zero. When you take the log, you get something that basically is, you know, minus infinity. So when you do everything in one fell swoop, you don't have that numerical problem. There's a trick inside of PyTorch. The trick is that you kind of reduce this computation of the softmax by first computing the max over the XIs. And then you can factor what goes on inside and solve the numerical problem. So log softmax is probably one of the most important modules inside of PyTorch. It's one that's used universally for classification. Okay, when you build in your own net, you use the log softmax module essentially as a cost function. You don't use directly as a loss function, but it's basically the main component of a loss function. So what does that do for you? So as I said, the softmax itself turns the values, the scores coming out of the neural net, whatever they are, into a bunch of numbers between 0 and 1 that sum to 1. And then you take the log of that, and you're back in the same space as the original X inputs, basically, because, you know, the log cannot cancel the exponential. But what you have done is that you've removed the absolute values of the XIs. So as I said, softmax really cares only about relative values of XIs. And so now what you have is a bunch of scores that really indicate the relative values of the XIs without caring about their absolute value. And that's very important if you want to build a classifier, and the classifier has multiple categories, you need to have some competition between the different categories because an input, an image, can only have one category, okay, if it's mutually exclusive categories. And so what you'd like is you'd like scores that, you know, basically are like probabilities and kind of inhibit each other, right? If one category has a high score, the other ones should have low score, and that's what the softmax normalization does. But then why do we need the log? Okay, so let me... So I guess it's time to talk a little bit about a particular type of standard classifier architecture, right, of neural net. So you would have an input X, an image, audio, whatever it is, you know, think of it as a, you know, it would be a multidimensional array of some kind, but we're going to think of it as a vector. Then we'll have, you know, multiple layers of modules, whatever they are, I'm not going to specify here, it doesn't matter. In a standard neural net, again, they're going to be, you know, linear modules and value modules, the last one being a linear, not a value. And then we're going to put one of those softmax module. Okay, and this output, which is the predicted output, is really going to be a vector of numbers between 0 and 1, that's some to 1. Okay, so I'm kind of redrawing this softmax module here. In comes a bunch of numbers that come out of the linear module that's before, outcome a bunch of numbers between 0 and 1, that's some to 1. So if you want to train this to classify it as a classifier, what you want is you want to train the system in such a way that the output for the correct class is close to 1 and the outputs for the incorrect classes are close to 0. So what you can use is a cost function. So basically, you know, the output here is going to be, I don't know, 0.1, 0.2, 0.6 and 0.1, okay? And let's say the desired output is something like 0, 1, 0, 0. So first of all, your system got it wrong because it gave the highest score to the third category and the correct category is actually the third, the second one, not the third one. So what we're going to need here is some sort of cost function that measures the divergence between those two things, okay? I don't want to say the distance because this may not be a distance. But some measure of discrepancy between those two things. And that's, again, one of the most important modules we're going to use. So one cost that we could use is squared error, okay? So squared error says, you know, I'm going to just compute the sum of the square of the differences between the value I get and the value I want. And that might work. But in classification, people use something called the binary cross entropy. And that involves log-soft max, okay? So, okay, so before I go there, I want to tell you something about this. So what happens here is that all those numbers compete with each other, right? Because they are normalized by their sum, they compete with each other. So if the network wants to increase the value of the second output here, which is, you know, which is the correct one, it will have to decrease the other one because it's normalized, okay? And that's the advantage of softmax for kind of mutually exclusive, for classification, mutually exclusive classes. I want to maximize the negative log of the correct output. So, you know, this target here one says, I'm just going to, if we want to translate this into an objective function, it's going to say, I want to pull this value up. And as a side effect, the other values will be pulled down because, you know, it's all normalized. So I just want to make this large, okay? And what am I going to use as a last function to make this value large? The answer is I'm going to use the minus log of it. Okay, so those are the Y bars, okay? So my last, my objective function, which I'm going to minimize is the minus log of the activation coming out of the softmax of the correct output. Okay, so C here indicates the correct class, the correct category, the desired category, right? So in this case, it's, you know, index, it's the second one, which is index one, right? Because we started zero. So here C equal one, because it's the second output, I want to make this output of the softmax as large as possible. And the way I make it as large as possible is that I plug it into a last function that is the negative log of this output. And if I minimize this negative log, I'm going to maximize the YC, the point two is going to get bigger. And as a consequence, I'm going to get smaller because it's normalized, okay? That's why I need to compute the log softmax. That's because it's going to get into my last function. So the log softmax of the correct category, minimizing the negative log softmax of the correct category, is the main way to train a classification system in deep learning. That's what most people use. I'll come back to this in a second. Okay, so that's why log softmax is so important. And in fact, we're coming to cost functions. So those are things we stick on top of a neural net to tell us whether the neural net is doing something good or a deep learning system to tell us whether it is doing something good. The most common one is the squared error. So the squared error is simply the squared difference between the output we want and the output we get. And they're using different notations here and they're doing it over a batch. So this index one to n is indexed over a batch. And you compute either the sum or the mean. You can choose with this sort of reduction parameter. Oops, sorry. Whether you compute the sum of the mean over the batch. But what you do is you compute the square difference of each component. So very simple. Whatever comes out of your neural net is a bunch of outputs. The cost function is the MSE, the squared error. You get desired values. So let's call those the YI bars and those are the YIs. This is a different notation from the one used in the Part-Touch manual. And you just compute one over... I don't want to call it P actually. I don't want to call it N either. K, which is the number of outputs here. The sum of YI minus YI bar, which is the output of the network squared. Or you can just write this as one over K, square norm of Y minus Y bar. So it's really the Euclidean distance averaged over components. So if you're doing regression, so basically if your neural net is trained to produce continuous values of some kind, not classification, but regression, that's perfectly good cost function to use. Very simple. This is the same thing, but instead of the L2 norm, we use the L1 norm. So this is some of the absolute values of the difference. Not used very often, but used sometimes. Variants of this, which I'm not going to go into the details of. And here is the negative look like the hood loss, which is a special case of which is the locks off max I was talking about earlier. So we have a target category. So we want to use this for classification. We have a target category, which is this little C here. And we want to make the output of the correct category as large as possible. And the other ones as small as possible. And if the outputs come out of the softmax, we can reduce this into a single function, which is the locks off max. So I explained that earlier. We want to make the correct output as large as possible. And we do this is by minimizing the negative log of it, which is negative log likelihood, if you want of the correct category. But if we do this as two separate modules, we compute softmax and then we have negative log likelihood. We get numerical issues. So what we do is we compute the locks off max and then we just, you know, maximize the correct output or the negative locks off max and then minimize the correct output, which is the same. And that can be seen as a special case of the cross entropy loss. So here is the expression for the softmax. And this is the negative locks off max, which is a negative log likelihood loss applied to a module with softmax, right? Where the last module is softmax in one fair swoop. So I can develop this. If I take the log of this exponential, right? So the log of a ratio is the difference of the logs. So I can write this as log of the numerator minus log of the denominator, right? Now the numerator has log of exponential which can't show each other. So I just get the score of the class with a minus in front. Okay, so this is the first term. And then the second term is the so-called log partition function, which is the log of the denominator or minus, you know, yeah, the log of the denominator. So the overall objective function is something like this. And it says make the score before the softmax of the correct class as large as possible because you have this minus sign. Okay. And then make the log of the sum of the exponentials of all the scores, including the correct one, but with all the incorrect one, as small as possible, just make those scores as small as possible, as negative as possible. Okay, so this, so the reason why this is interesting to look at is because this is going to be the gradient you're going to get when you back propagate gradient through a negative log likelihood loss through a softmax, through a log softmax in that case. This is the kind of gradient you get with respect to whatever variables in the softmax. Okay. So to minimize this, you need to minimize the first term, which means make the x of the correct class as large as possible. Okay, so make the score of the correct class as large as possible. Simultaneously, make the scores of all the classes as small as possible, including the correct one. So the score of the correct class is going to get pushed down, simultaneously with being pushed up, but it's going to get pushed up by this term much stronger than is being pushed down by that term. And we'll see why in a minute. So you're going to get the desired effect in your neural net, you know, going down, which is that, you know, the correct class is going to get pushed up and incorrect classes are going to get pushed down. The scores, if you want. So there are various tricks because people want to use softmax with a very large number of categories. So for example, let's say you want to train a spell corrector or a language model. So something that we predict, you know, you'll type an email and this thing is going to predict what word you're likely to type next, which you need is a language model, which is something that will, you know, basically predict a probability distribution of all the possible words in the language you're typing in. It's going to look at your previous, previous type word in the text is going to predict what word comes next so that, you know, it proposes you the most likely one coming out of that model. A more useful actually utilization of those language models is you're doing speech recognition or handwriting recognition or something like that or text generation. And the system doesn't have perfect identification of the words and what you want is a language model to correct the mistakes of your recognition system. So that's called a language model, a probability language model. And the problem with this is that you need a system to produce a probability distribution over a very large number of categories because the number of possible words in the English language is, you know, the most common words is like 100,000 or something like that, right? And the full dictionary is more like 300,000 if not more. So you need to do a softmax over 100,000 variables and that can become kind of expensive. Sometimes you might need to do a softmax over a million entries and that's kind of American unstable and expensive. So there are tricks to kind of do this fast by basically ignoring the things that have low scores and that's one of them that you're welcome to use. I'm not going to go into the details of how this works. So here is another kind of loss that can be used for classification as well. And it's called a margin ranking loss or ranking loss with a margin. It's basically a difference, addition of a constant and then a value, okay? It's as simple as that. You can see the formula at the bottom. And what it says is that, you know, if I have a network with multiple outputs and two scores come out of the system, I know that the correct category is one of them and I want to make sure that my system produces the correct category. So what I want to do is I want to make the score of the correct category as large as possible, not as large as possible, but I just want to make it slightly larger than the second highest scoring category, right? So that's called a ranking loss. So you have a neural net, it may have 10 or a dozen outputs. And what you're doing is that you're saying which output is the correct one, okay? I want to make that big. Which output is the largest one among all the outputs, whether it's the correct one or incorrect one, I don't care. Let's say it's another one, okay? Maybe its score is higher, maybe it's lower, I don't care. But what I want is I want to make sure that the score of the correct class is larger than the score of that other class that has a high score by at least some margin. So I can use this loss, okay? I pick X1 and X2 as being the correct one and the most offending incorrect one, for example, right? And I push the first one up and push the second one down in such a way that the difference is at least a margin. And so now I'm guaranteed that if my network learns properly, then the correct category will have a higher score than the other categories by at least that margin, okay? So that's another one of those losses that only cares about differences, doesn't care about absolute values, only cares about differences between scores. But it's not like Softmax, because it doesn't take into account the entire set of outputs, it only takes a pair. So architectures really are different ways of arranging modules to build neural nets. Any question at this point, by the way? We've been covering answering questions on the chat on the fly. Awesome, thanks. Okay, we'll jump right in. Okay, so we've talked about linear modules, we've talked about point-wise nonlinearities, we've talked about Softmax, we've talked about a few cost functions, D-square, D-square error, and a negative log likelihood applied to Softmax, which gives you the log Softmax, right? But here is another set of modules that are quite different from what we just talked about. Those are kind of quadratic modules, if you want. Okay, so here's an example of a quadratic module. It's a module where the output, the ith output, is a weighted sum. So here this is a linear module, right, where we compute just a weighted sum of inputs. But here what we're going to do is that our weights themselves are going to be functions of other inputs, okay? So we're going to have a linear module here that takes X, multiplies it by matrix, and gives us an output vector. But the matrix itself is not a set of parameters, it's the result of applying a linear function to another vector Z, to another matrix U, okay? So the overall thing, so we're going to write each Wij in our linear module as itself as a weighted sum, basically, of individual slices of a three-dimensional, a third-order tensor, right? The three-index tensor, Uijk. So this is a multidimensional array with three dimensions. And we multiply each slice of that tensor by a coefficient, which is a component of Zk, the vector Zk. That gives us a matrix. And that matrix is Wij. And we use that matrix Wij to multiply the vector X, okay? So overall, I can write this, write it down this way. Si is equal to sum of a j and k of Uijk, this kind of three-dimensional tensor, times Zk times Xj. This is a quadratic form, right? So basically things, you know, imagine Z is equal to X, this would be kind of a quadratic form. But it's basically a second-degree polynomial if you want, a monomial. That's function of X and Z through those coefficients, right? Okay, so that gives us a lot of power, because it allows us basically to have a little piece of a neural net whose function is determined by another piece of the neural net. It allows us to have switches. It allows us, for example, let's say if X is a vector with a bunch of components, we could design this whole thing in such a way that, depending on Z, certain components of X would be selected in the weighted sum, and some components of X will not be selected in the component by setting some coefficients in W to one or zero, okay? And that's basically what attention is. So you might have heard by reading various things about neural nets that there is this mechanism called attention. It's probably a bad word. What it means is using multiplicative interactions to switch in and out certain variables in a function that you apply so that the network pays attention to the variable you switch in and doesn't pay attention to the one you switch out. Here's a special case, which I'm going to explain here. And that special case is a switch. Okay, so let's imagine we have a module that has two inputs, X1, X2, and a single output. I'm going to call it S. And depending on some other variables, which I'm not going to specify, this guy either chooses the first variable or chooses the second variable. Okay, so we can move this switch to select X1 or select X2, right? And a module like this, so the switch module, you can think of it as a special case of this product module where the W matrix is either... Well, it's not a matrix, it's a row vector, right? So W is either equal to this or is equal to that. Okay, and I want to multiply this by X1, X2. Here, I get X1. And here, I get X2, right? So if I have some other variable here, which I called Z, which is a scalar, or it's not a scalar, a vector with two variables, two components, Z1, Z2. And I can write if Z1 equals 1, I get this matrix, if Z2 equals 0. And then if Z1 equals 0 and Z2 equals 1, I get that matrix. Okay, that would be a way to write the U. The U matrix would be a matrix with those two rows, essentially. And the Z vector would select one row or the other, depending on whether it's 1001. And that's like a switch. Now, here's a very interesting thing about the switch. It's very easy to back propagate gradient through a switch. Essentially, imagine the switch in the first position, so it's at X1. What that means is that when I wiggle X by some value, S is going to wiggle by the same value. When I wiggle X2 by some value, S is not going to wiggle at all. What that means is that if I get a gradient of some cost with respect to S here, here I just copy it. So when I back propagate through it, I get the same gradient here. And here I get 0. If the switch is in this position, if the switch is in the other position, I get the opposite. I get 0 here, and I get DC over DIC here. So it looks like a switch is an inherently non-differentiable thing, but it is differentiable with respect to its inputs and outputs. It may not be differentiable with respect to the Z, unless you use kind of a linear way of writing it, which I did here. So back propagating to a switch is very simple. Remember the position of the switch, and then just copy the values, the gradient from the top, respect to the top, to the inputs that are switched in. But here's a slightly more interesting form of this attention module, which is probably the most common one, and is to basically make the coefficients basically that sum to 1 to kind of switch either one slice of a U matrix or another slice, which I write here, W1, W2. So think of W1, W2 as slices of this U matrix. And think of the Z's here as the result of a softmax, and you get this kind of module, and it's probably a more convenient way to write it. So here I take an X, I multiply it by some matrix to get W1 X, and I get S, and then I have another branch where I take X2, multiply by another matrix, and I guess sum the two results, I get this S, right? So this is a weighted sum of two inputs multiplied by two separate matrices. This could be the same input, by the way, okay? Doesn't matter at this point. And what I'm going to do is that I'm going to make those... Sorry, in this case, this is for scalar values. So W1, W2 are scalars, okay? And here I'm going to make W1 and W2 the outputs of the softmax. So the softmax has two inputs, two outputs, and the two outputs are between zero and one, and sum to one. So I can do that. And the two outputs are between zero and one and sum to one. So I get two weights here that are between zero and one and sum to one, okay? So what you get here as S is a weighted sum of X1 and X2, where the weights are two weights that are between zero and one and sum to one. And so if one of them is one and the other one zero as a consequence, I only get X1. And if it's the other way around, I only get X2. But for intermediate values, I'm going to get something in between, okay? So what this module is going to do is it's going to be able to softly switch between paying attention to X1 or paying attention to X2 by either copying X1 into S or copying X2 into S or copying some sort of weighted sum of the two, depending on the value of Z, okay? So Z is a way to basically control the attention that the system will pay to either X1 or X2. And that turns out to be not only very useful but basically universally used as a basic module in most natural language processing systems and increasingly also in vision systems. It's a generalized form of this. Is that clear? Any questions at this point? No, so far, everything's fine. Cool. So think of this as a soft switch, right? I told you here about a hard switch, okay? That just picks one of the two inputs. This is one that softly pick one of the two inputs and kind of linearly interpolates between them otherwise. An interesting application of this is what's called a mixture of experts. Some of you may have heard of the latest gigantic model by Google, an enormous NLP system with a trillion parameters. It's a trillion parameters but it's actually one of those mixture of experts things. So not all the parameters are used all the time. So let me explain what this is. Let's imagine that you have either a single set of inputs or two different inputs here but let's say it's a single one, X1 and X2 are the same thing. We'll have two separate neural nets here which are called expert one and expert two, two separate deep learning systems that perhaps are expert at a particular type of input each of them but not the entire thing. So let's say both X1 and X2 are equal and X is spoken language, let's say. We want to do speech recognition but one of our experts can understand Catalan which is a language in northern Spain and southern France and the other expert understands Provençal which is a dialect of southern French dialect if you want. Those two languages are actually very similar. They're also similar to Italian actually and Spanish and French but French has more German in it. So some speech comes in and you don't know if it comes from someone from Provençal or someone from Catalonia. So what you want is to switch in the correct speech recognition system but you need something to decide whether it's Catalan or Provençal and so you have another network here that looks at the same input perhaps or maybe a different input but maybe the same input called a gator and that gator decides it's got two outputs here in this case and the two outputs are one zero if it's Catalan and zero one if it's Provençal or somewhere in between if it doesn't quite know because many words are very similar and you can tell in the first few words which one it is maybe or the pronunciation are very similar. So that softmax thing will basically decide on weights with which to combine the two experts so as to make a decision and initially not knowing which language this is because they are so similar. You can reuse this example with your local dialects which probably are better examples than Catalan and Provençal. So that's called a mixture of experts. You can have as many of those experts as you want and basically it says I have multiple specialized experts and I have a gator that decides in which part of the space each expert is an expert at and it decides to switch in the correct expert at the right time for the current particular input. So let's take a very simple version of this here to kind of visualize this a little bit. Let's say that we have or input vectors are points in a plane and category one is this and category two is that. So ideally let's assume our experts are linear classifiers so a single layer neural net basically. What we'd like is one expert to tell us in this part of the space which category is blue which one is red and another linear classifier to tell us in this part of the space where is blue, where is red and then what we need is the gator system to tell us so that's going to be another linear classifier and then in this part of the space is going to tell us use expert one and in that part of the space is going to say use expert two. So very simply here we can do a nonlinear classification with three linear classifiers in parallel one of which decides which of the experts is the proper one to use. So this is the the discriminative surface from the gator and this one is for expert one and that one is for expert two but this is not limited to this you can have huge networks in those experts you can have modules of this type inside of a network if you want. Can you tell us about that Z which is grey so it's an observation? In this case it's an observation so here in the example I just showed I just drew you basically have only one variable X and it goes in parallel to the gator to expert one and expert two and then the gator has two outputs those two outputs are weights between zero and one that's some to one so you multiply the outputs of each of the experts which are vectors by those two those two things and you sum up the results okay so that's the overall architecture but this block is this you have a softmax here inside okay and this is Z this is X1 and this is X2 okay but in general so I use it here a particular example where all three inputs are the same but in general that's just a module you can stick in your network and do whatever you want as many experts as you want So is the softmax of what? Softmax is a module that takes a bunch of inputs produces a bunch of outputs right there so in this case it takes two inputs and produces two outputs and so whatever your gator is you know you transform those outputs into a bunch of things that are between zero and one and some to one and the final question is going to be and how do you train this gator do you train it separately or you just do back prop you don't have to worry about it just back prop and get gradient through this okay so the question is how do you do back prop you don't have to think about it you just write this right you could try to be smart about it but the best thing is to not be smart about it and just propagate gradients right this is a graph as we've written a few other graphs and you know okay let me draw this right so this is whatever this is a neural net of some kind this is another neural net of some kind this is another neural net of some kind doesn't matter what kind okay and then here okay so I'm writing this explicitly as multiplications but really this is sort of a multiplicity module of the type that we talked about okay and it's just an addition okay and you get some output let's call it y bar you plug this through a cost function which takes the desired output y right now I can encapsulate this into a single module doesn't matter what I'm just telling you is that there needs to be a softmax at the output but you know it's just an architecture whatever it is that you want so how do we backprop a gradient we don't have to think about it that's the cool thing about automatic differentiation and pytorch and all those things you just write your module this way okay you say my output is the sum of this variable multiplied by that variable plus this variable multiplied by that variable okay and that variable those variables are equal to a softmax you know computed from z through some neural net and those two variables are computed with f1 f2 so you just write this right you just write this as a function as a program you know it's a few lines right in pytorch and automatically pytorch would figure out how to backprop a gradient so basically it's going to say okay I'm going to get some gradient here of my cost here dc over dy bar right which is going to be some value here because this is a plus module when you backprop get through an addition you basically have to copy the gradient on both inputs dc over dy bar the same value it's going to be the same value copy when I backprop get through a multiplication by a value let me call this w2 here I get w2 times dc over dy bar here I get if this is w1 here I get w1 times dc over dy bar and then I take those this is a vector and I backprop get it through the network and we've already seen how we backprop get through neural nets right so you know it's going to happen inside here doesn't matter and you're going to get a gradient with respect to the weights of this guy which I'm going to call I don't know theta3 theta2 and this is theta1 now to backprop get through this branch it's the same thing what I'm going to get here so let me call this z3 z2 okay so the gradient I'm going to have here is going to be equal to z3 times dc over dy bar because I have a product of two variables so the when I have a variable equal to the product of you know in this case z1 and z3 okay when I have dc over dz1 is equal to dc over dy which is I shouldn't call this y I'm sorry I should call this something else because it's going to get confusing okay so this would be this variable here which I'm going to call v okay v3 right so v3 is equal to z1 z3 right because well it's equal to w I'm being extremely confusing here w2 z3 and again my numbers are horrible so so when I differentiate a product with respect to the first variable I get dc over dv3 times dv3 over dw2 and that's that part is the derivative of this with respect to w2 and that's equal to z3 right so I get dc over dv3 times z3 and if I now differentiate with respect to v3 going through the same process I get dc to z3 I'm sorry I get dc over dv3 which I know times w2 right so if I have a gradient here which is dc over dv3 when I back propagate through this product in this branch I get the product of this gradient by this weight when I back propagate through this branch I get the product of this gradient by this value which is that okay and it's just simple derivatives so then you will get a gradient with respect to those two outputs at the output of the softmax and then back propagate through this other function but you don't have to worry about it by choice it's going to do it for you so here's an exercise interesting exercise which I think will help you understand a lot of things let's say we have the softmax module so zi equals e to the beta xi divided by some of j of e to the beta xj okay let's assume that I know dc over dzi okay compute dc over dx k for any k okay all those guys are known what are those guys right so what you can write is that they're going to be equal to d to sum over i of dzi over dxk times dc over dzi okay this is chain rule right we explained this last week and the question is what is this so this is a Jacobian softmax okay it's a matrix you know jik for each index has each pair of index ik has the partial derivative of the output of the softmax with respect to the case input of the softmax and it's a very good exercise to compute this okay I very much encourage you to do this it's very easy to find the solution you can find this in various places on the web but it's much more fun to actually figure it out by yourself you'll learn something by doing it and we'll tell you like how you backpropagate to a softmax okay okay here's something a little more complicated as if that wasn't complicated enough and it's reparameterizing the parameters so this is kind of a bit of a slightly more general way of kind of instance of what I was just talking about of products units where weight matrix is the result of weighted sum of slices of a tensor so here I'm sort of making this a little more general I'm saying imagine that we have a network g of x and w but the w's are not parameters that we can adjust they are themselves the result of applying a deep running system to a set of more elementary parameters u, elementary variable u okay so this is an example of a neural net whose weights are the output of another neural net if you want okay so this looks like a totally hairy thing but it's actually used very very commonly and in very simplified forms or in more general forms and you know it's interesting to understand how that works but in the end PyTorch takes care of that for you so we have a neural net here it takes an input applies some complex deep running system to it g of x, w or maybe it's simple you get an output prediction y bar plug this into a cost function of some kind lox of max whatever with a desired output okay but then the w is another input to that system it's not an internal parameter it's a it's another input another variable that is the output of some function h u that itself has inputs or parameters that we're going to learn so these might be either learn parameters or inputs from some other source okay let's say we're learning parameters which is why I circled them in red okay so when we back propagate gradient you know we have a gradient of the cost function with respect to y bar we have two Gomian matrices here for g of x, w one of them is going to give us the gradient of g of x, w with respect to x which we don't care about very much and the other one is going to give us the gradient of g of x, w of the gradient of the cost function with respect to w we interpret w as a vector it could be a matrix or a tensor but it doesn't matter we interpret it as a vector and we're going to get a gradient vector of the same dimension that you know indicates the partial derivative of c with respect to all components of w and then we're going to back propagate that through h and that's going to give us gradients with respect to u okay we can make an update of u using this so gradient with respect to u is the product of the Jacobian matrix of g with respect to w and the Jacobian matrix of u with respect to u we get this formula okay so u is updated by quantity here eta times the gradient of the cost function with respect to w times the Jacobian of h with respect to u okay this is the gradient of w with respect to u really so this is going to result in an update of u and we can think of this as an update on w and because w is a function of u through this h function we can compute what the update on w will be and it's basically the update on u multiplied by to which we apply h of u but if the update is infinitely small then the update will basically be the Jacobian of h again so the implied learning rule on w okay which is just something we can look at without really worrying about it because as I said you know PyTorch takes care of this this is just backprop there's nothing more to it but if you want to see the equivalent result of doing gradient descent in u what effect is it going to have on w is going to have this effect so it's as if we would apply a gradient-based learning rule to w where the gradient update would be compute the gradient of the cost with respect to w and then multiply this gradient by this big matrix here which is the Jacobian of h multiplied by its own transpose so this is a positive semi definite matrix and then multiply that by the step size okay and in gradient descent if you multiply the gradient by a positive semi definite matrix you're still doing gradient descent it doesn't make any difference to the fact that you're doing gradient descent so basically you're doing gradient descent in u but as a side effect you are actually doing gradient descent in w just with a different direction where the gradient is transformed by this matrix this particular point we'll come back to in a few weeks when we talk about optimization but the whole idea of this adding the weights function of another thing that's used a lot here is an example where it's used weight sharing so let's say that h of u is a very simple function that takes a single variable u the scalar variable and duplicates it into multiple copies right so it makes it into a vector where all the components of the vector have the same value okay so essentially we have some sort of big network here it goes into a cost function of some kind and here we have a single variable u and we have a module which is basically a duplication module that just copies that variable u into multiple copies of itself okay it's a white cable right you need to you know connect your multiple headphones to your music player right you need a white cable okay so it just copies the signal here I say that u is a and I'm going to call this w this connection so to get the gradient of the cost function with respect to u what do I need to do in other words how do you back propagate gradient through a module that does nothing but just copy a bunch of variables okay so let me take a very simple example and z1 is equal to u and z2 is equal to u okay so if I have the gradient of some cost function with respect to z1 what is going to be the gradient of this cost function with respect to u and the answer is the sum of those two gradients why is that okay it's very easy to understand first of all it's a direct consequence of the general form of chain rule which I worked last week if you have a variable that influences multiple variables the gradient of some cost function some subsequent cost function with respect to this variable is the sum of the gradient with respect to each of the variables that it influences times the derivative of each of those variables with respect to itself which is just a special case why is that if I perturb u by delta u it's going to perturb z by delta u and z1 by delta u and z2 also by delta u so as a consequence c the delta c is going to be equal to I'm going to write it this way it's going to be equal to dc over dz1 delta z1 which is delta u plus dc over dz2 delta z2 which is also delta u and so now I'm going to write those deltas as d's I mean there are infinitesimal quantities so I can divide both sides by du and I get dz over dz1 plus dc over dz2 okay so a module that divides I mean that copies a variable into multiple copies when you back propagate through it you just compute the sum of the gradients it's as simple as that okay so in this little example here if I get a vector of gradient of the cost with respect to w I just sum them up and I get the gradient with respect to u dc over du equals sum over I of dc over d wi very simple but again PyTorch is going to take care of that for you why am I telling you this it's a very interesting special case and this is a preview of basically the trick that's using convolutional nets which we're going to talk about next week so let's say we want a system to detect a motif so detecting a motif might mean detecting a face in an image for example or detecting a particular key phrase in a speech sentence so let's say you are amazon or google or facebook and you're supposed to detect you're supposed to have a very simple neural net that just listens to the and detects hey google or detects Alexa or detects hey facebook or hey portal now here's the thing I just said hey google and of course my phone woke up or you want a face detector and you have an image and basically you have a little neural net that you train to be a face detector and you like to apply it to every location in the image but you like to train it in that situation we don't necessarily know where the face is or you are I don't know you're playing in the stock market and you want to detect a particular motif in the variation of a set of financial values regardless of when they happen in your sequence so what you need is a neural net that looks at a set of inputs a piece of an image a segment of a signal whether it's speech or financial values or whatever it is and you need to kind of slide it over your sequence of inputs and basically your detector to kind of turn on wherever it needs to be to be turned on and you can think of this as essentially different copies of the same neural net that all share the same weights okay so you have one weight vector that is used by all of those neural nets but where you're going to back propagate gradient through this each of those neural nets is going to contribute a gradient to whatever objective function you're minimizing okay and the way to compute the gradient the overall gradient with respect to W is just to sum up the individual contributions of the gradient from each of those instances of those replicas okay so this is nothing more than an example of this kind of fan out a Y connector where one set of variables is used multiple times and because it's used multiple times when I back propagate I'm going to get multiple gradients for each of those multiple times I sum up those gradients and I get the gradient with respect to to that variable is that clear questions no questions actually we are in out of time actually we're pretty much out of time yes pretty much where I wanted to be so that's great so we see everyone in class tomorrow for the pytorch part where we actually going to be using pytorch finally for training a network and we see how running backprop is very easy and we will be releasing the homework right after the lab such that you have actually all the knowledge to start working right the exercise I gave you here computing the backprop again through softmax it's a good exercise I've been writing on the chat that you know those exercises you're giving are very encouraged right I'll put the whiteboard pdf version of the whiteboard on the drive so you have record of it that's great see you everyone bye