 Welcome to the fifth lesson of the 2022 fall edition of Deep Learning. I just decided to make a new lesson for you today, so it's not yet available online. And you have the preview, right? So as last week, you had preview on the book. Now you have the preview on this new lesson. Before that, there were a few announcements. I just go very quickly. So we go Twitter. Okay, the latest post was about showing you there is a new website for the coding train. Daniel Schiffman is a professor here with 1.5 million subscribers on YouTube. It's a really good website where he teaches you how to make animations in JavaScript and also in basic on the Apple one, whatever first computer they were using. One more thing that is nice, noteworthy to see him from this week is PyTorch. We have now a PyTorch foundation. So that's actually Linus Thorwald. And this is Sumit, one of the three co-authors of PyTorch. Him, my former undergrad student, Adam Paske, plus a third one. I forgot the name. Anyway, if you are on the website of PyTorch, you can see they have now the PyTorch foundation. They have officially made this a stand out, stand up, stand out. How do you call it? Stand something out. All right. That's it. Announcements done. Now back to the stand alone. Thank you. Back to the lesson. And we are talking about ANN. What is this stuff? This is artificial neural net, right? Supervised learning classification, as I was saying before. This is not an introduction to classification. This is another take on classification with neural nets, given that you already know about classification. You are already supposed to be having taken machine learning courses and be aware of what it is about. Anyway, I give you my perspective on the topic and on how to perform this with neural nets. So the major data here, the data points, whatever I use to make an example here is going to be this spiral, right? Those are points. This is my axis, my input to the system, which are going to be a little bit going from 1 to K, where capital K is the number of classes. And then here, I add some noise such that I make them a little bit more real, okay? So what is the major problem here if I would like to classify these points with a linear classifier? Type in the chat. So if I use a linear classifier, what's going to happen here? They're not linearly separable. So it means that we won't be able to split apart the input space by using these kind of hyper planes, okay? So the major issue here is that there are these kind of overlaps between the data points and the decision boundary, right? Those lines were the decision boundary. So how do you fix this usually? How can we fix? How can we avoid having those intersections? Okay, kernel. So what does the kernel do? Or whatever, what's going to happen, right? So if you, if those are straight lines, okay, so if they're straight lines, they're going to be intersecting those things. So other, what can you do, right? You can try to, there are two different things we can do. Either you can try to bend these decision boundaries, right? And that's usually what people do whenever they show this kind of representation of classification. What I like to do instead is going to be bending the space such that it becomes linearly separable, okay? So these are two different perspectives of the same thing. Let me show you a small animation about what I'm trying to tell you here. And then let's see how we can do that, right? So what I want to do with a neural network is what I'm showing you in this left-hand side animation. The data was not linearly separable. Therefore, I would like to learn a vector field, which is moving my data and making it linearly separable. Okay, doesn't make sense. So data, as it is in the original input space, is not linearly separable because it's very untangled, right? Therefore, I would like to learn a vector field, which is basically associated at every location of the space, a direction, right? Displacement such that every location in the plane has a specific displacement, which is allowing me to unwrap this warped input, okay? So to clarify, we are transforming the data to a new space to achieve linear separability. Yeah, so this is actually just an animation right now. This is like cooked, right? I made it by undoing this spiral in the input. But what I would like to do is going to be training a neural network such that it gives me a transformation, right? Like a vector field, which is telling me how to move each point such that after applying such transformation, the data lives in a linearly separable space, okay? Good. So let's try to do that. Usually, you see the other one, right? You usually see this one. In whenever you see classification, you can see how the decision boundaries morph and then match the input space. So I would call the right-hand side illustration the input perspective, where I look from the input, right? If the network is like that, I look from the input from the bottom side, how those linearly linear decision boundaries get warped by the model, okay? What I prefer to do is to look from the top after I put in my linearly non-separable, right? Non-linear separable warped manifold and see how this gets unwarped if looked from the top, okay? I hope it makes sense. So before showing you this, I show you last week. Remember what was the last thing I showed you last week? And I'm reading the chat. Do you remember? What did we talk last week about? Everyone's quiet. Yeah, inference and rotation, yeah, yeah. So but remember what was the exact last thing that we managed, right? So we talk about the fact that these neural nets are like sandwiches, right? Of sandwiches of type, linear and nonlinear, yes, okay? So it's a sandwich of linear and nonlinear layers. And then we saw that the linear layer is basically performing a rotation and then some sort of a little bit of changing the zoom factor and scaling factor. And then there is the other option, the other part, which is the nonlinearity, which is performing some kind of transformation that is nonlinear, right? So either we had this, that kind of boxing thing, or we had just the top quadrant that is surviving, right? And then I show you the last thing that no one pointed out. When I put several layers one after each other, we were getting arbitrary transformation, remember? I showed you the one with the hyperbolic tangent, which was like popping out. Let me get it out. Maybe I can show you again. So we remember. Remember? So this was like an arbitrary set of fully connected layer. And my cloud of point was getting warped somehow, right? So we get arbitrary transformation. Today, what we are going to be learning is going to be how we are going to be enforcing a specific warping such that data becomes linearly separable, for example, right? Such that it is helping to perform a specific task, like classification in this case. So let me show you a video. So this is going to be my input data. In this case, I show you five spire branches, right? Then I have a neural network, which goes from two input units, okay, in the bottom. Then those two goes with a fine transformation to 100 units in the middle. I apply a positive part or reload. Then I go down to two and then out to five, such that I have five classes, okay? So two to 100 linear reload, like positive part, from 100 to two, and then from two to five. Why that two? Just so I can plot things on the screen. So here I show you how points move after I send them through a network that has been trained to make data linearly separable, okay? So this is how this transformation looks. Every point here is actually going through a linear transformation. Remember we told you last time that a reload network, a network with a positive part, is simply a piecewise linear transformation. So each location here is moving on a line, okay? But different locations move around like along different lines because, again, it's piecewise linear. Every location goes through a linear transformation, but the linear transformation depends on the location. I hope it makes sense, okay? So every location in the plane will move according to one line, but then every location has a different line, okay? One line per location. And these are the linearly separable planes, okay? Like you have these locations and all those points now are in location of the plane and they are linearly separable. And finally, maybe you can tell me what these five arrows are, okay? Let me know, not now, later, or even now if you want, okay? So this network, as I can show you here, so I won't tell you if it's correct or if it's wrong, just keep it in mind. This was trained on a dataset with this particular shape. No. Okay, that's a good question. How this was trained, I tell you today, right? So today's lesson is the, yes, you're correct, Krishna. So today's lesson is going to be about how we are going to be training a system such that it performs such transformation, okay? And next week, we are going to be actually getting the code running. This network here, I told you before, goes from two linear to 100 hidden, positive part, 100 to two, such that I can plot, two to five, such that I can perform the classification. The next network, so this one here instead goes from two units to two to two to two, so it keeps staying in two dimensions with positive part after each hidden. Then there is the final two linear embedding and then I go up to the output three, okay? So this was the three branches. So what we are trying here to do, compared to the one before, before we had a very fat network, right? We had 100 units. Here we have only two units, but it's a deep network, okay? So which one is going to work better? The fat network of 100 units single layer or the one with five, four hidden layer, okay? People say deep. Let me show you what happens here, okay? So I made this one such that you can see what happens exactly, okay? So right now you can clearly tell, what can you tell? Two things you can tell. What did I tell you this network was? It will never generalize outside the data, okay, piecewise, no, that is correct, right? So this is exactly the point that we were trying to make. Before you couldn't tell, you have to tell me why. Now you can clearly tell that regions of the input space are morphed through a linear transformation, okay? So all these planes move all together, right? What's the correct word? Like it's a hard transformation. It's like a global, I would say, like all the points on the plane move in the same way, okay? Why weren't you seeing this in the previous video? No, both networks use positive part. That is a correct observation, Martin. They were definitely more than two dimensions, but what's the point, right? Oh yeah, there you go. Nola is also correct. So before there were so many small regions, okay? One near the other and the transformations of nearby regions were slightly different, right? And so if you have many small regions with slightly different transformations, it looks like it's almost like a continuous type of transformation, you understand, right? So if you have like many segments and you make the segments smaller and then you make the orientation of the segment slightly different, you just change it a little bit every time, then it looks like a curve, right? But still it's like a piecewise linear. Let me fully play the animation because there's one more thing. Second question, why do they seem to lie on a 3D plane? That is a very good question. Keep it in mind and let me know if you come up with an answer by the end of the class. If you don't, I'll let you know by the end of the class. It's a good question, Yifeng. Second question here for you. Was this network easier or harder to train with respect to the previous network? I wish. You can try. So people in the chat have been saying this was easier. This was really a pain in the, how do you say it, in a nice way? In the neck, there you go, to train this network, okay? Yeah, I know, but I'm not supposed to say that in bad words, they've told me. Anyway, so this was very hard to train. Second question for you, right? Take note of the question I'm asking. Why? You have to tell me, I guess by the end of the class. Yeah, just keep it in mind, think about it. I won't tell you if it's right or if it's wrong. Everyone should think. Anyway, let's keep going. So all these kinks are really, really hard on the optimizer, okay? Because all, the gradients are really, they are zero with respect to the whole plane, right? They all change the same way. Is it correct, zero? I don't know. But the gradients are going to be all the same for the whole set of, the whole set of region, right? So the, okay, the gradient of the gradient is zero, right? So it's constant, the gradient across the whole set of points, but that's what I meant. Okay, and as you can tell here, in order to be able to literally separate those points, the network had to stretch them so much, okay? Also, one more point. Why? This is really hard to train, yeah. So why do I see the yellow dots on the top right side? And also some purple points on the bottom right hand side. So what are the singular, singular values of those matrices? So how to get a large singular value? Yeah, you're correct. How to get a large singular value out of a matrix? Would the matrix have small, small, small values or large values? Let's say that would be this way. Large, okay? So this network, this skinny network, will have very large values. We don't like large values, okay? Because they perform this kind of very strong stretching, because the network is forced to be pulling those things apart, but things are very tight together, because we are in a low-dimensional space. And so the network will just brute force it with strength, and it will like pull things apart, breaking, you know, breaking your code most of the time, right? I managed here not to get it to break. What you usually observe is going to be nouns and imps and all that sort of very annoying stuff, okay? And that's a small peek into the pain of training this system. Let me show you very quickly, and then I stop wasting your time with this stuff. These are the first defined transformation. Then this is the positive part, as you can tell everything banishes, but in the first quadrant, second linear transformation. And as someone pointed before, this kind of looked like in 3D, but it's not. So why is not? You had to think about that. Now we have the third linear transformation, well, a fine transformation. You have the again positive part, one more, final one, and then we have the shifting, right? Oh, sorry. Okay. What happened here? Okay, Martin say a riddle. Who can correct Martin? I mean, Martin is almost right, right? Why I say almost? Maybe you cannot see there are two faint lines, right? I think it was liquid. Yeah, Patrick, you're correct. Why is Patrick correct? Because really negative part, there is a negative part. Yeah. So there is this negative part, which is quite large because the previous singular value were very large, right? So it got this kind of surprise. And so again, this really liquid really basically compress a lot the negative side, but doesn't kill it completely. Why did I use liquid for this network? You're very good. I like you. You plural, right? Why did I use liquid to train this model? Yeah, to avoid getting that network, right? To get that unit, right? If I would have used just real, I wasn't able to train like with just pure real, you can try to train this stuff after I give you the notebook, but you already have the notebook on the website. If you try to train this with just real, it doesn't work, right? It's too hard to train this model and warp the spiral in 2D is too hard of a problem. So how do we fix it? This is so bad, but that's the truth. How do we fix this hard optimization problem? That's the answer to all your questions in deep learning. Add more depth, add dimensionality, right? Add weights. There you go. Okay. That is the truth. Anyway, that's the final thing. And now I'm going to be performing a SVD decomposition of each linear transformation such that you can see rotation, scaling, rotation, flipping, bias, reload. Rotation, scaling, whatever. I cannot keep up with the speed of the animation, bias, and then reload, right? So rotation, zooming, or whatever, reflection. So we have multiple things. Rotation, reflection, zooming, rotation, reflection, bias, and then positive part. Anyway, very, very, very done. So this was all the fancy animations I wanted to show you today. Deeper is not better in this case. So the point is that, again, you can take this with a grain of salt. Instead of having, let's say, 100 neurons in weight and width, you can have two layers of 10, okay? So if you have 10, 10, it's kind of analogous of having just one of 100. In that case, we had two, two, two, two. So two to the four, like it's four to the two, 16, right? That was a joke. No one got it. Okay, better this way. So we had basically a network with 16 units rather than the other one with 100 neurons, right? And so we just had comparison between a very undersized, underparameterized model with the other parameterized. The user adapts allow you to exponentially reduce the width, right? Every time you add a layer, it's going to be like multiplying the number of combinations, more or less. Okay, so this is like one of the first slides which is going to be hard as in these are definitions which are correct. And no one listened to these definitions, unfortunately in many, like most of my colleagues, right? Again, just, but you see, these are the correct things. And everyone else is doing incorrect things. But nevertheless, people do incorrect things many times. Still in this class, we'll try to be consistent and be correct. Anyway, let's move forward. I hope to make sense of what I'm showing to you. So this is my piece of data, okay? This kind of capsule here, all in white. My white capsule can be split in three different parts. I have the pink circle on the left side, the blue circle in the center, and the orange circle on the right hand side. Okay, so far, just arbitrary colors for arbitrary subdivision of the data. Now I'm going to be splicing these in two parts. And you're going to be having the left hand side is going to be view, or it's viewable, it's observable. Whereas the right hand side is going to be unobservable, okay? So far, I think we can all agree this sounds okay, reasonable. Now, I'm going to show you that I'm going to be using for the variables that are observed a slight background, like a shading color, right? So those means those are observed. Every time you see a circle with the shaded background, that means it's an observed variable. The one with the transparent background means they are not seen, they're not observable, they cannot be observed. And now, unfortunately, in the painful part, which is not painful if you this is this is your first class, but if you have learned from other subjects, other other other courses, then it's going to be painful. So let's try the pink thing is going to be called X, and X is going to be your observation. Okay, that's what is given to you every single time. If there is an X, you are able to see the X. The blue Y is the Y, okay, the blue ball Y. So now you can also have the color throughout the whole course, pink ball X is going to be the observation. Blue ball Y is going to be the target or what we would like to learn about the target. It's given to you during training is not given to you during inference, because during inference, you would like to come up with the target right with the possible prediction. So far, we are good, I hope. Finally, in orange, which you never see is going to be the latent variable Z, okay, the orange ball Z, you never observe it. So it's going to be always hidden back in the in the unobservable space. Do you always have this latent variable? No, definitely not. So if the X and Y are always deterministic pairing, then you don't need to have an extra amount of data that is explaining random stuff and unpredictable things. Now, the question for you. If I'm doing supervised learning, what am I trying to do in this setup over here? We'd like to predict type. We'd like to predict no one is typing. We'd like to predict Y given. Okay, now everyone is typing. We would like to be exact. Thank you. We'd like to predict Y given that we are observing X. So here we are, we all agree. My question, next question would be in unsupervised learning, what happens? What's missing in unsupervised learning? Okay, so yeah, yeah, yeah. So someone said label that was correct, but definitely we are missing the label. The point I told you before is that Y is the thing we would like to learn. Y is given to us during training. It's not given to us during inference. Okay, X is always given to you. So in unsupervised learning, we don't have an observation X. You only been observing targets during training, the Ys. And then you will try to predict some targets, or you can do some for generation or whatever. So this is like the one of the major differences between this course and the other courses. Ys are the thing we'd like to learn. You can learn Y from X, or you can just learn Y regardless of the fact that there is an X. So there is unconditional learning. It's going to be just learning Y. Or if it's conditional learning, it's going to be learning Y given X. I hope it makes sense. If it doesn't, well, too bad. Moving on. So how is this classification data organized? Okay, so we have this pink bold X at the bottom, which isn't going to be now in our n dimensional space. I can make this in a, I can stack it several examples. In this case, I have capital P distinct inputs. Each of these row vectors are sorry, yeah, row vectors have n components, right? Because we said X, I lives in Rn. And then we said we have P. This is a matrix. It's called design matrix. It has n columns and P rows. For each and every X, we're going to have also a class label. YP belongs to the I matrix. The I is the identity matrix, right? So if Y is one of the column of the identity matrix is going to be one hot. So if we use three classes for the first class, we're going to use the one zero zero. The second class, we're going to use zero one zero. And the third class is going to be zero zero one. Okay, so now I can create these metrics of Y and capital Y, where I stack all these rows, right? I have as many Ys as Xs and this Y is going to be all zeros but one hot. So we have P of these targets of capital K elements. All right, so let's go through the network, neural network equations. So at the bottom, we're going to have this pink ball X. This is two dimensional vector, right? In the plane. It goes through a predictor, which is just an arbitrary name for this first module. And this one spits out the hidden representation. In a second, I will ask you hidden representation of what? Don't answer yet because you don't know what I'm talking about. This hidden representation goes through a decoder, which gives me now this Y tilde. What is Y tilde? Y tilde is an approximation for my Y, okay? So the blue ball Y was my target, remember? On the other side, I have my pink ball X. And then I tried to make an approximation for this blue ball Y. So now in order, how would I like my Y tilde to be with respect to Y? Far or close? Close, right? And so in order to make my Y tilde close to my blue ball Y, I just attached a spring, remember? I told you that in last week as well. And so here I'm going to be adding this box, which represents the energy of a spring, and which is going to be making the Y tilde close to Y once I move a little bit my parameters, okay? So now if I compute the force, there is a force pulling Y tilde towards Y. And that force through back propagation is also exerted on the parameters of the model. I'll tell you about that also in a bit. Let's keep going here. We also here can write down the equations of these two modules, okay? So the first equation is this H, the green ball H. H stays for hidden representation. Hidden means it's internal, okay? Hidden and internal are synonyms. Latent, it's something else. We won't talk about latent variable until a few classes from now, okay? So let's forget for now about latent and the word latent. We just talk about hidden H or internal representation, which is the representation the model has of the something on the input or something, right? Actually, I have to be more precise here. It's just going to be the hidden representation of my Y. So if I have my hidden representation of the Y, I can decode it and get this Y tilde, okay? So the predictor allows me to move from the X space to the Y space, right? Well, they encoded the hidden Y space. And then the decoder allows me to go from the hidden space of the Y down to the Y tilde, okay? So predictor allows me to jump from the X to the Y space. If you think about splitting me in half, I go from one side to the other with the predictor. And then the decoder allows me to go from top to down, right? From the hidden representation to the actual input representation. Okay. Moving on, so that's the equation here. Y tilde is going to be this G. What is this G? So F and G are arbitrary nonlinear functions. Could be positive part, sigmoid, hyperbolic tangent, soft argomax, whatever. So again here, I'm a little confused about what exactly the decoder does. It is like predict the probabilities, predict the class. This seems to be arbitrary splitting of the inference process. Yes, it's an arbitrary splitting. It's just semantics. And what the G does, I'll tell you in a bit. Don't worry. So Y tilde is going to be a function of my input, okay? So my prediction, meaning which of the five classes one point is, is a function of the input. So it's like mapping a n-dimensional space, which is my two-dimensional space, down to five dimensions or three dimensions, whatever number of spirals I had there. But what I usually think it's better to think of, right, in this case, is going to be thinking about two jumps, right? The first jump goes from my input space, our n to this large internal hidden d-dimensional space. And then we go from this large internal hidden d-dimensional space down to the target space, which is this k-dimensional space. And this d-dimension, it's supposedly be way large, right? Very large with respect to the input and the output. What I just said now, we create some confusion for when we think about images, where we have one megapixel, maybe image. So how large should be now the hidden representation? Should be larger than one million is going to be hard. So that's why we have, we already seen convolutional nets. We talk about that next week. So far, I think all, you already seen all these things before. Now something you haven't seen before. So we're going to be introducing this level of incompatibility between my input x and the target y. And we call this f. It's going to be my energy. So the energy expressed the level of incompatibility between my observation, the 2D location, and my target, which is going to be one of the whatever number, capital K number of spirals. This level of incompatibility is going to be the value. It's going to be the same value as this c cost term is. So the cost here measures the distance or the divergence, if you want, between my target, the global y and the purple violet y tiller. So the c is going to be the distance between those things. That was the cost. Now my energy, f, the energy, it's exactly the same value, but it's of different variables. So f expressed the level of incompatibility between my 2D input x and my target. How much is this level of incompatibility? Well, to compute this level, I have to first compute my y tiller, which is my prediction given input. And then I measure the distance between my y tiller and my y. So again, just definitions. f represents the level of incompatibility of the input. C represents the distance to the target, the distance of my prediction to the target. I hope it makes sense as definitions. There's nothing more than definitions over here. This you might have not seen before. Everything else should be seen. How do we train this network? How can we reduce that spring energy? How can we get the y tiller to approach y? Well, that was the f. f is basically this big box that englobes everything. So the whole model is part of the energy. Whereas c is just a little box, the f with the whole big box englobes the whole thing. I can understand incompatibility between y and y tiller, but I don't understand what it means for the input. Yes, I know. You will see in the upcoming slides and illustrations. On the note, what do we need f? Yes, to understand all these energy-based models. So this is going to be like the energy perspective of classification. So far, everything can be simplified and come back to the original into whatever thing you already know about classification. Right now, I'll just give you additional hooks, additional tips about how to interpret this classification classifier as an energy-based model. Here we have this classification, which is going to be my y tiller, we said before. Actually, I'm going to tell you now. It's going to be this soft arc max of the s, the linear sum. It's the output of the linear summation, the linear module. This is going to be defined as, and we define it here, the fraction like the ratio of the exponential of the specific item over the sum of all the exponentials. Usually people call this soft max, it's incorrect. I'll tell you more in the future. So I'm going to define here the loss as being the average of this third sample loss function. So this is going to be my curly l, my loss for the whole data set. It's going to be a function of the parameters. W is, are basically representing all the weights of my model. And s is going to be my training set. This is going to be one over, the average of this per sample. So this one is going to be a function of the parameter, the weights for the given x and y. And my per sample loss function, function of the weights, x and y, is going to be defined as going to be the energy in this case. So I choose, and this is something usually you don't see often, well, you don't see I guess in other courses, we choose to have our loss to be the energy loss. So the loss is equal to the energy. Okay, this, this is definition, we choose the energy loss. And the energy F we said before, it was equal to this C term, right? I'm going to define C, the cost here as the negative logarithm of the inner product between the y, the target and the y tilde. So if you, if you make a product between y and y tilde, I extract one value, which is going to be the value of my prediction at that specific location. And then I take the negative logarithm. That's called cross entropy, negative log probability, many things is called. So C, F, F also, which was the level of incompatibility, between my x and y is equal to C, right? So what do we say here? The loss is going to be defined as the level of incompatibility between my x and y. So when we're going to be doing gradient descent, what do we do? We minimize the loss, right? We can think about, in this case, as minimizing the level of incompatibility between the x and the given y, okay? So this is a different take. In this case, we have L equal F equal C. Here it seems very intuitive. Yes, that's what we're going to have for this case. Later on, things will be different, okay? So as you pointed out, everything is quite equal to each other, right? So C, L equal F equal C, everything simplifies. Later on, we will change these things. F is not going to be equal C, and then L is not going to be equal to F. So here, a good point, we are introducing two degrees of freedom, okay? L, F, and C in this case are all the same. Later on, we're going to change each of these two equalities in order to give our model or our self much more freedom of doing different things. That's a good question. Okay, moving on. So if my x, I'm going to have an arbitrary x, and then I have my first class, and my target is going to be 100. Let's say my prediction, right? My y tilde of x is going to be roughly one, roughly zero, roughly zero. I say roughly because the soft dark max is not, never takes the zero or one, right? It's always inside. So if I compute the cost of this thing, what do I get? That means you're going to get zero, right? So this one here is actually one minus, it's a little bit less than one. Log of something less than one is going to be zero minus. Then I have the negative sign in front, this is going to be zero plus, okay? Otherwise, let's say my model gets it completely wrong, and my model says the second one is going to be roughly one, right? So what happens now if I have this cost, this roughly zero says what? Roughly zero means zero plus, right? Because again, we are defined only in the open interval here, right? So zero plus, you take the log of zero plus, what do you get? Minus infinity, you have the minus in front, you're going to get plus infinity. And so now whenever you have this basically L or this, well, the loss, which is going to be zero plus whenever the model makes a prediction, which is roughly agreeing with my target, right? And then this loss with skyrocket to basically plus infinity, when you're going to be completely missing the prediction, okay? So now we have this scalar value, right? That is telling me how bad basically the loss tells me how bad a given set of parameters are, okay? So L, capital L, red, is a function of the weights. It tells me how bad a specific set of weights are. The F tells me after I train the model how incompatible these two things are, and then the C tells me basically how far the prediction is from the target, okay? F is going to be used later at the inference as well, yeah. Is the sign of zero significant? Yes, because it's incorrect otherwise, right? You cannot get negative zero if it's your, so the point is that, okay, why am I talking about these things, right? Good question, Patrick. Whenever you're going to have computations, right, with a computer, you will have numbers that are floating point are not real, right? So you will not get exactly zero. Will you get something positive, something negative? It's important, right? If you get something negative, ah, something got, you know, fucked up. That's why it's important, right? Those are small details that allow you to debug your code. Let me push through. And let's finish this first part. And I guess next week we're going to see that the new slides I made for you, but I didn't manage to go there. This, I didn't want to rush, okay? So W are going to be the weights of the network, which are just the collection of all the parameters, all the possible matrices and biases and everything, okay? I have this loss, right? Which is, again, the average of the per-sample loss. So let's say we have this guy here, right? So this is going to be my loss. I'm going to have the, just there on the vertical coordinate. On the horizontal coordinate, I have just a scalar W. You see it's not bold, so it's just a singular scalar value. I start from a initial guess for my parameters when I randomly initialize the model. I call it W naught. I'm going to have a specific initial value for the loss. I can compute now the loss at that value, which is not even needed. I can compute the derivative. In this case, the derivative is positive. Even if I have a positive derivative, I know I want to move to the what side left, right? Because it's positive. And so I will, this is the derivative, and I now want to move in the negative sign, right? So I want to use a negative something, a fraction. So I should have like a fraction of this thing. Anyway, proportionally, I would like to move proportionally to the negative derivative. This is called gradient descent, okay? So let's make a question. If I ask you, how are neural network trained? Neural networks are trained with answer gradient descent, okay? If I ask you, what is back propagation used for? What is back propagation used for? Back propagation used for a way to compute the gradients. Good. If the question is, is gradient descent used only for training? The answer is no, okay. Is back propagation used for training? Ouch. Not exclusively. That's the correct answer. Okay. Thank you. All right. You're good. You're keeping up with the thing. So the second part of the lesson, which of course, we are running out of time, so we won't cover, but I will put it on Twitter for a preview, okay? If you want to check after. So how do we compute this partial, right? So how do I compute the partial derivative of the loss with respect to the specific parameter w, y, or the other one, the w, h, okay? And so there is this kind of repetition of operation, right? It's like they simplify some sort of thing. Again, this was not very clear to me until I actually start drawing things out and make things, how do you say? Spending the time, right? So in order to be better understanding how these back propagation work, we have a homework coming out tonight for you where you're going to be basically implementing back propagation by hand such that you can acquire an understanding of how these algorithms work, okay? Is this useful later on in practice, like whenever you're using the, like a framework? No, as in these frameworks are already implementing backprop for you. So why the heck are we asking you to do this by hand such that you understand exactly what's happening back in the backstage, right? In the underneath, underneath the software? Because whenever you'd like to change something, or you want to, something doesn't work, you want to investigate, you have to be having the knowledge of how to probe those things, okay? And so it's really important for you to actually get your mind around this algorithm and get familiar and, you know, get annoyed, I guess a bit. But the understanding will be, you know, paying back a lot, trust me, in the future, okay? So that was pretty much all I wanted to tell you today about. Well, not really. I created a new lesson which I didn't talk about, but I guess we'll do next week about that. So unless there are questions, I would say we are done with class for today. And you have a nice evening and enjoy your weekend. It's going to be fresh this week, weekend. Okay, all good. Awesome. Let me know if there are any other feedback you'd like me to implement. These slides will go on the Google Drive now, I guess. And I'll see you next week. Bye-bye.