 As you know, we're going to talk about deep learning and we're going to jump right in. Much of practical applications of deep learning today, machine learning and AI in general, are used a paradigm called supervised learning, which I'm sure most of you have heard of before. This is the paradigm by which you train a machine by showing it examples of inputs and outputs. You want to build a machine to distinguish images of cars from airplanes. You show it in the image of a car. If the machine says car, you don't do anything. If it says something else, you adjust the internal parameters of the system so that the output gets closer to the one you want. So imagine the target output is some vector of activities on a set of outputs. You want the vector coming out of the machine to get closer to the vector that is the desired output. This works really well. As long as you have lots of data, it works for speech recognition, image recognition, face recognition, generating captions, translation, all kinds of stuff. This is, I would say, 95% of all applications of machine learning today. There are two other paradigms, one of which I will not talk about, one of which I will talk about a lot. So the two other paradigms are reinforcement learning, which I will not talk about. There are other courses. There's a course by Larry Pinto about this that I encourage you to take. And a third paradigm is self-supervised learning or unsupervised learning. And we'll talk about this quite a lot in the following weeks. But for now, let's talk about supervised learning. Self-supervised learning, you could think of it as kind of a play on supervised learning. So the traditional model of pattern recognition machine learning and supervised learning, certainly going back to the late 50s or early 60s, is the idea by which you take a raw signal, let's say an image or an audio signal or a set of features representing an object, and then you turn it into a representation using a feature extractor, which in the past was engineered. And then you take that representation, which is generally in the form of a vector or a table of numbers or some kind of tensor, a multi-dimensional array, but sometimes could be a different type of representation. And you feed that to a trainable classifier. So this is the learning where the learning takes part. This is the classical model and it's still popular. It's still used a lot. But basically what deep learning has done is replace this sort of manual hand engineering of the feature extractor by a stack of trainable modules, if you want. So in deep learning, the main idea of deep learning and the only reason why it's called deep is that we stack a bunch of modules, each of which transforms the input a little bit into something that's going to slightly higher level of abstraction, if you want. And then we train the entire system end to end. So I represented those sort of pinkish modules to indicate the ones that are trainable and the blue modules are the fixed ones, the hand engineered ones. So that's why deep learning is called deep. We stack multiple layers of trainable things and we train it end to end. The idea for this goes back a long time. The practical methods for this go back to the mid to late 80s with the back propagation algorithm, which is going to be the main subject of today's lecture, actually. But it took a long time for this idea to actually percolate and sort of become the main tool that people use to build machine learning system. It's only about 10 years old. Okay, so let's go through a few definitions. So we're going to deal with parameterized models, a parameterized model or learning model if you want is a parameterized function g of x and w where x is the input and w is a set of parameters. I'm representing this here on the right with a particular symbolism where a function like this that produces a single output, think of the output as either a vector or matrix or a tensor or perhaps even a scalar, but generally it's multi-dimensional. It can actually be something else in a multi-dimensional array, but something that maybe like a sparse array representation or a graph which values on it. But for now, let's think of it just as a multi-dimensional array. So both the inputs and the outputs are multi-dimensional arrays, what people call tensors. It's not really kind of the appropriate definition tensor, but it's okay. And that function is parameterized by a set of parameters w. Those are the knobs that we're going to adjust during training and they basically determine the input-output relationship between between the input x and the predicted output y bar. Okay, so I'm not explicitly representing the wire that comes in with w. Here, I kind of assume that w is somewhere inside of this module. Think of this as an object in object-oriented programming. So it's an instance of a class that you instantiated and it's got a slot in it that represents the parameters and there is a forward function basically that takes as argument the input and returns the output. Okay, so a basic learning machine will have a cost function and the cost function in supervised learning, but also in some other settings will basically compute the discrepancy distance divergence, whatever you want to call it, between the desired output y, which is given to you from the training set and the output produced by the system y bar. Okay, so an example of this, a very simple example of a setting like this is linear regression. In linear regression, x is a vector composed of components xi's, w is also a vector, and the output is a scalar that is simply the dot product of x with w. So y bar now is a scalar and what you compute is the square distance, the square difference really, between y and y bar. If w is a matrix, then now y is a vector and you compute the square norm of the difference between y and y bar and that's basically linear regression. So learning will consist in finding the set of w's that minimize this particular cost function averaged over a training set. I'll come to this in a minute, but I want you to think right now about the fact that this g function may not be something particularly simple to compute. So it may not be just multiplying a vector by matrix. It may not be just carrying some sort of fixed computation with sort of a fixed number of steps. It could involve something complicated. It could involve minimizing a function with respect to some other variable that you don't know. It could involve a lot of iteration of some algorithm that converges towards a fixed point. So let's not kind of restrict ourselves to g of x w that are kind of simple things. It could be very complicated things. And we'll come to this in a few weeks. Right, so this is just to kind of explain the notations that I will use during the course of this class. So we have observed input and desired output variables. Those are kind of gray grayish bubbles. Other variables that are produced by the system or internal to the system are those kind of, you know, empty circle variables. We have deterministic functions or functions that are so they are indicated by this sort of rounded shape here. They can take multiple inputs have multiple outputs. And each of those can be tensors or scalars or whatever. And they have implicit parameters that are tunable by training. And then we have cost functions. So cost functions are basically functions that take one or multiple inputs and output a scalar. But I'm not representing the output is implicit. Okay, so if you have a red square, it has an implicit output. And it's a scalar and we interpret it as a cost or an energy function. So this symbolism is kind of similar to what people use in graphical models. If you've heard what a graphical model is, particularly the type of graphical model called a factor graph. So in a factor graph, you have those variable bubbles. And you have those factors, which are those square cost functions. You don't have this idea that you had deterministic functions in it because graphical models don't care about the fact that you have functions in one direction or another. But here we care about it. So we have this extra symbol. Okay, so machine learning consists in basically minimizing, finding the set of parameters w that minimize the cost function averaged over a training set. So a training set is a set of pairs x, x, y indexed by an index p. Okay, so we have p training samples. And little p is the index of the training set, the training sample. And our overall loss function that we're going to have to minimize is equal to the cost of the discrepancy between the y and the output of our model, y bar, g of xw, as I said earlier. So l is a value, c is a module, and l is a way of writing c of y, g of xw, you know, whether it depends explicitly on x, y, and w. Okay. But it's the same thing really. The overall loss function, which is this kind of curly l is the average of the per sample loss function over the entire training set. Okay, so compute l for the entire training set, divide by some all the terms divide by p. And that's the average. That's the loss. Okay. So now the name of the game is trying to find the minimum of that loss with respect to the parameters. This is an optimization problem. So symbolically, I can represent this entire graph as the thing on the right. This is rarely used in practice, but this is sort of a way to visualize this. So think about each training sample as a sort of identical copy of the replica, if you want, of the model and the cost function applied to a different training sample. And then there is an average operation that computes the loss. So everything you can write as a formula, you can probably write in terms of those graphs. This is going to be very useful as we're going to see later. Okay. So supervised machine learning and a lot of other machine learning paradigms as well actually can be viewed as function optimization and a very simple approach to optimizing a function, which means finding the set of parameters to a function that minimize its value, okay, is a gradient descent or gradient based algorithms. So gradient based algorithm makes the assumption that the function is somewhat smooth and mostly differentiable. It doesn't have to be everywhere differentiable, but it has to be continuous. It has to be almost everywhere differentiable. And it has to be somewhat smooth. Otherwise, the local information of the slope doesn't tell you much about where the minimum is. Okay. So here's an example here depicted on the right. The lines that you see here, the pink lines are the lines of equal cost, and this cost is quadratic. So it's basically a kind of paraboloid. And this is the trajectory of a method called stochastic gradient descent, which we'll talk about in a minute. So for stochastic gradient descent, the procedure is you show an example, you run it through the machine, you compute the objective for that particular sample, and then you figure out by how much and how to modify each of the knobs in the machine, the W parameters, so that the objective function goes down by a little bit. You make that change and then you go to the next sample. Let's be a little more formal. So gradient descent is this very basic algorithm here. You replace the value of W by its previous value minus a step size, eta here, multiplied by the gradient of the objective function with respect to the parameters. So what is a gradient? A gradient is a vector of the same size as the parameter vector. And for each component of the parameter vector, it tells you by how much the loss function L would increase if you increase the parameter by a tiny amount. It's a derivative, but it's a directional derivative. So let's say among all the directions, you only look at W34. And let's imagine that you tweak W34 by a tiny amount. The loss function, curly L, is going to increase by a tiny amount. You divide the tiny amount by which L increase by the tiny amount that you modified this W34. And what you get is the gradient of the loss with respect to W34. If you do this for every single weight, you get the gradient of the loss function with respect to all the weights. And it's a vector, which for each component of the weight gives you, or the parameter gives you that quantity. So since Newton and Euler, it's been written as dL over dW because it indicates the fact that there is this little twiddle, right? You can twiddle W by little. And there is a resulting twiddling of L. And if you divide those two twiddles, then they are infinitely small, you get the derivative. That's kind of standard notation in mathematics for a few hundred years. So now the gradient is going to be a vector. And as indicated here on the top right, that vector is an arrow that points upwards along the line of larger slope. So if you are in a 2D surface, you have two W parameters. And the surface is represented here. Some sort of quadratic ball here, in this case. So it's a secondary polynomial in W1 and W0. Here on the right is the top-down view of this, where the lines represent the lines of equal cost. The little arrow here represents the gradient at various locations. So you have a long arrow if the slope is steep, a short arrow if the slope is not steep, not large. At the bottom, it's zero. And it points towards the direction of highest slope, all right? So imagine you are in a landscape, a mountainous landscape, and you're in a fog and you want to go down the valley. You look around you and you can tell the local slope of the landscape. You can tell where the minimum is because you're in a fog but you can tell the local slope. So you can figure out what is the direction of larger slope and then take a step, and that will take you upwards, right? Now you turn around 180 degrees, take a step in that direction, and that was going to take you downwards. If you keep doing this and the landscape is convex, which means it has only one local minimum, this will eventually take you down to the valley and presumably to the village. Right, so that's gradient-based algorithms. They all differ by how you compute the gradient first and by what this eta step-size parameter is. So in simple forms, eta is just a positive constant that sometimes is decreased as the system runs more, but most of the time not. But in more complex versions of gradient-based learning, eta is actually an entire matrix itself, generally a positive definite or semi-definite matrix. And so the direction adopted by those algorithms is not necessarily the steepest descent. It goes downwards but it's not necessarily the steepest descent. And we can see why here. So in this diagram here that I'm showing, this is the trajectory that would be followed by gradient descent in this in this sort of quadratic cost environment. And as you see, the trajectory is not straight. It's not straight because the system kind of goes down by following the slope of steepest descent. And so it goes down the valley before finding the minimum of the valley, if you want. Right, so if your cost function is a little squeezed in one direction, it will go down the ravine and then kind of follow the ravine towards the bottom. In complex situations where you have, you know, things that are the trajectory actually is being cut here, but you know, when the weather function is, you know, highly irregular, this might even be more complicated and then you might be, you might have to be smart about what you do here. Okay, so stochastic gradient descent is universally used in deep learning. And this is a slight modification of the gradient steepest descent algorithm where you don't compute the gradient of the entire objective function averaged over all the samples. But what you do is you take one sample and you compute the gradient of the objective function for that one sample. With respect to the parameters and you take a step. Okay. And you keep doing this, you pick another sample, compute the gradient of the objective function for that sample with respect to the way it's making a date. Why is it called stochastic gradient? Stochastic is a you know, fancy term for random essentially. And it's called stochastic because the evaluation of the gradient you get on the basis of a single sample is a noisy estimate of the full gradient. The average of the gradients, because the gradient is a linear operation, the average of the gradients will be the gradient of the average. And so things work out. If you compute the gradient, then you kind of keep going. Overall, the average trajectory will be sort of the trajectory you would have followed by doing full gradient. Okay. But in fact, the reason we're doing this is because it's much more efficient in terms of speed of convergence. So although the trajectory followed by stochastic gradient is very noisy, things can bounce around a lot. As you can see in the trajectory here at the bottom, you know, things have the trajectory is very erratic. But in fact, it goes to the bottom faster. And it has other advantages that people are still writing papers on. Okay. The reason for that is that stochastic gradient exploits the redundancy between the samples. So although in a machine learning setting, the training samples have some similarities between them. If they don't, then basically the learning problem is impossible. So they necessarily do have some redundancy between them. And the faster you update the parameters, the more often you update them, the more you exploit this redundancy between those parameters. Now, in practice, what people do is they use mini batches. So instead of computing the gradient on the basis of a single sample, you take a batch of samples, typically anywhere between, let's say, 30 and a few thousand. But smaller batches are better in most cases, actually. And you compute the average of the gradient over those samples. Okay. So compute the average cost over those samples and compute the gradient of the average over those samples and then make an update. The reason for doing this is not intrinsically an algorithmic reason. It's because it's a simple way of parallelizing stochastic gradient on parallel hardware such as GPUs. Okay. So there's no good reason to do batching other than the fact that our hardware likes it. Okay. Question. Yeah. So actually for real complex deep learning problems, does this objecting function have to be continuously differentiable? Well, it needs to be continuous mostly. If it's not continuous, you're going to get in trouble. It needs to be differentiable almost everywhere. But in fact, neural nets that most people use are actually not differentiable. And there's a lot of places where they're not differentiable. But they are continuous in the sense that there are functions that have kind of corners in them if you want. They have kinks. And if you have a kink once in a while, it's not too much of a problem. But so in that case, those quantities should not be called gradients. They should be called subgradients. Okay. So a subgradient is basically a generalization of the idea of derivative or gradient to functions that have kinks in them. So wherever you have a function that has a kink in it, any slope that is between the slope of one side and the slope of the other side is a valid subgradient. Okay. So when you are at the kink, you decide, well, the derivative is this or it's that or it's kind of somewhere in between, and you're fine. Most of the proof that applied to smooth functions in terms of minimization often apply also to non-smooth functions that basically are differentiable almost everywhere. So then how do we ensure strict convexity? We do not ensure strict convexity. In fact, in deep learning systems, most deep learning systems, the function that we are optimizing is non-convex. In fact, this is one reason why it took so long for deep learning to become prominent is because a lot of people, particularly theoreticians, people who are sort of theoretically minded, were very scared of the idea that you had to minimize a non-convex objective and say, this can't possibly work because we can't prove anything about it. Turns out it does work. You can't prove anything about it, but it does work. And so this is a situation, and it's an interesting thing to think about, a situation where the theoretical thinking basically limited what people could do in terms of engineering because they couldn't prove things about it. But that could be actually very powerful. Yeah, like your colleague. We want to optimize non-convex functions. Like your colleague at the Bell Labs who didn't like the non-mathy. It was a whole debate within the machine learning community that lasted 20 years, basically. All right. So what about how doesn't SGD get stuck in local minima once it reaches them? It does. Okay. So full gradient does get stuck in local minima. SGD gets slightly less stuck in local minima because it's noisy. It allows it sometimes to escape local minima. But the real reason why we're going to optimize non-convex functions and local minima are not going to be such a huge problem is that there aren't that many local minima that are traps. So we're going to build neural nets, and those neural nets are deep running systems, and they're going to be built in such a way that the parameter space is such a high dimension that it's going to be very hard for the system to actually create local minima for us. So think about a picture where we have in one dimension a cost function that has one local minima and then a global minimum. So it's a function like this. And we start from here. If we optimize using gradient descent, we're going to get stuck in that local minimum. Now let's imagine that we parameterize this function. Now we have two parameters. So we're not looking at a one-dimensional function anymore. We're looking at a two-dimensional function. We have an extra parameter. This extra parameter will allow us to go around the mountain and go towards the valley, perhaps without having to climb the little hill in the middle. So this is just an intuitive example to tell you that in very high-dimensional spaces, you may not have as much of a local minimum problem as you have in the intuitive picture of low-dimensional spaces. So here those pictures are in two dimensions. They are very misleading. We're going to be working with millions of dimensions. And some of the most recent deep learning systems have trillions of parameters. So local minima is not going to be that much of a problem. We're going to have other problems, but not that one. So there is like a trend in this hyper overparameterization, right? It seems like that more neurons we have and the better these networks work somehow. That's right. So we're going to make those networks very large and they're going to be overparameterized, which means they're going to have way more adjustable parameters than we would actually need, which means they're going to be able to learn the training set almost perfectly. And the big question is how are they going to work on a separate validation set or test set that is separate from the training set? Or how are they going to work in a real situation where the distribution of samples may be different from what we train it on? So that's the real question of machine learning, which I'm sure a lot of you are familiar with. Two more questions. Can we do? So how do we escape instead of subtle points? Right. So there are tons and tons of subtle points in deep learning systems, a combinatorially large number of subtle points as a matter of fact. I'll have a lecture on this. So I don't want to kind of spend too long answering this. But yeah, there are subtle points. The trick with subtle points is you don't want to get too close to them, essentially. And stochastic gradient helps a little bit with subtle points. Some people are proposed sort of explicit methods to stay away from subtle points, but in practice, it doesn't seem to be that much of a problem actually. Finally, how do you pick samples for stochastic gradient in the center randomly? Okay. There is lots of different methods for that. Okay. Yeah, I mean, the basic thing you should do is you have your training set, you shuffle the samples in a random order, okay? And then you just pick them one at a time, and then you cycle through them. An alternative is once you get to the end, you reshuffle them and then cycle through them again. An alternative is you pick a random sample using a random number every time you pick a new sample, you pick them randomly. If you do batching, a good idea is to put in a batch samples that are maximally different from each other. So things that are, for example, different categories if you do classification. But most people just do them, you know, just pick them randomly. But it's good to have samples that are maximally different that are nearby either in a batch or during the process of training. And then there are all kinds of tricks that people use to sort of emphasize difficult samples so that the boring, easy samples are not, you don't waste your time just, you know, seeing them over and over again. It's all kinds of tricks, all right? But, you know, the simple one is, which most people use, you shuffle your samples and you run through them. A lot of people now use also data augmentation. So every sample is actually distorted by some process. For an image, you can have, you know, distort the geometry a little bit, you change the colors, you add noise, etc. This is an artificial way of sort of adding more samples than you actually have. And people do this kind of randomly on the fly or they kind of pre-compute those transformations. So lots of tricks there as well. Last question, how do you pick the batch size? The best? The batch, batch size. Oh, the batch size. That's determined by your hardware. So, if you have a GPU, generally for, you know, reasonably sized networks, your batch size would be anywhere between 16 and 64 or something like that. For smaller networks, you might have to batch more to kind of exploit your hardware better to kind of have maximum usage of it. If you parallelize on multiple GPUs within a machine, you may have to have, you know, so let's say you have eight GPUs, then it would be sort of eight times 32. So there's 256 or something. And then, you know, a lot of the big guys kind of parallelize that over multiple machines, each of which has eight GPUs, some of them have TPUs, whatever. And then you might have to parallelize over thousands of examples. This diminishing return in doing this, when you increase the size of the batch, you actually reduce the speed of convergence. You accelerate the calculation, but you reduce the speed of convergence. So at some point, it's not worth increasing your batch size. So if we are doing a classification problem with K-classes, what's going to be like our go-to batch size? So there are papers that say if your batch size is significantly larger than the number of categories, or let's say twice the number of categories, then you're probably wasting computation, essentially going down convergence. So you're trying to train an image recognizer on ImageNet. If your batch size is larger than about a thousand, you're probably wasting time. Okay, that's it. Thanks. I mean, you're wasting computation, you're not wasting time. Okay, okay. Okay, so let's talk about traditional neural net. So a traditional neural net is a model, a particular type of parameterized function, which is built by stacking linear and nonlinear operations. Right, so here is kind of a depiction of a traditional neural net here, in this case with two layers, but I'm not imagining there might be more layers here. So you have a bunch of inputs here on the left. Each input is multiplied by a weight, different weights presumably, and the weighted sum of those inputs by those weights is computed here by what's called a unit or a neuron. People don't like using the word neuron in that context, because there are incredibly simplified models of neurons in the brain, but that's the inspiration really. Okay, so one of those units just computes a weighted sum of its inputs using those weights. This unit computes a different weighted sum of the same inputs with different weights and etc. So here we have three units here in the first layer. This is called a hidden layer, by the way, because it's neither an input nor an output. This is the input, and this is the output, and this is somewhere in the middle. So we compute those weighted sums, and then we pass those weighted sums individually through a nonlinear function. So here what I've shown is the value function. So this is called rectified linear unit. This is the name that people have given it in the neural net lingual. In other contexts, this is called a half-wave rectifier, if you're an engineer. It's called positive part, if you are a mathematician. Basically, it's a function that is equal to the identity when its argument is positive and it's equal to zero if its argument is negative. So very simple graph. And then we stack a second layer of the same thing, the second stage. So again, a layer of linear operations where we compute weighted sums, and then we pass a result to nonlinearities. And we can stack many of those layers, and that's basically a traditional plain vanilla garden variety neural net. In this case, fully connected. So fully connected neural net means that every unit in one layer is connected to every unit in the next layer. And you have this sort of well-organized layer architecture, if you want. Each of those weights are going to be the things that our learning algorithm is going to tune. And the big trick, the one trick, really, of deep learning is how we compute those gradients. Okay, so if you want to write this, you can say the weighted sum number i. So you can give a number to each of the units in the network. So this unit with number i and the weighted sum s of i is simply the sum where j goes over the set of upstream units to i, which maybe all the units in the previous layer or not could be just a subset. Okay, and then you compute the product of zj, which is the output of the unit number j times wij, which is the weight that links unit j to unit i. Okay, and then after that, you take this si, which is the weighted sum, you pass it through the activation function, this value or whatever it is that you use, and that gives you zi, which is the activation for unit i. Okay, super notation. By changing the set of upstream units of every unit, by building a graph of interconnection, you can basically build any kind of network arrangement that you want. There is one constraint that we can lift, that we will lift in a subsequent lecture, which is that the graph has to be ac-click in the sense that it can't have loops. If you have loops, that means you can't organize the units in layers. You can't sort of number them in a way that you can compute them so that every time you want to compute a unit, you already have the state of the previous units. If there are loops, then you can do that. So for now, we're going to assume that the wij matrix, the w matrix, doesn't have loops, represents a graph that doesn't have loops. That's what I should say. Okay, so here's sort of an intuitive explanation of the back propagation algorithm. So the back propagation algorithm is the main technique that is used everywhere in deep learning to compute the gradient of a cost function, whatever it is, objective function, with respect to a variable inside of the network. This variable can be a state variable like a z, or an s, or it could be a parameter variable like a w. And we're going to need to do both. Okay, so this is going to be an intuitive explanation. And then after that, there's going to be a more mathematical explanation, which is less intuitive, but perhaps actually easier to understand. But let me start with the intuition here. So let's say we have a big network. And inside of this big network, we have one of those little activation functions. In this case, it's a sigmoid function, but it doesn't matter what it is for now. This function takes an s and produces a z. We call this function h of s. So when we wiggle z, the cost is going to wiggle by some quantity, right? And we divide the wiggling of z by the wiggling of c by the wiggling of z that causes it. That gives us the partial derivative of c with respect to z. So this one term is a gradient of c with respect to all the z's in the network. And there's one component of that gradient, which is the partial derivative of the cost with respect to that single variable z inside the network. Okay? And that really indicates how much c would wiggle if we wiggled z by some amount. We divide the wiggling of c by the wiggling of z and that gives us the partial derivative of c with respect to z. This is not how we're going to compute the gradient of c with respect to z, but this is a description of what it is conceptually, okay? Or intuitively rather. Okay, so let's assume that we know this quantity. So we know the partial derivative of c with respect to z, okay? So c with respect to z is this quantity here, dc over dz, okay? So think of dz as the wiggling of z and dc as the wiggling of c, divide one by the other and you get the partial derivative of c with respect to z. What we have here is, what we have to apply is chain rule, the rule that tells us how to compute the derivative of a function composed of two individual functions that we apply one after the other, right? So remember chain rule, if you have a function g that you apply to another function h, which is function of parameter s and you want the derivative of it, the derivative of that is equal to the derivative of g at point h of s, multiplied by the derivative of h at point s, right? That's chain rule, you know that a few years ago, hopefully. Now, if I want to write this in terms of partial derivative, it's the same thing, right? Partial derivative is just a derivative just with respect to one single variable. So I would write this something like this, dc over ds. So c really is the result of applying this h function to s and then applying some unknown g function to compute c, which is kind of the rest of the network plus the cost. But I'm just going to call the gradient. I'm going to assume that this dc over dz is known. So one gave it to me. So this is this variable here on the right. dc over dz is given to me. And I want to compute dc over ds. So what I need to do is write this, dc over ds equal dc over dz. Times dz over ds, right? And why is this identity true? It's because I can simplify by dz. It's as simple as this, right? So you have, you know, trivial algebra. You have dc at the denominator here. dc at the numerator here. Simplify you get dc over ds, okay? It's a very trivial, simple identity, which is basically just chain rule applied to partial derivatives. Now, dc over ds, we know what it is. It's just h prime of s, okay? Just the derivative of the h function, okay? So we have this formula, dc over ds equal dc over dz, which we assume is known, times h prime of s. What does that mean? That means that if we have this component of the gradient of the cost function with respect to z here, we multiply this by the derivative of the h function at point s, the same point s that we had here. And what we get now is the gradient of the cost function with respect to s. Now, here's the trick. If we had a chain of those h functions, we could keep propagating this gradient backwards by just multiplying by the derivative of all those h functions going backwards. And that's why it's called back propagation, okay? So it's just a practical application of chain rule, right? And if you want to kind of convince yourself of this, you can run through this idea of like perturbation. Like, you know, if I twiddle s by some value, it's going to twiddle z by some value equal to ds times h prime of s, basically the slope of s, right? So dz equals h prime of s times ds, okay? And then I'm going to have to multiply this by dc over dz. So I rearrange the terms and I get immediately that this formula dc over ds equals dc over dz times h prime of s. Okay, so we had another element in our multilayer net, which was the linear sum. And there, it's just a little bit more complicated, but not really. Okay, so one particular variable z here, we would like to compute the derivative, the partial derivative of our cost function with respect to that z, okay? And we're going to assume that we know the partial derivative of s with respect to each of those s's, okay? The weighted sums at the next layer that z is going into, okay? So z only influences c through those s's, okay? So presumably by knowing, by basically multiplying how each of those s's influence c, and then multiplying by how z influences each of the s's and summing up, we're going to get the influence of z over c, right? And that's the basic idea. Okay, so here's what we're going to do. Let's say we perturb z by dz. This is going to perturb s0 by dz times w0, okay? We multiply z by w0. So the derivative of this linear operation is the coefficient itself, right? So here, the perturbation is, which is ds0 is equal to dz times w0, okay? And now in turn, this is going to modify c, and we're going to multiply this quantity by dc over ds0 to get the dc, if you want, okay? Now, whenever we perturb z, it's not going to perturb just s0. It's also going to perturb s1 and s2. And to see the effect on c, we're going to have to sum up the effect of the perturbation on each of the s's and then sum them up to see the overall effect on c. So this is written here on the left. The perturbation of c is equal to the perturbation of s multiplied by the partial derivative of c with respect to s plus the perturbation of s1 multiplied by the partial derivative of dc with respect to s1 plus same thing for s2, okay? So this is the fact that, you know, we need to take into account all the perturbations here that z may influence. And so I can just write down that very simple thing, you know, because dc of 0 is equal to w0 times dz, and, you know, ds of 2 is w2 times dz, I can plug this in there and just write dc over dz equal dc over ds0, which I assume is known, times w0 plus dc over ds1 times w1 plus dc over ds2 times w2, okay? If I want to represent this operation graphically, this is shown on the right here. I have dc over d0, dc over ds1, dc over ds2, which I assume are known or given to me somehow. I compute dc over ds0 multiplied by w0 and multiply dc over ds1 by w1, dc over ds2 by w2. I sum them up and that gives me dc over dz, okay? It's just the formula here. Okay, so here's the cool trick about back propagation through a linear module that computes weighted sums. You take the same weights and you still compute weighted sum with those weights, but you use the weights backwards, okay? So whenever you had the unit that was sending its output to multiple outputs, to multiple units through a weight, you take the gradient of the cost with respect to all those all those weighted sums and you compute their weighted sum backwards using the weights backwards to get the gradient with respect to the state of the unit at the bottom. And you can do this for all the units, okay? So it's super simple. Now, if you were to write a program to do backprop for classical neural nets in Python, it would take like half a page. It's very, very simple. It's one function to compute weighted sums going forward in the right order, and another function and applying the nonlinearity. There's another function to compute weighted sums going backward and multiplying by the derivative of the nonlinearity at every step, right? It's incredibly simple. What's surprising is that it took so long for people to realize this was so useful, maybe because it was too simple. Okay, so it's useful to write this in matrix form. So really, the way you should think about a neural net of this type is each state inside the network, think of it as a vector. It could be a multidimensional, right? But let's think of it just as a vector. A linear operation is just going to multiply this vector by matrix. And each row of the matrix contains all the weights that are used to compute a particular weighted sum for a particular unit, okay? Okay, so multiply this by this matrix. So this dimension has to be equal to that dimension, which is not really well depicted here, actually. Once from the previous slide, you wrote ds0. What is s differentially with respect to? So there is a ds. What is ds, basically? ds0, you mean? Yeah. Okay, ds0 is a perturbation of s0, okay? An infinitely small perturbation of s0. Doesn't matter what it is, okay? And what we're saying here is that if you have an infinitely small perturbation of s0, and you multiply this perturbation by the partial derivative of c with respect to s0, okay? You get the perturbation of c, except that that corresponds to this perturbation of s0, right? But we're not interested in just the perturbation of s0. We're also interested in the perturbation of s1 and s2. So the overall perturbation of c would be the sum of the perturbations of s0, s1, and s2, multiplied by their corresponding partial derivative of c with respect to each of them, okay? It's a virtual thing, right? It's not an existing thing you're going to manipulate. Just imagine that there is some perturbation of s0 here. Okay, this is going to perturb c by some value, and that value is going to be the perturbation of s0 multiplied by the partial derivative of c with respect to s0, okay? And then if you perturb s1 simultaneously, you're also going to cause a perturbation of c. If you perturb s2 simultaneously, you're also going to cause a perturbation of c. The overall perturbation of c will be the sum of those perturbations, and that is given by this expression here. Now those infinitely small quantities, d, s, dc, et cetera, think of them as numbers. You can do algebra with them. You can divide one by the other. You know, you can do stuff like that. So now you say, you know, what is ds0 equal to? If I tweak z by a quantity dz, it's going in turn to modify s0 by ds0, okay? And what is the quantity by which s0 is going to be tweaked? If I tweak z by dz, because s is the result of computing the product of z by w0, then the perturbation is also going to be multiplied by w0, right? So the ds0 corresponding to a particular dz is going to be equal to dz times w0. And this is what's expressed here, okay? ds0 equal w0 dz. Okay, now if I take this expression for ds0 and I insert it here in this formula, okay? I get dc equal w0 times dz times dc over ds0 plus same thing for 1 plus same thing for 2. And I'm going to take the dz and pass it to the other side. I'm going to divide both sides by dz. So now I get dc over dz equal, the dz doesn't appear anymore because it's been put underneath here, is w0 times dc over ds0 plus w1 times dc over ds1, et cetera. Okay, it's just simple algebra. It's differential calculus, basically. Right, so it's better to write this in matrix form. So really when you're computing, if I go back a few slides, when this is really kind of a matrix of all the weights that are kind of upstream of the zj's. So you can align the zj as a vector. Maybe only the zj's that have nonzero terms in w, wij. And then you can write those w's as a matrix and this is just a matrix vector product, okay? So this is the way this would be written. You have a vector, you multiply by matrix, you get a new vector, pass that through nonlinearities, values, multiply that by matrix, et cetera. Right? So symbolically, you can write a simple neural net this way. We have linear blocks, okay, linear functional blocks, which basically take the previous state and multiply by matrix, okay? So you have a state here, z1, multiplied by matrix, you get w1, z1, and that gives you the vector of weighted sums, s2. Okay, then you take that, pass it to the nonlinear functions, each component individually, and that gives you z2, right? So that's a three-layer neural net. First weight matrix, nonlinearity, second weight matrix, nonlinearity, third weight matrix, and this is the output. There are two hidden layers, three layers of weights. Okay, the reason for writing it this way is that this is symbolically the easiest way to understand really what kind of backprop does. And in fact, it corresponds also to the way we define neural nets and we run them on deep learning frameworks like PyTorch. So this is the sort of object-oriented version of defining a neural net in PyTorch. We're going to use predefined class, which are the linear class that basically multiplies a vector by matrix. It also has biases, but let's not talk about this just now. And another class, which is the value function, which takes a vector or multi-dimensional array and applies the nonlinear function to every component separately. Okay, so this is a little piece of Python program that uses Torch, we import Torch. We make an image, which is 10 pixels by 20 pixels and three components for color. We compute the size of it and we're going to plug a neural net that where the number of inputs is the number of components of our image. So in this case, that would be 600 or so. And we're going to define a class. The class is going to define a neural net and that's pretty much all we need to do here. So we define our network architecture. It's a subclass of neural net module, which is a pretty fine class. It's got a constructor here that will take the sizes of the internal layers that we want, the size of the input, the size of S1 and Z1, the size of S2 and Z2, and the size of S3. We call the parent class initializer and then we just create three modules that are all linear modules and we need to kind of store them somewhere because they have internal parameters. So we're going to have three slots in our object, N0, N1, N2, module 1, module 0, module 1, module 2. And each of them is going to be an instance of the class NN.linear with two sizes, the input size and the output size. So the first module has input size D0, output size D1, et cetera. And those classes are, since there is a capital L, means it's an object and inside there are parameters inside that item there. So for example, the value doesn't have a capital because it doesn't have internal parameters. It's not kind of a trainable module. It's just a function. Whereas those things with capitals, they have sort of internal parameters, the weight matrices inside of them. So now we define a forward function, which basically computes the output from the input. And the first thing we do is, we take the input thing, which may be a multi-dimensional array and we flatten it. We flatten it using this idiomatic expression here in PyTorch. And then we apply the first module to X. We put the result in S1, which is a temporary variable, local variable. Then we apply the value to S1, put the result in Z, then apply the second layer, put the result in S2, apply the value again, put the result in S3, and then the last linear layer, put the result in S3 and return S3. And there is a typo. So the second line should have been S1. It's the self.m0 of Z0, right? Z0 here, yes. Yeah, yeah, yeah. I know. Yeah, this is something that is going to be fixed, right? Which I didn't fix. I know. This is Z0. Thanks for reminding me of this. Okay, but you'll see examples. I mean, I'll show you kind of actual examples of this and you'll be able to run them yourself. That's all you need to do. You don't need to write how you compute the backdrop, how you propagate the gradients. You could write it and it would be as simple as forward. You could write a backward function and it would basically multiply by the matrices going backwards. But you don't need to do this because PyTorch does this automatically for you. When you define the forward function, it knows what modules you've called in what order, what are the dependencies between the variables that compute the gradient backwards, so you don't need to worry about it. That's the magic of PyTorch, if you want. That's called automatic differentiation. And this is a particular form of automatic differentiation. There's another way to write functions in PyTorch that are more functional, so you're not using modules with internal parameters. You're just coding functions one after the other. PyTorch has a mechanism by which it can compute the gradient of any function you define with respect to whatever parameters you want. Actually, these big guys with the capital L, like the nn.capital linear inside is going to have a lowercase linear, which is the functional part, which is performing the matrix multiplication between the weights stored inside the object with the capital L and then the input. So every capital letter object will inside have the functional way. So one can decide to use either the functional form by default or use this encapsulated version, which are more convenient to just use, right? Right. So in the end you can create an instance of this class. You can create multiple instances, but you can create one here, which is called myNet and give it the sizes you want. And then to apply this to a particular image you just do how to equal model of image. That's as simple as that. Okay. So this is your first neural net and it does all the backprop automatically. But you need to understand how backprop works, right? It's not because PyTorch does it for you that you can sort of forget about how you actually compute the gradient of the function because it's inevitable that at some point you're going to want to actually assemble a neural net with a module that does not pre-exist and you're going to have to write your own backprop function. So to do this you basically have, if you want to create a new module with some complex operation that does not pre-exist in PyTorch then you do something like this. You define a class but you write your own backward function basically. Okay. So let's get one step up in terms of abstraction and write this in sort of slightly more kind of generic form, mathematical form if you want. Okay. So let's say we have a cost function here and we want to compute the gradient of this cost function with respect to a particular vector in the system ZF. It could be a parameter or it could be a state. It doesn't matter. Okay. Some states inside and we have ChainRule and ChainRule is nothing more than this that I explained earlier. dC over dZF is equal to dC over dZG, dZG over dZF as long as C is only influenced by ZF through ZG. There's no other way for ZF to influence C than to go through ZG. Then this formula is correct. Okay. And of course the identity is trivial because it's just a simplification by this infinitesimal vector quantity dZG. Okay. So let's say ZG is a vector of size dG by 1. So this means column vector. Okay. And ZF is a column vector of size dF. This is if we want to write the correct dimensions of this you know we get something a little complicated. Okay. So first of all this object here dZG over dZF well let me start with this one. Okay. This one dC over dZG, that's a gradient vector. Okay. ZG is a vector, dC over dZG is a gradient vector and it's the same size as dZG. But by convention we actually write it as a row vector. Okay. So this thing here is going to be a row vector whose size is the same size as ZG. But it's going to be horizontal instead of vertical. Okay. This object here is something more complicated. It's actually a matrix. Why is it matrix? It's because it's the derivative of a vector with respect to another vector. Okay. So let's look at this diagram here on the right. We have a function G. It takes ZF as an input and it produces ZG as an output. So this is the information about the derivative of that module, which is this quantity here, dZG over dZF. There's a lot of terms to capture because there's a lot of ways in which every single output, every component of ZG can be influenced by every component of ZF. Right? So for every pair of components ZG and ZF, there is a derivative term which indicates by how much ZG would be perturbed by the infinitesimal quantity. Right? We have that for every pair of components of ZG and ZF. As a result, this is a matrix whose dimension is the number of rows is the size of ZG and the number of columns is the size of ZF. And each term in this matrix is one partial derivative term. So this whole matrix here, if I take the component IJ, it's the partial derivative of the ith output of that module, the ith component of ZG with respect to the jth component of ZF. Okay? So what we get here is a row vector is equal to a row vector multiplied by a matrix and the sizes kind of work out so that, you know, they're compatible with each other. Okay, so what is back propagation now? Back propagation is this formula. Okay? It says if you have the gradient of some cost function with respect to some variable and you know the dependency of these variables with respect to another variable, you multiply this gradient vector by that Jacobian matrix and you get the gradient vector with respect to that second variable. So graphically here on the right, if I have the gradient of the cost with respect to ZG and I want to compute the gradient of C with respect to ZF, which is dC of dZF, I only need to take that vector, which is a row vector, multiply it by the Jacobian matrix dG over dZF or dZG over dZF and I get dC over dZF. Okay? Someone is objecting here. Isn't a summation missing here? Which summation? A summation of all the components of this partial multiplications. Here? Yeah. Well, this is a vector. There's a lot of sums going on here because when you compute the product of this matrix, you're going to have a lot of sums, right? Yep, so it's hidden, right? Yeah, the sums are hidden inside of this vector matrix product. You can take a specific example. Let's imagine that this G function is just a matrix multiplication. We just multiply by ZF by matrix W. So we have a linear operation. The derivative of the Jacobian matrix of the multiplication by matrix is the transpose of that matrix. So what we're going to do here is take this vector, multiply it by the transpose of the W matrix and what we get is that vector. Okay? And it all makes sense, right? The sizes make sense. This matrix here is the transpose of the weight matrix which, of course, had the reverse size. We multiply it, we pre-multiply it by the row vector of the gradient from the layer above and we get the gradient with respect to the layer below. Okay? So backpropagating through a linear module just means multiplying the transpose of the matrix used by that module. And it's just a generalized form of what I explained earlier, propagating through the weights system. But it's less intuitive, right? Okay, so we're going to be able to do backpropagation by computing gradients all the way through by propagating backwards. But this module really has two inputs. It has an input which is ZF and the other one is WG, the weight matrix, the parameter vector that is used inside of this module. So there is a second Jacobian matrix of ZG with respect to the terms of this weight parameter. Okay? And to compute the gradient of the cost function with respect to those weight parameters, I need to multiply this gradient vector by the Jacobian matrix of that block with respect to its weight. And it's not the same as the Jacobian matrix with respect to the input. It's a different Jacobian matrix. I'll come back to this in a second. So to do backprop again, if we have a vector of gradients of some cost with respect to a state and we have a function that is a function of one or several variables, we multiply this gradient by the Jacobian matrix of this block with respect to each of these inputs and that gives us the gradient with respect to each of the inputs. And that's going to be expressed here. So this is the backpropagation of states in a layer-wise classical type neural net. dc over dzk, which is the state of layer k is dc over zk plus one, which is the gradient of the cost with respect to the layer above, times the Jacobian matrix of the state of layer k plus one with respect to the state of layer k. Right? Now we assume dc over dzk plus one is known and we just need to multiply with the Jacobian matrix of the function that links zk to zk plus one. The function is used to compute zk plus one from zk, and this may be a function also of some parameters inside, but here that's the matrix of partial derivatives of f, which is with output to zk plus one with respect to each of the components of zk. Okay, so that's the first rule of backpropagation, and it's a recursive rule so you can start from the top. You can start initially with dc over dc, which is one, okay, which is why I have this one here on top. Okay, and then you just keep multiplying by the Jacobian matrix all the way down. And backpropagate gradients, and now you get gradients with respect to all the states. You also want the gradients with respect to the weights because that's what you need to do learning. So what you can write is the same chain rule, dc over dwk is equal to dc over dk plus one, which we assume is known, times dzk plus one with dwk, right, and you can write this as dc over dk plus one, and the dependency between zk plus one and wk is the function zk applied to wk so you can differentiate the function, the output of the function zk with respect to wk, and that gives you another Jacobian matrix. And so with those two formulas, you can do backpropagation just about anything. Really what goes on inside PyTorch is inside most of those frameworks, TensorFlow and JAX and whatever. It's something like this where you have so let's take a very simple diagram here where you have an input parameterized function that computes an output that goes to a cost function and that cost function measures the discrepancy between the output of the system and the desired output. So I mean you can write this function as c of g of w, I didn't put the x here but just for charity. And the derivative of this is again you apply chain rule or you can write it with partial derivatives this way and same for expand the dependency of the output with respect to the parameters as the Jacobian matrix of g with respect to w. If w is a scalar then this is just a derivative partial derivative. Okay now you can express this as a compute graph so you can say like how am I going to compute dc over dw? What I'm going to have to do is take the value one which is the derivative of c with respect to itself basically the loss with respect to itself I'm going to multiply this by the derivative of the cost with respect to y bar okay and give me dc over dy bar obviously okay this is the same as this because I just multiply by one. Then multiply this by the Jacobian matrix of g with respect to w which is a derivative if w is a scalar that would depend on x and I get dc over dw so this is a so-called compute graph right this is a way of organizing operations to compute the gradient essentially an automatic way of transforming a graph of a compute graph of this type into a compute graph of this type that computes the gradient automatically and this is the magic that happens in the automatic differentiation inside PyTorch TensorFlow and other systems some systems are pretty smart about this in a sense that those functions can be fairly complicated they can involve themselves computing derivatives and stuff and they can involve dynamic computation where the graph of computation depends on the data and actually PyTorch handles this properly. I'm not going to go through all the details of this but this is kind of a way of reminding you what the dimensions of all those things are right so if y is a column vector size m, w is a column vector size n then this is a row vector size n this is a row vector size m in the geocommun matrix size m by n and all of this works out the way we're going to build neural nets and I'll come back to this in a subsequent lecture is that we are going to have at our disposal a large collection of basic modules which we're going to be able to arrange in more or less complex graphs as a way to build the architecture of a learning system okay so either we're going to write a class or we're going to write a program that runs the forward pass and this program is going to be composed of basic mathematical operations addition, subtraction of tensors or multidimensional arrays other types of scalar operations or the application of one of the most defined complex parameterized functions like a linear module, a value or things like that and we have at our disposal a large library of such modules which are things that people have come up with over the years that are basic modules that are used in a lot of applications the basic things that we've seen so far are values, there's other nonlinear functions like sigmoids and variations of this, there's a large collection of them and then we have cost functions like square error, cross entropy, hinge loss ranking loss and blah blah blah which I'm not going to go through now but we'll talk about this later the nice thing about this formalism is that as I said before you can sort of compute graphs, you can run you can construct a deep learning system by assembling those modules in any kind of arrangement you want as long as there is no loops in the connection graph so as long as you can come up with a partial order in those modules that will ensure that they are computed in the proper way but there is a way to handle loops and that's called recurrent nets, we'll talk about this later so here's a few practical tricks to play with neural nets and you're going to do that soon enough perhaps even tomorrow so and these are kind of a bit of a black art of deep learning which is sort of, a lot of it is implemented already in things like PyTorch if you use time to tools but some of it is kind of more of the sort of oral culture if you want of the deep learning community you can find this in papers but it's a little difficult to find sometimes so neural nets use values as the main nonlinearity so this sort of half wave rectifier hyperbole tangent which is a similar function and logistic function which is also a similar function are used but not as much, not nearly as much you need to initialize the waste properly so if you have a neural net and you initialize the waste to zero it never takes off it will never learn the gradients will always be zero all the time and the reason is because when you back propagate the gradient you multiply by the transpose of the weight matrix if that weight matrix is zero your gradient is zero so if you start with all the weights equal to zero you never take off and someone asked the question about saddle points before zero is a saddle point and so if you start at this saddle point you never get out of it so you have to break the symmetry in the system you have to initialize the weights to small random values they don't need to be random but it works fine if they are random and the way you initialize is actually quite important so there's all kinds of tricks to initialize things properly one of the tricks was invented by my friend Leon Boutu about 30 years ago even more than that 34 years ago unfortunately now it's called differently it's called the kaming trick but it's the same and it consists of initializing the weights to random values in such a way that if a unit has many inputs the weights are smaller than if it has few inputs and the reason for this is that you want the weighted sum to be roughly kind of have some reasonable value if the input variables have some reasonable value let's say variance one or something like this and you're computing a weighted sum of them the weighted sum the size of the weighted sum is going to grow the square root of the number of inputs and so you want to set the weights to something like the inverse square root if you want the weighted sum to be about the same size as each of the inputs so that's built into PyTorch you can call this initialization procedure what's the exact name of it? Alfredo I can't remember so there is the kaming then there is the Xavier and then there is also yours we have in PyTorch yeah they're slightly different but they can do the same more or less yeah the Xavier Gros version yeah yeah this one divides by the fennel and fennel there's various loss functions so I haven't talked yet about what the cross entropy loss is but cross entropy loss is a particular cost that's used for classification I'll probably talk about this next week and I'll have some time at the end of this lecture this is for classification as I said we use stochastic gradient descent on mini batches and mini batches only because the hardware that we have needs mini batches to perform properly if we had different hardware we would use mini batch size 1 as I said before we need to shuffle the training samples so if someone gives you a training set and puts all the examples of category 1 then all the example category 2 all the example category 3 etc by keeping this order it is not going to work you have to shuffle the samples so that you basically get samples from all the categories within kind of a small subset if you want there is an objection here for the stochastic gradient isn't Adam better? alright okay there is a lot of variants of stochastic gradient there are all stochastic gradient methods in fact people in optimization said this should not be called stochastic gradient descent because it is not a descent algorithm because stochastic gradient sometimes goes uphill because of the noise so people who want to be correct about this stochastic gradient optimization but not stochastic gradient descent that's the first thing stochastic gradient optimization or stochastic gradient descent SGD is a special case of gradient based optimization and the specification of it says you have to have a step size eta but nobody tells you how you said this step size eta and nobody tells you that this step size is a scalar or a diagonal matrix or a full matrix okay so there are variations of SGD in which that is changed all the time for every sample or every batch in SGD most of the time this eta is decreased according to a schedule and there are a bunch of standard schedule in PyTorch that are implemented in techniques like Adam the eta is actually a diagonal matrix and that diagonal matrix the term in the diagonal matrix are changed all the time they are computed based on some estimate of the curvature of the cost function there are a lot of methods to do this they are all SGD type methods Adam is an SGD method with a special type of eta so in the optin package in Torch there is a whole bunch of those methods there is going to be a whole lecture on this so don't worry about it about optimization normalized input variables to zero mean and you need variance so this is a very important point that this type of optimization method gradient based optimization methods when you have weighted sums linear operations tends to be very sensitive to how the data is prepared so if you have two variables that have very widely different variances one of them varies between plus one and plus one the other one varies between minus 100 and plus 100 the system will basically not pay attention to the one that varies between plus one and minus one we will only pay attention to the big one and this may be good or this may be bad furthermore the learning rate you are going to have to use the eta parameter the step size is going to have to be set to a relatively small value to prevent the weights that look at this highly variable input from diverging the gradients are going to be very large because the gradients basically are proportional to the size of the input or even to the variance of the input so if you don't want your system to diverge you are going to have to tune down the learning rate if the input variance is large if the input variables are all shifted they are all between let's say 99 and 101 instead of minus one and one then again it is very difficult for a gradient based algorithm that use weighted sums to kind of figure out those things again I will talk about this more formally later right now just remember the trick that you need to normalize your input so basically take every variable of your input subtract the mean you compute the mean over the training set of each variable so let's say your training set is a set of images the images are let's say 100 by 100 pixels let's say they are great scales to get 10,000 variables and let's say you get a million samples you are going to take each of those 10,000 variables compute the mean of it over the training set compute the standard deviation of it over the entire training set and the samples you are going to show to your system are going to be a sample where you have subtracted the mean from each of the 10,000 pixels and divided the resulting values by the standard deviation that you computed so now what you have is a bunch of variables that are all zero mean and all standard deviation equal to one and that makes your neural net happy that makes your optimization algorithm happy actually we have actually a question so you keep repeating SGD type methods gradient based methods because there are other types of methods yes so there is gradient free methods so gradient free method is a method where you do not assume that the function you are trying to optimize is differentiable or even continuous with respect to the parameters for several reasons perhaps it's a function that looks like a golf course it's flat and maybe it's got steps and it's difficult to the local gradient information does not give you any information as to where you should go to find the minimum it could be that the function is essentially discrete right it's not a function of continuous variables function of discrete variables so for example am I going to win this chess game the variable you can manipulate is the position on the board the variable so you can't you can compute a gradient of a score with respect to position on the chess game it's a discrete variable another example is the cost function is not something you can compute you don't actually know the cost function so for example the only thing you can do is give an input to the cost function but you don't know the function it's not a program on a computer you can't back propagate gradient to it a good example of this is the real world the real world you can think of it as a cost function you learn to ride a bike and you ride your bike and at some point you fall the real world does not give you a gradient of that cost function which is how much you hurt with respect to your actions the only thing you can do is try something else and see if you get the same result or not so what do you do in that case so basically now your cost function is a black box so now you cannot propagate gradient to this black box what you have to do is estimate the gradient by perturbing what you feed to that black box you try something and that something would be a perturbation of your input to this black box and you see what resulting perturbation occurs on the output of the black box the cost and now you can estimate whether this modification improved or made the result worse so essentially this is like this optimization problem I was telling you about earlier the gradient based algorithm you can estimate the direction lost in the mountain in a fog you can't see anything but you can estimate the direction of steepest descent you can just look around and you can tell which is the direction of steepest descent you just take a step in that direction what if you can't see so basically to estimate in which direction the function goes down you have to actually take a step so you take a step in the other direction come back and then maybe you get an estimate for where the steepest descent is you can take a step for steepest descent so this is estimating the gradient by perturbation instead of by analytic means of backpropagating gradients computing Jacobians or whatever partial derivatives and then there is the second step of complexity let's imagine that the landscape you are in is basically flat everywhere except once in a while there is a step so taking a small step in one direction will not give you any information about which direction you have to go to so there you have to use other techniques taking bigger steps working for a while and seeing if you fall down a step or not or go up a step maybe you can multiply yourself in 10,000 copies of yourself and then kind of explore the surroundings and then whenever someone says I find a hole in every one you can come there so all those methods are called gradient free optimization algorithms sometimes they are called zeroth order method why is zeroth order? because first order is when you can compute the derivative zeroth order is when you cannot compute the derivative you can only compute the function or get a value for the function and then you have second order methods that compute not just the first derivative but also the second derivative and they are also gradient based because they need the first derivative as well and they accelerate the process by also computing the second derivative and Adam is a very simplified form of kind of second order method it's not a second order method but it has a hint of second order another hint of second order method is what is called conjugate gradient is another class of method called quiescent Newton methods which are also kind of using kind of curvature information if you want to kind of accelerate many of those are not actually practical for neural net training but there are some forms that are if you are interested in zeroth order optimization there is a library that is actually produced by it's an open source library which originated at Facebook at Research in Paris by an author called Olivier Tito but it's really a community effort there's a lot of contributors to it it's called never grad and it implements a very large number of different optimization algorithms that do not assume that you have access to the gradient okay there are genetic algorithms or evolutionary methods there are particle swarm optimization there are there is all kinds of tricks there is a whole catalog of those things and those sometimes it's unavoidable you have to use them because you don't know the cost function so a very common situation where you have to use those things is reinforcement learning reinforcement learning is basically a situation where you tell the system you don't tell the system the correct answer you only tell the system whether the answer was good or bad you give the value of the cost but you don't tell the machine where the cost is so the machine doesn't know where the cost function is okay and so the machine cannot actually compute the gradient of the cost and so it has to use something like a zeroth order method so what you can do is you can compute the gradient with respect to the parameters of the overall cost function by determining the parameters or what you can do is compute the gradient of the cost function with respect to the output of your neural net okay using perturbation and once you have this estimate then you back propagate the gradient through your network using regular backprop so that's a combination of estimating the gradient through perturbation for the cost function because you don't know it and then backpropagating from there this is basically the technique that was used by the deepline people in the first deep queue learning type methods back to the normalization do we normalize the entire data set or each batch it's equivalent so you normalize each sample the variable you're computing is on the entire training set so you're computing the standard deviation and the mean over the entire training set in fact most of the time you don't even need to do it over the entire training set because mean and standard deviation converges pretty fast so but you do it over the entire training set right and what you get is a constant number two constant numbers a number that you subtract and a number that you should divide for each component of your input okay it's a fix preprocessing for a given training set you'll have a fix you know mean and standard deviation vector but maybe we can connect to the other tool right the other module the batch normalization right okay okay we haven't talked about that yet yeah I'm saying that we can perhaps extend this normalization bit to the both sides like the whole data set and the batch itself okay yes yes so I mean again there's going to be a whole lecture on this but for the same reason it's good to have variables the input that are zero mean and you need variance it's also good for the state variables inside the network to basically have zero mean and you need variance and so people have come up with various ways of doing normalization of the variables inside the network so that you know the approach zero mean and you need variance but and there are many ways to do this they have cute names like batch normalization like layer normalization and the idea goes back a very long time batch norm is kind of a more recent incarnation of it let's see what was I scheduled to decrease the running rate yeah as it turns out for reasons that are still not completely fully understood you need to to learn fast initially you need a learning rate of a particular size but to get good results in the end you kind of need to decrease the running rate to kind of let the system settle inside of minima and that requires decreasing the running rate these various semi-valid theoretical explanation for this but experimentally it's clearly need to do that and again there are schedules that are pre-programmed by Torsch for this but use a bit of L1 or L2 regularization on the weights or combination yeah after you train your system for a few epochs you might want to kind of prune it eliminate the weights that are useless make sure that the weights have their kind of minimum size and what you do is you add a term in the cost function that basically shrinks the weights at every iteration you might know what L2 and L1 regularization means taking a class in machine learning for large secret rational stuff like that it's very common but L2 sometimes is called weight decay this again are pre-programmed in PyTorch a trick that a lot of people use for large neural nets is a trick called dropout dropout is implemented as kind of a layer in PyTorch and what this layer does is that it takes the state of a layer and it randomly picks a certain proportion of the units and basically sets them to 0 so you can figure it as a mask a layer that applies a mask to an input and the mask is randomly picked at every sample and some proportion of the mask are set, the value in the mask are set to 0, some are set to 1 and you multiply the input by the mask so only a subset of the units are able to speak to the next layer essentially let's go dropout and the reason for doing this is that it forces the unit to distribute the information about the input over multiple units instead of kind of squeezing everything into a small number and it makes the system more robust there's some theoretical arguments for why it does that experimentally if you add this to a large network you get better performance on the test set it's not always necessary but it helps there's lots of tricks and I'll devote a lecture on this so I'm not going to go through all of them right now that requires explaining a bit more about optimizations so really what deep learning is about I told you everything about deep learning the basics of deep learning but I haven't told you why we use deep learning and that's basically what I'm going to tell you about now the motivation for why is it that we need multi-layer neural nets or things of this type so the traditional prototypical model of supervised learning for a very long time is a linear classifier linear classifier for a two-class problem is basically a single unit of the similar type that we talked about earlier you computer weighted sum of inputs add a bias and you can think of the bias as just another trainable weight whose corresponding input is equal to 1 if you want and then you pass that through a special function the sine function that I put minus 1 if the weighted sum is below 0 and plus 1 if it's above 0 so this basic linear classifier basically partitions the space the input space of X's into two half spaces separated by hyperplane so the equation sum of IWI XI plus B equals 0 is the surface that separates the category 1 that is going to produce Y bar equal to plus 1 from category 2 where Y bar equals minus 1 why is it a why does it divide the space into two halves it's because you're computing the dot product of an input vector with a weight vector if those two vectors are orthogonal then the dot product is 0 so the set of points in X space where this dot product is 0 is the set of points that are orthogonal to the vector W so in a n dimensional space your vector W is a vector and the set of X whose dot product with W is 0 is a hyperplane it's a linear subspace of dimension n minus 1 and that hyperplane divides the space of dimension n into halves so here is the situation in two dimensions you have two dimensions X1 X2 you have data points here the red category and the blue category and there is a weight vector plus a bias where the intercept here of this green separating line with X1 is minus B divided by W1 so that gives you an idea for what W should be the W vector is orthogonal to that separating surface ok so changing B will change the position and then changing W will change the orientation now what about situations like this where the points are the red and blue points are not separable by a hyperplane that's called a non-linearly separable case so there you can't use a linear classifier to separate those what are we going to do in fact there is a theorem that goes back to 1966 by Tom Kovar who died recently actually that says the probability that a particular separation of p points is linearly separable in n dimension is close to 1 when p is smaller than n but is close to 0 when p is larger than n in other words if you take an n-dimensional space you throw p random points in that n-dimensional space and you randomly label them blue and red you ask the question what is the probability that that particular dichotomy is linearly separable I can separate the blue points from the red points with a hyperplane and the answer is if p is less than n you have a good chance that they will be separable if p is larger than n you basically have no chance that they will ok so if you have an image classification problem and you have tons of examples way bigger so let's say you do n-nist so n-nist is a dataset of n written digits the images are 28 by 28 pixels in fact the intrinsic dimension is smaller because some pixels are always 0 and you have 60,000 samples the probability that those 60,000 samples of let's say 0's from everything else or 1's from everything else is linearly separable so which is why people invented the classical model of pattern recognition which consists in taking an input engineering a feature extractor to produce a representation in such a way that in that space now your problem becomes it's linearly separable if you use a linear classifier or some other separability if you use another type of classifier ok now necessarily this feature extraction has to be non-linear itself if the only thing it does is some affine transformation of the input it's not going to make a non-linearly separable problem into a linear separable one so necessarily this feature extractor has to be non-linear this is very important to remember a linear preprocessing doesn't do anything for you essentially so people spend decades in Cupid division for example a speech recognition devising good feature extractors for particular problems what features are good to do speech recognition for example can I do things like detect the eyes and then measure the ratio between the separation of the eyes with the separation from the mouth and then computes a few features like this and then feed that to a classifier and figure out who the person is so most papers between the 1960s or 70s and the late 2000s or early 2010s in Cupid division were essentially about that like how you represent images properly not all of them, but a lot of them for recognition and a lot of people devise very generic ways of devising feature extractors the basic idea is you just expand the dimension of the representation in a non-linear way so that now your number of dimensions is larger than the number of samples and now your problem has a chance of becoming linearly separable so the ideas that I'm not going to go through like space styling, random projection so random projection basically is a very simple idea you take your input vectors you multiply them by random by a random matrix okay and then you pass the result through some non-linear operation okay that's called random projection and it might make if the dimension of the output is larger than the dimension of the input you take a non-linearly separable problem linearly separable it's very inefficient because you might need a very large number of this dimension to be able to kind of do a good job but it works in certain cases and you don't have to train the first layer you basically pick it randomly and so the only thing you need to train is a linear classifier on top is polynomial classifiers which I'll talk about in a minute so those are basically techniques to turn an input into a representation that then will be essentially classifiable by a simple classifier like a linear classifier so what's a polynomial classifier? basically imagine that your input vector has two dimensions the way you increase the dimensionality of the representation is that you take each of the input variables but you also take every product of pairs of input variables so now you have a new feature vector which is composed of x1, x2 you add one for the bias and then also x1 times x2 x1 squared and x2 squared so when you do a linear classification in that space what you're doing really is a quadratic classification in the original space the separating surface in the original space now is a quadratic curve in two dimension in n dimension it's a quadratic hypersurface so it could be parabola ellipse or hyperbola depending on the coefficients now the problem with this is it doesn't work very well in high dimension because the number of features grows with a square of the number of inputs so if you want to apply this to it's an image net type image the resolution is 256 by 256 by 3 because you have color channels that's already a high dimension if you take the cross product of all of those variables that's way too large okay so it's not really practical for high dimensional problems but it's a trick now here is so super vector machines are basically two layer networks or kernel machines more generally are two layer systems has as many dimensions as you have training samples okay so for each training sample you create a a unit if you want and the role of this unit is to produce a large output if the input vector matches one of the training samples and a small output if it doesn't or the other way around a small output if it matches a large output if it doesn't doesn't really matter but it has to be nonlinear so something like a product of the input by one of the training samples and passes through a negative exponential or a square or something like that so this gives you how much the input vector resembles one of the training samples and you do this for every single training samples okay and then you train a linear classifier basically to use those inputs as input to a linear classifier you compute the weight so that linear classifier is basically as simple as that there's some regularization involved so essentially it's kind of a lookup table you have your entire training set as points in your if you want our units in your first layer and they each indicate how close the current input vector is to them so you get some picture of where the input vector is by basically having the relative position to all of the training samples and then using a simple linear operation you can figure out like what's the correct answer this works really well for low-dimensional problems the small number of training samples but you're not going to do computer vision with it at least not without not if X is our pixels because it's basically template matching now here is a very interesting fact it's the fact that if you build a two-layer neural net on this model so let's say a two-layer neural net you have an input layer a hidden layer and not specifying the size and a single output unit and you ask what functions can I approximate with an architecture of this type the answer is you can approximate pretty much any well-behaved function as close as you want as long as you have enough of those units in the middle okay so this is a theorem that says that two-layer neural nets are universal approximators it doesn't really matter what nonlinear function you put in the middle any nonlinear function will do a two-layer neural net is a universal approximator and immediately you say why do we need multiple layers then if we can approximate anything with two layers and the answer is it's very very inefficient to try to approximate everything with only two layers because many many many interesting functions we're interested in learning cannot be efficiently represented by a two-layer system they can possibly be represented by a two-layer system but the number of fielding units it would require would be so ridiculously large that it's completely impractical okay so that's why we need layers this very simple point is something that took about it took until basically the 2010s for the machine learning and computer vision communities to understand okay if you understood what I just said you just took a few seconds there is a last question here before we finish class so does the depth of the network then have anything to do with generalization? okay so generalization is a different story okay generalization is very difficult to predict it depends on a lot of things it depends on the weirdness of the architecture to the problem at hand okay so for example people use convolutional nets for computer vision, they use transformers for text you know blah blah blah so there are certain architectures that work well for certain types of data so that's the main thing that will improve generalization but generally yes multiple layers can improve generalization because whatever function you're interested in learning computing it with multiple layers will allow you to reduce the overall size of the system that will do a good job and so by reducing the size you're basically making it easier for the system to find kind of good representation but there is something else which has to do with compositionality I'll come to this in a minute if I have time also the minimum the how do you call it the well is like larger right if we have over parameterized networks it's much easier to find a minimum to your objective function which is why neural nets are generally over parameterized they generally have a much larger number of parameters than what you would think is necessary and when you get them bigger when you make them bigger they work better usually it's not always the case but it's very curious phenomenon about this we'll talk about this later okay this is the one point I want to make and it's the fact that the reason why layers are good is that the world is compositional the perceptual world in particular but the world in general the universe if you want is compositional what does that mean it means that okay at the level of the universe right we have elementary particles they assemble to form less elementary particles those assemble to form atoms those assemble to form molecules those assemble to form materials those assemble to form structures, objects etc and you know environments, scenes etc you have the same kind of hierarchy for images you have pixels they assemble to form edges and texons and motifs, parts and objects in text you have characters they assemble to form words word groups, clauses, sentences, stories in speech you have speech samples they assemble to form you know kind of elementary sounds, phones, phonemes syllables, words etc so you have this kind of compositional hierarchy in a lot of natural signals and this is what makes the world understandable right this is famous quote by Albert Einstein the most incomprehensible thing about the world is that the world is comprehensible and the reason why the world is comprehensible is because it's compositional because small part assemble to form bigger part and that allows you to have a description an abstract description of the world in terms of parts from the level immediately below in terms of level of abstraction so to some extent the layered architecture in a neural net reflects this idea that you have kind of a compositional hierarchy where simple things assemble to form slightly more complex things so images you have pixels formed to form edges that are kind of depicted here these are actually visualisation of feature detectors by a particular convolutional net which is a particular type of neural net multilayer neural net so at the low level you have units that detect oriented edges a couple layers up you have things that detect simple motifs circles gratings corners etc and then a few layers up there are things like parts of objects and things like that so I think personally that the magic of deep learning the fact that multiple layers help is the fact that the perceptual world is basically a compositional hierarchy then to end learning in deep learning allows the system to learn hierarchical representations where each layer learns a representation that has a level of abstraction slightly higher than the previous one so low level you have individual pixels then you have the presence or absence of an edge then you have the presence or absence of a part of an object and then you have the presence or absence of an object independently of the position of that object illumination, the colour, the occlusions the background, things like that so that's the motivation the idea why deep learning is so successful and why it's basically taken over the world over the last 10 years or so alright thank you for your attention that's great so for tomorrow guys don't forget to try to go over the 01 tutorial tensor, sorry the 01 notebook that we have on the website such that we can get all on the same level for the ones that are not familiar with NumPy stuff so otherwise see you tomorrow morning and have a nice day take care everyone, bye bye