 And here you are. Welcome to the second lecture by Pratik Shaldhari on the principles of deep learning. Thank you Pratik. Okay, good afternoon everyone. Nice to see you again. Today, we will take off from where we left yesterday. And we began to introduce neural architectures. And today we are going to look a little bit deeper into what kind of operations typical neural networks have. Okay. If I can finish scrolling up. Yes. So to refresh your memory. So, we, we said at the beginning of yesterday's lecture that a two layer neural network looks like this, it has S which will use to denote the weights of the first layer W which will be used to denote the weights of the second layer. But it's a function, which is some nonlinear operation, which I'll denote by Sigma here of S transpose X, S is a matrix X is a vector, let's say if X is a vector in D dimensions, and S is a matrix in D cross P dimensions, then this entire thing is a vector in P dimensions. And Sigma acts element wise on this entire vector and returns us a vector of the nonlinear function applied to everyone of its elements. Now, this function we would like to think of as a classifier built using two weights W and S, and they both interact in non-trivial ways. So if we did not learn any features if you simply had X here, then this would be an easy problem it would be a linear model and we will be fitting a convex optimization problem while fitting it on the training data set. But then, because you have W and S that interact multiplicatively so if Sigma is not equal to identity, then we have it in our hands are slightly harder problem. This is a non-convex optimization problem. And so in this lecture we'll look at a few techniques to solve non-convex optimization problems or understand the problems that are typically seen in deep learning. There is really like a one big thing to appreciate from this entire construction. We began with a linear model, but then we said look if you want to build a linear model you can use a feature space construction through your features through your inputs out into a larger feature space and create more complicated nonlinear functions. If you did so, you would have to choose a value of Phi. You would have to choose your features. This is what people used to do 10 years ago when they say you'll get an image. Let's say this image could look like the digit five. What features of this image should I use to classify the five as a five. How will I make sure that the features of the five are distinct from the features of six. The same question applies to words that you can type. So this is the first letter of my word in my native language. And you can say that there is some other word let's say I don't know how to write Chinese but I'll try in this very horrible way. There is some other character. If you wanted to build a machine that can recognize images or then read images, you would have to feature eyes, these images yourself, and because the space of problems is quite large the space of features that you can construct for this is also very large. And it's really very difficult to come up with good features. The idea behind deep learning or the reason why people find it so attractive is that you don't have to choose the features yourself. You can create this model on the data set, both w the classifier and s the matrix that creates the features this is a particular kind of feature that we are creating are chosen by the optimization procedure. You can expect these features to be well tuned to the kind of data that you have at your hands and not not not to fluid in some sense. People have noticed that the features are also nice like this. And that is why deep networks of the speaking are so attractive or so powerful in the eyes of many people. What we saw above, I have replaced the classifier with weights V just because I want to use w for something else later. So it is a two layer neural network I call it two layers because it has two different weights s and we, and then there is one non linearity that sits in between it. Yeah, you can do the same thing and create a multi layer neural network multi layer deep neural network by having the features being composed many many times. And this is powerful because it is an ability to take old features and compose them or take linear combinations precisely in this case, in different ways, before you hand them off to the classifier to use. Of course, if we complained about this problem being difficult to solve or optimizing this function being difficult to solve, then this one is even harder, because it has L different layers that create the features. The last classifier layer. So it has many many more parameters to fit. It is a large problem, it is a high dimensional problem, because the total number of features the total number of unknown parameters that you're searching for using your training samples is large. You know that if you want to pinch down a function in high dimensions, then you need exponentially many samples in the dimensionality of the function. So you should expect fitting such functions to be much harder than fitting the perceptron. Okay, so the basic theme of the construction of basic theme of the problem doesn't change we are still after finding the best parameters that minimize the loss you can also use hinge loss in this case if the true outputs are binary. So the lie let's say between minus one and one the set. You're out the product of the model after you apply a sign function to whatever comes out from your network that will also be plus one or minus one. And that is why you can use the hinge loss. Okay, so this is the optimization objective for a binary classifier built using a deep neural network. When people build such models, they notice that the features seem meaningful. Yeah, so the features as that people, people found so they will take S transpose X, and then they'll apply the nonlinearity and it's the first s transpose X, and the visualizations are constructed by feeding in many different access to the network, or alternatively looking at S directly, which is in some sense the average feature that you have seen through your entire data set. And you will notice that the first layers features they look a little bit like edges, colors, different kinds of colors, etc. And people in computer vision will know them as the more filters. The more filters are filters that look like ellipses with different angles, and the variance of the ellipses large or the angle of the ellipses large and that is what allows you to detect different kinds of edges in images. I appreciate the features of this kind, they're not very specific to the data set that you're using this particular picture was created using features of a data set called image net which we talked last time. It is a data set of images, a lot of million images of many different objects, and those tend to have colors those tend to have edges don't have low level texture, and that is indeed what these features seem to pick. So to give you a slightly different data set let us say all the photos on your phone right now, the features that you would learn in the first layer wouldn't be too different because the world is quite similar at that scale. And you don't, you will not bother to learn different features because you have many more layers at the top to learn more specialized features to your task. As you go up in the layer you see slightly more complicated features emerging. So, remember that our second layer is at the end of the day, something that takes in linear combinations of the old features. So using this matrix S to this matrix S to is also learned and then applies an online reality. And so it is going to take this edges and take linear combinations of all the features in this part, and then apply some linear function to them, element wise, okay. So you could look a little bit like this. And here you can notice certain patterns that are appearing already. You might notice this to be either an eye or a football who knows. And you can sit and interpret such kinds of patterns. Okay. If you go even higher up in the layers of a network, you will see even more distinctive patterns. Now you can even see objects. This is for sure and I so because image net consists of a lot of photographs that people took of their pets. You tend to start to find features that are very distinctive and very, very useful to classify household animals. And this is the kind of features that you will see. Now, it is very surprising. And I guess reassuring in some sense, that the kind of features that have been noticed to form in the visual cortex are quite similar to the features that we see here. So the visual cortex is an area of the brain that is in charge of visual processing. After the information goes away from goes out from the retina all the way to the brain there is many layers of processing that happens by the time it reaches the brain. And so people have given names to this is V1 V2 V3 slash V4, etc. And the first stage of processing does tend to learn features that look like this. The second stage of processing will tend to learn more complicated features and then by the time it reaches the brain it becomes a complete mismatch of features and you cannot interpret it so easily, which is also the case for a neural network. So it is not as if the network will learn this nice looking features always but if you look carefully at the features you will notice that these features do indeed make sense. So this is reassuring because we said that we wanted to mimic how the biological brain is structured and we used a very, very coarse model of how biological neurons are. We said it is a neuron is a linear function of it's in or it's a function that looks like this. It has some weights that it multiplies my inputs with and then creates features using sonar network. So this is a neural neurons are much more complicated than this, but it is nice that we can look at features that do not seem to be very different. Yeah, at least visually. Okay. And now, in some very real sense, the ability to not have to pick features is the real power of deep learning. If you if you look at how people used to conduct research about 10 years or so ago, someone working computer vision had a completely different knowledge of how their data looks like. They would know good techniques in image processing they would know how to combine 3d geometry ideas from 3d geometry to understand the physical scenes. All of this was very different from how someone working in natural language processing would use their data, they would be interested in representations of words, how sentences are formed using some grammar and stuff from words. And basically it was very impossible for someone working computer vision to borrow ideas from an LP and vice versa. The nice thing that has happened in deep learning, due to deep learning is that a lot of these fields have begun to use neural networks at such a fundamental level that the, that the same kinds of neural architectures. And we talk a lot a little bit about neural architectures in a bit the same times of neural networks, or that work well for computer vision also indeed, I tend to work well for an LP. And now you can borrow a lot of features and give away a lot of features to other fields. Okay, so the fee the subfields of machine learning tend to come together in spite of the fact that they work on different data modalities or they are interested in asking very different types of questions, and it is interested in understanding the structure of language computer vision is interested in understanding the structure of the scene that is underlying images, totally different concepts. The, the ways we study these questions today is quite unified, and that helps us make progress in all these fields at a pretty rapid pace. The most important point for you to remember is that deep networks are universal approximators, we will not do say precisely what this means, but conceptually what it means is that any given data set that you have at your disposal, you can fit a neural network and make in the sense that you can solve this optimization problem and get perfect predictions on that data set perfect training set predictions, we are not saying anything about generalization so far. Perfect predictions on the training set, and this is a nice property so you know that if you have real data that was drawn from a cubic equation and you fit only quadratics then you wouldn't be able to approximate it. But neural networks are not like this in some sense neural networks are a function class that is so rich that it will fit any data set that you have provided that the network is large enough. It has, it has a large enough number of layers and a large enough size of the matrices s one s two till SL. Cool. Any questions, perhaps before we begin properly. Can you hear me. I do I have a question actually say here you seem to contrast the power of deep learning to be used across different fields against the computational complexity of finding such set of features. Again, but one might also argue that there are other drawbacks like that except possibly for vision where you can sort of make sense of what the features are in general this is the blackest of all black boxes in the sense that you can really hardly understand what is actually happening you're predicting without understanding, whereas if you go through the pain of constructing a set of linear features are as people have been doing over the in the past years. So there you you sort of get a handle of what the how you are explaining things also while predicting them is that yes. Yeah, yeah, I completely agree so being able to create the features are creating the features. If we if we created features and they work well in the sense that they can predict well then we are in business. We get a good performance good predictions and we also get some notion of interpretation of what these features are or how the probabilistic model is using these features. Not having to create the features is nice, because it can give you good predictions, but then you also don't get any understanding of what the features are. And that you will sometimes see on the internet are very sketchy in the sense that the network when you train it once it might learn something that looks like this. And out of the thousands of features that these networks have some will look reasonable to you and then all the others will basically be stuff that you cannot interpret in any way. The, the power to fit anything comes with the downside that you don't understand how this model is being used or how this model learns. In some cases this trade off may not be a bad thing to make so if you're interested in not in parameter estimation problems, let's say, in a scientific problem you're interested in finding one term of a PD or one or a digital system. In those cases, using black boxes like this is is a little difficult. But if you're interested in slightly less ambitious problems where we are only interested in finding when two images are similar or only interested in checking when I will tweet next, which is unlikely in the short term. And these models will make sense or these models are useful. Thank you. I think we have also a question in the chat. Yes, so could you please explain once again figure 4.1 how s is multiplied against different access. So, what people will do is that we have in our data set different access. After I have learned my weights, which is a V star s one star SL star, etc. I will feed in a lot of my images in the training data set, and I will calculate what I get at the output of each layer and we'll talk a tiny bit very soon what a layer is but for now a layer is simply one particular operation applied to the to the images. This is the second operation being applied to the images. So you can read off the output of Sigma times s one transpose X for many different access and check what is it that the feature is trying to select. If the output is large, then this is precisely the output so the inner product between a big vector if you imagine this matrix strung up as a big vector and an image that also looks like an edge is large, the inner product of this vector with an image that looks like a little more that looks a little bit like just a blank color is small. And so, that is why people visit this is the kind of feature that is learned by this particular feature. This is the kind of pattern that is learned by this particular feature. This is the kind of pattern that is learned by this feature and that tells you some way of understanding what the feature is. When we look at conversion networks, this will become much more clear, because they're the feature is simply the output of one particular of convolutional operator kernel. So, let us look at a little bit of jargon. This is literally just jargon nothing very deep about why things are defined the way they are, or in some cases that is maybe, and we'll talk about it. So, neurons have always been something that are on or off. So, depending on how many impulses they get as how many stimuli they get from neighboring neurons, a neuron might fire or it does not fire. And that was indeed the classical non India function that people use. So this is called the activation function. This is our function, Sigma, and people this is these are the first kinds of networks that people used to write in the 90s. But this one is a little bit. So, this is difficult to use with modern optimization theory, because the gradient of this particular non reality is zero. So if you take a neural network that predicts why had you have some function of your true labels why and your predictions why the derivative of this function, which is the loss that you're fitting, let's say it is the hinge loss, the derivative of that loss with respect to the parameters of your function, which is s in this case is can be zero, essentially always, because this this non reality is one or zero always. Okay. And that is what makes training these kinds of non reality is very hard. So people said okay make me maybe can make it a tiny bit more soft. So this is a sigmoid function, a sigmoid function looks a bit like this. It is zero or it tends to zero as extends to negative infinity, and then it will let you out to one on the as extends to positive infinity. Now, the issue with this kind of a network, or this kind of a non reality is that it plateaus at the end. This is again like so if your features are something that gives you a value over here for this product as one times X, then pulling back this feature to make it useful such a feature is stuck right it always the neuron always files we don't like neurons that always fire. We would like the neurons to be active in some time for some images and inactive for some other images. So, on average, you want the activation of a neuron, which is as sigma times s one X to be in this region. Since the neuron has a large s one. It can have an average activation that is quite a bit larger in the saturation region pulling it back requires you to take the gradient of the loss with respect to s and if the if the non reality saturates then you cannot do it this easily. This is this is why sigmoids are tiny bit better than thresholds because they are at least differentiable in this region, but they have the same problems as thresholds on it is. Is there an understanding on why a sequence of linear and nonlinear operations and X gives predictions that are closer to why yes so this is this is the purview of what is called approximation theory. And you will see theorems in approximation theory that say that if I have a two layer neural network, which was precisely this thing. So this network can fit any function by itself you don't even need a deep network a two layer network can fit any function you want any why why you want so long as the number of feature and number of dimensions of w are very large. So if you take the network very fat. And even if it has only two layers it can fit any function. And so people will study for what kinds of nonlinear operations here is this approximation better achieved but essentially the thesis is that if the operation is nonlinear then you can find. You can show that the network and the networks larger than you can show that it can fit any function you want. Yeah, that's that's that's a sigmoid. The issue with the sigmoid is that things don't become negative. We would like the weights to be both positive and negative. And so people started using functions that are like this. This is called tarnish. The hyperbolic tangent which you have seen many times before in physics class. And this has the same issues as the sigmoid it also saturates but at least it is negative in some part of the domain. Another question for any nonlinear sigma function this theorem is valid. No this theorem is not valid for any nonlinear sigma function it is valid for some kinds of nonlinear sigma functions and then you will show some conditions on what this one was kind of functions are. Okay, roughly I think the function has to. I cannot seem to remember so. There is so one simple way to think about this is that if the weights are allowed to be both positive and negative then the nonlinearity can be just positive but for instance the weights are positive then the nonlinearity has to be both positive and negative to approximate any function so this is one simple case I can think of right now that restricts the kinds of nonlinearities that you lose. Okay. Over the years, people have been experimenting with many different kinds of non realities. And the one that is very popularly used right now is something called as a value, a value is simply a function that looks like this. It is identity to the right of the origin, and it is zero to the left of the origin. Pretty freaking close to a linear function, but it is clearly a nonlinear function. It has a huge inhibitory space, where everything that we said about the gradients being zero is still holds. So these are all neurons that are inactive. But the nice part is that on the positive side the neurons do not saturate anymore. The nonlinearity allows them to increase in magnitude as much as they want. This is both a blessing and a curse. And if you think deeply about it sometime later you will realize that it is not as rosy so it is not as if a relu is clearly better than a tan etch or a sigmoid. But it happens to be that the kind of tricks that we use to train neural networks today, they work well with those. Okay, mathematically it is the maximum of zero and X, if X is a scalar argument. There is many other variants of this, some people do not like the fact that the relu is not differentiable at the origin. So you cannot prove some theorems with it. So they will smooth it a little bit. And then they use functions that are like this. So X times sigmoid X is a function that essentially looks like a smooth relu. The smoothing happens near the origin because the sigmoid. Some people will take the negative slope and then make it a tiny bit or zero slope on the negative side make it tiny bit more negative. And then they change anything they call it liquid or loose. So you will see these things in many networks that people use today in papers and on the internet. Roughly speaking, rectified linear units work great, and all of these will essentially work exactly as well. There is no reason to pick them if you are designing a network for a new problem yourself. So we have been talking about binary classification problems are why was plus one or minus one. Another cool thing about neural networks is that creating a multi class classifier is pretty trivial. It was supposed to be some number between plus one and minus one if we is a vector. But if we is a matrix instead, then why hat can be thought of as a large vector. So if you have C classes, let's say 10 different digits or 26 different letters in the alphabet, then we can be a matrix. Whatever is the dimensionality of this, let us call it P, the number of features. And when you multiply it by this matrix, your output is now a big vector of size C of length C. In this case, you can interpret every coordinate of this vector as the probability that an image is that of a cat and images that of a dog and images that of a giraffe, etc. We'll talk about how to use such a certain interpretation now, but getting a multi class classifier out of a binary classifier is very easy for neural networks. This is not so for other machine learning models for instance, you know that if you have standard support vector machine, it is a very classical model in deep learning to or even a clustering machine that can cluster two things. So modifying this or adopting this to predict multiple classes requires you to do this entire business of one versus all for all other classes and then combine the predictions using voting and etc. For a minute, but it is much easier. Typical networks will have number of classes that range from two to 10 to even hundreds of thousands. So, in a, in some sense like if you're Facebook, each of us might upload an image or an Instagram, they would like to check the similarity between images, and they will use classification machines for doing this, they will try to guess whether the image came from Pratik or from someone else on this call and use this as a part of their other pipelines to serve your ads or whatever. People have given names to different parts of a network. All the stuff that comes after you apply the matrix S is called a feature. Okay, so S1 times X is called the first layer's feature S2 times Sigma times S1 transpose X is called the second layer's feature. Okay, and this is exactly the features that I drew in the picture above in figure 4.1. Okay. People in neuroscience will call the different rows of these matrices neurons, but that is the same name for that there's a different name for the same object. Nothing big about it. In what we'll do next, we are not going to worry about the different layers of a network, because you cannot say very much about what one layer learns versus what other layer learns. Essentially, it's a big mismatch of function. And it helps to simply think of this entire neural network as a big complicated function of many parameters. So we are going to forget the fact that there are layers, at least at a notational level. And we'll say that all the weights of my network, the classifier we are all the matrices that correspond to each of these layers. I'll all string them all up as a big vector and call it W. Okay, so this W I will also say that it has P dimensions it is Euclidean space in P dimensions, just like we said it is our feature space was P dimension we are going to imagine that our weights are also P dimension. Okay, so at the end of the day. After we abstract away all these things, a neural network is another classifier, which will denote like this f of x comma w, it takes an input x makes a prediction why had using the weights w to make that prediction, and we fit this model, again using a loss. The average of the loss over all the images in your training set. And I also use this short hand to denote the loss of the network on the ith image of the training set. Okay. This is just notation. Now, just like we use the perceptron algorithm or stochastic and descent to fit the hinge loss, we would also like to use stochastic and descent to fit a neural network. The, for the hinge loss, I could just write down the derivative, which is why times x, if the perceptron makes a mistake. And it is zero. Otherwise, I could I could simply write down this derivative like basic calculus. The net the the function that we have in our hands is a more complicated function of all its parameters. It is not that easy for me to simply write down its derivative it is possible. It is simply a chain rule at the end of the day, but it is what people would like some more straightforward ways of writing down this derivative. So, there is a question that says how can we choose an activation function for a classification problem. The simple way to think about this is you do not choose. You use a little, and you do not worry about activation function there's other things that you should be worrying about on your chosen problem. This will work unless you know something very special about what problem you're solving for instance, the way that I lose work is it's a it's a it's a piecewise linear function that is being applied to a linear operator. So, if I do this twice now for instance, what am I going to get if I'm approximating a function that looks like this if this is my true function y of x, let us say that x is one dimension. My values are going to let me approximate this function using a neural network that also learns a piecewise linear output. So, because the nominate is piecewise linear y hat has to be also piecewise linear. And so you're really taking a nonlinear function that you wish to fit y of x, the outputs for a particular x, and trying to chop it up into little parts that are all piecewise linear. This is reasonable of course, for every function, if you have enough little parts you can fit any function you want. But you can get much better answers sometimes if you use different kinds of minorities, for instance, if you knew that your true function y consisted of quadratics only, then you could think of choosing sigma of x to be x square. And that is also another nonlinear function. And use this nonlinear function to approximate your true function. So, you could get away with using fewer parameters, you could get a slightly better fit to a problem. If you knew that your true function wall had some special properties. This is the other examples where I have done this in the past with good effect. And that makes sense. Usually, you don't want to, you don't need to do this. But you know that if the network is large enough, every function can be written down as little pieces of linear functions, and you know that you will get good answers so that is why you sticking to really was a good idea, unless you know better. Okay, so, you cannot write down the gradient of the loss of a neural network, but even if it is simply the hinge loss as easily as we wrote down the gradient for the perceptron, we would like some more automated ways of calculating the gradient. Why do we want the gradient? Well, we want to use all these ideas from optimization theory to fit the network a little using these algorithms and all those methods require the gradient. The gradient is a pretty nice quantity because it tells you and no matter the dimensionality of the weight, the gradient is a large vector that tells you that you can converge in so and so many iterations it gives you local information about the loss function. So back propagation is an algorithm for computing the gradient of the loss function with respect to weights of a neural network. This is an important sentence to think about. It is also a very simple sentence to think about. It is nothing other than the chain rule of calculus. It's just implemented a little differently. And that's it. To give you an appreciation for how this works, we will not actually do back propagation in my course I make them do back propagation, but here because we don't have too much time, I'll give you an example. So, let's say that we are doing this very silly neural network, where the weights and the inputs and the classifiers weights and the outputs are all real numbers. Okay, so one dimensional regression between X and Y that is being fitted with a function of this kind so at this very childish level. W are the weights of the first layer it is just one scalar weight. And this W transpose somehow we are the weights of the second layer again one scalar weight. Okay, when you write this down in let's say Python. This is the actual graph that Python creates. So you will write down a function that takes in X as an input. It will use the weights W and it will calculate some intermediate variable Z imagine how your computer would do this. And Z would simply be W transpose X. Okay, and we can think of this as the first layer of the network. It will apply the nonlinearity sigma to this intermediate variable and let us say it calculates the feature edge. Okay. You now multiply by the second layer so it's this would presumably be like three to get V times edge. And then you use the true output to calculate the mistake that whether or not you made a mistake so this would be the loss. Okay, so anytime you write down a function like this in your computer. Your computer will follow the steps of what is called the forward computation graph from the left to the right. Yeah. Now, we know that chain rule is the opposite of this chain rule goes from the right to the left. What do we want from the chain rule we want to calculate what is the L by the V and what is the L. Oops. What is the L by the V and what is the L by the W. Okay, these are the two derivatives that we want so that we can update W and V in the direction of the negative derivative and do one iteration of stochastic intercept. So just like you would use chain rule to calculate these derivatives back propagation uses the forward computation graph to calculate these derivatives. If you simply wrote them down by hand it is obviously a very easy exercise, but the interesting thing to note about this exercise is that the LDV is is why minus your prediction why minus your prediction times the derivative of this part, which is simply equal to negative sigma times W transpose X. When you calculate the LDV, you are performing operations that look like this twice. Okay. And this is a very important thing to observe the, the derivative that we see out of the chain rule in calculus uses things that were calculated in the forward graph. Sigma times W transpose X is what sigma times W transpose X is precisely H, right. Similarly, the V times sigma times W transpose X is simply V times H. So every term that you will ever see in the chain rule of any function will have little pieces that were created when you perform the forward computation graph. Okay. If you wanted to calculate the backward part of the chain rule like so, you could imagine remembering all these things, simply caching them while you are performing the forward computation, and then putting together these cast pieces to get the derivative DLDV. Okay. And this is in, in, in, in a nutshell, the purpose of the way back propagation works back propagation is simply an algorithm to cash every intermediate variable that you create during the forward propagation or when you run this computational graph. And it is a way of taking this cast values and calculating the derivative of the loss with respect to every weight inside the graph. Okay. The same thing is true for this also. You will notice here that when you take the derivative with respect to W, you also have a term which is the derivative of the nonlinearity. Okay. So the way back propagation works in, in, in PyTorch, for instance, is that it is simply a bookkeeping exercise. PyTorch will literally create what is called a tape, which will record all the operations that are performed on every variable inside the graph. And when it will replay the tape it will know exactly how to take the chain rule of every single step. Okay. It's a very mechanistic way of taking derivatives. And this is really the crux of why deep learning is seemingly so easy. The, the central part of deep learning libraries. So there is one library called PyTorch, which I'll show you in a bit is what is called autograd. And it is called autograd because it is a machine for automatic differentiation. And if any, any function that you have, it will be able to define a forward graph. The forward graph is trivial to define it as simply the way you'd calculate a function like this. But for every single forward graph, it will able, it will be able to automatically write the backward graph and automatically be able to calculate the derivative. So this is a huge deal. In fact, until about 10 years ago or so, you would see research papers in machine learning where out of eight pages nine pages for the conference paper six pages would be spent in deriving the derivative of complicated models. All of those things are is one line right now. Okay, with this automatic differentiation engines. We are interested in not derivatives of an arbitrary function we're interested in derivatives of the predictions of the network, which is why had equal to W transpose V transpose SL transpose etc. Sigma of S one X. Okay. So these things is what we'll call a layer layer is something that performs an operation on its inputs and returns and output. So multiplying by S one is a layer, applying a nonlinear operation to it to a vector is a layer and there are many layers in deep learning. Okay. You only write the forward propagation so you take you tell the, or you tell the library how the activations of the previous layer and the weights of your current layer are combined to get the activations of the next layer. Or get the activations of the current layer by torch will automatically fill in this function called backward, which will say, how does the derivative of the loss with respect to the output of this layer, the weights of this layer, and the activations that this layer created during the come together to give us the derivative of the loss with respect to the weights of this layer and the derivative of the loss with respect to the activations of the previous layer. Okay. Once you have the derivative of the loss with respect to activations of the previous layer, when you call backward on the one layer before this, you will be able to again back propagate the gradient. When people say we are doing back propagation of the gradient what they really mean is, they are using quantities like this, the LDY hat to calculate the LDV. They are using quantities like the LDSK, which is the activations of this to calculate quantities like the LDSK minus one. The LDHK and SK. So the weights of my layer, the derivative of the loss with respect to my output are being used to calculate how the previous layers output define how much the loss is sensitive to the previous output. Okay. And you will see that this is a pretty mechanistic process, you will never ever do it by hand, usually, although it is useful to do it in the beginning for pedantic reasons. But this is the reason why deep learning seems so easy, because you don't have to take any derivatives anymore. Questions. No questions here in the whole. So, next. So, now we know the little bits and pieces of what makes a typical neural network we are going to specialize this inverse to handle certain kinds of data. And we are going to look at convolutional networks. The problem with neural networks that consist of matrices multiplying vectors is that we cannot use them for images. Oops, we cannot use them for images very easily. Mostly because images are very large things. So, typical 100 plus 100 image RGB has 10,000 times three 30,000 dimensions right there. Okay, so if you have even a one layer neural network. It takes in 30,000 dimensional inputs and simply creates a 10 dimensional output. This would be a perceptron that has 10 classes instead of two classes that takes in this dimension data that itself is a 300,000 parameter model. So, networks with matrices as one as to etc are very parameter hungry. They are big things to calculate. Okay. The judge just to give an appreciation for how bad things are. I talked about 100% image typical phones right now. So the iPhone four or the pixel seven. You know the resolution of the camera for these phones. It is 50 megapixel that is 50 times 10 rest to six pixels times three RGB. So that is roughly 10 rest to eight dimensions right there at the input. So you have to kill all this dimensionality very quickly in order to do computations with it. Yeah, and this is the central problem in computer vision images are large images are highly redundant observations of the physical scene. And if you want to understand the scene if you want to understand how many objects that are what they are doing etc. We need some way of killing away all this variations in the images. And here, here is how variations of images look like so this is the office of my advisor, my PSE advisor, and let's say that you took an image of the in his photograph in his office early in the morning. The same exact office would look a little bit different. The image would look quite different. If you took the image at night with a different lamp. Okay. If I just moved around a little bit in the office all of you will agree that this is the same physical office but I simply have taken a photograph from a slightly different place. Okay. This is exactly the same office again except that I moved the portrait plant in between me and the camera. And now obviously the photograph looks totally different. So, these are all things that make images different. The first one is illumination the second one is viewpoint the third one is visibility or occlusions, the exact same scene, which I'll denote by psi here. This is very different in different images, depending on what what parameters or what operations were performed before we created the image of the scene. Okay. For the purposes of understanding this scene. Operations are not very useful. I want to know whether there is a computer monitor on his desk. The fact that the room is illuminated or where I'm looking at the image for where I'm looking at the desk from is immaterial to me answering this question. So, a statisticians will call such quantities, noisances, their noisances, because they don't let us say anything about the true thing that you want to say something about the scene side. Okay, the physical scene side. But they play a role in how images are created they play a role in how observations of the scene are created. And so we need some way of mitigating the variability of cost by all these noisances, if we are to ever understand the scene side. Okay, so this is the name of the game in computer vision specifically but machine learning in general data is always created using from some concept, whether it is the sentiment that you're trying to express when you write down a sentence, or the physical scene that you have in a when you take photographs corrupted by noisances in vision they would have some nice structure viewpoint is a noisance which has a group structure, illumination and visibility do not have a group structure, but we know their names. In language, they will be a little more diverse even the same sentiment that you're trying to express you can express it in many different ways different people here would write totally different sentences for saying I am happy. And that is exactly the noisance variability that we want to average away. If you want to understand the concept that people are trying to express. In some cases, the entire concept may be totally different. So this is an image of her office. It is actually my office when I was a student. During my master's services this is an office in UCLA this is an office in MIT, totally different offices obviously the images look totally different. And if you are, we have to say that this image and this image at different things and this is very tried at this particular point for this particular example, but hopefully you get the point. So to understand the differences between the scene side, we need some way of killing the variability introduced by the noisances. Certain noisances are easy, and certain other noisances are harder. So in this chapter, we are going to take a look at one particular kind of noisance, which is simply translation. Okay, translation is a group noisance in the sense that if I, there is a step that is it follows the definition of a group it has an identity element I can revert it by multiplying it by the by the negative of the movement that I perform when I click one image. And so, for such kinds of noisances, there are operations that we can perform within the network that automatically makes the network in sensitive to such operations in the data. Okay. So one answer to this convolution is, is an operation that is equivariant for translations, and the way I will explain it as as follows so the basic building block of network before was vector w multiplying the inputs x. The building block in this case for a convolution network is going to be a vector w being convolved with vector x. Okay. I don't need to tell you the definition of a convolution, but maybe just to be a tiny bit more pedantic, if x is a vector, you pad it on both the left hand side and the right hand side by infinity. If I perform the convolution summation or the convolution integral, it's an infinite sum. It is going to take the signal and then at each point it will multiply it by the weights, the mirror flip of the weights and then sum it up. So in pictures it will look a little bit like this if this is my ex this is my w I first take my ex flip the w because the convolution tells me to flip the w across the y axis. And then at each little point I will sum up these two or take the inner product of these big vectors and then sweep the kernel from the left to the right, which is given by this integral over here. Okay, so an input of length three with zeros padded to the left and right. The output of the kernel of length three zeros padded to the left and right is giving me an output of length five. Okay, now there is various ways of thinking about this people will sometimes drop the borders and say no no I want on the output of silence three or some people will keep the borders depends on the application for neural networks we choose with what we want to do. One thing to note here is that the kernel that I am using when I say I'm performing a convolution has nothing to do with the kernel that we saw in the previous lecture. It just happens to be the same word signal processing people call filters and kernels interchangeably, but these two things that the two kernels are totally different and there is nothing connecting them. In a typical deep learning libraries. They implement a slightly different operation. They do not flip the kernel before performing the summation before performing the inner product. So technically speaking, this is something called as a cost correlation operation where you're running this particular summation x tau as tau goes from negative infinity to infinity. So instead of wk plus tau instead of wk minus tau. Okay. This is a clever trick, because we are going to learn all the features w anyway in the network just like we are learning is matrices s or learning the classifier we even for a convolution neural network, we are going to learn the kernel that performs the convolution in the network. So whether we learn the actual kernel and flip it before performing the convolution or whether we learn the flip version to begin with and perform across correlation operation. It doesn't matter very much. And this is why the deep learning libraries implement across correlation operation and still call it a convolution operation. The two dimensional conversions work the same way you slide the kernel across the 2d image from the left to the right and from the top to the bottom. And depending on what the weights of your kernel are, you can perform different operations so this you know pretty surely if the kernel is large at the center and then it has a let's say these four ones on the four axes. When you sweep this kernel over an image like this, this is an image of Jeff Hinton, you will get a slightly blurrier image hopefully zoom shows that the image on the right hand side is a tiny bit more blurry. But you will notice that the edges have been smeared out a little bit because you're averaging in the local neighborhood. On the other hand, if you average with negative weights in the kernel, then you will get a slightly sharper image because this is now checking whether the local intensity at a pixel is higher than all its neighbors are not and then summing up with these specific weights to create a sharpened image. You can also detect edges edge filter will look a little bit like this it is it is going to detect horizontal edges in this direction. And this is something called as a subel filter. And there is of course like you can do an entire class on signal processing studying these kinds of operations. The point to understand about all this is that we could have created such features by hand. If you are dealing with images and this is indeed what we would do 1015 years ago. The ability to learn these weights allows us to create features without having to choose them by hand. So, just like in a deep neural network, we take these meters as S1 transpose X, and then stack them up on top of each other, we can perform the exact same thing for convolutional neural network, we can replace the first layer by something let me call it S1 convolved with X as my first layer. For a second simply imagine that S1 is a matrix where every row of the matrix I think of it as a big kernel. And then the row and this particular vector X are being convolved together. I can now convolve it with another such thing and apply anonymity and do the same stacking business to get predictions out. This is a convolutional neural network. Now, the cool thing about this is that if you are doing dense neural network, then you would need lots and lots of weights to even take all the pixels to the output. So, if you have 10 to the 4 pixels 1000 outputs that you want out of your network, then right there you have 10 to 6, 10 to 7 parameters, 10 million parameters. For a convolutional network, you can get outputs for the entire image, even if your kernel is very small. In some sense, a convolution network is a subset of the neural network that we saw in the previous chapter. Any convolution network can be represented by a fully connected network or what is called a dense network the ones with SS are called dense neural networks. And that simply corresponds to one particular way of selecting s ones and s twos, you know that convolution is a linear operator. So, when I convolve a vector with another vector, I can write this as some matrix s times X, this is called a top list matrix or a circulant matrix for 2D images. So this dense neural network that is equivalent to a convolution neural network, but the condition neural network has very few parameters where this matrix s would be very large and I wouldn't know what to pick it as. Okay. And so that is what allows us to build very large networks that in particular work on images. Now this doesn't mean that neural networks are small. People will take the conversion neural networks and do so many convolutions and so many layers and stuff that it still ends up becoming a very large network. But you'll get there in a bit. Now, one thing that I will note is that you will often see an operation that is called padding padding is the convolution integral tells us to move the image. Each location. Okay, to move the image after every unit interval on the real line, but you don't have to move this you can do this summation on with with gaps in between and this is what is called a stride. So you can by using a stride the result of the output will be a tiny bit smaller than the dimensionality of X. And that is a nice thing because this lets let's you remove some of the small scale way redundant structure that you can see in images. So you can see images is locally the same so the color on my shirt is locally the same. You don't need to know the feature for every particular pixel on my shirt to classify a shirt, you're happy to simply take a patch that detects a color blue, and then moves it with some gaps in to create the convolution instead of moving it every pixel. This is something called a stride. It is a very powerful way of reducing the size of images, as the images go up in your network in different layers. So the 50 megapixel image you can reduce it right there to 12 megapixels using a stride of two, because you will lose for the size of the image will be four times smaller with the size of trial of two questions. Not here. Okay, thank you. So now let us see how the network is engineered. A typical network is engineer. So images, I usually have three channels, the images that you might take usually have three channels, the R channel, the G channel and the B channel, green and blue. The convolution network will have three different kinds of kernels that run on each channel independently. Okay, so each layer, each convolutional layer in the network will have a red kernel, a green kernel and a blue kernel. So if you are forming the convolution of this kernel with an image like this, you get an output. In this particular case, I chose the padding of the signal and the stride to be so that I took an image that was 12345 cross five, and I got an input output that was 12345 cross three, and there is very many ways of getting different kinds of images but it's really three cross three images. Given the output of the convolution like this, a typical layer in a convolutional network will sum them up. So it will sum up the output of the convolution on each of the different channels. It will add a bias to this just like we added a bias for the perceptron w transpose X plus B. The W transpose X part, it has been replaced by W convolved with X part convolution happening for every channel independently. And then this is the bias. Okay, and you get one particular channel of the output of that particular layer. Now, the way to think about this is, I have an image, let us say it is w cross of size with cross height cross channels. Okay. Oops. This image. I'm going to put it as a three dimensional array. It goes in into a network, or it goes into a convolutional layer, and outcomes and image of another size. So W, W bra and H bra, okay, don't get confused by W's and H's, here they simply denote weights and heights. Every particular channel of the output is created by combining the convolution of all channels of the input. So, when we perform a convolution operation, we are mixing and merging the information that consists of different channels of the input, activations to that particular layer. So roughly speaking, the number of, not roughly speaking, so the number of convolutions that you perform in a layer that takes in three channel images, let's say that three C is equal to three and returns three channel images. Let's say that W prime and X prime are the same size is nine different convolutions for every one of the output channels. You have different filters that work for every one of the input channels. That is why convolution networks can get pretty large. You can imagine that if this was 10 channels and 10 is actually a small number, typical networks will have hundreds of channels. Then you have all these kernels. In this case, you will have 30 kernels. If the size of each kernel is five plus five, then each kernel has 25 weights and now it is 25 times 30 weights plus the biases. There is 10 such biases, one for every channel, one for every output channel. Even if the number of weights in a convolution layer is much smaller than the number of weights for a fully connected layer or a dense layer, it can get large pretty quickly. It is very useful to think about how the conversions are implemented because that will help you get better results out of this. Many people forget the fact that these channels are merged together when they implement a network. Now, the reason we use convolutions, as I said, is to obtain what is called translational equivalence. That is a fancy word for a very simple concept. It simply says that if I had a kernel that could detect a star like this, I would also like to detect a star if the star moved in the image. I have a filter that is the output of this filter or a kernel is large when the stuff within it or in its receptor field is a star. I would also like to make sure that this or the output is large at this location if the star is at this location. Now, the issue with this is that even if the output of the convolutional layer is large here and large here, so even if the filter of the convolutional layer was perfect, it was exactly a star detector, then we wouldn't know how to detect the star because for this particular image, these pixels are large. For this image, these pixels are large. So, equivalence is the name given to a property that when I move, apply some operation on the input image, in this case translating the image, the same operation is applied to the output features of that particular layer. If I translate the star, then the features corresponding to the star, which I have by running a star filter on it, also translate by the same amount. This mathematically is called equivalence. You can write it down like this. The kth element of the convolution of x, x and w is equal to the k plus delta element of the convolution of x prime and w if x prime is the transformed image. Actually, more precisely this. Good. So, this is equivalence. While this is useful for us to be able to detect an object at different parts of the image, it is not very useful to us to be able to classify the object. Why? Because at the end of the day, we are going to build a network that takes in whatever features I have and then takes an inner product with some weights, the weights of a classifier and then apply the sine function to do this. Just because the features move, just because the features move from this location to this other location, imagine that you string this image up as a big vector. It was non-zero over here. In the second example, it is non-zero somewhere else. Just because I multiply by this vector v doesn't mean that I have the same output. I would like the output of the model to be the same because at the end of the day, the model is detecting a star. If there was a star in this image, in this image, there is also a star in this image. So, when the features move, we would like the output of the model to not move, but we would like the features of the model to move. We want it to be variants of the features but invariance of the output of the model. And this is a pretty difficult property to get because at one point, you want sensitivity to changes in the input images. That is what convolutions are buying us, sensitivity to translations. But you want to be insensitive in how the output of the model is. So, typical deep networks make a specific choice in creating such invariance. They will use an operation that is called pooling. You can imagine that if you had taken a simple operation that looks like this, so find me the maximum of the pixels in the features of these features in the entire image, then the maximum would be the same in both these cases. If there was a star in both these images, the maximum would be let's say one. If there was no star, the maximum would be something smaller than one. So, if we were doing some kind of maximization operation or selection operation, then we would get such invariance out of even if the features were actually varying with image transformation applied to the images. And that is exactly the idea that people in deep learning use. They will perform an operation that is called max pooling. Max pooling is simply an operation that takes in an image that looks a little bit like this. I simply gave names to then I simply wrote numbers in the different cells. Max pooling over a window of size 2 cross 2 is going to take the maximum locally of pixels in that window. And then after moving from this window, it will move to the next window with a stride of 2 in this case. So, the way max pooling creates invariance is it doesn't create invariance completely, but it creates insensitivity. If the maximum of the pixels in this is 6, the other pixels can be left, can take very slightly different values and the maximum would still be 6 so long as the other pixels are not larger than 6 any of them. We have created a tiny bit of invariance in how the output of this layer is created from the lower layer. So, typical neural networks will have a structure where they will perform, let's say one layer of convolutional features, this is my image X, then they will perform a non-linear operation, which is which is ReLU or something, and then they will perform a max pooling operation. So, what was made sensitive by the convolution layer is made insensitive by the max pooling layer. And you know that if you perform max pooling many many times, if you take a large image of size 100 to 100 and then keep pooling it all the way down to one pixel, then it is of course very very sensitive, very very insensitive to change this input image. So, while convolutional neural networks do not exactly get invariance, they get little bits of invariance after every operation of the max pooling operator as you go higher up in the different layers. So, a typical neural network will have a structure like this. It will have a couple of layers, let's say non-linearities and then maybe another max pooling layer. And then at the top, you will simply not worry about the fact that these are slightly sensitive pixels and then combine them anyway. And in that case, you may get some smeared out behaviors when you collapse features like this, but that is good enough to detect large objects in images, and that is what people use. Any questions? So, I think I have about five minutes or so. So, let us now look at the other kinds of noise sensors. So, convolutions of course are a beautiful operation, but they only let us address one noise sensor, translations, not even the entire viewpoint, which is rotations and translations and three-dimensional rotations and translations. The other kinds of noise sensors like contrast or illumination or occlusions, we cannot get rid of by such elegant operations. And that is why people use something called as data augmentation. This is a word given to a very, very simple operation. Let me show you some operations. So, a cat is a cat if I rotate it, if I skew it a little bit, if I zoom towards its face, etc. And these are all noise sensors that inform how images of cat are created from the object cat. And in order to force the network to be insensitive to these noise sensors in order to force the network to average out this variability, we will create a dataset which augments a typical image of a cat or an image of a cat that you have in your dataset with such operations. So, you will typically perform image processing operations that involve in this case a reflection in this one would be, this is the top left or here is a zoomed in version of the same image and then cropped at a different location. The top right is a different zoom and a crop. So, these are different images, but presumably both of them are a cat. And this forces the network to not pay attention to all features, but use specific features to make predictions of the cat being a cat. Actually, I said the exact opposite forces the network to pay attention to all features and not use specific features to call a cat a cat. There is many operations that people have discovered to work well, mostly because typical images have this variability. So, in this case, this is about brightness. You can think of contrast changes. You can think of different crops. And you can think of affine operations that are applied to the image. Blurring is another one. Depending on the application, the kinds of augmentations that we use in deep learning or vision or even images or even text has its own different kinds of operations is very different. And in a lot of ways, this is the real crux where you are a part where your knowledge of the problem plays a role. To give you a few simple examples, if you had images of a house, the difference between the left-hand side image and the middle image is that I flip them in a mirror. Both of them are perfectly plausible images of a typical house. The third image in the third column, I flipped it using a water reflection. And this is not a plausible image of a house. It is an upside-down house, which you will only see in Alice in Wonderland or something. So, we shouldn't use augmentations that make the house look upside-down if our objective is to classify houses or detect houses. Why? Because at test time, we are unlikely to see images like this. So, we should only use augmentations that we think will look like the test images. And this is really where our knowledge of the problem comes into play. If you want to classify cows, let's say you have images of a cow on a field of grass. These are typical images of a cow. These are slightly more unusual images of cows, cows on beaches. You would like to collect such images. Of course, if you're only running the classifier in Switzerland, then you can be happy with this particular picture. You will never see a cow on a beach. So, you don't need to collect images of cows on beaches. On the other hand, if you were running the classifier in India, you want to collect images of a cow on a street and next to people. So, the kinds of images that we use to train machine learning models necessarily has to be informed by the kind of images we will see at test time. Augmentation is one specific example where we get to apply certain operations, image processing operations on the images that allow us to express the idea that these are the images that I'll see at test time. If we make mistakes in augmenting too much, then we are imposing too many constraints on the network. The network is now forced to also classify this image which wastes its learning capacity and we'll talk about it in the next lecture. If you apply too few augmentations, then you don't average out the noisances and you remain sensitive to the noisances that you did not impose using augmentations. So, augmentation is a blanket all operation that catches things that are left over after convolution. You don't need to augment images using translation because convolutions already take care of it. I'll end here and then we'll begin in the next lecture with some loss functions and some more mathematical material. Any questions? Thank you. Any questions from the audience? Okay. I don't see any question in chat either. So, thank you again. We will recommend tomorrow at the same time. Thank you. Bye-bye.