 Hello and welcome to the lecture seven of this introduction to machine learning course today We're going to talk about deep learning in neural networks So that I'm sure sounds exciting for many of you and it is exciting But we are only going to spend one lecture on this topic We don't have more time in this course So of course, we'll just you know scratch the surface of deep learning a little bit and you will definitely need to to take other courses or read textbooks to to get to know deep learning and to get to use it in practice Another comment is that there's so many Lectures on deep learning available online because this is such a hot topic So I was in fact even debating with myself does the world need another YouTube lecture on deep learning in the end. I thought I will make it because I want to I Want for you to see how it fits in the in our course You know it so that it doesn't look like there's a machine learning and there's a deep learning as something separate but you see that it actually fits in and We'll just be using the same the same concepts as everywhere else in the course when talking about that today All right, so that said we can start What is deep learning? So I took a definition from Deep learning textbook that just went out last year I think and which I can recommend for anybody who wants to to read up on that So good fellow at our in this in this book to find deep learning as Algorithms enabling the computer to learn complicated concepts by building them out of simpler ones so there is an idea of hierarchy of Of concepts that we're building from small building blocks. That's essential in neural networks and in deep learning so in neural networks implement computations and in out of my building hierarchical concepts Hierarchical constructs, let's say out of small building blocks that are called neurons and that are very simple processing units and this hierarchical organization of the network allows then the Once it's trained on some data It allows it to form hierarchical representations of the data and it turns out that for many Different real-world problems this hierarchical Hierarchical organization of patterns is actually very helpful to to solve the tasks as we will see Later on Okay So all of you have probably heard or know that this field had a series of huge successes during the last Several less than 10 years. Let's say 10 years from around 2012 People call it even deep learning revolution and there's really like every year There's another big success that you hear about in the news the last one just just a few weeks ago was that Google team solved the folding or made a huge progress in predicting the protein structure out of the sequencing data for this protein but in 2012 they started with a huge progress made in Classifying images into cat's dog ships and cars, you know, and then you've all seen how neural networks can generate photorealistic images Photographs of people who don't exist and I just dreamed off by by network you've heard how neural networks based reinforcement learning algorithms can Teach themselves to play chess and gore better than Anybody in the world Also, recently you probably have heard about how a neural network can be trained to generate text that looks really really human like And it's actually hard to tell apart whether a page of text was written by a by a program or by a by a human I'm referring to GPT-3 here from open AI Today We're going to talk about like a little tiny corner of this field We're going to talk about feedforward neural networks for image classification. So in fact, this goes back to 2012 when the paper was published that really overtook all competitors by a very large margin in this task of classifying images into 1000 different classes of objects that that are depicted on these photographs So we're going to talk a bit about how this worked eight years ago alright, so in fact in some sense we have already in this course talked about a Neural network a feedforward neural network classifier, right? I just didn't call it like that and I'm referring here to Logistic regression. So let's revisit briefly logistic regression And then we will just build up on this concept adding more and more complicated steps and gradually arrive to something That looks like a real-world neural network. So here's logistic regression Loss function and definition copied from from an earlier lecture that we had on this topic, right? so we have a loss which is here and This is just so age of X is the probabilistic prediction that a model does that a given input Xi belongs to Class 1 right so this is a binary classification problem We have class 0 and class 1 things think about predicting cats versus dogs And then age of X is the probability that it's a dog for example And then we'll just sum the log probability across all training examples of the so whenever the real The real class is the dog Then we take the log probability that the network predicts that it's a dog and whenever it's a cat Then we take the log probability that the network Not the network just logistic regression predicts that it's a cat, right? That's that's very simple loss Right over here, and then what is age of X? So our prediction. How do we form it? Well, this is it is this it's a logistic Transformation of a linear function. So we have a linear function beta X hidden in here, right right in here And then we transform this linear combination of our predictors Via the logistic transformation so it takes any number and Makes it a number between zero and one which is what we want from probability passing it through the sigmoid function, okay Now let me draw it like that so for for logistic regression It it looks a bit weird to write it like that, but you will see where I'm going at I'm just trying to represent it a bit as a network So I have something that in the neural network a literature We would call input units or input nodes or neurons and I have p of them So I imagine that my that my predictor space has p different predictors So they are depicted here as the circles over here then I have an intercept term, right? So that's something that we were previously putting into the X vector as as having just always Value of one right so that's convenient because then we can just write beta times X and in fact the The beta that multiplies by one. That's just the intercept added to the model. Okay, great So for now, I will draw it like that and this refers To to these always having the value of one and just clip to one Okay, and then I'm taking these values and I'm multiplying them with some beaters and I'm adding this all up That's what this in here means right and I will depict it like that So every line here has a weight and I take the value of my predictor here And I multiplied by this weight and then they all meet at this point and this means that I add them all up and Then in here I have the value of beaten transpose X and then I write here logistic, which means that I take the result and I transform it Via some function, so I take this result here and I put it through the logistic function and then I get the output Okay, so this will be my depiction of the of the logistic regression Yeah, and then you can think about different different inputs coming in here So I have inputs here for example this I this goes from 1 to n and this is my training training set so you can imagine the inputs coming in here and then going through this Transformation and then you get H of X out of here and then you compare it to the to the true Value and then you try to adapt the weights for example using gradient descent so that you make as Few mistakes or as small mistakes as possible. Okay, I Will also use this shorthand Notation for the same thing without drawing every individual weight But just saying that we're going from from this representation that is my input To input layer to to this output node over here And there's only one output right age of X in this Case over here age of X. That's just a scalar variable. So this is just a number that we're getting out. Okay When in statistics talking about regression logistic regression anything this beta is traditionally called coefficients Right the coefficients in linear regression coefficients in logistic regression the much the the neural network community calls the same thing weights That's why I'm using so I will adopt this terminology That's why I replaced beta by W here just to to to emphasize But this is this is the same thing so whenever a deep learning person talks about weights you can think About statistics person talking about coefficients No difference That's a parameters of our model parameters that we want to to Adapt to fit to the training data that we have okay, so logistic regression is a network that is first. It's a linear network Second it only has the input layer. That's our data and the output layer. That's age of X prediction There's nothing else going on. It's kind of a stupid network Still so let's let's make it more interesting. Let's add something that is called a hidden layer and Actually from this point on it will suddenly become much more interesting. So here we have the same thing That's the input layer and that's the the Actually, there's another term. This is sometimes called bias node in in deep learning in my neural network literature I will try to not use this term because this has nothing to do with statistical bias that we talked about earlier But you you may see this so this is this intercept node Let's say okay, and now in the end we still want to get age of X Which is just a number but we will we will do this in in two steps So first we map all of that on to on on to here and then we'll map these here So again every line is one weight with transforming. So let's say here. We had P Nodes, right? That's the dimensionality of the input space. So let's say here. We have I don't D nodes and here it's only one so this means the W2 is Is in fact just a vector of dimensionality D or D plus one because there's also an intercept here But W1 Has it's it's a two-dimensional matrix. It's something that transforms a p-dimensional vector into a d-dimensional vector, right? So it's a p times d matrix of weights So you can think how many weights does this entire thing have it's like p times d here plus d over here plus Some connections going from the intercept. Okay more compact representation of the same thing is just that so W1 So this is the input layer This is the first and in this case the only hidden layer and this is the output layer So you can depending on your definitions you can say that this network has three layers or one layer What usually matters is how many hidden layers it has So this only has one hidden layer and let's now write down the the loss for that So we'll just use exactly the same loss. This will never this will not change in this lecture Well, this loss just coming out of maximum likelihood, right as we discussed before When we're making probabilistic predictions for the binary for the binary response variable and I will just keep it fixed What will change is this function? How we actually generating this predict probabilistic predictions given the input data and previously I just had logistic transformation of beta times x and now we have this so we will This is a linear case by the way. I didn't say that so imagine that I'm just Transforming so this is every every value in here is a linear combination of the inputs and let's say I'm not doing anything else and then the value here is the linear combination of these linear combinations, right? So we have this matrix product here. And what is a linear combination of linear combinations? Well, it's just a more complicated linear combination, right, so We gained nothing by adding a linear hidden layer inside this model This is just equivalent to logistic regression without any difference. You can multiply W2 by w1 and call the result beta and you're back to logistic regression. This is Still not interesting. It becomes interesting only when we say that the Neurons the neurons the nodes in the hidden layer are non-linear because then I can't write it like that now I have to say I am forming a linear combination of the inputs and that's w1 x Then I am passing them through a non-linearity. That's how people call it So I have some function phi and I will assume it's always the same for all these neurons That's how it usually is so Imagine you are this this guy over here and it computes a linear combination of all the inputs, right? This linear combination is formed here and then this linear combination is passed through some Fixed non-linear function that I will call phi This is done here in parallel in all these nodes and then I can go ahead and form a linear combination Over here. That's why there's a w2 over here W2 is just a vector. So in the end once you computed all that you just have one number It's over here and this neuron also is not a linear neuron, right? It takes this one number and passes it through its own non-linearity, which is logistic function and that's All these things together and then in the end out of here goes h of x Okay, so phi what is phi what what function can we take as fine? In fact, this is this is a choice that one can make when building neural networks and different There are different more or less standard choices for phi I am going to just describe one which is probably the most standard choice at the moment But there are several available. So we want a function that is not what do we want? We want a function that is not linear obviously as I explained We want a function that's easy to work with mathematically where we will need to compute gradients of everything so we want something that's easy to differentiate and Okay, you can choose different things actually you can take logistic function and back in the back in the 90s in 2000s people usually took logistic units as for hidden units What is more popular right now is the function that looks like that It's even simpler. It's just actually it's almost linear. It's almost y equals x But if x is negative then you just output zero. Okay, so here. It's just identity transformation You don't do anything But whenever you would get a negative number is your linear combination. You just output zero. That's the phi So this is called a rectifier function and the neurons that do that are called rectified linear units or re-lose Very simple. You can think already What will the derivative of this function be so there's a little bit of an not so nice thing in that the derivative is not defined at this point, right? however The derivative here is just one and the derivative here is just zero So it's the derivative is just a threshold whenever your axis positive derivatives one and below. It's just zero. So that's very It's actually pretty convenient to work with despite non-smoothness over here. So there are other choices of phi That that that are used in practice, but I'm not going to mention them We're just using re-lose here. Now we come to a very interesting statement so we now have a Neural network with one hidden layer with the real with the rectifier non-linearity in the hidden layer and Let we can ask how actually what what kind of functions can we Can we get if we Set the weights in in any way in any particular way or no, you know You're changing the weights the non-linearity is fixed So you change the weights in the connect going into the hidden layer and you change the weights that go out of the hidden layer What can you get? What is the space of functions that this? network can implement and an amazing Fact statement a theorem is that actually any function can be Implemented with such a network any continuous function F can be approximated to an arbitrary precision By a neural network with one hidden layer pretty much with any non-linearity that you can choose There are some conditions specific conditions on the non-linearity, but for example the rectifier Satisfies those conditions. In fact, there's there's it's not one theorem, but historically it's a bunch of different So several people proved more and more stronger and stronger results in this direction and But but this is the rule how I will formulate it for now That's what important for us So you choose any any function F that you want that's continuous and then I can give you some weights W1 and some weights W2 I cannot modify phi So you choose the phi and you choose the f and I give you w1 and w2 That approximate F with any precision that you also give me as an epsilon for example So here's another way to write that is that f can be approximated you sometimes can see this or we'll see this written like that So these circles here the funny circles This is a notation for composition of functions. So this is w1 is a linear transformation and then So you need to read it from this side there So this is a composition of w1 then phi then w2 and any function can be approximated with any Precision by that. So that sounds that sounds very strong, right? It seems why wouldn't we then like why do we need to do anything else? We basically we're done with any function that we possibly want to To to to have on our input space can be implemented by just one hidden layer Why do we need more and the answer is of course that this is a theorem that says that it's possible But it doesn't tell you actually what the weights will be and it doesn't tell you in particular the size of the hidden layer That you may need for that So it may be that you need a hidden layer of enormous size to to approximate a particular function It also tells you nothing about how easy will be to find these W's if you do something like gradient scent starting from Random weights, right? So this that's something I didn't say before once you added a Hidden layer that is not linear your loss function is not convex anymore So logistic regression has a convex loss function. That's great You choose a random solution you run gradient descent you arrive to the minimum Here in this lecture you can forget about that once your model has a hidden layer The loss is not convex. So you start somewhere you run gradient descent you end up in the local minimum Is it a good log or minimum? Is it a bad local minimum? We don't know There's no guarantees here in practice the models with one hidden layer do not perform very well so I'm just saying that I I'm pointing this out. I think it's important already just adding one hidden layer in principle increases the capacity of your Of your model from something that can only implement linear boundaries decision boundaries into something that can implement anything you possibly want right, it just may be actually hard to implement and hard to fit and The solution to that is to go de at least the the solution that we're discussing today Is that well we can add more hidden layers? So what I had before when you only had one hidden layer that's sometimes called a shallow network And we can go deeper and by going deeper. I just mean well we'll add a bunch of hidden layers, right? So and this this architecture is called a feed-forward neural network. That's what I mentioned earlier There are no connections that go back. There are no loops in this in this graph. There are networks with loops They're called recurrent neural networks. We're not talking about them today so this is a feed-forward neural network you enter the input over here and goes to the next layer and goes to the Next layer then goes to the next layer and then you get the output Right. It's a very very simple modular architecture and then we assume that all fies So the activation of the hidden in the hidden layers are all the same everywhere here So the only thing that you need to specify if you want to specify this network and you already chose some file like a rectifier Like a reload you just need to say how many neurons you want to have here and how many neurons you want to Have here and how many neurons you want to have here and how many hidden layers of which size you want to have Then you stack them together and you're done. This specifies your model then Let's write down the The loss for this model again. This doesn't change what changes is our prediction h of x and I will I will just write it explicitly here but you see that Like the the expression becomes a bit ugly, but it's very very simple in a sense, right? You take the x multiply it with w one then you pass through non-linearity multiply with w two Paths from non-linearity multiply by w three and again and again and once you multiply by w four Which is in this case just a vector, right? You get just one single number in the end and you pass it finally through the logistic function You get your output and you're done. Okay, so In some sense In some sense, it's kind of still logistic regression, but it has this now deep process Well, it is not logistic that that's not correct because as I said before logistic regression is is a general as linear model It's linear because this thing over here. That's now in in in red color That used to be linear and this is not anymore So this is not a linear function of the parameters for example this parameter over here After it passed through all this consecutive non-linearities, of course, this is this is nothing nothing linear anymore Okay So what do we do with that? We we want to fit it to the data, right? How did we do it with logistic regression? We had some training data We compute the gradient of our loss and we just run the gradient descent and and that's it Um, we that's what we want to do here. We want to do the same thing We want and to be able to run the gradient descent. We need to get the gradient So now we need to discuss how to compute These gradients over here and as a reminder, this is the gradient of logistic regression go back to the that lecture if you Here are unsure how that works. That's that's pretty important So when you differentiate the this loss that doesn't change here in this lecture Then you get this and then you have the gradient of this whatever was whatever goes inside the logistic function, right? And in this case the gradient of beta x is just x the gradient with respect to parameters, of course, right? That's just x. So here you have it. That's the gradient of the for of the logistic regression We now have this monster over here that We need to take the the gradient off. So this formula still is the same But now instead of gradient of that we need a gradient of that and we needed so what is the gradient? That's just partial derivative. So we need partial derivatives of this with respect to these w these w these w And these w and you can imagine that this becoming pretty messy to to compute these derivatives We will still do it, right? That's a useful exercise. We will do it and in order to do it It's actually easier I find at least for me or I thought it's pedagogically easier to write it down in in coordinates So here I'm writing it as just matrix acting on a matrix I will write it down everything in coordinates because I think it's easier to compute derivatives there So every matrix multiplication is a sum over it becomes a sum over here and all this a b c d These are indices that you sum over when you multiply two matrices together, right? So you can start unpacking it from from from inside and here we just have a matrix transforming a vector x i is is in particular training sample and Out of here comes another vector, right? So we went from dimensionality p. That's our input dimensionality to dimensionality Whatever in the second layer The dimensionality is so We've got a vector over here Then you pass it through non-linearity and by the way, this just means that every Element of this vector is transformed with this with this non-linearity Okay, and out comes another vector then you multiply it with another matrix out comes another vector and so on until here When you have a vector that's multiplied by a vector And you get a scalar output that I will just call set and the set is just all this entire thing Over here Okay, this is not a gradient yet. This is just rewritten the same thing rewritten in coordinates. So how do we compute these derivatives? Well It's actually not conceptually. It's not complicated. We just need to use the chain rule and So I wrote down the chain rule here But in fact, I already used it before so for example when computing this How did we compute this derivative if you go back to that lecture five? You will see that we use the chain rule, right? You take the derivative of the loss Within is the logistic. So then you take the derivative of the logistic within the logistic Stand stand is the beta x. So then you need to take the derivative of beta x. That's why you have this term over here Um So this is just a chain rule like it's known for I don't know 300 years And that's what we are going to use to compute the derivative of that So we will on the next slide which is going to be a bit of a messy slide I'm going to take this this expression and I will try to compute the partial derivatives with respect to different Different double. So let's try to do that So that's the same expression as we had before right and now we start to compute partial derivatives Um Okay, let's just do it. So It's easier for the w's that are closer here to the left Okay So if you want the derivative a partial derivative of this guy with respect to w for A particular element of w for I call it here for a so note that in this expression a is something that I sum over Here I don't sum I just say what is the derivatives of this with respect to w for element I don't know five so you can plug Five instead of the a and this will give you the partial derivatives of the fifth With respect to the fifth core element of w for vector. Okay, okay So well, this is actually just a linear combination over here, right? So it's the sum of this w's times something and if you want the derivatives Then you just choose the the element that corresponds to the a you want and this something is your derivative Just a linear function that you take the derivative off. So that's super easy. You just get this entire phi Uh Of whatever is inside. That's your derivative. You're done here. That's pretty simple That's then we proceed to the second term. So what do we do here? Okay now? We need a chain rule here enters the chain rule we're saying that derivative with respect to that Again, we have a in here. So these sums just falls out And inside we need the derivative. How do we do that? That's the derivative of this entire thing With respect to this phi and that's just this w w for so it gets in here Then you need to take that's the chain rule now You need the derivative of phi with respect to whatever is inside. So you get this guy over here Pause for a second here and remember the derivative of phi is either zero one That's all so in fact this this looks like a complicated expression I even just put dots dot a dot in there, but this can only be either zero one So if it's zero, then you're just done everything is zero. You don't need to compute it anymore Um, and if it's one it just falls out It's one, right? So that's actually simpler than it looks. Okay, but then we need The derivative of what is inside that phi with respect to this w and that's again just this second layer phi Okay, so that's it. We could we computed it And now we can continue this exercise and compute the derivative with respect to w What is it w two? So I will stop explaining every step here You just see that then you apply the chain rule one more Um, one you like you go one level down when applying the chain rule and uh, this sum actually now survives Because I only have bc uh enters here So I have to sum over a and then you get something like that. So at least no I mark in this In this brick color Everything that I get from the chain rule and and this is kind of a leftover in a sense. It's also from the chain rule, but Um That's kind of a part of the part of the expression that we don't need to to to decompose further here And finally we get the derivative with respect to the first layer weights And then you get this all the way down and you don't have any black color anymore because you need To compute the derivative of this most inner thing with respect to w which is just x So you have x here and then you're done. So this looks like a like a tonal mass But in fact, it's pretty like conceptually It's pretty simple. Of course if you want to implement that you it's a pain Because you need to take care of that all the indices are correct and so on If you actually implementing from scratch But conceptually it's not very it's not very complicated. This is not Back propagation. Yeah, so everybody heard about back propagation. You know that neural networks are trained Using back propagation this I didn't mention it yet because this is just application of the chain rule And in fact you can compute every so in order to do gradient descent You need to compute this derivative with respect to every single w weight That you have in your model. So you just do that with these formulas Then you update every weight in the direction like using the gradient descent Step in the direction of its corresponding derivative, right? No back propagation here yet back propagate. What is back propagation? This is just in a smart way an efficient algorithm to compute all these derivatives Efficiently, you know because if you look at that you see that many expressions here Appears several times So naively if you just compute that and then you compute these and then you compute that You will be repeating Repeating some calculations over and over and over again And for a large network. This is super inefficient So looking at it closer. You see that you can actually Compute these red parts Starting from the top here because here. Well, here's nothing to compute here. You have this thing Imagine you already computed that this only the red part then when you need this object here This is the same you just now sum it over a With this additional thing over here So you took something that you already computed and sum over and then you have this entire thing And then when you go on that level you have this entire thing over here And you multiply by something else and you sum it additionally And if you look at that this happens from the top layer in the network So from the from the fourth layer in my example And then we'll proceed to the third and to the second and to the first so you can compute all these derivatives All these parts that come out of the chain rule you can compute them by first computing them For the w4 for all elements of w4 And then you combine these and mount with with additional Multiplying by something else to compute the derivatives for w3 And then you combine those like you sum them over with additional weights that come out of in the on the third layer and you proceed from the end To the to the towards the input And that's called a backward pass through the network And this can be implemented actually very neatly as a loop over layers that gives you all these gradients In fact much simpler than this explicit derivation if you're reusing the previously the chunks that you previously had So there's a forward pass and a backward pass forward pass If you just want to compute the output the first line over here You do a forward pass you take the inputs multiply them by the weights By pass them through phi another weights another phi another weights another phi That's a forward pass through the network and then when you want to compute the gradient It's efficient to do it by using a backward pass. That's why it's called back propagation Okay, so the summary of that is the back propagation. It's just an efficient way to implement a chain rule and there's different ways to To explain or introduce back propagation one can do it more abstractly By by just explaining how I need to combine the gradients on level L To get the gradients of the level l minus one. That's another way to explain that I thought it's It will be clearer if I write it down like very explicitly for this for this example. I'm not sure if it's actually more helpful or not but That's what it is. Okay Great so now We know How to compute the gradient of this layered Hierarchical model feed forward neural network. So what do we do with it? We we just run gradient descent. You know that there's two notes Two remarks that I want to briefly make on this slide, but won't have time to Um to discuss further even though like probably each of them could be a lecture. So first of all Usually or very often it's not just the gradient like the raw vanilla gradient descent That is used to train your own networks this First the the learning rate There there is there are many algorithms that Adaptively change the learning rate such that the learning rate for different weights can be different Like if you see that one weight is not updated a lot, then you may be wanting to increase the learning rate over there So algorithms like adam for example and and and many others are the algorithms that try to make smart Changes to the learning rate adaptively depending on how the weights behaved In the previous gradient descent iterations. Okay Um, and then another another thing that is very common is something called momentum so if you imagine that you have some lost surface and you imagine yourself like Being sitting on this lost surface and then the gradient Points down so you make a step Downwards and then you're in a new place and now a gradient points a bit different direction than you go there That's what gradient descent does, right? So the momentum term basically keeps track Of the previous directions that you followed and doesn't allow you to deviate very very fast from it So if you if you were going downwards in that direction and then suddenly the gradients point points to the right gradient descent Vanilla gradient descent would just go to the right immediately if you have a momentum term It's some compromise between keeping to go going in the same direction as before and going to the right suddenly This makes your trajectory like optimization trajectory smoother in some sense and can help actually a lot Because the the this optimization path is less noisy than Okay, and so this is the second the second remark Is that in fact gradient descent? As as as I introduced it previously is not used There's another sense in which gradient descent is not used And that's because something called stochastic gradient descent is is used instead in practice And that's easier to explain in a sense because the gradient descent as is we had it in previous lectures Contains like you're computing the gradients with respect to weights by summing over the entire training set Right the gradient depends on the training data obviously, so I always had this sum over i in in the gradient formulas So I need to take my entire data set put them through these formulas and then I get the gradient and then I make an update If you have a lot of data This actually will take a lot of time to compute the gradient Just by summing over this entire training data set and you need to do this on the every train every gradient descent step So it's pretty expensive to compute What usually is done in when training neural networks is to use small so-called batches Of the training data. So you have a training data that maybe has a million examples You split it into batches that are pretty small that maybe have 10 examples or a hundred let's say so you take small 100 examples randomly out of your Out of your entire training set and then you compute the gradient by just summing over these 100 Examples, okay This gives you some direction and you go make a step in that direction Then you take the next hundred examples you and for the next gradient descent step You look at the next at the gradient that you get When you sum over this next Next batch, okay, and then you make the next then you take next batch and next batch and next batch So each gradient descent step that you're doing is using a different part of the of the training data As long as you still have them so at some point you run out of batches and that's called an epoch So now you actually you you went over your entire training data But you didn't make just one step you already made I don't know. It's sometimes it's a lot. You already maybe made a thousand steps because you had a thousand batches there So the stochastic the word stochastic and stochastic gradient descent means it means that that actually you never follow the The full gradient that the training set would allow you to compute in principle You always like follow some approximation to that that you get from different random batches of your data Okay, so what happens when you run through all the batches you just do it again So And and and usually the length of the training is measured in epochs So how many in the tree you can you can find? Statements in the papers like the network was trained for 100 epochs That just means that there were some small batches You loop through the training set and then you do this again then you do this again You do this 100 times. That's 100 epochs. Okay, so that's s gd stochastic gradient set for you. Okay Another remark if we now go back to these 2012 paper that That Showed how a deep neural network can be super successful in Classifying images then they did it on a dataset that's called image net That has photos like small photos of different objects And there's a lot in a lot and a lot of different classes as I said before There's a cars and different animals and different breeds of dogs and and whatnot and you need given an image Tell what class it is. So it's not a binary classification problem. There's not 10 classes. There's I think there's a thousand different classes So we talked in fact about that briefly when in the lecture on on logistic regression, right if there are K classes, then you just replace the logistic non-linearity with the so-called softmax non-linearity And each of the softmax units predicts the probability that it belongs to to class k and then you can very Easily replace the loss function that you had for just two classes to any number of classes And every time you just take the logarithm of the Probability that the model predicts for this particular class that so if it's if it's a picture of a Ship, you know, then you take the log probability that your model set is a ship And that that ends the loss and you just sum over all training samples I will not explain this in more detail But that's actually exactly the same for the multinomial logistic regression can be exactly the same way implemented in the neural network Okay, what I do still want to talk about today is something called convolutional neural networks Because in fact all these image classification tasks or not classification all the image based Tasks and deep learning use a convolutional architecture. So that's very important to understand and So the cnn's convolutional neural networks used something called weight sharing You can So why maybe first conceptually why do we want weight sharing because if you have a photo And you need to say if if it's a photo of a dog or not the dog can be in different places on the photo Right, so you want something like a dog detector And if a dog is in this corner the dog detector should say it's a dog But if it's in that corner should also say it's a dog, right? So it doesn't matter. There's a translation invariance You can move the dog around on an image and the dog detector should say exactly the same Probability that it's a dog In in all of them and this is not necessarily the case if you if you have normal like logistic regression model because there each pixel is its own predictor, right, so If you if you have an image and a dog is over here, that's a very different fact input vector from from a picture where the dog is over here and um The model will not necessarily pick it up like uh Build a translationary invariant model So we can just hard build this translation invariance into the model and that's what we achieve by this convolutional architecture So let's say this Is an input image, okay, so this is the width. This is the height these individual pixels And then I this this just explains one layer of a neural network the convolutional layer. I'm And I will do the following I will say I'm looking at a small patch of this image here three times three I'm forming a linear combination of these pixels. So I have nine weights. I multiply the values in here just the colors The the the color intensity over here by these nine weights and I sum it all up and that's the value over here. So that's normal Feedforward thing. I'm Fully connected neural network like within here so far, right? I have nine weights for these nine pixels but then I go to the right here And I'm taking the next patch of nine pixels and I'm applying the same identical weights To get to a value over here. So that's the crucial part. That's why it's called weight sharing I'm using the same I have like a it's called a kernel. I have this convolutional kernel of nine of three times three size that has nine weights and I I Sweep it through the image like Everywhere here, you know and always compute this linear combination using the same weights Hoping that maybe this is for example a dog detector or something else detector And then I just sweep it through the picture And I always see if if this looks like a dog or not as a very as a very simple example, of course You won't be able to linearly detect whether it's a dog or not But that's the idea so you you use the same kernel everywhere and then you arrive over here And this process is is is called a convolution. This is a convolutional layer And then usually what happens here Then of course the if you do it like that then the value here and the value close to it Will will be similar often will be similar at least so what is done in practice is to This is called a step called max pooling and what it does it just I'm taking now in this Here in this convolutional layer I take all the elements over there And I just choose the maximum of them and output over here And then I move this thing over here this this window choose the maximum here And and and and put it in there and the size of this layer like the width terms is Width times height is smaller than before and this is just because it's it's very expensive to Um To we want to reduce the this the size of the thing we're processing because we have so many layers You know, we will have so many weights. We want at least the this size to go down So all of that picture Actually described just what is called in in in cnn's just a one layer It's a convolution plus max pooling and then in the end maybe here you had a 200 times 200 image And here you maybe have it just a 50 times 50 image that you uh that you get out of that or image It's not really an image anymore, but it's a 50 times 50 output Now there's two Two things That make it more complicated actually first is that if an image was in color then you have r g and b channels Right, so you have an entire two-dimensional Picture in the red channel and an entire two-dimensional picture in the green and in the blue channel So in fact the input is like a three-dimensional in a sense with three different Channels it's called channels in in in here. So how do we deal with that? Well, this actually means that my kernel is not three times three. It's three times three times three Right, so I take this entire 27 numbers in here And then I have 27 weights and I multiply 27 numbers in the image with 27 weights And I add them up and that's the linear combination that gets in here Okay, then I slide this kernel through the image Then I do the max pooling by taking the maxima To reduce the size of the image and then I put this of course through the non-linearity Which may be the real non-linearity and then I'm done with this one layer And that's still not exactly that Because I don't want to use only one kernel. Maybe I have a kernel that is a dog detector But I also want a cat detector, you know So I want to have a bunch of different kernels that slide through the image and that's what's going to happen here So I have one kernel and then another and then another and let's say I have 10 different kernels So I do exactly the same thing in parallel 10 times and then I over here. I get 10. It's called feature maps So the thing over here got In a way, it's like rgb channels here now I have 10 channels here that corresponds to these different kernels that I was using And then they survive for the max pooling and I have this three-dimensional block over here So all of that is just and then the Rectified non-linearity and all of that is just one layer of the convolutional neural network So here's how a convolutional neural network in its entirety may look like actually a very simple one Usually you have more layers you start with an image. It has three channels. You go over here. For example, it can have 64 128 something like that Feature maps and a smaller size Then you do this again like in exactly the same way you go one you have another hidden convolutional hidden layer So you now have these three-dimensional kernels that slide here in two dimensions and you usually you keep increasing the feature maps until you have I don't know it can be a thousand or even more of different Of different feature maps in this layer and the size of the image like the physical size shrinks Number of feature maps grows Then you have a bunch of these blocks. These are all hidden layers And then in the very end at some point you say, okay, I'm done with convolutions now I just take all the values inside in this three-dimensional object And I just put them here in one dimension and now I will just have a normal feedforward not convolutional network where I'm just forming any linear combination of these things to get here and then to get here and by this I denote a softmax layer that Corresponds to the thousand classes for example that I want to predict. So this is how the convolutional neural network Typically works you can get rid of fully connected layers and just use fully convolutional network That's possible. You can combine you can choose all these sizes Now it's a bit like playing when you're building a network like that It's a build like like a playing a legal you're You constructed out of these building blocks and now only practice can tell you what is more efficient have more feature maps here or maybe have Less fewer Feature maps over there, but then take this block and replicate it five times to have five additional hidden layers So you construct it somehow then you run gradient descent. You see what you get Um One note maybe on the convolutional Architecture here in a sense one can imagine the same architecture Fully connected. This means that I'm not saying like without weight sharing I'm mapping this over here in the here in the convolutional You know in in a cnn and I'm using the same weights for all these mappings, but I can say whatever I'm just connecting all pixels from here to all neurons from there Which is a lot of different weights then many more And and I do this everywhere and in a sense the picture will stay the same But it can be now fully connected. This is will be much much much more complicated model than the cnn because This reduces the number of weights by a lot because all these weights have to be the same here, right? I'm just I'm just I'm using the this kernel that has only as I said in example before 27 weights And I apply it everywhere, but it's always the same 27 weights So you can see convolutional neural network as A very strong simplification of the fully connected neural network with the same architecture Or in the sense you can see it as a very strong prior Like infinitely strong prior that says that the weights in different parts of the image in this layer should be the same And in fact, you can put a particular regularization in the network Without making it explicitly convolutional and something similar to convolutional Structure will emerge from this regularization. So one can conceptually I think see convolutional Convolutional structure that you put in here as a kind of regularization, but super useful regularization if you're working with images Okay, what do cnn's learn I I want to show a few a few images here So if you train this kind of network on an actual data set with images Then what happens in the first layer is pretty interesting. I think and it's very easy to visualize because in the first layer you take this um, you take these little patches, right? and That's your kernel. So for each for each kernel in the feature map you can just directly plot How this patch looks like it's like an the kernel on the first layer is like an image itself So if there is a 96 feature maps on the first hidden layer, then I can just plot this 96 96 kernels as images and here's what comes out and it's pretty remarkable because always in any in all cnn architectures Trained a bit differently what always happens in the first layer is that you get these kind of kernels that Like uh, like Fourier components Uh in the spatial domain or like gabbo it's called also gabbo filters. It actually reminds how the How some neurons in the retina process the Attuned to To the to the world, you know, so it's pretty interesting. I think that that that this Comes out of Training a convolutional neural network with gradient descent you get some color patches also But that's on the first layer. That's very easy to see what is much more complicated is to analyze is what happens in the hidden layers because once you are like Far into the hierarchy and you're asking okay here. I am in the fifth hidden layer. What does this neuron do? It's really hard to answer because it's it's it's not clear how to how to Yeah, how to even begin answering this question and there's a lot of research actually that tries to do that So you you train a neural network You you are happy with how it performs Then you fix this neural network and then you kind of do, you know, the biology of neural network You go in the neuroscience of neural network Trying to understand in words What different neurons and different layers may be doing so that's an active area of research Which I find fascinating and you can read up on that a lot and then maybe there's a neuron that looks like it's a dog detector But maybe it looks like it like it fires a lot when you show a picture of a dog But does it really detect a dog or does it just detect like a nose of the dog? Maybe it's a snout detector So if you cut out the the nose then it will stop firing or something that looks like a roof detector Maybe a sky detector and so very often It's not easy to to answer this but I think if you look at a network carefully enough and you analyze it carefully enough you can Figure a lot of that out, but it isn't easy But these are beautiful images and you can you can Like navigate through the network go to different layers see how what different neurons respond to it's it's it's fascinating So I totally recommend playing around with that Okay Some historical remarks here, so we're we're getting towards the end of this lecture A funny thing is that neural networks convolutional neural networks back propagation all of that in a sense was invaded a lot of time ago Like a long time ago it was invented People argue about who exactly invented back propagation, but it happened in the 60s or 70s or 80s maybe long time ago So why did it take until 2010s like 2012? To actually become so popular as it is now and in fact there's no easy answer to that I think usually people say well one ingredient is computing power We now have this powerful GPUs that allows to train these gigantic networks And before we had that it wasn't clear if this is actually going to be fruitful or not Another ingredient is that you need large very large labeled Data sets so image net consists of millions of images That have a label you did not have that 20 years ago let alone 40 years ago And that's essential to train a deep neural network You need to have a lot of data And then there are all these different tricks of many of which I didn't even mention today Like how exactly you optimize like you use momentum or not how you initialize the weights This may matter and actually it does matter a lot what regularization you use things like dropout And other things that again I didn't mention today Normalization and so on there's a lot of different tricks that people use that turn out to to help with training I think Usually people would argue that the first two things are more important than the last That what actually made a big difference is the rapid increase in the computing power Where the networks can be implemented and that this large amount of labeled data appeared And and and could be used but I think it's like a positive feedback look between between all three right the network started to show some promising results and then So many people started working on that or more and more The better the results got and all these tricks got invented and so on so Hard to say what exactly actually started this process, but it was a combination of all of that And another thing it's pretty In in some sense it is obvious that you can take a deep neural network and it will have a very Like it has millions or many many millions of parameters if it's deep enough So of course it can it's a very expressive model and you can run gradient descent on that. That's clear What is not a priori clear is that first it will not get stuck in some useless local minimum That it will actually get somewhere that is useful that maybe is not exactly the real global minimum, but it's good enough that is not obvious and The second thing is whether it will not overfit very badly because you have such an expressive model. Maybe it will just overfit Like, you know, if it's deep enough So badly that the test performance will be awful and these things are not a priori clear And maybe if you go back 30 years and you ask whether this will happen Then maybe many people would say, yeah, there's no there's no hope Because it will get stuck. You know, there's no hope because it will just overfit if it's big enough Now we know that this doesn't happen And this was like empirically empirically found out A few words on this overfitting and regularization issue. So we spend a lot of time in previous lectures talking about Talking about overfitting and how you can regularize the model the linear regression logistic regression and so on and how important this is And it is also important for neural networks, too So you can do different approaches to there are different approaches to regularizing neural networks of which I Cannot like I cannot cover them In this lecture. You can use rich regularization. You just add this term to each layer. That's actually super simple. It's usually done Even by default maybe in in neural networks. It's called weight decay for the reasons that I explained back in lecture four Um, usually it doesn't play a huge role though In this setting So like for example the to tune the lambda parameter Surprisingly, if you do rich regression, it's it's very important which lambda you choose, right? And we discussed all these bias variance tradeoffs here Usually doesn't play a huge role as it turns out Another thing that I will talk about is called early stopping and in a way. That's also regularization approach Um, so I want to show you a plot of what like a sketch of what happens when you train in that book So I have epochs on the horizontal axis So I start with some loss over here on the training set and then I run my gradient descent And my loss improves and improves and improves and improves and improves and always improves for the training data What happens for the validation data or the test data? My loss also improves because my model gets better and then at some point it starts Decrease it starts it starts getting worse worsening again Because I mean this means that we are overfitting, right? That's that's what overfitting is that there's a gap between the training and the validation and it increases Um, so if you train the network long enough you start overfitting. So that's pretty interesting and in fact One can one can show that if you do this for linear regression And you just run gradient descent on linear regression then in fact you Do What what is in some sense what is learned by gradient descent first corresponds to large singular? Um vectors in the svd decomposition of x for this for the linear model And then later on you start picking up on the small singular values. So if you stop Training at some point even though on the training set you could still get better But you just stop after after some number of epochs Then this has an effect that is similar to the reach penalty because as we discussed reach penalty also Just penalizes small singular values. That is for linear model for non-linear model deep neural network It's really it's impossible or at least we don't know how to analyze this mathematically But The same thing happens you fit fit fit and then at some point Or often not always often you start to overfit if you train more So what people do is that they keep training network and they look at the validation loss And whenever it starts going up, then you stop and you say, okay, this is my model So the number of epochs to train becomes a regularization parameter. That's pretty interesting Another comment And the final series of comments on the over parametrization in neural networks So the modern neural networks are typically huge. They are enormous They are so large that they can actually fit You would expect that they can fit pretty much anything in the in the training data. They can overfit The the training set they can fit they can you can train them until the error on the training set becomes zero or near zero If you just if you take a real data set like image net or like a small part of that and you train your network And you get to to the loss that is close to zero Then you don't really know if you overfit or not because maybe you just you built a very good model You can check that by taking some data and shuffling the labels randomly So I mean by that I mean I take some part of the I take some image data set and I take the labels cat dog horse A car and I just shuffle them so that every Every image is labeled randomly now and then I train neural network on that nonsense data set And it turns out that if you use the same architecture as as is Normally used and you train long enough you can get to the training loss of zero Or close to zero So it will just overfit it will just memorize for each image What this label is even though the labels are nonsense and this is a clear sign That the capacity of the model is so large that it just overfits the data I mean this just depends on the sample size and the model complexity But this basically is this p much larger than n regime that we discussed before for linear models and neural networks Probably not all of them probably not for all data, but very often They are in this overparameterized regime in practice At the same time so you train them your Your loss goes close to zero on a training set, but your Generalization performance is still high. It still performs pretty well Can tell dogs from cats from from cars on the left out hold out validation data or the test data So this is what I called benign or what people call benign overfitting You do overfit in some sense But in some other sense you don't because your test performance is pretty good And this only happens because the there's some implicit regularization in the in the process by how the network is structured And by how it's trained with gradient descent we arrive at a solution that doesn't Just memorize all the labels in the training set and can't generalize at all even though that would be possible As this shows we arrive to a meaningful solution. It's like a linear regression We talked about in the overparameterized regime. There are different beta hat vectors So here there are different weights that can fit the training data equally well But somehow the neural network arrives at a good set of weights that does generalize And why exactly it happens is an active area of research fascinating research and I think largely is still Not understood or very poorly understood apart from you can analyze this in very simple toy situations But why doesn't happen all the time for these deep networks? Nobody really knows but Um Yeah, but that's important. Okay, um, the funny thing happens that you can sometimes see In the Whenever you have a deep classifier That you overfit the loss with training as as as I showed in the previous slide, right? So this is epochs again on the horizontal axis your loss and this is a test loss here Goes down and then it goes up. So here you start overfitting by training more But if you look at the accuracy which just goes from zero to hundred percent it grows up And then it keeps growing or at least it plateaus. So we see here that the loss goes up Means we do start to overfit, but somehow you don't see this in the accuracy So that's a curious thing that actually very often happens in practice And the reason it happens can be understood even in the logistic regression itself I briefly touched on that in the lecture on logistic regression Um That's what happens when you can when you can When when when your classes are actually completely separated for logistic regression Linearly It you run gradient descent and logistic regression finds a good direction to tell one class from another and once it found it It can predict everything perfectly on a training set Which means it becomes more and more and more confident becomes overconfident The this corresponds to beta vector growing the norm of it growing to infinity And then the loss on the test set goes up because you are too confident But the curious actually doesn't change Even on the test set since curious effect that I wanted to mention here and the last figure for today is that One way to look at this over parameterization issue is to say well What happens if I make the network deeper and deeper or maybe the hidden layers wider and wider And I start from a really small network that can't fit the data very well So this is then the model complexity axis and it can so imagine that you have a fixed neural network And you just make the hidden layers wider And this will correspond to moving here on the right. So what happens to the test loss? Well, here you are underfitting you have Low variance have high bias performance is bad Then the model becomes more complicated here the performance increases Then at some point you get to this regime that you have high variance solutions and the low bias But high variance until you reach this interpolation threshold as we called it for linear model So here beyond this your model capacity is so so large that you can fit any data you throw in even the random data And amazingly This curve then goes down. So again, I mentioned that even for linear models this can happen It happens in practice for for deep networks all the time So the this goes down again and not only it goes down But actually it can go it can get below Below this point. So actually it's the most advantageous It's the you you get the best test performance by by getting in this What could seem as a hopelessly over parameterized regime, right? And this only works because there's some magic implicit regularization somehow We don't fully understand how built into the gradient descent and model architecture that Despite this over parameterization gives you a good set of weights that do generalize Okay, so I want to finish with that as a kind of a little mystery or a huge mystery actually And that also makes it clear that this is a very ongoing research. We have these models They work amazingly well. They they drive the cars and fold the proteins But in some sense, we don't really understand how they work Thank you