 So just a quick heads up for the plans for the coming couple of lectures today We're gonna have a full lecture by Jan next week We're gonna have another another lecture for the first hour and then the second hour I'm gonna be doing the practicum then on Thursday. There's gonna be a practicum and then the second homework is coming out All right, that was it announcement stand All right eclipse Have a nice lesson Okay, the current net's a commercial net so So I'm gonna talk about more architectures today Then what we talked about last time, but we sort of laid the ground last time for the Components we're going to use in the in those kind of you know, big Architectures that are very popular with a well-used in various contexts I'm going to start with something I sort of vaguely alluded to last week, but But didn't go into details this idea of hypernetworks. So this is the idea that you you have A neural net whose parameters are themselves the output of another neural net which may depend on Parameters or or something else, but also on the same input as the as a network itself and some people have called this hypernetworks It really this is sort of a general form for something that Very well the idea that goes back to the 1980s Where you could have as I explained last week you could have a neural net determining the the weights of another one in a linear manner for example So that creates what it's called Sigma Pi networks or product some or pretty normal units Where you have, you know products of states instead of just weighted sums So you have sort of weighted sums of cross products of different Activations of units so that would be a kind of a very simple simple form of of this kind of this kind of network But we can have more complex forms where you know, this H network here could be an entire very large network with you know millions of of Connections and this G network also could be fairly complicated although very often they are actually fairly simple Okay, so this is just sort of a little Bleeding of the last lecture on to today's lecture if you want One thing also I talked about last week is the idea of shared weights So if you have a Module a neural net or something Could be an entire network could be just a single layer It could be just a single unit or a group of unit and you want to somehow replicate this unit because you are interested for example on in detecting a motif On on an input regardless of where it is. Okay, so I don't know you want to Build a machine that detects if the the heartbeat and the you know brainwaves and whatever it is Encephalogram or whatever of a patient is normal and you want to detect an event maybe Heart arrhythmia or something like this You would have the the sequence of signals coming in and then what you want to do is train a detector That detects this motif regardless of where it appears on the input So what you have to do is basically build a little neural net and then slide it over the input Now, of course, if it's in real time, you just you know reapply the neural net as you shift But during training you may not be able to tell the system exactly where the event occurs You know it occurs somewhere in this region, but only you don't know exactly where so the way you you train a system Like this is that you replicate the neural net over multiple locations And then you do a max of the outputs. Let's imagine. This is a detector. So it's got a single output They can be a high score or low score You do a max over this region. So the system will you know, basically tell you somewhere in there I detected something and What you can do is now back propagate gradient through this entire module and so If the system is partially trained and it detects a motif It will you know be the the largest output of all of the replicas here And you will see it on the output and when you back propagate because this is a switch Max is really a kind of switch and we talked about about that last time the gradient will propagate through that particular instance And we'll change the weight of that particular instance to tell it. Okay, you actually detected a Signal I'm going to change your weight so that your your output is even is even larger And it will let you know leave the other one alone But the other ones are replicas. They are the same as this particular network that you just you just updated. So you have to Essentially have a single weight Vector that you share Among all of those copies all those replicas Now this is a simple case where only one copy actually gets a gradient But let's imagine that this operation here is something like a softmax Then you know which of the motif within this thing is the The closest to what I'm interested in detecting so you're going to get a non-zero activation for each of those you're going to plug them into a softmax and And you're going to get you know outputs for each of those and you don't know which one is correct You can do a max now So now when you back propagate you get gradients for every single one of them And what you have to do is because the weight is shared among all those networks You have to sum the contributions of the gradient from all of those copies Okay, and that's because the the weight here is you know, basically There is a a a branching why connection that replicates this weight over all of these copies, right? And what we've seen is that when you have a wire that sort of You know generates multiple copies of itself if you want like a why connection when you back propagate you have to add the gradients, right? So this is sort of an important point Which I'm going to to repeat so if you have a single variable and this variable replicates is replicated Okay in the forward pass in the backward pass When you have a gradient here You can have gradient here to get the gradient here with respect to x It's going to be simply equal to the sum over i of dc over dy i Okay, so when you have a why connection going going this way going the other way It's a it's an addition Okay, we're branching out in the forward prop corresponds to a sum in the backward prop and Vice versa by the way, if you had a sum in the forward prop it would correspond to a copy Or a replic you know replication in the backward prop. So this module is its own transpose if you want So that's that's a way to explain why Here basically this w is being branched out to multiple modules and when we back propagate we have to do the sum now Pytorch does that automatically for you. So whenever you back propagate to a module the the gradient that is computed with respect to the parameters of that module is not Written into the gradient field of that of that of that parameter vector But it's actually accumulated and the reason this accumulated is so that you'll have this sort of implicit sum When parameters are shared across multiple copies of a of a network or part of a network. They are they are accumulated All right, maybe we want to repeat this sentence once more just that everyone can actually Remember exactly what you just said now this one the last one about the pytorch accumulation because it was like a surprise last time so Right. So in pytorch as I think Alfredo may have shown you when you do when you do a back propagation Through a network. The first thing you have to do is set the gradients to zero right here to reset the gradient to zero before you Do the back prop and just because implicitly when you do a back prop through a module and this module has parameters The gradient with respect to the parameter The parameters is accumulated Into into the gradient, right? So if you had no non-zero value before and you do a back prop the the the gradient you just computed gets added to whatever was there before and And so before you do a back prop you have to set that to zero and the reason for this accumulation is implicit Accumulation is for situations like the one we're describing here where you have multiple copies of the same network You're doing multiple back prop let's say on multiple samples in a batch or because you have multiple copies of network looking at different parts of an image and implicitly you want the the gradients to be added to the to be sort of combined if you want Into the into the gradient with respect to the parameters So that's the really the reason Why gradients are being implicitly accumulated in pytorch Also, Karl is a bit confused about the you said we have a max as reduction and not a sum So can you actually clarify the two right? So the max is the Reduction of the output while the sum the reduction of the gradient Okay. Well, why the sum is a reduction of the gradient is is is just math, right? I mean basically it's you know Again, it's it's an application of a chain world. So if I if I have a variable this variable influences other variables and This variable somehow goes into some, you know, a non-function that produces or cost, right? So let's call this, you know, C If I wiggle X by something it's gonna cause W1 W2 and W3 to wiggle by the same amount, right? Now if I want to know by how much C will wiggle as a result of X being wiggled C will wiggle by the the wiggling of Y1 time DC over dy1 Wiggling of Y1 plus DC over dy2 wiggling of Y2 Plus DC over dy3 Wiggling of Y3, right? But those are equal to Delta X because otherwise are equal to Delta X, right? So what I get is that the wiggling of C is equal to Delta X times DC over dy1 plus DC over dy2 Plus DC over dy3 Right? Now if I put this underneath here, I get and I write it, you know, as a partial derivative as opposed to differential calculus I get this Okay, so that's just the demonstration that if a signal is copied The when you back propagate you have to sum up the contributions Yeah Okay Right, so these are recurrent nets. So recurrent nets are An example, you know, I lied I lied to you in a previous lecture what I told you that a deep learning system Needs to be a directed ac click graph You can't have loops in the graph of connections between the different modules as a matter of fact You can have loops but you have to play a trick which is to unroll the loop and this is what We do generally with recurrent net so recurrent net is a network with a loop and a simple form of it is this one So you take an input you run it which will go h of t at a particular time t goes into a Function g which is yet another neural net it takes as an argument this h of t and it takes a The previous state of the system So z t is the state of the system and it takes the previous state of the system as well as a set of parameters w Okay, so this is why there is this delay block here the delay block says Whatever z of t comes out here. I'm going to store it So that I can reuse it at the next time step as an input to that to that module g Okay, so this is like a recruit, you know recursive equation z of t equal g of h of t z of t minus one And w where you know time is discretized essentially then you take the state z of t You run it through a decoder and that produces the output Why a y bar which is a decoder of of apply to to z now Why is this interesting because it allows a neural net to have a memory essentially right so you could imagine that the state will store the past history of what happened to x and This may be useful for you know processing sequences here t is kind of a time index but it could also be useful for you know sequences that are not temporal, but they are spatial and And and sort of are you saw the situations of this type but something that has an internal state and You know think of a finite state machine for example, right? If you've taken computer science courses, you know what a finite state machine is So you can basically represent a finite state machine by a recurrent neural net with an appropriate definition of g So how are you going to train such a such a such a system? Such a recurrent net the idea for recurrent neural net goes back a long time by the way and training them with backprop goes back to the 1980s. So This is a very old a real concept So the way we train a system like this is that we unfold it in time Okay, so X the input is going to be a sequence x of 0 x of 1 x of 2 etc X goes into the encoder produces H H so that goes into this the g function again It takes into account the previous state z0 and then you know produces e1 etc right, so now what we have is an unfolded recurrent net and it is now a directed acyclic graph, but what we can tell now is that the parameters of The of the g function are the same at different time steps. So we have an example of this Replication of a module inside of a neural net where we have multiple copies now this this this g neural net With the same parameters, right? So we're gonna what we're gonna back propagate through this structure that we're gonna have to sum up the gradients The contributions of the gradient of each of the time steps and accumulate them into the gradient for w and I didn't write it here, but of course the encoder and the decoder also have Parameters trainable parameters, and they are also shared across time. Okay, so here is how you train a back prop net you Unfold it in time essentially You do back propagation in this new network, which is unfolded in time and when you back propagate Because the the weights are shared across different time steps the gradients are going to be accumulated for the different time steps Now again in pytorch, you don't actually need to do this to think about this explicitly Because when you're gonna write your recurrent net that's kind of a loop pytorch is gonna figure out This is a loop that I need to unfold in time when I'm gonna back propagate. I'm going to accumulate the the gradient So this is going to be taken care of for you But that's essentially what happens Now it turns out it's very difficult to train recurrent nets because very often you want to train a neural net with of this type with lots and lots of different time steps, right? It could be a long sequence. Let's say you're doing speech recognition for example or something like that The the sequence may may have several hundred or several thousand Input and so when you back propagate you're gonna back propagate to those hundreds of layers essentially, right? The the g-functions are kind of can think of them as layers now and every time you back propagate through a layer Depending on the the gain of that layer if you want your gradient may get smaller or bigger So let's imagine the g-function is a linear function take z it multiplies it by some matrix and then adds H Also multiply with some matrix and then that's the that's the output. Okay So let's imagine the matrix by which we multiply z to produce the next z is just a rotation, right? It takes a vector and it rotates it So that would be good because the The z vector will preserve its length as you as you multiply it by w every time The only thing it does is that it rotates somehow, right? And it preserves its its length when you back propagate to the matrix that preserves the length of a vector The the norm of the gradient gets preserved also if you have a gradient or a particular norm You multiply it by a matrix that basically just a rotation the gradient you get with respect to the input as The same norm as the gradient you you give it with respect to the output But let's imagine now the the weight matrix As small values so that whenever you give a vector that vector gets shrunk by let's say a factor of two Okay, just for the sake of the argument. So as you go through the layers the the states kind of shrink progressively as You back propagate you start maybe with a large gradient But the gradient also is going to get shrunk because the transpose of the matrix also is going to shrink the vector But it's essentially the same amount as as in the forward pass So you get what's called a vanishing gradient problem Which is every time you back propagate to a layer your gradient shrink and shrink and shrink as financially, right? You start with a gradient of norm one and they get norm one half one quarter one eighth one sixteenths, etc So you don't get much of a gradient after a hundred layers Which means your your recurrent net is going to Act as if it didn't have that many layers because you know by the time you get to You know past a few layers through your gradients are basically completely killed Conversely imagine now that you're the the weight matrix by which you multiply z has a large norm So it it it makes the the vector z longer, but let's say a factor of two So every time you go through one of those time steps your vector grows larger And when you back propagate it's the same thing the gradient will get larger and larger explode. Let's go the exploding gradient problem so You know people realize this in the early 90s and what papers about this your show banjo you're going to meet Uber and various other people And they say, you know, what what can we do to fix this? So of course if you have non linearities like sigmoids in the system, it will it will limit the explosion but not the vanishing gradient And and people pretty much sort of abandon the idea of Most people abandon the idea of using recurrent nets in the mid 90s Most people actually abandon the idea of using neural nets all together in the mid 90s, but recurrent neural nets also in particular So can you can you say again? What is that leads to the norm to grow or actually diminish? So let's imagine you have a z vector. Let's say z of t And it passes through a module that just multiplies by a matrix, right? You get z of t plus one here and that's equal to just w z t You know if you have an input X of t let's say you just add it to to that but But let's say X of t zero so we can ignore it, right? So you have this frequency of equation where you take a vector and multiply by a matrix to get a new you get a new vector Depending on this matrix, right? Let's let's imagine that this matrix is is diagonal. Let's say z is two-dimensional. This matrix is diagonal and has ones Okay, so basically you see identity, right you take z z t and it You know z t plus one is just equal to z t Let's imagine it's something like like this So this is a permutation matrix. It takes the z zero z one and it flips them Okay, so z t plus one is just the same as z t, but the coordinates have been flipped Let's imagine I Have something like cosine theta Okay, so that's a rotation matrix So it just rotates the right so all of those matrices have Eigen values that are equal to one both of them So they they are so-called orthogonal matrices and they they don't change the norm They don't change the length of a vector to just rotate or change it somehow, right? But what if I use a matrix that's something like like this, okay, so now if I give it a vector Z zero z one I'm gonna get two z zero here and two z one here, right? So here the vector gets longer If I put one half The vector gets shorter And of course you can imagine all kinds of matrices that will you know that have like non-zero of diagonal terms that will you know Rotate your vector in some ways and stretch it in some other ways and make it either longer or shorter, right? So the problem is that if it makes it if you preserve the norm then you know These kind of keeps the same norm as you go through the layers and when you back propagate You multiply by the transpose of the matrix to good to to get the gradient So the length again doesn't change, but if you have matrices like this where the length of the vector changes The this one will cause an explosion of the length of the state because the length of the state vector gets multiplied by two every time You go through a layer and the gradient also gets multiplied by two every time you back propagate because you're using the transpose of the matrix to do the back prop Okay, which also multiplies by two this guy on the other hand Shrinks the vectors, right? So the the state will shrink exponentially and the gradients will also shrink exponentially when you back propagate Is that that answer the question? Yes, I believe so, okay Right. So what do we do to prevent the the gradients from exploding or from? shrinking So there's a bunch of tricks and I'm not gonna go through the details of each of them But sort of give you the basic ideas. Some of those are implemented in pytorch can use them You can just clip the gradients. So basically you say when I back propagate the great if the gradients get, you know, larger than the certain size I just You know Clip them to a maximum value or I normalize the the length of the gradient vector at every layer So this is gradient clipping or gradient normalization Leak integration I'm going to keep this off because I'm going to talk about this later Momentum that helps a bit, but not much initialization So it's very important that the weights be initialized in such a way that you don't get by default from the start you get exploding or shrinking gradient and And the biggest trick in fact are tricks that basically change the architecture in such a way that the gradients don't don't vanish And this is a trick that We we're gonna talk about in just just a second Before I talk about this. Here is another type of recurrent net which is somewhat popular for certain applications particularly in in physics and computer graphics and sort of various Other applications dynamical systems time series prediction So this is a situation where the the type of signal you're trying to model Is let's say a physical observation of something that occurs in the world and obeys a differential equation Okay, so you're observing. I don't know the position of a ball being thrown or you know something like that, right and the The state is basically the state of the system you're observing X would be the observation and and y would be some sort of You know prediction of What's gonna what's gonna happen? So the internal state of the system the you know the position and momentum of the ball for example would be represented by z of t and and you have a differential equations like this which You'd like your your neural net to basically learn by observing a sequence of x and y. So just a sequence of x Okay, so let's imagine that you know, there is this Equation that your your system belongs to I mean obeys your your data obeys You're gonna model it by you're gonna model G by a neural net, okay? and What you can do is Discretize this differential equation by discretizing you can you can write the the equation here at the bottom So z of t is going to be equal to z of t minus delta minus delta t So the previous value of z, you know With a little delta t which is kind of a small time interval plus delta t times the g function and if you Put the z of t minus delta t on the other side and divide by delta t and you make delta t go to To zero you basically get back this this differential equation, right? Which would be sort of a discretized approximation of the time derivative of the z the z vector Okay, so there's nothing different about this particular network then as the one I talked about earlier other than the fact that the you know the update rule here Instead of just computing the new z basically updates the old z with with some function And so you basically have Inside of this box here something that you know copies the last the last state on the output and then just adds the You know the new state computing through the g function People there's there's been a bunch of papers on this over the last years and people call this neural od so neural ordinary differential equation I may come back to this later in a little lecture Okay, but here is how we fix the issues with With recurrent nets so the the problem with the current nets as I said is the exploding of an issue gradient problem And it's due to the fact that there is a Sort of essential contradiction in recurrent net which is that you'd like recurrent net The reason why you would want to use recurrent net is to be able to store memories to basically have the system remember What happened in the past in the form of a state vector? Z which in this case here is called H. It's unfortunate, but I'm using here the standard notation for the papers So sorry for the difference in notation here But you'd like the you know, you'd like the neural net to kind of remember what happened in the past and And it's those sort of recurrent connections. You know, you can think of them as a sort of a elementary memory so the problem is that If the system actually has memory, right So what is a memory really? Okay, so let's say we want a system to actually be a memory there's a very simple way of Building a memory system which is Something that people who know something about electronics know very well Is that you you construct a by stable system a system that can be either in One state or the other state, you know, think about a pen You know that is on a on a hinge It's stable in this situation and stable in that situation But if you put it in the middle, it's going to fall in one of the two right? So it's a by stable system. It can be a builder. You can easily build a by by stable system Through circuits or operations or electronics by basically having a System with let's say take a simple sigmoid function and then Take the input multiplied by some weight a scalar weight and Add it to the input. Okay, so I'm putting a an X value here And this system here has let's say a hyperbolic tangent sigmoid function, right? So the output is you know, can't go above or below, you know above plus one or below minus one So let's say X is zero to start with and let's say the system initially is in the The output is close to minus one. It may not be exactly equal to minus one But it's close to minus one this minus one is going to get multiplied by a positive weight Imagine this weight is positive. Okay so here I'm going to get minus w and Here I'm also going to get minus w because my input is zero and minus w If w is large enough is also going to be close to minus one. So that's the system is stable. Okay If I keep running around it's just going to stay in the same. It's going to find kind of an equilibrium point And that equilibrium point will depend on the value of w if I make you know, w very large Is going to be close to minus, you know close to minus one now if I start with plus one This system is going to also get stable But it's but now this state is going to be plus w and then I'm going to get plus w here And the output is going to be again close to plus one and that system is also stable So that's a by stable neural net recurrent neural net. They can take two states either close to plus one or close to minus one Okay, if the weight is less than one Or less than the the gain of that sigmoid the slope of that sigmoid whatever it is every time I go around the loop is going to shrink and Eventually the the state is going to go to zero but if the weight is large enough every time I go around the loop the weight the Just the The weight is is going to you know make the output bigger and that goes into the sigmoid and at some point the The sigmoid is gonna You know be at its fixed point essentially and you're gonna get a stable state close to plus one or close to minus one So that's a elementary memory. Okay. In fact, there are memories in computers that are basically implemented this way Now If I if I change x now I can I can make x Let's say the system is in the plus one state. Okay close to plus one and I make the state like minus 10 The input minus 10 this is going to override the influence of the of the feedback and The system is going to flip into the minus one state and then stay there Okay So That's nice because that means that you know when the input is kind of not so large Then the system remembers where the large where the last light large state was and as soon as the input gets larger than a certain threshold the system flips and Remember is that last large value of x whether it was large and positive or large negative. Okay So that's great because now we can have neural nets that have memory in them, but there is an issue and the issue is that When you when you have such kind of memory modules and you stack them Okay, because of right, so you you take the state here and Multiplied by the matrix, you know, I'm just unrolling the well, I should I shouldn't have joined drawn it this way actually So I have sigmoid weight sigmoid weight sigmoid Weight and if I have an input, of course the input is sort of added here. Okay at different time steps Right, so here's the problem if the system is in a stable state After after a few steps that means it will have forgotten It's in its initial state, right? I can I can change the state here a little bit is going to make no difference to the output because it's going to go back to its stable state and So the the the problem with this is that necessarily this will have a vanishing gradient problem Because what that means is that when I wiggle the initial state it makes no difference to the output Which means when I back propagate gradient From the output to the from the output to the input I'm going to get zero because I can do whatever I want to the input It makes no difference to the output To the end state because that's a stable state. It's an attractive state. So if you have a system Recurrent net that when you run it To stabilization it actually stabilizes into a stable state, which means it has memory When you back propagate through that chain you get zero you get tiny gradients So Yoshua Benjo in the mid 90s wrote a paper that says basically it's impossible to train recurrent nets because We have this Catch 22 this conundrum this sort of incompatibility We want to use recurrent nets because they have memory But we can't train them because when we back propagate through something that has memory the gradient vanishes necessarily Turns out this argument was only partially true It's only partially true because you don't actually need to have stable states to have memory So in fact, you can store information about the past In a system that sort of keeps running around if the if the system for example Is a linear system where the matrix rotates the vector already talked about this Then there is no information that is lost by rotating the vector, right? because any vector is going to get rotated and there's no information lost as to the the previous state you can always rotate by by the inverse angle and so That would be a system that actually remembers where you started from but it doesn't have any stable states And so the idea that you need a stable state to have memory is actually not correct So that's the first thing the second thing is and this is where I'm going where I'm coming is that what you can do is Make the system remember by default so basically make the system be the Identity function by default so the new state is equal to the old state and then have a neural net that slightly modifies it and so this looks starts to look very much like the idea I was talking about earlier here the differential equation where What you have here at the bottom is you know the update equation where the The new state is equal to the old state plus some Something that's computed by neural net. So that's the trick we're going to use So here so this is the gated recurrent unit Which is sort of an elementary module that you can put in a in a in a recurrent net And this was proposed by King of Joe. This is professor at NYU as you know So you take the previous state and you basically just run it through and copy it to the outputs But it can get modified in two ways the first way you can get modified is that you can add to it The result of applying some nonlinear function I probably tangent to something like this To an input which itself might be the output of some neural net It could also, you know take the the previous state into account. You can run the previous state through, you know No, some some weight or or some, you know nonlinear function or whatever Combine it with that and and produce An output by which you're going to modify the state. So by default the system just copies whatever is before But it can modify it if it wants And there's also a gating so the gating is this multiplication here And this is why I talked about multiplicative interactions before and hypernetworks and things like that So basically here, this is a gating that can choose to So the the output here that gets fed to this is either a zero or one essentially Actually, it's minus Yeah, it's either zero or one because it's the output of a sigmoid Here it's inverted for some reason but You know into a rib at that It's somewhere between zero and one and it could be zero or close to zero close to one depending on the output of the sigmoid And this basically says, you know, if it's one then this wire is basically transparent And if it's zero then the the the wire is cut essentially right so the past is Forgotten and replaced by whatever comes out of this path Okay, so this is basically a Elementary memory unit if you want that either keeps its previous state or Can erase its previous state which is caused by this multiplication here Okay, if oops, sorry If the if the output is zero then the previous state would be forgotten and replaced by by something new and if it's one it will just, you know be copied and and And and updated Or it can be just updated, but but it's pretty close to the identity And so because it's close to the identity, you don't have this, you know, exploding and vanishing gradient So it's the imperfect memory if you want they can be kind of erased or or written into And you know, you have the the the formulas here at the bottom so There's a gating vector so unfortunately the again the notations are backwards from from mine, but the the Z is a gating The gating vector that is the result of applying sigmoid functions to sort of linear combinations of inputs and previous state and some bias Okay, you could you could have a whole neural net here if you wanted Then you have the resetting gates, so this is the one that can erase the memory again It's a sigmoid function applied to again some weight matrix that takes into account the input some weight matrix that takes into account the previous state and the bias Again, this could be an entire neural net not just a weight And then you have the update formula So the new state is equal to the old state multiplied term by term. This is the term by term multiplication Of the the gating vector. Okay, so this will selectively for each component turn on or turn off the previous state And then let's get linearly combined with the One minus the gating factor with the the sort of update. So this is the you know, hyperbole attention applied to You know the previous state And and and the input all combined with weight vectors. Okay, that's the basic unit you have that as a module in the pytorch if you want to use it and You know, if you need to have the system that looks at the sequence and remember remember things from far far in the past That may be a good way to go So this was proposed in 2014, but Quite a long time earlier in 1997 Oh, I turn Schmid über proposed something called LSTM, which stands for long short-term memory So they they they were interested in solving this vanishing and exploding gradient problem And they came up with this architecture, which is slightly more complicated than the one I just explained The gated recurrent units you can see GRU's as kind of a bit of a simplification of LSTM if you want and so this one has You know ways to Again reset or or forget the past And and and to kind of, you know, write new values into the activation Which is very similar to the the previous one with a few twists The the circles here again are Turn by turn multiplications. Okay, so Rather than going through like the entire detail of this the You know, the the lesson of this is that, you know, you can you can build Special architectures that will have the property you want you can sort of engineer the architecture if you want so that you get the The sort of behavior or property that that you want so LSTM were proposed in 1997 They were used for a few things but not really demonstrated for any practical purpose until the the late 2010 the late 2000 early 2010s with a former student of Jürgen Schmidlbauer who was actually working in in Jeffington's lab as a postdoc and Who demonstrated that using LSTM in a in a good way you could you could get pretty good performance on something like speed track ignition? and so that started a kind of a new wave of interest in In recurrent nets and LSTM in particular In particular that prompted people at Google Ilya Suskaver at the time in 2014 to Build an architecture called multilayer LSTM. So essentially you take LSTM modules of the type that I just explained Okay, that are you know elementary memories you feel it with a sequence in this case. He was interested in translations You feel it the sequence of Vectors that represent words. Okay in a sentence and Each of them goes through one of those LSTM unit, but then you take the Internal state of those LSTM at every time step and you shoot it to a second layer of LSTM And so you get what's going to multi-layer LSTM and You can think of this as the second stage of LSTM that's kind of computing some sort of abstract representation of the sequence that you get from the input You get you stack multiple layers of those I think he had something like four layers and in the end you get a single state vector here that if you train the system properly will represent the entire sequence of input vectors in an appropriate way Then to train that system you have lots of parallel text of Sequences of words in one language and sequences of words in the other language you feed that representation Which you hope will represent the meaning of the sentence independently of language You feed it to another multi-layer LSTM and the way this particular LSTM works is that it's got an LSTM module It takes a null marker and you run it through multiple layers and at the output it produces a Word essentially it actually produces a distribution over words using a softmax, but you're gonna pick one Okay, so you pick one word out of this maybe the one with the highest score And you take that word and you feed it to the input of the next time step Okay So now this guy knows what word was produced at the previous time step You run through the few layers of this LSTM for one time step produce the next word Then take this word and feed it to the next time step and you keep going this way So this is so-called sequential autoregressive generation of text using multi-layer LSTM and this system Trained by Ida Soskiver was the the the first system that showed that you could Reach the same performance and even get slightly better performance in some cases For language translation then classical translation system based on You know statistical models basically So this is kind of the first neural translation model that that Kind of worked but it was really unwieldy because it was still like huge and You needed you know a lot of GPUs to run this on and need to be Really smart. So very shortly thereafter at the end of 2014 Bada now who was an intern at Mila in in Montreal working with Pyongyang Cho and Yotra Benjo So this group of people came up with this idea that you know rather than represent the entire sentence in one vector at the output when the system is Is meant to produce a word in the outgoing sentence What it should do is pay attention to the correct location on the input that Correspond to the same word that you're trying to translate, right? So let's say the input Sentences in German and the output is in English. The word order is very different. And so Here comes the time to produce a verb and in German it could be that the verb is actually at the end So you had the system the system should really kind of pay attention to the end of the Of the sentence to be able to translate that that word and produce the verb at the right place in English So this attention mechanism Is basically a multiplicative interaction, which I'll talk about again In a few lectures when we talk about transformers and and social team memories So i'm not going to go into the details of this, but let's let's just say that This system basically is a soft switch with kind of a softmax that decides Among all of these guys, which one should I look at? So there's kind of a score that looks at each of those guys and that produces in the form of the mixture of expert that I was talking about an architecture Which will produce a score that will decide which of those guys should you pay attention to at this at this time that actually completely fix the issue so it It it allowed those translation systems to be much smaller than what elias discover had proposed And that really kind of started a whole effort in the Industry to kind of use translation system based on this idea of attention And since then this idea of attention Has really ballooned like, you know, it's used absolutely everywhere nowadays and but we'll come back to this in a in a future future lecture Um Okay, I should say something Because of this idea of attention that has really taken the world by storm People are actually not using recurrent nets that much anymore They prefer to use systematic attention and not actually Use recurrent nets so so stm is not used nearly as much as it was. Let's say five years ago it's replaced by architectures like transformers and and to some extent convolutional nets okay, convolutional net so this is the The the main topic actually of today because you're going to have homework on this pretty soon and practicons as well So convolutional nets and we're going to have a couple lectures on this You know this half lecture and the next half lecture I think are going to be in convolutional nets so convolutional nets are again use this idea of shared weights and they are good for sequences, let's say audio or speech Or other type of sequential signals like financial Time series and things like that Images volumetric images videos, which means you know volumetric images are like, you know MRIs and CT scans and things like this Or images that come from a depth sensor Video and you know other types of natural language natural signals Essentially natural signals that come to you in the form of an array Okay, multidimensional array single two three four four dimensions And this array has to have some property that nearby values in this array are similar or correlated Okay, so if you Take a photo any photo a random photo from the internet and look at the color values of neighboring pixels Is a good chance that the color values of two neighboring pixels are actually pretty much the same Okay, whereas If you take the color values of distant pixels, they have basically no chance of being the same So that's the The idea of local correlations, right? So nearby values are highly correlated. They are kind of Dependent if you want Uh, so that's one thing and the other thing is that motifs can appear everywhere So I told you earlier the first slide I showed was about, you know training a convolutional net They could detect a motif anywhere on the input. This is basically the I you know convolutional nets are based around this idea that Um, you want to be able to detect motifs regardless of where they they occur. So that's the the diagram I showed you earlier Uh, but we're going to generalize this a bit. So let's take a very concrete example. Okay. Why do we need to do? Why do we need to do this? Because we want to be of hierarchical representations of the real world. I'll come back to this Okay, but let's let's take a very concrete example here. Let's say I want to be able to recognize Character images, but I like to be able to do this independently of The location of the character on the image Okay, so I want to be able to translate the character Maybe change its size a little bit and still be able to recognize this character as a C And I might you know, I might want to recognize these as well independently of location Um, so there's another idea that people in computer vision and signal processing had figured out a long time ago Which is that you build a detector. So detector you can think of it as a template Okay, so here this template is basically a pattern of coefficients weights where white indicates The negative weight and black a positive weight and here black is a positive pixel and white is negative pixel zero And you take this template and you swipe it over the input Okay, and whatever the dot product of the pixels with the value in the template Is is positive and large Um, you You turn on the detector essentially, right? So let me take this little template. Let me put it at the left corner Uh, I get I have an equal number of positive and negative weights I get zero essentially and as soon as I start hitting a black pixel here This template can react, but it reacts maximally when it's centered around the endpoint of the sea here And it reacts maximally when it's centered around the endpoint of the sea here So maximum output is represented by black. So this Thing here, which is called a feature map represents the activation of the unit Corresponding to computing the dot product of those weights with a corresponding A little square on the input of 5 by 5 square in this in this case Okay, so I take this 5 by 5 square coefficient I compute the dot product of those 5 by 5 coefficients with 5 by 5 pixels And I put the result in the corresponding location Then I shift I shift this Template by when pixel Recompute the dot product and put the result in the next location. And if I do this for all location I get I get this Okay, so I get high response here uh negative response here when it's uh You know white and and you know, basically zero response outside Uh, the corner detector, you know, I get a vaguely positive response here, but it's not very large And mostly small responses everywhere else Same for this guy. Okay. So now if I run this with some non-linearity that thresholds Let's say I ran you with a threshold or maybe a binary detector I'm going to get activations wherever this this guy was large Okay, so at the end points of the sea that match this template and I'm going to get zero everywhere else And now it's very easy to plug those those three images to a linear classifier And decide whether it's a c or a d if it's got, you know Activations here is probably a c if you've got activations here It's probably a d because those things will detect the corners of the d, right? So so this guy here detects the upper corner here high activation here when the threshold I get high activation here same here And if I shift the d I also get high activations. They're shifted But my classifier maybe just computes a sum of all of those values And doesn't care where they are and so in the end I I can classify a c versus a d by just looking at whether I have positive activations here or positive activations there So this operation that I just described where you take A small pattern of coefficients and you swipe those coefficients over an image. That's called a discrete convolution But it's a it's a different type of linear operation where The the weighted sums Have two special characteristics. The first one is the weights are localized In in the in the 2d space, right? So not every unit in one layer Uh in this layer is connected to every unit in the previous layer They're only connected to in the in the previous layer here. They're only connected to a small neighborhood And all those units Share the same way. They all share those 25 weights. Okay Uh, so there is this idea of weight sharing So this is a very elementary convolutional net here that I described Okay, think of this as the simplest convolutional net you can imagine A layer of convolutions with three convolutional masks three set of weights That each of which detects a feature right a particular motif on the input then I get At the first layer feature maps that indicate where those motifs have been detected And I have a second layer, which is basically just a linear classifier And it essentially counts how many activations there are in this layer or that layer and tells me, you know, is this a c already? So very simple you could train this with back prop if you put a sigmoid here at this layer or maybe a even a radio And you have biases This this would work and you could train this with back prop and imagine that the system will learn those Templates for for feature detection. So here's the trick for For convolutional nets If you apply back propagation to an architecture like this where you have units that have local Connections and share the weights among connection among locations It's a convolution net and the system will be able to learn local feature detectors to Perform a particular task Um Now this system would be Invariant to shifts and and distortions and and change in shape because You know, it can detect the motifs wherever they occur. And so it doesn't really care where they are, right? Uh, but you do care where they are, you know, if you want to distinguish Not just c from d but also Other characters you probably care about the relative locations of those features Okay, so let's define what a convolution is. So the normal definition of convolution here. This is in a one-dimensional case Is the output number i so imagine you have a vector of inputs the the x i's, okay And you it produces a vector of outputs and basically each output is the weighted sum of a set of inputs Multiplied by weights So the sum runs over the size of the of the template here the template of weights Uh, could be five, you know 25 in the case that of the image that we saw earlier. It could be any size really um, and and you look at the the the previous Inputs from from x is a very similar form called cross correlation and in fact, uh in uh in pytorch and in a lot of Uh deep learning systems what people can convolution is actually cross correlation. It's just a convention of how you interpret this j index Uh, you know, whether you have a subtraction or uh, or an addition here. So this is actually what pytorch does um, so why I here is equal to Uh, a weighted sum of of x's over a window Uh, and and the weights are those w j's, okay? And when you then you shift the window and you use the same w j's The same w's the same set of w's and you apply it to another window that's been shifted by by one Uh, so in 2d you have two indices and you just you know at them So that's the definition of the convolution operation. So think of this as a layer Okay, where you get, uh A linear layer essentially you get an input uh a sequence or image Or even the 3d image you can have as many indices as you want Uh, let's say a sequence And it applies the same the same motif of weights to every location of your input and produces an output Which is essentially the same size minus border effects Uh as the input And it's a result of applying the the w To the the the filter w So the set of weights w is called either a filter for people who are familiar with single processing Or a convolution kernel if you if you're a mathematician Okay, so we'll we'll talk about a kernel essentially um Yeah, uh those pictures for those of you who've played with uh, you know photoshop and other Image manipulation programs, you know this there is an operation called convolutions in most of those things And there is like, you know emboss filters and edge detection or that stuff and they basically use convolutions, right? So this looks if this looks familiar it's because You've seen this before in in this context Okay, so we have an input sequence an output sequence to produce one output We computer weighted sum of our window and we multiply by the weights. Okay How do we back propagate through this right? So if we get gradients with respect to the output How do we get gradients with respect to the inputs? And the answer is trivial because it's a linear Module and so to back propagate through a linear module Uh, you imagine that this is a weight matrix and you transpose that weight matrix And what that means is that you use the same weight but backwards. Okay, so to get the gradient of The cost with respect to this particular x here this x is going to influence multiple y's through through different weights Okay, you're going to take the gradients with respect to all those y's and then multiply by all those weights And you get the gradient with respect to x and I written it here in a formula So the gradient of some cost with respect to xj is equal to the sum over k, which is the index running over the weights Of the gradient of the same cost Uh with respect to the outputs now that uh, Is that location j minus k? Okay, so you look You look to the to the left Okay, because all the Uh a a a particular, uh x here Uh will send Uh its output to y's that are to the left of it, right? Because of this formula And so you take all the y's that are to the left of that x and as well as above And you compute the weighted sum of the gradients backwards And you get the gradient with respect to the input and that's this formula and this is just like multiplying to the transpose Uh matrix of w if you think of w as a matrix But uh i'll further will explain this to you in more details Um, maybe actually yes the the the linear one. Yeah, but perhaps this one is going to be more in the homework Yes, we'll see Okay, so there's two backups because you know, this is a module with weights So, uh, we also need to back propagate with respect to the weights. So this is where the formula for shared weights Intervenes so the gradient of some cost with respect to weight wj Is going to be equal to Of course, you know, you have one input here And a weight that you know connects with So input x i x i plus j and uh output y and you have a weight that connects the two We've seen that for a linear module you take the gradient with respect to the output multiply by the input And that gives you the the gradient with respect to to the weight, right? So it should be DC over dw over dy i multiply by x i plus j The problem is that now this wj is used everywhere You know, there's the same wj that's being reused everywhere along this dimension and so Because this weight is shared we have to sum up the gradients With respect to each of the individual instances of this weight to get the overall gradient with respect to that value Which is shared shared across location and this is what this formula tells you So the gradient of some cost with respect to wj is equal to sum over locations Of of the sequence of the gradient of this cost with respect to yi multiplied by The x that this weight wj connects to yi Okay, which is x i plus j okay You know, it's very nice to be able to understand this um PyTorch implements this for you, of course You have a convolutional layer that you can just you know plug into your net and PyTorch obviously knows how to do this. So And and so, you know, this is kind of uh now There's different types of convolutions and when you create a convolutional layer in PyTorch It actually asks you about several parameters So here this is a dense convolution, uh, you know regular plain vanilla gotten variety convolution where Uh, you know, one input is influenced by a window here three of three. So the kernel size is three Okay, you You have the same three weights that are reused over and over again um And the window is shifted by one pixel, uh, you know by one One step every time. Okay. So each of these guys This guy is influenced by this window of three And the guy next to it is influenced by that window of three and again next to it by that window of three Okay, that's regular convolution This convolution with stride. So convolution with stride is one in which Uh, each output is influenced by a window that has been shifted by more than one step In this case here, the stride is two Okay, so this guy looks at this window of three And now we shift that window by by two steps and that gives us the next window And we compute the weighted sum here and that gives us the next output So this is convolution with stride or or set sampling convolution And this basically reduces the the the spatial temporal resolution of the of the output it kind of skips over It looks at all the values, but it you know skips over some location Um, and then there is the so-called skip convolution or convolution atroo is called atroo that comes from the french um, so it's not pronounced atras is is it's uh It's pronounced atroo And it's in theory Spelled with an accent on the a that means in french with holes And it comes from the french because the people who proposed this in the context of applied math and signal processing actually called it this way And and there the idea is that the some of the weights in the kernels are basically equal to zero So so here the window Over which this guy gets influenced is is a window of five But it doesn't but it skips every other inputs, right? So it looks at this guy it looks at that guy and it looks at the fifth guy, but it doesn't look at number two and number three And this is interesting if you want your convolutional net to Basically be able to look at a wide A window of inputs without actually having too many parameters or too many uh weights in your system Okay, so whenever you create a convolutional net a convolutional layer inside of a network you have to feed it those parameters You know the stride the size of the kernel Which could be one two three four-dimensional Maybe it stops at three. I'm not sure the stride which is how much you skip and and the skip or the The the a true which you know is how much you can expand you dilate. Yeah, it's called dilating convolution also sometimes In fact, I think that might be the name in pytorch dilating convolution All right, so how do we put this into a neural net? So this is the basically a convolutional nets convolutional nets are architectures that are built by alternating layers of convolutions point-wise non-linearities and sometimes pooling And pooling is a operation that I haven't talked about yet, which I will in just a second So this is a depiction of An early neural net Called lunet five actually Actually, it's a kind of a subset of it here. That's represented so again This is a depiction of the convolution operation We take here a five by five convolution kernel compute the dot product of all those weights by the pixels that are at a particular window on the input and that gives us the value of a particular Activations in the in the resulting feature map and we slide that window over all locations the results gives us a map here Where each pixel value indicates the activation if you want Okay So the first layer of a convolutional net is going to be one a set of such convolutions with different convolution kernels Okay, so each of those four convolutions use four different convolution kernels those convolution kernels are going to be trained with backprop. Okay So here the one I represented is one that actually results from training It's not something I built by hand and like the example with cnd that I showed We're going to train all of this with backprop. Okay And the advantage is going to be that the system is going to be able to learn by itself what are the good motifs to detect to actually Determine the the category or or solve the task that we're training it for In this particular example the outputs actually pass through a high probability tangent nonlinearity This is the old style neural nets from the 1980s and 90s So we were using high probability tangent at the time now. We're kind of more fond of values, but So what you see here are feature maps which are the result of doing a convolution with a convolution kernel and then passing the result through High probability tangent after having added a bias So that's the convolution Then there is a pudding operation and the pudding operation is essentially an aggregation of the activations within a neighborhood of a feature map With set sampling so let me explain this in more detail I'm going to do this for a one-dimensional Okay, so I have a one-dimensional signal I apply A convolution to it. Let's say three with kernel three Etc and the pudding operation A simple pudding operation will consist in doing the following We consist in taking two consecutive values And combining them in some way to produce an output and then you you skip Okay And compute the next one So I'm drawing here a particular type of pooling where There is no overlap between the pooling areas, but you could very well imagine having overlap between the pooling areas. So where You know, you're pooling over four, for example, and you only shift by two So the next pooling actually overlap With with the previous one, okay But generally There is some some level of subsampling in the pooling which means that if here you have N Values here in this in the signal and the subsampling ratio S is equal to two then what you get here is n over two Okay, more generally n over s If you don't have overlap if you have overlap, it's a little more complicated. There is like border effects Okay, now what operation are we going to do here for it to be Defined as a pooling. So a pooling is a is a basically an aggregation of values Whose output is independent of A permutation. So it's basically permutation invariant. So any function f of some number of Parameters That doesn't care about the order of its parameters. Basically So I can apply a permutation to the indices Okay, so p is a permutation here And I get the same result. That's a A pooling, you know, a perfect pooling operation. What kind of pooling operation do we have? A very simple one is just a sum right So basically y equals Sum or average this is a big k of xi xk, I'm sorry Right, so that's average pooling um Pretty popular one also is LP pooling. So this would be sum over k of xk to the power alpha And the whole thing to the power one over alpha In fact, I should probably write it p This is called LP pooling So LP stands for LP norm Okay, so we're computing the LP norm of the of the values seen as a vector And a particularly popular one is for p equal to So compute the square of all the inputs Uh, sum the square and then compute the square root. That's LP pooling Max pooling Okay, so just figure out which value is the largest and I put that This is very popular In fact, probably the most popular of all pooling functions is max pooling Then there is something a little funny Okay, so this is gonna Look so max pooling And this depending on the value you give to beta this behaves like max pooling Or average pooling. So if beta is really close to zero, this basically Is very much like computing the average of the x's If beta is very very large, this is very much like computing the max And if beta is very large and negative, this is like computing the min Okay You can come up with other ways to do pooling, but the most popular one Or this one And perhaps this one with p equal to is some theoretical argument for why this is a good idea But it's not very widely used. It's used in different contexts, but not Necessarily for pooling And very rarely average pooling is used But only if you precede this average pooling by some nonlinear operation like like a value or square or something Okay Now what is the role of pooling? So the role of pooling Is to basically Eliminate some information about the position where motif appears on the input, right? So once I go if I have a motif here That has been Detected on the input If the motif appears in the first three values or in the second three value by the time I get to the pooling area But the pooling unit the pooling unit will turn on Regardless of whether the motif is in the first location or the second location So by doing local pooling I eliminate a little bit of information about the precise position of where a particular motif occurs Okay, and that's what makes convolutional nets so interesting is because they Are robust to variations of the locations of features Okay, so So we take this input run through the convolutions run through Sigmaids and then here what is being done is actually an average pooling And then the result is passed through a hyperbole tangent So it's a funny kind of pooling average pooling with a non linearity afterwards And what you get here and there's a stride of two so that the pixels here At that level are twice the size of the pixels on the input because you know, we We shifted by by two and so I expanded the size of the pixels so that the image is roughly the same size as the input But it's really half the resolution And here's a idea general idea convolutional net is that you stack this right so the second layer again performs convolution So this guy here is a result of applying a convolution kernel to this image Another convolution kernel to that image another convolution kernel to that image another convolution kernel to that image summing up the results and passing the result through a hyperbole tangent non linearity Each of those guys each of those feature maps here the second stage Look at The feature map at the previous stage with different convolution kernels so they detect Different combinations of features so you can imagine that if the first layer detects very simple motifs like a corner Or an endpoint or something like that or an edge at a particular orientation What the second layer can do is detect combinations of those right because it can it can combine the the detection from All or subset of feature maps to the previous layer and and detect conjunctions of Of different motifs on the previous layer Which you can think of as kind of a higher layer feed higher level features right So if you if the previous layer has detected as a detector that detects horizontal edges And then another detector that detects vertical edges The second layer might be able to detect corners by combining the two for example Um, and then again you do pooling and sub sampling. So again, this is average pooling with a non linearity afterwards And you get you know representation of the input now that has almost no spatial Information about it do convolutions again And you get to the output and now you can train this entire neural net to basically classify character inputs Which here it has it's a zero So it's kind of a animation due to andrei capati of uh, what multiple convolutions do In a sort of you know animated graphical way Right. So so here he's represented the the version where you pad the the input so that the the uh, so that the the output You know as as more or less the the You know as as as a size similar size as the as the input In this case here the stride is is two. So this is a sub sampling convolution And it's padded the the output so that you know, you don't get too too small of a Let me restart the animation uh, okay, so you take a window computer weighted some of the values in the window with the with the with the weights and put the result here In the in the thing and the input here is not a single feature map, but three feature maps So you have to compute the convolutions of three different weight kernels Or for each of those three feature maps and they have different weights You add the result and that gives you the output feature map And then you have a different set of weights for the the second feature map Which is in green at the bottom So if you have some trouble kind of visualizing this this may be a good way of Of of doing it Here is another way. So this is a visualization of some old neural net that I trained, you know back in the early 90s Uh, this animation actually was produced in 1997 if I if I remember correctly. So here's the input layer here you get six convolution kernels applied to the input And then followed by a high probability tangent And you get those six feature maps And i'm sort of translating the three here to see the effect of translation on the representation Then there is pooling. So whenever a particular motif here shifts by one pixel Or after the pooling it shifts by one half pixel right Or if this this guy shifts by two pixel then this guy shifts by only one pixel So again, those are you know using average pooling and then passing the result through a sigmoid I mean to a high probability tangent with a with a bias. So you get Those can be thought of as kind of low resolution versions of the previous layer if you want Uh, with sort of different intensities because of the of the sigmoid and the bias And then we repeat the process. So the the feature map here is a result of applying convolutions to Some of those six feature maps of the previous layers Those are five by five convolutions into dimensions and then passing the result through So adding up the result of all those convolutions with all those Five by five kernels and then passing the results through a high probability tangent Um, and then you do this with different sets of Six five by five kernels for each of those Okay, you get 12 of them. I think I was 16 again number 16. I guess And then again pooling and sub sampling And here the system chooses to kind of keep a lot of sort of linear response by Kind of keeping within the range of the high probability tangent. So What you get is kind of a small amount of variation here When you have a large amount of variation there because of this average pooling and progressive reduction In the spatial resolution and then again, there are convolutions here So here the convolution the size of the convolution kernel is such that The convolution kernel has the same size the same size as the the height of a feature map So when I when I do the convolution I can't shift vertically because The kernel is the same size the same vertical size of the feature map So I can only slide it horizontally So as a result the the resulting feature map is is only a one dimensional Vector if you want for a different location horizontal location on the input But it's only one one vertical location Uh, because I can't shift the kernel Because that feature map is already five pixels psi and my kernel is is five by five And this is so there's 120 of those and this this is the representation of the of the input image Before it goes into a couple layers for classification, which are fully connected Okay, and what you can observe is that the kind of variation you observe here Is is relatively mild, you know, there there are values that go from From you know, black to white or white to black or more gray to less gray But there is no value that goes from black to white to black and and things like that or very rarely Whereas here, you know, if you take Uh, if you take this particular pixel here that i'm pointing at As the character shifts it goes from white to black to white Uh, and what that means is that the the the surface Of the various distorted version the line formed by the distorted version the shifted version of that three Is a very curvy line Okay, because because some Some values go between, you know First one and minus one and back and so that means the curve is very very Curvy whereas by the time you get to the top layer the the surface the line that is in this high-dimensional space that Um corresponds to the shifted version of this three is kind of a it's not exactly a straight line But it's kind of straighter Now what that means is that now when you going to apply a linear classifier to it the linear classifier is going to be able to basically separate all those shifted version of the three from shifted version of the four because they will be Sort of, you know lines that are separated Uh in that in that space whereas in this space, you know, those lines can be kind of You know intertwined So a general convolutional net you get something like a normalization I'll come back to this when we talk about basch norm in the later Lecture Felter banks, which are convolutions multiple of convolutions pointwise non linearity Some sort of pooling and you take that stage and you repeat it multiple times And that's a basic convolutional net very often the non linearity in modern days is a value Which you are familiar with And the pooling very often is max pooling occasionally p pooling sometimes one of those blocks some exponentials But don't feel like you're limited to any of those things, right? You could you could imagine much more complex kind of module at every layer Here's an example of how you create a convolution net, but again Alfredo will tell you a lot a lot more about this You can create a class which you know the the way you create a convolutional layer here Uh is is this uh nn conf 2d. That's a 2d convolution And what you pass is the number of feature maps on the input which in this case is one because it's just a grayscale image And uh The number of feature maps on the output and the the size of the kernel Which could be two dimensional so you could you put it into You know you can have different sizes for x and y here is the same So we just give one parameter and then the last one is the stride which in this case is one Okay, so you have four modules uh 2d convolution 2d convolution and then fully connected layers And The way you compute the forward is that you apply the value to the conv to the convolution apply to the input Then you Apply the pooling the pooling doesn't have any parameters. You don't need to create an instance here It's just you just call the function Apply the value again To the second layer convolution pool again Then sort of transform the whatever output here into a vector And then apply a fully connected two fully connected layers Apply suffmax because you want to do a classification on the output And then you can have you know a loss function afterwards So that's a very simple convolution convolution on that And it's kind of depicted here a little more Graphically Although the the number of feature maps here is is not correct because it's too large for what we're talking about here So here there is no padding which means the We're doing what's called valid convolutions, which means the each output each each layer Corresponds to the output of the convolution layer Is slightly smaller than the corresponding input so that the the kernel doesn't get off Of the boundary and so you get You know a shrinkage of the layers because of that and you also get a bigger shrinkage because of the because of the pooling Now here's one thing that's interesting about about this architecture if you take one of the one of those outputs here on on the On the layer at the at the right And you you try to see How many pixels? are connected to that feature One this particular feature here On that that layer at this fifth layer Each of those guys when you you back project Is basically influenced by an entire window of 32 by 32 pixels the input is 32 by 64 But each one of those guys On that layer is influenced by an entire window of 32 by 32 Okay, so we can see an entire character essentially And that's because of the of this pooling and subsampling at every layer This is another way of defining This is you know some some guy on You know Was you know proposed a different way of defining convolutional nets here giving a name to each of the using the nn sequential container in pytorch and and you know passing as argument so an n sequential is basically a A deep learning architecture where you just stack layers one on top of the other. So you don't have to write You don't really have to write like a forward because implicitly each of those layers are going to be Called sequentially, which is why it's called an n sequential. You can give a name to each of them By passing it a dictionary And so, you know, the forward function is is super simple, right? You you you just have to call You know connet essentially Uh, okay, so here's another example of an old convolutional net And this one is applied to multiple characters. So very early on we kind of realized that Connets could be applied not just to a single object a single character, but to multiple characters And uh And and basically if you make the output itself a convolution, you will get an answer for each location on the input So remember, I told you that each unit here Is influenced by a 32 by 32 window on the input The neighboring unit to that unit is influenced by another 32 by 32 window On the input shifted by four pixels. Why four pixels because there are two Layers with pooling a size two. So there is an overall subsampling ratio of four Uh, and what that means is that uh, the the next the next guy Uh at this layer actually looks at four pixel At a window shifted by four pixels on the input I should probably do a drawing on this to kind of explain in more detail. So that works Um, before I do this, uh, this is a fun video that you can find on youtube Uh, this is me by the way You know back many years ago around 1992 This was my phone number when I was working at belad So this was a demo system that we put together in a in a pc to be able to do kind of semi real time character recognition By you know grabbing images from a camera and then you know, I hit a key and the system kind of locates each of the characters And passes them through the neural net and then displays the result as it as it computes And you know, it was able to handle Very unusual styles Which you know really no no other system at the time was able to do So this is me when I was your age And this is donnie henderson who was the engineer research engineer who put together the system and rich Howard who was our lab director At belabs. This is a belabs in holter new jersey not very far from here. Now. Where does all of this come from? This whole idea of convolutional net comes from biology. Actually, this this is inspired by biology So when you look at the architecture of the visual cortex You see that uh much of the operations performed by the visual cortex for recognizing everyday objects is you know looks very much like You know, what we now call a convolutional net and in fact neuroscientists use convolutional nets as A model of of how the visual cortex performs object recognition so the your visual system You know, there's images that are formed on the back of of your eye on your retina And then that goes through the optical nerve the optical nerve has roughly about a million Fibers so you get an image that is roughly a million pixels In fact, the resolution of your retina is much more than this is more than 60 million or 100 million pixels but that gets compressed by Four layers of neurons, but you know, which are in front of your retina And they basically compress the signal so it fits onto 1 million fibers because if you had 100 million fibers coming out of the back of your eyes You know, there would be too many wires also vertebrates are really badly evolved because the The the wires from the optical nerve has to have to punch through the retina to be able to to get out And so you have a blind spot in your retina, which is where the the wires are coming out of your eye In vertebrates like like octopus and And squids and stuff like that are much much better designed because their their wires come at the back, right? Which is kind of a more logical way of building a retina So they they can have like, you know, material resolution optical nerves Then that goes into something called a lateral geniculate nucleus Which performs something a bit like a convolution and contrast normalization a little bit of attention Etc they use Contrational filters that look like centers around i'll come back to this Then it goes to the back of your of your brain the back of your head Which is where your your primary visual cortex is an area called v1 And v1 is very much like the first layer of a convolutional net, you know It's much more complex than that, but it performs a bit of the same kind of operation Then there is two different pathways For vision one called the ventral pathway, which is the one that's depicted here another one called the dorsal pathway Which goes kind of to the the top of the brain And the dorsal pathway is the one that processes motion localization and things like this whereas the ventral pathway processes recognition like, you know, how do Recognize an object, right? There's these two tasks, right? There's a task, which is recognizing that this is a pen There's another task, which is if I want to grab the pen, I need to know where it is and I need to You know figure out this orientation if it's moving I need to be able to predict this trajectory To be able to grab it that goes in the dorsal pathway. What we're talking about here is the ventral pathway If you have some idea of how this is arranged the diagram on the left here shows that so the There is the the white pathway, which is the ventral And the the the where pathway, which is the the the dorsal with sort of different areas that have cute names like, you know Empty which means motion, which you know does motion processing all of those things interact with each other, of course So the the ventral pathway is this hierarchy where you go from v1 to v2 from v2 to v4 v4 represents sort of intermediate visual features like, you know motifs forms parts of objects And then the posterior infotemporal cortex and the anterior infotemporal cortex And this this is where object categories are represented And this whole process when you when you start seeing something The whole process of kind of propagating through this takes about 100 milliseconds about, you know a tenth of a second So you're aware of what you're looking about a tenth of a second after your retina has seen it if you want It's pretty fast. There's no time for much feedback in in things So there was this hypothesis that this process was essentially a feedforward computation has to be fast, right? This is what you need to be able to detect predators for example, so, you know, you don't have time to think about it Okay, so then so the structure of the primary visual cortex area v1 was studied by two gentlemen by the name of Hubel and and and weasel In the late fifties and early sixties and what they discovered that was that there were neurons in the primary visual cortex area v1 That are basically neuron that look at only a a small part of the visual field And react to a particular motif like an edge at a particular orientation, right? So you would show an edge at a particular orientation to a cat or whatever animal that we're using And poke an electrode at a particular neuron in that location and if you move the if you move the edge There's only one particular location where this neuron will turn on And if you rotate the edge There's only one particular angle that this neuron will will will turn on for will react to okay Now you go to another neuron near to near it and it will react to the same edge at a slightly different location You go to yet another neuron Next to it and it will react to an edge at a different angle at the same location as the original neuron, right? So we'll have A block of neurons basically that all look at the same location react to different motifs Edges at different orientations And then another block of neuron that basically detects the same feature, but at a slightly different location on the input This is very much like a like like a convolutional layer that I talked I talked to you about And in fact the idea for a convolutional layers comes from them right They didn't call it this way, but that was basically what they more or less what they concluded and And it's very sort of logical to think about why this has to be this way. It's because you know Local features can appear anywhere on your visual field And so you need to have detectors that are basically replicated everywhere on your visual field So that you can detect those features anywhere on your visual field Because you know an object can appear here or it can appear here And you would like to be able to recognize it regardless of where it appears So you need the same feature detectors At at various locations. You want shift independence. You also want scale independence So again the location of features will will change depending on how far the object is Okay Or you know depending on the orientation and things like this, so One gentleman by the name of kunohiko Fukushima Whose name is is down here in the late 70s early 80s actually build computer models that are Were inspired by hugo and weasel And he built a system called the cognitron and then later another one called the neo cognitron Which also were the inspiration for convolutional nets So he was it was really interested in sort of reproducing the properties of the visual cortex So his model had neurons that were kind of biologically plausible and things like this But this was back in the late 70s early 80s. So he didn't have backprop He could not train his system using backprop He used a unsupervised running algorithm basically to train each all of the layers Some sort of competitive you can think of it as clustering if you want a little bit like k-means with some competition local competition And this system would learn, you know, semi useful features and then he would plug Basically a linear classifier or perceptron on top of it And the system was, you know Very slow and awkward but sort of able to recognize simple shapes It wouldn't, you know beat records on character recognition, but it was sort of an interesting model So I was inspired by this when I built my first convolutional nets in in the late 80s And basically make it, you know Simple and trainable with with backprop Now hugo and weasel also had the idea of complex cells which are basically pulling units. Okay So this idea of pulling comes from them as well And and and Fukushima also had this idea. So he had simple cells and complex cells. So we would alternate Simple cells would be convolutions and non-linearities and complex cells would be kind of pulling like Operations that would be alternated So those are what the first convolutional nets that I played with look like, you know, back in 1988 or so Uh, and you know, I showed that if you had a limited number of training samples, which were those characters that were drawn with a mouse with, you know By hands very small training set The the performance would be improved by by using those shared weights and and convolutional architectures Then uh in 1988 I joined uh delabs in new jersey and started kind of building larger versions of those trained on real data sets Of data, you know characters coming from from z codes essentially that was were given to us by The u.s. Postal service through the university of buffalo the state university of new york at buffalo A gentleman there by the name of sagor shi harry who is still there had a Was working with a postal service and was going to build this data set For the time, this was a huge data set. It had seven 7300 training samples By now it's tiny this network or so is tiny by now, but at the time it was gigantic and it took, you know 10 days to two weeks to train on the computers we had at the time Um, and this is sort of a later incarnation of it that was trained to be able to recognize multiple characters Okay, I'm going to actually stop here Keep the rest for next time and take questions if there are any So how do you get the network to be both invariant and equivariant, right? So convolutional convolutions are equivariant, but then the pooling are invariant So how do you get like a network to be both? Maybe I don't know Right, uh, so maybe we should first define what invariant and equivariant means. Okay, so invariant Uh, um, so I'm stating that f is invariant to Transformation t means that f of Uh, t of x equals f of x Okay, so I transform x by transformation t. I pass it through f I get the same result. Okay equivariant means that f of t of x Is equal to t of f of x So for this to be and and I should call this t prime. Okay, so there is some transformation t prime which apply to the space of f of x Uh, produces the same result Uh, as applying t to x. Okay Now for convolutional nets Uh, convolution is, uh, Is uh, is equivariant, um to translation To spatial translation. Okay so So if I have an image, uh Uh, let's say a one-dimensional signal, right x i so I index by x i and my transformation is that, uh, you know t of x index i is, uh x of i plus A constant. Okay, so I shifted the image, right? I shifted the signal Uh, then I have the property that f of t of x is equal to t of f of x If this is a dense convolution Okay, if it's a convolution which tried then the this translation would be This is tried to this translation will be by half the amount of this translation Okay, but for a regular convolution Um If I shift the input the output gets shifted, but otherwise I'm changed. Okay So that's equivariant Uh, and pooling is locally invariant Uh, to translation and this is this is not a formal statement Uh, because it's not entirely locally invariant But you can imagine that if I if I have a pooling operation If I shift the input by And the pooling is over size two for example, right? Or let's say bigger like size five or something if I shift the input by one It will not affect very much What the what the output of the pooling, uh, is it will affect it a little bit So it's not completely invariant, which is why I'm saying this is not a formal statement Uh, but it will not affect much the uh the output And so it has kind of a smoothing effect With respect to translation, which means that If I translate the input the input would be changed a lot Like you you may have a very large distance between the input and the translated version of itself but the distance between, uh The the result You know after applying the convolution pooling the distance between the two outputs of the input and the shifted version will be Not that not that large Because of the pooling Okay, so the the pooling Builds a bit of robustness or smoothness with respect to Shifts of the the features on the input, right? If I take a more concrete example So let's say I have an input signal Um, like a one-dimensional signal Where you know, I have all zeros but I have this little motif where I have two ones that are separated by When blank and I have a convolution kernel I'm going to apply to this which you know basically matches this pattern Okay, so when I compute the dot product with everything here on the input I get essentially zero But whenever it matches here my output is going to Turn on they're crazy, right? So if I swipe This convolution kernel over the input the output is going to be zero all the time except for this one Where it's going to be large and active Now if I pool With a stride of two or sub sampling of two Now I can shift the input by one pixel and What's going to change is that now after After the shifted version This guy is going to turn on and the same guy here is going to turn on It's not going to make any difference to the output Right because of the pooling So that's an example of Local shift invariance if I shift by more than one then the the next Pooling unit is going to is going to turn on so So it's only Local for small distortions Okay, other question No one is asking anything. I think we we handle everything backstage here that is today. Great. Thanks Um Let's see Yeah, I know I don't want to start with like multiple characters and applications. So next uh next week. We'll talk about We'll talk about applications or convolutional nets like in practice special types of Conventional net architectures that are very popular things like res net We choose some of the ideas that I talked about earlier about In the context of our current nets of kind of copying the input on the output Um, we'll talk about how we use those convolutional nets for things like detecting objects in images uh semantic segmentation Um, and you know various applications, uh, etc I'm not going to go into a huge amount of details there because if you really want to learn about this You should probably take a computer vision course where Fergus is computer vision course in which he goes into Gory details about exactly how you do this but um But that gives you some idea. Uh, there's also users of convolutional nets for image generation I won't talk about this very much, but um, but that's a very sort of interesting, uh area This particular convolutional net. So this is A common, you know, one of the first practical convolutional nets to recognize characters This one does not have separate pooling layers It's basically only convolutions with tried Over that's right. I think this one had basically two layers of convolution with tried to With kernels that are five by five eight feature maps in the first layer if I Sorry, 12 feature maps in the first layer and then 16 feature maps, I think in the In the second layer And then 30 units and then the output right and so the the convolutional layers were had tried to and the reason Why we didn't have separate convolution and subsampling and pooling layers Is that we just couldn't afford it the the computation was much much More expensive if we had separate convolution and pooling It's much cheaper to directly have subsampling convolutions with tried And so that's what we used at the time because that's that's what we could afford um, so This uh system if I remember correctly had on the order of 40 000 parameters um And you know, maybe a hundred thousand connections something like that And uh, uh We take about about 10 days to two weeks to train on the computers of the time Now this is going to be trained in seconds essentially Yeah, we see next next week some we are going to train in a few in a few seconds a small convolutional net Probably the ones you're going to train is lunnet five, which is you know a few years After this one. Yeah, actually one year after this one. So the one I showed Here Let's see this one That was this one a separate convolution and pooling and this one was a few years later It was actually one year later. This was nips 89. So just one year later Uh, and then a few years later. We had this lunnet five architecture that you know was basically just a slight play on this This 95 architecture actually ended up using quite widely for Recognizing amounts on checks and various other applications. I'll talk about this a bit next week I think people don't realize the the fact that you know the the longer it takes to train the system and the less Iterations one can can can can have right to improve, you know parameters improve like hyper hyper parameters and so on So right now we can easily get a good network because we just you know run it a few times Before you had to wait a week or whatever Like I still remember at the beginning when we were training on image net in in the in the 2012 whatever It took like a week to actually get a data point not an observation Okay, so the the most complex neural nets that people train at any time during history for the last 30 years even more Always take about two weeks. Okay. Yeah And because that's about the the time you're ready to To wait for something to run right Without it becoming completely impractical And so regardless of how many gpu's you have regardless of how much data you have regardless of you know Of anything at any point in history the biggest neural nets apply to the biggest Application take about take about two weeks to train somewhere between a week and three weeks It was it's also true a speech recognition systems, by the way, or it was true now not not not so much anymore So that's pretty much a constant Um As uh hardware becomes bit better. We just make the networks bigger. We train on more data You know, we train more more more iterations, you know, we You know things like that, right? So what I should tell you is you know back in 1988 1989 It was very difficult to actually collect data. It was uh, actually complicated to even like plug a camera on the computer That was actually major undertaking. Okay Um, just just to to collect data Uh, you know and scanners were not with the serial port, right? Huh with the serial port. There was no usb, right? So it's like there's no usb You know, this was usually a scuzzy part or paddle part or Or something like this or the 232. No, the the usually you had to buy a it was called a video grabbing Car, which is pretty expensive. You had to put in a in computers This was very expensive like multiple thousands of dollars with a video camera, you know, uh, and there were You know ccd cameras, but but that's that's about it. Um, and uh You know just having enough memory on the car to actually store the image was already a challenge because You know, it's not like people had like hundreds of megabytes, right? Um, so You know just an image that Is 256 by 256 was you know, is is already quite a lot of memory even in black and white. So it's all black and white so Uh, so that that that's the first thing. So there wasn't a lot of data, right? You had to You know talk to people who had spent months, you know setting up a system and spend a huge amount of money collecting data Then the computers were You know, it was just about the time that computers started having floating point Co processors so they could do floating point multiplications relatively fast That appeared in like standard workstations. So standard workstation were like sand Sandwork stations, uh, they cost tens of thousands of dollars um each I had one for myself at bel-az and I couldn't believe it because when I was opposed back in Toronto there was one for the entire department, but um Uh, those things could do about one megaflop. Okay. One million operations per second So this is you know, the systems we have today are more than a million times faster Then the To the point that the the deep learning framework that I wrote called sn actually could Had a mode so we could do integer arithmetics because uh floating point was just to so So we had a mode for integer now now now there are things like this too that you know are coming back but Um, so then there was no there was no python, right? There was no matlab. There was no You know, there was no, uh, you know jupiter notebook or whatever, right? um You are running unix Uh on a workstation you had emacs So it's the editor and you could write c programs And so what liomba two and I uh decided to do in 1987 when I was finishing my phd and he was studying his was to write uh You know something like pytorch at the time and we wrote something called sn and this was the ancestor Of pytorch tensorflow You know all the things that you see today. They're basically descendants of that of that early system But because there was no python, we had to write our own language Python like right in fact, it wasn't python like at all. It was lisp. It was a lisp interpreter So we wrote a lisp interpreter and then we plugged a a library Uh that we wrote also of sort of tensor operations Uh, eventually you wrote a compiler for it a little bit like torscript if you want but um, but that's what we used and Nobody else had that so that gave us superpowers And they had uh a big advantage which were which were that we were the only people able to train convolutional nets Because we had invested all this time in writing this software um But the venues was that nobody else could train convolutional nets And so a lot of people basically didn't want to get into this because they didn't want to spend the time writing their own You know neural net framework and and convolutional net training environment because that would require, you know a year or two of writing software and not doing anything else and so So then we gained this reputation that you know, only people like us could train convolutional nets, but it's just because we had Invested the time to write the tools and the software now we can just download play torch, right? But um, but back then, you know, you had to write your own interpreter um, so, you know, this is why those ideas were so pretty you know Self-evident but implementing them at the time was really really really hard And you had to really believe in them to be willing to invest the the time to do it So that's just All right, that's pretty much it. So we see each other again on next week Again tomorrow office hour Friday office hour deadline for the homework. It's gonna be actually Thursday we have two more one more lecture two more labs before that Have a nice week everyone Bye bye Have a nice week Yeah to answer that question this was coded in c not c++ which didn't exist in 1997 All right. Bye. Bye. All right. Take care