 In this case we have a network which has an input on the left-hand side Usually you have the input on the bottom side or on the left. They are pink in my slides So if you take nodes make them pink. I'm just kidding And then we have what how many activations how many hidden layers do you count there? Four hidden layers. So overall how many layers does the network have here? Six right because we have four hidden plus one input plus one output layer So in this case I have two neurons pair layer, right? So what does it mean? How what are the dimensions of the matrices we are using here? Two by two so what does that two by two metrics does? Come on. You have You know the answer this question rotation yeah, then scaling then sharing and Reflection fantastic right so we constrain our network to perform all the operations On the plane we have seen the first time if I allow the hidden layer to be a hundred neurons long We can wow, okay, we can easily We can eat Fantastic, what is it? We are watching movies now. I see Seeing fantastic. What is it Mandalorian? It's so cool, huh? Okay Okay How nice is this lesson is even recorded? Okay, we have no idea Okay, give me a sec. Okay, so we go here Listen all right So we started from this network here, right? Which we had this intermediate layer and we forced them to be? Two-dimensional right such that all the transformations are enforced to be on a plane So this is what the network does to our plane it folds it on specific regions, right? And those fold is fold folding are very very abrupt This is because all the transformations are performed into the layer, right? So this training took me really really a lot of effort because the optimization is actually quite hard Whenever I had the hundred neuron into hidden layer that was very very easy to train This one really really took a lot of effort and you have to tell me why okay? If you don't know the answer right now better, you know the answer for the meter So you can take note of what are the questions in the meter? Right, so this is the final output of the network, which is also that 2d layer 2d embedding so I have no nonlinearity out on my last layer and these are the final Classification regions, so let's see what each layer does this is the first layer a fine transformation So it looks like it's a 3d rotation, but it's not right. It's just a 2d rotation Reflection scaling and shearing and then what is this part? What's happened right now? Do you see? all We had like the real part which is killing all the negative Sides of the network right all this is sorry all the second negative sides of this Space right this is the second affine transformation and then here you apply again The redo you can see all the negative Subspaces have been erased and they've been set to zero Then we keep going with the third affine transformation It's zoom it zoom in a lot and then again you're gonna have now the redo layer Which is gonna be killing one of those all the three quadrants right only one quadrant survives every time And then we go with the fourth affine transformation We were elongating a lot because again given that we confined all the transformation to be living in the space It really needs to stretch and you know use all the power it can right again. This is the Second last then we had the last affine transformation, which is the final one and then we reach finally linearly separable Regions here finally. We're gonna see how each in a final formation can be Split in its components. So we have rotation. We have now squashing like Zooming then we have rotation reflection because the determinant is minus one and then we had the final bias Again, you have the positive part of the riddle rectify linear unit again rotation Flipping because again we had a negative minus one determinant zoom in rotation One more reflection and then the final bias. This was the second affine transformation Then we have here the positive part again. We are third layer. So rotation reflection Zooming and then we have the as we did the composition, right? You should be Aware of that right you should know and then the final is the translation and again the third Reloom then we had the fourth layer. So rotation reflection because the determinant was negative Zooming again the other rotation Like once more rotate a reflection and bias Finally a real and then we had the last the fifth layer so rotation Zooming we didn't have reflection because the determinant was plus one again Reflection this case because the terminus was negative and then finally the final bias, right? And so this was pretty much how this network which was just made of sequence of layers of Neurons that are only two neurons per layer. It's been performing the classification tasks And all those transformations are being constrained to be living on the on the on the plane Okay, so this was really really hard to train. Can you figure out why it was really really hard to train? What does it happen if my? If my bias of one of the five layers put my points in away from the top right quadrant Exactly, so if you have one of the four biases putting my Initial point away from the top right quadrant Then the real use are gonna be completely killing everything and everything gets collapsed into zero Okay, and so there you can't do any more anything, right? So this network here was really hard again to train if you just make it a little bit fatter Then instead of constraining to be two neurons for each of the hidden layer Then it much easier to train or you can do a combination of the two, right? So instead of having just a fat network, you can have a network that is less fat But then you have a few hidden layers. Okay, so that was pretty much. Yeah question So the fatness is how many neurons you have per hidden layer, right? Okay, okay, so the question is that how do we determine the structure or the Configuration of our network, right? How do we design network and the answer is gonna be That's what young is gonna be teaching across the semester, right? So keep in keep keep keep it like keep your attention high because you know, that's what we are gonna be teaching here That's a good question, right? We there is there is no like Mathematical rule like there is a lot of experimental Empirical evidence and you know a lot of people trying different configurations We found something that actually works pretty well now again We're gonna be covering these architectures in the following lessons other questions Don't be shy No, okay, so I guess then we can switch the so the second part of the class Okay, so we're gonna talk about convolutional nets today And it's that right in so I'll start with Something that's relevant to convolutional nets, but not just which is the idea of transforming the parameters of a neural net So here we have a diagram that we've seen before except with a small twist The diagram we're seeing here is that we have neural net G of X and W W being the parameters X being the input that you know makes a prediction about an output and that goes into a cost function We've seen this before but the twist here is that the weight vector instead of being a Parameter that's being optimized is actually itself the output of some other function Possibly parametrized in this case this function is Another parameterized function or it's a parameterized function, but the only input is another parameter u. Okay, so essentially What we've done here is made the weights of that neural net a function of some one inventory some more elementary parameters you through a function and You realize really quickly that backprop just works there, right if you backprop a gate gradients For you know through the the g function to get the gradient of whatever objective function minimizing with respect to the Weight parameters you can keep back propagating through the h function here to get the gradients to respect to you So in the end you're sort of you know propagating things Like like this so When you're auditing you you are multiplying the Jacobian of the objective function with respect to the parameters and then by the Jacobian of the h function with respect to its own parameters Okay, so you get the product of two Jacobians here, which is the just what you get from backpropagating You don't have to do anything in pytorch for this this will happen automatically as you define the network and And that's kind of the the update that that occurs now of course W being a Function of view through the function H the the change in W will Will be the change in you multiplied by the Jacobian of H transpose and so This is the kind of thing you get here The effective change in W that you get without updating W. You're actually updating you Is the update to you multiply by the Jacobian of H and you know, we had a transpose here We don't we have the the opposite there. This is a square Matrix which is NW by NW which is the number of the dimension of W squared. Okay, so this matrix here has as many rows as W has components and then the number of columns is the number of Components of view and then this guy of course is the other way around so it's and you by NW So when you make the product do the product of those two matrices you get an NW by NW matrix and then you multiply this by this NW vector and you get an NW vector which which is what you need for a dating the weights Okay, so that's a kind of a general form of transforming the parameter space and there is You know many ways you can use this and a particular way of using it is when H is what's good You know what we talked about last week, which is a sort of y connector So imagine the only thing that H does is that it takes one component of you and it copies it multiple times So that you have the same value the same weight replicated across the g function The g function we use the same value multiple times So this is this would look like this So let's imagine you is two-dimensional U1 U2 and then W is four-dimensional, but W1 and W2 are equal to U1 and W3 W4 are equal to U2 So basically you only have two free parameters and when you're changing one component of you You're changing two components of W at the same time In a very simple manner and that's called weight sharing Okay, when two weights are forced to be equal. They actually they're actually equal to a Monumentary parameter that controls both. That's weight sharing and that's kind of the basis of a lot of Ideas, you know convolutional nets among others, but But but that that you can think of this as a very very simple form of h of u So again, you don't need to do anything for this in the sense that when you have weight sharing if you do it explicitly with a module that does kind of a white connection on the way back when the gradients are back propagated the the gradients are summed up So the gradient of some cost function with respect to you one for example We'll be the sum of the gradients of that construction with respect to W1 and W2 and Similarly for the gradient with respect to you two would be the sum of the gradients with respect to W3 and W4 Okay, that's just the effect of back propagating through the two white connectors Okay, here's a slightly more general view of this parameter transformation here that some people have called hypernetworks so a hypernetwork is a network where The weights of one network are computed as the outputs of another network. Okay, so you have a network H That looks at the input. It has its own parameters you And it computes the weights of a second network Okay, so the advantage of doing this And there's various name for it. The idea is very old goes back to the 80s people using what's called multiplicative interactions or three-way network or sigma pi units and they're basically This idea and this is maybe a slightly more general formulation of it that You have sort of a dynamically Your function that's going to dynamically defined In in the g of x and w because W is really a complex function of the input and some other parameter This is particularly interesting architecture when what you're doing what you're doing to x is transforming it in some ways right, so the you can think of W as being the parameters of that transformation, so why would be a transformed version of x and And and the x I mean the the function H basically computes that transformation Okay, but we'll come back to that in a few weeks Just wanted to mention this because it's basically a small modification of Of this right you just have one more wire that goes from x to H and that's how you get those hyper networks okay, so which I think that's You know the idea that you can have one parameter controlling Multiple multiple effective parameters in another network and one reason that's useful is If you want to detect a motif on the on an input and you want to detect this motif regardless of where it appears Okay, so let's say you have an input. Let's say it's a sequence But it could be equally be an image in this case. It's a sequence sequence of actors. Let's say And you have a network that takes a collection of three of those vectors three successive vectors This is not work g of x and w and It's trained to detect a particular motif of those three vectors. Maybe this is I don't know the power consumption electrical power consumption and sometimes, you know, you might want to be able to detect I could blip or a trend or something like that Or maybe it's you know Financial instruments of some kind Some sort of time series. Maybe it's a speech signal and you want to detect a particular sound that consists in three Vectors that define the the the sort of audio content of that of that speech signal And So you'd like to be able to detect if it's a speech signal and there's a particular sound You need to detect for doing speech recognition. You might want to detect, you know, the sound The the valve P right the the sound P where wherever it occurs in a sequence, you want some, you know Some detector that fires when the sound P is Is pronounced and so what you'd like to have is a detector that you can you can slide over right and regardless of where This motif occurs Detect it. So what you need to have is some some networks and prioritize function that You know, you have multiple copies of that function that you can apply to various Regions on the input and they all share the same weight, but you'd like to train this entire system end-to-end. So for example Let's say Let's talk about a slightly more sophisticated Thing here where you have Let's see a Keyword that's being being pronounced. So The the system listens to sound and wants to detect what a particular keyword a wake-up Word has been has been pronounced, right? So this is Alexa, right and you say Alexa and Alexa wakes up. He goes bonk, right? so What you like to have is, you know, some network that kind of takes a window over the sound and then so keeps You know in the background sort of detecting But you'd like to be able to detect, you know, wherever the sound occurs within the frame that is being looked at Was being listened to I should say so you could have a network like this where you have, you know Replicated detectors they all share the same weight and then the outputs which is, you know The score as to whether something has been detected or not goes to a max function Okay, and that's the output and the way you train a system like this You know, you will have a bunch of samples audio Examples where the keyword has been pronounced another bunch of audio samples with the keyword It was not pronounced and then you train a two-class classifier turn on when Alexa is somewhere in this frame turn off when it's not But nobody tells you where the word Alexa occurs within the the window that you train the system on Okay, because it's really expensive for labelers to like look at the audio signal and tell you exactly Well, this is the word is Alexa is being pronounced The only thing they know is that you know within the segment of a few seconds the word has been has been pronounced somewhere Okay, so you'd like to apply a network like this that has those Replicated detectors you don't know exactly where it is, but you run through this max and you want to train the system to You know, you want to back propagate gradient to it so that it learns to detect, you know Alexa or whatever wake up word occurs and So there what what happens is you have those multiple copies five copies in this in this example of this network And they all share the same weight and you can see this as just one wave vector is sending its value to five different instances of the same network and so when you back propagate back propagate to the To the the five copies of the network you get five gradients of those gradients get added up For the parameter now this is a slightly strange way this is implemented in pytorch and other Deep learning framework, which is that this accumulation of gradient in a single parameter is done implicitly And it's one reason why before you do a backprop in pytorch you have to zero out the gradient Because there's sort of implicit Accumulation of gradients when you do back propagation Okay, so here's another situation where that would be useful and this is for really the real motivation behind convolutional nets in the first place which is the the problem of training a system to recognize a shape independently of the position of Whether the shape occurs and whether there are distortions of that shape in the input so Here this is a very simple type of convolutional net that is has been built by hand. It's not been trained Okay, it's been designed by hand And it's designed explicitly to distinguish Cs from these okay, so you can draw a C on the input image which is you know very low resolution and What distinguishes sees from these is that sees have Endpoints right there is the the stroke and ends you can imagine designing a detector for that whereas these have corners so if you have an endpoint detector something that detects the end of a segment and a corner detector Wherever you have corners that are detected. It's a C. It's a D. And whatever wherever you have Segments that end it's a C So here's an example of a C. You take the first detector so the little black-and-white motif here at the top Is a is an endpoint detector. Okay, it's it detects the end of a of a segment and the way this This is represented here is that the the black pixels here So think of this as some sort of template Okay, you're going to take this template and you're going to swipe it over the input image and You're going to compare that template to the little image that is placed underneath Okay, and if those two match the way you're going to determine whether they match is that you're going to do a dot product So you're going to think of those black-and-white pixels as value of plus one or minus one say plus one for black minus one for for white and You're going to think of those pixels also as being plus one for black and minus one for white and when you compute the dot product of a little window with that template if They are similar you're going to you're going to get a large positive value with a dissimilar you're going to get a Zero or a negative value or a smaller value Okay So you take that Little detector here and you can put the dot product with the first window second window third window, etc You shift by one pixel every time for every location and you recall the result and what you what you get is this right? so this is here the the grayscale is an indication of the matching Which is actually the dot product between the vector of formed by those values and the And the the patch of the corresponding location on the input so this image here is roughly the same size as that image minus border effects And which is you see there is a whenever the output is dark. There is a match So you see a match here Because this end point detector here matches the You know the endpoint you see sort of a match here at the bottom and And the other kind of values are not as Dark, okay, not as strong if you want now If you threshold those those values you set the output to plus one if if it's about the threshold Zero if it's below the threshold You get those maps you have to dab the threshold appropriately But what you get is that you know this little guy here detected a match at the two end point of the C Okay, so now if you take this map you sum you sum it up Just add all the values You get a positive number that's that through threshold unless you see detector. It's not a very good C detector It's not a very good detector of anything, but for those particular examples of C's and maybe those D's It will work you'd be enough now for the D is similar those other detectors here are meant to detect the corners of the D right, so this guy here this detector as you swipe it over the the input will will detect the Upper left corner and that guy will detect the lower right corner once you threshold you will get those two maps where the corners are detected And then you can sum those up and the D detector which will turn on now What you see here is an example of why this is good because that detection now is shifting variant So if I take the same input D here, and I shifted by a couple pixels And I run this detector again. It will detect the motif wherever they appear the output will be shifted Okay, so this is called equivariance to shift. So the output of that network Is equivalent to shift which means that if I shift the input the output gets shifted, but otherwise unchanged Okay, that's that's equivariance invariance would be If I shifted the input the output would be completely unchanged, but here it is modified. It just modified the same way as the input and So if I just sum up the activate the activities in the feature maps here, it doesn't matter what where they occur My D detector will still will still Activate right if I just compute the sun. So this is sort of a handcrafted Pattern recognizer that use uses local feature detectors and then kind of sums up their activity and what you get is an invariant detection Okay, this is a very classical way actually of building certain types of pattern recognition systems Going back many years But the trick here what's important of course what's interesting would be to learn those those templates, you know, can we You know, can we view this as just a neural net and we backpropagate to it and we learn those templates As as weights of a neural net, you know, after all we're using them to do that product which I waited some so, you know, basically the This this layer here to go from the input to Those so-called feature maps that are weighted sums is a linear operation. Okay, and we know how to backpropagate to that Would have to use, you know, a kind of a soft threshold already with something like this here because otherwise we can do backprop Okay, so this operation here of taking the dot product of a bunch of coefficients With an input window and then swiping it over. That's a convolution. Okay, so that's the definition of a convolution It's actually the one up there. So this is in the one-dimensional case where Imagine you have An input xj so x index by the j index You take a window Of x at a particular location I Okay, and then you some You do a weighted sum of A little window of the x values and you multiply those by the weights wj's Okay, and the sum presumably runs over a kind of a small Small window. So j here would go from one to I don't know five Something like that, which is the case in the little example I showed earlier Okay, and that gives you one yi okay, so Take the first window of five values of x Computer weighted sum with the weights that gives you why one then shift that window by one compute the weighted sum of The dot product of that window by the W's that gives you why to shift again, etc Okay Now in fact in practice when people implement and things like by torch There is a confusion between two things that mathematicians think are very different But in fact, they're pretty much the same it's convolution and cross correlation so in convolution the convention is that the Index goes backwards in the in the window when it goes forwards in the weights in cross correlation They both go forward in the end. It's just a convention, you know, it depends on how you lay it, you know Organize the the data in your weights you can interpret this as a convolution if you read the weights backwards So read doesn't make any difference But for certain mathematical properties of convolution if you want everything to be consistent you you you have to have the The J in the W having an opposite sign to the J in the X So the two-dimensional version of this If you have an image X that has two indices in this case I and J You do a weighted sum over two indices K and L and so you have a window a two-dimensional window index by K and L and you compute the dot product of that window over X with the The weights and that gives you one value in Y I J, which is the the output so the The vector W or the matrix W in this in the 2d version There is all of you of you six tensions of this to 3d and 4d etc It's called a kernel. It's called a convolution kernel. Okay, so clear. I'm sure this is known for many of you, but So what we're going to do with this is that We're going to organize, you know build a network as a succession of convolutions where where you know in a regular neural net you have Alternation of linear operators and point-wise nonlinearity in convolutional nets We're gonna have an alternation of linear operators that will happen to be convolutions or multiple convolutions Then also a point-wise nonlinearity and there's going to be a third type of operation called pooling Which is actually optional Before before I go further I should mention that there are Twist you can make to this convolution. So one twist is what's called a stride So stride in a convolution consists in moving the window from one You know from one position to another instead of moving it moving it by just one value You move it by two or three or four. Okay, that's called a stride of a convolution and so if you have an input of a certain length and So let's say you have an input which is kind of a one-dimensional and the size a hundred And you have a convolution kernel of size five Okay, and you convolve the This kernel with the input And you make sure that the window stays within the input of size 100 The output you get has 96 outputs. Okay, it's got the number of inputs Minus the size of the kernel which is five minus one Okay, so that makes it four so you get 100 minus four. That's 96 That's the number of you know Windows of size five that fit within this big input of size 100 Now if I use a stride So what I do now is I take my window of five Where I applied the kernel and I shifted not by one pixel but by two pixels Or two values. Let's say they're not necessarily pixels Okay, the number of outputs I'm gonna get is gonna be you know divided by by two roughly Okay, instead of 96, I'm gonna have you know any less than 50 You know 48 or something like that the number is not exact you can Figure out in your head Very often when people run convolutions in convolutional nets they actually pad the convolution So they sometimes like to have the output being the same size as the input And so they actually displace the input window past the end of the vector assuming that it's padded with zeros Usually on both sides Does it have any effect on performance or is it just for convenience If it has an effect on performance is bad, okay, but it is convenient That's pretty much the answer The the assumption that's bad is assuming that when you don't have data, it's equal to zero So when your non linearity is a radio, it's not necessarily completely unreasonable But it sometimes creates like funny border effects, you know boundary effects, okay Everything clear so far right, okay, so what we're gonna build is a neural net composed of those Convolutions that are going to be used as feature detectors local feature detectors Followed by non linearities and then we're going to stack multiple layers of those okay, and the reason for stacking multiple layers is because We want to build hierarchical representations of the of the visual world of the data It's not convolutional nets are not necessarily applied to images They can be applied to speech and other signals. They basically can be applied to any signal that comes to you in the form of an array And I'll come back to on to the the properties that this array has to verify So what you want is Why do you want to build hierarchical representations because the the world is is compositional and I alluded to this I think in the first lecture if I remember correctly It's the fact that You know pixels Assemble to form simple motifs like oriented edges oriented edges kind of assemble to form local features like corners and T junctions and Things like that gratings, you know and then those assemble to form motifs That are slightly more abstract that those assemble to form parts of objects and those assemble to form objects So this is sort of natural compositional hierarchy in the natural world and this natural compositional hierarchy in the natural world is not just because of perception visual perception is true at the physical level, right? You know you start at the At the lowest level of description, you know you have Elementary particles and they form, you know, they come to form less elementary particle And they come to form atoms and they come to form molecules and molecules come to form materials and materials, you know parts of objects and parts of objects into objects and things like that, right? or macromolecules or Polymers blah blah blah and then you have this this natural compositional hierarchy the world is built this way and It may be why the world is understandable, right? So this is famous quote from Einstein that says the most incomprehensible thing about the world is that the world is comprehensible and it seems like a conspiracy that we live in a world that we are able to comprehend but we we can't comprehend it because the world is compositional and You know it happens to be easy to build Brains in a compositional world that actually can interpret compositional world It still seems like a conspiracy to me so there's a famous quote from From a not that famous but somewhat famous from a statistician at Brown called Stu Gehman and And he says you know that sounds like a conspiracy like magic but you know If if the world were not compositional we would need some some even more magic to be able to understand it and he The way he he says this is the world is compositional or there is a God You would need to appeal to Superior powers if the world was not compositional to explain how we can understand it Okay, so this idea hierarchy And local feature detection comes from biology. So the whole idea of confidential nets comes from biology. It's been so inspired by biology and What you see here on the on the right is a diagram by Simon Thorpe who is a psychophysicist and Did some relatively famous experiments where he showed that the the way we recognize the really objects Seems to be extremely fast. So if you show you flash the image of an everyday object to a person and you flash one of them every 100 millisecond or so you realize that the the The time it takes for the person to identify in a long sequence whether there was a particular object. Let's say a tiger Is about 100 milliseconds So the time it takes our brain to interpret an image and recognize basic objects in them is about a hundred millisecond the tenth of a second right and That's just about the time it takes for the nerve signal to propagate from the retina Where you know image or form on the in the eye to what's called a LGN lateral geniculate nucleus Which is a a small You know a piece of brain that basically does so contrast enhancement and gain control and things like that And then that signal goes to the back of your brain v1. That's the primary visual cortex area In humans and then v2, which is very close to v1. There's a fold that sort of makes v1 sort of Right in front of v2 and there is lots of wires between them And then v4 and then the infotemporal cortex which is sort of on the side here And that's where all the categories are represented So there are neurons in the infotemporal cortex that represent, you know, general generic object categories And you know people have done experiments with this where You know if you take patients are in hospital and have their skull open because They we need to locate the look the exact position of the source of their epilepsy Seizures And Because they have electrodes at the surface of their brain You can show the movies and then observe if a particular neuron turns on for particular movies And you show them a movie with Jennifer Aniston and there is this Neuron that only turns on when Jennifer Aniston is there. Okay, it doesn't turn on for anything else as far as we could tell Okay So you seem to have very seductive neurons in the infotemporal cortex that react to a small number of categories There's a joke kind of a running joke in neuroscience of a concept called the grandmother cell So this is the the one neuron in your infotemporal cortex that turns on when you see your grandmother Regardless of what position what she's wearing how far whether it's a photo or not Nobody really believes in this concept when people believe in our distributed representation So there is no such thing as the ground as a cell that just turns on for your grandmother There are this collection of cells that turn on for various things and they serve to represent, you know general categories but the important thing is that they are invariant to position size illumination all kinds of different things and the the real motivation behind convolutional nets is to build Neural nets that are invariant to irrelevant transformation of the inputs Okay, you can still recognize a C a D or your grandmother regardless of the position and to some extent the orientation the style etc so this Idea that the signal only takes 100 millisecond to go from the retina to the infotemporal cortex Seems to suggest that if you count like the delay to go through every neuron or every stage in that in that pathway There's barely enough time for a few spikes to get through so there's no time for Complex recurrent connection, you know recurrent computation is basically a feedforward process. It's very fast Okay, and we need it to be fast because that's a question of survival for us There's a lot of for most animal, you know, you need to be able to recognize really quickly what was going on particularly You know fast-moving predators or praise for that matter So that kind of suggests the idea that you know we can do Perhaps we could come up with some sort of neural net architecture that is completely feedforward and Still can do recognition The diagram on the right Is From gallant and Vanessa and so this is a type of sort of abstract conceptual diagram of the Two pathways in the visual context there is the ventral pathway and the dorsal pathway the ventral pathway is You know basically the v1 v2 v4 it hierarchy Which is sort of from the back of the brain and goes to the bottom and and to the side And then the dorsal pathway kind of goes, you know through the top also towards the info temple cortex and There is this idea somehow that the ventral pathway is there to tell you what you're looking at Rather the dorsal pathway basically identifies locations Geometry and motion Okay, so there is a pathway for what and another pathway for where if you want and that seems fairly separate in the human or Primate visual cortex and of course there are interactions between them So various people had the idea of kind of using so what what is that idea come from there is Classic work in neuroscience from the late 50s early 60s by Huber and Riesel. They're on the picture here They want a Nobel Prize for it. So it's really classic work and what they showed with cats basically by poking electrodes into cat brains is that Nons in the in the cat brain in v1 detect Our only sensitive to a small area of the visual field and they detect oriented edges Contours if you want in that particular area Okay, so the area to which a particular known is sensitive is called a receptive field and you take a particular known and you you show it Kind of an oriented bar if you want that you rotate and at one point the neuron will fire and as the for a particular angle and as you move away from that angle the activation of the neuron kind of Diminishes, okay, so that's called orientation seductive neurons and Huber and Riesel called those simple cells If you move the bar a little bit you go out of the receptive field that neuron doesn't fire anymore Doesn't react to it. There's gonna be another neuron almost exactly identical to it. Just a little bit You know Away from the first one that does exactly the same function. It will react to a slightly different receptive field, but with the same orientation So you start getting this idea that you have local feature detectors that are positioned replicated all over the visual field, which is basically this idea of convolution, okay So they're called simple cells and then another idea that or discovery that When we saw did is the idea of complex cells. So a complex cell is is another type of neuron That integrates the output of multiple simple cells within a certain area Okay, so they will take different simple cells that all detect contours at a particular orientation edges at a particular orientation and Compute, you know an aggregate of all those activations, they will either do a max or a sum or a Sum of square or square root of sum of square some sort of function that does not depend on the order of the arguments Okay, let's say max for the sake of simplicity so basically a complex cell will turn on if any of the simple cells within its Input group turns on the same, okay So that complex cell will detect an edge at a particular orientation regardless of its position within that little region so it builds a little bit of shifting variance of The representation coming out of the complex cells with respect to small variation of positions of features in the input so gentlemen by the name of Kunihiko Fukushima No real relationship with a nuclear power plant In the late 70s early 80s Experimented with computer models that sort of implemented this idea of simple cell complex cell and you had the idea of sort of replicating this with multiple layers, so basically You know the exercise it he did was very similar to the one I showed earlier here with this sort of handcrafted feature detector Some of those feature detectors in his model were handcrafted, but some of them are learned They were learned by an unsupervised method. They weren't you didn't have backdrop right the backdrop didn't exist So I mean it existed, but he didn't it wasn't really popular and people, you know didn't use it. So So he trained those filters basically with something that amounts to a sort of clustering algorithm a little bit and You know separately for each layer and so he would You know trained the filters for the first layer I trained this with handwritten digits. They also had a data set of handwritten digits And then feed this to complex cells that we can you know pull the activity of simple cells together and then that would consist that would Form the input to the the next layer and it would repeat the same learning algorithm His model of neuron was very complicated. It was kind of inspired by biology. So we had separate inhibitory neurons the The other neurons only have positive weights and arguing weights Etc. We managed to get this thing to kind of work Okay, not very well, but sort of worked Then a few years later I basically kind of got inspired by similar architectures, but Train them supervised with backprop. Okay, so that's the genesis of convolutional nets if you want and then independently more or less Max reasonable in Tommy Pojo's lab at the MIT kind of rediscovered this architecture also, but also didn't use backprop for some reason He calls this H max so this is sort of the early experiments I did with convolutional nets when I was Finishing my post-doc in University of Toronto in 1988. So that goes back a long time And I was trying to figure out, you know, does this work better on the small data set So if you have a tiny amount of data You're trying to fully connect the network or linear network was just one layer or network with local connections but no shared weights or compare this with What was not yet called a convolutional net where you have shared weights and local connections Which one works best and it turned out that in terms of generalization ability which are the curves on the bottom left the Which you see here that the top the top curve here is Basically the baby convolutional net architecture trained with a very simple data set of handmade and digits that were drawn with a mouse Right. We didn't have any way of collecting images basically at that time And then if you have local connections without shared weights, it doesn't you know It works a little worse and then if you have fully connected Networks it works worse and if you have a linear network it not only work works worse But but it also over fits it over trains So the test error goes down after a while This was trained with 320 320 training samples, which is really really small those networks had you know on the order of Five thousand connections one thousand parameters. So this is you know a billion times smaller than what we do today a million times I would say And then I Finished my postdoc and went to Bell Labs and Bell Labs had you know slightly bigger computers But what they had was a data set that came from the postal service So they had zip codes from envelopes and we build a data set out of those zip codes And then train a slightly bigger neural net for three weeks And Got really good results. So this this convolutional net did not have separate convolution and pooling It had Strided convolution so convolutions where the window is shifted by more than one pixel. So that's What's the result of this the result of this is that the output map when you do a convolution that where the stride is Is more than one you get an output whose resolution is smaller than the input And you see an example here. So here the the input is 16 by 16 pixels. That's what we could afford The kernels are 5 by 5 But they are shifted by two pixels every time and so the The the output here Is is smaller because of that Okay, and then one year later. This was the next generation convolutional net this one had separate convolutional pooling so What is the pruning operation at that time the pruning operation was just another Neon except that all the weights of that neuron were equal. Okay, so a pruning unit was basically a Unit that computed an average of its inputs and then I did a bias and then passed it to a nonlinearity Which in this case was a hyperbole retention function Okay, all the nonlinearity is in this network were hyperbole tangents at the time That's what people were doing and the pruning operation was Performed by shifting the the window over which you compute the the aggregate of the output of the previous layer by by two pixels. Okay, so here You get a 32 by 32 input input window Convolved this with filters that are 5 by 5. Yeah, I should mention that a convolution kernel sometimes is also called a filter and so what you get here are Outputs that are I Guess minus four, so it's 28 by 28. Okay, and then there is pruning which computes an average of Pixels here over a two by two window and Then shifts that window by two So how many such windows you have? Since the image is 28 by 28 you divide by two it's 14 by 14. Okay, so those images Here are 14 by 14 pixels Okay, and they basically have the resolution as the previous window Because of this stride Okay, now it becomes interesting because what you want now is you want the next layer to detect combinations of features from the previous layer and so the way to do this is you have Different convolution filters apply to each of those feature maps Okay, and you sum them up you sum the results of those four conclusions and you pass a result through an originality and that gives you One feature map of the next layer so because those filters are five by five and those Images are 14 by 14. Those guys are 10 by 10 Okay to not have border effects so Each of these feature maps of which there are 16 if I remember correctly Uses a different set of kernels to Convolve the the previous layers in fact The the connection pattern between the feature map The feature map at this layer and the feature map at the next layer is actually not full So not every feature mask connected to every feature map. There's a particular scheme of Different combinations of feature map from the previous layer forming Combining to for feature maps at the next layer and the reason for doing this is just to save computer time We just could not afford to connect everything to everything. It would have taken twice the time to to run or more Nowadays we're kind of forced more or less to actually have a complete connection between feature maps in a convolutional net Because of the way that multiple combinations are implemented in GPUs Which is sad and then the next layer up So again those maps are 10 by 10 those feature maps are 10 by 10 and the next layer up Again is produced by pooling and subsampling by a factor of 2 and so those are 5 by 5 Okay, and then again, there is a 5 by 5 convolution here But of course you can't move the window 5 by 5 over a 5 by 5 image So it looks like a full connection, but it's actually a convolution Okay, and keep this in mind But you basically just only one location Okay, and those feature maps at the top here are really outputs And so then you have one spatial location because you can only place one 5 by 5 window within a 5 by 5 image And you have 10 of those feature maps each of which corresponds to a category So you train the system to classify digits from 0 to 9 you have 10 categories This is a little animation that I borrowed from Andre Capati Spend the time to build this little animation Which is the to represent sort of multiple convolutions, right? So you have three feature maps here on the input and you have six Convolution kernels and two feature maps on the output. So here the first group of three feature maps are convolved with the three input of three Kernels are convolved with the three input feature maps To produce the first group the first of the two feature maps the green one at the top Okay And then and the animation stops Okay, so this is the first group of three kernels convolved with the three feature maps and they produce the the green map at the top and then you switch to the second group of of convolution kernels that you can you convolve with the Three input feature maps to produce the the map at the bottom. Okay, so that's An example of your n feature map on the input and feature map on the output and n times m convolution kernels to get all combinations Here's another animation which I made a long time ago That shows a convolutional net after it's been trained in action trying to recognize digits and so What's interesting to look at here is you you have an input here, which is I believe 32 rows by 64 columns and After doing six convolutions with six convolution kernels passing it through hyperbolic tangent nonlinearity after a bias You get those six feature maps here each of which kind of activates for a different type of feature. So for example the feature map at the top here Turns on when there is some sort of horizontal edge This guy here turns on whenever there is a vertical edge Okay, and those convolution kernels have been learned through backprop the thing has been just trained with With backprop not set by hand. They said, you know, randomly initially So you see this notion of equivalence here if I if I shift the input image the activations on the feature maps shift, but otherwise they haven't changed All right, let's shift equivalence Okay, and then we go to the pooling operation So this first feature map here corresponds to a pooling a pooled version of This first one the second one to the second one third one to the third one and they're putting operation here again Is an average then a bias then a similar in nonlinearity. And so if this map shifts by One pixel this map will shift by one half pixel Okay, so you still have equivalence, but you know shifts are reduced by You know a factor of two essentially Okay, and then you have the second stage where each of those maps here as a result of Doing a convolution on each or a subset of the previous maps with different kernels summing up the result passing the result through a Sigmoid and so you get those kind of abstract features Here that are kind of a little hard to interpret visually, but it's still equivalent to shift Okay, and then again you do pooling and sub sampling So the pooling also has you know this tried by a factor of two so to get here are our Maps so that those maps shift by one quarter pixel is the input sheets by one pixel Okay, so we reduce the shift and it becomes you know might become easier and easier for following layers to kind of interpret What the shape is because you exchange? spatial resolution for Feature type resolution you increase the number of feature types as you go up the layers The spatial resolution goes down because of the pooling and sub sampling But the number of feature maps increases and so you make the representation a little more abstract But less sensitive to shifts and distortions and the next layer Again performs convolutions, but now the size of the convolution kernel is equal to the height of the image And so what you get is a single band For this feature map. It's basically it becomes one-dimensional And so now the Any vertical shift is basically eliminated right it's turning to some variation of activation, but it's not It's not a shift anymore. It's some sort of simpler hopefully transformation of the input in fact you can show it's simpler It's flatter in some ways Okay, so that's the the sort of generic convolutional net architecture. We have This is a slightly more modern version of it where you have some form of normalization batch norm Group norm whatever The filter bank does those are the multiple convolutions In signal processing they're called filter banks Point wise non-linearity generally a value and then some pooling generally max pooling Okay in the most common Implementations of convolutional nets because of course imagine all the types of pooling I talked about the average, but the more generic version is the LP norm which is You know take all the inputs to a complex cell and have it done to some power and Then take the you know sum them up and then take the Elevate that to one over the power Yeah, this should be a sum inside of the Of the piece root here Another way to pool and again You know any pretty good putting operation is an operation that is invariant to the two-way Permutation of the inputs it gives you the same result Regardless of the order in which you put the input Here's another example. We talked about this function before one over be log sum over inputs of e to the b x x financial b x Again, that's a kind of symmetric aggregation operation that you can use So that's kind of a stage of a convolutional net and then you can repeat that There's sort of various ways of Positioning the normalization some people put it After the non linearity before the pooling, you know it depends But it's typical so how do you do this in pytorch? There's a number of different ways you can do it by you know writing any specificity writing a class So this is a example of a convolutional net class particular one here we do convolutions revenue and and Max pooling Okay, so the constructor here creates completions position layers which have parameters in them And this one has what's called free connected layers. I hate that. Okay So there is this idea somehow that the the last layer of a convolutional net like Like this one is fully connected because Every unit in this layer is connected to every unit in that layer. So that looks like a full connection But it's actually useful to think of it as a convolution okay now for efficiency reasons or maybe some other bad reasons they called You know fully connected layers they we use the class linear here But it kind of breaks the whole idea that your network is a convolutional network So it's much better actually to view them as convolutions In this case one by one convolutions, which is sort of a weird concept. Okay, so here we have four layers to convolutional layers and two so-called fully connected layers And then the way we So we need to create them in the constructor and the way we use them in the forward pass is that You know we do a convolution of the input and then we apply the radio and then we do max pooling and then we Run the second layer and apply the radio and do max pooling again And then we reshade the output because we it's a fully connected layer. So we want to make this a Vector so that's what the X view minus one does and then apply a value to it and You know in this the second fully connected layer and then apply yourself max if you want to do classification And so this is somewhat similar to the architecture you see at the bottom The numbers might be different in terms of you know feature maps and stuff, but but the general architecture is It's pretty much what we're talking about Yes Say again, you know, whatever a gradient descent decides We can look at them, but If you train with a lot of Examples on natural images the kind of filters you will see at the first layer Basically will end up being mostly oriented edge detectors very much similar to what people to what neuroscientists observe in in the cortex of animals individual cortex of animals The word change when you change the model. That's the whole point. Yes Okay, so it's pretty simple. Here's another way of defining those. This is I guess it's kind of a outdated way of doing it right not many people do this anymore But it's kind of a simple way also there is this class in in pytorch called and then sequential and It's basically a container and you keep putting modules in it and it just You know automatically kind of use them as being kind of connected in sequence, right? And so then you just have to call You know forward on it and and and it will just it will just compute the right thing In this particular form here, you you pass it kind of a bunch of pairs It's like a dictionary so you can give a name to each of the layers and you can later kind of Access them. It's the same architecture was talking about earlier. Yeah, I mean the the backdrop is automatic, right? You you get it By default you just called Backward and It knows how to backpropagate to it Well, the class kind of encapsulates everything into you know into an object, you know where the parameters are there's a particular way of sort of, you know getting the parameters out and and Kind of feeding them to look Feeding them to an optimizer and so the optimizer doesn't need to know what your network looks like It just knows that there is a function and there is a bunch of parameters and it gets a gradient And you know, it doesn't need to know what your network looks like Yeah, you'll you'll you'll hear more about this Tomorrow Okay, so here's a Really interesting aspect of convolutional nets and it's one of the reasons where they've where they've become so Successful in many applications. It's the fact that If you view every day on a convolutional net as a convolution so there is no full connection so to speak You don't need to have a fixed size input you can vary the size of the input and the network will vary its size accordingly because When you apply a convolution to an image You feed it an image of a certain size you do a convolution with a kernel You get an image whose size is You know related to the size of the input, but you can change the size of the input And it just changes the size of the output And it's just true for every convolution every convolutional like operation, right? So if your network is composed only of convolutions, then it doesn't matter what the size of the input is It's going to go through the network and the the size of every layer will change according to the size of the input and The size of the output will also change accordingly. So here is a little example here where You know, I want to do Cursive handwriting recognition and it's very hard because I don't know where the letters are so I can't you know Just have a character recognizer that I mean a system that will first cut the word into Into letters because I don't know what the letters are and then Apply the convolutional net to each of the letters So the best I can do is take the convolutional net and swipe it over the input and then record the output Okay, and so you would think that to do this You would have to take a convolutional net like this that has a window large enough to see a single character Okay, and then you take your input image and you compute your neural net your convolutional net at every location Shifting it by one pixel or two pixels or four pixels or something like this You know a small enough number of pixels that Regardless of where the character occurs in the input you will still get a score on the output Whenever it needs to recognize one But it turns out that would be extremely wasteful Because You will be redoing the same computation multiple times And so the proper way to do this and this is very important to understand Is that you don't do what I just described where you have a small convolutional net that you apply to every window What you do is you Take a large input and you apply the convolutions to To the input image since it's larger you're gonna get a larger output You apply the second layer convolution to that or the pooling whatever it is you're gonna get a larger input again Etc all the way to the top and whereas in the original design you were getting only one output now You're gonna get multiple outputs because you know, it's a convolutional layer this is super important because This way of applying a convolutional net with a sliding window is Much much much cheaper than recomputing the convolutional net at every location Okay, you didn't build you would not believe how many Decades it took to convince people that this was a good thing So here's an example of how you can use this this is a This is a convolutional net that was trained on individual digits 32 by 32. It was trying on a list, okay? 32 by 32 input windows It's loaded five so it's very similar to the architecture. I just showed the code for okay It's trained on individual characters to just classify The character in the center of the image and the way was trained is that there was a little bit of data Augmentation where the character in the center was kind of shifted a little bit in various locations Changed in size and then there were two other characters You know kind of two that was going to add it to the side to confuse it Okay in many samples and then it was also trained with the 11th category Which was none of the above and the way it's trained is either you show it a blank image Or you show it an image where there is no character in the center, but there are characters on the side Okay, so that it would detect when it whenever it's in between two characters and then you do this thing of You know computing the convolutional net at every location on the input without actually shifting the convolutional net But just applying the convolutions to the entire image And that's what you get so so here The input image is 64 by 32 even though the network was trained on 32 by 32 with those kind of generated examples and Which you see is the activity of some of the layers here not all of them are represented And what you see at the top here are those kind of funny shapes You see threes and fives popping up and they basically are the an indication of the winning category for every location, right? so the the the eight the eight outputs that you see at the top are Basically the output corresponding to eight different positions of the 32 by 32 input window on the input shifted by four pixels every time And what is represented is the winning category within that window and the grayscale indicates the score, okay? So what you see is you know, there's two detectors detecting the five Until the three kind of starts overlapping and then two detectors detecting the three that can I move around? Because You know within a 32 by 32 window The three appears to the left of that 32 by 32 window and then to the right of that other 32 by 32 window shifted by four And so those two detectors detect that three or that five So then what you do is you take all those scores here at the top and you do a little bit of post processing very simple And you figure out it's a three and a five And what's interesting about this is that? You don't need to do prior segmentation So something that people had to do before in computer vision was if you wanted to recognize an object You had to separate the object from its background because the recognition system, you know, we get confused by By the background but here with this convolution that has been trained with overlapping characters And it knows how to tell them apart and so It's not confused by characters that overlap ever a whole bunch of those on our web website by the way Those animations from the early 90s What do you mean? Oh Yeah No, that that was the main issue that's one of the reasons why you know Computer vision wasn't working very well. It's because the very problem of Figure ground separation detecting an object And recognizing it is the same you can't recognize the object until you segmented but you can't segment it until you recognize it It's the same for cursive handwriting recognition, right? You can't so here's an example Do we have pens doesn't look like we have pens right? Oh, here we go. That's true. I'm sorry Maybe I should use the If this works. Oh, of course Okay You guys read this Okay, I mean, it's it's horrible handwriting, but it's also because you know, I'm writing on the screen. Okay. Now. Can you read it? Minimum. Yeah Okay, there's absolutely no way you can segment the letters out of this, right? I mean, it's kind of random number of waves But just the fact that the two eyes are identified then it's basically non ambiguous at least in English so that's a good example of You know the interpretation of individual Objects, you know depending on their context and what you need is some sort of high-level language model to know what words are possible If you don't know English or similar languages that have the same word. There's no way you can you can read this spoken language is very similar to this so we All of you who have had the experience of learning a foreign language probably had the experience that You have a hard time segmenting words from a new language and then recognizing the words because you don't have the vocabulary Right. So if I speak in French si je commence à parler français, vous n'avez aucune idée où sont la limite des mots except if you speak French So I spoke a sentence it's words, but you can't tell the boundary between the words, right because it is basically, you know Clear seizure between the words unless you know where the words are in advance, right? So that's the problem of segmentation you can't recognize into the segment you can segment until you recognize you have to do both at the same time and You know early computer vision systems that are really really hard time doing this So that that's why you know this this kind of stuff is big progress because you don't have to do segmentation in advance it just Just train your system to be robust to kind of overlapping objects and things like that. Yes in the back Yes, there is a background class. So when you see a blank response It means the system says none of the above basically right so it's been trained To produce none of the above Either when the input is blank or when there is one character that's too You know Outside of the center or when you have two characters There's nothing in the center or when you have two characters are overlap, but there is no central character, right? So it's you know Train to detect boundaries between characters essentially Here's another example so this This is This is an example that shows that even a very simple commercial net with just two stages right convolution pooling convolution pooling and then two layers of You know two more layers afterwards Can solve what's called a feature binding problem so visual neuroscientists and computer vision people had the the issue it was kind of a puzzle how is it that We perceive objects as objects, you know objects are collections of features But how do we bind all the features together of an object to form this object? Is there some kind of magical way of doing this and and they did you know Psychologists did experiments like You know Draw this and then that and you perceive the the bar as A single bar because you're used to bars being you know Obstructed by occluded by other objects and so you just assume it's an occlusion And then there were experiments that you know figure out like how much do I have to Shift the two bars where to kind of make me perceive them as two separate bars But in fact, you know, they may not be perfectly aligned and if you if you do this You know maybe exactly identical to what you see here, but but now you perceive them as two different objects So, you know, how is it that? We do the we seem to be solving the the feature binding problem And what this shows is that you don't need any specific mechanism for it. It just happens if you have enough non linearities and you train with enough data then As a side effect you get the system that solves a feature binding problem without any particular mechanism for it So here you have two shapes and you move a single Stroke and you know, it goes from a six and a one to a three to a five and a one to a seven and a three Etc. Right good question. So the question is how do you distinguish between the two situations? We have two fives next to each other and The fact that you have a single five being detected by two different frames, right two different framing of that five Well, this is explicit Training so that when you have two characters There are touching and none of them is really centered. You train the system to say none of the above, right? So it's it's always gonna have, you know, five blank five It's all it's always gonna have even like one blank one in the ones can be very close They will you'll tell you the difference Okay, so what are common that's good for Yes Okay, so what you have to look at is this so Every layer here's a convolution Okay Including the last layer so it looks like a full connection because every unit in the second last layer goes into the output But in fact, it is a convolution. It just happens to be applied to a single location So now imagine that this layer at the top here now is bigger. Okay, which is represented here Okay Now the size of the kernel is the size of the image you had here previously But now it's a convolution that has multiple locations, right? And so what you get is multiple outputs That's right, that's right each of which corresponds to you know a classification Over an input window of size 32 by 32 in the example I showed and Those those windows are shifted by it by four pixels the reason being that the the network architecture I showed Here has Convolution with tried one then pulling with tried to convolution with tried one pulling with tried two and so the overall Stride is four right and so To get a new output you need to shift the input window by four To to get to get one of those because of the the two pulling layers with And maybe I should be a little more explicit about this. Let me draw a picture that would be clearer so So you have an input like this you do a convolution. Let's say a convolution of size three Okay, yes tried one Okay, I'm not gonna draw all of them. Then you have Pulling with subsampling of size two, so you pull over two and you're so something the stride is two so you shift by two No overlap Okay, so here the input is this size one two three four five six seven eight Because the combination is a size three you get an output here a size six and Then when you're pulling with subsampling of with tried two you get three outputs because that divides the output by two Okay, let me add another one actually two. Okay, so now the output is ten This guy is eight this guy is four I Can do combinations now also? Let's say three I only get a I only get two outputs Okay, oops Hmm, I'm not sure why it doesn't draw Doesn't want to draw you more. That's interesting Doesn't react to clicks that's interesting Okay, I'm not sure what's going on. Oh Sure is not responding. I guess it crashed on me Well, that's annoying Yeah, definitely crashed and of course you forgot it. So Okay, so we attend then eight Because of convolution with three then we have pulling Of size two with Strike two so we get four Then we have combination with three so we get two Okay, and then maybe pulling again Size two and subsampling two we get one. Okay, so Ten input eight four two And then one for the pulling this is completion three. You're right. This is two and those are three Etc right now, let's assume I I add a few units here Okay, so that's going to add. Let's say four units here two units here Then Yeah, this one is like this and like that So I got four and I got another one here Okay, so now I had I had only one output and by adding four for inputs here Which is not 14. I've got two outputs Why four because I have two Stride of two Okay, so the overall subsampling ratio From input to output is four. It's two by two times two so Now this is 12 and this is six and this is four so that's a you know a demonstration of the fact that You know you can increase the size of the input It will increase the size of every layer and if you have a layer that has size one and it's a conditional layer That its size is going to be increased Yes, change the size of the letter Like you vertically horizontally. Yeah, so there's going to be So first you have to train for it if you want the system to have some invariance to size You have to train it with characters of various sizes You can do this with the augmentation if you saw if your characters are normalized That's the first thing second thing is Empirically Simple convolutional nets are only invariant to size to within a factor of relatively small factor like you can increase the size by You know, maybe 40% or something. I mean change the size by 40% plus minus 20% or something like that, right? Beyond that you you know You might have kind of more trouble getting invariance, but people have trained with like you know input input I mean objects of sizes that vary by a lot so the way to handle this is If you want to handle viable size is that if you have an image and you don't know what size The objects are That are in this image you apply your convolutional net to that image and Then you take the same image reduce it by a factor of two Just scale the image by a factor of two run the same convolution net on that new image and Then reduce it by a factor of two again and run the same convolution net again on that image Okay, so the first convolution net will be able to detect small objects within the image So let's say your network has been trained to detect objects of size I don't know 20 pixels like faces for example, right there are 20 pixels It will detect faces that are roughly 20 pixels within this image And then when you set sample by a factor of two and you apply the same network It will detect faces that are 20 pixels within the new image, which means there were 40 pixels in the original image Okay, which the first network will not see because you know the face would be bigger than its input window And then the next network over will detect faces that are 80 pixels, etc Right, so then by kind of combining the scores from all of those and doing something called non-maximum suppression You can actually do detection and localization of objects people use considerably more sophisticated techniques for detection now And for localization that we'll talk about next week, but that's the basic idea So let me conclude What are calm nice good for They're good for signals that come to you in the form of a multi-dimensional array but that multi-dimensional array has To have two characteristics at least the first one is There is strong local correlations between values. So if you take an image Random image take two pixels within this image two pixels are nearby Those two pixels are very likely to have very similar colors Okay, take a picture of this class for example two pixels on the world basically have the same color Okay, it looks like there is a ton of object here, but Animated objects, but in fact mostly statistically neighbor neighboring pixels are essentially the same color As You move the distance from two pixels away and you compute the statistics of how similar pixels are as a function of distance They're less and less similar So what does that mean is it because? nearby pixels are likely to have similar colors that means that when you take a patch of pixels said 5 by 5 or 8 by 8 or something The type of patch you're going to observe It's very likely to be kind of a smoothly varying color or maybe with an edge But among all the possible combinations of 25 pixels the ones that you actually observe in natural images is a tiny subset What that means is that it's advantageous to represent the content of that patch By a vector with perhaps less than 25 values that represent the content of that patch Is there an edge is it uniform what color is it? You know things like that right and that's basically what the competitions in the first layer of a convolutional net are doing Okay, so you have if you have local correlations there is an advantage in detecting local features That's what we observe in the brain. That's what convolutional nets are doing This idea of locality if you feed a convolutional net with permuted pixels It's not going to be able to do a good job at recognizing your image images even if the permutation is fixed Right if we connected that doesn't care about permutations Then the second characteristics is that Features are important may appear anywhere on the image. So that's what justifies shared weights Okay, the local correlation justifies local connections The the fact that features can appear anywhere that the statistics of images or the signal is uniform Means that you need to have repeated feature detectors for every location and that's where shared weights Come into play it does justify the pooling because the pooling is if you want invariance to variations in the location of those characteristic features and so if the objects you're trying to recognize Don't change their nature by kind of being slightly distorted. Then you want pooling So people at his comments for kinds of stuff image video Text speech so speech actually is pretty speech recognition comments are used a lot Time series prediction, you know things like that and you know by the medical image analysis So if you want to you know analyze an MRI for example MRI or CT scan is a 3d image As humans we can't because we don't have a good visualization technology We can't really apprehend like you know Understand a kind of a 3d volume and a 3d machine will image the combat is fine You know feed it a 3d image and it will deal with it That's a big advantage because you don't have to go through slices to kind of figure out the object in the image and Then the last thing here at the bottom I don't know if you guys know what herperspectral image images are so herperspectral image is an image where you know most natural color images I mean images that you could act with a normal camera. You get three color components RGB But we can build cameras with way more spectral bands than this and That's particularly the case for satellite imaging where some Cameras have many many spectral bands going from infrared to ultraviolet and That gives you a lot of information about what you're seeing in each pixel Some tiny animals that have small brains find it easier to process hyper spectral images of low resolution Then high resolution images with just three colors for example the This particular type of shrimp right they have those beautiful Eyes and they have like 17 spectral bands or something But super low resolution and they have a tiny brain to process it Okay, that's all for today See you