 So what is desirable in a representation? The first one is you can say there's an entangling, a disentangling objective here where what I want is that similar stimuli produce similar representations because that makes makes learning the final recognition much better than. The second one is I want it to be predictive of the future world. A good representation allows us to predict what happens around us. We want the representation to be predictive of other parts of the world. What I currently see should tell me something about what happens in other areas. I want representations to be cheap to compute. I want them to allow for compositionality where I can say represent the world in terms of the objects that are here. Now also I might want a representation to be able to causally represent the world. Now some of these features are very complicated but many of them can be used for the training of representations in an unsupervised way and as we shall see the representation induced by ConvNet have many of these desirable properties. Not all of them though. So what is a convolutional neural network and I want to say here that I will be using the way that Brandon Rower is using this description and he was so nice to give me the permission to use his slides for that. So what is a ConvNet? What we have is it's a it's a regular supervised learning setting. No we have a two-dimensional area of pixels goes into the ConvNet. Output is oh this is an X on oh a classification on the output. Now what's the setting? We might have patterns like the upper one which is an X which we want mapped into the label X and others like that little ball that we see at the bottom left that we might want to have mapped into an oh. Now in a way this seems an easy problem until we start thinking about the properties of the visual world. So we have the X clearly that stimulus that I show you here on the left hand side is very similar to what the one that I show you on the right hand side. In fact it's just a tiny rotation between the two of them and what we'd like to do is have a system that can represent that what we have on the left is similar to what we have on the right and the question is how we can do such a thing. Now how could we look at similarity? Well we can just look at how similar those two are point by point. It turns out that these two stimuli here if we look at the pixels there are more pixels where they're different than there's pixels where they're the same and yet clearly for us they're perceptually very similar whereas a naive comparison doesn't reveal that similarity. So how could we have a system that allows us to know that these two very different images in a way are very similar? Well confnats are the trick that we do for that. So let's have the intuition here. So we have these two stimuli locally they're very similar look here the the green box here reappears here in this part and the orange box here reappears in this part and the violet box reappears in this part. Now in a way this is already the intuition that drives much of confnats. Now what I want is a representation that somehow locally detects that there's an edge in the green case or that there's a crossing in the orange case and then represents that if these things are moved a little relative to one another it's still kind of the same image. So the question is how could we could implement something like that? How could local features look like? Here's three possibilities. Now we have one feature here that has one diagonal here we have the checkerbot pattern here we have another diagonal. Now what we want to do is we want to detect such a feature in that image and here we can do it by eye you know like yeah here it fits it also perfectly fits here. So let's now implement something that does this kind of a such. So how can we locally ask how similar the feature is to the local patch of the image? We take so we take that local patch here and if you want we apply it to the image in these areas no they say the same size and then we calculate how similar they are where we have here is one we multiply it with the one here gives us a one as a result and then we write this into the multiplication of them. Now we just point wise multiply those vectors and then we can do it for all locations it's always the same it's either one times one or minus one times minus one so it gives us nine and now we want to normalize that where we can say the result of local filtering is nine times the one that we have which is that they always agree with one another divided by nine just for normalization giving us the value one here and then what we can do is we can define this in at other locations now we can say well what if we apply this thing here right in the middle? Well let's see this thing applied to the middle will agree in these seven spots here and it will disagree in these two and therefore what we have is nine is seven times the plus minus two times the minus it's five ninth and therefore we'd write point five five into the result of the convolution here and then we do this for all possible locations and now when I say all possible locations look what's happening there this location I didn't why didn't I do it because it goes outside of the image outside of the image it wouldn't be defined we'll talk about padding which solves that problem in just a second but we apply this to every possible location on the inside giving us this as a result so this is the result of the convolution now what does it tell us it basically means that feature here is very strongly present here and here and here and here somewhat present in certain other places and really not present anywhere in those areas and now let's do a convolution exercises by hand now like I said it's always nice to do these things by hand to have good interactions so I want to convolve this thing here with this thing how would I do this I take this upper left pattern here multiplied with this what do we have we have three ones here times three giving us three and here we have two non-zeros here that each see a minus one minus one half so we have three minus two over two so we're going to get a two in this case and then we can apply it on the right hand side what's different here is we have only one one and we have six minus one halves so we get minus three plus one so we're going to get a minus two we're going to do that that same operation here giving us a minus one and that same operation in the lower right giving us a plus one this is if we do convolution by hand now in practice you will never do convolution again by hand that's why we have an exercise to give you the chance to do it by hand just so that you get the intuition you will see that confnats and convolutions are so common in deep learning that I really want to make sure that you have an intuition of how exactly they work