 So convolutional neural networks, I guess, today. All right, so foundations, me, you know, I post nice things on Twitter, follow me, I'm just kidding. All right, so again, anytime you have no idea what's going on, just stop me, ask questions, let's make these lessons interactive such that I can try to please you and provide the necessary information for you to understand what's going on. All right, so convolutional neural networks, how cool is this stuff? Very cool. Mostly because before having convolutional nets, we couldn't do much. And we're gonna figure out why now, how, why and how these networks are so powerful and they are gonna be basically making, they are making like a very large chunk of like the whole networks are used these days. So more specifically, we are gonna get used to repeat several times those three words, which are the key words for understanding convolutions, but we're gonna be figuring out that soon. So let's get started and figuring out how these signals, these images and these different items look like. So whenever we talk about signals, we can think about them as vectors, for example. We have there a signal which is representing a monophonic audio signal. So given that these only, we have only the temporal dimension going in, like the signal happens over one dimension, which is the temporal dimension. This is called 1D signal and can be represented by a singular vector as is shown up there. Each value of that vector represents the amplitude of the waveform. For example, if you have just a sign, you're gonna be just hearing like some sound like that. If you have like different kind of, it's not just a sign, you're gonna hear different kind of timbers or different kind of flavor of the sound. Moreover, you're familiar how sound works, right? So right now I'm just throwing air through my windpipe where there are like some membranes which is making the air vibrate. This vibration propagates through the air. They are gonna be hitting your ears and the ear canal. You have inside some little, you have like the cochlea, right? And then given about how much the sound propagates through the cochlea, you're gonna be detecting the pitch. And then by adding different pitch information, you can, and also like different kind of, yeah, I guess pitch information, you're gonna figure out what is the sound I was making over here. And then you reconstruct that using your language model, you're having your brain, right? And the same thing Jan was mentioning, if you start speaking another language, then you won't be able to parse the information because you're using both a speech model, like a conversion between vibrations and like, you know, signaling your brain plus the language model in order to make sense. Anyhow, that was a 1D signal. Let's say I'm listening to music. So what kind of signal do I have there? So if I listen to music, user is gonna be a stereophonic, right? So it means you're gonna have how many channels? Two channels, right? Nevertheless, what type of signal is gonna be this one? It's still gonna be 1D signal, although there are two channels. So you can think about, you know, regardless of how many channels, like if you had Dolby surround, you're gonna have what, 5.1? So six, I guess. So that's the, you know, vectorial, the size of the signal, and then the time is the only variable which is like moving forever, okay? So those are 1D signals. All right, so let's have a look, let's zoom in a little bit. So we have, for example, on the left hand side, we have something that looks like a sinusoidal function here. Nevertheless, a little bit after, you're gonna have, again, the same type of function appearing again. So this is called stationarity. You're gonna see over and over and over again the same type of pattern across the temporal dimension, okay? So first property of this signal which is a natural signal because it happens in nature is gonna be, we said, stationarity, okay? That's the first one. Moreover, what do you think? How likely is, if I have a peak on the left hand side, to have a peak also very nearby? So how likely is to have a peak there rather than having a peak there given that you had a peak before? Or if I keep going, how likely is you have a peak, you know, few seconds later given that you have a peak on the left hand side? So there should be like some kind of common sense, common knowledge, perhaps, that if you are close together and if you are close to the left hand side is there's gonna be a larger probability that things are gonna be looking similar. For example, you have like a specific sound will have a very kind of specific shape. But then if you go a little bit further away from that sound, then there is no relation anymore about what happened here given what happened before. And so if you compute the cross-correlation between a signal and itself, do you know what's a cross-correlation? Do no, like if you don't know, okay. How many, oh yeah, hands up who doesn't know what's a cross-correlation? Okay, fine. So that's gonna be a homework for you. You take one signal, just a signal, a audio signal, then you perform convolution of that signal with itself, okay. And so convolution is gonna be, you have your own signal, you take the thing, you flip it and then you pass it across and then you multiply. Whenever you're gonna have them overlaid in the same, like when there is zero misalignment, you're gonna have like a spike. And then as you start moving around, you're gonna have basically two decaying sides. That represents the fact that things have much things in common. Basically performing a dot product, right. So things that have much in common when they are very close to one specific location. If you go further away, things start averaging out. So here the second property of this natural signal is locality. Information is contained in specific portion and parts of the, in this case, temporal domain, okay. So before we had stationarity, now we have locality, right. Boom, bless you. All right, so how about this one, right? This is completely unrelated to what happened over there. Okay, so let's look at the nice little kitten. What kind of dimensions, what kind of, yeah, what dimension has this signal? What's your guess? It's a two-dimensional signal, why is that? Okay, we have also a three-dimensional signal option here. So someone said two dimensions, someone said three dimensions. It's two-dimensional, why is that? Sorry, noise. Why is two-dimensional? Because the information is, sorry, the information is specially depicted, right. So the information is basically encoded in the spatial location of those points. Although each point is a vector, for example, of three, or if it's a hyperspectral image, it can be several planes. Nevertheless, you still have two directions in which points can move, right. The thickness doesn't change across, like the thickness is of a given space, right. It's a given thickness and it doesn't change, right. So you can have as many planes as you want, but the information is basically, is a spatial information is spread across the plane. So these are two-dimensional data. You can also, okay, I see your point. So black and white image or gray-scale image, it's definitely to the signal. And also it can be represented by using a tensor of two dimensions. A color image has RGB planes, but the thickness is always three, doesn't change. And the information is still spread across the other two dimensions. So you can change the size of a color image, but you won't change the thickness of a color image, right. So we are talking about here the dimension of the signal as how is the information basically spread around, right. In the temporal information, if you have dolby surround, a mono signal, or you have a stereo, you still have over time, right. So it's one-dimensional images are 2D. So let's have a look to the little nice kitten and let's focus on the nose, right. Oh my God, this is a monster now. Okay, nice big creature here, all right. Okay, so we observe there and there is some kind of dark region nearby the eye. You can observe that a kind of similar pattern appear over there, right. So what is this property of natural signals? I told you two properties. This is? Stationarity. Stationarity. Why is this stationarity? Because you start to look at it. Right, so the same pattern appears over and over again across the dimensionality. In this case, the dimension is two dimensions, right. Moreover, what is the likelihood that, given that the color in the pupil is black, what is the likelihood that the pixel on the arrow or the, like on the top tip of the arrow is also black? I would say it's quite likely, right. Because it's very close. How about that point? Yeah, kind of less likely, right. If I keep clicking, hmm, you know, it's completely, it's bright now, you know, the other pixel, right. So as further you go in the spatial dimension, the less likely you're gonna have, you know, similar information. And so this is called locality, which means there's a higher likelihood for things to have, like the information is like contained in a specific region. As you move around, things get much, much more, you know, independent. All right, so we have two properties. The third property is gonna be the following. What is this? Are you hungry? So you can see here some donuts, right? No donuts, how do you call bagels, right? All right, so for the one of you which have glasses, take your glasses off. Now answer my question, okay. So the third property, it's compositionality, right. And so compositionality means that the word is actually explainable, right? Okay, you enjoy the thing. Okay, get back to me, right? I just try to keep you alive. Hey, hey, hello. Okay, for the one that doesn't have glasses, ask the friend who has glasses and try them on, okay. No, don't do it, it's not good. No, I'm just kidding, you can squint, don't use other people's glasses. Okay, question, yeah. So stationarity means you observe the same kind of pattern over and over again in your data. Locality means that pattern are just localized. So you have some specific information here, some information here, information here. As you move away from this point, this other value is gonna be quite almost independent from the value of this point here. So things are correlated only within a neighborhood, okay. Okay, everyone has been experimenting now, squinting and looking at this nice picture. Okay, so this is the third part, which is compositionality. Here you can tell how you can actually see something if you blur it a little bit. Because again, things are made of small parts and you can actually compose things in this way. Anyhow, so these are the three main properties of natural signals, which allow us to, can be exploited for making, you know, design of an architecture, which is more actually prone to extract information that has these properties. So we are just talking now about signals that exhibits those properties. Finally, okay, there was the last one, which I didn't talk. So we had the last one here. We have an English sentence, right? John picked up the apple, apple, whatever. And here, again, you can represent each word as one vector. For example, each of those items, you can be a vector, which has a one, in correspondence to the position of where that word happens to be in a dictionary, okay. So if you have a dictionary of 10,000 words, you can just check whatever is the word on this dictionary. You just put the page plus whatever number. Like you just figure the position of the page in the dictionary. And so also language has those kind of properties. Things that are close by have, you know, some kind of relationship. Things away are not less correlated. And then similar patterns happen over and over again. Moreover, you can use words, make sentences to make full essays and to make finally your write-ups for the sessions. I'm just kidding, okay. All right, so we already seen this one. So I'm gonna be going quite fast. There shouldn't be any, I think, questions because also we have everything written down on the website, right? So you can always check the summaries of the previous lesson on the website. So fully connect the layer. So this actually perhaps is a new version of the diagram. That is my X. Y is at the bottom. Low level features, what's the color of the X? Pink, okay, good. All right, so we have an arrow which represents my, yeah, fine, that's the proper term, but I like to call them rotations. And then there is some squashing, right? And the squashing means the non-linearity. Then I have my hidden layer. Then I have another rotation and a final squashing, okay? It's not necessary. Maybe it can be a linear, you know, a final transformation, like a linear, whatever function there, like if you perform a regression task. There you have the equations, right? And those guys can be any of those non-linear functions or even a linear function, right? If you perform a regression once more. And so you can write down these layers where I expand. So this guy here, the bottom guy is actually a vector. And I represent the vector with just one ball. And there I just show you all the five items, elements of that vector. So you have the X, the first layer. Then you have the first hidden, second hidden, third hidden, the last layer. So we have how many layers? Five, okay. And then you can also call them activation layer one, and layer two, three, four, whatever. And then the matrices are where you store your parameters, so you have those different W's. And then in order to get each of those values, you already seen this stuff, right? So I go quite faster. You perform just the scalar product, which means you just do that thing. You get all those weights. I multiply the input for each of those weights, and you keep going like that, right? And then you store those weights in those matrices, and so on. So as you can tell, there are a lot of arrows, right? And regardless of the fact that I spent too many hours doing that drawing, this is also very computationally expensive, because there are so many computations, right? Each arrow represents a weight, which you have to multiply by its own input. So what can we do now? So given that our information has locality, our data has this locality as a property, what does it mean? If I had something here, do I care what's happening here? So some of you are just shaking their hands. The rest of you are kind of, I don't know, not responsive. Do I have to ping you? So we have locality, right? So things are just in specific regions. Do you actually care to look about far away? No, okay, fantastic. And so let's simply drop some connections, right? So here we go from layer L minus one to the layer L by using the first five, 10 and 15, right? Plus I have the last one here from the layer L to L plus one. I have three more, right? So in total we have 18 weights, computations, right? So how about we drop the things that we don't care, right? So like, let's say for this neuron, perhaps why do we have to care about those guys there on the bottom, right? So for example, I can just use those three weights, right? I just forget about the other two. And then again, I just use those three weights, I skip the first and the last and so on. Okay, so right now we have just nine connections, just now nine multiplications. And finally three more. So as we go from the left-hand side to the right-hand side, we climb the hierarchy and we're gonna have a larger and larger view, right? So although these green bodies here don't see the whole input, as you keep climbing the hierarchy, you're gonna be able to see the whole span of the input, right? So in this case, we are gonna be defining the RF as receptive field. So my receptive field here from the last neuron to the intermediate neuron is three. So what is gonna be, this means that the final neuron sees three neurons from the previous layer. So what is the receptive field of the hidden layer with respect to the input layer? The answer was three, yeah, correct. But what is now the receptive field of the output layer with respect to the input layer? Five, right? That's fantastic. Okay, sweet. So right now, the whole architecture does see the whole input while each subpart, like intermediate layers, only see small regions. And this is very nice because you will spare computations, which are unnecessary because on average, they have no whatsoever information. And so we managed to speed up the computations and you actually can compute things in a decent amount of time. Clear? So we can talk about sparsity only because we assume that our data shows locality, right? Question, if my data doesn't show locality, can I use sparsity? No, okay, fantastic, okay. All right, more stuff. So we also said that this natural signal are stationary. And so given that they're stationary, things appear over and over again. And so maybe we don't have to learn again and again the same stuff all over the time, right? So in this case, we said, oh, we dropped those two lines, right? And so how about we use the first connection, the oblique one from going in down, make it yellow, so all of those are yellow. Then these are orange. And then the final one are red, right? So how many weights do I have here? And I had over here, nine, right? And before we had 15, right? So we dropped from 15 to three. This is like a huge reduction. Perhaps now this actually won't work. So we have to fix that in a bit. But anyhow, in this way, when I train a network, I just have to train three weights, the red, green, sorry, the yellow, orange, and red. And it's gonna be actually working even better because it just has to learn, you're gonna have more information, you're gonna have more data for training those specific weights. So those are those three colors, the yellow, orange, and red, are gonna be called my kernel. And so I store them into a vector over here. And so those, if you talk about, you know, convolutional kernels, those are simply the weights of this over here, right? The weights that we are using by using sparsity and then using parameter sharing. Parameter sharing means you use the same parameter over and over again across the architecture. So there are the following nice properties of using those to combine. So parameter sharing gives us faster convergence because you're gonna have much more information to use in order to train these weights. You have a better generalization because you don't have to learn every time a specific type of thing that happened in different region, you just learn something that makes sense globally. Then we also have, we are not constrained to the input size. This is so important, right? Also Jan said this thing three times yesterday. Why are we not constrained to the input size? Because we can keep shifting it over, right? Before, in this other case, if you have more neurons, you have to learn new stuff, right? In this case, I can simply add more neurons and I keep using my weights across, right? That was some of the major points Jan highlighted yesterday. Moreover, we have the kernel independence. So for the one of you that are interested in optimization, optimizing like computation, this is so cool because this kernel and another kernel are completely independent so you can train them, you can parallelize these to make things go faster. Okay. So finally, we have also some connection sparsity property and so here we have a reduced amount of computation which is also very good. So all these properties allowed us to be able to train this network on a lot of data. You still require a lot of data but without having sparsity in locality, sorry, without having sparsity in parameter sharing, you wouldn't be able to actually finish training this network in a reasonable amount of time. So let's see for example now how this works when you have like audio signal which is how many dimensional signal? One dimensional signal, right? Okay. So for example, kernels for 1D data. On the right hand side, you're gonna see again my neurons and I'll be using my different, the first kernel here and so I'm gonna be storing my kernel there in that vector. For example, I can have a second kernel, right? So right now we have two kernels, the blue, purple and pink and the yellow, orange and red. So let's say my output is R2. So that means that each of those bubbles here, each of those neurons are actually one and two, right? They come out from the board, right? So each of those are having a thickness of two. And let's say the other guy here are having a thickness of seven, right? So they are coming outside from the screen and they are, you know, seven neurons in this way. So in this case, my kernel are going to be of size two times seven times three. So two means I have two kernels which are going from seven to give me three outputs. Yeah, hold on, my bad. So the two means you have R2, right here because you have two kernels. So the first kernel will give you the first column here and the second kernel is gonna give you the second column. Then it has to, it needs seven because it needs to match all the thickness of the previous layer. And then it has three because there are three connections, right? So maybe I got confused before. Does it make sense? Is this sizing? So given there are two, seven, three, two means you have two kernels and therefore you have two items here, like one and one coming out for each of those columns. It has seven because each of these have a thickness of seven. And finally three means there are three connections connecting to the previous layer, right? So one D data uses three D kernels, okay? So if I call this my collection of kernel, right? So if those are gonna be stored in a tensor, this tensor will be a three dimensional tensor. So question for you, if I'm gonna be playing now with images, what is the size of a full pack of kernels for an image convolutional net? So we are gonna have the number of kernels, then it's gonna be the number of the thickness and then you're gonna have connections in height and connection in width, okay? So if you're gonna be checking the convolutional kernels later on in your notebook, actually you should check that. You should find the same kind of dimensions. All right, so questions so far? Is this so clear? Yeah? How much is running with that in the performances and what is it paid off to stop? Okay, so good question. So trade off about sizing of those convolutional kernels, is it correct? So three by three, it seems to be like the minimum you can go for if you actually care about spatial information. As Jan pointed out, you can also use one by one convolution, sorry, one like a convolution which has only one weight or if you use like in images you have a one by one convolution. Those are used in order to be having like a final layer which is still spatial, still can be applied to a larger input image. Right now we just use kernels that are three or maybe five. It's kind of empirical, so it's not like we don't have like a magic formulas but we've been trying hard in the past 10 years to figure out what is the best set of hyperparameters and if you check for each field, like for speech processing, visual processing, like image processing, you're gonna figure out what is the right compromise for your specific data. Yeah? Second? You said three. Okay, that's a good question. Why odd numbers? Why the kernel has odd number of elements? So if you actually have odd number of elements there will be a central element, right? If you have an even number of elements there won't be a central value. So if you have again an odd number you know that from a specific point you're gonna be considering an even number of left and even number of right items. If it's an even size kernel then you actually don't know where the center is and the center is gonna be the average of two neighboring samples which actually creates like a low pass filter effect. So even kernel sizes are not usually preferred or not usually used because they imply some kind of additional lowering of the quality of the data. Okay, so one more thing that we mentioned also yesterday it's padding. Padding is something that if it has an effect on the final results is getting it worse but it's very convenient for programming side. So if we pad our, so as you can see here when we apply convolution from this layer you're gonna end up with, okay, how many neurons we have here? Three, and we started from five. So if we use a convolutional kernel of three we lose how many neurons? Two, okay, one per side. If we are gonna be using a convolutional kernel of size five, how much are you gonna be losing? Four. Four, right? And so that's the rule usually, zero padding you have to add an extra neuron here an extra neuron here. So you're gonna do number size of the kernel, right? Three minus one divided by two and then you add that extra whatever number of neurons here and you set them to zero. Why to zero? Because usually you zero mean your inputs or you zero each layer output by using some normalization layers. In this case, three comes from the size of the kernel and then you have that some animation should be plain. Yeah, you have one extra neuron there then I have an extra neuron there. Such that finally you end up with this ghost neurons there. But now you have the same number of input and the same number of output. And this is so convenient because if we started with I don't know 64 neurons, you apply a convolution you still have 64 neurons and therefore you can use let's say a max pooling of two you're gonna end up at 32 neurons. Otherwise you're gonna have this, I don't know if you will consider one, you have an odd number, right? So you don't know what to do after a bit, right? Okay, so yeah and you have the same size. All right, so let's see how much time you have left. You have a bit of time. So let's see how we use this convolutional network in practice. So this is like the theory behind and we have said that we can use convolutions. So this is a convolutional operator. I didn't even define what's a convolution. We just said that if our data has stationarity, locality and is actually compositional, then we can exploit this by using weight sharing. Sparsity and then by stacking several of this layer you have like a hierarchy, right? So by using this kind of operation, this is a convolution. I didn't even define it. I don't care right now. Maybe next class. So this is like the theory behind. Now we're gonna see a little bit of practical suggestions how we actually use this after in practice. So let's say we have like a standard spatial convolutional net which is operating on which kind of data if it's spatial. It's spatial because it's my network, right? Spatial. No, I'm just kidding. Spatial is space. So in this case, we have multiple layers. Of course, we stack them. We also talk about why it's better to have several layers rather than having a fat layer. We have convolutions, of course. We have non-linearities because otherwise. So okay, next time we're gonna see how a convolution can be implemented with matrices but convolutions are just linear operator with which a lot of zeros and like a replication of the same weights. But otherwise, if you don't use non-linearity, a convolution of a convolution is gonna be a convolution. So we have to clean up stuff. So we have to like put barriers, right? In order to avoid collapse of the whole network. We have some pooling operator, which Joffrey says that's something really bad. But we are still doing that. Hinton, right? Joffrey Hinton. Then we have something that if you don't use it, your network is not gonna be training, so just use it. Although we don't know exactly why it works, but I think there is a question on Piazza. We'll put a link there about this batch normalization. Also, Jan is gonna be covering all the normalization layers. Finally, we have something that also is quite recent, which is called a residual or bypass connections, which are basically these extra connections, which allow me to get the network, the network decided whether to send information through this line or actually send it forward. If you stack so many, many layers, one after each other, the signal get lost a little bit after some time. If you add these additional connections, you always have like a path in order to go back from the bottom to the top and also to have gradients coming down from the top to the bottom. So that's actually very important. Both the residual connection and the batch normalization are really, really helpful to get this network to properly train. If you don't use them, then it's gonna be quite hard to get those network to really work for the training part. So how does it work? We have here an image, for example, where most of the information is spatial information. So the information is spread across the two dimensions. Although there is a thickness, and I call the thickness as characteristic information, which means it provides a information at that specific point. So what is my characteristic information in this image? Let's say it's a RGB image. It's a color image, right? So we have, most of the information is spread on a spatial, so it's spatial information, like if you have me making funny faces. But then at each point, this is not a grayscale image, it's a color image, right? So each point will have an additional information, which is my specific characteristic information. What is it in this case? It's a vector of three values, which represent RGB, are the three letters, but what does it represent? Okay, overall, what does it represent? Like, yeah, it's intensity, just tell me in English without weird things. The color of the pixel, right? So my specific information, my characteristic information, yeah, I don't know what you were saying, sorry. The characteristic information in this case is just the color, right? So the color is the only information that is specific there, but then otherwise information is spread around. As we climb the hierarchy, you can see now some final vector, which has, let's say we are doing classification in this case. So my, you know, the height and width of the thing is gonna be one by one, so it's just one vector. And then let's say there you have the specific final logic, which is the highest one. So which is representing the class, which is most likely to be the correct one, if it's trained well. In the midway, you have something that is, you know, a trade-off between special information and then this characteristic information, okay? So basically, there is like a conversion between special information into this characteristic information, do you see? So it basically go from a thing, input data to something that is very thick, but then has no more information, special information. And so you can see here with my Ninja PowerPoint skills, how you can get, you know, a reduction of the, well, a thickener, like a thicker, thicker representation, whereas you actually lose the special one. Oh, okay, so that was, oh, one more, pudding. So pudding is simply, again, for example, it can be performed in this way. So there you have some hand drawing because I didn't want to, I didn't have time to make it in latex. So you have different regions. You apply a specific operator to that specific region. For example, you have the P norm. And then, yeah, if the P goes to plus infinity, you have the max. And then that one is gonna give you one value, right? Then you perform a stride, like you jump two pixels further. And then, again, you compute the same thing. You're gonna get another value there and so on until you end up from your data, which was m by n with c channels. You get still c channels, but then in this case, you're gonna get m half and c, and n half, half, okay? And this is for images. There are no parameters on the pudding. You can nevertheless choose which kind of pudding, right? You can choose max pudding, average pudding, any pudding is wrong. So, yeah, that's also the problem. Okay, so this was the main part with the slides. We are gonna see now the notebooks. We'll go a bit slower this time. I noticed that last time I kind of rushed. Are there any questions so far on this part that we covered? Yeah? So there is like, Joffrey Hinton is renowned for saying that max pudding is something which is just wrong because you just throw away information. As you average or you take the max, you just throw away things. He's been working on something called capsule networks which have specific routing paths that are choosing some better strategies in order to avoid throwing away information, okay? Basically, that's the argument behind. Yeah? Yeah, so the main purpose of using this pudding or the stride is actually to get rid of a lot of data such that you can compute things in a reasonable amount of time. Usually you need a lot of stride or pudding at the first layers at the bottom because otherwise it's absolutely, you know, too computationally expensive. Yeah? So it's... Yeah, so that's it. Those network architectures are so far driven by state of the art which is completely empirical based. We try hard and we actually got to... I mean, now we actually arrived to some kind of standard. So a few years back, I was answering like, I don't know, but right now we actually have determined some good configurations, especially using those residual connections and the batch normalization, we actually can get to train basically everything. Yeah? So basically you're gonna have your gradient at a specific point coming down as well and then you have the other gradient coming down. Then you had a branch, right, a branching and if you have branch, what's happening with the gradient? That's correct, yeah, they get added, right? So you have the two gradients coming from two different branches getting added together. All right, so let's go to the notebooks such that we can cover the, we don't rush too much. So here I just go through the component part. So here I train initially, I load the MNIST dataset, so I show you a few characters here, okay? And I train now a multi-layer perceptron, like a fully connected network, like a fully connected network and a convolutional neural net, which have the same number of parameters, okay? So these two models will have the same dimension in terms of the, if you save them, we'll weight the same. So I'm training here this guy here with the fully connected network. It takes a little bit of time and it gets some 87%, okay? This is trained on classification of the MNIST digits from Jan. He actually downloads from his website, if you check. Anyhow, I train a convolutional neural net with the same number of parameters. What do you expect to have a better or worse result? So my multi-layer perceptron gets 87%. What do we get with a convolutional net? Yes, why? Okay, so what is the point here of using? Sparcity. What does it mean? Given that we have the same number of parameters, we manage to train much more filters, right? In the second case, because in the first case, we use filters that are completely trying to get some dependencies between things that are further away with things that are closed by. So they are completely wasted. Basically, they learn zero. Instead, in the convolutional net, I have all these parameters that are just concentrated for figuring out what is the relationship within neighboring pixels. All right, so now I take the pixels and shake. Everything just got scrambled, but I keep the same, I scrambled the same way all the images. So I perform a random permutation, all the way, it's the same random permutation of all my images, or the pixels of my images. What does it happen if I train both networks? So here I train, see here I have my pixel images, and here I just scrambled, with the same scrambling function, all the pixels. I have my inputs are gonna be these images here. The output is gonna be still the class of the original. So this is a four. You can't see, this is a four. This is a nine. This is a one, this is a seven, this is a three, and this is a four. So I keep the same labels, but I scrambled the order of the pixels, and I perform the same scrambling every time. What do you expect as performance? Who's better, who's working, who's the same? Perceptron, how does it do with the perceptron? Does it see any difference? No, okay, so that guy, still 83. Yann's network? What do you guess? Oh, oh, oh, oh, that's the fully connected, sorry. I changed the order, yeah, see? Okay, there you go. So I can't even show you this thing. All right, so the fully connected guy basically perform the same, the differences are just based on the initial, the random initialization. The convolutional net, which was winning by kind of large advance advantage before, actually performs kind of ish similarly. But I mean, much worse than before. Why is the convolutional network now performing worse than my fully connected network? Because we fucked up, okay? And so every time you use a convolutional network, you actually have to think, can I use a convolutional network, okay? If it holds, you know, you have the three properties, then yeah, maybe, of course, it should be giving you a better performance. If those three properties don't hold, then using convolutional networks is BS, right? Which was the bias, no, okay, nevermind. All right, bye-bye, goodnight.