 So I apologize, as I said before, I just patched up those slides just before class so I knew they looked bad and sorry about that. So in fact, I just added a additional slide, which is actually complementing the explanation I gave before, which was lacking this particular aspect. So kernels, we didn't talk about kernels. Again, I apologize. So let's see what are these guys. In this case, we are going to be using one-dimensional data. Everything is going to be just fine as we use two-dimensional data, which are images. Images are 2.5 because they have also some thickness. But you can think about images as two-dimensional data, not spatial data. So here we think about 1D data. And we said we go from one layer to the other layer by using these kernels, which apply first here, then the same kernel. So the kernel is the collection of these three weights. I apply this guy to the second chunk here and to the third chunk. So this is the same guy, which is applied to different locations because we, again, exploit the fact that we are using data which is stationary. The same feature can appear in different location of the data, which is spatial location. It's temporal location. We don't care, but it's in different regions of your data. All right, so. I have a question before, which is usually normal that you have less neurons on the neuron layer. What is this for illustration on the other side? OK, that is also correct. Yes, I apologize. So whenever you apply convolution, you're going to be lacking a neuron here. Because otherwise, where would you send this connection to? So I would have this connection, and I don't have anything here. So there is another technique, which is called zero padding, which you're going to be adding as many zeros here and here so that you're going to have the same number of neurons every layer. This is what we are doing, basically, most of the time. So you have. Or you can do it quickly. Yes, but it doesn't. I thought that's why there's not a single object. It doesn't work if you have images. If you end up one side of the image, you cannot see the other one coming in. So it doesn't really work for images, for example. Most of the time, we just do zero padding, which means I add as many zeros as here and here and here as the number of kerners divided by two floor. So here we have three divided by two 1.5 floor, one. So I add one more zero here, one more zero here, so that I have five neurons here. I have five neurons here. So in theory, usually, convolution won't change the dimensionality. If they do, it's because I didn't pad my input, OK? You can pad, you can decide not to pad. You can add it and just sand it to three. This is in your picture that's in the middle of three, but you can connect the next one with the three of these. You just don't have more neurons. Or you mean, since it's the same kernel, it will be a duplicate. Oh, OK. OK. OK, I didn't understand what you said, but I'm happy you understood what you said. All right, so again, here we have that we go from this layer to this layer by using this kernel, which is the collection of these three colors, over and over and over again, OK? So this one here is my first kernel associated to this layer. Of course, we can have multiple kerners per layer, but I didn't mention that. So here we have my second blue kernel, the blue, purple, and a pink. And so this one, given that we have two kerners associated to this specific layer, is going to make these guys here not scalar anymore, but they are R2 guys, right? The first element is associated to the convolution with this first kernel. So I have these three guys multiplied by these three and some I have first value. I have these three guys multiplied by three guys sum together at the second value, these three guys multiplied by these three, sum together at the third value. This is the first item in the R here. Second one, second item, is going to be containing the scalar product, the projection, on this kernel with the input, OK? So we are going to have here this number here represents the thickness of this layer. So these balls are somehow coming out from the screen. So these different kernels, are they taking the same inputs from the previous layer? So is that the inputs are the same and the weights are different, or is the inputs are different? The first one. So since you see there are different colors, there are different numbers. But here you have all these guys are the same guy. They're the same input, right? So you have your signal here distributed across five samples. And here you multiply this guy with the first kernel. So you have a projection, right? So whenever you do a projection, you see how much of this kind of feature is located in here. This is the first answer. So this one tells you what is the projection of this first location of your input with this kernel. So you have a kernel, and you have your input. You do the projection of your input on that kernel. You have the first output here. Then you do the non-linearity so you can go on. Instead, this guy here is going to be another kernel space on which you project, again, your input data. So here we have two kernels. Therefore, these guys are not scalers, but they are vectors of size two. And here we are using three connections. And therefore, I have three values within my kernel. So we have number three here because they are the connection. We have two because there are two different kernels. Yes? No? Yeah, question. Can we have a different kernel with different features? Like four values in one kernel, actually. Different sizes? Yeah. Yes, you can. But then you're going to be changing your network. So you're going to have your network, which is branching out. And it's going to have like four branches, for example. And each branch has its own size. But each convolutional layer has the same number of items in a kernel. In every kernel, you have the same number of guys because it's going to be just a tensor. I'm going to show you soon. But again, if you'd like to have different receptive fields, then you have different branches in your network. So you can easily do that in PyTorch. You simply send your same input to different convolutional kernels. And at the end, you may concatenate everything. OK? Yes? Kernels that are essentially have very, very similar rates that are really the same thing. Good question. So as long as you initialize these guys with random numbers, and it's very likely these two guys are randomly, there are two different numbers. So whenever you have your weight space, your weight space, those guys are different points. And those guys are going to be stepping towards wherever the loss is going to go to the zero, right? So you have this guy here. Given that I start here and the other guy starts here, both of them are trying to step in order to minimize the function. But they are still far away. It's very unlikely that they are going to be hitting each other, matching because this stuff has a lot of up and down. So one, we stop here. The other one, we stop at a different location. But they may end up being. But it also happened that there's some certain number of local minimum in your weight space. What if you start with more kernels than there are local minimum in your space? Then definitely you're going to have some kernels colliding to the same solution. All right, so that was supposed to be very short. OK, right, three connections, two kernels. One more thing, five input. Who said those are scalars? This can be also the output of another convolution. And they may be five dimensional. So if we are, let's say, in an RGB image, this is going to be number three. So every pixel has R and G and B components. So this number is going to be three. If these are whatever, they're going to have different features. So in each location, the descriptor at that specific location has a specific dimension, dimensionality. So as you can understand now, these guys here are also five dimensional, given that we had to project this guy on here. So we have these three guys, which are 15 elements now. There are going to be scalar product with this guy here, which is also three in this direction and five in this direction. So finally, you can collect all your kernels in some tensor, which is of dimension three. So we have two kernels, one and one two, of five elements. One, two, three, four, and five times three neurons. So first row, second row, and third row. So any time you deal with one dimensional data, you are going to be ending up with collection of kernels, which are three dimensional tensors. Number of kernels, number of input features, five, and number three of output. Sorry, this connection, number of connections. So for example, before, we were going from one layer input image, grayscale, to six layers. So we were having kernels that were five by five, because we are dealing with images. And they have dimension of one in this case, because my input has just one layer. So it's one times five times five. How many of these we have? We said six. So in two dimensions, like when we play with images, this guy here is going to be six times one times five times five. That was the first convolutional module. Second guy here, we have kernels that now gets an input, which has six dimensions. So this guy has to be six by five by five. How many do we have? Also we have six. So this guy here is going to be six times six times five times five. So one-dimensional data, three-dimensional kernel, collection of kernels, two-dimensional data, four-dimensional kernels, three-dimensional data. Yay, you understood, right? So pay attention. One-dimensional data uses three-dimensional kernels and so on. Yes, please. The kernel is actually like each line is supposed to be the way, right? What about the biases? So the bias is going to be just R2 times three here, right? So each of these will have a bias term. So the bias term goes to the node, not to the node. Yeah, the bias is like the offset, right? So basically, let me think. I think there are just two. Just two, right? So one bias for the first layer and one bias for the other one, right? Are there quick? Yes? So if you have another set of parentheses with three different colored stars, there would be three by five by three? So it's going to be, yeah, three by five by three. If you have seven kernels, you're going to have seven kernels, five, because it's where you start and you end up in, you have a size of three, right? I was just thinking, the fact that there can be various minima in that weight space. And since you initialize them randomly, each of them are expected to converge to one of those minima, right with the gradient. But I was just thinking, if we can, in this dynamics of training evolution, if we add something like each weight point, which represents each kernel, I guess, to repel each other in that way, in that dynamics as well, I think that would ensure that even if there are less number of or same number of, I mean, it will ensure that two kernels don't converge to a similar. Right, right. So if you add additional terms to the loss, then the network will try to perform well for whatever you ask. But usually what I like is to perform good on that specific loss I define. So if you add additional terms, I will do worse in the other task, because I have additional terms I have to deal with. But you can try to have additional terms and see how you can introduce complexity in this kind of optimization problem. So this was just to show you that every layer here has a collection of kernels, which are limited for the amount of connections. And they don't have to be scalar values. That was just for completeness. We are going to go back to the notebook example, where we have seen we have trained the convolutional net. We have trained the fully connected network. Both of them have the same number of parameters. But we have seen that the convolutional network performs much better than the fully connected layer. Because we are, again, exploiting those three specific property of the data, which were, no one is listening. Locality, stationarity, and one more. So given that there are these three specific characteristics, we exploit them in order to improve performance as we have seen. What does it happen if we don't have those three assumptions? If those three assumptions are not any more valid, what does it happen? We're going to be seeing that soon. Are you ready? Are you excited? Should be. All right, so yeah, just as Alfredo said, we trained fully connected neural network and the convolutional neural net with the same number of parameters. And we've seen that the CNNs perform much better. And what happens if we made some assumptions when we constructed the convolutional neural net, which is the stationarity, locality, and the compositionality? What happens if we relax those assumptions? And we can test that by permuting our data. So the data that we used in the previous task looked like this, the first top rows, which basically the handwritten digits. Those are actually grayscale images, but we plot it in color. And we just scramble it by flattening it to the 726 component factor, 784 component factor, and applying the random permutation. And then putting it back to the square image. It looks like this. So we don't see any numbers anymore. But can it tell it's the same permutation? So can it tell what it was? And we simply go into training the convolutional neural net on the data with permuted pixels. And here's what we get. We do see the training. So the training loss is reducing. The training accuracy was good, but the generalization, the test loss was low, 80%. Sorry, the test accuracy was 80%. We will see. I would say it's quite high for the images as they looked. Let's see how the fully connected network performs. It's 89%. We don't remember how it was in the previous, so let's just plot it. So we see that the convolutional neural network, sorry, fully connected neural network is the first and the last beans. Nothing actually happened. And in fact, if you repeated the training multiple times and the average it, you will see that it actually almost exactly the same. For the convolutional net, we see a significant performance penalty. The performance reduced by over 10%. So that's what happens when we basically remove the assumptions that made convolutional neural network work so well on images. All right. Questions? Great general question. I'm not sure I understand. When you randomly scramble, I'm not sure I understand what it is. If I then write four, will it make out four? Or I have to scramble it and then in the same way? So we know the label. So we know the image of four is four. But when we see an image of four, we see basically local patterns. The pixels nearby have the same intensity. And we see the edges. And basically back when Alfredo was talking about the cat picture, when we scramble it, we no longer have locality. The pixels nearby are different. But we still know that it's a four. But that's right. Next time, if somebody writes four and tuits it, whether it's a four or a first, I have to scramble it in the same way. Yeah, you have to apply the same scrambling of the transformation, right? Same implementation, we fix random seats, yeah. The whole point is to show that if those assumptions we had before of locality, stationarity, and the compositionality are not anymore valid because we scrambled the position, we cannot exploit any more that specific characteristic of the input in order to improve performance. Whereas the fully connected layer doesn't have any assumption. So connection, you can scramble them. They are still the same connections, right? You have to preserve the strong... Oh, yes, of course. That's a deterministic... It's not a four anymore, just label both. Yes, it's a deterministic scrambling of the pixels. Yeah, I don't really know how to phrase this as a question, but it's just... Like, those don't look any different, right? Like... Which one it does? Pretty much. Yeah, so like, even if the label is still the same, like how... It's surprising that it could easily... With 80% accuracy, like, oh, yeah, that middle one, that's a four. The one just to the right of it, that's a one. That's why convolutional neural networks work so well. They can do magic even though we try to kill them, right? Because it's still... You can still think the convolutional neural network... The convolutional neural network, you can still think about it as being a fully connected layer for those central pixels, right? So although we have so many parameters, that the network managed to learn something even without using the neighboring information. So right now, every kernel is just tuned to find these specific pixels in these specific locations. So that's why they still... It still performs fine-ish, 85%, I don't know what that is. The scrambling is random? No, the scrambling is a deterministic function. So every time you have the same scrambling, okay? It's just to show you that if I change the location of those pixels, the fully connected layer doesn't see it, it's the same. Therefore, the convolutional neural net instead is gonna perform worse. You don't have any more of that advantage with respect to the... So the scrambling here is basically with the random permutation, with the fixed random seed. And before we permute, we flatten the image, we permute the pixels, and then we put it back to D. Yes? My question I think goes back to the difference between the fully connected and the convolutional neural net in terms of the training with it. If we would take the fully connected one and just run the training for more edbox, we could still get the same, but with the convolutional one? No, so if you have the same number of parameters, the modeling power of the convolutional network on natural data is much higher. So given the same number of free parameters for both models, the convolutional neural net can exploit those parameters much, much more because it doesn't use those extra parameters to try to figure out connection with further away values or with different characters in every position. So you can use better, you can reuse your parameters in order to extract multiple... In multiple locations, the same kind of feature. So convolutional neural net, given the same number of parameters and given that you can deal with data, which is again, have those three main assumptions valid, then it's gonna be having a higher modeling power. On the other side, if you have the same modeling power, so convolutional neural net and a fully connected layer, fully connected network, which perform the same with the same accuracy, then you're gonna figure out that the convolutional neural net has much, much, much, much less parameters. So given... To add to what Alfredo just said, as you can see from the result, actually the fully connected network had pretty high training accuracy that's kind of along the lines with what you said. If we train it, we can get very high training accuracy, but the generalization power will be very bad. And on the test set, it actually pretty low, which means with the same number of trainable parameters, these are less meaningful, as Alfredo said. So one could devise a strategy where you start with a fully connected network and then you see which weights they go to zero or they go close, really close to zero and use that as a way to guide your convolutional architecture. Potentially, but then if you just start with many less weights, you're much faster. You have many less computation. Yeah, but maybe you didn't put enough. Oh, okay, so every time, okay. Sure, I could have a full course here. Initially, you're gonna increase your dimensionality of your network, perhaps increasing these six here. As soon as you notice, I mean, you have to increase it until you see here 100%. So you go and increase your model capacity up to the point where it is 100% able to model your training data. But then it's gonna be performing very poorly on the validation set because it's gonna be basically memorizing your input data. And that's when you are actually introducing other regularizing techniques, let's say, namely dropout, which is somehow taking off, turning off some neurons, or you're even maybe willing to use batch normalization, which is not a regularizing technique, but it works as a regularizer. So again, you should be using all the time batch norm because it performs some kind of regularization as a byproduct, but it's also gonna be speeding up your training procedure. So first, you try to overfit your training set. After when you're sure that you can completely fit your training set, you start adding regularization blocks, which are gonna be making you improve performance on the validation set. And when you reach the highest validation accuracy or the lowest validation loss, you stop there. This is like the cross-validation technique. So actually about batch norm, it will speed up convergence, but the training itself time per epoch won't be speeded up. It will speed the data. Yeah, of course, yes. It speeds up convergence. Yeah, sure. And it also performs this regularization, which again, I should tell you more why it does. So if someone asks you, asks you in a, let's say in a job interview, is batch normalization a regularization technique? Your answer is gonna be no, but it performs regularization as a byproduct. Bless you. All right, any other question? No? Yes? So you know what is the information that the fully connected network can learn from a picture? Just how many pixels are of one color with respect? Right, right, right. So if you think about that, can you scroll to the 700 something number? Where is the 700? You mean? There's a number, there you go, here. So this number here, you can think one image of those numbers is one point in a 784 dimension. And then those neural networks are simply moving those points in different regions. So that the last layer can just chop them with linear lines, with the hyper planes. So the only thing we have seen that yesterday, right? Whenever you have those neural networks, you are warping basically the space. And in this case, you have a 784 dimensional space with those dots. Each dot is representing a specific image. You're just moving things around so that they are all together in different regions and then you just carve out that region in order to tell, oh, that number belongs to that class. So that's how the fully connected layer works. The other one does something similar but has a more principle way because it deals with those other properties of the images that we mentioned before. Yes? Are there ways to use within the framework we have here to sort of, you feed it and you look just one image and see what it talks to, like to test what you've trained at the end. As opposed to say, oh, okay, cool. I know that it's nine ports and accurate. Can you simply just give it things Yes, yes, of course. So that's actually here. If you scroll down to the test loop or wherever the test loop is. You want a test loop? Test loop is here. Right, so this is testing loop. So you just get your model, you provide here a, so you get data, you can just write your number on a piece of paper. You just take a picture with your phone. You resize it to 28 by 28. Then you just, you know, you do NumPy image load, something like that. And then you put it here. And this guy is gonna be speeding you out the probability. So the probability distribution over the 10 classes is gonna tell you how likely the drawing you made is gonna be respecting one of the 10 digits. So you can actually do that. Or maybe, okay, next time I'm gonna be adding, I can add one extra cell. Okay, I'll do that tonight. I add extra cell where you can actually load your image and perform their test. Any other question? If not, thank you for listening. And I see you, I guess, tomorrow for one more lesson on machine learning. Thank you.