 All right, so again, I'm Alfredo if it's since it's too long. You can just call me Alf. It's fine also Feel free to Follow me on Twitter and post, you know picture from the talk or if you know if you found something that is Useful for you. You know share your opinion. It makes it means something to me. All right So today we are going to be talking about convolutional neural networks. Oh Right So basically we are going to be exploiting the stationarity locality and compositionality of the input data, which is basically most of the time natural data, so it's data that is Capture with you know some sensors from reality and therefore given that these are basically Transduction on some natural signal They will have some properties which are namely these three guys here Which can be exploited in order to improve the performance of our model. Okay? All right. So again, it's last time. I just made these slides a few hours ago So there are gonna be some mistakes if you don't understand what's going on There are gonna be errors perhaps so just stop me ask me anything Don't let me go because again if you I lose you if I lose you It's very hard for me to get you back on deal Thank you. All right. So let's get started nice picture picture of a cat. All right, so Signals can be represented as vectors. For example, if we deal with one dimensional One dimensional tensor like we have seen last time, which is means it means it's a vector We can for example represent a waveform wherever every sample the height of the sample maybe represent the Air displacement measured by a microphone So in this case XT are the is a waveform way from height the those simple numbers here So there is a T here because I usually have vectors as column vectors in my mind So although I wrote them as a row I They are Victoria. I mean these are column vectors on the other case here We have a cute cat and we can also think about that is one dimensional Tensor so it's a vector wherever whereas in every element here I have all the pixels. So for example here I have the first row then I have the second row and so on so I can think these Kitten here is one point in a I don't know. Let's say it has 10 pixels times 10 pick 10 pixels White and black so gray scale so I can think about this cat is one point in a hundred dimensional space Okay, so all the cute kittens can be Okay, sweet, thank you So we can also think about this kitten here if it's like a 10 pixel times 10 pixels gray scale is one point in a Hundred dimensional space. Okay another point. Maybe it's another kitten and so on and in this case here My x i j are pixel values most of the time Perhaps we are representing this guy here with a better representation which would be a two-dimensional Tensor or a matrix wherever we have like in the y-axis like So over the call over the rows we have the pixels that are vertically Distributed here and Across the row we have these pixels over here. Maybe that's a better representation They're just having everything in just one long vector Why do I represent this guy as a one-long vector because yesterday we have seen that for playing with basic neural networks I just have vectors and matrices and then I have this kind of a fine transformation of vectors So we are gonna be seeing later on the notebook. We are gonna be playing just with these kind of vectors and matrices in order To move kittens around Finally, we can also represent John picked up the apple sentence is a sequence of words and in here every x t are perhaps one vector One hot vectors which are basically representing which word out of a dictionary is being used in that specific position so x 1 may correspond to John x 2 may correspond to picked and For example, John may be the I don't know whatever word in the dictionary like you open your dictionary You look for the word John you count on What is the position from the beginning and that's gonna be corresponding to the position of that one in your vector? That is long that is long as many So it has a size of as many words in your dictionary. Okay, does it make sense? It's the one hot encoding from yesterday Yeah Is that how a lot of natural and processing works there's just having the same little long vector with just all the different words Yeah, cool. Yes. Yeah So you start with that one and then you may collapse into some more condensed version Which is basically a embedding representation on that kind of one hot but again you start with the one hot at the beginning All right, so these are different kind of signals we can Represent like audio or we can represent images and text everything can be represented in a vectorial format in this case Just one dimensional tensor, right? So it's a vector. All right, so What are the particular properties of using natural data because again, this is a audio signal This is a image and this is also a natural language. So all of these Signals had some specific characteristic. They are natural. They look natural to us What does it mean? So let's see something better here. So let's pay attention to the first Audio signal we are gonna be seeing that here for example in the left-hand side We have this kind of sinusoidal Kind of wave form but again later on also we have the exact same sinusoid Happening in two different locations, right? So the same kind of Small representation like small detail may happen in two different parts of the signal, right? Now we can just have a look we can look at the larger part and we can think for example that this peak here it's happening again here and Maybe that same wave form is also there and we don't really care where it is So I mean we observe the same kind of pattern in multiple locations across the for example temporal domain The temporal axis in this chart So we have to pay attention to this part natural signals show some kind of Repetition on the same kind of patterns over their dimensionality All right, so let's go back to our Kitten here, so let's focus on a particular detail, right? Like before we observe the sinusoidal part when we zoom in the wave form we can just focus here on the nose So let me zoom up a little bit. So okay like a monster, right? So what we can notice here this kind of particular Shape in this circle here. It may be very similar to whatever happens here So it's like a darker region surrounded by lighter light and peak pixels. So similar things again Are recurrent normally appear again in different regions of our signal in this case We move especially so this space space in this This two dimension before we had time across the x direction Also, we can notice what is the likelihood what is the probability? That those two pixels here this one and the other one are of the same color Question to you. It's very high, right? So pixel that are close together May have the same color. What about this one? Not that high right because you go further away from that neighborhood So many things may change right if you start looking around with more distance What about now they are completely different, right? We get a black pixel at the beginning and now we see a white pixel basically almost and of course the The further you go away and the less related pixels are so in here We also capture we can think about two different characteristics First one is that we have specific features Co-co happening like happening at the same time in different locations and Then we have that things that are close together are Very similar that as long as you go away. They look not that similar so So basically here we can think about this one this stuff is Stationarity of the signal so the signal present the same kind of patterns over and over again So stationarity is the first word you want to keep in mind and the second part here is locality So signals natural signals are local. So there is information Defined in a neighborhood of your Your signal and as you go away if the information starts changing. Okay, so locality and stationarity here Moreover, how do we make up this kitten here? With multiple kind of patterns, right? So we have patterns and patterns and we have patterns of patterns and patterns of pattern on patterns So there is also a notion of compositionality, right? So we need to have different kind of Like we have several layers of Abstraction we can see so if you go down to the pixel space you can see edges But as you go a bit further away You can see some kind of shapes and if you go back again, you're gonna see the whole cat So there are three main concepts here. We want to keep in mind the first one was Help me Stationarity which means we observe the same pattern over and over again, then we have Locality because things are Meaningful only in small regions. We don't care about things that are further away and the third one is Compositionality which means things are made of larger things that are made of larger things very well so let's keep this in mind and Let's see how we can exploit these three specific characteristics of natural signals in order to improve performance of our neural network Okay, are you excited? Very well. All right, so recap from yesterday lesson fully connected layer just the classical Neural net we have seen yesterday. They are also called a multi-layer perceptron. They are also called dense layers You can call them fully connected. You can call them as you want I Use this notation here. So We have an input at the bottom because I said last time you're gonna be putting the input at the bottom all the time Because this is makes sense because here we have the low level features as we go higher in the hierarchy We have higher level features. So we go high in the hierarchy as you proceed up in the network If you have it upside down Doesn't make sense you go up by going down doesn't work. So we are gonna have the input at the bottom Here we are gonna be using a fine transformation with perhaps a matrix w and then we apply a first non-linearity F After that, we are gonna have the hidden representation. We perform another a fine transformation another non-linearity and then we get our prediction is why my why hat the hat says means all this is not the actual Target, this is my prediction of the target. Whereas why is my target? All right, and these are the formulas which are just representing this pretty chart over here Okay, no no no new things with the same fact stuff. We have seen yesterday All right, so the function the non-linear function can be the positive part the sigmoid for example the logistic sigmoid The hyperbolic tangent and perhaps the Boltzmann function there All right, so how this neural network work basically if here I plot X is like my X here has five elements. So in this case my input vector has Five as a dimension. This is one dimensional tensor of size five Then I'm gonna have my first hidden layer. So I write h1 then I may have a second hidden layer, right? So we start making deep neural networks Whoa, right? How can we make it deeper? Let's put a third hidden layer. Whoa All right, that let's finish right. Let's put an out there. I have my why hat so Yeah, because I had to there's no space so or from left to right or from bottom to top Never from right to left or top to bottom. Okay, so or this way or you just send it down that way Hidden layers have to have the same Because I did copy and paste Good question though. All right, let's go on. So the first X the input layer is also called activation at the layer one So this is my first layer of my neural network So the first layer corresponds to the input and also it can be called activation at layer one Then the same the hidden layer number one It corresponds to the activation at the second layer Activation third layer activation to the L generic layer and the last layer Of course, you can guess is gonna be the activation at the capital L layer, which is my total number of layers Alright, so we go from a one to a two by using the first weight matrix, right? We w2 to go from the 23 w3 and w whatever WL So let's pay attention to the first output. So this is my first element of This vector. Okay, so how do we compute this guy here? This guy here. It's simply the The role of the matrix times my vector. So it's a scalar product plus the first bias is the first scalar So that guy again is the jth row if I consider the jth element, right? Is he okay your kind of perplex? Okay, very nice. All right, so and this one, you know, it's this is the user definition of scalar product You just multiply in some all the components so From the input we have these guys going here wherever whereas each of these lines is representing one of these weights over here, right and Then if we consider the second element, we are going to have all the guys there and for the other guys the same Right, so every time we have every neuron here So these are also called neurons each of these are neurons or activations So each of those are basically the scalar Scalar product between this one and my and each singular row in the matrix, right? and then we apply the non linearity and So on for everyone so this guy here sees all the input this guy here sees all the input and every every neuron sees everything that happens in the previous layer Same if we want to go to the second hidden layer to the third activation I'm gonna have those connections Which are stored in the w2 matrix and so on for the third layer all these connections from the w3 and Finally, we have our output layer with all the connections with the WL. Okay, so this is again just recap, but I'm showing you a Lot of connections, which took ages to draw. Okay Right pay attention Since I'm lazy, I'd like to draw less lines. Let's see how we can improve this diagram All right next one locality, right? So what is locality? Can you remind me, please? Just speak you don't have to raise hands. I don't know what raising hands means just Right so features we just care about things that are close together We don't care about looking further away because you know, I care about what's here. What there is like unrelated basically What happens at the boundaries, let's say like Let's say high There if you pick a pixel Right so you can decide different sizes of how far away you look so you can look like three pixels away You can look like five pixels away. I'm gonna actually talk about this in a second. All right So let's go from the layer L minus one to the layer L Here we have every one of these frills input goes to the first neuron So on for the second one and for the third so overall we have five times three. We have 15 connections, right? All right, so from L to L plus one here. We had just three connections As we said before we don't care too much for this neuron to look what's happened very far further away Right because we we have said we don't care about things that are further away So in this case, we are gonna just have Less connection we drop these two connections here to this guy and this guy over here the same for this guy And the same for that guy. So overall overall here. We have nine connections, right? Let's draw in time for making all those lines So from 15 to 9 by just dropping connections that are coming from input are further away That one doesn't change So as we move from left to right, we Increase our global view. So what does it mean? It means that this guy over here So this guy here observes just three guys in the input as you move to the right-hand side This guy sees basically five inputs, right? If you move one more on the right, you're gonna be seen Much many inputs here. So as we go on the right-hand side, we increase our Understanding of the overall big picture that we are looking at So we define now something that is called receptive field. What is this receptive field? Which is something that your your friend was mentioning before So the blue guy here sees only three neurons from the previous layer And so my receptive field here is gonna be three it sees three items on the other side Also this guy here sees only these three guys over here, right? So my receptive field for this guy is also three. What is gonna be now the Receptive field of the input with respect to the input. Can you guess? Can you guess? She's three times three five Five oh five five. This is five Sorry, I copy and paste, exactly, right? This is five. So this guy here sees five guys here So the more you move to the right-hand side and the more is the receptive field increases with respect to the input All right Yeah Or is it someone making noise? No, okay. All right. So next one. So that was Locality, right? We just care about things that are close together What is this? Sessionary team, huh? So before we start. Okay. We start right now with this kind of Less number of connections, right? We just start with the 15 Sorry three times three nine connection rather than 15 And we so before we actually had those extra two connections for the first guy Over here, what do we do now? We are gonna be drawing these first guys here in yellow So they look exactly like the white one. Okay, never mind The second rows are also gonna be orange Perhaps if you can see the color and the third one here are gonna be red So what does it mean? This guy here Treats these three guys in the same way as this guy here treats these three guys as the same way as this third guy Here treats this input. So if this is a kernel is a You know Set of weights. I'm using the same set of weights for every neuron. I don't want to change I don't have I don't want to have Nine different weights. I just have three weights, which I'm gonna be reusing across my different neurons Okay, so big difference. We have a different weight for every edge here We are rich sharing. We are sharing the weight. That's why it's called parameter sharing We share parameters across different neurons So and here I collect these three arrows as my kernels in this case is just a vector of three elements Share across the different neurons So this allows us to have faster convergence of our training algorithm because we have less parameters we have less Things to tune in order to reach convergence. Yeah Sorry Yeah, so this one here so every every arrow here every edge represents a weight So these three are three numbers representing my three weights And this one is called kernel. I should actually write it kernel and this is reused by these three guys So one kernel is associated to the layer and it's reused at every position And if you're familiar with convolutions, this is simply convolution You have a kernel here and this guy is sliding across All right. So By having less number of parameters, we can have a faster convergence. We have less things to tune up You have less computations We have better generalization because when we learn about a specific pattern for this guy here This guy here actually knows already What this guy learned so given that you are sharing that parameters once you learn these parameters Somewhere else in us in the image in the image on your input data You can reuse that site that kind of knowledge you extracted from data elsewhere. So you can have better Better generalization for unseen patterns It's not constrained to the input size, huh, what does it mean? So if here I add one more neuron I have to add New layers new new new edges, right? And here I have to retrain the network because I don't know if I have an extra neurons here What are gonna be the values for this extra neuron right here since I'm just reusing these values across the whole input I simply can add the new input here, and I simply use the same values overall. Yeah And I and I looked on this would be like that since you have less weights this thing is gonna be notice performance Fully connected and I say naive because probably you have some answer to that so so Yes, I so you're gonna see something very soon in the in the demonstration on the notebook Okay, so you're gonna see something very soon So if you have the same size same number of parameters These kind of sharing weights is gonna be super a hyper performing like it's gonna perform much much better Right, so if you have the same number of parameters, these one can have many more patterns You can learn because here basically those things that are further away. They don't really care It doesn't doesn't really they don't never sorry They never learn something very meaningful because things that are further away can change They change they are not related to the closed part So if you look at something close you have some uniform statistics that is Happening again and again around if you think about things that are further away. They can be anything so they are unrelated So this guy here basically They learn to be zero sort of they want to be Invariant to whatever happens further away So in the fully connected layer when you use natural data Basically you're wasting connections because these don't bring any information inside the the game So even though you're gonna use the same number of let's say Let's say we use the same number not the number of parameter the same size So it's actually one to one correspondence. We drop number of parameters here. You don't see any lack In performance any worse performance because these extra connections don't bring in Much of our improvement okay because they are further away So it depends okay with which kind of signals you using and how many connections here I just show you three you can choose five you can choose seven right? You don't have to limit yourself to three Ha, that's correct. So you're gonna see very soon what's happened if you Are evil and start make messing with the data. Okay, you're gonna see that very soon So faster convergence better generalization not constrained to a specific input size Kernel independence. Oh We have done parallelization, right? You have done that thing yesterday Some some guy was talking about parallelization do like that with your head. So I understand you've done that good So given that these part of these parameters are just working Independently you can paralyze drastically this stuff because you don't have you don't need to you know Yeah, they're independent. So they don't rely. They don't need to have they don't have how they say Okay, they are independent You know better what I'm trying to say Last one again given that we are using sparsity What's that we even that we are using sparsity over there We have a reduced amount of computation respect with respect to the fully connected Okay, I understand what you're saying so here you're trying to look for this pattern here And maybe you have it so you get a high activation here here and here you don't see the same pattern So here you're gonna be have you're gonna have a low activation and a low activation But you're just looking for this three long pattern So you don't have you just look for these three long pattern everywhere if you have a long signal Let's say you have like a signal with 100 units You just have this little kernel going down all the span of the your signal You just get out these higher activation whenever you find a signal. You just find it perhaps just in one location In that context so when you're training and it sees the pattern somewhere right and because weights are shared so by backprop it learns to change the weights and some other part of the image the The it doesn't see that pattern. So wouldn't it like screw the weights up That's why you have multiple kernels so every layer you have multiple kernels and every corner tries to specialize on a specific pattern With which it resonates. So when it gets some Excitation because I found my pattern it gets very excited and then this gets rewarded and it starts growing if it doesn't see the pattern The back propagation so the gradients when you perform back propagation I we didn't talk about that back propagation But anyhow when you compute the gradients you multiply by the input right the activation itself If the activation is zero. No, there is no gradient. So the gradient Gets killed right so once it reaches a minimum So every time you get something excited the gradient bomb gets boosted busted boosted How do you say boosted? Yeah view boosted and then that you get That kernel to improve better right if you that kernel doesn't see anything. It doesn't even get any kind of It doesn't get to change its own values, but good point Because I'm there with random weights. Is it supposed to know what it it's looking for that that? Okay, I have a lesson on that too, but you should come to my class. So just look my videos recordings anyhow when you start with random weights you have a Chance of observing a specific kernel a specific pattern anytime you observe that pattern one of your weights One or sorry one of your kernel is gonna be a bit stronger than the others So the one that is a bit higher and gets improved and when you go up with one the other one are getting pushed down So basically when you find one of your randomly initialize initialize kernels to start resonating with a specific pattern It's gonna be strengthening its own response to that specific specific pattern Why try to push everyone nice down for that specific? Pattern does it make sense No, yes, no, you can say no, I try again Okay Other questions No, should I keep going you like it? Yeah Yes, yes, but I didn't say that it that's what happens in a if you check the how the Training mechanism work, which I didn't actually explain but yes, every kernel is gonna be specializing In a specific feature and whenever it is specialized it gets to improve its own Performance with that Yes, it's gonna be everyone is gonna every kernel is gonna be focused on focusing on a specific kind of pattern Okay, yeah, all right Other questions No, right. Shall we go on? No people are sleeping. Hello. Oh, hello, you're back All right, let's go on More pictures. Okay. So how do you Okay, what was it? How do we use this convolutional neural network with let's say images because that's what I play with For most of my PhD So this is a standard Neural network, I'm gonna try to explain now some of those components So I think for the first part, we are not have multiple layers. So here I have like Multiple deep blocks, which are my discriminators. For example, I have multiple blocks which are containing some convolutional Module layer. So we have multiple layers deep networks Which are convolutional because we'd like to Share parameters and make them sparse. So we don't have so many computations and So these two things are basically summarizing these three Specific characteristics of natural data, right? We said it is sparse. Sorry. It is local Locality session arity, which is used by this guy here and then is compositionality, which is used by multiple layers, right? And their safety field concept. So these here are how we exploit those three Particular statistics Then we have no linearity. Why do we have no linearity if you have just linear layers? What is the output of? Matrix multiplied by a matrix multiplied by a matrix multiplied by a matrix Just a matrix, right? It's all collapse is done So you need no linearities in order to keep those things from collapsing Then we have pulling. Huh, what is this stuff? You're gonna see soon what it is Then okay batch normalization. This is actually what you're gonna be using every time you're gonna be training networks, which is Improving speed for the convergence and it also adds some kind of Very good noise which allows your network to generalize better. I know I said many things if you didn't understand It's fine. Just use this stuff works very well Last one receive a connection one more thing is been very popular recently is to skip Connections so here you go from the bottom layer to the second layer to the third layer But then every time you initialize this stuff at the beginning before training this stuff is randomly initialized So things that come here gets, you know distorted more distorted more distorted here. You have just crap So given that you perform this extra connection and some here Whatever is here the network and decide just to turn off this guy and Therefore you start with a smaller network and then maybe if the network Needs some more modeling power then he actually starts using some of this modeling power here So but this residual connection are something that are really really good Because I had your network start with little tasks, and then he starts growing basically in modeling capacity So these two guys is something you're gonna be always using in your network. Yes These are coming from a copy and paste I guess so those are Discriminator blocks discriminative blocks, and that's why I call them D because I had also a Thing coming down which had some kind of Generator or yeah something Okay, so you can just think about those as layers of your new network Okay, nothing nothing fancy, but each of those contain a convolution and all linearity maybe a pooling Which I haven't explained yet All right, so let's see What poolings are? Okay, sorry actually, let's see how it works. So in here. I provide my image at the bottom And it's like a three layer maybe Image because it's like RGB layers and has a special Dimension as you go and at the end I have something that is like more of a vector which at each element may be Telling me the likelihood of observing a specific category if we are doing classification as we have seen yesterday So in the yes in case of yesterday, we had like a vector of size 3 Whereas each point was expressed in how likely is to have that point belonging to one of the three spirals We have defined so we start with something that is flat and Large right so it's like very thin and very large and we go with something that is one by one So it's very narrow underneath. It's very long, right? So in between what do we have in between we have something in in between like in the Not that spatial There is not that much special information, but we have much more thickness in this sense here So the information that is here The like if you think like the not like mathematical information just Information in the English English way of like just easy way The the term that you use in English not any kind of specific information so the information Containing this part here is the same information that is here But here the information is expressed as a spatial information and here as you go up the information is actually encoded in the depth of this kind of representation So here This patient information may represent where the position of the ear of the cat is then where the nose is where the eye is and The depth of the in the image it tells you the color right it tells you something that is Locally defined in that specific point. Yeah Here So this is like one by one pixel For your entire This is just one for the whole image and each Element here may tell you what is the likelihood of observing like a cat a dog a Horse or anything? So here you have descriptive information, which is not any more special But it's like telling you the likelihood the probability of having a specific item if we are doing it For example classification whereas here the information is spread in the special in the special dimension so from special to You know descriptive and in between you get a halfway representation And here we go from thin to thick. All right. I'm almost done So pulling I said right What is pulling so let's start here by drawing something I've just drove the top corner of my kitten picture So here I can represent my four pixels in the top left corner. I Can get those pixels and compute the P norm What is the P norm is just that thing there? this the P root of the summation of the each element Rise to the power of P So if you have basically the Euclidean norm and just have the number two there, right? So you're the square root of the summation of the squares of the items if you have the one norm You just have the summation of the absolute values, right? If you have the P P going to infinity, then you basically have the max, right? So if you have the max norm P goes to infinity you have the max so basically given those four pixels We can we can extract the maximum value or we can extract extract like the norm value Or you can extract whatever thing you want with the P pooling Okay, so most of the time you're gonna be using max pooling, which is basically this guy here So we have the first guy there. We perform the LP norm and we get our output over here Then we are gonna be moving our basically window across the next four pixels and again We are gonna be performing the LP norm and then we have our next value Okay, and we do this stuff across the whole image So if we perform this operation on the input image, what is gonna be the output of this operation? Smaller image right spatially smaller sweet So for example if I started with an image, which is M by N I end up with M half and N half if I use these kind of parameters My channels here see are gonna be the same channels here We didn't change the depth of an information. We just changed this direction So something you want to keep in mind what's happened. This was my activation at specific layer. I apply pooling and I have How much information did I lose here in a fraction? Infraction Okay, so what how much information did I retain? One fourth, okay. Thank you. I don't know percent. All right Anyhow, we have retained here one fourth of the information This is considered a very very strong bottleneck, which means you're you're gonna be losing information Really like it's bad. So what is a very good practice? It's gonna be having the a convolutional layer Actually expanding This guy so if you start with C channels, you're gonna be ending up with two C's Channels so that when you cut half the X and Y dimension You just lose you just go to one half of restriction of information. Does it make sense what they say? No, no, hold on. Just let me check first if I people understand so first we said by applying pooling we shrink if it's an image by Half and half so we get that water. So if I increase twice the channel size, I'm just losing half, right? Okay, yes, no channel size you mean the number of values that I can take for example that I can index I So the channel here no that index I up there. That's the RGB, right? Are the value of the amount of R the amount of G and the one of B? No, no, these on here are simply One two three four guys. Okay, so this is applied to every channel. So for every channel I perform this kind of shrink Haven't told you why you can ask why All right, very good, so Why are we doing that? I slept not too much. So maybe I forgot. All right. So why are we doing that because there are? Too much data in the beginning we have so much spatial data and we were trying to go from those large representation to something that is Narrow right. We just remember the slide of before so we were like large and then we go narrow so we increase this dimension by using convolutional kernels by every kernel gives you one Point there in that vector and then by using this one we can reduce the spatial information So we go from our spatial image on the bottom to something that doesn't have any more Spatial information, but it's just the script or of the image So Yes, that's for example when you have some print in the newspaper if you look like this You will not see anything, but if you move far you will see picture. Yeah, so that's what we are doing That we just yes trying to produce different layers on different To see different size of objects. That's that's correct. So we perform this For different reasons one of these reasons is to have different kind of zooming factor of our data So the more you go you apply these operations and the more you have a global view and the more Further away you go from your newspaper The other also the other also The reason is that we are going to be lowering the amount of data we are dealing with And then at the end we have classification, right? Yeah, that's we start from the image We end up with classification. Yeah, this is done through Neuronet, you know those layer whatever conversions we do, but now next slide you're showing us how to do it by hand So you change this dimension here from large to smaller by using this pulling operation You increase the thickness by using the convolution Yes, this is my neural network and each of these blocks contain a convolutional layer and nonlinearity and the pulling operation It's like a sandwich What is your question? Sorry Then everything is gonna be the same size When you go through layers of neural network The pulling is something additional you're trying to it now or it's always there So I if you don't apply any pulling here from here to here You're gonna have the same spatial location. We just increase this thickness So every time I do a convolution between this one and my kernel I get the first plane then I do this convolution between this one and another kernel I get the second plane here and so on. So here I have multiple outputs of multiple kernels But if they love the the spatial information is the same. It doesn't change The units are Representing the thickness. So if I have one linear if I have one linear unit, I'm gonna have just one Like this here So the number of linear units before I was showing you fully connected layers in this case here Let me go back So this guy here If you can think here despite The fact that he thinks here there are three and here there are five You can think about here having also five and the other guy gets inputs from you know from outside So whenever you apply convolution, you're gonna have the same number of input and output So if we go back here If I apply a convolutional kernel to this guy one convolutional kernel I get exactly the same guy here the same spatial information. I mean same with the same Dimension in the space If I apply the second convolution, then I have the second layer. So if I apply ten one, I'm gonna have something that is 10 10 10 lines long But then again, I would like to remove this spatial information so that I have like one vector. Yeah That's what's the sea on the next page Again, sorry On the next slide. Yeah, see which is like you do this more than once Yes, yes, so that the sea is the the height here. I should put that notation here. There was another question there. Yeah Yeah, so here I have three layers one two three in this guy here But here we have maybe many more So if I have three here the thickness of this guy here, this might be like 20 And this may be the size of my You know a number of objects. I'm trying to look for Yes, that's correct if I use color images, yeah If I won't like to perform a classification task, yes If you don't do pulling you when you will never end up with one by one Yes, but I have to change the size of this stuff, right So if I apply pulling this becomes half by half If I apply pulling it's gonna be half by half and then I'm gonna end up with one by one at the end The final decision is gonna be applied on the whole image Alright, so I think That was it. So this was the final Yes, I was also in time. So and that was the final thing right the how you perform the pulling operation so at the beginning first layer we start with Three channels and maybe apply a convolution you end up with ten channels or let's say we go from three to 12 But then again, I'm gonna cut half and half. So With one way I try to increase the thickness of these guys and the other way I try to remove the special information So here we have seen How a convolutional neural network is built and how we are gonna be applying different different kind of layers and before we have seen a Comparison with the fully connected layer Which has many more parameters. So right now we are gonna be showing you a quick example about training Two networks with the same number of parameters One is a fully connected network. The other one is a convolutional neural network So given that they have the same number number of parameters, but the convolutional neural network is sharing those parameters You may have many more kernels respect. I mean you have much more Modeling capabilities. So we are gonna see how they compare By using natural images like let's say pictures of numbers drawn by some individual And and we are gonna see how they perform the convolutional network and the fully connected layer Then we're gonna see what happens like your colleague was mentioning before if one of these Hypothesis are not there anymore. So what does it happen if? Locality stationarity or compositionality Are gonna be going away And we are gonna be see seeing very soon. What is the result other questions? Yes, please I'm just curious to do for the residual bypass connection. Yes Do you use that for the same reason you would like implement dropout or they know this one, right? Yeah, so this one I use for Okay, let's think that we are gonna have 100 layers and there are no bypass connections, okay When we initialize those networks, we initialize them usually with some small values Towards the zero when you perform the computation for getting the partial derivative of the loss function with respect to the parameters You perform like basically the multiplication on the final On the final loss Derivative with these weights, you know, so if you if you perform some, you know When you perform the chain rule with matrices, you basically multiply the gradient by the input. So Given that those values are very very tiny you have a tiny signal at the end Multiply by something of which eigenvalues are all small small right towards zero So you have some signal here. It is basically Dying through the network. So all these guys down there. They will never be trained You know what? Where is your input coming from the bottom? And where do you send your input through? trash because it's noisy Noisy initialized, right? So you have your very nice input and you just destroy all the statistics You send in it through a random crap and you know something starts changing here in the upper part But those things really never change when you apply this kind of bypass connection receive a connection or receive Receive by pass or shortcut connection three way three names for the same shit Sorry, so in this case you have the gradient coming down and then hope it goes down here So this guy is going to be using the same gradient that this guy is observing in order to optimize its own parameters So these guys are absolutely essential if you start doing You know networks that are kind of deep Otherwise you have some problem that is called vanishing gradient those those gradients basically given that are multiplied by Metrices with small eigenvalues. They vanish down or even worse if those weights are getting you know You try okay, I don't want to lose my my my gradient I increase those weights those weights are very nice. They are large and you have like eigenvalues They are larger than one and then boom You have exploding Gradients, so it's very kind of finicky thing. So if you just add those connections everything will just work fine Also, that's why we use this one sort of speaking. This is so good If you are familiar with like fiber optics sometimes you have those repeaters in the fiber optics Are you familiar with fiber optics optics? No, are you familiar with transistors? Yeah, all right. So why are you using a high impedance? Transistor after another transistor You don't want the transistor that is on the afterwards to load your first one. Otherwise, you just drag everything down Right, so this guy here are basically some kind of impedance matching and they are separating every block So this block doesn't load my previous block and so we put everywhere I put everywhere in those batch normalization which allow my network to be like independently training each separate block without having this kind of Loading side effect. Okay, so these two things are really really Essential I could have a lesson just on those two, but I just put them there because you have to use them I mean, I reckon I highly recommend last last summer school. I thought two weeks ago There was just one guy who apply this guy here and he won the challenge of the summer school He won like the prize. So I was like yay I think I think it was a book from a machine learning stuff. It's very nice 40 bucks for the dollars All right, so This is what that was it for the introduction to convolutional neural networks right now We are gonna see this exercise if everyone is okay And you don't have for their questions where we're gonna be playing a little bit with the notebooks and see how they perform. Okay, so Let me introduce here my colleague Alright, so Alfredo gave a nice overview of the fully connected and convolutional neural nets and At some point of the exercise we had Um Basically we compared convolutional neural and fully connected neural networks and how every neuron from the previous layer has to be connected to every neuron in the the The layer after that Thank you In convolutional neural networks. We basically take advantage of the locality which is one of the three basically properties of images is locality stationarity and composition compositionality to reduce the number of Ages and connections and basically we have the receptive field. So I'm just repeating a little bit what he said and right now. Let's basically Look at the specific example and kind of get hands on so yeah, so You can follow that on your laptops or for those who have succeeded with the setup of the Jupiter lab you can just go in the Jupiter lab so I'm gonna Go and start the terminal window But here I have my exercise in the workspace and I did have to do get cool To pull the latest changes from yesterday as we added one more notebook And the notebook we're gonna need is the notebook number five Yeah, so if you did make some changes in the notebooks, which basically most likely did because As you run the notebook, it saves the outputs and when you try to do good pull it won't let you so you can do Get reset hard get reset hard and then people Or you can just follow it as a demo so Let me ask first How many of you have used any deep learning frameworks in the past not counting this school? Who have used deep learning frameworks any like Kara's TensorFlow by torch? So about how how many of you of you used? by torch Just a few how many of you used Keras So yeah, most of the people who used deep learning frameworks used Keras Yeah, so For this Exercise you're gonna use by torch, but I'm trying I'm going to try to make connections to Keras So the first cell there is nothing special. We just import torch Then we import the NN module, which is what we're gonna need up to basically construct our neural network The optimum module is for optimizers. This is where all the Optimizers leave like Adam RMS prop SGD So we're gonna need the data sets torch vision data sets To import the MNIST, which is what's going to be used for for this exercise And of course not applied for plotting and numpy Also as we're going to be plotting it we're going to be converting it back to numpy and Functional and then functional It's going to be used for for some for activation functions. You will see All right so the first step is simple, let's just load the data set and This is just a copy from what's recommended in the by torch MNIST tutorial You can see familiar steps here where we load the data we physically download it to the folder then we Normalize it we shuffle it and also we batch it we Assemble it into batch says batch sized tensors. I personally so For this we would do the shuffle for the testing stuff, but that's strictly speaking not mandatory And let's take a look at few images here Those are familiar to to you the handwritten Digits from MNIST data set. So nothing special here Those are 28 by 28 images 28 by 28 pixel images all right, so and For so the purpose of this exercise is we're going to Construct the fully connected and convolutional neural networks We're gonna make we want to make sure that they have roughly all like almost exactly the same number of parameters and We will see how well they do on this MNIST data set and then we will try to break up some of the assumptions that Alfredo was talking about In particular their locality By Scrambling the pixels by shuffling the pixels In some random order and see how well they do after that so for those so basically There's two strictly speaking APIs to construct a neural network in PyTorch One of them is very similar to what we used in Keras So in Keras we have two APIs sequential and functional But both of them we kind of put the layers one by one as the Lego blocks Something that I have here This is basically I'm defining the whole logic of the neural network as a one sequence and this is for Fully connected network, which will consist of two linear layers followed by a relu Nonlinearity with the one more layer found by followed by the soft max out soft max output activation Which is what we're gonna need for hand written hand written digit classification So it will output as output the vector of Floating point numbers, which we will take the maximum probability and assign the label according to that So in PyTorch, it's a little bit lower level than Keras we typically extend we prepare a class derived from an end module which is going to be the base class for all the neural networks you write in PyTorch here you would Overwrite the constructor And you would basically invoke the base class base class constructor by the super Method adding additional methods, which are not present in the base class In this case, it's the input size and the logic of the network Finally, we need to override the forward method Which is the the logic of our forward pass as we train We have the forward pass which in this case is going to be a bunch of affine transformations followed by non-linearities and there is also a backward pass, which is the Partial derivatives chain rule differentiation, which comes for free to us In PyTorch, but we still have to call backward on the loss as you will see later in the training loop So here, yeah, basically the forward pass is Super simple. We just take those square images 28 by 28 to be able to feed them to the fully connected network We have to flatten it and the view in PyTorch as Alfredo introduced yesterday is essentially like a reshape in NumPy But it doesn't make a copy. So that's why it's called view And we Flatten it to the input size. We have to be sure that the input size is actually what we pass in the As we gonna use this has to be 28 by 28 All right, and Next is the conf net So here on purpose I use slightly different approach. I didn't put it all in one Sequential model, but I broke it down creating a basically a member for each for each layer and We will have here two convolutional layers Conf to D layers followed by two linear layers. This is actually exactly what Alfredo had in this picture on the right from top to bottom from sorry from bottom to top except He was calling to D which I'm guessing from some Generative adversarial discriminated generator Yeah, so and Don't pay attention to this like multiply four multiply four It's basically the goal here was to soon the number of parameters to match between the fully connected layer and fully connected neural network and the convolutional to prove our point Okay, in this if you define it like that as opposed to sequential you have you would have more in Interesting looking forward method which you mandatory mandatory Lee has have to override Initial network which you write so it just basically goes from the From the outermost to the innermost layer first we go to the convolution when D relu Pull in convolution pull in Flatten fully connected relu fully connected softmax All right, so then we just need to shift enter so this thing is Yeah, don't the the thing is impact torch similarly to Keras we actually writing pretty much device agnostic code and Basically, if you have a GPU it will pick it up It might be a difference in dependencies Keras depends on tensorflow so to run on tensorflow you have to people install tensorflow dash GPU And make sure that the CUDA drivers are in place But if you don't have a GPU you just run on the CPU and you don't have to make any changes to your code Similarly, if you divide define a device string like that, it will switch to the CUDA If you have GPU if not if it's not available, it will just run CPU and you don't have to make any changes So this is the again for the So I'm still stuck on the convolution thing if you just go back a little bit. It says kernel size equal to five Does that mean five? Five units in the back is connected to one in that that's the meaning Okay, so there are two two layers before the component contour the first one It has in channels equal to one in the next one It says in channel is set to be same as the n features and I'm not sure what that means why Do you see the first the first one the arguments is one n One and n features the next one is n feature and n feature So here basically we're free to Kind of adjust the parameters as long as the they're consistent across layers and here the main goal was to Get a neural network which would have the same number of parameters trainable parameters as the fully connected layer Right. So, you know, you can control the number So the parameters you can fix as you define in your convolutional neural net is the kernel size You can also fix the padding. Oh, sorry. You can fix this tried as you move the The kernel with it can be one or two or more So there's nothing wrong to have the same number of input channels, but maybe I'm missing the point So, you know, you're gonna use this compound and come to differently inside your new inside your next Yes, the first one is gone one then later on use Okay, just be loud It's not gonna be recorded otherwise, okay, so here unless you bear a lot we go from basically One channel because we are having a white black and white image So we just go from one layer to a any kind of number. So let's say five layers So we have five different kernels those Parentheses with the three stars I put there. We have five of them. There are no three stars there are five stars because we are gonna be considering a Neighborhood of five pixels in this case five pixels times five pixels. So these kernels are 25 Values and I have As many as this guy how many how many is that? I don't know Maybe if this is three, I'm gonna have three kernels of 25 elements. Okay, so the second one We are gonna be Having we start from one layer one one layer thickness input image and we end up now with We set three three layers, right? So afterwards we are gonna be using convolutional kernels, which are gonna be going from three Thickness to again three So this means that my kernel in this case here was five by five in this kernel This case here is five by five by and three and how many how you have this many so this is three Okay, if this is a different number Let's say this is four. So you have Five by five times three because this was the size of this guy and then I have as many as this guy So I have four five by five by three Okay, does it make sense? No, you can say no, I can try again I see what you're saying. So So the three different colors have been processed separate. Oh here. We just have one there's only one color It's a grayscale image stores. There's one input channel But if that would be three, I would just process them all together So if this would have been three this would be a would have been five by five by three and then I have this three dimensional tensor Moving around and performing convolution around the image Okay, yes other questions See this light projector. I can't even see anything Yeah, so the the channels control the depth of the tensor and The kernel will be in the x y plane All right, so let's Go, let's move on and look at the training law. So, you know in Keras Keras is a high-level Deploying framework and we usually work with something called estimator estimator it's basically something you define and then you just called dot feed on it if you want to train it or if you have a for instance if you Want to pass it on a general if you want to train it if your data is a generator Python generator you can do call dot feed underscore generator and then once you train it saved weights you want to Run a prediction step So actually if you do the have a validation step at the end of each epoch you would call dot evaluate And for prediction if you want to get the actual predicted labels rather than just the metrics like accuracy and loss You would call dot predict So this is all nice and simple Until you want to do something custom When you when you want to do something something custom it's very nice to have access to actual training for loop and That's why actually by torch we don't have estimators Or at least not yet, and we have to write our own training and testing loops So here is the training loop for convenience. We put it in the function So here this term of it basically it's the permutation Which is something that we're going to be using for the exercise to scramble the pixels by default it's a Range so it does no permutation So here as you can see so Steps as follows we zero the grad the grad so basically as we run the training steps For We're going to accumulate the gradients for each weight and we have to zero those gradients before the next Optimizer step. Otherwise it will Calculate incorrect values So then we get the output of the model, which is the essentially out what the forward pass gives us and here We can define the loss Outside of the training loop, but this is just fine for this exercise. We use the negative log likelihood loss and give it the target and the Output values of the forward pass and so we have that and we call the The back backward pass on the loss and complete our optimizer step, which is just a weight update once we have completed the backward pass we got the the gradients and In the optimizer dot step. We essentially apply the gradients to the weights to get the new weights All right questions Apply the gradient a negative of the gradient with some sort of an epsilon factor on the current weights to get the new weight Yeah, in this in the simplest case. Yeah, in the simplest case of the stochastic gradient Descent we would just have the weights at the iteration I plus one equal to the weights I minus lambda the weight update right right, so if you use some Moment great SGD with momentum. There would be something else or Optimize like Adam. They are just a learning rate depending on the parameter to improve the convergence But in the simplest case that that basically weight update is what happens here at the optimizer dot step Sorry Permutation, but in this case, it's nothing because by default it's Identical range so which means we keep the pixels in the same order as they come This is actually for our second exercise where we have random permutation. I Will show the pictures so there is no there's not going to be any magic involved and I will show the pictures Okay, so The test the test is essentially the same One thing that will Differ in the training and test steps is that we won't have back backward pass, right? So in fact You can talk you can do model dot evolve Here this would set No gradient required Alfredo skip the order differentiation lecture yesterday, but Basically, yeah, we can switch off the gradient from the variables and Yeah, so here again, we do the for loop over testing data We do the forward pass we calculate the loss and we accumulate it in this particular case We have the softmax output activation. So in numpy that would look like You would do NP dot art max over the output of the softmax, right? So you want to pick up the index of the Vector which corresponds to to maximum probability and Then we Count the correct predictions. We have the correct values, which is basically the the handwritten digits 1 2 3 4 5 6 and Fred dot eq you we can substitute it with just an actuality sign. I will show how to do that later so This stuff is just to get it back from the tensor to the actual scalar floating point scalar So that will look a little bit ugly and But we have to get All right, I'm gonna hit shift enter and So finally before we go And train the fully a small fully connected network we have to also So here we just already we defined fc2 layer We call its constructor with the input size number of hidden layers and output size hidden units Sorry, the number of hidden layers is fixed, of course And we defined the optimizers as the simplest as there has degrading descent as you did we passed the model parameters to it and Define the learning rate and the momentum And that's all we make sure that We take a note of the number of parameters to compare it to what we are going to use in the convolutional And then we just train whoops, and I'm running out of time. I don't know how it happened Oh, okay, so this is a time for the break. I was like that was too fast. All right, so It's running We see here the the values walked at the output is the number of samples Processed so far and we also take note of the training loss As long as the training loss reduces we know we're doing okay And I was expecting to be faster, but hey A nice check to do is actually to check this first number you always have to check what this number is Can you do the? E the neper number to this one Yeah, I think someone can someone use a calculator to compute this e to the this guy You tell me Okay, 11 right so basically what you expect this first number to be it's gonna be the natural logarithm of the number of classes You're training on okay, so if this number is anything Away from the logarithm natural logarithm of 10 then you must be kind of worried and something is gonna be it's kind of screwed up Okay, and this is due to the fact that we are using The logarithm did you remember right from yesterday We were checking the log of the correct class and given that the beginning all the classes have equal probability Given that the network is not yet trained you expect to have one tenth and therefore log Natural log of one tenth or is the same of minus log of ten So minus log of one tenth is the same as log as of ten And so this first number should be matching that one if it doesn't you should be kind of worried, okay And we're yeah, so the training for a fully connected network finished. We got 96 percent training accuracy and 86 percent tests accuracy before we go Let's just start the training for convent and go to the coffee break. So here basically We just have to do the same steps. We have to construct the model using the Class we already defined above and you're gonna use the same optimizer with the same learning rate and the momentum The number of features here is adjusted again to guarantee that we have Similar number of training trainable parameters see number of parameters six four two two here and in the previous case It was six four four two which is almost exactly the same And one more thing. So here we have mentioned there is number six of Features, so we are gonna have just to keep in mind the numbers. We are have we having six kernels which are one times five Times five so we have a collection of six one times five times five for the first convolutional layer So this is com one Then we have com two Which was how much is gonna be still six right so it's gonna be six kernels and This one was going from one from one to six, right? so this one is gonna be six times five times five so six because it has to match this dimensionality here and One because it has it has to make two has to match the input dimensionality. Okay? Yeah, so So what so do these six? Let's say for one one do these six kernels move together Let's get together On the picture. So they are gonna be you have the initial picture here and you have this guy here and This guy moves across this direction and this direction and it gives you the first Feature map here. Then you have the second kernel. So you're gonna have The second guy they had the how many six so you have up to the sixth Guy here. So you go from one that is the input my x to my first H One here and then you're gonna be using these kernels here in order to convolve this guy for the second one To be different from each other in other words does it prevent those six kernel to not look for the same feature The constraint is automatically Induced by the loss function whenever you have a kernel which is finding a specific feature the other one will try to find other features Because I said before so as soon as you have one feature one kernel resonating with one specific kind of feature That is gonna be strengthening its own response to that particular feature Right so if two kernels are initialized the same way they may learn the same stuff at the beginning we have random initialization So basically you're gonna have it's very likely that each feature each kernel is gonna be resonating with different kind of features And then therefore is gonna be optimizing for a specific target So if there are no other questions I think we can take a small break get a coffee try to run some more the Exercises like if you're on your machine if you didn't manage to follow up and we come back here At Right so 330 so just before we go to the break so Convolutional neural network finish training as you can see we got 96% accuracy on the training set and 94% on the test set which is out of sample and as a result We can say that the convolutional neural network performed significantly better on this type of task and let's see How well does it do if you're trying to basically remove the locality From our assumptions and let's continue after the break No, no, no actually No need I will say Something else right up. I actually