 Actually before yeah before you disappear. I wanted to do like to wish happy birthday to our TAs. They were born the same day So they All right, all right, so good luck with your presentation on the other side and I keep I keep here the the audience Enjoy it. All right. Okay, great. Take care everyone. All right. Bye. Bye okay, okay, so Back to us and I guess I'll keep From well, I restart basically from the first slide that we cover today And so let me share the screen. So yeah, both both TAs have the same birthday Yeah, there was there was a birthday criteria to select them perhaps so we don't have to again keep in mind too many things Just kidding It's not the first time also in my previous lab We had four of us had two birthday right my birthday was the same as my co-worker and my previous advisor Birthday was the same as is another co-worker. So this thing is kind of more common than you think. All right. All right so Getting back to actual lecture class right lecture topic Class topic whatever So what do we talk about today and today? We're gonna be talking about Again see yes deep learning but we start with convolutional networks, right? Why do I talk about convolutional net? Because okay last last lesson with Jan last week we started with our current neural networks I was thinking to restart to tell you more about the recurrent neural net today I think I'm gonna push that to tomorrow since What young talk today is really really connected to what I'm gonna be Digging further in this lab, right? So I think it's making more sense for me to for you actually to listen to hear more about the same Topic right so we don't get too confused So convolutional neural net what do they do they basically exploit the station stationarity locality and compositionality of the Natural data. This is what the first slide of today lesson with Jan was about right and so He went perhaps a little bit fast on these three concepts. We are gonna be building now some Deeper intuition and understanding of what they actually mean, okay Okay, so can I click here been okay fantastic so input layer or input samples what is what are these? input data we use Can be used for convolutional net right what kind of data is effective for being fed to a convolutional neural net so Usually until now we have seen that our data points are those curly Sorry, those pink X bold vectors, right? So when they are bold this this this guy over here So if they are bold they means it's a vector for me And so I have that the ice vector X belongs to our end So it's a column vector of n components and this is bold because it's a vector and so the set So this is curly brackets means the set of all the x i's of n components such that x i is a data sample is going to be my curly x right and so my curly x is the collection of Training sample of sample that I have access to okay In this case they are M. So my data set my training data set as M M samples One to M we where each of these can be thought again These vectors can be thought as a vector going from zero with the arrow Or you also can think about that as the point in these one point in this n dimensional space, okay? But we can also think about Some other type of data which are a little bit You know different from this. So these are the input samples, right? So we can also think about the following right so my curly x instead right now are those Middle the set of these x i's which again are these kind of Items but x i's are functions now, okay? So these functions They go from these big omega here capital omega to rc So the omega is the domain and rc is the image or co-domain, okay? And basically this function are mapping my lowercase omega vector to a Lowercase bold x right so it goes from vector that omega which it belongs to capital omega to the x Which is belonging to rc? And of course this is a function right of the omega and I have how many how many functions I have m functions right with i that goes one to from one to m So what is this omega omega is called the domain. So it's my signal domain. So x This one over here is called a signal and the signal has a specific domain the omega Instead rc these are the number of so C stays for channel, okay? And therefore the sickness x I map my domain omega to the number of channels rc Okay, okay, so First question what is this omega right? So what am I talking about? Why do I why do we need this right? So we need this because If you try to just treat an image as one data point in like I Let's say that you have a one megapixel image right great scale one megapixel image So this means we are in a one million dimensional Space right and you have one data point in one million dimensional data space It's I told you before another data point is gonna be very close to that one and a third data points gonna be very very very close to that one and so everything is very very crammed together and It's really hard to I would say is I would even say it's impossible to train a Fully connected layer fully connected network There's also called multi layer perceptron, but we don't like this word It's impossible to train a fully connected network here to to perform You know anything on this type of data because it's way too crazy to move to to operate in this million dimensional space Because as I told you we had to go in a higher dimensional Intermediate representation and if you start with a million dimensions Where do you go? Okay? It's not feasible in terms of computations I think and also in terms of number of samples that are required in order for you to train the system Because the number of samples it's proportionally increasing with the number of dimension that are You're dealing with okay so I'm gonna be getting there. Okay. I'm answering this question in a second Okay, so what is so instead of having just thinking about images as data point We want to think about an image as a function. Okay, which In this though in this kind of context is instead of being calling a function We're gonna call it. We are gonna be calling it a signal. So a signal gives us some information On top of these domain. So there is a domain And on top of this domain the signal will give us, you know, some information, which is this our channel and this is like I call it a descriptive information. So for every location in the omega, right? So omega is our domain for every location the signal will tell us what is the specific information Characteristic information of that location of our domain. Okay So, what is this domain? Okay, right? Let's try to give you a few examples so omega could be the set the order set of the Going from one to three blah, blah, blah until capital T divided by delta t. Okay, so let's say I Have a sequence that is last three seconds and Perhaps I want to I have a delta t which is gonna be 0.1 or 100 milliseconds, right? So the frequency the sampling frequency is like 10 Hertz Okay, and so overall this set here goes from one to three and four until 30 and I start from one because I that's how I count in mathematics. I count one two three, right? if we do Python programming we count from zero or C programming so My omega here is like a Like I'm it has discrete values, okay It's not one point two one point four. This is only one two three is like an index You cannot go in between so it exists at, you know, discrete intervals and I have a finite number of elements so this is a first type of signal right and Capital T is the total amount of time of the total length of this signal which could be like the 30 Like the three seconds. I was mentioning you that we perhaps do like some Prediction for with a predictive model. I don't know. This is something we're gonna be seeing in the future Or it could be a audio signal like I want to listen to three minutes of Sound right and then the frequency the delta t could be like let's say One over 20 kilohertz. Okay, that's usually a possible frequency sampling frequency Of 44, I think now we use 44 kilohertz same frequency sampling frequency for audio And delta t is the sampling interval right one over the sampling frequency. So We figure now what is the this omega right the omega is the discrete index we can index our signal right so we have x of Lower case omega and then omega lower case omega tells us Like a point like this can be also thought as a chain, right? You have one item after each other or like a one-dimensional grid What is C in this case so if we are talking about let's say audio on we can have the following so if it's one Can you guess what is gonna be the Audio signal which has only one channel What's called you should be typing in the chat. It's a mono. Yeah, that's correct So this is a mono signal my monotype of signal Instead if I have two channels given a specific sample, what is called that? Okay, stereo someone else can I get us can also answer again So yeah, that's definitely stereo. So some of that is not the same person here can someone guess what is the five plus one Yeah, what is one point one? Yes, okay fine Can we explain in English perhaps such that I understand whether you are following? Maybe we know already is it to okay five channel plus five subwoofer. No one subwoofer, right? Yeah, one one subwoofer. Yeah, that's correct. Yeah, so that's basically the Dolby five point one Okay, okay. All right So this is this type of signals are said to be one-dimensional because the domain has only one direction Okay, and that's called one-dimensional signal One not to one. Okay. All right. So moving on we can have this type of Omega right this different type of domain in this case We have the Cartesian Cartesian product of two discrete sets. Okay. These are order sets, but okay So you have one two three four H and times, right? So it's a it's a like Outer product one two three four until W No, and so as you can guess H stands for height and W stands for width Okay, so in this case, we don't have the unitary grid, but we have a 2d grid right a classical grid and these are images, right and so Yeah, correct. So let's see what C can be. So again number one. What is it? Some of it didn't talk so far. Please talk back to me. I Know it's early. Oh, I don't know if it's early grayscale. Okay. I'm very good Someone spells with the a someone spell with the E. Okay. I prefer it British Just kidding. It's not black and white It Would be black and white if The channel would be actually N, right? So you have binary zero or one. So you actually you would have like Zero or one to the one right in this case I've I've written here are and so are is gonna be like all continuous values. So in this case would be grayscale we would have to replace these are with a Set zero and one then you would have black and white image. Okay, if black and white means Salt and pepper right like the the binary image right bitmap Okay, so that's correct a grayscale We number three. I guess you are GB. Okay, you know, but okay, okay And so and then 20 is something also young talk today, right? I mean, don't cheat and check the slides before Following the class right so they have to guess that that so the guessing part in the in the class is Part of the learning right so the fact that you would not not cheating on the what's in the slides that already online You you actually can have like some Better absorption of the material. Okay. So these are the hyperspectral images. Yes Oh, sorry, okay. I was skipping the this one. Okay, let me click further for these are the hyperspectral images And so these 20 bands are not Are not necessarily a set they are actually ordered like young said before they are a set They're not ordered this number of channels actually in the hyperspectral images usually they are and you know the lambda They like the wavelength are sorted such that you can move from one end to the other of the spectra And they go. Yeah outside the the visual field, right? So that you can see all types of radiation a spectral radiations and these are very useful For example, whenever you have satellite imaging in order to identify regions of desertification or you know potential fires and things because each specific type of soil or you know Brown type will have a very Particular footprint right the the spectral footprint, which is not that evident when you actually Just use the color information, right? So given that we have sensor that are Having like a input range that is much larger than our you know retina and our our visual system We can leverage that Larger spectra in order to capture more, you know meaningful Radiation patterns or even albedo right the albedo is the reflection of the of the light from the Sun Okay, the color image, right? I actually Skip the thing in between so Let me give you even a little bit more precise, you know Understanding or explanation of what's going on, right? So my X my ball X right and this this body over here So this is a ball X means is a vector has a vectorial output in this case It's gonna be three components is a function of the location W1 and W2 W1 and W2 can only take the discrete values and namely W1 can take values such as 1 2 3 until H and then W2 can take values in 1 2 blah blah until W, right? And so the width W and omega there are two different letters, okay? You need to put this curly thing in bit before otherwise you don't understand what's going on All right, so these are you know just specifying a location in this 2d matrix and then given it to the location you're gonna be having like a vectorial like a like a vector of three components specifying the Value of the R channel right the green channel and the blue channel So the right R channel is gonna be a scalar field is a like a one-dimensional Function the image is a one in one dimensional, right? The image like the co-domain is one in one dimensional G the same This is a basically a grayscale image, right? This has one channel and the B is also one channel, but then if you put them together You can have like the purple and the pinkish and the purple color height right here on my background, okay? Cool So we have one more and which is this one, right? And so in this case the domain is the real axis are four times R4, right? So it's like a whatever space you cannot even draw it and think about that What are these? Do we know do we have physicists in class? So these are simply the time space time and the other one is the four momentum Okay, so this we are talking about physics and so here we can use convolutional network in order to play with physics and we cannot be try for example to Predict the let's say Hamiltonian of the system, okay? Okay, okay, so we can do fancy things here and this is like the type of data we Can use convolutional network for you know getting actual, you know stuff to work, right? So And let's actually go start simple and let's say how this one this signal of work, right? So this one this signal I sometimes also write it as like a lower case x with square bracket And inside there is this K, which is my index in variable, right? Which should have been the omega in the previous slide and again This is gonna be can also be considered like this big long vector of these specific items happening at Specific locations, right? And so there is no intermediate value Okay, okay, okay, so we mentioned that we can apply neural nets or convolutional net if the Data has some specific property and we are gonna be covering this right now So once again the 1d type of signal are those in which the domain is one dimension, right? So R2d like sorry the omega has Like just in goes in one direction And so this could be like for example the waveform here in the first example The second case we have like a grid because in two directions And so they still discrete values and I will go in two directions So it's a two-dimensional image a two-dimensional type of signal and the third one is a one-dimensional signal Where our each item in the sequence it's Corresponding to this x1 x2 x3 Which are very long vectors and let's say these are so each of these x1 x2 and x3 are 10-dimensional 10,000 dimensional vectors Representing the index in a dictionary representing where the specific word happened to be okay So John maybe is the Black Word in a given dictionary. So the collection of these indexes can also be Thought as you know signal operates on in one dimension Okay, so let's focus now on the on the first signal the the one that goes in one dimension And it is represented here with a waveform So this waveform here, let's zoom it a little bit So on the first part I can see this specific pattern happening on the early part of the signal, okay Later on in the same signal. I will also see the same type of pattern. So this is one characteristic of this type of signals this natural signals is that Similar type of patterns happen over over over again. Okay, so patterns are Happening in different locations of the signal the same type of pattern Moreover, let's zoom it a little bit. Yeah, there we go. So Question for people at home if I see a peak on the left-hand side, how likely is that I see a peak coming up? quickly or If I can quantify this better How likely is that a peak happen very quickly after a first peak rather than further away or Even further or what about what is the likelihood that? Given that I have a peak on the left-hand side. There is a peak there or even further further away so We can tell that you know, perhaps if I say a word a word is make like a of a sound and this sound which is called Like there are phonemes that attach together If I want to say a specific word like the word word It has actually one type of structure and this structure If I say multiple times will always be, you know having this information Condensing that specific Temporal interval, okay can be a bit larger slower depending on how fast I say the word But again, as you figure I said this specific word so many times that it appears several times in the overall temporal information the temporal domain and then the actual Likelihood that you find a specific phoneme after another phoneme. It's very high, right? Like very close together in time But then afterwards if I say something else over here It's quite Decorrelated by the fact that I have this specific pattern happening over here Nevertheless, you observe the same pattern over over again and then there is some type of local structure that It's in the signal. Okay, so there are two things over what I was trying to say is that again the information has a strong correlation in a Local regions and then it happens again and again This does not only happen for the audio type of signal as I show you right now, but also for the Second type the two-dimension right so let's look at this little kind cute kitty here And let's zoom it like in the central region and then you have like a monster. No, it's huge. Okay. All right, so In this type of image in this type of sorry signal You also can see now the discrete locations, right? So these are called pixels in the other case with the audio. They are called samples, but they're same thing, right? Like in 3d type of data Signals you're gonna have them called voxels But again, all of these different names are just saying that there is a unit element Like the value associated to the unit element in the domain, right? So the first point was that let's figure. Let's look at these, you know, crevice or whatever this type of Pattern we have by next to the eye We can observe that that type of pattern It's similar to these other one that happened also further away again again We can see that similar patterns We occur across the domain in this case is two-dimensional domain, right? Over and over again, right? So we have this property that I'm gonna be called it Stationarity, okay, so stationary is that the same pattern hope Happen again and again in the same type of or something the same domain, right over the domain Nevertheless The other property we were mentioning is the fact that so how likely is that given the center of the pupil is black? You're gonna be also finding another pixel nearby that is also black. I Would say it's quite likely You have a high, you know degree of probability You have a high degree of likelihood that you're gonna find another black pixels in the pupil, right in there in the hole of your eye Because all the pixels within the pupil are, you know Not reflecting the light. Otherwise, you wouldn't see anything, right? So it's a hole and the light goes inside the hole You don't see anything because there is like a chamber behind, right? Nevertheless, if you move further away from the pupil, no from from the hole And you move to the iris then you're actually gonna see, you know There is some color behind like nearby the the pupil, right? So it's less likely that also the irises black could be if you have black irises, you know You could also have that a high probability can be the same color But now let's take a step even further, right? So that's completely outside the the ocular area and the likelihood to find another black pixels unless you have a black cat It's rather low. Okay And the further you go the further away you go and the In the less likely you're gonna find some, you know Likely similar value, right? Oh, you have yeah, so again the information has some The information in the signal happened to have like local kind of Property has a locality and then the similar type of patterns happen again again again Over the domain and that's the stationarity So we need the third property, right? Uh, let's see who remembers that so Um, I don't know if you had breakfast or not, but here I'm going to be showing you some donuts Okay, so here there are many many donuts. Uh, there are the chocolate. There are with strawberries. There is some with the um, what's called the blueberry and etc Okay, so uh exercise for them for the those of you in class who wear glasses, okay? So you you can raise your hand so I can actually see that Uh, how many of you if you should raise your hand how many of you wear glasses? I see A few people. Okay, just six people. What about so everyone has a perfect site. No, okay So 10 people Okay, 12 fantastic Okay, now those people that are wearing glasses. Please take off your glasses and look at the screen once again now you can also Okay You can also react with that. Oh usually this works very nice in class Yes, okay Among the people that don't wear glasses Do you know what's going on here? No Okay Oh very good. Okay. So someone please explain who cannot see What's going on here? So if you squint if you again if you don't wear glasses, but you squint your eyes, you know, just squint a little bit Yeah, yeah, that's merrily more wrong. Okay You squinted very good Okay, okay. So this is third part, right? So this is compositional compositionality the part that uh, if you if you If you look at very close like close by you don't see any specific type of Major structure, uh, as you start, you know Gathering and integrating the information. Uh, in this case, especially you will be obtaining some, you know Structure like some some meaning out of these individual specific values. Okay And so as we can see here, there is a high local Or go back at a distance. Yes, or just remove your glasses if you're blind like myself Again, here you can clearly see there are like, uh, like there are very many chocolate, uh What's called this stuff? I forgot the name of doughnuts. Are these doughnuts or Let's call them doughnuts. I don't think they're either bagels, right? The bagels or doughnuts whatever they look some someone like look like doughnuts. I'm like bagel. But anyhow, you have like chocolate Uh, oh, someone see the lips of the face. Okay, very good So the the lips of the face are those pinkish one the red one, right? So these are like strawberry doughnuts You can see, you know, the fact that you have a strawberry doughnut here We'll determine that we'll have likely a very high likelihood the fact that other strawberry doughnuts will follow Okay, similarly here you have, you know, a brown chocolate doughnut. You have many chocolate doughnuts nearby Um, but the far you go and the less likely you're gonna find similar doughnuts. Okay And then if you put all together, uh, this information by, you know, integrating over the spatial domain Or just going further away and squinting your face your eyes. You can see how the overall local structure Can give you a overall understanding of the entirety, right and so when you say When you say, oh, this is a face of a of a lady that that's the whole entirety of the togetherness of all these specific local information that gives you that kind of overall Comulative information. Okay, everyone can lower your hands. You still have your hands up. It's gonna be, you know The blood is gonna be going down your arm All right, all right, all right Okay, so moving forward We have already seen how fully connected layers work in the just last lesson. So I'm gonna be skipping this part And then I'm gonna be going directly to the Okay, maybe I can just Just show you the slide So this was how we compute these specific values, right? You remember that this animation from last time, right? So we have that each neuron here Look at every other neuron before And these weights are used in order to weigh the input neurons, right? So given a specific neuron here The first one here, this one gets the summation of all these guys here weighed by this coefficient that is stored in this connection, which is again then stored later in these Metrics over here, right? And then we apply possibly a non-linear function As you can tell there are too many connections. There is a huge amount of computations Given now that we know that natural signals have this specific type of properties The three properties which are typed down on the chat We repeat them a few times Help me reminder remind them. What do we have? There are three properties We mentioned this like seven times or ten times so far. Tell me type in the chat No, okay locality second one is The fact that things happen over again in the same manner. It's called Stationarity and then the third one was the one I show you right now the fact that the meaning overall comes from this Compositionality this hierarchical combination of individual stimuli stimuli Okay, so now we're going to be combining or exploiting these properties in order to Introduce some inductive bias in the architecture In order to be able to remove computations and speed up convergence of the learning algorithm. Okay And so Now we're going to see how locality can induce sparsity or you know, we can use sparsity given that our signal is local What does it mean? So In this case, uh, I have the pink layer and my input layer is going to be the layer l minus one Then I have my layer l and then you know the following one is going to be the l plus one And then when we have, you know, the fully connected layer we have that each input is Way by this coefficient storing this connection. So we have here five connections, right? We have five connections. We have time connections. Overall, we have 15 connections And then on the other side we have Yeah, so here's our 15 and then on the other side at layer eight plus one. We're gonna have three more, right? So we're gonna have What This total amount of connections, right? 18 connections, so I cannot I cannot count in english. Sorry Okay, uh, but now we know that on the other on the other side our our signal is local, right? So the information is just present in a specific region of the domain, right? So the domain in here is one and it goes in this direction So this neuron over here It just cares about checking for patterns that are happening within a small region here We don't really care to know what happens here or here because the further you go away And the less relevant these Uh neurons are these pixels or samples or whatever you want to call them boxes, right? And so the first part is going to be oh, I'm going to just use three connections, right? So just look for specific patterns there For the second neuron, I'm going to just look at three patterns over there For the third neuron the same And so here we have nine connections On the right hand side, we still have those three connections. So in total again, we have 12 rather than 18 Okay, okay, if you increase the number of neurons, you're going to see how these numbers, you know increase and how they compare As you move from the left hand side to the right hand side You go high in the hierarchical view So these networks are usually drawn or left to right in this case Or actually I should have just drawn them bottom to top since I actually have space here, right? Why do I draw them bottom to top because that's how you go up in the hierarchy, right? So the network should be in the same direction where you climb from the the base No, where there is the input to the higher level of this higher hierarchy, right? And so I'm going to be defining something now on this rf the Which is the receptive field, okay? And so my receptive field here is simply Telling you how many neurons my given neuron can see from the previous layer, right? So in this case the receptive field for the output neuron with respect to the hidden layer is three So question for people at home. What is the receptive field? For the hidden neurons with respect to the input neuron type Three, okay That's that's good. So final question. What is the receptive field of the output neuron with respect to the input? Five, okay I don't know if you're checking the slides in advance or you're just Reasoning along with me. I hope you're reasoning with me. Okay Uh, and so the the more you go to the right the more you have this global view and the Similarly, similarly lets you figure before wi-fi because you can count that this Body here and this blue one here can see up to five pink inputs. There you go. Yes. Thank you for answering. Uh, all right and so again the in order to actually be able to let's say classify the previous images Face of a lady You need to have several layers in this case, right? You can't really do it with one layer Or you can have several layers and then you need to reduce again the number of The information right which we which we can do this for example with a pooling. We're gonna see in that in a second Okay, so here How are how is met a neural network, right? So we have two things a neural network. We said Remind me we saw that last time we have What's an a neural network neural network made of type in the chat Yeah, linear layer, but this is my labs, right? What terms do I use? No squashing. Okay, and the other one was Rotation, okay, so we have a rotation and squashing and the the rotation is simply, you know a fine transformation like in in proper jargon And which is you know achieved by doing a matrix multiplication, right? Matrix multiplication plus the displacement the offset and so in this case I just show you that we start with a With a fully connected layer, you know, which is a matrix But in in this case i'm showing you i'm dropping some connections So these dropping connections simply means writing zeros in correspondence of all the connections Uh that are not existent, right? So you know in a next in a next lesson I'm going to be actually writing down the math to show you how this work, right? But nevertheless the convolution is still a matrix multiplication a linear operation, right? Just with a lot of zeros and that's When you put a lot of zeros inside a matrix, that's called a sparse matrix and so that that's why Again, that's where this sparsity terms comes into play, but again, it's still a matrix multiplication multiplication still a rotation And also you need to be a rotation in a high-dimensional space, right? So these convolutions will have to You know rotate your data and then it just looks at these specific regions and expands them to multiple hidden representations, right? But just that small region you cannot take the whole thing. It's too many Points too many, you know, it's a two high-dimensional input. Otherwise cool Second part stationarity which implies parameter sharing. What does it mean? So in the first case, I show you We start with these sparse connections on the left hand side where all these arrows are white instead I just show you in with a dashed gray line the connection we dropped And which are no longer with us. Okay, so those are zero connections On the right hand side, I'm going to be drawing the same thing. But in this case, I'm going to be using colors So I'm going to be drawing the first edge in yellow And so all these ones are going to be yellows Similarly, I'm going to be drawing all horizontal connections with the orange color as like this Finally, all the final connections are going to be drawn in red. And so these are going to be my three colors So these three colors are basically representing the same weight So given a neuron here The neuron that is one up will be always multiplied by the yellow coefficient. So this neuron here Has the neuron up one up multiplied by the yellow coefficient Similarly for this one. Okay, so they all share the same parameter. That's why it's called parameter sharing And then I collect all these values The yellow orange and red into this collection, which is called A kernel. Okay, so we have first, you know definition word of the word kernel And so parameter sharing allow us to have quite some several benefits Uh, first one is faster convergence. Why is that? Because the same way The same weight actually Can be getting Gradients from different regions Before you get, you know that this specific value here gets gradient only from this location In this case, you get this yellow one gets gradients from multiple parts So you have much more information for moving faster the The way to the correct to the best location You have better generalization because you're not looking for a specific thing that happens in only one location But you try to find a more general type of pattern that can be used all across the domain. Okay, so Doesn't doesn't specify too much for that specific region Uh, it's not constrained to the input size. This is really important, right? Uh, if you make the input a bit larger even after training, you already have the Weight that can can be used to tackle that extra part, right? So what happens is that you can train let's say a face recognition system on just faces like my my face here And then you can apply this on the whole image over here, right? And then you can I just apply the convolutional net all across this, you know, big image By having a train only on this small type of input data, okay And there is no, you know, no no issue with the larger input larger input, right? Instead if you consider this big image is just one big vector And you have no idea how to, you know tackle the fact that You have more pixels now than what you use for training, right? And so this is something that, you know, it's not to be uh They can, you know, not be forgotten Finally, the kernels are independent So if you have multiple kernels, you can compute these convolutions These these multiplications by multiplication and say matrix multiplication are in parallel, right? They don't share any type of They don't have dependencies you can compute all of them together But here we just saw one kernel. So where are the other kernels? I'll tell you in a second The connections parsing instead basically reduce the amount of computations You can implement a convolution with a matrix multiplication, which is Which is basically, you know, having a big zeros inside the matrix But multiplying by a zero, you know already what's the answer, right? Nevertheless Having the convolution implemented as a matrix multiplication can be Convenient if computation is free Uh to you, right? So let's say If you use a GPU right a graphic graphic card You have, you know a given amount of computations you can take, right? Uh, and Matrix multiplications are already implemented very well And if you can take, you know, this larger amount of Chunks of memory and you still have that amount of computation. You can just use them with a Classical matrix multiplication and they were gonna perhaps gonna be even going faster than running, you know The other option that goes, you know throughout the the whole thing, right? So in one case you just do one matrix multiplication. You're gonna go faster But you it takes a lot of memory as well. The other one is going to be slower, but perhaps it takes less memory Given that you have a larger budget of, you know computation you can still use Uh connection sparsity equals tried. No connection sparsity means you have zeros in your In your matrix multiplication. So the other matrix that you're going to be using he has Some values set to zero which are these dash lines All right, so We we mentioned before we had, uh, Kerners let let me go into these kernels a little bit more So in this case, we said that we have the yellow orange red kernel that is applied to the first Triad of of input then it's going to be applied to the second triad and then the third triple not triad triple of neurons Um, and so this one are going to be stored in this collection here So my kernel in this case has three values, you know yellow Orange and red and they are stored in this vector over here Uh, nevertheless, I may have A second kernel, okay, and so I have the kernel blue purple and pink And so for this specific case, I have two kernels Each of which has length three. Okay, so the kernel Uh, no length. Sorry my bad the kernel dimensionality is three It has three items, right and I have two of them, right How does it work? So By the fact that I have two kernels By running the first kernel, I'm going to get the first set of outputs by running the second kernel I will have the second Number of outputs and so you can think about these neurons over here. Oh my bad But you can think about these neurons here as having two layers the one on the on the screen and then one layer Up out of the screen, right? So you have one. This is the on the screen the When you apply the first kernel you have the first output and then you apply the second kernel You're gonna have the second output on top, right? And so in this case we move Perhaps from a mono type of signal If these pink guys have only one value to a stereo Type of you know hidden layer. Okay, but in this case instead my input. It's a seven dimensional It has seven channels, right? So each Each pink circle here are basically seven Stuck circles like one two three four five six seven, right? So there are seven Circles one on top of each other Don't write private messages. I cannot see them. Just write to the public chat. So also the t8 scan answer So you have seven input so Sorry, you have seven channels in the input and then with two kernels We you will have uh two dimensional output, right? Usually if you are going to be using a stereophonic Type of signal you're going to have just two channels in the input and then perhaps you want to bump these to 16 let's say intermediate Channels for the for the hidden layer, right? So you want to usually up sample, right? You want to go? Sorry up sample you want to go in a higher dimensional space When you move to the hidden layer such that things are easy to move as we mentioned a few times already Okay, I'm talking too much so the fact that these Input are seven dimensional like the Basically the channels are seven. This means that this yellow item over here. It also has seven components Okay, so the yellow star the orange star and the red stars All of these are three are seven dimensional vectors. Okay against coming out from the screen Such that I can perform the dot product the scalar product between this and the input All right. So my kernel size are going to be the following So I have two kernels you can see over here Of which each element inside of the kernel has seven dimensions And this is the seven over here And then overall I have three, you know three items in this kernel 123 Right, so it has to be seven He has to match this dimensionality such that each value in the in the vector input is has its own coefficient And then the three is the actual length Or well the actual span of this Kernel right how much the actual dimensionality of it I don't consider seven as the dimensionality three is the So the three is how How much you cover of the domain, right? So the three is going to be telling you you cover three samples of the domain And seven has to be The number of channels matching the whatever source, right? You have to be able to weigh Each channel with its own coefficient So seven is given to you by the previous layer Which is again the number of channels of the previous layer Three is going to be the The how many components you basically want to consider In your domain. So this is A parameter that it depends on on depends on the Locality of the signal, right? So if you have signals that are more spread You want you want to have a larger number over here, right? Such that you can you know check for this larger type of signals, right larger Like more extended type of signals Number two here is how many Channels the next layer will have, right? Okay So in this case one dimensional data Uses 3d kernels collections, right? So One d data because it goes only in the domain is one dimensional, right? It has these kernels which have Three dimensions number of kernels the number of channels the extension in terms of domain Adding so what is padding padding is necessary When you want to actually have the same dimension across the network Even though you apply a convolution as you can tell here I apply a convolution with a size three in this case the size is again the domain size You have that you have a reduction Of the output Due to these fact the fact that you know you apply this convolution And so the reduction is going to be equal to the size Of this kernel minus one right so the size is three and therefore You I have one is missing over here one is missing over here. Okay, you can see this, right? So How much do we need to pad we need to pad one extra One extra thing here and one extra thing here And how do we get there so You you number of zero padding It actually is equal to the number of kernels minus one And then we're gonna be doing half of it on one side usually one half on the other side Usually that's why we like to have odd numbers for the length for the extension on the domain right so for this number here also the other Reason to have the odd number is that a given neuron corresponds to a given location if you have an even nine even number of Edges here you're gonna have that one neuron corresponds to two And so you're gonna have some blurring factor here. You can also Just copy the information over right to the same location And so we need one extra input per side And so I'm gonna be inputting a zero there and a zero over there such that now I can perform a convolution there And I get one extra there and an extra there such that now the number of Neurons in the hidden layer are the same as the neurons in the input layer And this is helpful for you know, not getting crazy with very size of the network But arguably stupid because you insert, you know Uh some zero information on the edge, which was not there right so we add something all right, so How do we actually use? Uh, can we use different type of padding? Yes, you can do that. Uh, but usually you're gonna be Uh having zero mean right anyway, and so a zero is a good value to to use for these edges Right, so you usually with zero mean the the input and you divide by the standard deviation also whenever you use like the Batch norm or layer norm you're gonna be always having the the average value is going to be zero for each layer Right, so adding a zero is not too bad Uh by convention each kernel With one channel will have an odd size due to this Uh But any any so the We like to have odd numbers of you know extension such that there is going to be a central location Right, if you have an even number there is no central location and that's create can create blur blurring effect So here is a little bit of a practical, uh, you know suggestion about how to use these convolutions Which is going to be uh representing this diagram So you want to have multiple layers such that you can create this hierarchical, uh, you know structure Of convolutions, which are these rotations Uh non linearity, which are the squashing Uh, we have perhaps to use some pooling factor right in order to reduce what is the spatial dimensionality Uh, and then batch normalization. It's something that was I mean it's been introduced and now it works like magic basically allows us to actually train very long networks without having Uh issues in convergence Uh, we also Found out that residual bypass connections are really important in order to get you know network to train very well Even if it's very very deep And so what the network does is basically converting this input, which is very very flat in this case No, so it has the the thickness is very it's very tiny But the information is a spatial information right this the information is distributed across the the domain And so I usually call the thickness as the information on the thickness as characteristic information So a given location has a specific characteristic in this case What is the characteristic information of a given location in this pink image here? So what what do you know about the specific location? You know the color, right? So the rgb is the specific characteristic information that location the specific pixel on the other side At the end of the network, let's say we do classification. We have just one big vector And all of it is characteristic information. There is no more domain, right? We just collapse everything in one specific thing And then the given location in the in the in the actual across the vector is going to tell you, you know, this is a hypopotamus Versus a giraffe and in the in between you have basically a Tenser whatever size that is, you know a midway between very thin to very long So the network basically changes the size you have like a reduction of the domain information And then I increase in this, you know a characteristic information Pulling and that is the last part then we're going to be covering the notebook Basically, we start here with an edge of this in this case a two-dimensional signal And we have a Also, it's called kernel which is the area where we look at in this case I have an even number of items and so I look at what is, you know a possible You know operation that is summarizing the content and for example, I can use the LP pooling in this case whenever I consider the P norm And or I can consider the max pooling if I consider like P goes to plus infinity, right? And so I apply this LP norm, which is again condense the information or if I use max actually throws away information More or less and then you get one value and then I apply a stride such that I can Decimate the information across the spatial Well across the domain, right? And so this allows me also to reduce the amount of computations Similarly, you can do something similar with a strided convolution where the convolution is applied only at, you know Jumping intervals, right? And then you don't consider exactly what happens exactly all across the domain And so in this case the number of channels stay the same right because you perform only the pooling across the Domain if domain Coordinates, but then we keep the same number of channels That was pretty much it for the slides so in the last 10 minutes we're going to be covering the notebook Again as I was mentioning I think on the on the chat or maybe not. I forgot We always keep open a terminal where we can, you know, use it run it For as a calculator, right? Anyway, so Okay, the bar is in front of my nose again. So let's move the bar. Let's move my face I cannot move my face. There we go. Okay So we go in Work GitHub Pdl Okay, and then we do conda Activate pdl um, and then we do Uh jupyter notebook Okay, and so right now we're going to be looking at the um Convolutional net, right? So just for sake of Uh, we say switch by door to deep learning set kernel. So I'm going to just execute everything and then I'm going to be commenting While my computer is running Okay, so what do we do in this kind of uh, notebook here? Let me go full screen Okay, there we go and let me zoom in Okay So here we plot some plotting libraries and we set some default visualization Things, right? Import torch and then report the nn package Also, I import some other data set and then pipe plot as plt here, I just have some Handling function like helper functions and we set the device to be whatever is available on the current platform I have a gpu in this case Here I load the mnist data set. Uh, this is the modified mnist, uh, which young modified back 30 30 years ago And then we apply some normalization which are Computed across the whole training set. So these are going to be the The mean and then we divide basically by the standard deviation Uh, here we have That I'm going to be plotting you 10 digits, right? And so here are I'm showing you with yellow numbers that are basically Close to one and then purple numbers that are Set to zero. Okay, but then since we average, uh, like we we we move the mean I believe that the the yellow is going to be plus one and the purple minus one So in this case, we are going to be using a different type of Function, okay I have like a different type of neural network last time We saw that we can use sequential in this case We're going to be using the more extended version of the building a neural network We it's more a flexible way of building a neural net So in this case, I have my class which is going to be fully connected to layer Uh, and then I have my init which you're going to be inputting whatever parameters you want to pass later on And then I have like, you know, I assign to my internal variable like input size the input size and then I define my network again in this case to be also sequential, but this is defined inside this init function Afterwards after this is in it is initialized I have to define the forward function and the forward function simply does the following He gets the input and he reshapes it in a One big long vector because this is a fully connected layer So he doesn't know exactly how to deal with anything that is not a vector and then I simply feed my x my Reshaped review like there's this one big vector to this network, right? Uh, in the other in the other case instead, I use a convolutional net And I don't use this Sequential right so here I have each individual item to be like the Separate thing like the separate Module and so I have a convolutional 2d convolutional 2d. These are in an end dot Package, so I have a capital c capital c means this is an object and this means inside There are all the parameters all the weights similarly for this and then capital linear um Here we are using nn reload and as a you know Object in the nn package Instead in this case since we are going to be using the extended forward I have that I sent the x through this conf one and conf has a lower case In this case because it's just a function although This one gave us also the weight And then I get the reload from this f the functional And so reload now is just a function that doesn't have weight. So I don't need to use Uh, you know the nn package so the nn was giving me a module And it's necessary whenever you use the sequential because you had to stack several modules here. I just apply My function to a given input, right? So here I apply all them one after each other. You can do this or you can just do the Uh, sequential, right? Uh How much time left we have where is the timer? Okay, I don't know the timer I think we have a few minutes left. So here I'm just Uh, creating these two networks such that they have the exact same number of parameters Here I have my training loop And then what I do is going to be the following. So I may or may not permute my data points So Let's see what's going to be I train my data my my my networks here On the first the normal data set and then I try to change the order of the pixels just to see what's going to be so We out of time. Okay. I'm finishing up So I'm training here the small fully connected layer and the number of parameters are 6000 And we get 96 percent On the training set and then 87 on the test set Then I train my convolutional net on the same number on the same type of Data and we get we also have the same number of parameter right 6442 here. We have 6422 is similar, right? And then we get 95, right? So we can see here how the convolutional network has a better um, you know Accuracy on the test set then the uh fully connected layer But then here I'm going to be shuffling the pixels and so I just take all the pixels that I apply in a deterministic transformation Whenever I apply a deterministic transformation I get the following right so instead of having this type of five If the five is going to be look like this instead of having a zero that looks like this I'm going to have a zero that looks like this Okay, the four looks like this the one looks like this Cool. So now I train the comb net On these shuffle data. And so we went down from 95 To 83. Why is that? Because now there is no more Uh, local information right because we completely screw up the the type of data Instead if I train the fully connected layer on the same type of data We're not we're not going to be observing such, uh, you know Worst performer is the 85 here and before we also noticed that was what 87. Okay. This is just statistical variations So if we put all together And we see how the cumulative, you know, how the thing the two things compare we have the following Let me zoom a little so the fully connected layer before was you know performing this this much well the convolutional Uh convolutional net was performing much better Now after scrambling the pixel order we have that the convolutional network performs much much much worse And then the other one Should be basically should be performing roughly the same as before right there. There should there is no major There are no differences for the for the normal neural net. Okay Why has the normal the neural net reduced it because of statistical, uh, You know variations in the initialization of the parameters If you would have You can try again Might be this one can be better than the first one So the the fully connected layer doesn't see any type of difference and the variations you get are simply the normal expected variations on That you may get when you run multiple times the same training procedure by by by using the different type of um Different initial initializations, right? So this is just due to statistical, you know fluctuations Instead in this case the neural the convolutional net really drastically lost Uh performance due to the fact that you know those little kernels cannot really look At things that have been you know move around so there is no more way of extracting information given that like, okay There is some information still extracted because we have several layers But nevertheless The kernels are less effective, right because they have a limited view and then we move pixels away and so there is no more Uh way to capture that kind of information And that was pretty much everything. I want to tell you About convolutional network for today. So tomorrow we meet and uh same time 9 30 and we're going to be talking about Recurring neural network and these two lessons are going to be I think quite helpful for the For the homework that is going to come in out tomorrow. Okay I think that was it if you have more questions, uh, ask them on campus wire We will answer every question you have I was not checking the chat for the last few minutes such that I don't waste too much of your time Okay. All right. See you tomorrow. You have a nice day. Bye