 So we want to cover one of the most successful type of networks, convolutional networks, and I try to find things that are not covered online and in YouTube videos such that we get a better understanding of the ideas. And hopefully, we have to finalize it today. So we will see how much we can cover. So still we have a problem, even with auto encoders and RBM. And we still have way too many weights. We have way too many weights. And we still need features. So these two problems could not be solved by auto encoders, which means if I have a large input, suddenly I have to put a million numbers into the auto encoder, that would not be possible easily. So we have not solved this problem. So the solution would be coming something that we call then convolutional neural networks. And a little bit of history again, 1959, Hubbell and Wiesel did some research on the visions of cat's vision, cat's vision system. It seems whenever we achieve something, we have to put some suffering and inflict some suffering on poor animals. So they put electrodes in the brain of cats. I can't imagine that's not very pleasant. So and that's one of the things that I think we still cannot call ourselves very intelligent species, because we cannot figure out a way how can we make progress without causing suffering. So why you cannot simulate that? Why we cannot do it on dead people or dead animals? So why do we have to take an animal and then put some electrodes in his ass and is screaming and everything? Well, they did it. So we are complicit in that crime. So we are sitting again using the results. They realized that there are simple and complex cells in the visual primary cortex. And they used a cascaded system, cascaded model for two types of cells. It seems based on the experiments that they did. And we don't want to talk about ethical side of it. I guess still many people are doing it. Now, I'm not just blaming them. I'm blaming us. They realized there are two types of cells, simple cells, complex cells. And you cannot really understand the visual cortex or some of it, part of it, if you don't make that distinction. So if you don't make the distinction, sorry, I pulled a little bit messed up. So I will try to write larger. So that was biology. In 1979, it usually takes 20 to 30 years that somebody starts thinking, OK, OK, that was biology. What can we do that with that in engineering and computer science? So the knowledge that you have specific cells like simple cells and complex cells in a cascaded way in an animal vision system. What can we do with that? 1979, Professor Konihiko Fukushima is the guy who came up with the first idea that is the prototype for CNN. That was the first CNN, you can say. Not really, not the C1. The C was not there, but maybe something else. Maybe SNN, SCNN, I don't know. So Professor Fukushima came up with the idea of neo-cognitron. And he was proposing it for digit recognition. For digit recognition. It has some problems. The main problem was that he did not investigate too much. OK, what is the learning procedure for this type of normal? So neo-cognitron did not go anywhere. He's just part of literature. Now his history of AI, Fukushima was the one who started thinking about this, bringing it into the computer science. But he did not do his homework to the very end, if I may say so. So he put a lot of emphasis on the topology, not much emphasis on the learning algorithm. So nobody knows neo-cognitron. But whoever came up with the CNN, they knew neo-cognitron. Everybody else who was in the field knew that. And then it took additional 10 years. And around 1989, Dan Lokun and others, Benjio and others, they came up with a CNN, the convolutional neural networks. Well, that's a rough, really rough, I cannot claim that's the history. But sort of, that was important. And the idea of complex cells was that you get an stimulus. So you get a pattern. And then you have some sort of cells. And these cells basically look at your pattern. And you understand what is going on. So this is the example that is in one of the works of Hubble and Beasel when they looked at the topology of simple and complex cells in cat's visual cortex. And it seems that if you go from topology, if you go to features, then if you come back again to the actual pattern, whatever stimulus it is. So they projected different patterns on a screen. And the cat was looking at it. And you have the electrodes in the head. And then you measure which part of the brain is get activated. And if you torture enough cats, you get some idea what's going on. So this would be at some level, it would be rather some cells that are responsible for doing this. So this would be some simple cells. So some simple cells look at big picture. Big picture at a low level is a little bit confusing. And then you go a little bit higher and you get more cells. And these cells will, again, look at different things. And these are complex cells. And you get to another set of cells. And those cells are highly complex cells. And if I want to, so what does that mean? What does that mean? So that means that simple cells can tell you, OK, this is black, this is white. You go one level higher because, OK, so I can separate that. This is something, this is something. Which means what? Which means you can say, this is a line, this is a corner. And you go one level higher to the highly complex cells. And you can say, that's a square. So details to abstraction. They saw that, that this is what is working in the visual cortex of cats. Visual cortex of humans is, comparably, in some levels, complicated, more complicated. In some levels, easier. So simple cells just tell you rough, rough features that cannot help but to say, I'm looking at a square. Highly complex cells can tell you it's a square. It's a circle. It's a whatever. But they need the low level information coming from simple cells. So that was the idea. And then, so this is the work from Hubble and Visa. And what Fukushima did, he said, you know what? I can get an image or data. Today, we always think it in terms of images, but it could be anything. Data, input. And then we put it in to go inside simple cells and then go through complex cells. And then go again through simple cells. Again, through complex cells. And we do that until we get somewhere where there is a decision of some sort. So and, of course, this is you have to come up with a structure. What is simple? What is complex? This was based on this, again, trying to mimic this. The problem was Fukushima did not provide a learning algorithm like back propagation. When was it, 1979? There was no back propagation. Back propagation came 1986. So cool Fukushima. So he got the idea, but he couldn't really close the loop and say, it looks like this. He got the structure. This is the prototype of CNN sort of. Because when you look at the biological pattern, you see, OK, how many different ways do you have to implement this? I have to get at edges and then on blobs and then at shapes so you go low level to high level. Why didn't we know that since the 50s, when David Moore came up with his theory of computational vision? David Moore has a book, it's 700 pages. It's called Vision. So he wrote that book when he knows he has cancer and he's going to die. I have to sit home, I have to finish this book. So even in that book, he wrote, we have to do it this way. We have to go low level, high level. Everybody knows that. The image on the retina of the eye is low level. It's just photons hitting the sensors and then you send some signal. And then you get a level high and a level high until the abstract idea arms looking at the face. Everybody knows that. But OK, how should I implement it? That was the challenge. Well, OK. Before we go into the details of what CNNs are, maybe I give you a list of differences. Or what are MLPs compared to CNNs? What is the difference between, again, before we go into it and look at the implementation of CNN, I want to know, what is it that you are promising yourself from this new type of networks that you call convolutional neural networks? So what do you have as input for MLP? Well, you have feature vectors. MLPs cannot process raw data if problem is too complicated. So somebody has to make his hand dirty, extract features, give it to MLPs. So we had it before then. We had SIFT and SERV. We had LBP. We still have them. So you extract features. Somebody else, computer vision guy, does the feature extraction and then give it to MLP with three layers, five layers, six layers. And say, OK, can you classify these features for me? So in CNN, the input is the data, the raw data. Raw data, almost. So CNN says, don't bother. I don't trust this computer vision guys. They sit down, 10 years, they come up with an operator, it doesn't work in 90% of the cases. I will figure it out myself. That was a huge departure. So nobody has thought before that we should give the feature extraction as a task to neural networks. So we do it with our computer vision algorithms. And we somehow thought, and that was mainly the main problem, we thought that computer vision is something. And artificial neural networks are something else. So and CNN, of course, closed this graph. I said, no, it's the same thing. Then what about connections? Connections are quite dense in MLPs. You have millions of millions. If you go deep, you get millions of weights, and they are densely packed in every layer. In CNN, there are spores. Just a few, just a few, weights can connect this layer to next layer. But how can you learn? Learning is about adjusting the weights. And this is why they are dense, because I have millions of weights, and I want to adjust the synapse. Well, that was the curse, because you couldn't converge. You cannot learn. You cannot apply the back propagation if you have millions of weights. So it has to be sparse. It's as if I'm writing my wish list for Santa Claus. So what is it that I want? I want you to do the raw data, and I want you to be sparse, such that training is easy. So this is our wish list. We still don't know what CNN is. This is our wish list. I want CNN to be this. Weights. So connections are not necessarily weights. When we talk about connections, so how many connections do we have? Because how do you organize the connections? So weights, the question is here, in MLP, the weights are independent. I want to share the weights in CNN. How can I share the weights? Why should I share the weights? Because if I share the weights, again, I can reduce the number of weights, then learning becomes easier. Maybe this is the way for design of any AI technique. Sit down, come up with a wish list. What it should be, and then work backwards toward how can we implement this? Learning. So how do I learn? Well, MLP learns with backprop. What does CNN learn? Everybody cooks with water. Backprop. No new learning algorithm. What? You want to revolutionize AI, but you're keeping the same learning algorithm? Do you have any better idea? I'm listening. Anything better than backpropagation, gradient descent? Let me know. Backprop. No new learning algorithm. Filtering. Well, sometimes we need to filter stuff, right? This is part of, so you may have input is feature, but again, it may need filtering as part of the feature. So that was also done outside for MLP. Any filtering of features that was necessary was done before I put things in the network. Here is done inside. This guy wanna be in charge. This guy is in charge. Just give me the data. I filter it, I get the features, I organize it, I classify it, I give it to you. Why do you think now you have suddenly package that everybody can run face recognition? Because now you have one method that does everything. Just get the raw data. Too good to be true. What is true? So what is then the scheme? What is the general scheme? So in MLPs, we directly learn a non-linearly separable problem, non-linearly separable problem via features of the input. So somebody gives me the features of the input, but those features are still non-linear, so I have to learn a non-linear problem based on something that somebody else has given to me. So you are going to build a bridge but the foundation has been built by somebody else. You cannot say, why did it this way? I wanna go do it. You can, the foundation is made. That's the features. So problem is still non-linear when I give it to MLP. So here, again, wish list for the CNN, I wanna learn a, okay, if not 100% linear. I wanna learn a quasi-linear problem. I wanna learn a linear problem because linear problems are easy compared to non-linear problems. They still need computational power, but they are easy. Learn a linearly or quasi-linearly separable problem. Via, now this is, now I cannot do the wish list anymore. Now here I need a crucial idea. Via what? Raw data, it has to be sparse. The weights have to be shared. I'm still stuck with back propagation. I have to filter inside, but I need a quasi-linear problem. And I will solve that problem via, what is the most significant principle of computer science for solving top problems? What is it? What is it? Divide and conquer. No, I thought you would give me alcohol to drink. You're still giving me water and everybody cooks with water. Divide and conquer. What do you mean? I know divide and conquer is very important. I know it's the only tool we have to attack and solve difficult problems, but what does that mean? Well, this guy's told us that. Oh, you are telling me those poor cats suffered for nothing. Low level, mid level, high level. Okay, what does that mean for signal processing? Multi-resolution, pyramid. This is old. We know that since 70s, of course. New ideas are based on discover all things and put them together in a new way. Of course we know multi-resolution. So, via divide and conquering the raw input. Visionist, you're being quite ambitious. I'm trying to go over two, three decades of research just step by step and say what happened in the mind of many researchers that we got here. So, let's do some observations. And maybe we find some motivations. Without motivation, nothing works. Both MLPs and also auto-encoders. I'm sorry, I'm sorry. Yeah, I know, I know. Auto-encoders are fantastic, but they don't solve our problem. Because they're still taking the raw input. I cannot put a million inputs at them. If I wanna put an image at an auto-encoder, 1,000 by 1,000 is a million inputs, still, still today. That would be a gigantic challenge. So, it was nice, it was good. We still use it for compression, but these guys still have way too many weights. So, let's say I have a simple image, 30 by 30. So, that's an image of 30 by 30. What is an image of 30 by 30? A digit, a letter. You wanna do digit recognition, character recognition. So, that means you have 900 inputs, right? You have 900 pixels. And I'm assuming it's black and white. It's not even color. Otherwise, I have to take it three times three. That means the first layer has approximately 80,000 weights if I fully connect everything in the first layer, between the first layer and the middle layer. 80,000 weights, 80,000 weights to just recognize the tiny letter A or B or the digit two or five. Okay. But what I wanna do is actually more sophisticated. Digit recognition is done, it's gone, it's over. We can recognize digits. Okay, I wanna do phase recognition. Let's say 256 by 256, image. Well, that's roughly 65,000 inputs. And again, I'm keeping it black and white. Don't ignore the color, ignore the color for the time being. Then your first layer, MLP or autoencoder, has around 600,000 weights. Good luck training that. Not gonna happen. We know that this problem is intractable. We did that. The mid 80s, early 90s, mid 90s, we did that. We know that doesn't work. So MLP and autoencoders, they have a lot of input because of full or dense data. Connectivity is one of those crucial moments that you have to come up with some idea and say, less is more. How can I make it less dense? How can I make it sparse and do not lose recognition capability? How can I do that? But I'm doing this and I realize with the MLPs and autoencoders, I cannot do that. Images, wow, this is ridiculous, 256 by 256. Most applications at the moment, most deep networks are working in that region, maybe up to 300 by 300. You have satellite images, medical images, images from astrophysics that are 200,000 by 200,000 pixels. So we are at least a decade away from deep learning, understanding astrophysics, digital pathology, satellite imaging. So take it easy. So we can't, but at the moment, we cannot even do this. We cannot do face recognition. If we go with this, we can't. Okay, what is the idea? We need an idea. Why don't be greedy for God's sake. Use some weight sharing. Share some weights. Don't be greedy, don't keep all weights for yourself. I wanna be the new one who recognizes the nose, the eye. Just weight, share the weight with others. But to be skeptical, if I share weights, how should we then get enough information, enough information, which is features, from the input? So you are telling me to share weights. But if I share weights, I cannot get enough information from the input and you are forcing me to get the raw input. If I get the raw input, and for every point, every section, every fragment, every region of the input, I share the weight with, I don't know how many neurons you want me to share the weight with, but then how should I get enough information to understand the data? Well, I don't like it, but I have to go back to Hubble and Miesel because what the damage is done, the animals have suffered, but now I can at least read again between the lines what they found out. Okay, what was it? One of the things that they found, and actually we were using them, but it was not clear to the AI people. And this is always interesting. One of the phenomenons that I see going on when I read papers and blogs and watch videos of people who are apparently new in their eye field, they use some terminologies as if it has been invented five years ago, whereas that terminology is at least 60 years ago. And this is one of the measures too that you realize somebody is a newcomer, which is not a crime, which is perfectly all right. But they talk about stride as if that was an invention on normal network. That signal processing 101 was invented 70 years ago. But now we are bringing it in the context of normal network, that's new. So going back again to my favorite scientists, Hubble and Miesel, they discovered the so-called receptive fields, receptive. They discovered that on that screen, when they projected some sort of shapes on that screen, and the poor cat is somewhere here, I'm not drawing the cat, maybe that's the retina of the cat. And then they realized for every type of information, there seems to be a receptive field that gets activated when you display that character, that sign, that small chain. So you show a small rectangle and then 45 degrees and then 90 degrees and then 135 degrees and then you show a dot and then you show a circle and you play with simple stimuli. And then you see sometimes it light on here, sometimes activated here, sometimes activated here. So there are receptive fields. It seems that the vision system is specialized, every part of the vision system specialized to capture specific type of signal. Okay, so there is specific sensitivity toward specific patterns in small fields of the cat's visual cortex. For humans is a bit more complicated, but okay, so let's not talk about humans. Let's, because cats can see, they navigate, they understand, they recognize. They can make the difference between a dog and a mouse. So they are smart, let's just imitate that. Don't get caught up in the discussion, how would it look like for humans? Maybe something similar. So if that's the case, and this is usually, the emphasis here is on small. So this is not a million neurons. This is a tiny part. So this receptive fields are a small part. So if you wanna imitate it, you have to imitate it with something small. What is that? So simulate with convolution. Again, that's not new. Convolution is not new. Convolution is signal processing. We have known convolution since ever. But now I wanna use it in context of a connectionist approach. I wanna filter the raw data. There are 100,000 questions in my mind that I cannot answer. How do you wanna use, if somebody coming from signal processing, talking to a neural network guy, and they say, I wanna use convolution in neural network. I say, what? How? How would that work? Okay, what is convolution? We knew that convolution with convolution. Basically, you can cover, you can design and implement any type of filter. That's a very powerful concept. You can design and implement any type of filter. Nobody needs to describe filter to us and engineers. Filter is our business. We design filter left and right. Hardware software, we understand filtering. That should not be the problem. So what is convolution? So in a physical system, that may make it a little bit difficult. In a physical system, an input on a linear system to produce an output. That's one of those definitions that doesn't say anything. What? In a physical system. So an input acts on a linear system to produce an output. Okay. So if my output is y of t, now I'm going back to engineering. Now we are in the time domain. Then my variable is a variable of time, x of t. And then I convolve that variable if I take the star sign with a function h of t, I can write that as an integral and say, if I add things up from minus infinity to plus infinity. Well, I'm doing analog, so I have to do it all over the place. Would be x of some random variable times h of t minus that random variable, tau, d tau. Okay, that's new, which is not. So this is some dummy variable. Dummy, and my markers are going out of press. Dummy variable. So some sort of dummy variable. Doesn't matter what you use. The actual variable is time. Here I have the output. Here I have the input. Here we call it a unit impulse. Unit impulse, which is a response of the system. At this moment, when I talk about this, I have confused, totally confused the neural network guy because now I'm talking with the terminology of engineering as a system signal impulse. What are you talking about? Everything is a vape for me. How vape's? Okay, relax. This is a collection of vape's. Okay, happy? So, what does that mean? So the convolution sum, the convolution sum, is basically, now if I come back to the iteration, y of n would be the sum of x of k times h of t minus, sorry, n minus k. And k belongs to capital N. This is a specific field which should mimic the receptive field of this guy. So that n, which we'll define in what neighborhood we will apply our convolution, our filtering. Okay, would that make it easier if I forget about convolution and call it filtering? Filtering. So, in what field should I apply the filter? Historically, we have tried to apply filter at once. So, image comes in as thousand by thousand. I apply a filter which is thousand by thousand, which means I have to do a matrix multiplication of thousand by thousand and thousand by thousand. Good luck with that. Convolution says you don't need to do that. Why are you doing this to yourself? Let's do it in a tiny place, but you slide it over the entire image. That has the same effect. I am not sure about that. So, let's say we come up that your h is minus one, minus one, minus one, zero, zero, zero, one, one, one. So the first convolution mask or convolution kernel that you're writing on the board, for some reason it has these numbers, whatever that is. And as compared to those numbers that I gave you, images are being 250 by 250, this is three by three, so it's a really small one. It's really small. And I would say Haberlin-Liesl would not complain about this. Yeah, yeah, receptive fields, this is the relationship basically, yeah? Receptive fields are small. Okay. This is for us a vertical impulse. Again, I'm continuing to confuse the neural network guy, talking like an engineer. And now I have an x, I have an image. I cannot write the entire image because the image is very big. Let's say I have tiny parts of this, which are three by three. I don't know, let's say the image is 256 by 256, right? And I'm only looking at three by three parts of it, this matrix because my mask, my convolution kernel, my filter is three by three. So if somebody would give me five by five, I have to choose five by five. So now I get two, I choose two and I write two samples here. Let's say this is, I don't know, this is A and this is B. So A is 10, 10, 10, 50, 50, 50, 100, 100, 100. And B is, so again, I'm giving you random numbers, making my life easy, is other way around. 10, 10, 10 is horizontal, 50, 50, 50, and 100, 100, 100. If these are real images, those values are between zero and 256 because we usually encode with eight bits. So you get two to eight is 256. If you have color images, you have three of those matrices, R, G, B, you have color images. So it's 24 bits, it's not eight bits. So, and now I wanna convolve this guy, I have to throw this away. I wanna convolve this guy with H and I wanna convolve this guy with H, right? Should not be a big deal. This will come out if I convolve this with this. It will be minus 30, right? So minus one times 10, minus one times 10, minus one times 10, zero times 50, zero times 50. So it will be minus 30 plus zero plus 300, right? So it will be 270, which in signal processing we say the response of the filter is 270. That's a very strong response. And I do it here too. So now I have minus one, minus one, minus one. So minus 10, minus 50, minus 100 is what? Minus 170, minus 160. Plus zero, zero, zero, plus one is 160. So it's zero. So the filter for this image, for this small part of the image, responds 270. For this one, it responds zero. So basically the cat is saying, I see something. I don't see something. I see something. I don't see something. There is no response. For this filter, which means what? I wrote vertical impulse. This guy is a vertical edge. Dark, gray, bright is going this way. It's a vertical edge in the image. If you grab an image, if you have Photoshop or GIMP or any other photo editing tool, zoom in, zoom in until you see the pixels. And then choose a three by three. But here, so here I have 10, 50, 100. So dark, gray, bright. I'm going horizontally. So this guy is specialized in vertical. The receptive field that sees vertical cannot see horizontal. You wanna impress me? This is signal processing. We know that. Yeah, but now we wanna give the control to neural networks. Because before 1995, we would sit down and manually design filters like this. And we said, what type of filter will detect edges? What type of filter would detect blobs? What type of filter would detect corners? And then we design it based on design requirements that we have. Everything was manual. Now that wish list was saying, I wanna do everything myself, so I have to filter it myself. So I have to understand the concept, but I don't like it that you're giving me these numbers. I don't trust humans. I wanna figure that myself. What, you wanna figure out the filter coefficients? Yes. What's wrong with that? What, there's a lot of engineering knowledge in that. Really? Generate three by three random numbers, upload on the image, you get some edges. Sorry, sorry about all of us being engineers, but it's not magic, it's just convolution. You convolve a signal with a filter. So, what is the idea? Now we are roughly 1996, starting in 1959. Now, Lacune is quite close now. Okay, another idea. Use weights as filter coefficients. Use weights as filter coefficients, which means what? Which means I have my input x1, x2, x3, x4, x5, x6, x7, x8, x9. I wanna convolve that with, not like engineers do, and they give me some numbers, and they call it SOPL, edge operator, Pripyat operator, Kirch operator, Robert operator, no, no, no, no, no. AI operator. W1, W2, W3, W4, W5, W6, W7, W8, W9. I made my life easier with simple indexing, I had not double indexing, so. You get the picture, because all of them have a special location in a gigantic matrix that I will be operating on. So, this is coming, this is one of this small, tiny three by three in a big image. So, because my filter is small, I cannot have big filters. For many decades, we try to do big filters. Big matrix multiplication kills us, still today. So, we don't wanna do that. We wanna do small filters. But what type of filters? How do I know the cat see this? What about vertical one? What about 30 degrees? What about 45? What about 90? What about a small circle? What about a corner? Oh, there are many things that we can see. And for each one of them, we should have a filter. So, that's that low level simple cells that Hubble and Wiesel were talking about. Understand the signal at the lowest possible level and then take it one level higher, one level higher, until you see, oh, that's a face. There are fantastic visualizations on, if you just do an image search, you see what layers of net will see what. I'm struggling to make the point here on the board. So, X1, X2, X9, go in. I have weights. W1, W2, W3, I have a beautiful neuron. Some logistic function comes out the result. Back to Frank Rosenblatt, perceptron. Now we have a bridge. Now is the reconciliation of neural network with signal processing. Wow, fantastic. Why didn't we think of this earlier? Because we are egoistic. We are just confined within our domain and nobody, the computer science things, we are the best. Engineering things, we are the best. The biology is that these guys are idiots, we are the best. So, we have to talk. If you talk, you say, oh my God. There's so much we can learn from each other. So, of course this is what we know. You wanna add a bias, add a bias. Who cares? So, what is then the weight sharing? One filter for all. All what? All pixels. One filter for all. I design a filter and I applied. What, how big is your image? A million, thousand by thousand. A million pixel in color, which is three million. So, I apply a filter, nine weights or shared among a million pixels. If that works, would be fantastic. So, but there's a problem. There's always a problem, isn't it? There's always a problem. And then people give up. You come up to here and say, yeah, that's weight sharing, yeah, exactly. But then there is a problem. And you go drink coffee and you go on sabbatical and you go on conferences and your papers get rejected and your grants get rejected and every department think, why did we hire this guy? And, but we need many features. Excuse me, now that you are getting rid of the engineers that was giving you handcrafted filters and you want to do it with weight sharing, what is this? Is that the vertical line? Is the horizontal line? Is 45 degrees? You need a lot of features. Relax, I take many filters. Yeah, but you need many. Okay, I take many. 64 enough? I take 64, 64 is too much. The signal processing guy would tell you 64 is too much. That had no precedence in the literature. In the literature, we would work with four filters. Six filters. Maybe you go insane, eight filters. Now we have 128, 128 filter. Who should manage it? The network. We took, people when people say AI makes people lose their job. Have I lost my job? I'm an engineer. They took the design of filter away from me but I'm here and I'm reporting proudly that they took it away from me. So solution, learn several filters. 32, 64, 128. Wow, are there that many situations? So, okay, if I just go after edges but they're not just edges, they're corners, they're blobs, they're small circles. Corner can be like this, can be like this. There are many, many different things. Now, we didn't think of all those possibilities because we are doing it manually. Now that it's getting automatically, let's go insane. That's just 128 filters. Why not? So which means what? Which means the image comes in and this image is usually, let's say 256 by 256 is quite common and has three frames. It's color image most of the time. And you get filters, three by three, many to be applied on that image. But those filters, we don't design. We just let the network go. We have to think about it, how do we do, how do we initialize those network, those weights? Let's put them on zero or all on one or all on mine. No, if the network, if everything generates the same result the network cannot go anywhere. So you have to do it randomly. We call that symmetry breaking. You break the symmetry because to learn you have to be asymmetric. No, initialize randomly. Initialize randomly out of a zero mean distribution with low variance. Zero mean I want to be normalized. The average is almost zero. And low variance I don't want a big fluctuation in filters. It's a hundred thousand detail in that. And when you are a newcomer you do not figure this out easily. It takes time to see. TensorFlow is a blessing, is a blessing. All of us, we all Google. They did that. Of course then we learned their platform and then basically everybody is working for Google. But okay, God bless them because the package is fantastic. But it also makes us lazy on other packages too. It makes us lazy because we have to understand the details of why we are initializing this way. If you are a designer you have to answer that question. So if I do this, then your 256 by 256 image becomes a lot bigger after every filtering, right? If this is your three by three as a result of one of the filters. So you get a gigantic tensor. So from one image I generate many, many, many images. And I put them together. I just, I generated many images while I'm putting them together making a box. That's it. Nothing is happening. 32, 64, 128. 128 times I filter the image which has three frames, 120 times, times three. And then I put them together. It becomes a gigantic box. So the management of it is quite challenging. Now we are giving the control. This guys are our weights. This weights are shared with 256 by 256. Numbers, so it's not much. For a neural network it's not much. So and then one would say, so this is the convolution, right? So you do this and you get the response of many filters. So the response of many filters. Apologize for my bad drawing. Van Gogh would like it because he was breaking the law of perspective all the time. So his color would compensate for that. So you apply many filters. Basically you want to understand the scene. How many edges do I have this direction, this direction, this direction, this direction? Faces, we have techniques in computer vision for recognizing pedestrians. Because if you are a pedestrian and you're walking, what do you see more as a computer? You see vertical lines moving. So vertical filters would be fantastic. So but if images is new and I don't know it, I don't know, is vertical important? Horizontal important? 45 degrees important? I don't know. So I have to do many and then all information is here. We are assuming that that happens somewhere between the retina in our eye and the visual cortex. It's happening. Some filtering is happening, but it's very complicated. We don't understand most of it. So and then okay, so you do this, now give it to MLB. You're too fast. This is still too difficult. So it was not about just filter it and then give it to MLB and MLB will do it. It's still too difficult. Was the question? Yes, convolve with a set of filters. Yes, yeah. This is the first, this is the convolution that we applied. So I have some filters here for horizontal, for vertical, for this way, for this way, for this, for this, for this, and so on. For many, many different responses or unit impulses, we get some filter responses. And then we assume, yeah, okay. Now I broke it down in many, many filters. Now I give it to MLB. This is way too difficult still. For many reasons. First of all, this is too big. Second of all, this is still the weight, where is the weight sharing? The weight sharing is here, but I get a lot of information into process. I cannot, this is still nonlinear. You cannot just give me the filter and say do it. This is still nonlinear. Okay, so what should I do? Whenever this happens, good scientists and good engineers and good computer scientists, we go to the library and grab the book corresponding to this field in the 101 section. Signal processing 101. What did we do? We'd signal. We filtered them, we put them together, we combine them, we upsample them, we don't sample them. Wait a minute, what did you say? We don't sample them. Oh, don't sample them. Can we don't sample this? No, we can try. So do down sampling, which people call it pooling. Again, it's one of those terminologies that people, newcomers think, oh, we invented something, it's called pooling. Why is it down sampling? Has been around for quite some time. It's good that they put a terminology for normal network on it, but we should know where it's coming from. So convolution is not exactly divide to enable conquer, so divide it. So down sample it. So my image is, actually you add to my pain with the filtering. I had to deal with one image. Now you are giving me 128 images. Where's the help? I was expecting that you help me. Okay, take this. Now down sample it. Make it 128 by 128. Cut it in half. Why in half? Signal processing one-on-one, Nyquist theorem, sampling theorem. I'm not taking signal processing. What have you taken? Have you taken a linear algebra? No, not really. What are you doing in AI then? Go grab a textbook. So I have an image. One, two, three, four, two, three, four. And then we have some numbers. Two, one, one, seven, three, two, five, nine, one, six, eight, seven, five, 10, 10, 12. So I'm just grabbing a part of that gigantic image. I don't want to draw that. I just grabbed the first part. And I want to down sample. Down sampling like filtering had its own window or receptive field if you like. So I want to go two by two. I want to go two by two. And I want to down sample every two by two to one by one. So I want to make every four pixel, one pixel. So what we do is we say, okay, I want to convert this to just four pixels. So I have four by four, I have 16. I want to make it four pixels. Here I take the maximum, which is three. Here I take the maximum, which is nine. Here I take the maximum, which is 10. And here I take the maximum, which is 12. Done. Grab the maximum. You want to discuss why maximum, why not minimum, why not median, why not average? We can have that discussion. Maximum is easy. It gives you the highest intensity. Let's go with that because we can see in brightness. We cannot see in darkness. Minimum would mean darkness, darker colors. So I can on the sample. So I can make this 256 by 256 bucks. I can make it 128 by 128 bucks. Is that good enough for you? No, not really yet. It's still too big. Okay, you know what? Repeat this. What do you mean? Convolution pooling, convolution pooling, convolution pooling. How long? Until you get what you want, two by twos. Really? I can down sample on two by twos? Or you can. This could, if you wanted to. So of course, this is max pooling. There are other types, but this has been established. So now we have two components. We have convolution pooling and they should serve MLP. Now the guy is putting everything together, writing a paper, giving it to his supervisor and say, what do you think? Convolution is there to get features. Pooling is there to reduce dimensionality. And MLP is there, of course, to classify. So it seems now we have something in place. It seems we have a chain. I still don't know. You are telling me, image comes. Convolution pooling, convolution pooling, convolution pooling, until you are comfortable. Until the data is as big as you can handle it. Okay, experiment with that. But every experiment will take forever. No, because we have wait sharing. You still need GPUs. I'm not lying to you. You still need GPUs. It's not gonna work on your laptop. You cannot train ImageNet on your laptop. You still need some bad-ass Tesla GPUs. What is now is possible. Now it's tractable. Now it comes back and gives us something. So, but we still have problems. Oh my God, we have so many problems. We still have problems. We are not out of the wood yet. What is this problem this time? So, so far we have, I have the image. I have my kindergarten drawing with the house and a river and a mountain. And then it goes inside the convolution. Then you do pooling and then we go MLP. Problem still non-linear, hence impossible for in bracket shallow MLP. Why, we still don't have deep networks, basically. The deep networks that we call deep networks they are not deep. Convolution pooling is deep, but the MLP is still shallow in CNN at least. So solution to go back to my favorite biologists, Hobel and Liesl, they would tell you, cascade it, cascade it. The way that Neo-Cognitron did. Neo-Cognitron did cascade. Simple complex, simple complex, simple complex. Okay, you're doing convolution pooling, cascade it, convolution pooling, convolution pooling, convolution pooling. I get small, small, small, small. Like we know this concept, wavelet transform. Late 90s you would go to any conference there was no word on AI. Everybody was talking about wavelet. Multi-resolution, pyramid, image representation. In theory, you can compress an image to a pixel according to the wavelet theory. Convolution pooling, convolution pooling. This is exactly what wavelet does. Look it up, this is exactly what wavelet transform does. Convolution pooling, convolution pooling, convolution pooling, you can go up to a pixel. If you keep the coefficients, you can reconstruct back the entire image from that one pixel. The theory is sound. Of course you cannot go to one pixel. Depending on image content, maybe you go to 50 by 50 and then you have to stop. Because you start it with a face and if you come back you get an elephant. So, everything has a limit. But the theory was sound. So, okay. Which means I get my image 500 by 500 and we apply the convolution and then we do the down sampling and then we do another convolution. It's getting bigger of course in the length. And then I do down sampling and so on. Then give it to MLP. So, several ones. Image comes, convolution down sampling, convolution down sampling, convolution down sampling. Then give it to a shallow MLP. Which means what? Two, three layers of fully connected neurons. We know MLP. Two layers, three layers. So, why should this make it linear? So, because I'm breaking it down. I'm going from low level to mid level to high level. Again, look up. First thing you do, if you have not already done that, just type CNN, convolutional network visualization, search for images. You see those colorful samples. What is extracted in every convolution layer? It's fantastic. So, how much time we have? Okay, we have two minutes. We have two minutes. Okay. So, maybe I just mentioned, there are many details that we have not mentioned about CNNs, but okay. So, some common networks that you must know meanwhile. First one was AlexNet 2012. The breakthrough came with AlexNet because that was the first time that somebody went through the trouble to show what you can do with it. Until 2012, it was theory. 2012, for the first time, neural network people found the courage to attack the biggest problem we had, which was ImageNet. 16 million images, one million of them labeled, 2,000 classes, cats, dogs, cars, bicycles, people. Nobody has attempted neural network with that in a serious way. It was SVM all over the place. Manually designed filters with SVM had the best results. SVM with LBP had the best results, one of the best results. Then, common students said, to help with everybody, I will do this. You need those type of students. Because as a student, you don't have much to lose, do you? There is still no reputation. Just go insane. Just let's try something crazy. He did that. So, let's apply neural network on ImageNet. What do you mean? We cannot. So, let's write it from scratch. The ZFNet 2013. Not much you can hear about that. Of course, Google came along. GoogleNet 2014. 2014. This is the inception network, basically. VGG, I cannot hear VGG anymore. So, what that was one of the most successful ones, because again, they came up with 16 and 19 layers and they substantially improved this result. They said, look, we can do better. Now the race was, how do I design a convolutional neural network best, such that I can train it really efficiently and I can get high accuracy without overfitting. And of course, ResNet and many others. ResNet came 2015 and many more. DenseNet and so on and so on. So, they are just architectures. Somebody has sat down using a fantastic tool that is called TensorFlow and you put your blocks here. So, this is my convolution. This is my pooling. I put this condition here and I have this layer here. You design it. Let's do two in parallel. One here, one here and then you train. And so, it has become really experimental, which is good. So, we still have a little bit about CNN that we have to cover maybe half an hour and we will do that next week and then we will start with reinforcement learning.