 Hello! Welcome to How Convolutional Neural Networks Work. Convolutional Neural Networks, or ConvNets, or CNNs, can do some pretty cool things. If you feed them a bunch of pictures or faces, for instance, they'll learn some basic things like edges and dots, bright spots, dark spots. And then because they're a multi-layer neural network, that's how it gets learned. The first layer, the second layer, are things that are recognizable as eyes, noses, mouths. And the third layer are things that look like faces. Similarly, if you feed it a bunch of images of cars, down to the lowest layer, you'll get things, again, that look like edges. And then higher up, you get things that look like tires and wheel wells and hoods. And at a level above that, things that are clearly identifiable as cars. CNNs can even learn to play video games by forming patterns of the pixels as they appear on the screen and learning what is the best action to take when it sees a certain pattern. A CNN can learn to play video games, in some cases, far better than a human ever could. Not only that, if you take a couple of CNNs and have them set to watching YouTube videos, one can learn objects by, again, picking out patterns, and the other one can learn types of grasps. This, then, coupled with some other execution software, can let a robot learn to cook just by watching YouTube. So there's no doubt CNNs are powerful. Usually, when we talk about them, we do so in the same way we might talk about magic. But they're not magic. What they do is based on some pretty basic ideas applied in a clever way. So to illustrate these, we'll talk about a very simple toy convolutional neural network. What this one does is takes in an image, a two-dimensional ray of pixels. You can think of it as a checkerboard, and each square on the checkerboard is either light or dark. And then, by looking at that, the CNN decides whether it's a picture of an X or of an O. So, for instance, on top there, we see an image with an X drawn in white pixels on a black background. And we would like to identify this as an X. And the O, we'd like to identify as an O. So how a CNN does this has several steps in it. What makes it tricky is that the X is not exactly the same every time. The X or the O can be shifted. It can be bigger or smaller. It can be rotated a little bit, thicker or thinner. And in every case, we would still like to identify whether it's an X or an O. Now, the reason that this is challenging is because, for us, deciding whether these two things are similar is straightforward. We don't even have to think about it. For a computer, it's very hard. What a computer sees is this checkerboard, this two-dimensional array, as a bunch of numbers. Ones and minus ones. A one is a bright pixel, a minus one is a black pixel. And what it can do is go through pixel by pixel and compare whether they match or not. So to a computer, it looks like there are a lot of pixels that match, but some that don't. Quite a few that don't, actually. And so it might look at this and say, I'm really not sure whether these are the same. And so because a computer is so literal, it would say, I'm uncertain. I can't say that they're equal. Now, one of the tricks that convolutional neural networks use is to match parts of the image rather than the whole thing. So if you break it down into smaller parts or features, then it becomes much more clear whether these two things are similar. So examples of these little features are little mini-images. In this case, just three pixels by three pixels. The one on the left is a diagonal line, slanting downward from left to right. The one on the right is also a diagonal line, slanting in the other direction. And the one in the middle is a little x. These are little pieces of the bigger image. And you can see as we go through, if you choose the right feature and put it in the right place, it matches the image exactly. So okay, we have the bits and pieces. Now, to take a step deeper, the math behind matching these is called filtering. And the way this is done is a feature is lined up with the little patch of the image. And then one by one, the pixels are compared. They're multiplied by each other and then added up and divided by the total number of pixels. So to step through this, to see why it makes sense to do this, you can see starting in the upper left-hand pixel in both the feature and the image patch. Multiplying one by one gives you a one. And we can keep track of that by putting that in the position of the pixel that we're comparing. We step to the next one. Minus one times minus one is also a one. And we continue to step through pixel by pixel, multiplying them all by each other. And because they're always the same, the answer is always one. When we're done, we take all these ones and add them up and divide by nine. And the answer is one. So now we want to keep track of where that feature was in the image and we put a one there. Say when we put the feature here, we get a match of one. That is filtering. Now we can take that same feature and move it to another position and perform the filtering again. And we start with the same pattern, the first pixel matches the second pixel matches. The third pixel does not match. Minus one times one equals minus one. So we record that in our results. And we go through and do that through the rest of the image patch. And when we're done, we notice we have two minus ones this time. So we add up all the pixels to add up to five, divide by nine, and we get a point five five. So this is very different than our one. And we can record the point five five in that position where we were where it occurred. So by moving our filter around to different places in the image, we actually find different values for how well that filter matches or how well that feature is represented at that position. So this becomes a map of where the feature occurs. By moving it around to every possible position, we do convolution. That's just the repeated application of this feature, this filter over and over again. And what we get is a nice map across the whole image of where this feature occurs. And if we look at it, it makes sense. This feature is a diagonal line, slanted downward left to right, which matches the downward left to right diagonal of the X. So if we look at our filtered image, we see that all of the high numbers, ones and point seven sevens, are all right along that diagonal. That suggests that that feature matches along that diagonal much better than it does elsewhere in the image. To use a shorthand notation here, we'll do a little X with a circle in it to represent convolution, the act of trying every possible match, and we repeat that with other features. We can repeat that with our X filter in the middle and with our upwards slanting diagonal line in the bottom. And in each case, the map that we get of where that feature occurs is consistent with what we would expect based on what we know about the X and about where our features match. This act of convolving an image with a bunch of filters, a bunch of features, and creating a stack of filtered images is what we'll call a convolution layer. A layer because it's an operation that we can stack with others as we'll show in a minute. In convolution, one image becomes a stack of filtered images. We get as many filtered images out as we have filters. So convolution layer is one trick that we have. The next big trick that we have is called pooling. This is how we shrink the image stack. And this is pretty straightforward. We start with the window size, usually two by two pixels or three by three pixels, and a stride, usually two pixels, just in practice, these work best. And then we take that window and walk it in strides across each of the filtered images. From each window, we take the maximum value. So to illustrate this, we start with our first filtered image. We have our two pixel by two pixel window. Within that pixel, the maximum value is one. So we track that and then move to our stride of two pixels. We move two pixels to the right and repeat. Out of that window, the maximum value is 0.33, et cetera, 0.55. When we get to the end, we have to be creative. We don't have all the pixels representative, so we take the max of what's there. And we continue doing this across the whole image. And when we're done, what we end up with is a similar pattern, but smaller. We can still see our high values are all on the diagonal. But instead of 7 by 7 pixels in our filtered image, we have a 4 by 4 pixel image. So it's half as big as it was about. This makes a lot of sense to do, if you can imagine, if instead of starting with a 9 by 9 pixel image, we had started with a 9,000 by 9,000 pixel image, shrinking it is convenient for working with it. It makes it smaller. The other thing it does is pooling doesn't care where in that window that maximum value occurs. So that makes it a little less sensitive to position. And the way this plays out is that if you're looking for a particular feature in an image, it can be a little to the left, a little to the right, maybe a little rotated, and it'll still get picked up. So we do max pooling with all of our stack of filtered images and get, in every case, smaller set of filtered images. Now, that's our second trick. Third trick, normalization. This is just a step to keep the math from blowing up and keep it from going to zero. All you do here is everywhere in your image that there is a negative value, change it to zero. So for instance, if we're looking back at our filtered image, we have these what are called rectified linear units. That's the little computational unit that does this. But all it does is steps through everywhere there's a negative value, change it to zero. Another negative value, change it to zero. By the time you're done, you have a very similar looking image except there's no negative values. They're just zeros. And we do this with all of our images. And this becomes another type of layer. So in a rectified linear unit layer, a stack of images becomes a stack of images with no negative value. Now, what's really fun, the magic starts to happen here. When we take these layers, convolution layers, rectified linear unit layers and pooling layers, and we stack them up so that the output of one becomes the input of the next. You'll notice that what goes into each of these and what comes out of these looks like an array of pixels or an array of an array of pixels. And because of that, we can stack them nicely. We can use the output of one for the input of the next. And by stacking them, we get these operations building on top of each other. What's more, we can repeat the stacks. We can do deep stacking. You can imagine making a sandwich that is not just one patty and one slice of cheese and one lettuce and one tomato, but a whole bunch of layers, double, triple, quadruple, as many times as you want. Each time the image gets more filtered as it goes through convolution layers and it gets smaller as it goes through pooling layers. Now, the final layer in our toolbox is called a fully connected layer. Here, every value gets a vote on what the answer is going to be. So we take our now much filtered and much reduced in size stack of images. We break them out, we just rearrange and put them into a single list because it's easier to visualize that way. And then each of those connects to one of our answers that we're going to vote for. When we feed this in X, there will be certain values here that tend to be high. They tend to predict very strongly that this is going to be in X. They get a lot of vote for the X outcome. Similarly, when we feed in a picture of an O to our convolutional neural network, there are certain values here at the end that tend to be very high and tend to predict strongly when we're going to have an O at the end. So they get a lot of weight, a strong vote for the O category. Now when we get a new input and we don't know what it is and we want to decide the way this works is the input goes through all of our convolutional, our rectified linear unit, our pooling layers and comes out to the end here. We get a series of votes and then based on the weights that each value gets to vote with, we get a nice average vote at the end. In this case, this particular set of inputs votes for an X with a strength of 0.92 and an O with a strength of 0.51. So here definitely X is the winner and so the neural network would categorize this input as an X. So in a fully connected layer, a list of feature values becomes a list of votes. Now again, what's cool here is that a list of votes looks a whole lot like a list of feature values so you can use the output of one for the input of the next. And so you can have intermediate categories that aren't your final votes or sometimes these are called hidden units in a neural network and you can stack as many of these together as you want also. But in the end, they all end up voting for an X or an O and whoever gets the most votes wins. So if we put this all together, then a two-dimensional array of pixels in results in a set of votes for a category out at the far end. So there are some things that we have glossed over here. You might be asking yourself where all of the magic numbers come from. Things that I pulled out of thin air include the features in the convolutional layers, those convenient three-pixel by three-pixel diagonal lines at the X, also the voting weights in the fully connected layers. I really waved my hands about how those are obtained. In all these cases, the answer is the same. There is a trick called back propagation. All of these are learned. You don't have to know them. You don't have to guess them. The deep neural network does this on its own. So the underlying principle behind back propagation is that the error in the final answer is used to determine how much the network adjusts and changes. So in this case, if we knew we were putting in an X and we got a 0.92 vote for an X and that would be an error of 0.08 and we got a 0.51 vote for an O, we know that that would be an error of 0.49, actually an error of 0.51 because it should be zero, then if we add all that up, we get an error of what should be 0.59. So what happens with this error signal is it helps drive a process called gradient descent. If there is another bit of something that is pretty special sauce to deep neural networks, it is the ability to do gradient descent. So for each of these magic numbers, each of the feature pixels, each voting weight, they're adjusted up and down by a very small amount to see how the error changes. The amount that they're adjusted is determined by how big the error is. Large error, they're adjusted a lot. Small error, just a tiny bit. No error, they're not adjusted at all. You have the right answer, stop messing with it. As they're adjusted, you can think of that as sliding a ball slightly to the left and slightly to the right. On a hill, you want to find the direction where it goes downhill. You want to go down that slope, find the gradient to find the very bottom, because the bottom is where you have the very least error. That's your happy place. So after sliding it to the left and to the right, you find the downhill direction and you leave it there. Doing that many times over lots of iterations, lots of steps, helps all of these values across all the features and all of the weights settle in to what's called a minimum. At that point, the network is performing as well as it possibly can. If it adjusts any of those a little bit, its error will go up. Now there are some things called hyperparameters. These are knobs that the designer gets to turn, decisions the designer gets to make. These are not learned automatically. In convolution, figuring out how many features should be used, how big those features should be, how many pixels on a side. In the pooling layers, choosing the window size and the window stride. And in fully connected layers, choosing the number of hidden neurons, intermediate neurons. All of these things are decisions that the designer gets to make. Right now, there are some common practices that tend to work better than others, but there is no principled way. There is no hard and fast rules for the right way to do this. And in fact, a lot of the advances in convolutional neural networks are in getting combinations of these that work really well. Now in addition to this, there are other decisions the designer gets to make, like how many of each type of layer and in what order. And for those that really like to go off the rails, can we design new types of layers entirely and slip them in there and get new fun behaviors? These are all things that people are playing with to try to eke out more performance and address stickier problems with CNNs. Now what's really cool about these, we've been talking about images, but you can use any two dimensional or even for that matter, three or four dimensional data. But what's important is that in your data, things closer together are more closely related than things far away. What I mean by that is if you look at an image, two rows of pixels or two columns of pixels are right next to each other, they're more closely related than rows or columns that are far away. Now what you can do is you can take something like sound and you can chop it up into little time steps and for each piece of time, the time step right before and right after is more closely related than time steps that are far away and the order matters. You can also chop it up into different frequency bands. Bass, mid-range, treble, you can slice it a whole lot more finally than that. And again, those frequency bands are the ones closer together or more closely related and you can't rearrange them. The order matters. Once you do this with sound, it looks like a picture. It looks like an image and you can use convolutional neural networks with them. You can do something similar with text where the position in the sentence becomes the column and the row is words in a dictionary. In this case, it's hard to argue whether order matters, that order matters, it's hard to argue that words in a dictionary are that some are more closely related than others in all cases. And so the trick here is to take a window that spans the entire column, top to bottom, and then slide it left to right. That way it captures all of the words, but it only captures a few positions in the sentence at a time. Now, the other side of this limitation of convolutional neural networks is that they're really designed to capture local spatial patterns, spatial in the sense of things that are next to each other matter quite a bit. So if the data can't be made to look like an image, then they're not as useful. So an example of this is, say, some customer data. If I have each row, it's a separate customer, each column is a separate piece of information about that customer, such as their name, their address, what they bought, what websites they visited, then this doesn't so much look like a picture. I can take and rearrange those columns and rearrange those rows, and this still means the same thing. It's still equally easy to interpret. If I were to take an image and rearrange the columns and rearrange the rows, it would result in a scramble of pixels, and it would be difficult or impossible to say what the image was of. There I would lose a lot of information. So as a rule of thumb, if your data is just as useful after swapping out any of the columns for each other, then you can't use convolutional neural networks. So the take-home is that convolutional neural networks are great at finding patterns and using them to classify images. If you can make your problem look like finding cats on the internet, then they're a huge asset. If you'd like to continue your study of CNN, I would recommend looking at the notes from the Stanford Computer Science 231 course from Justin Johnson and Andre Karpathy. Also checking out the writings of Christopher Ola who is an exceptionally clear writer. And feel free to check out another presentation that I did, Deep Learning Demystified, talking about some of the properties of deep neural networks in general for someone who is new to the topic. If you'd like to dig even deeper and play with some of these, there's a variety of toolkits. They each have their strengths and weaknesses. I invite you to dig deep into them and learn all of them. Thanks for listening. Stay connected with me online and I would love to follow up with you.