 Neural networks are good for learning lots of different types of patterns. To give an example of how this would work, imagine you had a four pixel camera. So not four megapixels, but just four pixels, and it was only black and white. And you wanted to go around and take pictures of things and determine automatically then whether these pictures were of solid, all-white or all-dark image, a vertical line, or a diagonal line, or a horizontal line. This is tricky because you can't do this with simple rules about the brightness of the pixels. Both of these are horizontal lines, but if you tried to make a rule about which pixel was bright and which was dark, you wouldn't be able to do it. So to do this with the neural network, you start by taking all of your inputs, and in this case our four pixels, and you break them out into input neurons. And you assign a number to each of these depending on the brightness or darkness of the pixel. Plus one is all the way white, minus one is all the way black, and then gray is zero right in the middle. So these values, once you have them broken out and listed like this on the input neurons, it's also called the input vector, or array. It's just a list of numbers that represents your inputs right now. It's a useful notion to think about the receptive field of a neuron. All this means is what set of inputs makes the value of this neuron as high as it can possibly be. For input neurons, this is pretty easy. Each one is associated with just one pixel, and when that pixel is all the way white, the value of that input neuron is as high as it can go. The black and white checkered areas show pixels that an input neuron doesn't care about. If they're all the way white or all the way black, it still doesn't affect the value of that input neuron at all. Now, to build a neural network, we create a neuron. The first thing this does is it adds up all of the values of the input neurons. So in this case, if we add up all of those values, we get a 0.5. Now, to complicate things just a little bit, each of the connections are weighted, meaning they're multiplied by a number. That number can be one or minus one or anything in between. So for instance, if something has a weight of minus one, it's multiplied and you get the negative of it, and that's added in. If something has a weight of zero, then it's effectively ignored. So here's what those weighted connections might look like. You'll notice that after the values of the input neurons are weighted and added, the final value is completely different. Graphically, it's convenient to represent these weights as white links being positive weights, black links being negative weights, and the thickness of the line is roughly proportional to the magnitude of the weight. Then after you add the weighted input neurons, they get squashed. And I'll show you what that means. You have a sigmoid squashing function. Sigmoid just means s-shaped. And what this does is you put a value in, let's say 0.5, and you run a vertical line up to your sigmoid, and then a horizontal line over from where it crosses, and then where that hits the y-axis, that's the output of your function. So in this case, slightly less than 0.5. It's pretty close. As your input number gets larger, your output number also gets larger but more slowly, and eventually, no matter how big the number you put in, the answer is always less than 1. Similarly, when you go negative, the answer is always greater than negative 1. So this ensures that that neuron's value never gets outside of the range of plus 1 to minus 1, which is helpful for keeping the computations in the neural network bounded and stable. So after you sum the weighted values of the neurons and squash the result, you get the output. In this case, 0.746, that is a neuron. So we can call this, we can collapse all that down, and this is a neuron that does a weighted sum and squash the result. And now instead of just one of those, assume you have a whole bunch. There are four shown here, but there could be 400 or 4 million. Now to keep our picture clear, we'll assume for now that the weights are either plus 1, white lines, minus 1, black lines, or 0, in which case they're missing entirely. But in actuality, all of these neurons that we created are each attached to all of the input neurons, and they all have some weight between minus 1 and plus 1. When we create this first layer of our neural network, the receptive fields get more complex. For instance here, each of those end up combining two of our input neurons, and so the value, the receptive field, the pixel values that make that first layer neuron as large as it could possibly be look now like pairs of pixels, either all white or a mixture of white and black, depending on the weights. So for instance, this neuron here is attached to this input pixel, which is upper left, and this input pixel, which is lower left, and both of those weights are positive. So it combines the two of those, and that's its receptive field, the receptive field of this one plus the receptive field of this one. However, if we look at this neuron, it combines our this pixel, upper right, and this pixel, lower right. It has a weight of minus 1 for the lower right pixel, so that means it's most active when this pixel is black. So here is its receptive field. Now because we were careful of how we created that first layer, its values look a lot like input values, and we can turn right around and create another layer on top of it the exact same way, with the output of one layer being the input to the next layer. And we can repeat this three times or seven times or seven hundred times for additional layers. Each time the receptive fields get even more complex. So you can see here using the same logic, now they cover all of the pixels, and more special arrangement of which are black and which are white. We can create another layer. Again, all of these neurons in one layer are connected to all of the neurons in the previous layer, but we're assuming here that most of those weights are zero and not shown. It's not generally the case. So just to mix things up, we'll create a new layer, but if you notice our squashing function isn't there anymore, we have something new called a rectified linear unit. This is another popular neuron type. So you do your weighted sum of all your inputs, and instead of squashing, you do rectified linear units, you rectify it. So if it is negative, you make the value zero. If it's positive, you keep the value. This is obviously very easy to compute, and it turns out to have very nice stability properties for neural networks as well in practice. So after we do this, because some of our weights are positive and some are negative, connecting to those rectified linear units, we get receptive fields and their opposites. You look at the patterns there. And then finally, when we've created as many layers with as many neurons as we want, we create an output layer. Here we have four outputs that we're interested in. Is the image solid, vertical, diagonal, or horizontal? So to walk through an example here of how this would work, let's say we start with this input image shown on the left. Dark pixels on top, white on the bottom. As we propagate that to our input layer, this is what those values would look like. The top pixels, the bottom pixels. As we move that to our first layer, we can see the combination of a dark pixel and a light pixel summed together get us zero, gray. Whereas down here we have the combination of a dark pixel plus a light pixel with a negative weight. So that gets us a value of negative one there, which makes sense because if we look at the receptive field here, upper left pixel white, lower left pixel black, it's the exact opposite of the input that we're getting. And so we would expect its value to be as low as possible, minus one. As we move to the next layer, we see the same types of things, combining zeros to get zeros, combining a negative and a negative with a negative weight, which makes a positive to get a zero. And here we have combining two negatives to get a negative. So again, you'll notice the receptive field of this is exactly the inverse of our input. So it makes sense that its weight would be negative, or its value would be negative. And we move to the next layer. All of these, of course, these zeros propagate forward. Here, this has a negative value and it has a positive weight. So it just moves straightforward because we have a rectified linear unit, negative values become zero. So now it is zero again too, but this one gets rectified and becomes positive. Negative times the negative is positive. And so when we finally get to the output, we can see they're all zero except for this horizontal, which is positive. And that's the answer. Our neural network said this is an image of a horizontal line. Now, neural networks usually aren't that good, not that clean. So there's a notion of, with an input, what is truth? In this case, the truth is this has a zero for all of these values, but a one for horizontal. It's not solid, it's not vertical, it's not diagonal. Yes, it is horizontal. The arbitrary neural network will give answers that are not exactly truth. It might be off by a little or a lot. And then the error is the magnitude of the difference between the truth and the answer given. And you can add all these up to get the total error for the neural network. So the idea, the whole idea with learning and training, is to adjust the weights to make the error as low as possible. So the way this is done is we put an image in, we calculate the error at the end, then we look for how to adjust those weights higher or lower to either make that error go up or down. And we, of course, adjust the weights in the way, then make the error go down. Now the problem with doing this is each time we go back and calculate the error, we have to multiply all of those weights by all of the neuron values at each layer. And we have to do that again and again once for each weight. This takes forever in computing terms, on a computing scale. And so it's not a practical way to train a big neural network. You can imagine, instead of just rolling down to the bottom of a simple valley, we have a very high dimensional valley and we have to find our way down. And because there are so many dimensions, one for each of these weights, that the computation just becomes prohibitively expensive. Luckily there was an insight that lets us do this in a very reasonable time. And that's that if we're careful about how we design our neural network, we can calculate the slope directly, the gradient. We can figure out the direction that we need to adjust the weight without going all the way back through our neural network and recalculating. So just to review, the slope that we're talking about is when we make a change in weight, the error will change a little bit. And that relation of the change in weight to the change in error is the slope. Mathematically, there are several ways to write this. We'll favor the one on the bottom. It's technically most correct. We'll call it DEDW for shorthand. Every time you see it, just think the change in error when I change a weight. Or the change in the thing on the top when I change the thing on the bottom. This does get into a little bit of calculus. We do take derivatives. That's how we calculate slope. If it's new to you, I strongly recommend a good semester of calculus just because the concepts are so universal. And a lot of them have very nice physical interpretations, which I find very appealing. But don't worry, otherwise just gloss over this and pay attention to the rest and you'll get a general sense for how this works. So in this case, if we change the weight by plus one, the error changes by minus two, which gives us a slope of minus two. That tells us the direction that we should adjust our weight and how much we should adjust it to bring the error down. Now to do this, you have to know what your error function is. So assume we had an error function that was the square of the weight. And you can see that our weight is right at minus one. So the first thing we do is we take the derivative, change in error, divided by change in weight, DEDW. The derivative of weight squared is two times the weight. And so we plug in our weight of minus one and we get a slope, DEDW of minus two. Now the other trick that lets us do this with deep neural networks is chaining. And to show you how this works, imagine a very simple trivial neural network with just one hidden layer, one input layer, one output layer, and one weight connecting each of them. So it's obvious to see that the value y is just the value x times the weight connecting them, w1. So if we change w1 a little bit, we just take the derivative of y with respect to w1, and we get x. The slope is x. If I change w1 by a little bit, then y will change by x times the size of that adjustment. Similarly for the next step, we can see that e is just the value y times the weight, w2. And so when we calculate DEDY, it's just w2. Because this network is so simple, we can calculate from one end to the other, x times w1 times w2 is the error e. And so if we want to calculate how much will the error change, if I change w1, we just take the derivative of that with respect to w1 and get x times w2. So this illustrates, you can see here now, that what we just calculated is actually the product of our first derivative that we took, the dy, dw1 times the derivative for the next step, DEDY, multiplied together. This is chaining. You can calculate the slope of each tiny step, and then multiply all of those together to get the slope of the full chain, the derivative of the full chain. So in a deeper neural network, what this would look like is, if I want to know how much the error will change, if I adjust a weight that's deep in the network, I just calculate the derivative of each tiny little step all the way back to the weight that I'm trying to calculate, and then multiply them all together. This computationally is many, many times cheaper than what we had to do before of recalculating the error for the whole neural network for every weight. Now, in the neural network that we've created, there are several types of back propagation we have to do. There are several operations we have to do. For each one of those, we have to be able to calculate the slope. So for the first one is just a weighted connection between two neurons, a and b. So let's assume we know the change in error with respect to b. We want to know the change in error with respect to a. To get there, we need to know db, da. So to get that, we just write the relationship between b and a, take the derivative of b with respect to a, we get the weight w, and now we know how to make that step. We know how to do that little nugget of back propagation. Another element that we've seen is sums. All of our neurons sum up a lot of inputs. To take this back propagation step, we do the same thing. We write our expression, and then we take the derivative of our endpoint, z, with respect to our step that we're propagating to a, and dz, da in this case is just one. Which makes sense. If we have a sum of a whole bunch of elements, we increase one of those elements by one, we expect the sum to increase by one. That's the definition of a slope of one. One to one relation there. Another element that we have that we need to be able to back propagate is the sigmoid function. So this one's a little bit more interesting mathematically. We'll just write it shorthand like this, the sigma function. It is entirely feasible to go through and take the derivative of this analytically and calculate it. It just so happens that this function has a nice property that to get its derivative, you just multiply it by one minus itself. So this is very straightforward to calculate. Another element that we've used is the rectified linear unit. Again, to figure out how to back propagate this, we just write out the relation. b is equal to a, if a is positive, otherwise it's zero. And piecewise, for each of those, we take the derivative. So db, da is either one, if a is positive, or zero. And so with all of these little back propagation steps and the ability to chain them together, we can calculate the effect of adjusting any given weight on the error for any given input. And so to train, then, we start with a fully connected network. We don't know what any of these weights should be. And so we assign them all random values. We create a completely arbitrary random neural network. We put in an input that we know the answer to. We know whether it's solid, vertical, diagonal, or horizontal, so we know what truth should be. And so we can calculate the error. Then we run it through, calculate the error, and using back propagation, go through and adjust all of those weights a tiny bit in the right direction. And then we do that again with another input, and again with another input for, if we can get away with it, many thousands or even millions of times. And eventually, all of those weights will gravitate. They'll roll down that many-dimensional valley to a nice low spot in the bottom where it performs really well and does pretty close to truth on most of the images. If we're really lucky, it'll look like what we started with, with intuitively understandable receptive fields for those neurons and a relatively sparse representation, meaning that most of the weights are small or close to zero. And it doesn't always turn out that way. But what we are guaranteed is that it'll find a pretty good representation of, you know, the best that it can do adjusting those weights to get as close as possible to the right answer for all of the inputs. So what we've covered is just a very basic introduction to the principles behind neural networks. I haven't told you quite enough to be able to go out and build one of your own, but if you're feeling motivated to do so, I highly encourage it. Here are a few resources that you'll find useful. You'll want to go and learn about bias neurons. Dropout is a useful training tool. There are several resources available from Andre Carpathi, who is an expert in neural networks and great at teaching about it. Also, there's a fantastic article called the Black Magic of Deep Learning that just has a bunch of practical from the trenches tips on how to get them working well. If you found this useful, I highly encourage you to visit my blog and check out several other How It Works style posts. And the links for these slides you can get as well to use however you like. There's also a link to them down in the comments section. Thanks for listening.