 When you hear about artificial intelligence about a half of the time what people are talking about is convolutional neural networks Understanding how they work is really helpful in getting a peek behind the curtain at the magic of artificial intelligence So we're going to walk through it in some detail convolutional neural networks takes images and From them they learn the patterns the building blocks that make them up So for instance in the first level of this network you might learn things like line segments that are at different angles and then at subsequent layers those get built into things like Faces or element of cars Depending on the images that you train the network on You can pair this with reinforcement learning algorithms to get algorithms that play video games learn to play go or even control robots So to dig into how these work we'll start with a very simple example much simpler than all of these a Convolutional neural network that can look at a very small image and determine whether it's a picture of an X or an O Just two categories So for example this image on the left is an 8 by 8 pixel image of an X We want our network to classify it as an X Similarly with the image of the O we want the network to classify it as an O Now this is not entirely straightforward because we also wanted to handle cases where these inputs are of different sizes Or they're rotated or they're heavier or they're lighter and every time we'd like it to give us the correct answer a Human has no problem looking at these and deciding what to do But for a computer this is much harder when trying to decide if these two things are equal What it does is it goes through pixel by pixel Black pixels might be a minus one white pixels might be a plus one and it'll compare them pixel by pixel Find the ones that match and here the red pixels are the ones that don't match So a computer looking at this would say no these are not the same they have some matches, but they have a lot that don't match So the way convolutional neural networks do this one of the tricks they use is that they match pieces of the image so you can look at these pieces and shift them around a little bit, but as long as the Tiny bits still match then the overall image is still considered a pretty good match So these tiny bits might look like this We'll call them features. You can see the one on the left looks like a diagonal Arm of the X that's leaning left the one in the middle looks like the center of the X where it crosses And the one on the right looks like a diagonal arm leaning right and You can see how these different pieces these different features of the image match different patches within the overall image So the math behind finding this match Applying features is called filtering. It's pretty straightforward, but it's worth walking through The way that is done is you line the feature up on the image patch. You're concerned with Multiply pixel by pixel Add up the values and then divide by the total number of pixels. This is one way to do it Here for instance We start with this feature of the arm of the X leaning left We align it with this arm on the image and we start with the upper left pixel and We multiply both values one by one equals one Now because we started with the upper left pixel we can keep track of our answers here So this pixel when multiplied equals one The upper center pixel is minus one in both of the Feature and the image so minus one times minus one is also equal to one so that When you multiply them and you get a one that indicates a perfect and a strong match and We can continue doing this throughout the entire feature and the entire image patch and Because they are a perfect match every single one of these matches will come back as a one and So to find the overall match We just add up all these nine ones divided by the total number which is nine and we get a match of one Now we can create another array to keep track of how well the feature When placed in this position? Match our image so this average value is one We'll put a one right there to keep track of that You can see what it looks like if we were to move this feature then and align it to a different patch So let's say we move it down to the center of the X and we go pixel by pixel and find what matches and After a few pixels we actually find one that doesn't match We end up with a minus one times a plus one giving us a minus one back This indicates a non-match of these pixels And as we go through the rest of the feature we can see that there are a couple of pixels that don't match So here when we add these up and divide by nine we get a number that's less than one point five five So it indicates a partial match, but not a perfect one It turns out you can go through and do this for every Possible location in the image you can chop it up into every possible image patch Compare the feature to each one and here's what you would get in this particular case This is what convolution is it's taking a feature and it applying it across every possible patch across a whole image and You can see here why it's called filtering What we have is a map of where this feature Matches the image you can see a strong bunch of plus ones on the diagonal line from the lower right to the upper left and then Lesser values everywhere else So it's a filtered version of the original image that shows where the feature matches You can do this we can represent this with this notation We just invented this little convolution Apple operator for shorthand and we can do this with our other features as well We can see where our center X matches Not surprisingly it matches strongest in the center of the image We can see where our leaning right arm matches and not surprisingly it matches along this diagonal from the lower left to the upper right We have three filtered versions of the original image So this is what a convolution layer in a convolutional neural network does It has a set of features in it and it can be three or thirty or three hundred or three thousand but it has a set of features and it takes the original image and Returns a set of filtered images one for each of the features So this is how we'll represent it That's the number one ingredient in convolutional neural networks. That is the magic special sauce the special trick that gets from a Non-exact match and let's the algorithm is able to pull out Okay, well, it's not a perfect match But it's still a pretty good match because it does this convolution and moves the feature across the image and finds everywhere that it might match Another piece of this is called pooling So we took our original image and now we have a stack of images What this step does is it shrinks it down a little bit? We start by picking a window size usually two or three pixels Picking a stride usually two pixels has been shown to work Well, and then walking this window across the filtered images and Then from each window taking the maximum value that you see so this is called max pooling So to see how this works we start with one of our filtered images we have our window which is two pixels by two pixels and Within that the maximum value is one So we create another little array to keep track of all of our results and we put a one in it And then we step it over it by our stride, which is two pixels Look at the window choose the maximum value in this case. It's point three three record it and go again and We keep doing this recording the maximum value each time all the way through the image and when we're done We have which if you squint looks like a shrunken version of the original We still have this strong set of plus ones on the diagonal from upper left to lower right And then everywhere else. It's less than that So it maintains kind of the original signal but shrinks it down kind of picks off the high points and this gives us a smaller image but still similar to the original and We can just represent it with this little shrinking arrow We can do this with each of our filtered images and again You see that very roughly the pattern of the original is maintained So in a pooling layer a stack of images becomes a stack of smaller images Now the last ingredient we need is normalization So this keeps the math from breaking by taking and tweaking these values just a little bit It takes everything that's negative and changes it to zero This keeps things from becoming unmanageably large as you progress through subsequent layers This function is called a rectified linear unit It's a fancy name for something that just takes anything that is negative and makes it zero So a point seven seven not negative. It doesn't touch it But a minus point one one it's negative just bumps it up to zero And by the time you've gone through all of your images and done this all of your pixels and done this This is what you have so everything that was negative is now zero So just a nice little bit of normalization some conditioning to keep things numerically well behaved So a stack of images becomes a stack of images with no negative values now You can notice that the output of one layer looks like the input to the next There are always arrays of numbers An image and an array of number are the same thing. They're interchangeable So you can take the output of the convolution layer Feed it through the rectified linear unit layer feed that through the pooling layer and when you're done You have something that has had all of these operations done to it and In fact you can do this again and again and this recipe can imagine making like a scooby-doo sandwich of all of these different layers again and again and in different orders and Some of the most successful convolutional neural networks are kind of like Accidentally discovered groups of these that just happen to work really well so they get used again and again So over time each convolution layer filters through a number of features each rectified linear unit layer changes everything to be non-negative and each pooling layer shrinks it So by the time you're done you get a very tall stack of filtered images with no negative values That has been shrunken down in size now by the time we've gone through several iterations of this We take and run it through a fully connected layer This is more of a standard neural network where every input gets connected to everything in the next layer with a weight Every single value you can think of it as a voting process So every single pixel value that's left in these filter shrunken images gets a vote on what the answer should be and This vote depends on how strongly it is tends to predict an X or an O when this pixel is high is This output usually an X or is it usually an O so you can see for this particular input the input was an X here's what the imaginary convolved and filtered values are and Over time we would learn that these things that are high When it sees an X get a strong vote for the X category similarly if you have an input that's an O these Final pixel values that tend to be really high when the right answer is row is O gives a strong vote for the O category The thicknesses of these lines represent the weights the strength of the vote between these pixels and these answers So now if we get a new input that we've never seen before these might be the final pixel values We can use these votes and do a weighted voting process for both of these Add them up and in this case, you know It's a point nine two total for X and a point five one total for O Point nine two is obviously more than point five one. We declare X the winner So this input will have been categorized as an X So this is a fully connected layer So it just takes a list of feature values in this case our filtered shrunken pixels and it becomes a list of votes For each of our output categories in this case an X or an O Now these can also be stacked you can have like little they call them hidden layers But like secret hidden categories in between here So one votes on the first layer votes on the first set of hidden categories And then those vote on the next layer and so forth until you get to your final ones We'll dig into this more in just a little sec, but these All stack on to the end Now to go into the next level of detail on these neural networks Set aside our X and O detector for a while there. We had eight by eight pixel images So 64 pixels in all Consider now a two by two pixel image. So just a four pixel camera And what we would like to do is Categorize the images that it takes as either being a solid image all light or all dark a vertical image a diagonal image or a horizontal image Now the trick here is that simple rules can't do it so both of these are horizontal images But the pixel values are completely opposite in both of them So you can't say well if the upper left pixel is white and the upper right pixel is white Then it must be horizontal because that's violated by the other one Now of course you could do more complicated rules to do this the point is that when you go to larger images You can't make simple rules that capture all the cases that you want So how do we go about it? We can take these four input pixels and break them out We call them input neurons, but they just take these pixels and turn them into a list of numbers The numbers correspond to the brightness minus one is black plus one is white Zero is middle gray and everything else is in between So this takes this little image and turns it into a list of numbers that's our input vector Now each of these you can think of it as having a receptive field This is an image Which makes the value of this input as high as possible So if you look at our very top input neuron The image that makes that number as high as possible is an upper left pixel. That's white That makes the value of that one and it doesn't care what the other pixels are. That's why they're checkered So you can see that each of these has its own Corresponding receptive field the image that makes the value as high as it can go Now we're going to build a neuron So when the people talk about artificial neural networks in a neuron We are going to build it bit by bit The first thing you do to build a neuron is you take all of these inputs and you add them up So in this case, this is what we'd get so the neuron value at this point is 0.5 Now the next thing we do is we add a weight. We mentioned the weighted voting process before So what that looks like is each of these inputs gets assigned a weight between plus and minus one and It gets the value gets multiplied by that weight before it gets added so now we have a weighted sum of these input neurons and We will represent this visually by showing positive weights in white negative weights in black and the thickness of the line being approximately Proportional to the weight and when the weight is zero, we'll leave it out to minimize visual clutter So now we have a weighted sum of the inputs The next thing that we need to do is squash the result So because we're going to do this a lot of times It's nice if we always guarantee the answer is between plus and minus one after each step That keeps it from growing numerically large a Very convenient function is this s-shaped so sigmoid squashing function This particular one is called the hyperbolic tangent There is Confusingly something else that's called the sigmoid. It's a little bit different by the same general shape But the characteristic of this is that you can put in your input You know draw a vertical line see where it crosses the curve track that over to the Using a horizontal line to the y-axis and you can see what the smashed version the squashed version of your number is So in this case point five comes out to be just under point five Point six five comes out to be about point six and as you go up this curve You can see that no matter how large your number gets What you get out will never be greater than one and Similarly, it'll never be less than minus one. So it takes this infinitely long number line and Squashes it so that it all falls between plus and minus one So we apply this function to the output of our weighted sum and then we get our final answer so this weighted sum and squash is Almost always what people are talking about when they talk about an artificial neuron Now we don't have to do this just once we can do it as many times as we want with different weights and This collection of weighted sum and squash neurons is you can think of it as a layer loosely inspired by the Biological layers of neurons in the human cortex So each of these has a different set of weights here to keep our picture really simple We'll assume these weights are either plus one white lines minus one black lines or zero missing entirely So in this case now, we have our layer of neurons We can see that the receptive fields have gotten more complex If you look at the neuron in the first layer on the top You can see how it combines the inputs from the upper left pixel and the lower left pixel Both of the weights are positive those lines are white And so what comes out its receptive field is if both of those pixels on the left are white Then it has the highest value it can possibly have If we look at that layer of neurons and look at the one on the bottom We can see that it takes its inputs from both of the pixels on the left. Sorry on the right But it has a negative weight Connecting connecting it to the lower right neuron. So it's receptive field. It's what maximally activates it It's a white pixel in the upper right and a black pixel in the lower right Now we can repeat this because the outputs of those that first layer of neurons looks a whole lot like our input layer Still a list of numbers between minus one and one and so we can add additional layers and we can do this as many times as we want Each time each neuron in one layer is connected to each neuron in the other layer by some weight So in this case you can see how the receptive fields might get still more complex and Now we're starting to see patterns that look like The things that were interested in solids verticals diagonals horizontals by combining these elements Now there's one more thing that we can do remember our rectified linear unit We can have different neurons here and instead of a weighted sum and squash we can just have something that takes the input and spits out zero if it's negative and The original value if it's positive and so for instance if we have If we have an input whose receptive field is the one on the very top and the second layer all solid white and we connect it with a positive weight To the rectified linear unit neuron on top then of course what would maximize that is a all solid white input but if we look at the Neuron just below that That's connected to it with a negative weight Then That that flips everything around and what maximally up to fates that is an input. That's all solid black Now we're really starting to get the set of patterns that we can imagine using to decide what our image is So we connect these again to a final output layer This output layer is the list of all the possible answers that we expect to get out of our classifier Originally it was X's and O's now it's four categories solid vertical diagonal and horizontal and Each of these inputs Into them have a vote But you can see that very few of them are connected this network assumes that most of those votes are zero So to see how this plays out, let's say we start with an input that looks like the one on the left With this is obviously a horizontal image with a dark bar on top and a white bar on the bottom We propagate that to the input layer and Then we propagate that to the first hidden layer And you can see for instance the neuron on the very top it combines two input neurons that one is light and one is dark So you can imagine it summing a plus one and a minus one and Getting a sum of zero. So that's why it's gray Its value is zero Now if you look at the neuron in the very bottom in that first hidden layer You can see that it sums also an input that is negative and one that's positive But It's connected to one by a negative weight and the other by a positive weight So it actually what it sees its weighted sum is minus one and minus one So what it is getting you can see is the opposite of its receptive field So that means it's maximally activated but negatively. So that's why it's black We move to the next layer and you can trace these things through so Anything zero plus zero is going to get you zero If you look at the neuron on the very bottom of this second hidden layer You can see that yes, it's adding up a negative and a negative both connected by positive weight So it's also going to be negative Which makes sense because you can see that its receptive field is the exact opposite of what the input is right now So it's maximally activated just negative and then when we track this to our next layer You can see that following that bottom pair of neurons because it's a negative value It goes through the rectified linear unit and becomes zero. So that's gray But if you look at the very bottom neuron there It has it's connected with a negative weight. So it becomes positive So that rectified linear unit really likes it. So it gives it a maximum value So everything is zero except for that neuron on the bottom and then finally What that means is that the only output That is non-zero is this horizontal one. So this network would classify The input image as being horizontal because of this Now there's some magic here. Where did we get those weights? Where did we get the filters in between? This is where we start to get down to the When we talk about learning adaptation, you know the learning and machine learning it is all about optimization These are learned through a bunch of examples over time So we're gonna set that aside for just a minute. We'll come back to how those get learned We need to talk about optimization first So consider drinking tea There is a temperature range where it is a delightful experience. It's warm and delicious and comfortable If your tea is too much hotter than that, it's very painful and not good not fun at all And if your tea is cooler than that, it's lukewarm and it's really meh. It's really not worth your time So this area at the top is the peak. This is the best This is what we're trying to find in optimization. We're just trying to find the best experience the best performance Now if we want to find that mathematically the first thing we do is we flip it upside down Just because this is how optimization problems are formulated But it's the same type of thing instead of maximizing tea drinking pleasure We want to minimize tea drinking suffering. We want to find the bottom of that valley the lowest possible suffering There's a few different ways we could do this the first is to look at Every point on this curve and just pick the lowest one Now the trick with that is we don't actually know what this curve is beforehand So in order to pick the lowest one we have to do Exhaustive search which in this case would be make a cup of tea have someone drink it ask them how they like it make another one Ask them how they like that one do it again and again for every possible temperature and Then pick the one with the lowest suffering the most enjoyment This is effective very effective also. It can be very time-consuming for a lot of problems and so We search for a shortcut now Because this is a valley we can use our physical intuition and say hey Well, what if we just had a marble and we let it roll to the bottom of this valley? We wouldn't have to explore every single piece of it So this is what's behind gradient descent The way it works is we start not knowing anything about this function We make a cup of tea someone tells us how they like it and then we change the temperature a little bit We make another cup of tea just a little bit cooler We ask someone how they like that and we find out they actually like it just a little bit less That tells us what direction we need to go we need to make our next cup of tea warmer and The change the difference between how much they like those two tells us the slope It tells us the steepness gives us the sense of how much warmer we can expect to make that next cup of tea so we make another one and we repeat the process and then we again Scoot a little ways off to the side make another cup of tea and figure out again Which direction we need to go are we do we need to go warmer to make a better cup or cooler to make a better cup? And we repeat this until we get to the bottom of the curve You'll know you're at the bottom When you change the temperature just a little bit and the tea drinker says yeah, it's exactly the same I like that just as much as the last one that means that you're there kind of at the flat bottom of the valley so Gradient descent is the first level trick For brewing fewer cups of tea. There's another thing you can do which is to use curvature This is kind of an advanced method is you can make Your original cup of tea and then make one a little bit warmer and one a little bit cooler And you can look to see how that curve of your function goes and If it's very steep and getting steeper then you know you can take a giant step because you're probably not anywhere close to the bottom And then you can do it again And if that curvature is starting to bottom out then you can take a smaller step because you the signal that you're getting closer to the bottom and it helps you to do this in fewer steps as Long as your curve is relatively well behaved, which is not always the case so ways that this can break Imagine we're doing this on a hot day and Actually, it turns out that if we were to cool our tea way down, we'd get a really nice iced tea Which turns out to be even more popular with our tea drinkers But gradient descent would never find this gradient descent always rolls down to the bottom of the nearest Valley It doesn't hop around to see if there are any valleys hiding anywhere else another problem is Let's say there's a wiggle on our curve There is something happening in the environment. We have noisy buses driving by and it affects how people enjoy their tea We might not be able to find this very lowest dip because we might get stuck in a dip further up the curve similarly if we ask our tea drinkers to rate their tea drinking experience on a scale from one to ten we Get these discrete jumps in our function and if you imagine a marble rolling downhill It does downstairs. It doesn't always work well and it can get stuck on a step without making it all the way to the bottom Now all of these things happen in real machine learning problems Another one imagine you have really picky tea drinkers and if the tea is anything but perfect they hate it hate it hate it and So you have these plateaus on either side and there's no signal to tell you that if you Move in a little bit. You'll find that deep valley so For cases like this There of course we can always fall back to exhaustive exploration It will find the best answer in every single one of those cases But a lot of times we just don't have the time like if I have to Brew and measure the pleasure drinking pleasure of 10 million cups of tea to get a good answer to this It's not going to happen in my lifetime So luckily there are some things in the middle that are More sample efficient than exhaustive exploration But a little bit more robust than gradient descent things like genetic algorithms simulated and kneeling things that They're defining characteristic is they have a little bit of random jumping around They're a little bit of unpredictability and so they make it harder to slip things by them They all have their strength and weaknesses there tend to be good for different types of problems or different types of Pathologies in your loss function, but all of them help avoid Getting stuck in the local minima the little small valleys that gradient descent will get stuck in they get away with this by making fewer assumptions and They can take a little longer to compute the gradient descent, but not nearly so long as exhaustive exploration You can think of gradient descent as being like a formula one race car and if you have a really nice well-behaved track It is fast But if you put a speed bump in the track, you're done Genetic algorithms simulated and kneeling evolutionary algorithms Those are like, you know a four-wheel-drive pickup truck You can take a fairly rough road with those and get where you're going You won't get there in record time perhaps, but you'll get there And then exhaustive exploration is like traveling on foot There is nothing that will stop you from getting anywhere. You can travel little or literally anywhere But it just might take you a really really long time So to illustrate how this works Imagine we have a model that we would like to optimize We have a research question. How many m&m's are in a bag of m&m's? So answering this is easy you buy a bag of m&m's you eat it 53 you can count those m&m's So great. We know how many were in the first bag Now when I did this I made a mistake and I bought another bag and I tried that one and I got a different answer So now I can answer 53 or I can answer 57 either way. I'm only right half the time Because I can't capture both bags with one answer and I could answer somewhere in the middle, but that's Never right. I have never opened a bag that had 55 m&m's in it. So it's unclear that that's the right answer either and The situation does not improve with the more bags of m&m's that I ate It just gets a little bit out of control and so I changed my goal from Answering the answer answering the question right to answering the question in a way that is less wrong So in order to do that I have to get really specific about what I mean by how wrong I am and to do that I have this distance function this deviation which is the difference between my actual guess and The actual number of m&m's in a bag So we call for for bag number I this deviation is D sub I It's just the difference between the guess and the actual number and Then I have to take this deviation and turn it into a cost So one common way to do this is to square it It's a nice as the further away things get kind of the more costly it is perceived as and it goes up faster So if there's a bag that's off by twice as much as another the cost is four times So I really it penalizes the things that are way off things that are close. It doesn't penalize so much And if we don't care if we don't want to overly Penalize the things that are way out there. We could use an absolute value of the deviation So if it's off by twice as much, it'll just be twice the cost But really we could use anything we could use the square root of the absolute value We could use 10 to the power of the absolute value of this deviation Anything that goes up the further away you get from zero We'll stick with squared Deviation this is super common. It has some nice properties and Makes for a good example So for the total cost of any guess that we make if I guess an estimated bags M&M's in a bag then the loss function this fancy curly QL of that guess is Just adding up the square of the deviation associated with each bag of M&M's d1 through dm squared So each deviation is actually the number of M&M's in that bag minus the guess so I square that and We can write that with fancy summation notation like this. So this is my loss function This is the total cost. This is how wrong I am when I make a guess and s'd So Because we have computers you can write a little bit of code and you can do exhaustive Exploration and I can say if I guess anything between 40 and 70 how wrong would I be with this data and you can plot it and Visually we can look at this and we can say hey look There's the lowest value and we can say what is the value of the guess They gives me the lowest loss. That's what that argue that Notation means right there and this best guess is just about 55 and a half M&M's Problem solved. So this is an example of numerical Optimization where we calculate this loss function and then we can do essentially because it's simulated We can do exhaustive exploration and just pick off the lowest value Now for this particular example, there's a fun other way to find it We know at the bottom of this curve the slope is zero It's the only place in the whole curve where that's true where it's flat We can use a little bit of calculus to find that Feel free to tune out if calculus is not your thing, but it's not too bad. So we find the slope of The loss function with respect to our guesses and we set it equal to zero and we solve it to find What for what guess is that true? So we take our loss function this sum of the square of the differences of the count and the guess and we take the derivative of that with respect to our guess and the Derivative of a sum is the same as the sum of the derivatives We take the derivative of that just bring down the exponent So two times that Summed because all that's equal to zero we can divide it by two and it'll still be true So now this sum of our deviations is zero so To further simplify this It's the sum of all of the counts of the actual bags times the sum of our guess Once for each bag if we have m bags, then it's m times our guess And then we can move that to the other side of the equal sign and divide both sides by the number of bags m and what we get is that our best guess is The sum total of the number of m and m's we found in all the bags divided by the number of bags or the average count per bag So this is a really slick result and it's things like this that make people so excited about optimization The little bit of math and calculus you can get this nice theoretical result Now it's worth noting that this is only true if you use a Deviation squared as your cost function. So that's one reason people like it so much It's because it tends to give some nice results like this But there is this analytical shortcut to find what the best guess is We're going to come back to this in a few minutes Now how does optimization change? How do we use it in our neural network to find these weights and these features? So what we want to do? We know what our error function is. It's how wrong our guesses are So in this case we have a labeled data set Which means that a human has already looked at this input on the left and I said hey, that's a horizontal image The truth values are what we know should be the right answer Zero votes for everything except horizontal. That should have a vote of one So let's say initially we've got a neural network that All the weights are random and it gives us nonsense results. It says well, yeah Everything is has some number associated with it, but it's nothing like the right answer Well, we can find the error for each category and add it up and find a total error And this is how wrong our neural network would be for this one example Here is our loss. Here's our error Now the idea with gradient descent is we're not just adjusting one thing We're not just adjusting our guess of the number of M&Ms We're adjusting many things we want to go through and adjust every single weight In every single layer to bring this error down a little bit Now that is a little bit challenging to do because in order to do that one thing you can do is find a Analytical solution like we did before To go through and Move our guess a little bit up and a little bit down and find the slope is really expensive When you consider that this is not a one-dimensional problem anymore It might have hundreds or millions of different weights that we need to adjust So calculating that gradient that slope requires Hundreds or millions of more passes through the neural network to find out which direction is downhill Enter back propagation So remember we found the nice analytical solution to what we had going on in the case of the M&M estimate So we would love to be able to do something like that again if we had an analytical solution we could jump right to the right answer so slope in this case it's change in weight Or sorry, it's change in error for a given change in weight That's the slope here. So there's lots of ways to write that But delta error delta weight d error d weight will use this partial error partial weight Just because it's most correct, but all these things mean the same thing if I change the weight by one How much will the error change? What is the slope? So in this case it would be minus two and we would know that we need to increase The weight in order to get closer to the bottom This tells us not only the direction we need to move but gives us a sense of about how far we should go Doesn't tell us exactly where the bottom is, but it tells us which way it needs to adjust Now if we do know the error function example, we can make a An analytic solution and we can find that we can calculate that slope exactly so in this case the change in error For a given change in weight is just the derivative of our error function here Which is in this case is the weight squared. So the derivative is two times the weight The weight is minus one and so the answer is a slope of minus two That tells us what we need to know about which way to adjust Now with neural networks, of course, they're a lot more complex than that but we can actually analytically compute the slope of The function where we are we don't know where the minimum is But we can find the slope without having to recalculate the value of everything each time and this is how it works Imagine the world's most trivial neural network that has one input one output one hidden layer with one neuron in it So it's got an input connected by a weight w1 To an intermediate value connected by a weight w2 to an output value So the intermediate value is just x times that weight So the derivative of y with respect to the weight is x What that means is if I change W1 and I move it by one then the value of y will change by the value x whatever x is We have the slope of this piece of the function Similarly, it's straightforward. We can just read off that whatever the value of y is Multiply it by the weight w2 we get e So if we want to find the slope of the error function For a given change in y The answer is w2 if I changed y by one unit then the error changes by the amount w2 now chaining Means that we can take these two things and just Multiply them together so by inspection we can see that in this little neural network if we take x Multiply it by w1 multiply that by w2 we get the error e Now what we'd like to know is if I change That w1 by a certain amount. How much does the error change? On this case, we just take that whole expression and take the derivative derivative with respect to w1 and Fairly trivial bit of calculus it comes out to be x times w2 and What we can see then is we can substitute in these steps this Change in y with respect to w1 is the same as x W2 is the same as the change in error with respect to y and What this breaks down is if we want to step down the chain We want to know how much a change in w1 Affects the error. What is d e d w1? We can actually break it down into steps and say okay Well if I change w1 how much does y change and then if I change y how much does the error change? This is chaining and this is what lets us if we know Which way we want to change the error? It lets us calculate how much we can change this weight to help that happen and There's nothing to prevent us from doing this again and again if I have a weight that's deep into my neural network And I want to know how much my error is going to change if I tweak it Up or down. I want to know the slope of my loss function with respect to that weight Then I can just break it down and say okay well if I change the weight How much does a change if I change a how much does b change if I change b How much does c change and chain it all the way down? Now it's called back propagation Because in order to calculate it. We actually need the value at the end We have to start with the error value In order to calculate each of these all the way back down into the depths of the network But still we can do that Now the way the reason you have to go backwards is that let's say We want to know what this should be if I change the error if I change a how much does the error change? Like well, let's assume that I already know how much the error is going to change if I change b What is this back propagation step? What is the additional link I need to add to this chain? It's like well, it's how much does b change if I change a If they're connected by a weight Then how do I incorporate that weight? We know that Two neurons connected in this way are represented by this B is the weight times the value of a and so we can just take a little derivative here and get The change in B with respect to a is w. So this step in the back propagation change Can be represented by whatever that weight is Cool. Now we know that we have sums in our neural network. That's another thing. We have to handle if I know How much my error changes with a change in z then how much would it change? With a change in one of the inputs to this it to z where that input it goes into a sum Well, I have can write the expression for z adding up all the inputs If I want to know how much z changes with respect to a change in a I just take the derivative And it is turns out to be one. So this is a trivial back propagation step Now the most interesting one of all if I know how much the error changes with respect to a change in B and Then I want to know how much it changes with the input and to that sigmoid function Then I can just okay. Well a sigmoid function mathematically looks like this and I can take the derivative of B with respect to a and One of the beautiful things about the sigmoid function is that the derivative actually looks like this It's just the value of the function times 1 minus the value of the function Which is one of the reasons that sigmoids perhaps are so popular and deep neural networks So this step is also straightforward to calculate in none of these steps Have we had to recalculate all of the values in the neural network? We've been able to rely on things that have already been calculated what the values are at each of these neurons That's what makes back propagation so mathematically efficient and is what allows us to Efficiently train neural networks That is why each element in a neural network no matter how exotic it is needs to remain Differentiable so that we can go through this exercise of Finding what that link in the chain is when we're doing the chain rule on our derivatives So that we can compute the back propagation we can back propagate it and Again rectified linear units if we know how much the output Effects a change in error. We want to know how that extends to the input We can write the function of a rectified linear unit We can take the derivative of it and then use that in our chain rule So imagine now that we have this labeled example We calculate the answer that this random Neural network that's not special at all it'll give an answer that's completely wrong and then we back Propagate the error and adjust every one of those weights a little bit in the right direction And we do that again and again after you do that a few thousand times this Stochastic gradient descent goes from this fully connected Totally random neural network to something that is a lot more efficient and that is able to give answers that are much closer To the right answer So coming back up to our convolutional neural networks. These are the fully connected layers That's how they're trained. They can also be stacked this back propagation of Applies not only to these fully connected layers, but also to the convolutional layers and the pooling layers we won't go through and calculate the chain rule for them, but you can do that as well and Going through this this whole stack of different layers Gets trained on a bunch of examples in this cases of labeled X's and O's give it a bunch of inputs that we know the right answer to and we let it adjust all of those connections not only that it also adjusts all of the Pixels in the features for each convolutional layer, so it learns not only the weights, but also the features and Then over time those representations become something that lets it predict very well. What is an X and what is an O? On top of that there are other things that we can use Optimization for so that there's a bunch of decisions here that we haven't addressed yet How do we know how many features to put in each convolutional layer? How do we know how big those should be how many pixels on a side? How do we choose the size and stride of our pooling windows? In our fully connected layers, how many layers do we have and how many hidden neurons do we put in each? Each of these decisions are called hyper parameters They are also values that we get to choose, but they're the next level up They kind of control how everything happens below and in order to see how well they perform We have to train the whole thing on all the images start to finish So but the same principles apply we can adjust these and choose them to get the best result possible In a lot of cases it's worth pointing out that there's just not enough computation available in the world to try out all the possible examples and So what we have right now are some recipes some things that researchers have stumbled on To that seem to work well and they get reused But there are a lot of places a lot of combinations of these hyper parameters that actually haven't been tried yet And so there is always the possibility that there are some combinations that work even much better than what we've seen so far now We don't have to use convolutional neural networks for just images Any two-dimensional or three-dimensional data works well The thing that matters is that in this data things that are closer together are more closely related than things far away It matters if two things are an adjacent rows or an adjacent columns So in images, this is plainly the case The location of a pixel in an array of pixels is part of the information If you were to randomly jumble the rows and columns that would lose the information that's there That's what makes this well suited to convolutional neural networks Anything that you can make look like an image may also be suited to convolutional neural networks for instance, if you're working with audio You have a really nice x-axis. Your columns can be subsequent time steps You don't want to jumble those because the time the order in which things occur in time matters and You can make your rows The intensity in different frequency bands going from low frequency to high frequency again, the order matters there and So being able to take sound then and make it look like an image You can apply this processing to it and find patterns in the sound that you wouldn't be able to find conveniently any other way You can also do this with text with a little bit of work You can make each of your rows a different word in the dictionary and then you can make your columns the position in sentence or a position Location that it occurs in time Now there are some limitations here Convolutional neural networks only capture local spatial patterns So if your data can't be made to look like an image or if it doesn't make sense to then they're less useful So for example, imagine you have customer data That has columns representing things like names and ages addresses emails Purchase it transactions browsing histories and these customers are listed if you were to rearrange the rows or Rearrange the columns the information itself wouldn't really be compromised. It would all still be there It would also be queryable searchable and interpretable Convolutional neural networks don't help you here. They look for spatial patterns So if the spatial organization of your data is not important, it will not be able to find what matters So rule of thumb if your data is just as useful after swapping your columns with each other Then you can't use convolutional neural networks. You shouldn't use convolutional neural networks. That's a big takeaway from this And so they're really good at finding patterns and using them to classify images. That is what is they are the best at Now the takeaway from this is not that you should go and code up your own convolutional neural networks from scratch You can it's a great exercise. It's a lot of fun But when you go to actually use it, there are a lot of mature tools out there that are helpful and just waiting to be applied to this The takeaway from this is that you will be asked to make a lot of subtle decisions about how to prepare your data and Feed it in how to interpret the results and how to choose these hyper parameters for this. It helps to know What's going to be done with your data and what it all means so that you can get the most out of these tools? All right. Good luck