 We're going to try and build a comfortable intuition, a solid understanding of what backpropagation is and how it works. This is part of the end-to-end machine learning school library. You can find a lot more tutorials and courses at e2eml.school. We're going to use backpropagation to take the perfect shower. Imagine your shower head is finicky. If you don't get the flow just right, you get a terrible shower. Too little flow and it feels like it's dripping, too much flow and it feels like a fire hose. There's a very narrow range where it's comfortable. You have two valves that adjust the water flow rate, the shower handle, and the main valve for the house. By adjusting either one of them, you can adjust the flow rate through the shower head, making it faster or slower. We're going to use backpropagation to get them adjusted just right. The shower flow rate depends on the setting of both of these valves. We care about how sensitive it is to both valve settings individually. If we turn the shower handle a quarter turn, does the shower flow rate change a lot or just a little? Is it sensitive to the shower handle position or not very? We can put a number on this sensitivity. First, we start by measuring the shower handle position. Let's assume the positions on the shower handle are numbered 1 through 10. And if we adjust it from, say, a 2 to a 3, that's a change of 1 unit. We can also measure the flow rate at the shower head in units of cubic feet per minute. And then if the flow rate changes from, say, 4 to 5 cubic feet per minute, that's a change of 1 unit. We can adjust the shower handle by a certain amount and observe how much the shower flow rate changes. Let's say we change the shower handle position from 4 to 8. That's a change of 4 units. Then we notice that the shower flow rate changes from 3 to 5 cubic feet per minute. That's a change of 2 units. By dividing the change in shower flow rate by the change in the handle setting, we can find the sensitivity of the shower flow rate to handle setting. In this case, it's 5 minus 3 divided by 8 minus 4, or 2 divided by 4, or 1 half. Sensitivity is a way of quantifying how much the shower flow rate will change if we adjust the shower handle up by one position. For each unit increase in shower handle position, we can expect the flow rate to increase by a half unit. This terminology quickly becomes awkward, but luckily math helps us out here. Sensitivity is a change in one thing per a one unit change in another thing. Change in shower flow rate per one unit change in shower handle position, for instance. This can be written delta shower flow rate divided by delta shower handle setting. Delta, the capital Greek D, is a common way to indicate a small change in a variable. Or if we want to talk about the relationship between very small changes in both of these items, we can use the calculus notation of D shower head flow rate divided by D shower handle setting. This means the derivative of the shower flow rate with respect to the shower handle setting. Since the flow rate is actually sensitive to two different things, the shower handle setting and the house flow rate going into the shower handle, it's most accurate to write sensitivity as curly D shower head flow rate divided by curly D shower handle setting. This means the partial derivative of shower flow rate with respect to shower handle setting, ignoring changes in the input flow rate. But they all mean the same thing. Conceptually, we're just expressing the sensitivity of the thing on the top to a unit change of the thing on the bottom. Because they are the most technically correct and because they look cool, we'll use the curly D's. To make our conversation even a little more concise, we can give things shorter names. We'll call the shower flow rate Y the shower handle position H. We can call the flow rate in the house X and the pressure in the water main W and the position of the main valve M. Now we can use these short names to call the sensitivity of the shower flow rate to the shower handle position DY DH. We can also find the sensitivity for the main house valve. We can measure both the main valve position M and the resulting house flow rate X and we can observe how the ladder changes when we adjust the former. By dividing our change in X by our change in M, we can calculate the sensitivity of the house flow rate to our main valve setting DX DM. The output flow rate of the shower head, Y, depends both on the shower handle setting H and the house flow rate flowing into the shower handle X. We can put a number on this sensitivity too. Adjusting the main valve M changes the house flow rate X, which indirectly affects the shower flow rate Y. By measuring the change in the house flow rate X and the corresponding change in the house flow rate through the shower head Y, we can find the sensitivity of the shower flow rate to increases in the house flow rate DY DX. So now we have a few different sensitivities. DY DH, DY DX, and DX DM. This is a solid start. This is exactly what we need to start calculating what adjustments we need to make. One thing that jumps out is that we don't actually know what the sensitivity of the shower flow rate with respect to the main valve is. We don't know DY DM. However, with a little bit of puzzling we can figure it out. Imagine that we had measured DX DM to be 2. For every unit change in the main valve position, the house flow rate increases by 2. Also, assume that we had measured DY DX to be one quarter. For every unit change in the house flow rate, the shower head flow rate increases by a quarter unit. Now we can play out what would happen to the shower head flow rate if we increase the main valve position by one unit. DX DM says the house flow rate X would go up by 2. And since we know that every unit of house flow rate increases the shower head flow rate by a quarter, we can multiply the two together to get the net result. Two times one quarter equals one half. So chaining these together, a unit change in the main valve position M gives an increase in the shower head flow rate of half a unit. This example is pretty straightforward to think about, so the power of what we just did may not be immediately obvious. We chained together two sensitivities by multiplying them. DY DM equals DY DX times DX DM. Calculus tells us that this doesn't just work for our example here, but it works all the time everywhere for any chain of sensitivities, no matter how long. Not surprisingly, it's called the chain rule, and it will be the secret to our success in back propagation. Now we are in a good position to start making adjustments. We have the sensitivity of the thing we care about, the shower flow rate, with respect to the two things we can change, the main valve setting M and the shower handle setting H. Armed with these, we are ready to get our shower going. Let's say that our ideal shower flow rate is a special value of Y, which we'll call Y prime. We can calculate our deviation, how far away from this ideal value we are by taking Y minus Y prime. To express our unhappiness with the current state of the water flow, we can express how far away it is from the ideal, the absolute value of Y minus Y prime. We'll call this E, our error, and we would like for it to be zero. Our goal will be to adjust our valves, M and H, to make our shower flow rate perfect, to drive Y to be Y prime, and to make E go to zero. Since we're in the business of calculating sensitivities, we can also find the sensitivity of E to changes in our shower flow rate. The derivative of an absolute value is straightforward. dE dy is one, if Y is greater than Y prime, and it's minus one if it's less than Y prime. It's not actually defined at Y equals Y prime, but we can just declare it to be zero. Now we can chain this with our other sensitivities to find the sensitivity of the error to our two valve positions. Now we have one thing we want to change, the error, and two ways to change it. How do we go about it? Do we try to get away with turning just one valve or turn both valves the same amount? There are any number of ways to get the results that we want. Which one do we choose? This is where back propagation comes in. The recipe that back propagation uses to choose how much to adjust each valve is to weight the adjustments by sensitivity. Is the shower flow rate error twice as sensitive to the main valve position as it is to the shower handle position? Then adjust the main valve twice as much as the shower handle. The benefits of focusing adjustments on the most sensitive valve aren't obvious in this simple example, but in a more complex situation, imagine hundreds of showers connected through thousands of pipes to millions of valves. It helps encourage specialization. It leads to just a few valves and pipes being closely tied to a single shower. Knowing the sensitivities and the size of the change that we hope to make, it's tempting to try and get the perfect shower in just one go, touching each of the valves just once. There are three good reasons not to do this though. The first reason is that we may not be able to get to a zero error. It's possible that the water main has low pressure and that even with both valves wide open, our shower would be slower than we want. This is the case more often than not. Depending on how our error, E, is defined, the best we can do might be some number other than zero. We don't know what the best value is so we're kind of hunting in the dark for it and it doesn't make sense to try to jump to zero in one go. The second reason is that our sensitivities are probably non-linear. This means that dy-dh, the change in the shower flow rate with respect to the shower handle position, will be different depending whether we're at position two or at position seven. Graphically, the relationship may not be a straight line to the constant slope. It's probably a curve. The implication of this is that if we make a large adjustment, we're jumping a long way along the curve but pretending it's a straight line. In all probability, we'll end up deviating wildly from the curve and getting an unexpected result. The third reason is that our sensitivities remember our partial derivatives, meaning they also depend on other values. dy-dh might be a larger number if the house flow rate x is high than when the house flow rate is low. And because we're changing the main valve setting, m, at the same time, house flow rate will almost certainly change too. The whole foundation on which we calculated our initial sensitivities will shift. The safest way to handle an uncertain nonlinear dynamic situation like this is to take tiny steps. Instead of trying to move the whole distance at once, move one one hundredth or a thousandth or one ten thousandth of the way. It means that you'll have to make a lot of steps, but at least your chances of settling into a good answer are much better. This fraction of the distance that we choose to nibble at is called the learning rate. Choosing a learning rate that's too large will cause us to bounce wildly around without finding a viable solution, and a learning rate that's too small will take an unreasonably long time to get to a solution. The baby steps will just be too tiny. A learning rate of one one thousandth is not a bad place to start, but keep in mind that the ideal learning rate will be different for every problem you try. Now we finally know everything we need to adjust our valves and get our shower set up. Each valve adjustment will be proportional to the sensitivity of the error to that valve, and in the opposite direction because we want E to go down, not up. And we multiply that by our learning rate, which we'll call by the Greek letter eta. So for our first iteration, our adjustment to the shower handle delta H1 is minus eta times dE dy times dy dH. Similarly, the change to the main valve delta M1 is minus eta times dE dy times dy dx times dx dM. Congratulations! We've just used the chain rule to back propagate sensitivities through our little network and make small adjustments to the valves. Now we are roughly one one hundredth of the way to a great shower. The bad news is that all of our sensitivities have now changed, and they may have changed a lot. We're going to need to recalculate them. The good news is that we have everything we need to do that already. We made our small changes to the valve settings delta M1 and delta H1. All we need to do is note the changes in the house flow rate delta X1 and the shower flow rate delta Y1 that resulted from these changes. We can use these to calculate the new estimates for the sensitivities in just the same way we did before. This is a computationally cheap way to recalculate our sensitivities. Now we are really off to the races. All that lacks is to repeat the procedure of making a small change to each valve and updating our sensitivity estimates a hundred or a thousand times until the shower temperature gets close enough to the ideal that we just don't care about the difference anymore. Having a small fixed learning rate is only one way to make use of our sensitivities. If you imagine our error, E, as a valley, this approach is analogous to taking a step in the downhill direction. The steeper the hill, the larger the step. This is called gradient descent. There are other approaches that also use the results of previous steps to make better guesses about where to step next. They have names like aida grad and momentum, and have been shown to work really well. But every single one of them makes use of back propagation to find the sensitivities of the error to each of the knobs we can adjust. And that's back propagation in the world's simplest network. In our shower example, there were just two valves, two knobs to adjust. These were the weights of our little network. In a real neural network, there will likely be thousands or millions of these. In our example, we just had one hidden layer. That's the house flow rate. In a real neural network, there can be dozens of layers or more. However, the principles of back propagation are the same. Chain sensitivities back through the network, make a small update, observe the effects, update the sensitivities through the network, and repeat. Now that you've seen how back propagation works, you're ready for the next step, to code it up for yourself. Come and join me in the Build a Neural Network Framework course in my end-to-end machine learning school. The first link in the comments below will get you there. Together, we'll code up a whole neural network framework in Python, start to finish, including back propagation. Thanks for stopping by. I hope this helps as you build your next project.