 Hello, everyone. This is Alice Gao. Welcome to our second lecture on artificial neural networks. In this lecture, I'm going to talk about how we can train a multi-layer feedforward neural network using gradient descent. Throughout this lecture, I'm going to work with this simple three-layer feedforward neural network. This is one of the simplest cases that I can come up with. So we have three layers, input layer, hidden layer, output layer. Input layer has two real input values. Then hidden layer, there are three nodes, but one of them, H0 is a dummy one. And then in the output layer, we have two output nodes. So I've labeled all the weights. Our goal is to, given some training data, we want to learn the weights such that our actual output values, 0102, are close to the expected output values given by the training data. Let me first talk about gradient descent. Gradient descent is a well-known optimization problem. You might have learned it in a stats class or in another math class. So if you have learned this, the next few slides will be revealed for you. Another way of thinking about gradient descent is that you can think about it as a local search algorithm. Think about having a search space where the dimension is given by the number of weights we have. So we have a quite high dimensional space. And in this space, we're looking for a combination of weights in order to minimize the errors we're making on the training examples. Gradient descent is actually quite similar to greedy descent. Except greedy descent, we were talking about it in sort of a discrete setting where each node has a finite number of neighbors. But in this case, everything is real valued. So we are going to use the gradient or the partial derivative to decide on which direction we want to go and also how big of a step we want to take. In the case of a multilayer feedforward neural network, we are trying to minimize a function, and the function we're trying to minimize is the error function. Now the error we're trying to minimize is between the expected outputs based on the training data and the actual outputs based on the weights in our current network and also the training data. The quote at the top of this slide, this gives a nice intuition about what gradient descent is doing. So it's walking downhill. We're trying to find a place in a search space such that the error is minimized. And we're always taking a step in the direction that goes down the most, the steepest. So the gradient or the partial derivative is trying to help us find that direction. Let's talk about the steps of the algorithm. So first of all, we have a neural network. We have a lot of weights. We are going to initialize the weights randomly. Next, we're going to figure out how to update the weights. So for each training example, we're going to calculate the negative of the partial derivative of the error with respect to that weight. So this is the partial derivative. I'm showing you right here. Then we'll take that partial derivative multiplied by eta, which is the learning rate. And then I'm missing sort of a summation here. We need a summation because for gradient descent, we need to calculate this quantity I've highlighted for every training example and add all of those together. So once we have this sum, we're going to update the weight by decreasing the weight by this sum. Just to reiterate, for each training example and for each weight, we are going to calculate the partial derivative of the error with respect to the weight. Multiply that by the learning rate eta and then we'll add together this quantity for all of the training examples. And the sum will be how much we're going to decrease that weight by. So I've just described one weight update. And we're going to potentially do this weight update many times. So gradient descent is iterative method. And after some number of steps, when the error is small enough or when the changes in the weights are really small, then we can decide to terminate. These are the high level ideas of how gradient descent would work for our neural network. Now, when I learned gradient descent for the first time, I didn't find the formula to be very intuitive. I wasn't sure it's not intuitive to me why the change has to be related to the partial derivative and it wasn't also clear to me why we need to decrease the weight instead of increasing the weight. So I thought of some explanations to help me re-derive the properties of gradient descent if I couldn't remember all the details. So I thought this would be useful to share it with you. But of course, this is not mandatory. If you choose to memorize this algorithm, go ahead, that's fine. But maybe in the future, there will be an occasion where, say, you're stuck with your idol in an elevator and they ask you to derive gradient descent on the fly. Maybe this will be useful for you. So I'm going to include the explanation in a separate short video. You can watch it if you like. And the explanation is done on the next slide right here. So the main idea of this derivation is that I'm going to use a very simple function, the quadratic, the squared function. And I'll use this to derive some intuition about in which direction should we change the value of x when we're trying to minimize the function and also by what amount should we change the value of x. And using these intuition, I can rederive the formula for gradient descent anytime I want. Now that I've talked about the idea of gradient descent, here is something you might be wondering. So I mentioned that gradient descent updates each weight after going through all of the examples. For each example, we're going to determine a change in the weight and we'll add up all these changes together and that's by how much we're going to update each weight. But what if we have a giant training set with lots and lots of examples? Doesn't that mean that we don't do weight updates very often? We have to go through the entire training set in order to figure out how to update each weight once. So that doesn't seem very efficient and that sounds like we're going to learn, update the weights very infrequently, so we're going to learn very slowly. So how can we speed up the learning process? Well, one way to split up the learning process is that let's not update weights after looking through all of the examples. Let's update the weights after looking at just one single example. This variant of gradient descent is called incremental gradient descent. So we only look at one example, calculate the changes for all the weights and update the weights immediately. Another related version is called stochastic gradient descent. So it's exactly the same as incremental gradient descent except that each example is chosen randomly. So we're not systematically going through the examples one by one, but every time we do an update, we'll randomly choose one of the training examples and do the update weights. There's some pros and cons of doing the updates after each example instead of doing the updates after seeing all the examples. The pros are we are doing updates much more frequently. So each step is cheaper and the weights are going to be updated more frequently and they will become more accurate much more quickly. That's great. That means that we're learning faster. However, the downside is that if we're updating after every example, this process is not guaranteed to converge, because each example might take us, a single example might take us in a direction that's not towards the local minimum. It might take us to a direction that's away from the local minimum. So you can see that there's a trade-off here. If we want to learn faster, we risk not converging to the local minimum. So can we change this trade-off in our favor? Certainly. So the following idea is that let's try to manipulate this trade-off in our favor and the idea here is using another variant of gradient descent called batched gradient descent. So batched gradient descent, as you can guess from the title, it updates the weights after a batch of examples. So the size of the batch can be decided by us. On one extreme, if the batch only contains one example, this is exactly the same as incremental gradient descent. On the other extreme, if the batch is equal to the entire training set, then that's the same as the original gradient descent. So by changing the batch size, we could sort of move between these two extremes. And often people will use batch gradient descent as follows. They will start with small batches so that we will learn very quickly. And then at some point, we'll start increasing the batch size because once you increase the batch size, it's going to be closer to the entire training set so that eventually the weights will converge. So the batch size gives us a really nice way of trading out the speed of learning versus convergence. That's everything for this video. After watching this video, you should be able to do the following. Explain the high-level ideas of gradient descent and how we can apply it to learn the weights in the neural network. Explain how we can use variants of gradient descent to trade off learning speed and convergence to a local minimum. Thank you for watching. I will see you in the next video. Bye for now.