 Hello, everyone. This is Alice Gao. In the previous video, I talked about a little bit of history on deep learning. Then I talked about the motivation for using a neural network rather than using some other machine learning models. Then I finally talked about the structure of a neural. In this video, we are going to look at a computational and mathematical model of a neural. Artificial neural network actually uses a very simple mathematical and computational model of a neural. This model was first proposed by McCulloch and Pitts in 1943. This tells you that this is not a new idea. It's been around for a few decades already. This is a very simplified model. If you learn a little bit more of neuroscience, or for example, if you take a more specialized class on neural networks, then you will know that an actual neuron works much more complex in a much more complex way than this. So there's many simplifications. But essentially, this model says that a neuron is a linear classifier. To think of a very simple example, if we have some data in a two-dimensional space, we have some maybe positive examples and negative examples, things like this. So if you try to use a neuron to classify these data, then what are you trying to do? You are trying to actually fit a line which you can use to classify these data into two classes, for example. But of course, our data, whenever we use neural networks, the data is probably not two-dimensional. We probably have much higher dimensional and much more complex data. So being a linear classifier for the neuron, I have to describe it in one sentence. It means that the neuron is going to decide whether to fire or not. And this depends on a linear combination of its inputs. So we calculate a linear combination of its inputs. So you can see in this picture right here, this summation here is saying, we are calculating a linear combination of its inputs. And if this sum exceeds some threshold, then the neuron is going to decide to fire. And also that depends on, sometimes you can adjust how much it fires, the strength of the output signal as well. Let's look at this model in more detail. So here is again our model of the neuron. And we have a few input signals coming from other neurons. So these are denoting the input signals from other neurons. In general, we'll denote it by a sub i. So coming from neuron i. And our current neuron is called the neuron j, a sub j. So we have these input signals and between the connection of this neuron and another neuron, which is sending the input signal, the connection has a weight. So for example, going from ai to aj, the connection has the weight wij. And then the current neuron is going to compute a weighted sum of its input signals. So this weighted sum is taking each input multiplied by the weight and then take the summation of all of it. So after calculating this weighted sum, where the neuron j is going to apply an activation function called g to this weighted sum to figure out the output signal. So depending on what the activation function is, the output signal may be different. So applying the activation function. Something you might have noticed here, something maybe interesting, maybe odd, is that we have this input a sub zero, which is always set to one. This is sometimes called a dummy input or a bias input. And then we also have weight between this input and our current neuron, between our connections, which is called w sub zero j. And this is called the bias weight. So why do we need this extra bias or dummy input? Some of you may know this already. So thinking back to the previous slide, the neuron is a linear classifier. We are trying to use it to represent a linear function. In a two dimensional space, it's a line. In a higher dimensional space, it's a hyperplane. In order to be able to represent all possible linear functions, we need to have the possibility of having a constant term. With a linear function, you could have, for each input, you can multiply by some weight. So that's the coefficient for that input. But you might possibly have a constant term. So how do we have this constant term? This bias input allows us to be able to have a constant term in the linear function that we're trying to represent. So in the simple model of a neuron, the activation function is a pretty important component. Let's look at this component more and think about what are some desirable properties we want this activation function to have, and then given those, what are some suitable activation functions that we can use? So how should we choose our activation function for our neural network? Let's look at three desirable properties that we probably want our activation function to have. The first property is that the activation function should be nonlinear. The reason behind this, well, we are trying to use our artificial neural network to model complex relationships and complex relationships are often nonlinear. Linear function is a really simple function and most complex things in the world are nonlinear. So how do we get the neural network to ultimately represent nonlinear function? Let's imagine that we choose a linear activation function. Then what's the consequence of that? Well, we have a bunch of neurons say in a particular layer and each neuron computes a linear combination of its inputs and then we'll feed that linear combination the weighted sum into a linear activation function. And then we'll take those output signals and then maybe feed them into more neurons. This is how a neural network is going to look like. So this basically means we are stacking a bunch of linear functions together. And no matter how complex our neural network is, no matter how we stack multiple linear functions together, we're going to end up with a linear function. That's how linear functions work. So if we don't have a nonlinear component in a neural network, we are never going to end up with a nonlinear function overall. So that's why we want to choose nonlinear activation functions. So once we do that, a neural network with multiple layers of neurons, we are essentially interleaving linear and nonlinear functions. We have linear functions as the part where we're computing the weighted sums, then we'll feed these weighted sums into a nonlinear activation function and then we'll maybe feed those into other neurons, which is again one layer of linear function and another layer of nonlinear functions. The second desirable property is kind of obvious, but we should still state it. We should choose the activation function in a way such that it mimics the behavior of real neurons. So how should the activation function mimic the behavior of actual neurons? Well, neurons are going to take the input signals, do some computation and feed it into the activation function. And typically, in a very simplified way, we can understand the behavior of neurons as if the weighted sum of the input signals is large enough, then the neuron is going to fire or it typically sends a high output signal. Otherwise, if the weighted sum of the input signal is not large enough, then the neuron does not fire or it sends a low output signal. So typically, we want the activation function to almost look like a step function, where we have some sort of threshold and then given this threshold, the neuron either fires or not. As you will see later, a lot of activation functions are similar to a step function, but they're not exactly the same as a step function. So it doesn't mean that if we want the activation function to mimic the real behavior of a neuron, it doesn't mean we always have to choose something that looks exactly like a step function. We want to mimic the behavior, but it could be that there's no hard threshold. Maybe for every possible weighted sum of the input signals, we will fire an output signal of some amount. It doesn't have to be 0 or 1. The third desirable property says that the activation function should be differentiable almost everywhere. The reason for this comes from the perspective of learning. We are going to train or learn neural network using optimization algorithms such as gradient descent, and many such optimization algorithms require the function to be differentiable, because we need to calculate the gradient, which is basically the derivative, and use the derivative to decide on which direction we want to change the parameter values to do the optimization. There may be other desirable properties that we want the activation functions to have also. Now, given these desirable properties, let's look at a few examples of activation functions. The first activation function that we're going to discuss is not surprisingly the step function. I already sort of implicitly mentioned this when I talked about the behavior should mimic the real behavior of a neuron. I've plotted the step function right here so you can see what are some properties of the step function. First of all, it's very simple to use, very simple to understand, and it does in a very simplified way mimics the real behavior of a neuron. But one critical problem with the step function is that it's not differentiable. It has this continuous point at x equals 0. So we cannot use it for applying gradient descent. So basically, if we use this activation function, we won't be able to learn weights for a neural network using gradient descent. Honestly, the step function is not used in practice. But in lectures, I'm going to use it quite often because it's simple to use, it's easy to calculate, and it's useful to explain some concepts. Next, let's look at the sigmoid function. The sigmoid function is a really popular choice for the activation function for quite a while. But at some point, people discover some serious problem with the sigmoid function so they started using other activation functions. We'll discuss the problem in a moment. So in this plot, I've drawn two versions of the sigmoid function curve, and the flatter curve, the one in red, uses a relatively small value of k, and the black curve, the steeper curve, uses a relatively larger value of k. So what are some properties of the sigmoid function? One property as you can probably see intuitively is that the sigmoid function sort of approximates the step function. That's one of the original reasons why people wanted to use it as the activation function. Because if we cannot use the step function directly, then we should use something to approximate it. And for this general version of the sigmoid function, you can change the value of k, and as k increases, the sigmoid function becomes steeper and becomes closer to the step function. So it's a better and better approximation of the step function as k increases. Another property of the sigmoid function is that it gives us clear and bounded predictions. So as the value of x becomes very large or very small, the value of g of x is very close to zero or one. And often when we have a classification problem, we want the prediction to be bounded within this range. At most one, at least zero. This will make our computation much easier. The sigmoid function is also differentiable. So this allows us to use gradient descent to learn the weights of the neural network if we use this as the activation function. So far, the sigmoid function probably looks like a pretty good choice. Next, let's talk about some problems with this function. So one problem with the sigmoid function is that it causes the problem called vanishing gradient. So if you look at this function, if the value of x is very large or very small, the value of g of x changes very little. The curve is quite flat. So in those regions, the value of g of x responds very little to the changes in x. What does that mean? That means the gradient, which is basically the derivative of the function is going to be very small. Because of this, the network in those cases, when the value of x is very large or very small, the network either does not learn further anymore or it learns very, very slowly. Because the gradient is very small, the gradient tells us how much we can update the weights in each step. And if those updates are very small, then the network changes very little. So this is one of the classic problems with sigmoid function called the vanishing gradient problem. And this motivated people to look for other activation functions that don't have this problem. Finally, compared to the step function, sigmoid function is pretty computationally expensive to calculate. As I will discuss later, we will learn the back propagation algorithm and also from many stories and news articles, you must know that it takes a long time usually to train your network. So often we are going to prefer activation function that's easier to calculate than one that's expensive to calculate. Next, let's talk about two versions of the rectified linear unit. So first of all, the classic version, I've plotted the classic rectified linear unit here for you. A couple of properties of this function. First of all, it's computationally efficient. Just by looking at the expression, you can see that it's pretty easy to calculate the result of this function. So being computationally efficient means that when we use it to learn a neural network, the network converges pretty quickly. So that's a really nice property of the function. Also, this function, of course, is nonlinear and it is differentiable. So these are the good properties of a relu. But recall that one critical problem with the sigmoid function was the vanishing gradient problem. So does relu fix this problem? Well, it seems to fix it on the positive side. When x has a positive value, this function will always have a constant derivative. So it sort of fixes that on the positive side. But on the negative side, we still have a similar problem. This problem is also called the dying relu problem. So when the input, when the value of x either approaches zero or are negative, you can see that the gradient, the derivative becomes zero and the network cannot learn. So we fixed the problem partially but not completely. Finally, we have another version of the rectified linear unit called the leaky relu. So the main difference between this and the classic version is that on the negative side, when x is negative, we have a line which has a small positive slope. So it's not a complete horizontal line. So leaky relu has the exact same nice properties as the classic rectified linear unit. But it fixes the dying relu problem in the negative area. So even if now x is negative, the derivative is still not zero. So it allows us to keep learning for negative input values. That's everything I want to say in this video. Let me summarize. After watching this video, you should be able to do the following. Describe the simple mathematical model of a neuron and explain why a neuron is a linear classifier. Describe some desirable properties of a neuron. Give four examples of activation functions and then compare and contrast the properties of these activation functions. Thank you for watching. I will see you in the next video. Bye for now.