 Alright, I'm going to explain this in multiple passes. The first pass will be a high-level overview of the what's, where's, and how's of activation functions. And the second pass I'll fill in the gaps with more details, and we'll walk through a few examples. Exciting stuff coming up. So let's jump into it. Case 1. We have this neural network here with a single hidden layer. Nothing fancy. Take the inputs, multiply it with the weights, add the bias term, and just pass it to the next layer. This neural network can learn how to classify data that is separable by a line. Nice. But what if data is slightly more complicated? I got this. I got this. Yeah. Wait, wait. No, no. No. Why? Won't it fit? Not so easy anymore, is it? Our network is too simple to capture patterns in the data. So instead of just passing the data through the next layer, let's pass it through a non-linear function. Oh, yeah. The network was able to classify this data with a quadratic curve. This function that we passed the data through is called an activation function. This function could be a sigmoid, an arc 10, a relu, a leaky relu, a parametric relu, swift, elu, max out, and the list goes on. Any one of these functions would have worked here. But what if the data was even more complicated? When data gets complicated, well, we add more layers and let's say that we use a sigmoid function for this and see how that turns out. Okay, nicer, a bit better, wait, wait, why? Even though the network is looking at more examples, it's not learning anything. This is the vanishing gradient problem. It happens because of the sigmoid function. It squeezes information. From the back propagation step, the gradients become smaller and smaller until eventually they vanish. No gradients means no learning. A remedy to this is to use an activation function that doesn't squeeze information, like relu. Let's use relu in every neuron and see what happens. Okay. Nice, sir. Oh, yeah, going to town, aha. Relu works well in practice, and this is also why we use it almost everywhere today. But in a random case, or in a few random cases, we may get this. Yes, yes, okay, why do I still have this problem? Looks like we're running into a similar problem of no learning like we did before. This is the dying relu problem. It's similar to the vanishing gradient problem in that the problem was caused then because of sigmoids squeezing the input. Now the problem is caused by relu completely blocking inputs less than zero. So a solution is to introduce some activation even in the negative case. Elu and Reiki relu have handled this pretty well. And so the day is saved. But what about the activation of the output neurons? Well, that depends on the problem we're solving. In classification, we use the softmax output. The number of neurons in this layer is the number of classes in the classification, and the values represent the probability of belonging to a particular class. Now in the case of regression problems, we need a real number output. So we don't tend to use activation functions at all. If the network spits out a single real number, then we only use one output neuron. If it spits out two real numbers, we use two output neurons with no activation. And I think you know where this is going again. Now this is the first pass over the activation functions. We went over four cases and we took a look at the softmax activation for the output layer. Now let's go through these cases again and add some more details. Back to case one, the simple neural network with one hidden layer and no activation. The output is softmax activated. This leads to a line separator. How is this the case? Well, the output neuron can be written as w2h plus b2, where w and b are the weights and biases, and h is the output of the hidden layer. But we know that h is w1x plus v1. And so we can substitute that in the output equation. And then we just expand the brackets and looky looky what we have here. The output is a linear equation of the input. This output is passed through the softmax activation. Since there's two neurons, if we call the two neurons o1 and o2, then the output activation of the first could be written as e to the o1 divided by the sum of e to the o1 plus e to the o2. We can divide both the numerator and denominator by e to the power of o1. We can then substitute the value of the output equations. Then we expand the terms and we can see that u1 minus u2 can be written as some u0, just another constant. And v1 minus v2 could be written as just some v0, which is a constant again. And in the end, it essentially boils down to a sigmoid function. If we use these values to plot a boundary, it's a straight line. Basically, it boils down to logistic regression. This animation captures it pretty well. And for a more psychedelic visual on logistic regression, I have an entire video on it, so you can check that out. Now case two. We have data that can't be separated by a line. And what we did before was we added a sigmoid activation. The output is now more complicated and we get better fitting decision boundaries. Cool. Case three, even more complicated data. And for this, we increase the number of layers. But with functions like sigmoid that squeeze data, this leads to the vanishing gradient problem. Why does this happen? Let's look at the sigmoid function. At points with large positive or negative value, the gradient comes really close to zero. This isn't good. And since in neural networks, these gradients are back propagated. If you look at this neuron, it's in a later layer and say that this neuron has a gradient that is near zero. The neuron in a layer before it will have an even lower gradient. This low value propagates backwards until finally it becomes zero. Once the gradient becomes zero, the neuron is useless and there is no learning. This is the vanishing gradient problem and I think you can see why this is a problem. To solve this, we need to understand the root cause. It is the squeezing nature of the sigmoid function. What I mean by this is that the sigmoid takes a real number and it squishes it between some small fixed range zero and one. This squeezing nature of functions creates small gradients. And we see the same in arctan function which squeezes real numbers between negative one and positive one. The solution to this is to use a function that doesn't squeeze values, like relu. And that's why we used it. But in case four, we saw a situation where we might run into the dying relu problem. Why is this the case? Well, let's take a look at the relu function now. For positive inputs, it lets information pass through it unfiltered. For negative inputs, it's completely blocked and that is the problem. During training, there may be a time where the bias may become very negative. So WX plus B is negative for most neurons. Most neurons are off during the forward step and hence most neurons are also off during the backward step. And what's worse is that if newer inputs are not positive enough to overcome the bias, these neurons remain dead. And neurons don't learn. To solve this, we make sure that learning happens with negative inputs using leaky relu or elu activations. Whew. So once again, the day is saved. I hope this video helped you understand more about activation functions, check out my other content and I will see you very soon. Bye bye.