 So, let us talk a little bit about initialization. As we discussed before, it's really important to have the right kind of initialization. And we saw that in some cases, we can derive what the right initialization is. But keep in mind, with every different transfer function, there should be a different way of initializing all aimed that we remain in the right range as we start building networks that are deeper and deeper. Otherwise, we can very easily have vanishing or exploding gradients. So let's talk about ReLU because that's just a nice example. In this case, mind you that we have f of x is ax for x is smaller than zero, and is x for x is greater or equal than zero. The expectation of the output here will generally be zero, but the variance changes and assuming that the probability that p is x is smaller than zero is 0.5, it just means that instead of seeing a positive activity activation, we could just as well see the negative activation. In that case, we can calculate the variance of the function applied to the output of the layer, which is the expected value of the squared of the function. Now, keep in mind that the mean is zero here, and in that case, we have that this is the variance of the positive activity plus a squared times the variance of the negative ones divided by two. And now what we can do is we can basically pull out the one plus a squared over two. So in that case, we have that variance is one plus a squared over two times an n times sigma squared times gamma squared. Sigma squared is the variance of the pre-synaptic, gamma squared is the variance of the weights. And with that, we can then calculate how the system should be initialized so that it is in the right range. Now, check the Xavier initialization with Leaky-Rilou, and we'll see.