 So, in a way, this is a very simple example of feature importance, because it tells us how important each pixel is when driving the behavior of the Layer 1 neuron. Now, in general nonlinear systems, this is highly non-trivial, but it's a major way for us to understand the system. In fact, I should mention here, it's very popular in neuroscience, where people ask, given a neuron that I might record from in the actual brain, what does it mean about the outside world? Feature importance is highly relevant from an ethics perspective. I have a machine learning system. I want to be able to tell you which aspects of the inputs does our machine learning system really focus on. It's highly relevant for explainable AI, where DS like Shapley values are used to explain how a machine learning system works to humans by basically telling them these are the relevant variables. So, in the network that we just built, we used RELU. Why did we use RELU? Well, a lot of people somehow believe in the superior of RELU, and there's a historical reason for that. So, when AlexNet came up, it propelled deep-learning into the limelight by showing that big vision problems can be solved really well by deep-learning systems. And AlexNet used RELU, and consequently, a lot of subsequent people focused on it. Well, but is it justified? Let's look at a range of transfer functions. Here's a nice list that Mishkinnet all did of different functions. So, let's look at a few of them. Now, let's see what it would mean for RELU or for transfer functions to really matter. It means that for them to be important, I expect that they have a relatively big influence, relative to the amount of data that we have. So, what we have here on the left-hand side is just an analysis of how important the amount of data is. And we see that as we scale the data set from, say, 600,000 to 1.2 million by removing images that we lose our performance and we go from getting roughly 45% maybe right to roughly getting 40% right. So, it's a small difference that factor 2 of training data makes, but it's a highly significant one. But it gives us a scale to compare the importance of transfer functions to. We have, see on the right-hand side, is the effect of using different transfer functions. If we have a linear system, of course, we lose a lot of performance, maybe twice as much as we gain by that scaling from 600 to 1.2 million. If we use TANGE, which is no longer very popular, we lose about 2%. And if we use relative to RELU, whereas if we use some of the more modern transfer functions, we gain 2%, 3%. Now, you can say, well, 2%, 3%, that's not all that much. But it actually is a very large effect relative to the amount of data. We could have gotten the same performance with half of the data and keep in mind that this kind of data is really expensive. So, let's talk through them. Here we have TANGE. It's called hyperbolic tangent, which is the function. It's often pronounced as TANGE. It's e to the 2z minus 1 divided by e to the 2z plus 1. But the really important thing is just a smooth, differentiable function that goes from minus 1 to 1. It smooths out any discontinuative. It used to be very popular because people believed that it approximates the property of real neurons, that they saturate, that at high firing rate, their firing rate can't get any higher. However, the evidence that real neurons actually ever saturate is extremely limited. It's largely been replaced by RELU. And intuitions might relate to probability. Now, the TANGE's output only spans the range from minus 1 to 1. Probabilities go from 0 to 1. So, it's a simple scaling between them. And then there's some smooth transition where evidence makes us believe in one hypothesis versus another. So, if what's going on in a deep-learning system is that the system internally basically operates based on probabilities, this seems like a decent choice. It works well on some meta-learning problems in parts because it's nicely multiple times differentiable. Here we have RELU. The big advantage is it's very fast to compute. It's the most commonly used function. The resulting functions are piecewise linear as we saw. It's not actually differentiable everywhere, namely not at 0. But this is often okay. It's more of a problem when the derivative is 0, where we have the dead RELU problem. But it's pretty unlikely for a RELU to be dead for all output. If we initialized a RELU so that for none of the training samples, it would ever be active, it would always have 0 gradient, and therefore it would never learn anything. But it's pretty unlikely. An alternative, a very simple alternative to that is leaky RELU. Where we basically replace it with a small but still positively increasing function in the zero area. It fixes the problem of dead RELUs. It in a way helps us deal with vanishing, exploding gradient problems. It's not used very much in practice compared to RELU, but it often makes problems go away. There are of course cases where we need functions to be twice differentiable. For example in meta-learning, where we learn to learn, and therefore we have to calculate the derivative of gradients after the parameters of the learning process. Or maybe we simply want to use an optimization method that uses a hashen. You will hear some about this in the future. Here we have the logistic sigmoid. Another popular activation function. In this case it goes from zero to one. It's similar to TANGE, but asymptotes at zero and one instead of minus one and one. It's only used in special cases nowadays. Now let us compare transfer functions on animal faces. The best way to know if this idea is meaningful is ask. Well, does the transfer function really matter? So which one work well and do you have any intuition of why?