 So we saw that we can solve the exo-pram, but can we solve anything? Can we stave off another neo-winter coming from there being certain functions that we cannot approximate? Actually, yes, the pram of not being able to approximate some function is going to not come back. So what do we have? We have the universal approximation theorem, which is a little hard to get an intuition from when you get started, and yet it's super useful. So let phi be a function from r to r. It needs to be non-constant, okay? Otherwise, if I give you a zero as function, you can't build anything out of it. It needs to be bounded, okay? So that means, well, it just can't get outside of a certain bound, and it needs to be continuous. And now, let im be the m-dimensional unit hypercube. No, it doesn't matter. It just says the function we want it to be good with in some interval, and we can just arbitrarily choose that we want it to be the unit hypercube. The space of real-valued continuous functions on im is denoted by c of im. That's the space of all possible functions that could be in that cube. Then given any epsilon greater than zero and any function f out of c, c of im, there exists an integer n real constants vibi out of r and real vectors wi, such that if we then define f of x is basically just like a one hidden layer in your network, where we have the output of that with this arbitrary function of which all we need know that it's non-constant bounded and continuous, and we can then describe any function as the linear combination of these functions applied to linear combinations of the input. As an approximate realization of the function f, that is f of x minus little f of x, so that this is smaller than epsilon for all x out of the cube. In other words, functions of the form f of x are dense in c of im. Okay, so what that means is that if you give me enough basis functions, I can always solve any, always approximate any function arbitrarily well in the hypercube. Now let's be clear what it doesn't mean. It doesn't mean that this n is small. You know the uppercase n could be exponentially increasing as we make epsilon larger. So that means this is not a theorem that deals with practically we can approximate all functions that we're interested in. It just means that in principle the limit of infinite widths, a single layer in your network, regardless of how the transfer function looks like, as long as it's non-constant bounded and continuous, can approximate any function here. Okay, so this is mathematically very reassuring to know. At least no one can prove to us that we can't be solving certain problems, approximating certain functions. That doesn't mean in practice that this is going to be the case. Okay, but let us convince ourselves in the case of RELU that it can approximate any function. No? So what's the idea here? If we can take a function, we take its values at each integer, I can of course make that grading be smaller and smaller, then the estimation that we have here in orange gets to be similar to the blue ground truth one. Okay, we can clearly do that. So, but let's see if RELU can actually do that. So how would that work? Let's say here we have the sine function. We want to approximate it with RELU's. How could we do that? Well, we take that first piece here from 0 to 0.5 and we can approximate it with a linear function, and we can again approximate here with another linear, another linear function, another linear function. Okay, this is not a great approximation, but here we just have four points. We could use more than four points. We'd progressively get better. So how can RELU do that? Well, let's see. We could have a first RELU that makes it, that is basically 0 to the left of 0, and then has a slope of 2 as soon as we're to the right of that. This is exactly the property that at 0.5 it will hit 1 and therefore will be there with the function that we have. Now, as soon as we hit 0.5 we now need that function to have a different slope. It now needs to slope downwards. So where before we had added a RELU with a slope of 2 at 0, now we are adding a RELU of minus 4 at 2. So now we have a slope of minus 2. Now this RELU has no effect left of 0.5, but it has twice the effect of the first RELU as soon as we're to the right of 0.5. So this is going to get us down there towards this point where at 1 we reach 0. Now we don't need to change anything, so we can add a RELU with a weight 0 at 0.1, and then we will again have to add a RELU with a slope of 4 at 1.5. So you show how that works, and this idea is something that we can of course build into an algorithm, and with that we can approximate any function with RELUs, and if you think about it you'll quickly convince yourself that the multi-dimensional generalization of this is also meaningful. So let us now see how that works. Implement that intuitive strategy from the previous slide and plot the error. We make as a function of the number of such segments, i.e. what we expect is that if we add more and more points we will get better and better to approximate in the function.