 So, let's talk about wider versus deeper networks. In particular, given this meme, they just, as long as we go deep enough, everything is going to be fine. So, in multi-layer perceptions, we know that the architecture truly matters. Let us talk about how it matters. The first way how it matters is in terms of expressivity. When we've encountered expressivity, we know that multi-layer perceptions with one hidden layer are universal approximators. But long functions could require huge width as we discussed. Given a particular architecture, we can look at its expressivity or the set of functions that it can approximate. So, at some level, the expressivity question is given an MLP, what is the set of all these functions that it could implement or at least approximate? So, here's an interesting one, the sawtooth function. No, a function that just goes up and down and up and down and up and down. We know that we can express two-to-the-end linear pieces with just of the order of three-end neurons, which is a finding by Telgasky and the depth of about two-end. And you can see how it works below. It basically takes a local function that makes a little sawtooth and then iterates that in a really clever way by going basically back and forth over the areas where it goes back and forth. So, a shallow implementation of this takes exponentially more neurons. So, here we have a function where it exponentially helps us to make it deep. Is this generally the case? Well, no. But there are small things that we can say. We can say, what's the number of linear pieces that we could have? And here's a finding by Montefar, the number of linear pieces that can be expressed by a deep piecewise linear network is exponential in the depth and polynomial in the number of input dimensions, okay? Again, exponential benefits at some level of having depth. And here's another one for a bounded activation function, a unit length curve sent through a deep network can grow in length exponentially with the depth. For a shallow network, the length is only linear in the width. So, all these findings point to the advantages that we can sometimes have in expressivity by having deep neural networks. Another one is the multiplication problem. This theorem says that for approximating the multiplication of n inputs, x1 up to xn, to within an arbitrary epsilon accuracy, a shallow network requires two to the n neurons, but a deep one only requires n neurons, which is linear in n. Again, yet one more of these, there exists prams for which depth is very, very useful. Now, the question about expressivity, and let me highlight one thing about the expressivity results. They show examples where it massively helps to have deep networks. That doesn't mean that for the kinds of functions that we want to solve in everyday life, that is the case, but expressivity isn't enough. It doesn't imply that we can find a solution in finite time. In fact, in many cases, we can prove that learning prams are hot. So, great, you can solve the pram with that kind of function, but it would take you an exponential time to learn it, which effectively makes it impossible just as well. So, the question is, is expressivity even the pram with shallow nets? We know empirically that shallow nets don't work all that well. So, there's a great paper of Bah and Karuna who showed that a white shallow net can be trained to mimic a deep network, attaining significantly greater accuracy than training the shallow network directly on data. So, this is bizarre if you think about it. So, you take a shallow neural network, you train it on the data, it does really quite poorly. Then what you do is you train a big deep neural network, which does well enough, and then you take the big neural network and ask it to teach it to a shallow network, where basically the deep neural network makes up potential data points and the shallow network learns to recreate them. And this procedure massively helps. So, it doesn't seem to be that what holds back the shallow learning system from being well is the lack of expressivity. It looks like it for some reason cannot learn it. So, just to be clear, in this space of theoretical understanding, what neural networks can compute and what neural networks can meaningfully learn, there's still a lot of work to be done. We don't currently have a satisfactory understanding. But in any case, when we build neural networks, we do want to think about expressivity. If we think in our head what would be the right solutions to that, do we think that this kind of a network could express it? And do we think that there should be a way of efficiently finding the solution? So, let us try if on simple prams, deeper or wider networks are better, keeping the number of parameters the same.