 Switch it off, no, that's okay. Thank you. So, good morning everyone, welcome. So first of all, I would like to thank the organizers for inviting me. I would like to talk to you about joint work or actually two papers with different subsets of co-authors, Dennis Elbrechter, who is a PhD student of Philip Grossis who is a professor in Vienna, was formerly at ETH. Gitta Kutinyuk from TU Berlin, Philip Pedersen who was Gitta's student, and Mitro Perechristenko who is my PhD student, and who also generously helped me with the slides. So, you didn't think you could just sit here and listen, right? So I'm going to make you work a little bit. We just heard Jared talk about classification. So here is a classification problem for you to solve. Can you help me annotate the images? So there's four groups, and each group contains three images. So any suggestions for any people you recognize? Don't think science only. Dark chocolate for those that make correct guesses after the talk. Joachim. Gödel and Karajan, yes? Then there are two more. von Neumann, yes? So the lady as the color suggests is more recent. Not science. Any opera music lovers here? OK, I'll show you. This is Elina Garancia with the famous Mezzo Soprano. Another problem, image annotation. So this is what deep neural networks excel in, classification, image annotation. Can you help me annotate this image? This is to help you a little bit. He's considered one of the greatest conductors of the 20th century and at the same time was the least prolific. So von Karajan said that he conducts only when his fridge is empty. Any ideas? You probably recognize the background, no? Or the concert. It's a famous concert that you can listen to every year. The New Year's concert, yes? Sorry, in Vienna, yes, right? But we have to start with the name of the conductor. You see, I think this just proves. You can be a genius without being widely known. So if you haven't heard about Carlos Kleiber, I recommend looking him up. Everything he ever conducted is on 11 CDs. And he was known to be somewhat special. So he would, for example, agree to conduct the Vienna Philharmonics, not for the New Year's concert, but for a regular concert. And then he would come and do rehearsals. And then he would decide the day of the concert that they were not in good enough shape to be conducted by him. He would just leave a note saying, drove into the blue Carlos on the day of the performance. And they would then struggle to find a replacement for him. Good. Conducting the Vienna Philharmonics New Year's concert. And this was the experts. He conducted it twice. Conducting the New Year's concert is like winning a Nobel Prize. And so there's two years. You can actually, there's enough information in the image to tell a part. If you're interested, go to YouTube. He has two concerts. And one is in 89. And the other one is in 92. And the way you can tell is by the flower bouquets in the background and the colors. So this is 89. And 92 was yellow and green. OK, good. But let's get serious now. I also do have technical stuff to talk about. So what I would like to talk about is neural networks, mappings, concatenations of affine transformations, nonlinearities, affine transformations, nonlinearities, and so on. So WL is an affine map. Then we have what is called an activation function, rho acts component-wise. The quantities to keep in mind in this talk are the network connectivity, which is the total number of non-zero parameters in those affine transformations. And defines the topology of the network, the number of layers, or the depths of the network, L, and the width of the network, W. So the most relevant quantity for us will be M, the total number of non-zero parameters in the WLs. So what I would like to do here is I would like to isolate one question. And that's the question of what can we learn fundamentally with the neural network? And the way I'm going to do it is I'm going to assume that we have, I call this, access to infinite amounts of data. But what it really means is we know the entire function that we want to learn. So you will say immediately, well, of course, this never happens in practice, yes. But this talk is somewhat information serrating in the sense that we isolate certain aspects and we ask what is possible at all. So we know the function. We have the optimal learning algorithm available. We don't care about the learning algorithm. So this talk has nothing about learning from finite amounts of data. It has nothing about learning algorithms. All it has is it asks, which classes of functions can be represented and represented efficiently as a concatenation of linear transformations and nonlinearities. And what I mean by efficiently, I will tell you in a second. So for those of you who are familiar with Sibenko and Hornig, you will immediately say, well, wait a moment. What's the point of this talk? We know that any continuous function can be represented through a single hidden layer neural network. And that's a result that dates back to 1989 for sigmoidal activation functions. So that's correct. What it does not tell you is how the width of the network scales as a function of epsilon. So if you want to increase the precision, how does the size of the network grow? There's no result there. There is a more recent result by Louis et al that tells you that an integrable function mapping rn2r can be represented through a real network of width n plus 4 with an error of no more than epsilon. And here there's no bound on that. So there's these two approximation results. And then there is a whole universe of follow up results starting with the papers by Andrew Berron in 93 and 94 that gave approximation error bounds for smooth functions. And I'm not going to read out all these references to you. And it's highly incomplete, this list. There's literally hundreds of papers that were trying to answer the question of the complexity of these networks for specific function classes. What I would like to point out is results from the 1990s by Charles Chewy and also Marshcar that observed that in certain cases deep networks can perform better than single hidden layer neural networks. And there's going to be something in this talk about what deep networks can do that shallow networks cannot do. So there's going to be some precise statements in that. And more recently, Eldan and Shamir, there exist functions that are expressible through small three-layer networks but only through very large two-layer networks. So these are all observations that speak in favor of depths. Some more recent work I would like to mention by colleagues from Zurich, Arnulf Jensen, Christoph Schwab, and then Philip Gross that showed that deep neural networks can break the curse of the nationality in approximating or in solving certain PDEs. The paper that inspired us for this talk is a paper by Shaham Cloninger and Coifman that talks about approximations with wavelet basis and connects this to neural networks. So the paper, it's not very explicit, but the main philosophy of our talk is somehow quite hidden in that paper. But anyway, so I saw that paper in archive and this is what God has started. So I would like to encourage you to have a look at that beautiful paper. Good. So obviously, all these references, they tell you that there should be some kind of systematic framework that asks for the complexity of the approximating network. In particular, we will be looking at two different cases. One is the approximation of individual functions. For example, the first one that we'll look at is the squaring function, then the multiplication function, polynomial, smooth functions, sinusoidal functions, and so on. And in that context, approximation of individual functions, we are going to be interested in the number of nodes in the network needed to achieve a certain approximation error. In particular, we are going to be interested in what is called exponential approximation accuracy, the approximation error, the case exponentially error in the number of nodes in the network. And the main statement or the main result will be that deep networks provide exponential approximation accuracy for a wide range of functions. In fact, I was tempted to put here for anything you can think of, but then I wanted to be a bit more conservative. OK, and the second part of the talk is going to look at the approximation of entire function classes. So for those of you who were in the 1990s and early 2000s in a whole wave of approximation theoretic results, wave of space, and so on, they know about those results that wavelengths optimally approximate functions from base of spaces. And then there is many other results. And I'm going to talk about that framework. And we are going to be interested in neural networks that approximate functions from a given function class. It's going to be a worst case theory. And we're going to ask for the complexity of the network. And here we're going to be much closer to reality than in the first part. We're going to quantize the network in the sense of we're going to ask, what is the number of bits needed to encode the topology of the network? And you're going to quantize the weights and the biases in the network. And you want to store the entire network in a bit string. So I give you a bit string from that bit string so you encode the network into that bit string. And then from that bit string you should be able to uniquely read out the network, reconstruct it. So the network is going to act as an encoder for an element of that function class. And then immediately you will start thinking about Komogorov, epsilon entropy, covering numbers, and so on. And indeed, there is a deep relationship here. Good. OK, so let's start with the approximation of the squaring function. So the result is as follows. You can approximate the squaring function with a deep neural network, with an error of at most epsilon, with a network with a width of four, and the depth scaling logarithmic in the inverse of the approximation error. Here this is a bound on the size of the weights. The size of the weights will play a very important role later on, because when we talk about Komogorov, actually it's going to be Komogorov-Donohoe optimality. The scaling of the weights in the domain over which you want to approximate the function is going to turn out to be crucial, but more on that later. Good. So the total number of parameters in this network is 4 times c times log 1 over epsilon. So what that means is that, as you will see in a second, that we are going to get exponential approximation accuracy. But before I show you this result, let me just tell you about Yarotsky's work, who had this result on the approximation of the squaring function. So we have some minor improvements or modifications of that result, but this is the paper you should be looking at, Yarotsky 2017. Maybe the two things I want to mention here is that relative to his result, the construction that we have is we get the weights of the network to scale no faster than polynomial in the size of the domain over which we want to approximate the function, which we need for optimality later on. And the second aspect we need is in approximating polynomials, in contrast to Yarotsky's work, the width of our network does not scale in the degree of the polynomial. We're going to use a Taylor series argument later in approximating sinusoidal functions. And if the width would scale in the degree, then this wouldn't give us optimality. So these are the two things that may be new here. All right, so finite widths for depth scaling like log 1 over epsilon. Now the total number of non-zero weights in the network is upper bounded by this quantity here. So what is this? This is the depth times w squared. Well, this is the size of the matrices in the affine transform, w squared, and then plus w. This is the dimension of the bias vector, so L times w squared plus w. And if you stick in 4 and log 1 over epsilon and you rewrite the equation, you get that the approximation error decays according to 2 to the minus, some constant connectivity. The p is the polynomial degree we'll later have instead of log 1 over epsilon, polar log 1 over epsilon. But here p is equal to 1 in the approximation of x squared. So the finite widths combined with polar log and 1 over epsilon depths yields exponential error decay in connectivity. That's going to be relevant late in the talk where I will show you two function classes that are known to be very hard to approximate in classical approximation theory, namely the Weierstrass function and oscillatory textures. And the best known approximation procedures, algorithms give you a polynomial error decay. And it turns out with a deep neural network you can very easily get exponential error decay. Good. So the slide is titled Optimal Neural Network Algebra. Why algebra? Well, once we know how to approximate optimally, optimally meaning this exponential error decay, the squaring function, we're going to take this to approximate the multiplication function. Why the multiplication function? Well, once you have squaring, once you have multiplication, you can build polynomials. Once you have polynomials, you can do pillar series approximations, and so on and so forth. All right, so multiplication. We write the product x times y in this form here, y. Well, we know how to square. We know how to form linear combinations. Well, we have affine transformations available. So this is actually just a composition of three neural networks. One computes x plus y squared. The other one x squared. And the third one y squared. So you can show that optimality is preserved. And that's how you build multiplication. And then you go on, and you realize x to the k through the combination of squaring and multiplication operations. And then you can build smooth functions through Chebyshev interpolation combined with polynomial approximation. And finally, you can build sinusoidal functions through a Taylor series approach. And there is actually a twist that is important. And this is actually, I held back and I didn't talk much about this, but this is where we fought the hardest with the problem. Because we needed, in order to get optimality for the strong optimality result that I'm going to show you later, we needed optimality for the approximation of sinusoidal functions. And so this is, I would say, 80% of the research time that went into the paper was spent on this problem. But never mind. You can read it up. It's done now. In the end, it's not so difficult. But what it uses is actually the symmetry of the head function that you can build from a reload, so a function like this, a linear combination of reloads, shifted reloads, and together with the symmetry of the sine and the cosine. All right. So deep neural network approximation of function classes, we want to establish a relationship between the complexity of the function class c, compact z, and the complexity of corresponding approximating networks. Well, first, we have to say what we mean, or we have to define complexity of a function class. Well, as I said, you will think of Kolmogorov covering numbers and what we mean by complexity of a network. Well, the network is going to be encoded by a bit string, as I told you. It's going to, this bit string is going to encode the topology of the network and the quantized weights. Why do we want to quantize the weights? Well, because we have to quantize in practice. I mean, you cannot use infinitely many bits to represent real numbers. Good. The framework we are going to use was introduced by David Donahoe, based on a beautiful work by Kolmogorov. The paper that I learned this from is this one here by Ko and Dama and Dupshi and Dvor in 2001, where this is looked at in the context of certain wavelet-based encoders. And the basic idea is this. You have a function class, which is a compact z in L2 of omega. You have a set of encoders, or you have valid encoders, and encoder would map an element of C onto a bit string of length L. So just a sequence of zeros and ones. And the decoder would take this bit string and demap it back into function L2 of omega. So you think of Kolmogorov covering numbers. You encode the ball centers. And then you know the radios. You know the maximum error. So this is kind of philosophy that is behind this. We are going to use the minimax code lengths, which is the minimum code length such that there exists an encoder-decoder pair so that the worst case error is no more than epsilon. And in particular, we're going to be interested in this minimax code lengths scaling as epsilon to the minus 1 over gamma. And we're going to look at the largest gamma. So larger gamma is better because it gives us a smaller growth rate. I mean, growth rate, I mean, when the error goes to zero. So if you want the more and more precise approximation, then you're going to ask yourself, well, how does the number of bits needed to encode those functions scale when epsilon goes to zero? So larger gamma is better because the requirements for storing signals in C are more lenient. Good. OK. So here is the relationship to Kolmogorov-Tikomir of epsilon entropy. You cover your set, and you count the number of balls of radius epsilon. You take the log of that number. So that's your Kolmogorov-Tikomir of epsilon entropy. And this number depends, of course, on the radius epsilon. I mean, if epsilon becomes smaller, you need more balls to cover. And it turns out that when C is not a finite set, then your h epsilon of C is going to go to infinity. As epsilon goes to zero, the question is, how does it go to infinity? And in many interesting cases, it goes to infinity according to epsilon to the minus 1 over some alpha, potentially also with a log term here. So in this talk, I'm going to ignore a log term or whatever have it there. So I'm just going to look at this quantity here. I'm going to look at the optimal exponent. And so this is a crude measure of growth, if you like. So whenever we talk about optimality, we mean modulo whatever you have here that grows more slowly than epsilon to the 1 minus 1 over alpha. Good. All right. So now switching gears. So where was this framework used very successfully? It was used in dictionary approximation. What is dictionary approximation? If I give you a function class, for example, base of spaces or modulation spaces, so those are spaces that relate to the short and Fourier transform, and give you a dictionary. For example, wavelets or Gabor dictionaries in physics coherent states, then you try to find linear combinations in those dictionaries to approximate your functions. Well, so that's dictionary approximation. And the question that you can now ask is the following. Suppose you have a dictionary, a set of functions, and you want to represent any function in your compact C as a linear combination of elements in that dictionary. And you refer to those approximations as best m term approximation. Why? Because you allow m participating elements in your approximation. And in particular, you're going to ask for the error to decay like m to the minus gamma and for the largest exponent so that you can get this error decay. And that exponent is going to be denoted by a gamma star of C comma d. So clearly, this exponent depends on the function class, but also on the dictionary. In particular, it depends on how well the dictionary is matched to that function class. And that matching problem has been investigated very intensely some 20 years ago. And I'm going to show you some of the main results later on. But before we do that, let's ask ourselves what is if any, there is a fundamental limit on gamma star of C comma d. So you fix your compact set. And you are allowed to vary over all possible dictionaries. So this is the problem of dictionary mapping. What is the, is there an upper bound on gamma star of C comma d? And the answer is no, really, because if you take a dense and actually countable d, you can get gamma star to be equal to infinity. So you can approximate arbitrarily accurately. However, this is not terribly practical. Why? Because if you have a dictionary of infinitely many elements, you have to search into the dictionary infinitely deep. And you need an infinite number of bits to store the indices of the participating elements. So that defies the whole purpose. So what, a way around, this is a concept that David O'Neill introduced in 93, who defined this idea or this concept of effective approximation. So what he said is that if you want an m-term approximation, you're not going to be allowed to search the entire dictionary. You're going to be allowed to search the first pi, where pi is a polynomial, pi of m elements. So only polynomially deep, you're not going to be only polynomially deep into the dictionary. And you can already guess what this is going to do to the number of, so first of all, searching is feasible. Second, the number of bits to encode participating indices. Well, if you have a polynomial in many elements, a polynomial in m, so then the log of that is going to be some constant times log of m. And so you can actually encode with log m bits. And then you have m participating elements. So it's m log m. I'm going to show you later. The second thing is we're going to need a bound on the coefficients because we want to quantize them. And this will also show up later in your network approximation. What you need to do here is you need to control the error that you incurred by quantizing the coefficients in your representation. And there's literature on this, if you're interested. The way you do this is that you simply show that this is actually a continuous mapping here, the superposition. And then if you quantize accurately enough, then you can show you can orthogonalize the participating elements, sorry. And then the approximation error gets preserved because the condition number of the orthogonal matrix is equal to 1. These are fairly standard things. And I'm mentioning this because in neural networks this is going to be a bit harder because we have to deal with nonlinear mappings. All right. So here is a don't host concept of effective best m term approximation rate. You want to have a dictionary approximation m terms. You want to have a decay of m to the minus gamma. You want to have a bound on your coefficients. And you're asking for the largest exponent, which we call now gamma star effective of c, d. And you refer to this as the effective best m term approximation rate. So here's the question now. What, if any, is the upper bound on gamma star effective of c, d? And what gives this gamma star effective of c, d, operational significance is that indeed you can show it's upper bounded by the Kolmogorov exponent. And the question that we are going to ask ourselves now is what are dictionaries for a given function class that optimally represent effectively that function class in the sense that the corresponding dictionary for this given function class achieves the Kolmogorov exponent. And this is what I will refer to as Kolmogorov Donohoe optimality from now on. And you can already see where the talk is going to go. So I'm going to show you now what kinds of results are available for classical approximation of function classes. And then we're going to ask ourselves, what are the function classes that neural networks optimally approximate in the Kolmogorov Donohoe framework? Well, we cannot apply it quite because neural network approximation is something different. So we're going to define this concept or the theory of, say, extension of Kolmogorov Donohoe approximation for neural networks. And then we're going to ask ourselves, what are the function classes that neural networks approximate? The neural network, you can then say acts like a dictionary, and we're going to find the c's that neural networks approximate optimally. And the answer to this question will be essentially anything you can think of. So that's a very strong universality result. The neural networks are Kolmogorov Donohoe optimal for essentially any function class you can think of. Good, but I will need some more time to get there. All right, so a back of the envelope calculation. Why the choices that we made lead to optimality? So polynomial depth search, meaning log m bits, to represent the indices of the participating elements. You have m elements. So you have order m log m bits to represent the indices of the participating dictionary elements. You quantize by rounding to integer multiples of m to the minus gamma. Why? Well, if you have the entire range bounded, then you divide by m to the minus alpha, sorry. Then you have a constant times m to the alpha. You take the log thereof, and that's log m again. And then you have m different elements. You have m log m bits to represent the quantized coefficients. And then you can control the error that you incur through this quantization. All right, and then you can show that there is an encoder decoder pair reconstructing f from order m log m bits with epsilon scaling like this here. And if you have an optimal dictionary for a given compact c, then you're going to get m log m to behave like epsilon to the minus 1 over gamma star times this term here, which we write this as o tilde. So this actually ignores the log, and you get the optimal exponent. OK, so that's kind of the philosophy that we're going to follow. We are going to be interested in d being replaced by neural networks, and then we're going to ask what are c's that give you Kolmogorov optimal approximation for neural networks. OK, but before we get there, here is an incomplete summary of what is known in terms of optimality results for given dictionaries or pairings of dictionaries and function classes. So this is LP. So we'll have Halder, Bump, Algebra, Bounded Variation, Base of Spaces, Modulation Spaces, as defined by Feichtinger and Grochanik. These are the optimal dictionaries. Here in square brackets, you have suboptimal dictionaries. So for example, here, let's say base of spaces, wavelets, Fourier is suboptimal. And so if you do Fourier, your gamma star is going to be s minus 1 half if you use wavelets as s. So it's really important, of course, to choose a matching dictionary. So as already indicated, at the end of the talk, you will see that so here is, if you give me a function class, let's say any of those, then, or if you give me a data set, and so I would view this as you giving me a function class, then I would need to know something about the structure of the data. Is it in a base of space, in a relation space? I mean, what kind of smoothness do I have and so on? Then I go, I pick the optimal dictionary, and then I deploy my approximation algorithm. However, given what I said before, if you have a neural network, I don't have to know anything about the structure of the data. I just deploy it with the optimal learning algorithm, which I cannot specify, and I do not specify. But the neural network is going to find it, and in some cases, actually, we can prove that it finds the optimal dictionary even. Those are stylized cases. I didn't want to put them up, because I don't think these results are strong enough to be shown here. But they universally approximate all those function classes, optimally, in the sense of Kolmogorov-Donohoe. So what is it that we need to do in terms of modifications if we want to take this idea or this concept to neural networks? So we are going to replace m term approximation by m weight approximation. So the number of participating dictionary elements is now replaced by the number of non-zero entries in your weight matrices. So the connectivity of the network plays the role of the number of participating terms. We are going to have to encode the network topology and the quantized weights, and we will need to control the quantization-induced error. So that leads us to this concept of best m weight approximation rate. And again, we're going to let the error decay like m to the minus gamma, and we're going to look for the largest gamma and refer to it as the best m weight approximation rate. So there are two things that I talked about in dictionary approximation. One is polynomial depth search. Do we need something equivalent here? No, we don't. Why? Because the tree-like structure of the network already makes sure that there is no problem there. How do you see this? Well, what is the total number of non-zero weights in the network? You cannot exceed the depth times w squared plus w, obviously. So if the connectivity is m, neither the depths nor the weights can scale faster than m. So therefore, what we get here is a scaling of at most order of m cubed. Now, how many bits do you need to encode m locations of non-zero weights in a set of m cubed locations? You need log m cubed choose m, which is order m log m bits. So the tree-like structure naturally gives you already what you enforce in dictionary approximation through polynomial depth search. So that's one. So we don't have to care about that. The second one is the conversation of the weights. And that indeed I'm not going to show you the results. You have to work for this. The way you do it is you exploit Lipschitz's continuity of your non-linearities. And then you have to come up with error estimates. It's messy, but it's not very conceptual and not very difficult, but it's just a lot of work. So I don't think it's interesting enough to be shown here. So how do we let the depth scale of the network? Well, inspired by the approximation of individual functions, we let it scale like log 1 over epsilon. I mean, we have different choices here. It cannot scale faster than m, m is the connectivity. Since our error is always proportional to m to the minus gamma, we're going to have the depths grow polylogarithmically in m. OK, good. So that, in summary, gives us best embed approximations subject to polylogarithmic depths and polynomial weight growth. And you can again do a pack of the envelope calculation. It's going to yield Kolmogorov optimal, Kolmogorov donor optimal approximation. All right, so here is the exponent that we are going to be interested in is gamma star effective neural network of C. So remember, in dictionary approximation, we had gamma star effective of C comma D. So now the D, the dictionary, is replaced by the neural network. The neural network is now our dictionary. And therefore, we have this quantity here. And we are going to ask ourselves, is there a limit, an upper bound on this quantity? And it turns out there is. And it's the Kolmogorov exponent. So again, you see, if you pick everything carefully, and you can make it work altogether. So as to give this exponent operational significance, in the sense that it is upper bounded by the Kolmogorov exponent. And not only that, you can also show that indeed there are function classes where gamma star effective of neural network of C is equal to the Kolmogorov exponent. And not only that, if you can find a function class that you can prove it doesn't satisfy this, I would be delighted to hear about it, because we couldn't. All right. Before I go into showing you what function classes or how we identified those function classes, let me just turn things a little bit around and put things into perspective that we have shown so far. What you've seen so far is essentially answering the following question. If you have this function class and your neural network is to approximate any element in the function class to within an error of at most epsilon, the question is, how does the connectivity network have to grow at least so as for this to be possible? And it turns out that the connectivity has to grow at least as fast as epsilon to the minus 1 over gamma star as epsilon goes to 0. In fact, there is what I call a strong converse. Physicists would call a phase transition. Namely that if you exceed your gamma star of C, then you violate this error bound infinitely often as a function of the connectivity. That's not surprising, because you cannot beat the converse of covering. Good. All right. So clearly in practice, you use overparameterization and so on. I mean, so this tells you what you minimally need. And again, we don't talk about training algorithms here. OK. So let me tell you briefly how we prove now that gamma star effective neural network of C is equal to gamma star of C for different function classes. Forget about all the details that you will see in the next three slides. The main idea is the following. So the way we started this was we knew, OK, wavelengths are optimal for base of spaces. So now we somehow tried to navigate from known terrain. So we said, OK, can we approximate, wave-led dictionary is optimal with neural networks? The answer is yes. If you choose your mother wave-led so that it can be approximated well by a neural network, then what you need is how are the other elements in the wave-led dictionary generated through affine transformations and translates? Well, affine transformations and translates. Well, you have affine transformations in neural network. You just got to make sure that the neural network doesn't destroy the optimality of the approximation of the mother wave-led. And you can show that it doesn't because it exhibits an invariance property with respect to affine transformations and translations. That's it. It's all about invariances. So it's not so surprising that you can actually show for affine dictionaries, so not only wave-leds, also rich-leds, curved-leds, shear-leds, alpha-shear-leds, whatever, they are all called don't know how optimally representable by neural networks. What is more surprising is that, so this is what I call the affine group, that you can also do it for the Wald-Heisenberg group, where you have translates fine, but modulations, very different group operation. And the way we do it is always through a sandwiching argument, namely, we're going to first show what I just told you, that if you approximate this dictionary by neural network, the gamma exponent you get is at least as large as the one you would get in dictionary approximation. So you can be at least as good. And then you're going to use the Klamogorov upper bound here, and then you're going to find the optimal dictionary for your function classes, which is wave-leds for Bezov. You have gamma star of C here. And so you see that the gamma nn star effective is sandwiched between gamma star of C and gamma star of C, and therefore you have optimality. I'm not saying this is the only way to do it, but this is how we could do it. And then we went ahead and we did it for Gabor dictionaries, where you have translates and modulations. And this is, if you remember in the beginning, I said we worked hard to get this sinusoidal approximation right. This is where you need that result to get this optimality. And also for the Wald-Heisenberg group, you can show that if you start with a generator that is well-approximated by neural network, and then you translate it, and you modulate it, or shift it in time and in frequency, the optimality is preserved. And you're going to get the optimal approximation for, for example, modulation spaces as defined by Feichtinger-Gröchenig and the optimal dictionaries here of what is called Wilson basis. I understand Wilson was a solid state physicist who won an oval prize, but I'm sure some of you know this. OK, same arguments. Good. So here is the whole list. So this is what we could show. For all these pairings, optimal dictionaries for given function classes, you can do all this with neural networks with one structure. So forget about what the data structure is. You don't have to choose and find the right dictionary. The neural network will learn it for you if you have the optimal learning algorithm, which I don't. But if you do, then it's going to find it. So this one structure has a very strong universality property. It's going to find all these optimal approximations with the minimum number of bits. So you're not paying a price for this also. Again, all this granted that you know the function perfectly and you have this learning algorithm. Two function classes that are hard to approximate classically. I mentioned this in the beginning. Oscillatory textures and the Weierstrass function. Let me show you pictures. And then you see this. This is an oscillatory texture. The way it's defined is it's a cosine of some real number times the function g of x times h. So this is hard to approximate because of its oscillatory nature. And the Weierstrass function I'm sure you know about. The fractal function looks like this. And so these functions are hard to approximate. And the best known approximation results, classical approximation theory, give you an error that decays polynomially in the number of parameters you use in your approximate. It turns out, though, that deep neural networks approximate both oscillatory textures and the Weierstrass function with exponential accuracy. And this result here was recently extended by the sheet of warrant hanging to more general fractals. In fact, you can show that whenever you can generate the fractals from an iterated function system, you can be optimal with respect to neural network approximation. So to wrap up, the last item I want to show you is the case for depth. So I want to be very specific here. There's no philosophical discussions here. What I want to show you is the following. Let the FBSC3 function, satisfying some additional condition, then if you restrict your neural network and to have a finer depth, L. So let's call this shallow network because L would be small. And you restrict the network to have a width that scales no faster than polylogarithmic in 1 over epsilon, then you cannot satisfy your error bound of F minus 5 being less than epsilon. So you cannot approximate, the worst case of approximation error, is going to exceed epsilon. So the total number of parameters is going to be L times the polynomial log 1 over epsilon. And if you have a shallow network, fixed depths and the widths can scale no faster than pi log 1 over epsilon, you cannot do it. However, there is a result, the two papers that are underlying this presentation, that shows you that if you go deep and you have a finite width, namely of 23 in this case here, and again, forget about the details, the only two quantities you should look at are the depths. Depths scales polylogarithmic in 1 over epsilon. The exponent is 2 here. This is the size of the domain. The width is 23. So you have a constant times log 1 over epsilon squared, and you can get the worst case error to be less than epsilon. So here's something that shallow networks cannot do, that deep networks can do. You stay with finite widths, and you go polylogarithmically deep. All right, so in summary, deep networks give you exponential approximation accuracy for a very wide range of functions. So remember, we had this algebra, x squared, multiplication, polynomials. Then we had smooth functions and oscillatory functions like sinusoids. Then they can, in principle, learn optimally vastly different function classes, so universality, one structure optimal for all these different function classes. And what is mathematically underlying this is a concurrent invariance property of deep nets with respect to time shifts, scalings, and frequency shifts. So fine group and the Weylheisenberg group, which means that you can, in principle, without knowing anything about your data, deploy this network, it's going to do as well as dictionary approximation would do if you had the optimal learning algorithm and if you have access to your function or infinite amounts of data. These are the papers underlying this, so thank you very much.