 Here, Yuri. All right, excellent. I'll share my screen. Can everyone see it? I can see it, yes. OK. OK. Hi, everyone. I'm a 50-year PhD student at MIT advised by Guy Bresler. I'll be talking about my recent work on corrective view of neural networks, representation, memorization, and learning. OK. So as we all know, neural networks are universal approximators that is given any continuous function over a compact domain. We can approximate it to arbitrary accuracy in the L2 metric. So we introduce a novel mathematical tool, which allows us to obtain sharp bounds on the number of neurons required for representation. That is, given epsilon to be the error we want to achieve, what is the number of neurons required for different function classes. We obtain state-of-the-art memorization results. That is, how many neurons are required to completely interpolate data on arbitrary points. And we also gives up polynomial bounds on the number of neurons required to learn low-degree polynomials via stochastic gradient descent or gradient descent. OK. First, I'll briefly describe the results that we obtain. That is, regarding memorization. We all know that neural networks can memorize arbitrary labels over arbitrary data very easily with very few neurons. A long line of papers aim to understand why this happens. That is, why slightly overparameterized networks are able to efficiently memorize these labels via the study of stochastic gradient descent or gradient descent on the data points. So we set up the problem as follows, given an arbitrary point in the d-dimensional sphere. How many neurons are necessary to memorize it? So we assume that the minimum distance between the points is theta. And the maximum error we can tolerate is epsilon. So we show, via our techniques, that two-layer RELU networks require n over theta to the 4 and log 1 over epsilon nonlinear units, which are RELU units. This is optimal in n for two-layer RELU networks. And one of the first works to achieve this via gradient descent. We note that optimal, without overparameterization, we need n over d nonlinear units. This is a comparison of the various results. So we see that the older results require a very high polynomial dependence on n and d. And some of them even have extra factors which might be exponential in d or n, which are not enumerated. And the two other works which come very close to this are the one by Amit Danielli, which show that when n is only polynomially large in d and it's uniformly distributed over the sphere, you need no overparameterization. You can learn it with n over d neurons. And Kawaguchi and Huang, who show with a similar setup that you need o tilde and neurons to memorize. And in our work, we have the same similar results as Kawaguchi and Huang, where we have o tilde and RELU units required to memorize arbitrary data. We don't then n can be even exponentially large in d, in our case. So coming to our results on representation, we want to approximate a function f over rd, which is real valued. And we care about a domain of which is a ball of radius r. And we want to approximate it with respect to squared error using a neural network. OK? The technical condition we impose on f is a sort of smoothness condition, where if capital F is the Fourier transform of small f, then we have the following condition on the integral. That is, its tail must be decaying at least as fast as to be able to integrate mark psi to the ad. Roughly speaking, because of the connection between a decay of Fourier transforms and the smoothness of a function, this means that f has theta of ad bounded derivatives. OK? So we show that under this condition, there exists a two-layer network with n nonlinear units of RELU and smooth RELU kind, such that the squared error between f and f hat, where f hat is the output of the neural network, is scales as 1 over n to the a. OK? So where f has ad number of bounded derivatives. OK, what does that mean? a is, OK, so if f has theta of ad bounded derivatives, we can achieve 1 over n to the a rate. That's what a is. OK? So this sounds bad because of this constant, which might be dependent on the dimension. But in a lot of cases, we have functions with specialized structure where there's a low-dimensional structure, even though the function is high-dimensional, for instance, low-degree polynomials. So in that case, we can completely replace d in all these instances with a q. q and q can be much smaller than d, for instance, 2 or 3 or something like that. So in that case, it really shines. OK? So for previous results, use Taylor series approximation for such functions. That is, if you have theta ad bounded derivatives, you implement Taylor series expansion by implementing addition and multiplication with complex deep networks. They, too, achieve 1 over n to the a kind of squared error. But they have complex deep networks with no known training results, whereas we show similar results under similar conditions with a two-layer network, and which can be trained very easily. OK? So an application of our representation result is about learning low-degree polynomials. Consider, let's say, degree q polynomials over d-dimensional input. OK? So with a suitable random feature sampling, we can learn this class of functions via gradient descent up to error epsilon with the complexity of d to the q subpoly of 1 over epsilon with number of neurons, which are of the RELU and smooth RELU kind. OK? These are the first subpolynomial learning bounds. OK? I just described the main results of our work. Let me go into some technical content. The main idea behind all these results is that we use the corrective mechanism. We first divide all the available nonlinear units of neurons into multiple groups. The first group approximates the function and consideration. The second group approximates the error produced by the first group and corrects it. The third approximates and corrects the error produced by first two groups and so on. Under certain conditions, after a number of corrective steps, we get a rate of 1 over into the a, which was as claimed. And we can do a number of corrections for exactly the class of functions, which have theta ad bounded derivatives. OK? So this sort of gives purely representational results. How do we go from representation results to learning results? We do this by a random feature sampling. For instance, consider a two-layer network that is with one linear layer and one nonlinear layer. Here kappa i are the recombination weights, and w i and ti are the inner weights. OK? We draw w i and ti at random from some tractable distribution. For instance, w i could be uniform over the sphere in the dimensions, and ti could be uniform over minus 1, 1. And we draw it like this, and then we optimize over kappa i only. So this reduces a nonconvex optimization problem, which can be very hard, to a smooth convex optimization problem and which can be easy to solve. It has been shown, theoretically, that SGD for neural networks with a large number of neurons reduces to this approximately because for a large number of training steps, the inner weights don't change appreciably. OK? So the basic modus operandi of our work is that we pick w i and ti from some tractable distribution, which does not assume the knowledge of f. We show via probabilistic method that there exists recombination weights kappa i naught, which achieve an error at most epsilon. And the random features optimization, that is the linear regression over kappa i, which give kappa i star much to at least as well as kappa i naught. That is, it must produce an error of at most epsilon. OK? So OK, we will apply this to the first toy problem, which is the memorization problem, which is a very simple phenomenon, which is a very simple problem to explain our correction phenomena. Suppose we have data points, labeled data points x1, y1 to xn, yn, where xn is the point in the sphere in the dimensions and yn is the corresponding arbitrary label. We do not assume any structure on the labels or on the data. OK? They're completely arbitrary as long as they're on the sphere, and the distance between them is theta, minimum distance is theta. OK? We construct the discrete Fourier transform as shown here, and we take psi to be a normally distributed random variable, and with sigma to be of the order root n over theta, so that there are some kind of cancellations. And when we do this inverse Fourier trick, it cancels out, it approximately gives you yj, and it cancels out the rest of the points. OK? So doing this approximate inverse Fourier transform, we get yj is equal to this quantity, which is actually an expectation over a cosine functions, or one-dimensional cosine functions. Now we do the standard trick, which was used by Barron and everyone classically in the literature and introduced by Barron first, where we replace the cosine function here with the additional expectation over ReLU by introducing a new random variable T. So this is just replacing the cosine function with an expectation over ReLU's. And this follows from integration over parts formula. For this, we have to introduce a new random variable T, which is uniform in minus 2, 2. So now that we have a formula in terms of ReLU in expectation for each yj, we can go ahead and construct an empirical estimator for this. That is, we draw psi k and tk from this distribution and then construct the following empirical estimator. So via Gaussian concentration inequality, we have the following contraction. That is, this is the error in the labels. y1 hat is the prediction of our neural network for y. And y is the label. And this can be seen as an n-dimensional vector. And there is the following norm contraction, which we can show via Gaussian concentration. And where n0 is the number of neurons we used in the first step, first group of neurons. When n is larger than o tilde n over theta to the 4, there is a contraction. So now, again, y minus y1 hat is just another set of labels over the same point. So we can replace y with y minus y1 hat and estimate it with y2 hat. That is the first level of correction. And we obtain that the error after two correction steps goes like this equation here, where we have a squared, which is a second contraction on this equation. So we don't have to stop it after two times. We can continue for o of log n over epsilon times. And we show that we have the error at most epsilon as shown here. From this equation, we conclude that memorization requires o tilde n over theta to the 4 log 1 over epsilon activation functions, which are completely relu. Now, when we come to representation theorems, we are dealing with continuous domains. The higher level idea is the same as memorization, but there are some technical complications because we have a big measure, which is infinite. And we have to do the appropriate modification, in this case, to use the same techniques. So one of the first things we do is we don't just consider relus. We want to consider relus and smooth relus. In particular, we construct S relu k activation functions, which are same as relu outside a neighborhood of 0, but they are 2k times continuously differentiable. And the construction is very specialized. And we do this because the Fourier transforms of relu is not very well behaved, but Fourier transforms smooth relu has the right kind of decay so that we can do the successive approximation trick. So the main idea behind the representation theorems is as follows. That is, we first have a target function f and be approximated by f1 hat, which is a two-layer S relu k network with n activation functions. We show, via classical techniques introduced by Barron, that the squared error is of the order cf square over n, where cf is a norm on the Fourier transform of f. Now, we also know that after certain modification and triangulation arguments, we show that the Fourier transform of a certain extension of f1 hat is an unbiased estimator for the Fourier transform of f point-wise. Therefore, we can show that roughly the Fourier norm on the error function f minus f1 hat here is contract by a factor of 1 over root n. Giraj, one minute. Yeah. Can I go ahead? Yeah, yeah. Just one minute left. Sorry. Yeah. So the Fourier transform of this contracts, and therefore, we can approximate fm by f2 hat, and we get another factor of 1 over n square. So continuing this way, for a times, we get rate of 1 over n to the a. After each corrective step, the reminder becomes less and less smooth, while further approximation is impossible. We apply these learning results to learning low-dimensional polynomials of degree q over rd. So the state of the art learning result requires polynomial of 1 over epsilon, a number of neurons, but purely representation results with complex deep networks require only a polylog of 1 over epsilon, number of neurons. Our learning result shows that if you have subpolynomial of 1 over epsilon, a neurons, we can actually learn these polynomials using SGD or GD with a two-layer network. So this bridges the gap between learning and representation results. So this is just an application of the representation results. We show that the set of polynomials are infinitely differentiable, and we can apply the correction step arbitrary number of times by paying a cost of higher constant. And we show that if we have n to be epsilon to the minus 1 over a for a arbitrary, we can achieve an error of utmost epsilon with so many neurons, letting a tending to infinity slowly enough as epsilon tends to 0, we get a subpolynomial bound. Thank you. All right.