 Thank you, guys. Let me see. Sorry, there's a small technical issue. Just a minute. So I see that you have actually you're connected with two accounts, maybe this is related to that. No, I'm just connecting to one account through my computer and went through the iPad so. Okay. Perhaps I can use my iPad to present if I can. Sorry for this. Okay. Sorry for the technical issue. So thanks for the invitation. And in this talk, I will describe a stylized model to answer the question, when do neural networks outperform kernel methods. This is a joint work with Bruce, theoretical and and jail in the past several years. In the past two years, a line of work that considered to approximate multi layers neural networks by their linearizations, and this is a so called neural tangent kernel. Because we have a multi layers neural network, as of X and data, X is input vector in D dimension, and data contains all the parameters of this network. And typically, capital N, the number of parameter is much larger than the simple size. So consider linearization of this function at around a random parameter set as zero. Then, under certain conditions, this linear term will dominate the whole function. And this motivate us to define the neural tangent model. This new tension model is a linear function of parameter beta, but it is still nonlinear in the input feature X. Consider the coupled gradient flow dynamics of your network and your tangent model. And the coupling is through the initialization of parameter beta theta zero. Then a line of work established that and their proper initialization and when the number of parameter in the network is very large. Then this neural network can be well approximated by its linearization along the whole train trajectory. And this condition proper initialization and our parameterization is a little bit subtle in practice. People are still arguing whether this condition holds. So this line of work established that along the training trajectory, the neural network can be well approximate by your tangent model. But what about the generalization properties of these two models. So this kind of work considered to experimentally compare the performance of neural networks and neural tangent models. In the first work, they considered several 10 data experiments, and they show that your tangent model can have 23% test accuracy where test error where, while your network has 10% test error. So there is performance gap between these two models. The same group of people they studied. They perform experiments on some small data sets, and they show that in that data set. Some times these neural tangent model can generate better than your networks. And there are later works that perform experiments also answer and they can closer the gap of these two models and make the performance of your tangent model to be 10%. So the conclusion from these experiments is that sometimes there is a large gap, while sometimes performance gap is small. And in this talk, I will use a stylized model to answer the question, theoretically, when there is a large performance gap between these two models. I will mainly focus on two layers networks. A two layers network is a summation of AI times activation function of the inner product of WI and X. The AIs and WIs are the parameters to be optimized over in these two layers networks. If we consider the linearization of these two layers networks, we can derive two function classes. The first function class is the random features function class. For the random features function class, WIs are random and fixed, and AIs are the parameters to be optimized over. There is another function class, which corresponds to the second term, where we fix the top layer rates and linearize over the bottom layer rates. We call this function class the neural tangent function class. And in the neural tangent function class, WIs are random and fixed, and BIs are the parameters to be optimized over. Ideally, we want to analyze the summation of these two function classes, but it's more technical challenging, but the result will be almost the same. So I will just analyze these two function classes separately. We consider to use these three function classes, neural network class, random feature class, and neural tangent class to approximate a target distribution. And we assume our data distribution, follow the so-called spiked features model. In the spiked features model, the feature X is assumed to be a D dimensional vector, where D is very large. And this feature contains two parts, a signal part and a junk part. The signal part is X1 in ds dimension. Here we assume the signal dimension ds is much smaller than ambient dimension D. We take this ds to be D to the eta, where eta is between 0 and 1. So the dimension of the junk feature is D minus ds. We assume the variance of the signal feature and the junk feature are different. The variance of the junk feature is 1, and the variance of the signal feature is fixed to be this SNRF. This SNRF is called the feature signal to noise ratio, and we assume it to be D to the kappa, for kappa greater or equal to 0. We call this X1 the signal feature because we will assume this response only depends on this X1. Here is a figure illustration of this spiked features model. And here is a signal feature X1 and the junk feature X2. The target function only depends on this signal feature X1. And in this figure, the variance of the signal feature is larger than the variance of the junk feature. So in this case, the feature SNR is greater than 1. And in this figure, the variance of the signal feature is the same as the variance of the junk feature. So the feature SNR is equal to 1. In this spiked features model, there are several important parameters. The first important parameter is this feature SNRF, is D to the kappa, assumed to be always greater or equal to 1. There's another important parameter, which is this signal dimension, DS. We define another important parameter, which is called the effective dimension. We define the effective dimension to be the maximum of signal dimension and the ambient dimension over a feature SNRF. By definition, we will always have this effective dimension greater or equal to the signal dimension and less or equal to the ambient dimension. The feature SNR and effective dimension are tied together. A larger feature SNR will induce a smaller effective dimension. Later, in the following, I will show you how this feature SNR and effective dimension affect the approximation power of these three function classes. Our theoretical result just characterized the approximation error of a target function following the spec features distribution using our three function classes. For random feature function class and the neural tangent function class, the result, the approximation error depend on this effective dimension. And when the number of neuron is between D to the L and D to the L plus one, the approximation error is approximately approximation error of random features model is approximately the norm square of the target function projected onto the high degree polynomials. This P greater than L is the projection also known to the space of degree L polynomials. We have similar result for the neural tangent model. But the difference is that we replaced by P greater than L by P greater than L plus one. On the contrary, if we assume if we consider the approximation error of neural networks, then the approximation error depend on the signal dimension. The number of neuron is between DS to the L and DS to the L plus one, the approximation error is upper bounded by the same quantity. Moreover, these approximation error of neural network is independent of this feature SNR. The theorem is a little bit hard to parse. So here I simplify the statement of the theorem into this page. To approximate a degree L polynomial in the signal feature X1, neural network need at most DS to L number of parameters. Whereas random features model need DE to the L number of parameters and neural tangent model need DE to the L minus one times D number of parameters. Because this effective dimension is between the signal dimension and ambient dimension. So the approximation power of neural network is greater or equal to the power of random features model and greater or equal to that of neural tangent model. We consider two extreme cases. In the first extreme case, we have a low feature SNR. In this case, the variance of the signal feature is the same as the variance of the junk feature. In this case, neural network need at most D to the eta L number of parameters, whereas both random features model and neural tangent model need D to L number of parameters. So the approximation power of neural network is much larger than that of the random features model and neural tangent model. In the other extreme case, we consider a very high signal to noise ratio. The variance of the signal feature is much larger than the variance of the junk feature. In this case, neural network still need D to the eta L number of parameters. Whereas random features model and both the number of parameters necessary for random features model and neural tangent model decreases. Random features model need D to the eta L number of parameters whereas neural tangent model need a little bit more parameters. So when the signal to noise ratio is high, the worst case approximation power of neural network is the same as that of the random features model, but more powerful than neural tangent model. We use a numerical simulation to conclude our results. So this figure plus the approximation reverses the log number of parameters, different colors stands for a different signal to noise ratio. The red one has a high SNR and the blue one has a low feature SNR. The dot dash lines are for the neural networks, dashed line for random features model and the continuous line are for the neural tangent model. We can see that the power of neural network is greater than the power of random features model and greater than the power of neural tangent model. If you look at the dashed, the dashed, the dashed, the dashed lines. These are the curves for neural networks, and then we can see that they almost collapsed collapse. That means the feature SNR, the risk of neural network is independent of this feature SNR. If we look at the curves for random features and neural tangent model, we can see that a larger feature SNR will induce a larger approximation power of neural network of random features model and neural tangent models. We have similar results for the generation error. When the feature SNR is low, the potential generation power of neural network is much larger than that of the kernel methods. When the signal to noise ratio is high, then the generation power of neural network, in the worst case, is equivalent to that of the kernel methods. Our theoretical result implies that if the data distribution follows the specced features model, then adding asymptotic noise in features will decrease the feature SNR and the performance gap between neural networks and kernel methods becomes larger. We performed some numerical simulations on real data sets. And here we performed the experiments on the FMNIST data sets. In this data set, our underlying assumption is that the labels of these images primarily depend on the low frequency component of these images and independent of the high frequency component of these images. The underlying assumption is that these images approximately satisfy these specced features model. And we look at the test error of neural networks and kernel methods, and we can see that when the noise strength decrease, that means the feature SNR decrease, the classification error will also decrease for all the methods. But the decrease of neural network performance is the slowest one. So this simulation gives some evidence things that the FMNIST data set, this image data set, approximately satisfy this specced features model. Let me conclude the message of my talk. In the specced features model, we derive a controlling parameter of the performance gap between the neural network and kernel methods. And this parameter is this feature SNR. This feature SNR, I would like to remark that this feature SNR is different from the label signal to noise ratio. For some small feature SNR, there is a large separation. For large feature SNR, the performance of neural networks and kernel methods are closer. So somewhat implicitly, your networks first finds the signal features and then perform kernel methods on these features. This is some implication and deserve further exploration to make it rigorous. And this concludes my talk. Thanks for your attention.