 he will talk about this frequency principle in deep learning and how to relate this to physics. Yes, so please to Chen. Okay, can you help me, Hajin? Yes. Okay, thanks. First I want to thank Hajin for inviting me and giving me this opportunity to present our recent work. So this is the current undergoing work, so if you have any comment, they will be very welcome. So today I would like to thank Hajin for inviting me and giving me this opportunity to present our recent work. So this is the current undergoing work, so if you have any questions, I can share it again. Okay, can you hear me? Yes. Okay, good. So first I will give some takeaway information. So first I will talk about the frequency principle that deep neural networks prefers low frequencies. Based on these understandings, we actually designed a multi-scale deep neural networks. Am I muted? Am I not muted or not? No, it's okay. Okay, you can hear me, right? Yes. Okay, okay, I will continue. So second I will talk about a first diagram to understand the two-layer neural networks and try to understand the deep neural networks from a systematical picture. So now let's talk about the macro's perspective. So as we know that John Manoeman has said, with four parameters, you can fit an element to a curve and with five, you can make him wrinkle his trunk. So in fact, we always try to use as less parameters as we can to fit a model, but this may be not true in deep learning. Now let's take an example. So traditionally, if you use a model to fit the training data like blue data points here, and if you use a force order polynomial to fit the data, although you cannot fit the data perfectly, but actually you can recover the true model quite well. And if you increase the complexity of your model, such as you use 15 degree or polynomial model, you can actually fit the training data quite well, but you can see oscillation here. That means it overfitted the data very significantly. So in traditional learning theory, actually if you increase the model complexity, you can decrease the training error, but you can increase the test error. So this is traditional theory. That's why phenomenon said you need to use as less parameters as you can. By deep learning, that is a puzzle. So even the number of parameters is much, much larger than the number of training data, that deep neural network seems to always generalize well. And this is supported by a lot of experiments. I want to go to tell into this experiment, but we try to use a macro perspective to understand why the deep neural networks does not overfit even though the number of parameters is so many. So here's an example. Suppose we are going to fit a target function that is denoted by this red curve. And the training data is denoted by this blue data points. And after training, the deep neural networks can well fit the target function as we can see here after training. But what we care is how the neural network fit the training data. So we want to see the training dynamics. So here is the dynamics movie. So before I start the movie, let's describe these curves. This red curve is the target function. And this blue one is the deep neural network output at the first training initial output. And now let's take a look at this movie. So as we can see that the neural network, after some training steps, you will capture the landscape. And after some time, some details emerge. And as the training keep on, you will converge the target function. So the deep neural network seems to capture the first landscape and then more and more details. So to quantify this effect, it's very naturally to use Fourier analysis. So we perform Fourier analysis of the target function and the deep neural network output. So here the red dashed curve is the FFT of the target function. As we can see, there are many peaks because the target function has sharp transition. It's not so continuous. For the neural network, the FFT is denoted by this blue line. And if you train a neural network, and now let's take a look, this is the frequency and this is amplitude. So after the training goes on, the neural network will first capture the first peak, and then the second peak and third and fourth and keep going until higher frequency. So we can have an observation that the neural network captures low frequency components first, while keeping other high frequencies once more. And then it gradually captures higher and higher frequencies. So we call this frequency principle or app principle. So the very simple idea is that deep neural network seems to prefer low frequencies. So we use this to try to explain why there's not so much oscillations in the deep neural network, even though the number of parameters is much, much larger than the number of training data. But in this talk, I want to go into detail of this frequency principle. We try to utilize this understanding to design deep neural network structures to fit high frequency faster. As we know that it's some problems, high frequency actually are quite important as such in some computational physics. So now let's, okay, let's go through this application. So suppose we are, let's consider a Poisson equation. So this is a Poisson equation with deletion boundary condition. So you can solve this problem by just a very simple finite, finite difference of skin, just central finite different, different skin, and you can get a linear system a u equal g, and you can solve this linear system by like a Jacobi iteration or Gaussian-Sendale iteration. And if you analyze this model more carefully, actually you can find that the low frequency, low frequency converges much slower than higher frequency. So this is very different from this deep neural network. Now let's take a look how deep neural network can solve such a partial differential equation. So to solve this partial differential equation actually is equivalent to solve the following variational problem. So this variational problem, the minimization of this problem actually is the solution of this Poisson equation, and this part is a regularization part to try to satisfy the boundary condition. So we parameterize the UX by a deep neural network, and the input is X, and output is UX. So we only need to minimize the parameters such that you can minimize this variational form. So now let's take a look about this 20 dynamics of different frequency. So here's an example. So gx here is the right hand side term of this Poisson equation, Laplace u equals minus g. And we choose this g because we want to have some high frequency components. So let's take three frequency components, higher frequency, one median, and lower frequency. So this is the Jacobi method, and this is the iteration epoch. And this color indicates the relative error of this different frequency. That means the relative error is very small. So from this picture, we can see that high frequency actually converges very, very quickly. However, the low frequency actually converges extremely slow compared to this high frequency one. But for different deep neural networks, this behavior is completely different. As we can see that the first frequency actually converges very faster, and the second one and the third one. So these actually are very different. But as we know that deep neural neural network now are applied into a lot of such differential equation problems because it has a lot of potential to solve high dimensional problems, but it will suffer from high frequency curves. So how to solve this problem? Our idea is very naive or very simple. You just scaling the function. For example, this f is an original function, and you just do a rescaling in the forward domain. So suppose you are considering a function and the function at a frequency alpha k, alpha is larger than one here. We just do this rescaling to another function. The frequency of this function actually is k. So the frequency for this function is smaller than the original function. So we only need to fit this lower frequency one. Then we kind of rescale back to high frequency one. So this is performed at a forward domain, but it's totally equivalent to perform at a spatial domain just to do a different direction. This applies in the rescale form. So since this rescale form is kind of approximated by this deep neural network, so actually we can have a lot of neural networks to perform it to fit a different frequency. So we have answers of neural network. Here, HI is the one neural network. So we construct a structure called a multi-scale structure. So this multi-scale structure actually is consistent of a lot of sub-neural networks. So in each sub-neural network, it received input x, but with a kind of rescaling factor x2x until nx. And they don't have any connection between each other. So you can save your computation cost. So in the final, you just have a summation of all the output of these sub-neural networks and you can have output. And you can have used the previous similar procedure to solve like Poisson equation, Poisson-Folkman equation and so on. We have done a lot of experiments and we found that for this multi-scale neural network, it kind of uniform conversions for each frequency. And it's mesh-free. You don't need to make any mesh. So it can deal with very complex domain and it's very easy to implement it. So you just need to have a slightly changed network structure. Here's an example. So we compare the multi-scale neural network to a fully connected neural network. So in a normal case, we use a network with hidden layer 1,000 neurons. For this multi-scale case, we use 5,000 neural networks. And each neural network also has 200 hidden layers. So from the perspective of the neural number, these two actually are the same. But for this multi-scale one, you have much less connections. So the computation cost is much smaller. So in this case, we take an example. This is a 2D example. This is an exact solution. It has a lot of different scale oscillations, different frequencies. And for the normal case, actually it can only learn the low frequency. As we have mentioned previously, in the frequency principle, it prefers low frequency. It's very quickly at a capture low frequency but not a good at a high frequency. So you can see that things in this circle or things in these corners compare with this true case. Actually, this is a normal case. It's much more smooth. You cannot first-party reconstruct this oscillation case. But for multi-scale case, you can see that all these oscillations can be well reconstructed or can be well captured. Now let's take a look at the training process, the error. This is L2L. For normal case, the L to create much slower than the multi-scale one. For 2D case or 3D case, and we actually perform extensive experiments, like for 2D, 3D, or Poisson-Bosman equation and so on. If you're interested, you can look for the paper. It just came out very recently. So now we have this micro-perspective to understand that in the neural network output actually has some order from low frequency to high frequency. And we have shared some lines to understand the generalization ability and can help us to design the neural network structures. Now let's try to understand why there is such behavior. Let's go from a micro-perspective and try to understand more. So there are a lot of related words that try to understand that this frequency principle. Now let's take one. So let's take an example of two-layer neural network. This edge is a two-layer neural network. And you can see here there is a pre-effector or scaling factor. And this A is the output vector. And W is the input weight. L is biased here. Slightly different from the common useful one. But actually you can do it perfectly almost the same for normal case. But the scaling is important. So we consider for sufficiently large M. This is very easy to understand. Just like in a statistical mechanics, you want to consider an ensemble. You always need to do some more dynamics limit. Okay, so in this case, if we study from the first principle, means we study the gradient descent of each parameter. So that's the meaning why we call this micro-perspective. So we can gather the dynamics of this neural network output. This is the Fourier transform of the neural network output. And this exceeds the Fourier frequency. And this is the coefficient. Okay, so this part actually has very interesting properties. So HP means times this sampling density. If you have discrete sampling density, it's just a summation of data function. So this equation stops when your training output at the training data, your neural network output at the training data equals to your training samples. Okay, so this makes sense. All equivalent to our training process. Now let's take a look at this coefficient. It's very interesting case one. So this black data actually is the mean taken respect to the initial parameter distributions. And this can see the frequency. As we can see that when you increase the frequency, actually this coefficient is decreased. So this reflects the frequency principle. Right. And one more interesting thing is the solution of long-term limit, long-term limit of this OD actually is equivalent to the solution of this minimization problem. And this minimization problem actually give us a lot of information. So I don't want to go to the detail of this proof, but it's actually very simple. This is the kernel regression. So there's a minus one here. So there's a C square in the denominator. So this part actually is kind of increased as the frequency increase. So the penalty will be larger for this high frequency component, meaning this neural network will find a solution that can satisfy the training data, but has as less low frequency as it can from all solutions that can fit in training data, but it prefers low frequency. Okay, now let's take a look more. So if you only have one part for the whole part, actually this is the cubic plane. So and just give us a very intuitive understanding of how the neural network interprets the training data. And if you only have this minus two term, this is the linear plane. So now let's take a look here. So if you have different parameter setup, you actually have different interpretation results. So it's very important to study the different parameter regime. So we are going to understand this different parameter regime. So we're going to draw a face diagram for neural networks. As a first step, we will try to do it for two layer neural networks. Now let's take an example of this fluid dynamics of the face diagram of water. So as we know that you can see the index, the coordinates like a pressure or temperature, and you can draw a face diagram for water. And we can benefit a lot from this face diagram because like in this part it's solid and this part is liquid, they have a lot of similarity dynamics similarities. So if you know some point in this liquid and you have another point of liquid, you know, they must have some similarities. So that's the kind of starting point or motivation of why we want to draw a face diagram for neural networks. So we can see the very naive model. This is the two layer ReLU neural networks. So we can see the very general form of neural network. You have a pre-scaling factor, the input weight and output weight are wk and ak. You can consider a high dimensional case. And initialization of this ak is initialized by Gaussian distribution with standard deviation beta one and wk beta two. And we train this neural network with mean squared error. So why we say this is a kind of very general form because if you take alpha equals to square root m, that is anti-k, very well studied just in previous examples, we use this case if alpha equals m. And this is in the mean field. So these two actually are very popular types in the last few years in theoretical study of neural networks. So to study this model, actually we need to do some normalize on non-dimensionality. To understand this, actually you can think about this Reynolds number while we want to define Reynolds number. Because although sometimes you can have different scales, different kind of parameters, but they have dynamic similarities. So if you compute the Reynolds number, they have the same Reynolds number, you know they have some dynamic similarity. They're similar idea. So we know that this a and w, they are initialized by some Gaussian distribution with beta one and beta two. To derive a kind of normalized model, we actually can do this rescaling. This ak can be rescaled by this beta minus one and wk can be rescaled by beta minus two. Also we can have time rescaling. So after doing this rescaling, actually we can derive a normalized model. So for any model we have, we can just compute their normalized a or normalized w. So what we care is the long-term solution of the neural network. So actually here are three parameters, kind of a, w, or t. They have different three scales. But t, the scale of t is not important because we only care about the final solution, the well-trained state. So actually there are only two kind of independent coordinates for this final solution. Okay, so from this dynamics, from the gradient descent, actually we can, we can see there are only two important quantities. One is beta two minus over beta one. The other is beta one, beta two over alpha. So we use these two as kinds of index. So we define the kappa and kappa prime. However, as we want to consider this m, this neural number goes to infinity. So as we mentioned before, this alpha actually goes to infinity or goes to zero. So there may not be a perfect quantity for index in the first diagram. But actually we can use their rate, how they go to infinity or go to zero with respect to this neural number. So we define this gamma and this gamma prime. Okay, this gamma actually is the decay rate or kind of rate of this kappa with respect to this m. And this gamma prime is for this kappa prime. So let's take an example. Suppose gamma prime equals to zero. We take different gamma and you can see we learn these four data points with different gamma. You can see they have different learning results. So we first, okay, so we want to classify this model into different regions. So the first step is what kind of regions is the best to classify. So this is the model, right? But if this sigma is a kind of linear or this part does not change during the training, this is the random feature. For random feature model, you can just perform this linearization. We actually ignore the initial output because usually the initial output can be offset to zero. So what's the meaning of the linear model? Meaning this model can be linearized with respect to this initial parameter or you can do it in a first order tele-expansion. If this is a linear model, all the things are very simple. Okay, so now let's understand in what case you will be a linear model. So since the only linear part is sigma, so if this w parameter does not change a lot during the training, then it means this model can be linear model because it's close enough. You can do a tele-expansion. So we only need to examine the relative difference of this w. Okay, so this phi w is the collection of all w. So this gamma equals 0.5 as we can see. As the m neural number goes to infinity, the relative difference goes to zero. So in this part, we can understand that it can be well approximated when m goes to zero by first order tele-expansion. For the second case, gamma equals to one. The difference of the change is over one. So this actually a critical point for gamma equals to 1.75 as m goes to infinity, it goes to infinity. So this is not a good linear case. This is a non-linear case. So we normalize the model and use gamma and gamma prime to enhance this dynamic behavior. So if we change different parameters, but they have gamma, same gamma and gamma prime, you can see the slope, the slope of this change is the vertical coordinates here. Almost the same. That means the effectiveness of our coordinates for first diagram. So now next, we can try to study the relative difference of this theta w. Actually, we study this slope and try to classify in the region. So we found that actually for different gamma and gamma prime, a large region actually can be classified into linear region. And this is in the red part. In the blue part, the slope is larger than zero. So you will go to infinity after m goes to infinity. So this is a non-linear region. And in this boundary, separate these two regions, actually the change is over one. So we actually have a very simple way to classify the neural network into a very simple first diagram. But this is very important because as we can see here. So a lot of other studies actually study the behavior or the properties of the neural network dynamics. And like this NDK or Meanfield or some other works like Professor Wiena Yi's work, a lot of other works. But actually in this first diagram, they only study at some point here or there. And we already know their studies has a lot of similarities. But actually, we don't have a kind of conclusion picture to understand why they have a similarity or are they in the same region. So this work actually kind of as an initial step, I try to understand this first diagram for two-layer neural networks. And in the future work, we try to understand more about this more deep neural network, not just two-layer, and try to figure out the first diagram for the neural networks. So finally, still take away information. If you're interested, you can go to my website. And the key point here is the frequency principle about this low-frequency preference and some network structure design. Finally, a first diagram for two-layer neural networks. Thank you. If you have any problem. Thank you. So there is a question in the question and answer. Can you see? Yeah, I can see it. Yeah. Okay. So, yeah, today I didn't go into kind of... Maybe you first read the question. Okay. I can describe the question first. So the audience said, many talk about curve fitting. However, we know deep neural network or machine learning was not designed for just a curve fitting. What people really care about in deep neural network is generalization power. How do you relate the generalization power to your frequency analysis? Okay. I didn't show the result, but actually we have other results for this. Okay. So let me relate it to here. For example, here. So you can also perform this, like, following analysis for, like, classification cross. Okay. Here's an example for MNIST or SIVA 10. You can use a free connected or convolution neural network. It's also okay. And you can perform this following analysis from this mapping from image to the label. And you can pick a direction, like a principle, a first principle direction to seed. You can see in such a task, actually, low frequency still dominant. So these red data points are the training data. And this green one is the owner data. You can just perform following analysis. And the neural network of fitting actually may fit in the low frequency very well. But in high frequency, it has slightly different. So since the low frequency preference of the neural network is consistent with the low frequency dominance of the data, so it can work very well and the generation is good. However, for other tasks like this priority function, it has very different behavior. You perform following a transform of this priority function. This function is defined by only minus one or one two data points. And the output is just modification of all coordinates. In this case, high frequency actually dominance. So in this case, if you don't have enough samples, actually you don't have all samples, your learning actually will have more power at a low frequency and less power at a high frequency. And the learning curve from the blue one is totally different from the real case. So it doesn't have any generalization power. So even for this real image data case, we are following analysis can also be applied. Okay. So I also have one question. So in the second part, you design the method to make sure that the neural network can claim all these different frequencies the same. Yes. So for these things, would you need to have a knowledge about the frequency distribution of the original data? Yeah. Even if you have information of the original data, you actually can design much efficient method to faster capture the high frequency of the target function. Actually, professors have developed a method called a phase BNN. But in their method, you need to specifically know where the high frequency is, and you can just do a shift the high frequency to low one. But in our case, actually, we don't have such accuracy. So we just design an answers, you can understand this is just an answers, and you can from this fitting, we found that its answers actually, well, all we can say you have this capability to transform this high frequency to low frequency. We actually don't know if we really do it. Actually, in experiments, it did. So this, if you have information of the real target function, yeah, they will benefit a lot. Yes. Also, Professor Tang is a question. Can you see, Professor Tang says... Let me check. Okay. In essence, we discussed about from mapping from the study to common organization. So far, actually, we don't work on this SGD. We actually work on this gradient descent, fully gradient descent. If you add several parts, if you add some gradient mass stochastic or some noise in these dynamics, I believe this frequency principle actually still holds because experiments show such a phenomenon. The second is, I believe if you write this SGD, more clearly, you can see more details, like you may see more relation about this stochastic gradient descent to naturalization. And I know a lot of works out in these directions. I actually don't read too much about Lenka's work. I read some of his, some of her work recently, very inspiring, especially the work that called for the physicists to study machine learning on deep learning theory. That's quite inspiring, but I don't quite read very carefully about his recent work on this SGD already. So I could not comment too much on this. Okay. So maybe the time is up. We think to change again. And then we will talk about