 this CSRC-ICTB webinar. The second session will focus on machine learning and related influence problems. We will have three in-line talks. The first one is given by Dr. Tran Wan from the Institute of Automation of the Chinese Academy of Sciences. Dr. Wan recently came back from Harvard University to the Chinese Academy of Sciences just last September. His background in statistical physics and also influences. So, Tran, you can now start. Okay, thank you. Thank you for joining me. So, today I'm going to share my recent work about the dynamics of generative adversarial networks in high-dimension. My name is Tran Wan. I'm from the Institute of Automation Chinese Academy of Sciences. So, my general interest is about unsupervised model. Probably the most simplest unsupervised model is principal component analysis. The low-dimensional hidden variable Z is connected to the high-dimensional observation variable X by a linear transform A. Another similar but different model is the so-called independent component analysis. In this model, the hidden feature variable Z are also connected with observation data X by a linear transform with additional assumption that each component of Z are mutually independent of each other. So, recently the neural network got a lot of attention and there's a lot of generative model, unsupervised model, then developed by the neural network. One of them is the generative adversarial network. So, the basic idea is also very simple. From the hidden variable Z, we apply several layers of transform at 1, like 2, to at m, and then get a high-dimensional observation variable X. And this X mimics real-world data, something like a voice or image, like this one, or a human face. So, the general objective is we have a lot of the real data from the real world. And we want to retrieve some low-dimensional semantic meaningful feature in an unsupervised way. That means we don't know any label from the data. One idea is we first build a generative model to approximate this real-world data, and then we do an inverse way to attract these low-dimensional meaningful facial variables. Here is a line of my recent work along this unsupervised model problem. So, today I'm going to mainly talk about the first one, which was published last year in New Rivers. Here is my collaborator. The first one, Professor Yuen-Wu, is my previous postdoc advisor in Harvard. And the second one is a PhD student in my previous group. So, now let's start from the introduction of GAN. So, GAN consists of two parts. One is so-called the generator. Another one is the discriminator. The generator G is just a mic from a low-dimensional variable eta, just a random noise, and map this noise eta to a high-dimensional available x. This x tries to mimic some real-world data, like image. And the discriminator is just a binary classifier. The input can either be a real-world data, like image, or the fake data from the synthesis data. Training GAN is like a player-to-player game, where the discriminator tries to distinguish whether an input is real or come from a generator. And the generator is trying to produce the real data as possible as it can, and try to fool the discriminator. And this GAN gets a surprisingly good result, for example, like it can generate the realistic-looking human face. This is a work done by a media two years ago. It's already pretty good. And recently, it can get a very, very super high-resolution data. But the problem with the challenger is, training GAN is not really difficult. One problem is, the training dynamics has multiple states to report. At some point, it corresponds to a good result, and at some point, it corresponds to a better result. The question is, how can we train the parameter or control the training process so that we can eventually get a good result? And another problem is, during the training process, the training process can oscillate or sometimes can oscillate indefinitely. So how can we avoid this oscillation? It's another problem. So the last one is a very important problem for the GAN is so-called model collapsing. In this problem, if the GAN faces a model collapsing, the GAN can only produce a very few sets of data. For example, the GAN just remember one single image from the real data. In this point, the discriminator cannot distinguish this single real data from other data. So if the GAN drop into this model collapsing setting, it cannot get a good result. Mathematically, the GAN can formulate in this MIMEX problem, where the mean is over the parameter of the generator, and the discriminator is over, the max is over the parameter in the discriminator. The expectation is over the real data, YR, and also the fake data, YG, which is produced by the generator. And this utility function usually consists of three parts. The first part is given the real image, real data, and this D1 produce the probability that the input is real. And the second part, where here is a negative sign here, the second part is given the data is fake, the estimation of this input is real, and the third term is a regularizer. By this way, the discriminator tries to mismatch this part, which basically gave a correct answer. And the generator tries to fool the discriminator, so it tries to mismatch this part and mismatch this fake part. Usually solving this MIMEX problem is very hard, but in the machine learning community, the most simplest way is to use the gradient descent problem. Here is the MIMEX problem, so we use a mixed descent and ascent method. For the discriminator, we can use the gradient ascent. Here we plus the gradient, we add the gradient here. Yeah, for the generator, we can use the gradient descent. So each time we calculate the gradient and minus this gradient from the previous step. So our work is try to understand the whole learning process. This learning, this stock has been designed really, it's very hard in the discrete time step domain. So what we want to do is try to understand, first try to predict the performance, second is try to find the optimal algorithm parameter so that we can get the desired results. We basically get a good gap. The overall picture, the method is try to somehow, try to map this linear discrete time process to a continuous process by using some stochastic modeling. That means we can build a differential equation to describe this discrete time process. It's like a kind of diffusion process in the statistical physics model. Once we get a differential equation, we can transform original iterative problem to just analyze the differential equation. For example, we can analyze the fixed point or analyze the stability of this fixed point, which eventually gives some answer to the final problem of the original optimization problems. So in order to get kind of analytical results, we will build a very simple, but so completely simple model. So here the model is following. So we just use one layer to model the real data. Here, cosine is kind of a feature vector and dimensional feature vector and CK is the real random number that represents the hidden feature variable. And AK is a high dimensional random noise. The G is any kind of nonlinear function. And YR is a kind of a real, it's just a fake image. For the generator, we have a similar model except that we don't know the real feature vector C. So we use a model parameter, WG, to try to estimate this CK. And CK is hidden variable in the generator. For the discriminator, it's also very simple. So for the discriminator, we have an input Y and it's parameter WD. The model is following, it's just the inner product between the Y and the WD and then apply a kind of nonlinear function D hat. So once we have this simple model, we use the standard stochastic gradient descent method to train this model. And in the past few years, there's a lot of analysis about this GAN process. But most of what we're drawing to the so-called small learning rate analysis. In this case, they keep the dimension of the system and being finite and then let the learning rate goes to zero. In this case, the discrete time process converge to a continuous time process which usually called the gradient flow. This is a first-order differential equation where the right-hand upside is nothing but the exact gradient at the current point. In our work, we consider the high-dimensional regime. In this case, we consider the dimension of n goes to infinity and at the same time, the learning rate go to zero with the constraint that n times the learning rate will be a constant tau. So in this case, we find that the continuous time limit cannot be described by an ordinary differential equation because the noise cannot be ignored. So eventually, we reach a stochastic differential equation and the first term is almost a similar gradient term. The second term is a Brownian motion term. Basically, it's just a long-term equation in the statistical physics. And here is the concrete form in the concrete form for this, for the simple so-called GAN model. Here, the WT and WG are continuous version of the parameter in the generator and the discriminator respectively. And tau are learning rates. And here, G, G2 to LT and H are some function, some deterministic function depending on the time t as well as the distribution of the parameter omega t and omega g. And dp t is a standard Brownian motion. So this stochastic differential equation can be characterized by the so-called McKinney-Bellassov equation, which basically is a non-linear focal point equation. By solving such non-linear focal point equation, we can compute the probability density along the time. Here is some experiments. The red dot represents the evolution of this weight in the real experiment. And the blue curve, the probability density is estimated by solving this integral partial differential equation. And we can see that the result, the theoretical prediction meets the experiment pretty well. So this is what we call so-called the microscopic state description. In this case, there are three vector play a very important role. The first one is to see the true model parameter in the ground true model. The second vector is omega g, which is the parameter in the generator. The third model is parameter in the discriminator. But in this, in previous case, the integral partial differential equation is usually very hard to solve. We can only solve it numerically. So next step, we want to try to use even simpler way to calculate the whole training dynamics. So one way is we can define the, just a three parameter we call the macroscopic state also called an auto parameter in statistical physics. This three variable, nothing but the angle between these two, this three vector. For example, the q, t, t is angle between the omega d and omega t and the c. And q, t, g is angle between the omega g and c. And r, t is angle between omega d and omega g. By this way, we prove that when the system dimension n goes to infinity, this three scalar convert weakly to a unique solution of a system of ODE. And this ODE has, it's just first order ODE with a little bit complicated nonlinear function g. I don't list g here, but you can see it in the paper. In addition, we prove it rigorously that for any finite system which says in is the difference between the finite system and the limiting value q, t, g is bounded, is upper bounded by a constant c over a square of t. That means when the system dimension n goes to infinity, the convergence rate is one over square of n. And here's kind of a simulation. The dot are the simulation from the real experiment and the curves are the simulation by solving this ordinary differential equation. And we show that in fact, this ordinary differential equation characterization is pretty, is a standard tool in the statistical physics. But for a long time, people don't really care about the rigorous pool. And in our work or earlier work in the reference tool in published two years ago, gave a rigorous proof that for such online learning process, it converges in large system limits to a solution of ODE. And these proof tricks are also used to improve some other model. For example, like the two-day neural network model in an unsupervised way. So here, let's look at the detail about this ordinary differential equation. So once we get this ordinary differential equation, we can cast the original problem to just study the dynamics of the fixed point of this differential equation. So we see that we first did the experiment that we fixed the learning rate of the discriminator and then change the learning rate of the generator from 0.1 to 0.4 to 0.6 to 1. And we saw that the dynamics are dramatically different. For example, when the learning rate of the generator is small, we get a pretty good result. The QG convert to minus one, that means the generator are perfectly, almost perfectly aligned with the true parameter, and the dynamics of the discriminator, we can see that it firstly gets alignment to the true model, and then it gradually decrease to zero. That means the QT, the discriminator first, initially it quickly learned the true model, and then because of the generator gets a better and better result, the discriminator are forward. So this is a good result, and this is the result we want to get. But if we somehow increase the learning rate a little bit, we find that the whole training system are not stable anymore. It become kind of oscillate indefinitely. And if we further increase the learning rate, we can see that even the generator are oscillating and they completely forget to where it has been learned. And if we further increase the learning rate, surprisingly it get another stable result, even though it's not as good as the first one, it is still a stable solution. So basically the whole training process is very complicated, depending on how the parameter in truth. We are able to help the rest of this phenomenon, we analyze the fixed point of the previous ordinary differential equation. And we show that surprisingly the fixed point is not unique. There are at most five fixed points, and each fixed point represents a face in the face diagram. Here the face diagram, the horizontal axis is the learning rate of the discriminator. And the vertical line is the ratio of the generator and the discriminator's learning rate. And we show that there are two non-informative results. For example, this non-informative face corresponding to the trivial solution that means all results are there. Basically it learned nothing. And this informative result corresponding to this one. In fact, the four point, sorry, in fact the four point, four red point, one, two, three, four here represents this four finger here. And we show that in the face diagram, there's a black region, no fixed point are stable. That means here if you choose the parameter with in this region, the system will oscillate like this two point and this two finger indicates. So this picture from one dimensional problem in that paper, we also analyze the multi-dimensional case. In multi-dimensional case, the face diagram is similar, but the successful phase are split into two phase. One is completely successful and the second one is only half of them are retrieved. Here is the result. So we see that if we're learning with all the noise to be a very small, we get an oscillate result. And then if we increase the noise, this is really interesting. If we increase the noise, the system become stable. And then if we further increase the noise, some features are lost, cannot be estimated from the data. In fact, this three point eruptions, this black region and info, one region and a half info region. If we further increase the noise, we get a non-informative region. This means that we cannot learn anything from the data. And more process, we can analyze this phase in a quantitative way. And we characterize when this informative result we can get and this give this inequality. That means the learning rate times the noise should not be very small, not very large. If it is too small, we get the oscillation state. That means the system drop into this black region. And if we, if we have a large noise, it will drop into this half information state or even non-information state. So you want to keep the system into this good information stage, we have to carefully tune the parameter into a very narrow region of this parameter space. So a one-take-home method is that noise helps the convergence. Basically if you feel there's some oscillation in the training process, you can try add some noise artificially, maybe it will increase the noise. This is a practical method already used in training the GAN model. So as a conclusion, in this talk, we present an exact and a tractable analysis of the training dynamics of a simple GAN model in head dimension. Specifically, we analyze a training process at the two-level. The microscopic level are deterministic described by a coupled ODE. And in the microscopic level, we show that the microscopic dynamics are still stochastic. The evolution of the detailed weights remain stochastic and it is characterized by SD. This is different from previous work where in the microscopic states, the smaller unit analysis shows that the dynamics is still deterministic. In that work, we cannot get any results about how the noise helps the convergence. And finally, we show that the noise level is essential to the convergence. The strong noise lead to a failure of the future recovery, but the weak noise cause oscillation. So we need to carefully tune the noise and learning with so that we can get a good result. And that's all. Thank you. So, any questions? So, I have a question. Do you see these phenomena of collapse in this simple model? Could you repeat the problem? The model collapse, right? No, the collapse problem. So the fact that... So your question is, do I experience the model collapsing in this simple model, right? Exactly, yes. Yes, sure. So in this experiment, it shows the model collapsing, the third finger, when the noise is very strong, when the variance of the noise equals to four, we get a model collapsing. We can compare this finger and the finger in the middle, the one in the middle. Here, the two lines get, there are two features. The two features are almost recovered. But in this finger, in the top finger, there's only one feature recovered. Okay, thank you. Yeah, so here we can characterize when there's a model collapsing. Basically here, we call it half informative region. Yes. So, are there other questions in the comments? Okay. If you know, we thank Fawa again. And then we move to the next speaker. Dr. Thank you. Thank you. Yeah.