 OK, so good afternoon, everybody. So this will be the last lecture, which has been, I think, a very nice series. I'm finally beginning to understand a bit about machine learning. I'm sure I hope you have also gotten a lot out of this. Just I want to remind you that there will be a picture taken with all the students, with the speaker. So students should just stick around for a photograph. Thank you, Adish. So I'll do here indeed the last lecture that will be on more current applications related to state of the art of normal network. But before doing that, I'd like just to make a few comments in particular around what Adish just said, is that I think it's great if researchers in here begin to understand machine learning. Because the interface with physics, with math, are very deep. And I tried at my level to give a hint on why. But deeply, the problems are very close to the problems that are encountered in physics for the same mathematical reasons that I explained, high dimensionality. But with one twist is that it's much easier to do experiments. To do experiments on GPU is not very complicated. To have an accelerator is somewhat more complicated. Or to make an experiment on fluids and mechanics is more complicated. So you have here an environment where experiments are easy to do in high dimensions. Problems are well posed. And you can do math. And you can do the transfer and the relation with physics. And this is why I believe these fields have great interest in working in it. So the last lecture will be, as I said, absolutely. Oh, I'm sorry. Yes. So the last lecture will be on the more experimental side. We're going to look at what deep neural networks are able to do. And go in the reverse direction. Try to make a mathematical understanding of it partly. And relate it to what I've been speaking about in the previous lecture. So these are examples of images that are generated by deep neural networks, faces. These don't exist, these faces. They've been generated from a database of faces. And then you have an algorithm that I'll describe, which generate new faces. And these are examples. So this is a field. Each face is a field. It's the equivalent of a turbulence. Much more structure, much more complicated, totally non-stationary, non-erotic. And to think that this is a realization of a random process is very complicated. What is the underlying process? What is the underlying measure? How to define it? These are complicated and deep questions. If you look at something like that, same thing. These don't exist, these bedrooms. They've been generated by neural networks. And therefore, these are realization, again, of random processes, like turbulence, like 5.4, just more complicated, non-erotic, non-stationary. So the question will be first to try to understand how these things are being done by neural network. And then trying to understand does that have any relation to what we've been speaking about, 5.4, turbulence, in what sense, and what is behind all this? So I'm going to first describe these score diffusion algorithm, which are the state-of-the-art algorithm. Whenever you see these amazing images that are being produced by neural networks, they are produced with the kind of algorithm that I'm going to speak about. Now, the first question that comes to mind is to say, OK, these are very good-looking images. But after all, to make a good-looking images, it's not so difficult. I take an image processing software. I take four or five images. I take some piece. I blind them together, and I get a new image. Are these just some kind of blending of the images that have been used in the training? In other words, are these just memorization and restitution? And that's a question that everybody wonders about child GPT-4. Is it just a big memory system with an intergent retrieval? Or does it really generalize? Is it of the same sort that what I do when I regenerate 5.4 by taking a probability distribution and randomly sampling my probability distribution? That's the first key question. And then the second question is, if indeed it does learn a probability distribution, what is the nature of this probability distribution? How come, same question than previously, you can learn this probability distribution, given that you are in very high dimension and that you are facing the curse of dimensionality, so it shouldn't be possible? Unless there are underlying assumptions, what are these underlying assumptions? And then we're going to come back to the ideas of scale separations and so on. And does it relate to physics and renormalization group? Yes, I will claim. And I'll try to explain why there are relations. So basically, across these three days, I'm now trying to complete the trajectory where we began from statistical physics, ergodic, toy model 5.4, stationary processes, moving to more complex with, however, the same kind of concept which are behind, try to build models with local dependencies because that's going to break the curse of dimensionality, scale separations, and capturing dependencies across scales. So let me now begin with how these algorithms work. The idea is not very intuitive. It's related to the problem of noise removal. That problem, I just presented it this morning. So you have a field, an image that I call phi here, to which you add noise. So suddenly, you have a noisy image and the noisy image depends upon the variance of the noise. If the variance is very big, you have a lot of noise. If the variance is small, you have less. And the problem in noise removal estimation is to try to remove the noise to reproduce an image which is as clean as possible. It's very important in many fields, such as medical imaging. What does that mean? That means that you want to build an estimator, and I will call it phi hat here, which is calculated from the noisy data, which is going to give you something as close as possible to the true data phi. In other words, what you want to minimize is the difference between what you compute, the estimator, and the true image, the clean image. And you would like to minimize the error on average of our all possible image that you will see and of our all possible noise, which are here, Gaussian white noise. Now it happens that the solution of this quadratic problem is simple to compute formally. Phi hat, the estimator, is just going to be the mean, the conditional mean of all fields given the observation phi hat. In other words, it is the mean, relatively, to the conditional probability of the field given the observation. In Bayes term, this would be called the posterior distribution. Sorry? Oh, d phi. Sorry. Thank you. Yeah, x and phi are the same. I've been mixing the patient. Thank you very much. Now, there is an observation that has been done that dates back to the 50s have been rediscovered with different names, Tweety Robbins in Japan Mayazama identity, and which observes the following thing. If you know the probability distribution of the noisy data, in other words, phi sigma here, and you know the Gibbs energy, there is a simple way to compute this optimal estimator. In other words, this conditional mean. It just consists to add to phi to move in the direction of the gradient, what you want. You have a noisy observation. You want to create an observation which is more realistic. So you want to create an observation which is more probable. You are going to move in the direction of the gradient of the log probability. In other words, this would translate into minus gradient of u. It's, again, a gradient descent on the energy. That's what it says. You prove that in three lines. And that observation depends upon the fact that the noise here is Gaussian. So when you do an integration by part, you prove that. Now, what does that mean? That means that now let's revert the problem. Suppose that now what I'm interested is to compute the probability distribution. So what I'm interested is to compute this gradient of the log probability. Then what can I try to do? I can try to build a denoiser. In other words, to try to build an image which minimizes the error. Now, how am I going to build it? I'm going to build it with a neural network. So I'm going to build a neural network which does noise removal, which suppresses noise. And what I know is that if I call this neural network is able out of the image to denoise, in other words, to make an optimal estimator, then I know that the neural network has in fact learned the gradient of the log probability. So I'm going to build a neural network whose objective is to minimize this mean square error. In other words, to take to compute the optimal estimator and therefore, I'm going to get the gradient of the log probability. And once I learn the gradient, I can try to sample it and to get new images. In other words, this is the scheme. And that has been, at some time, there are ideas that are ready and are rediscovered by many different groups at the same time across the world. And that has been going on. Song and Ermond are one of them, Kadokai and Simon Chelly. Several groups that develop that. And the idea is quite nice, very beautiful. The idea is you begin from your noisy image here and you have the noise that you see. You try to find some noise removal algorithm with a neural network which builds the denoising. And now you know that the difference between this and this is, in fact, going a good estimator of the gradient of the log probability. That's the core idea. Now, of course, comes in what neural networks. And it comes back to the previous lecture. The kind of neural networks that are going to be used are convolutional neural networks. So cascade of convolution and rectifiers. And in particular, the one that works the best are called UNET. And if you look at the UNET, you see a multiscale representation. You begin with the image. You coarsen the image at different scale. We are going to compute a denoising. And then the denoising is going to go back up to the denoising at the finest scale. In other words, the denoising is happening at each of the scale in this UNET. And these are the architectures that are able to get good results on very large-sized image. We'll come back to that. But just to say that from a purely empirical point of view, people have been brought to implement that with that kind of architecture. Now, I was just speaking about denoising and computing the gradient of the log probability. But what I would like is really to re-synthesise the image without noise. So then there is this second very nice idea, which is as follows. You're going to begin from an image. And you're going to progressively add noise. So instead of adding noise, boom, in one sweep, you progressively add noise. And this you do it with what is called an Einstein-Lenbeck equation, where you progressively add a little bit of noise, a little bit of noise. And if you do that, you begin with your image, which is a sample of a probability distribution. As you add noise at any time t, you are going to have a noise image phi t, which has a certain probability, which follows a certain probability distribution, which is now different because you have noise. And then ultimately, when you've added a lot of noise, so when you've run your Einstein-Lenbeck equation, you're going to converge to something which is purely noise. Because if you look at phi t, there is the original image, but which is attenuated exponentially. And at the end, it's going to converge to a Gaussian white noise of fixed variance 1, which is here. When t is large enough, you can consider that pure noise. Now, what you would like is to take this. Oh, there is something interesting. How can you think of that in terms of probability? What is the relation between pt and p0? pt is essentially, it's a probability of a field, which you have the original field to which you add an independent Gaussian noise. That means that the probability distribution here is equal to the probability distribution p0 convolved with a Gaussian in high dimension. So when you go like that, what you are doing is you are progressively blurring your probability distribution until the points that the probability distribution is a Gaussian in high dimension. Now, what we are going to do, we're going to stop at some time t, which is big enough so that we can consider that this is white noise. And now we are going to invert this equation. And the inversion is possible. And the reason why the inversion is possible is precisely the previous argument, that in fact, you can do denoising. And if you want to do denoising, what should you do? You should do a gradient along the energy of the probability distribution at any time. So what we are doing here is we are inverting time. We are going from time 0 to time t. And now we are going to go from time big t back to time 0. How do we invert time? We have an inverse equation, which basically at each step does a little bit of denoising, a little bit of denoising. So at each step, you have a gradient on the log probability. In other words, a gradient on the Gibbs energy. So this is a Langevin equation. But there's something is that the noise instead in the Langevin equation is constant. Here the noise is going to progressively go to 0 and you'll recover the original limit. So math tells you this is invertible. Yes, but there is one difficulty. To invert it, you have to compute the gradient of the log probability, which is called a score. This is why this whole technique is called score denoising. The score, it's the gradient of the log probability. And you have to learn the score. And with this score, you do a denoising. And of course, then there are technical questions, such as this was a continuous PDE. You need to discretize it. To discretize it, you have to make sure that your Haitian is well conditioned and so on. We may have time to go back to that. But I want first to go to the core idea. And this is how these images have been produced. How these images have been produced? First, you take a database of image out of which you progressively add noise on each image and you learn for each time the score. And then you produce a new white noise. And you run your PDE. And each time you give a new white noise, you get a new face. And these six images have been obtained by you using Gaussian white noise. Boom, it gives you a new image. You give me a new white noise image. I give you a new face. That's how these images have been produced. This is how these images have been produced as well. Now, when you see that, the first reflex is really to think, really, what it's just doing is taking patches of the original images, patching it in a smart way together and producing new images. So is it indeed doing such memorization, in which case the images that you are seeing depend upon the database you began with? Because it's a kind of patching. This is called overfitting. In machine learning, when your result depends upon the training, it means overfitting. Or does it generalize? Generalizing means that the results is getting independent from the training set because you are really converging to expected values when you do your calculations. And if that's so, then you learn something which is interesting. Second question is, OK. And I'll show that, in fact, there is a generalization. Second question is, how is the denoising doing? Denoising, there has been since Wiener 1950, 70 years of research. Yes? You showed the faces created from noise and rooms created from noise. Why do you create a face and not a room? I mean, how does it work? Because you had training before. Let me, I didn't insist well enough on that. When you go from here to here, you do that on the whole database. And what you do at each time is you learn the score. How do you learn the score by training your neural network to do denoising? So the score is specialized on the database. If I give you a database, you are going to learn a score. If I give you a database, you are going to train a neural network to do that. You give a database. You train the neural network to do the denoising. So when you have a database, what's happening is that you know the original image. You know the clean image. And you train the neural network so that the error is as small as possible. Once it's trained, you give a new image and you see whether it works given that it's not in the database. So the training, the score training, is always related to a database. It's exactly the same thing than what I explained in the first two lectures. You have a database of example out of which you learn the parameters of your model. What are the parameters of the model here? It's the weights of the neural network. So the database is trained in order to minimize the error by optimizing the weights with a gradient descent. That means that the database allows you to learn the score associated to the database. But nothing guarantees that this score is not just overfitted of the database. That's the issue. Thanks. That's really an important thing to clarify. Yes? The definition of the score? The definition of the score is the gradient of the log probability of the image. The gradient is taken relatively to phi. OK, so it's a gradient on phi. That's the score. In other words, the score is exactly the gradient of the Gibbs energy. Because the log probability gradient, you kill the normalization constant, you get the Gibbs energy. So it's the gradient of the Gibbs energy, the score. OK, so that's how, indeed, there was one database that's called Celebrity Database, the image of good-looking, following Hollywood standard people. And it outputs following good-looking, following Hollywood standards people. And these are bedrooms, probably, of people having a certain level of wealth. And it produces bedrooms of people having a certain level of wealth. Good, so denoising. And what you would like to understand, therefore, I was there, is how are these things doing the denoising? There is 70 years of research on denoising. Does it have anything to do with what has been done before or not? And can we test whether it learns the true score, which is equivalent to have an optimal denoising? And then how come it solved the high-dimensionality problem? And why using unit structure, what's the relation again with potentially renormalization? So let me attack the first problem, generalization. Denoising for generalization can be phrased as follows. What you want is, given the noisy signal, you want to compute an estimator, which is going, if you add it to the noisy signal, going to denoise, in the sense, minimize the mean square error on average. And of course, as I said, this expected value, you are, in fact, going to do it on a training set. You are going to have a training set on which you are going to try to optimize your thing. And what you can try to do is to see how big should be your training set in order to generalize. So you can try with a training set with only one image. Unlucky and likely that you'll get anything very good, but you can try. Or a training set with 10 faces measures, 100, 1,000. You can progressively increase the size of your training set and see, at one point, it's going to generalize. And to do that, well, what does it mean generalizing? That means you compute there on the training set and you compute it on a test set. So what does it mean, a test set? That means that you are going to take new face images that are different from the original one. You are essentially going to take your data set, separate it in two, train on one part, and see whether the result applies on the other parts that the algorithm has never seen. So what you can do is look at the denoising and compare what's happening. If now you give an input, an image from the train set and see what is there as a function of the amount of noise that you will give. So here I'm reducing the amount of noise. And that's called PSNR, Picked Signal to Noise Ratio. The PSNR increases. And that indicates that the quality of the image also increases. Now, if I have a database with only one image, this is the amount of noise of the noise the image has a function of the noise in input. This is what you get on the test set. You see the two curves are totally different. That means that it does very well only because it knew the image. It doesn't generalize. Progressively, you increase the number of elements. And what you see is that on the train set, the performance gets worse and worse because it doesn't have to denoise one image, but 1,100,000 images. Whereas on the test set, you see the performance gets better and better. And at what point the two curves match? If you look at the orange curve over there, it's the same than the orange curve over there. What it means? It means that whether you took an image from the train set or the test set, the denoising performance is the same. That means that your algorithm is generalizing. How many face image do I need? 100,000. Before, it doesn't generalize. The two curves are different. Now, let's do something different. Let's do the synthesis. So let me remind you to do the synthesis what you do. You begin from noise. And you are going to use the score or the gradient of the Gibbs energy that you learned at each step to re-synthesize a new image. And now you can do the same experiment with a different database. What I'm going to do is I'm going to have one database on which I'm going to train a neural network. And I'm going to use a second database on which I'm going to train a second neural network. And to the two neural network, I will give them the same noise to begin with. Now, the two database had only one image. This was the image in the first database. And this was the image in the second database. So it begins from the same noise. And it does a synthesis. And the synthesis it reconstructs are, in fact, the image in the database. They're different. That means the algorithm doesn't generalize. Now I give. Excuse me. Yes. Can you precise a bit what you mean here by it does the synthesis? It does the synthesis. I mean that I give a noise image. I run the equation. But the score has been calculated over two different databases. And I look at the final image and look if the final image is the same. If it generalize, it should be the same. Because you should think of that as a transport. You have a probability distribution, which is Gaussian, which you are transporting into another probability distribution. The transport is done by the gradient of your energy. And if you have two database, the algorithm generalize, that means that you will end up with the same transport. So if you begin from the same image, you should end up with the same image. But if the algorithm, more or less, or exactly because this is a deterministic transport, you have a deterministic equivalent to this transport. Now in the cases that we began with, you had a very different score where I calculated. So it led to very different places. In other words, it just memorized. That's another way to say. The only thing it does, it memorize an image and it gives you back an image. Now if the database has 10 images, same thing. You recover an image, which is totally different in the two cases because, in fact, the algorithm memorized. You increase, they are different. They are different. And then, boom, you converge to images which are very similar. And this is mind-boggling because that means that you begin from noise. You are creating a face which has nothing to do with the face in the database, in the sense that you can replace this database with a different database of the same kind of faces. These were white people with a certain style, a certain pose, and so on. And it reproduced you, a white person with a certain style, and so on. But same person, although it has been trained on different faces. You can see the other fitting by the fact that what you can do, once you've done the synthesis, you can look at the closest image in the database. And you can see, of course, when you had only one image, you just did a reproduction of the image in the database. Here, you again reproduced an image in the database. So it's just memorization. But when you're arriving here, this is the closest image in the database. This doesn't belong to the database. These are the same, and this is the closest image in the other database, totally different. So you are truly learning a probability distribution. What is, and these are images which have been synthesized with a database which is with 100,000 images, with, again, same noise to begin with, but two different database, and each time it converts to the same face. What does that mean? It means that the face was in the white noise. The face, it's not a memorization, somewhere the white noise given the probability distribution encodes the face. You change the noise, you change the face. Same thing for bedrooms. These are experiments done on bedrooms, and all this is the work of Zarada Akadokai, Florentine Good for their PhD, and Erosimon Shady that was done at the Flatiron Institute. So these are images, same thing. When you don't have enough image, you are going to reconstruct, in fact, an image which belongs to the database. You just memorize, and after a while, 100,000, and the question is why 100,000, you reconstruct a bedroom. And same thing, all these bedrooms have been reconstructed from two different set of bedrooms, from the same white noise. So that means somewhere the system has learned what is a bedroom. And that's quite impressive. That means that you can associate a notion of probability distribution to the bedroom, and you can associate a transport from a Gaussian measure to this probability distribution, and this is what has been learned. So the answer is yes, the system generalized, they don't just memorize, okay? But if you don't have enough data, they will memorize. Okay, now the question is now that it generalized, what is the nature of what it's doing? How does it compute this denoising within the neural network? So let me summarize the classical approach that I mentioned this morning. The classical approach, non-linear approach, so not the most elementary one, for doing denoising consists in trying to find a sparse representation of the data. So you try to find an orthogonal basis, and you decompose your noisy data in the orthogonal basis. So you compute the inner product. Remember, this morning I showed the wavelet coefficient, and what I said is that the noise is going to produce a small layer of coefficients, and if you were able to concentrate all the energy of the data on few coefficients, they are going to be big, you can just threshold, remove the noise, and keep the energy of the signal. That's the strategy. So you do a thresholding, and the thresholding you can just set everything to zero, or subtract an amplitude with a rectifier, and these are minor versions. Now, the key thing to be able to do that, again, because your noisy coefficient is equal to the original signal plus the noise, is to make sure that you find a basis which compresses well the original signal, which produce a sparse representation. This is the key ingredient to do efficient denoising. So at the time, that was the kind of thing that was done on the wavelets. I showed you this morning in one dimension. This is a noisy image. This is the original one. This image in a wavelet domain is going to produce many zero coefficients beside the edges. So you compute the coefficients of the noisy image in this 2D wavelet coefficient. You just keep the large coefficients. The large coefficients are going to be mostly around the contours of the image. And then you reconstruct, and you get a cleaned image where the contours are preserved, but you eliminated the noise in the regular region. This is the 2D equivalent to what I showed this morning. Okay, but what is being done by a neural network? In the experiments that I showed, we used a specific kind of neural network where there is no bias, just rectifier and filters. And the fact that there is no bias and the fact that the rectifier is locally linear means that you have an operator that you are going to learn, which is the transformation from the original image to the output of the neural network, which is locally linear. So in other words, you can write it as the gradient of the network multiplied by five because locally your transformation is linear. It is very nonlinear globally because you have all your rectifiers, but locally it's linear. Now the reason why I'm doing that is for the following. If you look at the noisy image, the denoised image, it's equal to the noisy image plus the estimation of the score. Now the score is the gradient of the Gibbs energy, and here you had a gradient of the score or the Jacobian, so what you get here is a Haitian again of the Gibbs energy that you learn. So the denoising, you can write it as identity minus the Haitian of the Gibbs energy that you are learning in this situation. Let me diagonalize the Haitian. I take my Haitian matrix, I can always diagonalize it, and I have an orthogonal basis, which is going to diagonalize it. This is my orthogonal basis, and I rewrite this equation in the orthogonal basis. So here I'm going to get one minus the eigenvalue of the Haitian in the orthogonal basis, multiplied by the coefficient of my image projected on the vector, and then I reconstruct by multiplying it by the vector. If you look at this, this is a kind of thresholding algorithm. What it's going to do, it's going to take the image, it's going to project it on an orthogonal basis that is learned by the network, which happens to be the basis which diagonalize the Haitian, and it's going to attenuate the coefficient depending upon the amplitude of the coefficient. So it's a kind of learned thresholding. The difference with what I did before is that the basis is not fixed. For each new image, you have a new basis. So you learn the basis based on what is input. Now I would like to come back to the problem, is it learning the true score? There is one thing I know from Miyazawa formula is that it's going to learn the true score if it got the optimal denoiser. So I can test if it got the optimal denoiser in a situation where mathematically I can compute the optimal denoiser. But the situation has to be complicated enough so that it will do something interesting. Now there is a class of functions which are interesting, which are the functions which are piecewise regular. They have alpha derivatives everywhere outside a boundary, but the boundary itself is a regular boundary. So these are examples of structure with geometrical regularity. Now a lot of math has been done on these cases and that's something with many colleagues on which I worked and you can prove that if you put an amount of noise which is sigma, then the optimal denoiser will give you an error which has a certain decay that depends upon the regularity. Not only that, but we know the theoretical basis that will do an optimal denoiser. So we can now take this test case and see is the neural network able to do that? And if yes, then it's able to learn the score. We did these experiments with profile like that having different level of regularity. Alpha means the level of regularity, alpha big means that it has many derivative, alpha is the number of derivative which are bounded and the dotted curve are the optimal theoretical curve. The full curves are the error produced numerically by the neural network. And you can see that they match very well which means that yes, the neural network is going a good job. But now you can look at the basis that was calculated because we know the kind of basis that is going to do a good job. And it's quite amazing because what are these basis? Look, they oscillate, these are harmonic basis like sine wave which oscillates on one side and oscillates on the other side but never cross the boundary. You don't want to cross the boundary because if you cross the boundary, you cross a discontinuity, you are going to smooth it out. So these are geometrically adapted basis. So the neural network is doing the right job. You can do it on a simpler case, circles. You can train your neural network on noisy circle. It's going to produce new circles. How did it do it? Somewhere are hidden these orthogonal basis which diagonalize the gradient of the Haitian that has been learned. And you can look at the eigenvectors. What is it going to do? It's going to average everything outside the circle. It's going to, the second eigenvector is going to average everything inside the circle. And then the next vector is going to, in order to refine to get a very good boundary, make a difference between the averaging here and here. And then in math that's what's called bandlets, these kind of vectors. And you have the different eigenvectors with a very fast decay of the lambda meaning that you need to, you can project on four or five eigenvector, you'll get your circle. What does it do on a face? This is a clean face, the noisy face, and that was the noise. These are the different eigenvectors that are being learned. And there you can see that it completely adapts. It builds vectors which are able to take advantage of the geometric regularity. It makes global averaging inside. It makes averaging which never cross contour. So you have a completely geometrically adapted basis. You see, wavelet bases are very primitive compared to that. They are just making local oscillation of a little square of different sides. What these kind of things are doing are completely adapted to the geometry. It's quite impressive. Now, the next question, the final question we wanted to look at is how come it's able to do that? How come it's able to do that? Because again, if you compute the number of possible geometry, that you are going to have a curse of dimensionality, there is an explosion. And the claim is it follows the same principle. You separate scale and you learn the geometry, scale per scale. So same idea. What I'm going to show is that that kind of thing, you can learn it. So what is hard is to go, you can look at the architecture of the neural network and see clearly it does a separation of scale because it aggregates this thing on different size and so on. But you are never sure exactly what it's doing because you don't control the weight. So the strategy to do that kind of thing is to do the reverse. You impose on your neural network to learn something like that and you see whether after imposing that it's still able to have a high performance. And that will demonstrate that this is a prior which is adapted to learn what you want to learn. So the prior here is that you're going to decompose again, you're learning into a low frequency learning and then learning only the scale interaction when you go from one scale to the other. And the key thing for being able to learn these conditional probability is that it doesn't involve a huge amount of parameter. In other words, that they are local, that you've been able to localize the problem. And so that's what's done. So again, the way it's done is that you take your image, you decompose it into an orthogonal basis with the low frequency which is really composed and each time what you're going to learn is the transition probability. So the conditional probability of the high frequency given the low frequencies, in other words, it amounts to estimate these Gibbs energies which are the conditional, specify the conditional probability to go from one scale to the other scale over these wavelet coefficients. So you do the factorization. The interesting thing in this problem is did you know what is the most complicated thing to learn when you learn this, a big image like that? It's to learn this one, the very low frequency. Once you learn the very low frequency, all the high frequencies given the low frequency, in fact, you learn them with a stationary model, with a convolutional model. So it's a stationary model and local model. Because once you have the small image which can be as small as eight by eight, you basically have the position of the face of the big structure and then you just have to refine basically the structure. If you think of it, the low frequency, if you compare it to a five four after the face transition, that means that in the very low frequency, that's where things are complex, where all the different phases are coexisting. So you have to model that, but we don't have just two phase. You have many images and you have to model that. Most of the weights of the neural network are in capturing the low frequency. Then going from eight by eight to 1000 by 1000, much easier. You just go up with your stationary model across the face. So first you need to learn the low frequency with a neural network, which is going to be the denoising on the small image. We use much smaller images, eight by eight, typically. And then condition on the low frequency, you are going to denoise the high frequency to get a denoised image. These are these conditional denoiser, which allows you to compute the conditional probability. And you put everything together, then you begin with, so the denoising is each time using the estimated low frequency, doing the denoising of the high frequency, and then you use that to condition the denoising of the next scale and so on. So the important observation is that the fine scale, although this is where you have all the variables, because you had an image of a million by one million variables, and your small image was only 64 variables. The most difficult thing was to model the 64 variables. Then the one million minus 64, much easier, you're doing across scales, and you can use a conditionally stationary model. Main difficulty, again, model the larger scale, although they live in a smaller dimensional space. Yes, it takes many parameters. You need a big network, and we have no mathematical model of what's happening. What we know is that at one point, eight by eight, you go down, there's no much gain in terms of... Anyway, you're in a very low dimensional space. So the problem resumed to trying to understand what's happening in this low dimensional space over this distribution. So if you want to do synthesis, what you do is what I described. You begin from noise, which is going to synthesize a low frequency image, which is going to synthesize the high frequency. You recombine it, you get the next scale, next scale, next scale. And that's another set of words. One way you can view this problem is that in fact, you are putting together two variables. One, you have the time of the denoising. The time of the denoising, as I said, is like taking your probability distribution and blurring it in high dimension. So it's a scale variable, but it's a scale variable in your very high dimensional space. Then you have another scale variable, which is here, which is a scale variable in your 2D space, the space of the image, where you blur the pixels of the image. And what you do is the following. Instead of running the algorithm directly on the image, as I said, you decompose the image into its wavelet coefficient and the low frequency. You add noise to the wavelet coefficient. Then the low frequency you decompose, you add noise to the wavelet coefficient. And then finally, why is that going to be much more efficient than going directly? You see, the wavelet coefficients, they have a small amplitude. So you don't need to add much noise to bury them under noise. It's going to be very fast. The original image, if you want to add noise directly to it, you need to add a lot, a lot, a lot of noise. Then the next image, wavelet coefficient, same thing, it's going to be very fast. Finally, you have the low frequency image, but this is a very small image. And there, you need to add a lot of noise, but because you know it, instead of adding a little bit, a little bit, if you are going to do big time step, add, in other words, you are going to discretize your longevity over big time step. Different way to view it is that the longevity equation, instead of running it with a single time, which is the time that would be associated to the original image, you associate a different time for each scale. So time runs differently at fine scale and much faster at larger scale. And that's what allows you to save the number of computational time. And then you revert, as I said, you denoise to synthesize the low frequency, you denoise the high frequency to go to the next scale, the denoise, and you go like that. And so the number of step, so that's to show that you have a space which is very interesting. Very interesting because the way times run is not the same if you look at it at different scale. Just to show example on numerical result, if you do indeed have a time which is different at different scale, you can go much faster. This is what was obtained with time time step where you adjusted the step at each time. And if you do the same over a fixed time, you get much more noisy image. So anyway, to say that it's not just about understanding, you can make the algorithm more efficient when you begin to understand the structure which is behind that kind of thing. One application, and I'll finish on this, is super resolution. So in terms of applications, again, what I'm going to show here is not state of the art, but just to give you an idea of also the kind of thing that you can do with it. Suppose that you have just a low resolution image that is given to you, a photograph that was built at very low frequency. And now you want to blow it out to have a very same photograph, but a much higher resolution. This is a classic problem. And in fact, that's the problem on which I did my startup at the time. At the time you needed, that's when the white screen, the flat screen appeared. High definition television, 2001. The problem was that the production of videos and television was in low resolution. And suddenly people were going into a shop, they were seeing this beautiful screen with beautiful videos. They were coming back home and they were visualizing their TV and it was ugly because they had this low resolution image which were interpolated, it was ugly. So the problem was, how do you take a small image, make it big but make it look good when it's big, okay? Well, what you can do is you are given this image at low resolution. Suppose that you've learned a model about faces, then you are going to synthesize given this image the high frequencies, get the next day. Now you synthesize the high frequency associated to this image, again, it's the inverse. And that's the kind of thing that you have. That was the original image. This is the course image that you would see on your low resolution TV or bad resolution camera. And then you progressively synthesize high frequencies and you get that kind of images. These images are directly driven from this one. It's just you run your probability model and it's refining all the structures. Okay, so let me conclude on this last part. So the first observation, and that was a big surprise for me. I went into there to try to prove that these neural network were in fact just memorizing. And the conclusion was the contrary. Conclusion, and with all my colleagues, the conclusion is they do generalize if the size of the database is big enough. Now what is the size of the database is directly related to the number of parameters is a neural network. What we observe is that when the neural network was much smaller, but couldn't synthesize images that were as good, then the number of, and it's not very surprising, the size of the database is getting reduced by so much. Okay, so, but it does generalize. And this is really important, for example, if you think of currently systems such as the large language model GPT-4. Is it just a random parrot which is just repeating randomly what it has memorized in the internet, or does it generate new structures which is based on this learn probability distribution? What is complicated is that as you saw, it depends upon the number of example. So probably if you interrogate it in an area where there is very little content, super specialized question in theoretical physics, probably the database condition to that area is too small, it's just going to give you back what was there. But if you ask a question about cats and dogs where you have millions of, trillions of exchanges about cats owners and dogs owners and their videos and so on, there it can tell you very sophisticated and new things. Now, whether you're interested or not, I don't know, but the fact is it has the ability to generalize. And this is why it's not just memorization. It's really generalizing. Now, of course, it's very interesting now to understand the nature of this probability distribution. I mean, just to think that there is a probability distribution behind bedroom is complicated for me. What does it mean? Yeah, I mean, it's just something extremely non-intuitive because intuitively I would say, but there is much too much parameter. I mean, it means nothing. Maybe we are overstating or overthinking the number of parameters of our taste, of our habits and so on. Maybe we live in relatively low dimensional space. Maybe our environment is in a relatively low dimensional space because this is the imagination of our designers or of our fashion and so on. I don't know, but the fact is it does generalize in the sense I just described. The other thing is that it makes very interesting connections with other topics in analysis which is approximations of geometry. This topic about trying to build basis adapted to geometry, many researchers, including myself, have been working on that, let's say from the years 2000 to 2010, 2015. In about the years 2000, we all realized, okay, weighted basis are nice, but they can't adapt to geometry and that has a cost. So what are we going to do? But the first idea that comes to mind says, okay, we are now going to try to build geometries which can be adapted with dictionaries of basis and then choose the basis based on the image which is the kind of strategy that we see appearing here. But the basis that we produce were much, much worse than the quality that is there. The algorithm were much more heavy so that at the end, it has never been new. However, you have some of the concept which is here which is what is interesting about the link with math. Math at one point, even though the application doesn't work, is not totally useless. There are concepts which are coming out and there seems to be there. In other words, one way to analyze this network is by looking at the autogonal basis which diagonalized the Haitians and looking at the properties of these autogonal basis. And finally, because this was the line for all these lectures, we see that again, the factorization of the problem across scale is the key element to avoid the curse of dimensionality. Is the key element which allows to go from a problem where you have a combinatorial explosion to a problem where you remain in a relatively reasonable dimension of possibilities because you've been localizing the problem by separating scale and that goes back to physics, that goes back to the ideas of the renormalization group. So I had in the previous lecture mentioned in the last slide several papers, there is one which is trying to summarize all that that should be ready on archive in about two weeks. And here are the collaborators for all these work with our three conference papers about the generalizations, the geometry adaptive basis and the different properties I've been speaking about. Now, this is a huge domain. There are a lot of work, but most of the work is still very experimental, algorithmic. So for the one who are interested in math, I can only encourage you, there's a lot of things to be done. Thanks very much. Thank you for this again very enlightening and very interesting talk. Are there questions from the, yes? I was, so you said that the difficult part is learn the representation of the small, small image or let's say. Is it what distinguish faces from rooms, from cats, et cetera, is in this small model? I mean, if you- Not just, the other one is also different. Yeah, I was just thinking, I don't know if it's possible. This you neural network, you make a transplant and you put the neural network of the rooms in the neural network of the faces, what do you get? Okay, if you do just that at the bottom, which is the low frequency and you keep the high frequency models of faces, I didn't try, but I don't think it's gonna work. So it does learn a model for the high frequencies. I'm just saying this model is much simpler and it's stationary. This is what is amazing is because you see faces are totally non-stationary, but once you have basically you know where is the face, you can get the stationary model for the given the low frequency. So the short answer is no, you need to change everything, but it could be that the high frequencies are quite similar because after all it's about refining edges, it's about refining textures. So you're right, it may be similar and these are interesting experiments to be done. I just want to go back to this connection with the normalization group a little bit. So I understand that the main analogy, of course you want to say that it's more than an analogy so I want to understand that a bit better. So the analogy as I understand it is that in this Wilsonian normalization group you understand that even though you could have a very large number of operators, only a finite number of relevant operators matter. Sorry, you said you can have a large number of? You can have infinite number of operators in your Gibbs energy, but only a finite number of them matter. If you are looking at phenomena at energies much below some typical scale, there is a scale separation, that's what you were saying, the high frequency modes. But I think this point of view then allows you to divide phenomena into renormalizable versus non-renormalizable theories. For example, let's say gravity is non-renormalizable. What that means is that the UV physics doesn't really decouple. The, what you think is high energy and you have integrated them out, they actually contaminate your low load. So to show that the curse of dimensionality is really cured. Yeah, does your, since you're not keeping only the, you're keeping some finite number of operators which are not necessarily, yeah, yeah. It's very good because I understand your question now thanks to the conversation we had at lunch. The answer is we don't work in all what I described with a finite number of operator. Each layer we add a new set of operators. So that means that contrary to 5.4 which has renormalizable in the sense that it will remain in a finite space. We don't do this assumption because this is not true in most problems and it's not true in turbulence. So how do we do it? Because if you don't work on a renormalizable space then you have, you may find an infinite dimensional space. What we do is you first work on the low frequency. There you can work on the low dimensional space. When you go to the next scale, you add the operator, you have the operators here but you add up a set of new operators. And the next scale we add up a new set of operators. So we progressively grow the space of operators as we go. So the result of that is that it will go to infinity but slowly because as a function of the size of the field, for example turbulence was growing like size of the field L, log L to the power 3. It will go to infinity. So you're right. We don't want to stay on a renormalizable theory because these are just toy models. These are just self-similar model. Faces, there is no way that you'll have a renormalizable theory because the large structure of faces are very different from the fine structure of faces. What I want to ask is that I agree completely that you don't want to keep finite numbers. But you are keeping let's say 100 operators rather than 1 million operators. Now usually in non-renormalizable theories if you try to do that, you cannot really do this if there is no clear scale of separation because the operators which have ignored actually contaminate the low energy physics. So I'm saying you might, is there some notion of logarithmic renormalizability? No, no, no. But each time we add a new layer of operator, they recontaminate all the other ones. So we update all the coupling parameters. So there is a full interaction. Each time you add up a new set of operators, we do compute what you call these contaminations. We don't make the assumption that there is a weak coupling where you would just need to add that to the previous weight. And in fact, the coupling goes through the free energy calculation. It's when you compute the free energy that you get, but that goes into more technical element. So my argument is it's not, and I want to insist on that. It's not an analogy. It's really precisely mathematically what we are doing is we are integrating the high frequency of freedoms when we go one way and when we go the other way. We are re-synthesizing this high frequency. But the key question is indeed what you are saying is building these models. In something I think the surprise, because you see, and there I'm on very vague terrain, so I'm sure I'm going to say things wrong. But I'll still say it. For turbulence, we know that the number of degrees of freedom is infinite. The question is how fast does it grow to infinity? And in some sense, in the experiments we are doing, we're saying it grows not very fast. Log L for 3, which is certainly not the right rate. But it doesn't grow fast, which means that it's feasible. Because as long as you are in log, then you win the curse of dimensionality, because exponential of log, that's good. That's OK. I think from a theoretical point of view, that's the key question. How fast does the number of degrees of freedom grow as a function of scale? I guess that's the question that can you classify the set of problems for which it grows logarithmically and a set of problems where it doesn't grow logarithmically. There must be some classification like that. That would be a beautiful. I mean, here we've just set up a program. Anybody interested? You can classify physical phenomena as a function of the speed of the number of degrees of freedom as a function of scale. Renormalizable are constants, and then the other grows. And they are the ones that cannot be calculated. And the ones that cannot be calculated are the ones that grow too fast. So we have somewhere a domain where we can compute, and which is not trivial. And it would be interesting to see in physics, whereas every is all the physical phenomena we study in that range, or other physical phenomena that are outside that range. Turbulence, I think, is within that range. Ah. Yeah. Oh, yeah, but that's very interesting. Yeah, but that would be a perfect framework to begin to study such a problem. OK, we have to discuss at the end of that. Yes? Thanks. I have a more practical concern about the application, for example, on super resolution with neural net. I can understand that it is not memorizing images in database, but it's definitely inventing details out of the prior information of the internet. So does that limit its usage on scientific images, like images of space telescopes, if we want to denoise or over a super resolution, then? Or if maybe we can combine with more traditional methods, like twizzling, slagging, to make it more reliable for scientific imaging? OK, that's a very important question. Thanks. If you have a system which memorize, then it is just useless for scientific applications, because it's going to produce what you said. Alucination is going to produce amazing details, which are just memorized from another case. You have nothing to do with. But if you show that it generalize, then that means that you are really in the equivalent of the maximum entropy regime. You don't depend anymore on your database. So you don't hallucinate in that sense. The generalization is what guarantees you to get out of that thing. However, the generalization, the way I define it, is to say it is independent of the data set. But you may not have learned the right probability distribution. In other words, you still may, there is always in statistics two problems. There is the variance. So you solve the variance problem. You don't vary as a function of the data that you've been using for training. But you may have a bias. In other words, the solution may be biased based on the way you've chosen. For example, if the problem that you had began with was create a random phase of someone on the Earth, quite biased, what I just showed. So the bias is still there. And that you have to solve with other means. Question? Yes. Thank you, Stephen, for the nice talk. OK, since you are emphasizing that the network generalizes, the natural question that I can ask is, let's say we have two models. One was trained on the true, like real data, real image. And one that is trained on the data obtained from a diffusion model. Are you expecting that the model performed the same, based on the fact that there was a completely fake and real data? Yes. Really? Yes. If you are in the generalization, although even if you are in the memorization regime, because you see, if you are in the memorization regime, it's going to reproduce images of the training set. So you're going to get the same kind of results if you work from the training set or from the generated set. So in other words, in fact, it's not a way to discriminate whether it generalizes or not. My question was more about, because there is this question that as you try to answer, like, what does the network learn in the sense that maybe there is some kind of information that is contained on the real image that is not on the fake image in the sense that, let's say, I'm in the regime that I don't have access to enough data. I want to use the generated image to complete my data set. Am I going to obtain, at the end, a model that is as performant as if the model was trained on completely real data? That was in that sense. Yeah, well, in that sense, if you have a data set with only one image, your network is going to reproduce this single image always. So it's perfect. It kept all the information. In fact, you see, the neural network is just encoding the training set when it's in the overfit regime. When you get out of the overfitting regime, when you are in the generalization regime, it forgets the training set. And therefore, in some sense, you lose information because you cannot regenerate the training set. And there you may be right. There may be something that you very little details that was important for this particular image that was missed by the model. Because the model, don't forget it, it's biased by the fact that you have a neural network with a particular training architecture that has been used. So this neural network is going to bias the family of probability distribution. And therefore, in fact, the more you are in the overfit regime, the better you are going to be from your point of view because you encode the data set. Thank you for your very interesting talk. So you mentioned earlier, discussing the generalization between two networks. I was wondering if there is like a threshold or an order of magnitude, when can we expect this generalization to happen? Very interesting question also. That's open to research. What we can see is that we've been, and Zaha Kadokai has been doing experiments on small neural networks and a bigger one and observed that the number of training example was growing by a factor that was of the order of the multiplication of the number of parameters. That's probably not precise, because that was just when you have two points, you can always make a line through two points. But it indicates that there is an issue of the complexity of the model. The other question is whether there is an issue of the complexity of the data. Normally, it should take much more example to learn something very complicated than to learn something simple. So intuitively, I thought that it would take more example to learn bedrooms than faces which are always centered and so on. Hasn't been demonstrated. All these are open to experiments questions results. Thank you. What I understood from this is that one of the key features of this wavelet factorization is that it can capture local properties of the data that we have. So would it be possible to apply the same idea to analyze other type of data, like for example video, when we don't only have spatial dependency between the pixels of one corner to the nerve? What was the kind of data you spoke of? Video. Video? Video, yes. But we also have like dependency on time with one frame to the next one and the next one. OK. So again, you are raising a very important question, time. I mean, how do we deal with time, which is, of course, fundamentally in physics? That's, I think, indeed, the next step. So you have scale in space. The question is, do you have scale in time? I would have a tendency to say, yes. So when you solve a simple nonlinear PD, like burger equation, you can see that the day you begin to separate scale, you also want to adapt time to the scale. Because basically, the high frequencies, if you want to have a stable numerical scheme, you need to have a delta x very small, whereas the low frequencies, you can have a delta t much larger and have a stable scheme. So the two are coupled. And that's a little bit what I showed when I showed the Langevin equation. The Langevin equation, I showed that you can have time that runs, it has to run very fast and it will stop very quickly at high frequency. And it can run with bigger time step on low frequencies. So the interaction between the time scale, the spatial scale, and whether by doing that, you can really model the full Hamiltonian evolution is a very interesting. Is, I think, now the key question. This is the question behind meteorological prediction. Meteorological prediction is obviously about prediction in time. And there are very nice results already around numerically. Thank you. All right. If there are no further questions, let's thank again Professor Mella for this very stimulating talk. Thank you.