 Okay. Okay. Okay. Good afternoon everybody. So, yeah, please take a seat. It's a pleasure for me to welcome all of you to the 2024 Salam distinguished lectures, which has been a tradition at ICTP. It's an annual presentation of talks by renowned scientists to showcase important research developments and also to give some big picture and the forward view. And I'm happy to say also that this has been an activity which is very generously supported by the Kuwait Foundation for Advancement of Sciences, KFAS. And in fact, two representatives of KFAS are visiting us here. Special thanks to Mr. Jabir Al-Saba and Ms. Ahad Al-Suzi, who came all the way from Kuwait. In fact, this collaboration really has been very important for ICTP for supporting many scientists. And as it turns out, the new Director General of the KFAS, who has just taken charge, she is a woman scientist for the first time, I think in the history of KFAS, headed by a woman scientist, but she also a scientist who has close connections with ICTP. So, it's a particular pleasure to have you here. Thank you. So, this year's lectures will be, we are very happy to have Professor Stefan Malat. He is an applied mathematician, theoretical physicist and professor at the College de France. He's a member of the French Academy of Sciences and a foreign member of the US National Academy of Engineering. He was previously professor at the Courant Institute of NYU at Equal Polytechnique and Equal Normal Superior in Paris. He was also a founder and CEO of a semiconductor startup company. He developed the multi-resolution wavelet theory and algorithms at the origin of compression of standard JPEG 2000 and sparse signal representation in dictionaries with matching pursuits. So, his current work is developed in mathematical models of deep neural networks for data analysis and physics. And I'm not in his field, but from what I understand, even before the machines were learning by themselves, as is happening today, some of the pioneers like Professor Malat were sort of asking the algorithms to look, figuring out what to look for in the data and what are the relevant features to make sense of. This is especially important in vision and image processing. And thanks to the development of wavelets, which I guess he's going to talk about, which is the close cousin of Fourier decomposition, it allows us to compress and interpret natural images in a more efficient way. And this has been at the center of kind of a numerical revolution that has happened. So, I'm very looking forward to his talk. I just want to mention also that he gave an online colloquium in 2020 during the COVID time. So, we are very happy to see him in person here. The overarching theme and the title of this year's series of talks is Learning Multiscale Energies from Data by Inversive Normalization. And I had a good opportunity to, I mean, we came back from Paris in the car, we shared the car together. And I was very fascinated to learn from him. There is some deep connection with the renormalization group, which theoretical physicists are very familiar with, but coming in a very different context. So, I'm really looking forward to your talk. And then there will be two more following two colloquia talk. Today's talk is titled Energy Estimation and Data Generation by Inversive Normalization Group. Then tomorrow, there will be a special talk at, well, maybe I can tell you later at the end of this talk, but I might as well tell you now. At 11 o'clock, there will be a second more technical talk specialized on log sobolev stability, wavelets and interaction energies across scales. And there will be a colloquium, the second colloquium will be at two o'clock on multiscale neural network models for generation by score diffusion. So, I welcome all of you to attend these talks. And I now welcome the students who are coming in. Many of you are, maybe this is your first colloquium. I don't know if it is your first colloquium, but this is certainly your first salam distinguished lecture. So, you can look forward to a very exciting talk and a topic which is really at the frontier of really exciting developments that are happening around the world because of the developments in machine learning. Before I give the floor to Professor Malak, I just want to also take this opportunity to say that we normally announce 29th was actually the birthday of Salam, 98th birthday of Salam. In fact, I was in Imperial College, I was invited there because they inaugurated a library in his owner, Abdul Salam Library to honor his legacy. And they had a very nice ceremony, a public lecture and a number of very nice ceremonies. And we at ICTP have been giving the Spirit of Abdul Salam Award to anyone from the extended ICTP family of scientists and non-scientists as well as from our administrative staffs who have shown that they who have worked tirelessly to further Abdul Salam's humanitarian passion and vision for cooperation, promotion and development of science and technology in the developing world. And I'm very happy to announce because all these three people I know very closely and I have worked with them closely. One is Rosanna Sehn, who has been the secretary of our section for many years. When I was the head of the section, I had worked with her closely. Professor Luciano Mayani and Fabio Zwirner, both very well-known theoretical physicists, but they played an extremely important role as chairs of the steering committee of ICTP, which was Professor Zwirner and as the chair of the scientific council of ICTP, which was Professor Mayani. You can read about it more on our website and the actual award ceremony will be later, but we announced this prize during this week, which is the week of the Salam's birthday. So I will not take any further time and I give the floor to you. Since you're starting a bit late, don't worry about the time you can. Thank you very much. Thank you very much for this warm introduction. I will be speaking today indeed about machine learning, but also physics. And these are in fact very close fields, although it doesn't look like. Physics is really the field of high dimensionality, understanding interaction of particles, atoms, how you have emerging phenomena at the macroscopic scale. That requires to be able to understand how many, many variables interact. Well, machine learning is essentially about this also. Huge amount of variables coming from data, out of which you have to do a diagnostic, you have to recognize, you have to understand the distribution of these data. This is raising essentially the same questions. The mathematics which are behind are essentially the same and sometimes exactly the same. And that's why it's not so surprising that we're arriving exactly to the same kind of concept, the kind of tools that have been used in physics. And in fact physics is a very important source of inspiration, is a very important domain of applications of machine learning. So that's why these talks are going to be at the interface between both domains. So the problem on which I'll be speaking during this talk and following up is about learning distribution of data. So it can be a physical field. And when you present talks like that in a crowd which is a mix like that, one of the issues are notations. So in data science, normally the data is called X. Here I'm going to follow the notations of physics. The data corresponds to a field. Think of it as an image. It can be a volume of data. It's going to be five. And what you would like to understand is you have a system, you have a field, which is at equilibrium. And if it's at equilibrium then it can be described by Gibbs energy which specify the interactions of the different particles. And what you would like is to learn the energy. Learning the physics from data is about learning the energy. And the energy, it defines the probability distribution of the data through the fact that the energy is the Gibbs energy of the probability distribution. So question, can we learn the energy from just a set of observation, a set of data which are, have been observed or which have been simulated. Now the typical and simplest model of course that has been used in math and in physics is to describe the field as having a Gaussian probability distribution which essentially means that there is no structure and that you see on the left. Then of course have been introduced much finer models in physics in particular and I'll describe what it means with scalar potential and that allows to describe physical models such as E-Zing 5.4 and I'll show another example on real data called weak cleansing which comes from cosmology. But the frontier really of physics is still to be able to describe system which are much more complex like turbulence where you have geometry, very complex geometry or here you see the cosmic web which is the aggregation of mass in the universe and understand the statistics of that. And that we still don't understand well in physics how to do that because they are very long range correlation and you have to understand the geometry. So the question is how can we model the energy of such system and if you can then you can generate new sample, you can generate new data. And of course the applications are very large. You have applications to physics to understand the nature of the interaction in your systems but you also have application to generate new data, new fields. Okay, so that's the problem. Estimate the probability distribution from samples. Now if you look at that from the point of view of machine learning you really have two types of problem. The first type of problem is an approximation problem. You want to approximate your probability distribution so you need to define models and the models needs to belong to a family which is not absolutely huge. And therefore you need to model the problem that's in physics often called an ansatz or you may use a neural network to build your model. And then the second phase is the optimization. You need to specify the parameters of the model. The parameters is incorporated in this letter theta. It may corresponds to millions of parameters in a neural network or much less if you have a good model. And typically you want to do that in order to minimize the error with the true distribution and there are distances to do that. One is the Kulbach-Liebler divergence also called relative entropy in physics or we'll see score matching different techniques. Once you have the model you can generate data. In other words you can produce examples which are high probability example given your probability distribution and that requires also to solve an optimization problem. So these are the two class of problem that we're going to face. Now why are these problems very difficult? They are very difficult because we are in high dimension and that's called the Curse of Dimensionite. Suppose you want to approximate a function which associates to your data a value y. For example it can be classification or it can be the probability distribution. And suppose you have example in other words you have example of data phi i for which you know the value of your function. You would like to compute the value of the function for any data phi. How to do it? The naive way to approach the problem is to think well let's look at the example that I know that are close to the value at which I want to compute the value of the function and let's try to do a local interpolation. Well that will never work because you have no close example in high dimension. And to understand why you have no close example in high dimension you can take a cube in dimension d so 0 1 to the power d. And imagine that you want to guarantee that you have always an example at a distance epsilon. How many examples do you need? Epsilon to the power minus d. Take for epsilon 1 over 10 which is not very small. For dimension 80 that's already 10 to the power 80. It's more than the number of atoms in the universe. Now we're not going to be in dimension 80 but in dimension 1 million or several million. So it's just hopeless. So the whole issue will be about understanding how to reduce dimensionality. In other words how to capture structure in your problem. Now one of the key structure that is going to go across all these talks and goes across all physics is the fact that you have an organization across steps. Suppose that you have an n body interaction problem. So you have sorry here n is d, d variables which are interacting. It can be particle. It can be agents in the social networks. It can be pixels into an image. Now typically you may have very far away interactions but the far away interaction are often weak. If you look at a pixel for example or particle here the strongest interaction are going to be with the other particles that are closed with let's say your family. Now the particle which are a little bit more far away you cannot neglect them but the interaction are weaker and the one even more far away even weaker. However you cannot neglect them. Think of it for example at someone living somewhere in Moscow. You may think I don't care what he's doing in his life it's not going to influence me. Yeah but if you neglect all Russian like that you neglect the fact that they may be a tension between Russia and Ukraine and the tension between Russia and Ukraine can change the price of energy and that can change your life. So they are global interaction but what it means is that instead of looking at the interaction of each particle you can look at the interaction of the global group and look at the equivalent field on the central particle here. What does that mean? That means instead of looking at the interaction of d particle you are going to look at the interaction of log d groups of different size. Now the reason why it's difficult is that these groups are also going to interact between themselves and you have to understand all these interaction across scale but in some sense you've broken the curse of dimensionality because you've gone from d to log d and that's the key point of multi-scale that we are going to find in different ways but that's the main argument. Okay so as I said today I'm going to stay in the world of physics there. In the next colloquium so tomorrow afternoon I'm going to move to much more complicated problem. Now you can generate I'm sure you've seen example of faces, bedrooms, incredible images we're essentially the same tools and what I'm going to try to show you is that the same principle are still there the reason why you can do it is because there is structure and where is structure the structure comes in big part from the fact that they are a hierarchical organization you need to do scale separation to see them and to understand the interactions across scales. So the same tool that we're going to look today will come out when we'll look at the neural network applications. So now what I'm going to do is to introduce the basic tools in machine learning but are also used in physics to do that kind of thing. So as I said there's two parts the modeling part the approximation and there is the optimization part. In the optimization part let's begin with a simplest problem suppose that you already know the probability distribution you want to sample for it in other words you want to find a typical example which has a high probability high probability means low energy how can you do that basically you do a gradient descent you want to minimize the energy so you change the value of your field in the direction of the gradient but the problem is that there may be local minima in your domain and you want to get out of the local minima you get out of the local minima by adding some noise what is this this is a rangevin diffusion which has a very strong physical meaning because it corresponds to movements in in gas in particular and if you do that you are guaranteed to converge to a typical example of your probability distribution even if the distribution is not convex but it may be very slow so speed is important and one of the key question when you look at this optimization problem is to find condition under which the convergence is going to be fast one of the case in which you are guaranteed that everything is going to be fast is of course if you have no local minima if you have an energy which has a single minimum and that means that the Haitian of the energy is strictly positive larger than alpha identity and under that kind of condition you can prove that you have an exponential convergence now the exponent which is here and that's important to notice it essentially corresponds to the size of the covariance and which corresponds to the inverse of your Haitian here we'll see that why is that important because that means you are going to encounter problem when you have a field where there are very long range correlations I'm going to come back to that and that's going to be one of the topic that will look more technically tomorrow now let's build a simple model simple class what are simple class of models so in physics the energy usually you decompose it into term you have the kinetic energy term which corresponds essentially to the energy of the velocity and you have the potential term that may incorporate all the nonlinear terms defining your physical interactions one way to approximate the physical energy is to build a linear approximation what does that mean that means that you are going to define a family of functions phi which are your phi k here and you are going to expand the energy or approximate it as a linear combination of these functions you can write that as an inner product between your vector theta and the vector of functions phi theta that corresponds to the coupling parameter in physics okay and building a model like that in physics you call that an ansatz that means you have a prior given your modernization of the physics that your energy can be written like that and the key question is to identify the coupling parameter and that will define a probability distribution now computing the parameter is one difficulty but the real difficult problem in that kind of situation is to define the model is to define phi when you have several centuries of physical studies behind you you may know phi when you look at a new problem you may not and that's where neural network can come in basically neural network will be about learning the ansatz learning the phi yes yeah exactly yeah the three coordinates are the physical coordinates so you have a box of data and you may have millions of points in your box of data a priori no it depends well see for example in example like phi four the phi is going to be local but if you are in gravitational problem it's not going to be local so then the question is you want to optimize the parameter now you have a model you want to optimize the parameter and the goal is to make sure that your probability distribution is close to the model the natural metric to do that is the relative entropy also called cul-backly blur divergence if you do statistics it means doing maximum likelihood maximize the log probability of the phi find the parameters which has a maximum probability now if you are in a situation of an exponential model things are relatively simple because to minimize this quantity is convex and what you have to do it are the examples so what do you do you want to minimize this again gradient descent gradient descent is essentially the tool of machine learning but it is also the tool in physics everywhere so you do a gradient on your parameter theta then you update theta in the direction of the gradient and if you do a direct calculation you'll see that the gradient essentially gives you the difference between the expected value of phi of the what is called the feature vector or the potential vector in physics minus the expected value taken relatively to the true distribution so you can view that as a moment matching problem now the expected value relatively to the true distribution you can compute it because you have examples so you can do a Monte Carlo sum which basically amounts to average the value of phi over the data that you know the value for the true for the probability distribution that you want to compute the model you need to compute the expected value whether that requires that requires to compute samples and then do the averaging this is very costly because for each gradient descent you need to compute samples from your probability distribution with your longevity diffusion average everything computes the difference and then go up now there is a technique to avoid that which is called score matching that was introduced in 2005 by Ivaren and the idea is very simple if you look where the bad terms come from it comes from the normalization term of your probability distribution so what you want is to kill this term to kill this term you compute the log and then the gradient relatively to phi so that means that instead of looking at the difference between the log probability you're going to look at the difference between the gradient of the log probability and that corresponds to something called the Fisher information that what is going to be minimized and if you do that you have a simple quadratic problem within the parameter very fast to compute you don't have to sample anything it looks like a miracle the problem is that the miracle usually comes with a cost and the cost here is the fact that minimizing this quantity is weaker than minimizing the true cul-back or relative entropy because you've lost information about the constant to understand in what sense this weaker metric is equivalent to the other one is again related to this notion of log sub-left constant to the speed of convergence of Langevin we'll go back to that next time okay so now we have our two legs on one hand the optimization and what I'm going to show you is that most of the time the optimization is very badly conditioned to avoid that the solution will be separate scale and will naturally be led to the renormalization group now we're going to be led to the renormalization group not just quite actively but very precisely the math are exactly the same and that will force us to go into the mass of the renormalization group the second problem is the approximation problem approximation problem means understand what is the nature of the model and what we're going to see is that in fact what is important is not to describe the whole object is rather to describe the interaction between the different scale how does atom aggregate to make a molecule how molecule aggregates to make some kind of material and so these different steps that's what we're going to model we'll see that this is much easier and that's much better conditioned and that's how we'll end up with interpolates okay so the renormalization group what is the idea the idea that came out of the work of caddanoff and welson that was in the 1970s is to say you if you have a system at a very fine scale here what you may want to do to understand its property is to reduce its dimension by looking at larger and larger scale so how can you do that you take your field your image you average subsample it average subsample average subsample and now you want to look at the probability distribution of this field where things have been aggregated average progressively how do you do that where your probability distribution at each scale you are going to describe it with an energy this energy is going to be described by a parameter and these are called the coupling parameters and now the only thing that you have to do is to relate the parameter at one scale to the parameter at the other scale and that's called the renormalization group equation now if you have a phenomena with a phase transition phase transition means you go from water to gas or from water to ice then you arrive at a very particular point where your iterations have a fixed point and that was the key observation of welson and at phase transition all the theta don't move anymore okay now how do you compute this renormalization group map well that is simple take the very high frequency image over there you can decompose it into a low frequency image and a complement what is the complement the complement which is here is all the high frequencies of your image which has disappeared well we will compute that in an orthogonal basis but let's leave aside one second what basis suppose that you are able to do that how do you compute the probability distribution here you take the fine-scale probability distribution and you marginalize it in other words you integrate it against the high frequency variables now observe one thing this high this probability distribution I can always write it as the low frequency probability distribution multiplied by the conditional probability distribution effectively when you do this integration you are integrating the conditional probability distribution now there is a beautiful observation that welson did is that despite the fact that the original probability distribution is very singular this particular integral you can compute it easily and approximate it with a gaussian integral so where is the miracle coming from that's what we'll see but now before looking at that we are going to look at the problem which we are interested in in our case we don't know the energy that's what we want to compute in our case we have data okay so we have data and what we would like is to compute the energy what we are going to do is the reverse we have data it's not complicated to compute a model of very small dimension one two three pixels okay we are in low dimension that's easy and what we're going to do is to go the reverse direction progressively we are going to refine the model by going up the scale now how do you go up the scale you need to compute the probability distribution at fine scale from the probability distribution at coarse scale well that's you get it by the conditional probability distribution of the high frequencies given the low frequencies same object that what wilson was looking at if you do that you are going to factorize essentially your probability distribution as a very coarse scale distribution and all the conditional probability distribution the key idea of all the talk and all what will come afterwards is that it's very difficult to compute p but each of these terms are not going to be so difficult to compute so you separate your problem in sub problems which are much simpler now why is that so because if you look at this conditional probability distribution you can represent it with an energy this is called the interaction energy of your between the scale what is the interaction energy the interaction energy is going to be the difference between the fine scale energy and the coarse scale energy but when you subtract the two you kill the singularity and it is you kill the high frequency singularity for physicists you kill the ultraviolet singularity and it is because you kill this singularity that now you have a well behaved object which has a nice Haitian which is invertible and you are going to be able to compute and that's the work we've been doing with Thierry Marchand Misaki Uzawa, Julia Biroli on which I'm going to elaborate okay now we need to be a bit more practical we have our high frequency field that we want to decompose in a low frequency field and these orthogonal variables traditionally in physics people almost always use a Fourier transform and I'll explain why the Fourier transform is not as appropriate as you may think for doing that what basis we'll use a wavelet basis and I'll explain why the key point is that it's going to be a basis with elements which are well localized in space so this is how a wavelet basis looks like it looks like a sine wave but which is local in space so I'm going to have several wavelets three in two dimensions and each of the wavelets I'm going to dilate it by factor of two okay by dyadic factors now and that has been the work of many mathematicians in the 90s you can build an orthogonal basis of your space by using these three wavelets dilating them and because they are local you need to translate them in space and this is the orthogonal basis what does that mean that means that a field data can be described by the coordinate in other words the inner product of your field with each of these wavelets in physical terms a wavelet is a wave packet but you design while your wave packet in order to get an orthogonal basis of your space how does it look like and that was has been used for compressing images you take your field this is the low frequency and what you see here these images these are the set of wavelet coefficient in three different directions they look almost completely disordered very not much not correlated and that's going to be very important you take the next image here and you sub decompose it into the high frequency variable and the low frequency if you want to see what it does to an image like a face that's how it looks like the wavelet coefficient whenever you are in regular regions are going to be zero they are only going to be non-zero where something is happening near the edges where it's going to be very big positive or negative locality is very important to capture local structure of your image and then you do the same thing you have the wavelet coefficient at next scale next scale now what's the relation with the Fourier transform if you take an image this is the Fourier support so these are the frequency variable often called k in physics in math it's rather omega you do a first subdivision what does it mean in the Fourier domain that means that you've decomposed your big frequency domain into a low frequency domain corresponding to this image and three other images which correspond to three frequency bands now this image you sub decompose it into lower frequency here three frequency channels and so so what you see is that you are doing something which is also local in frequency so you are local in both of the space in spatial domain and in the frequency domain so let's now look at what's happening on the simple but a case which raise a lot of difficulties gaussian model this is turbulence in 2d this is the Kolmogorov 1942 model whether they do essentially it proposed to model turbulence as a gaussian field having exactly the same covariance the same power spectrum and if you sample your model that's what you get obviously it's not a very good model but we'll begin with that this is the energy it's a quadratic energy okay what is the Haitian the Haitian is just the matrix k the interaction matrix of your gaussian field the covariance in this case of the field so the expected value of the second order moment is the inverse matrix when you have a field which is stationary which means that the probability distribution doesn't change when you translate then the covariance is a convolution operator it's going to be diagonalized in the Fourier domain how is it translated it's translated by the fact that you are going to only look at the eigenvalue also called the power spectrum and typically in physics what we often observe is that these power spectrum have a power low decay power low decay and that's exactly what's happening for turbulence means that in the low frequency it shoots up very high in other words you have a singularity singularity here means that you have very long range correlation in your field that means that all the optimization algorithms are going to behave very badly because they are very badly conditioned now and if you look at the size of the log sub-left constant to the norm of the operator it's going to grow with the size of the field exponentially I'm sorry there is a plus okay why are things going to be better with the renormalization group or with wavelengths because what you are doing is you are limiting your domain to a frequency band you only look at a frequency band and within this frequency band the eigenvalues varies by a constant factor and the fact that you only look within this frequency band means that if you look at the interaction energy it's an operator which is perfectly well posed and that has been the center of a lot of research in math in between the 70s to the early 2000 understanding what class of operators have that kind of properties and in particular for prop operators which have this kind of power low decay the proof are not too complicated to show that once you go in the wavelet domain all your operators are well behaved but not only that the interactions are very local in other words one wavelet is going to interact with a neighbor wavelet but it's not going to interact with very far away you've gone around the long-range singularity problem so let me come back to the picture of the renormalization flow and the work that was done with Giulio Misaki Ozawa and Thierry March. You have an energy and I'm going to suppose that my energy I can approximate it with my coupling parameter what is the forward renormalization group doing? It's supposed that the energy at very fine scale is no and it computes the energy at the different scale which amounts to compute the coupling parameter at the different scale. We are going to do the reverse we don't begin from u0 we begin from data and we are going to go from a core-scale approximation and move up with these interaction energies. Now how are we going to move up? We are going to move up by approximating these interaction energy with a model in other words the key step of the modernization will be to understand how to model the interaction energies and how to define the potential vectors to build this model and what I will show at the end is that you can build much more sophisticated model including model of turbulence by going that way. You have a free energy here which is the renormalization constant that is not so important. Now if you did that what did you do? As I said you took your priority distribution you sliced it into components which are much simpler and each slice you estimate it by estimating its Gibbs energy. In other words the coupling representation is not anymore the usual one that is used in physics with the theta-j at the different scale. It is the very core-scale coupling and all the interaction scale parameters that will be the description of our field and of the energy. How do you sample? Suppose that you have a model like that you first sample at a very core scale very core scale image then you are going to sample the conditional probability distribution which is going to give you an example of wavelet coefficient. From that up you can reconstruct your field at the next scale. You sample the high frequencies up you can reconstruct. You sample the high frequency you reconstruct and so on. You go up the ladder okay. The key point is that we'll see that's also fast. You don't have any more what is called the critical slowing down that has been facing when you go directly on the field and I'm going to take as an example 5, 4 which is a model of ferromagnetism. It's a model of how the spins are distributed on ferromagnetic model. So you first have the kinetic energy and the potential what does it do? It imposes that the value of the pixels of the image is either close to minus one or to one which is why you have two local minima in minus one and one and then you have beta which is the inverse of the temperature. If beta is very small you have no laplation everything is disordered is as if you had randomly independent pixels. When you begin to increase beta reduce temperature you begin to introduce correlation up to a certain value where suddenly you have a phase transition where you see appearing these very long range correlations and then after that it's a winner take-all. Either most of the spins are one or most of the spins are minus one and you have these two phases. If you look at the power spectrum you see that when you reach the phase transition this is almost a straight line it's a log-log plot that means you have a power low decay and therefore singularity in zero. Okay if you take such a field and you want to sample directly the field by doing a gradient descent on the energy it's a nightmare because it's a very badly conditioned operator and it takes a long long time and as the size of the field grows the number of iteration of your Langevin iteration grows more and more. So what are we going to do exactly what I described instead of describing the field from the energy at fine scale we are going to describe it from the interaction energy at all the scales that we're going to parametrize with the same parametrization two-point interaction and a parametrization of the scalar potential and we're going to learn everything from data. To learn everything from data we are going to either minimize the cul-back-libre distance or the score matching. Here it was done with the score matching which is much faster and then you can sample and what you observe is the following the number of iterations that you need much smaller because you don't have any more singularity and it's totally independent of the size of the field in other words you don't have any more this critical slowing down because you've decomposed your problem into problems which are totally stable. The second thing is that now you learned everything you were not given the energy so you can verify that the learning is precise by comparing what was the original synthesis the original field and what you generated and you can see visually you don't see any difference but you can also look at the statistics the marginal probability or the power spectrum and they are extremely close. You can reconstruct the full energy with the potential and and look at the operator which is a laplation the key point is why is it precise because you have a basis which is local because your difficult part of the energy is the potential which is nonlinear and the potential is local in space and that's why using wavelet on which I'll come back. So you can do that kind of thing with other type of similar field I say similar this is the case this is weak cleansing so weak cleansing is a problem where what you see are astrophysical images the very bright point corresponds to aggregation of mass and that allows you to how do you detect this aggregation of mass by looking at the deformation of the ray of lights and that's called a lensing effect due to gravity it's a gravitational lensing effect so one of the question is to be able to describe the probability distribution of such fields you can simulate them you can do exactly what we describe compute the energy from the data at each scale then generate new fields and these are example where you can look at the synthesis with the same kind of model scalar potential but that's not going to work if you want to look at more difficult problem if you want to get out of this type of toy models and the problem of the renormalization group in physics that it was beautiful when you applied it to ferromagnetism but nobody has been able to apply to more complex problems such as turbulence because there's no good model of turbulence and indeed if you try to synthesize a turbulent fleet from such a model it just doesn't work not only that we know that these models require much more free parameters so let's look at a complicated image this is the image that you have over there about let's compute the wavelet coefficient with different wavelets here which are complex this is the average this is the high frequencies you see the contours then you go to the next scale and to the next scale so what do you observe first many coefficients are zero wherever the image is regular they are zero second you can see the geometry emerging that's where the wavelet coefficients are large okay where the amplitude of the coefficient is large the second thing which is very important is when you have a contour you don't just see it here you see it at old scale in other words the scales are very dependent one upon the other if you want to capture the geometry you want to capture the interactions between the different scales and that's the problem so let's look at the tools that we have a hand in hand gaussian model if you take a gaussian model as i said you have an energy which is quadratic like that okay let's look at the gaussian model over the wavelet coefficients so it's just an orthogonal change of basis you just do your orthogonal change of basis on your field and you're going to get a new matrix which is the wavelet transform of the matrix here what is that going to give you it's going to give you the interaction between a wavelet coefficient at one scale and one position and another scale and another position now one thing you can verify is that this is going to be essentially zero it's going to be totally decorrelated why because these are coefficients which lives in different frequency bands they oscillate at different speed if you try to correlate them you are going to get zero we're very well known phenomenon physics so you can't capture interaction between scale with a linear representation what you need to do is to kill the phase which is why also relu are so important in neural networks how can you do it well the first idea that comes to mind you want to kill the phase let's use a modulus you kill the phase so i just take the modulus so that means i had a complex wavelet transform and i'm going to take the real part imaginary part add them square square root i have a modulus image okay at the different scale and now i'm going to build the energy not just on the original field but also on the modulus of the wavelet coefficient in other words i'm going to look how the modulus of the wavelet coefficient correlate across scale across position what does that mean the matrix k here is going to give you the correlation of the original field that's the usual term that will be the kinetic energy term but then you are going to have a potential and the potential is essentially going to give you the correlation between your field and the wavelets modulus and the different wavelet modulus coefficients now the big matrix here what is going to capture a lot of information is going to be the matrix which keeps gets you the dependence between the modulus of these coefficients not only they are going to be non-zero but the bad news is that it's going to be non-local even very far away coefficients are going to be non-zero because if you have a contour which is very regular very long a coefficient here and a coefficient here are going to be correlated you don't want that long range correlation means that all your optimization algorithm are going to blow up so you need to kill the correlation and it's always the same idea you want to take a problem which is not local and you want to make it local how same rule compute the wavelet transform why because why it's not local because it's regular how can you kill regularity you build a transformation across scale so we are going to reapply a wavelet transform on the modulus in other words we are going to re transform this matrix what does that mean that means now we have two scale variable it looks complicated but let me tell you it's much much simpler than what a normal night will do the original modulus you re transform it so each time you introduce a scale variable to kill the interaction in space that's the new scale variable and you're going to look at all these interactions that's going to define a matrix which now has very little interaction if the size of the field is l the number of parameters go to infinity but very slowly like log l cube so i'm going to use few parameters one observation totally different from what is usually done in physics what you do in physics typically you use high order polynomial expansion that was was well soon and most people are doing but the problem is that it explodes very quickly because you cannot estimate it you can use derivative of high order which is equivalent is the tailor expansion but all this technique we're not able to approximate such field here we go in a very different direction and all this is totally inspired by neural network you never take power you take a nonlinearity but the nonlinearity is homogeneous you just kill the phase and we're going to now see what it gives i now have a model i can try to estimate it the problem is it's going to be very unstable to estimate it for the same reason than five four so what should i do i again do the same trick i separate scale i look at the scale interaction energies that i'm going to parametrize i parametrize exactly with the model i described which has few interaction parameters and then once you have all the interaction energy you add them up to get back your original energy and you can look at how the potential looks like you can recover the two-point interaction which in case of turbulence is the one we calculated is very close to a standard laplation and you can simulate and these are examples all this is the work by the way that was done by etienne l'empereur including the model and the simulation who is a phd student at Ecole Normand Supérieur this is the original field you have a database like that this is for 2d turbulence this is an example of 2d slice of dark matter which comes out from simulations that were done at the flight iron so you have a database you learn the priority distribution and now you just sample the model and these are the kind of sample that you get you can see that visually you have captures the geometry and how did you capture it by capturing the scale interaction energies and you can then do a statistical validation the statistical validation essentially will consist in looking at the statistical properties you can look at the marginal and verify that they superimpose the original and the synthesis they superimpose so you see one color for the turbulence the dark matter you can look at the power spectrum you can look at high order moments by spectrum you can look at the tri-spectrum third fourth order moments and verify that the statistics are being matched okay so let me conclude for the first for this first part for this first part what I try to show is that the reason why we can solve this very high dimensional problem is because there is a very strong structure which is behind which is the scale organization and the domain which has been studying this very strong structure is physics there is a huge amount of research that has been done around the renormalization group in quantum physics in statistical physics that's a key tool to understand how you can go from fine to core scale however we are looking at it from a slightly different point of view because we don't suppose we have the model we want to discover the model but we have the data that's why we go in the reverse direction and when you go in the reverse direction you realize that the key object which is indirectly there in the other direction are really the interaction energies and these are the objects that you can estimate so that is the key in what I described I showed that you can do relatively simple model just by looking at the modulus of the wavelet coefficients but other models are possible and we'll look at much more sophisticated model with neural networks and then of course there are a lot of scientific questions behind that why wavelets so I said because of local in space and Fourier that's a question we'll look in more detail next time I'll introduce much more in much more detail what are wavelets what are their properties and what is the issue with the Fourier transform more complex model yes it's absolutely needed physics like turbulence it's weird to say that but it's something it's simple compared to the kind of images that is being synthesized faces whole rooms like that and so and the big difference in particular is that if you look at the kind of model I showed they are stationary and they are ergodic if you take a face it's not stationary it's not ergodic there is a lot of structures and still neural network can do it so the question is why is it related and what I'll try to show tomorrow is that yes that will be tomorrow afternoon it is very deeply related the principle are similar but of course the problem is much more complex and there are beautiful questions around that the other point that I didn't touch which is totally fundamental is of course time dynamics or what I spoke about is a statistical a physical system at equilibrium there is no out of equilibrium questions and that's of course the key question if you want for example to do meteorology and these are very active research questions that I'm not going to touch during these two days thanks very much thank you very much professor mala for this very inspiring talk read the interface of physics and machine learning very timely presentation thanks a lot so is there any question from the audience so thank you very inspiring I have a naive question but I would like to see whether this is connected in some form to what you just told us so if I think about learning something from sampling and I imagine I do an equilibrium Monte Carlo simulation I know that the number of samples that I have limits somehow the for a critical system for instance limits the length scale that I can sample so I will only be able to if I have a thousand sample I will only be able to see correlation up to some point does it imply something in terms of constraints about this rg inverse flow works or this is just but do you when you're speaking of the sample you consider that you have samples at very high resolution yeah exactly exactly so it is very related because if you have an estimation which is unstable the number of samples that you need to estimate the low frequencies and the high frequency is much bigger so it is very key to restabilize the problem first and then to estimate each of the piece so yes you can estimate much higher frequencies if you estimate it interaction scale per interaction scale then if you try to estimate directly the the field to begin with because there you have a condition number which is ugly and your statistical estimator is not going to convert everything is going to be driven by the low frequencies which are exploding whereas it shouldn't because it's a very small dimensional problem thank you very much for this very spying talk and I have a also naive question so usually in direct renormalization group then you have to fine tune in order to get to the critical point now because relevant operators are typically they are repulsive the renormalization group is repulsive and whereas when you do the inverse renormalization group do you flow naturally to critical points and then what about irrelevant operators so I don't know whether there's some simple intuition about how it generally works okay what's happening is that if you are in one so when you flow on the reverse side if you renormalize you are not going to suffer from the critical point in the sense that you are not going to suddenly see a singularity which is developing and making your your estimation which is unstable but you are still of course going to have your bifurcation which suddenly appears and that's going to mean that in the low frequencies initially when you were above you had a simple low frequency which was essentially Gaussian distributed and suddenly boom it splits and you have two local minima which essentially corresponds to the two phase and I'm not absolutely sure that I'm going to answer your question but I'm going to one second follow this right what is interesting is that what gets very complicated or more complicated is indeed at the low frequency and what I'm going to show tomorrow is that for phases if you want to synthesize bedroom phases stuff things like that what is very complicated is the low frequency once you got the low frequency vaguely the phase when you go across the frequency then things are much much much simpler and you can build almost stationary model so the the low frequency problem that corresponds to the bifurcation the fact that you have many local minima in your low frequency domain you have it this is the nature of the problem but what you don't see is an explosion of an operator which makes your optimization explode because you did the renormalization but like in the standard renormalization group so I now because we don't have quite the same language I don't know if I answer the question that that's how I best approximated it. Hi so thanks for a fascinating connection between renormalization and machine learning so it suddenly maybe machine learning can make sense you know to physicists this is what it looks like but I just want to understand the translation so the basic idea is that you're sort of flowing to some basin of attraction so that you go from a big dimensionality of coupling constants to a smaller dimensionality but in usually the critical points are fairly well known I mean what I'm saying is that the critical exponents can you identify in this story well known critical exponents that have been understood in physics like Wilson Fisher fixed point or Gaussian fixed point which which are the fixed points that would naturally occur because they are determined by this number of spatial dimensions and the symmetries. Yes okay I will have to answer to that the first one is if it's about identifying the critical point the exponent you can identify it from the data by doing regression but you cannot identify it on a theoretical basis because you don't have the exact energy to begin with but I was answering in a different way we don't ask the same question and I'll say it in frank way I think that the game of computing critical point is of course very interesting but it's the same game that has been played for the last 30 years and you can get a little bit of refinement we get out of the thing and the question is here really the question is not to study super finely the phase transitions when you deal with turbulence there's no phase transitions it's about studying the distribution so it's about building models of things which are completely different if the question is to understand very finely the critical point and the phase then you need to have an explicit model of the energy and that's you can do it for universal class because everything is equivalent but then you are very constrained in your question and I think in that sense machine learning is giving a little bit of air on this question by saying well let's look outside the main problems are not just to look at critical point but to understand what is the distribution of turbulence around an air wing what is sociological interaction between I mean to understand probability distribution we still have all these multi-scale properties but which are not described by a phase transition but for the phase transition yes we can't because we don't have the exact model we can just compute let me reformulate my question so what I want to ask is that the low energy behavior in the language of physics is the kinds of basins of attractions that can occur are kind of classified by the critical exponents so can you tell given a system which basin of attraction it belongs to no because if you go back into again the idea of the basis of attraction is that you're going to have a fixed point you're going to have a map and a fixed point and that means that you are at a critical point but most of these problems are not that critical point you are not going to have a self-similar probability in other words that means that you restrict yourself to self-similar probability distribution almost all probability distribution are not self-similar so what you want is not to describe the self-similar but known so it's like a non-equilibrium problem you are not at the equilibrium you are not at critical point and then what's happening is that you don't have a fixed point and an attraction but a fixed point but you just have a flow across scale but you still observe that the low frequencies capture all the capture the complexity and and frankly right now this is one of the totally open questions I'll show tomorrow how once you get the low frequency of phases boom with a stationary model you can recover the whole phase but understanding the complex and you we can see that all the computations of the network goes for the low frequency there is no good model of what's happening here and maybe there there is a relation that could be done with models at critical points that I don't know thank you hi I'm here this way yeah thank you very much I was particularly intrigued by the part on turbulence and so this data that you use to learn the energies of the images are obtained by solving the Navier-Stokes equation and so we know that they have a dynamic that we don't see from the images but we know that for instance for turbulence which is a strongly out of equilibrium system it's one of the few examples where you can derive from Navier-Stokes equation some very specific relationship between two points and three points correlators in the form of chroma or you know chromograph laws so my first question is whether there is a signature of this in the way you reconstruct the data which is a very very strong requirement to fulfill for and if this is the case or have we learned something about Navier-Stokes by looking at the way you have these correlations across scales this specific choice okay so the first question is have we learned something about the two points how have we recovered the the two-point constraint the answer is yes but in some sense you may say by cheating why because we put it in the model in the model as I said we are going to describe the field by a matrix k that we are going to learn and to compress but in what remains are in particular a compression of the two-point interactions which describes the two-point interactions of the Kolmogorov five-third decay law so in some sense it's this law is not represented in spectral domain but over the wavelet domain but it's still there what is not there are three-point interaction provided by the bi-spectrum and there we regress and there we do match these three-point interaction now in these example I'm not certainly not going to pretend that that works in all cases and so on these are ongoing work but yes there is something very interesting and then we did in a number of studies of number fields is that despite the fact that we don't explicitly capture the third-order term we do recover it and there is one reason is that within this potential there is these two terms this term looks is asymmetrical like the third-order term this term is symmetrical like the fourth-order term in some sense at the phase level they behave like third-order term but instead of the third the problem of third-order term is that you have a coefficient to the power three and that's pretty bad when you have outliers because everything explodes so what we do essentially you can interpret what is being done is that it's like third-order term for the phase but for the modulus you don't make the power three you keep to a power one the power four term it's like fourth order moment but the modulus you don't put it to the power four you keep it to a power two so there are relations at the phase level and what we observe is that we recover the third-order and fourth-order moments so that's how much I so there is a lot of interesting thing indeed you know working with third and fourth-order moment makes a lot of sense because the mathematics are incredibly much simpler than working with a modulus which has a singularity so doing the correspondence is one of the key also mathematical questions to understand why it's being recovered hi okay wavelet are great to analyze signals two-dimensional three-dimensional one-dimensional for I don't know speech recognition I guess something like that are there cases in which analysis based on wavelet in some dimension came out of a surprise I mean it's not surprising that two-dimensional pictures can be analyzed by a two-dimensional filter are there cases in which it came out as a surprise okay I'm going to uh and set the level that's I mean first of all I would like to say wavelets don't always work if you have an oscillatory phenomena such as wave propagation highly oscillatory it's going to be a catastrophe you have to get something much better localized where was this has been the surprise here is the fact that when you cascade these wavelet transform modulus wavelet transform you can capture geometry and that's was a big surprise and I'll explain it in and I would say this is still one mystery of the normal the wavelets they're very local so if you have a contour you look at your wavelets locally a priori it looks very difficult to capture the geometric regularity this is why during 15 years mathematicians including myself have been trying to get rid of wavelets and develop things called curved leds band led stuff with geometry so that we could follow the geometry of the structure and at the time we thought we have to abandon these things and to go to something like that the big surprise is that you can still rely on local filters like the neural network do they always have three by three filters that they cascade and if you put the nonlinearities between you are able to capture the regularity of geometrical curves and that's the in other words it's a big surprise that you can with something which like wavelets synthesize something which has geometry into it regular geometry you see to recover something which has a geometry which like these worlds given that you only look at the field through little square of different size means that somewhere you've been able to capture that it's a surprise that it can be done that would be my currently the surprise I had we begin to have ideas why but we really I mean the whole math community thought it was not possible for 15 years so essentially we lost our time that's really bad but okay okay thank you very much for a very inspiring talk I think you partially answered what I'm going to ask but still I would like to recollect along the lines of Antonio question namely you provide us two paradigmatic example of 2d turbulence where there is no phase transition and lambda to the phi model where there is second to the phase transition and what about first order phase transition where you have nucleation of one phase inside another phase and all sorts of discontinuities associated this well you may say that this is cubic term but I would like if you tell a bit more about that sort of models okay uh julio biroli is studying that in fact he's studying active matter phenomena where you have nucleation phenomena which are happening and he's studying that within this framework so they are right now able to do generation but I'm not the best one to describe that especially because I don't understand the underlying physics so but I think it's a very interesting question it's uh that it's still open but there are positive signs on that direction other further questions I can ask mine actually I have one so these wavelets seem like a very natural basis given everything you explained to us if you do not uh like uh put these wavelets in the machine directly and you let the machine learn features would neural networks for example actually learn wavelets do we observe this yes uh so we observe it in several ways first there was the first generations of the neural networks like alex net where the first filters were pretty large so they had the ability to and then the first layer of filters you can look at them and they looked at them and they all look locally oscillatory patterns with different directions so they they did observe that but there's something more now what people do uh if you look at res net or things like that they impose that you have three by three filters so very local filters and then you cascade your filter you subsample subsampling is equivalent to enlarge the filter because suddenly the filter sees something which is twice larger and then your refilter subsample I'll show next time this is exactly how a fast wave that transform is implemented so what I mean by that is that in the architecture itself the multi-scale representation is indirectly implemented by the fact that you have small filters subsampling at one point small filters subsampling so it's within now it's almost within the architecture uh now are there autogonal wavelet no absolutely not uh they can look very weird why autogonality here is nice to have is for the math you are computing probability distribution you need a Jacobian the Jacobian is one otherwise you have your Jacobian around and so so uh for the math it's important for the actual computations it's not so important but the multi-scale is there for me wavelets is a synonym to multi-scale local and in neural net you have multi-scale local which appears thank you and actually if you don't see your hand I will ask a quick second question so these modern generative models that are used for example in this dali that generates these faces and artificial pictures you showed us they're based on these diffusion techniques if you train these machines I mean are they good at also sampling pictures with this very multi-scale type of spectra or in this case we observe failure and the techniques you propose are needed or do they compete with what you propose okay so it's a thank you very much for the question because that's a perfect introduction for the for the last talk the answer is for score diffusion when people didn't explicitly incorporate multi-scale representation in the architecture they were not able to recentize very large image they began to be able to recentize very large image which architectures like unit which have explicitly the multi-scale properties and that will be exactly the topic of next time to show why within score matching is following and sometimes that's one of the beauty of score matching is that the mathematical description of score matching is really very clean and is a longevity type diffusion where you also learn the energy it's really exactly the same framework so we'll be able to see in what sense is similar or not similar okay thank you very much thanks all right so if there are no further urgent question let's thank you again professor malla thank you very much