 Right, so I just, hello everyone and welcome to this seven seminar in our machine learning and six seminar series. I just like to remind everyone that this is recorded and we published this talk on the YouTube channel. So today we have the pleasure of welcoming Stefan Malin. Stefan is an applied mathematician and professor at the College de France on the chair of data sciences is a member of the French Academy of Sciences, and the Academy of Technology, and a friend member of the US National Academy of Engineering. He was a professor at the current Institute of New York University for 10 years, and also helped positions at the Equal Polytechnic and Equal Nomics in Paris. Among many achievements he has co-founded and served as the CEO of a semiconductor startup company and was the recipient of the 27 place Pascal prize from the French Academy of Sciences. He can't agree to give us a talk today on the subject of an internal estimation by conditional renormalization group and convolution. Thank you very much for joining us Stefan and the virtual floor is yours. Thank you very much and I'm very happy to give this seminar because it's indeed a very rich interface between machine learning and physics myself I'm coming from applied mathematics as you mentioned but I can't I came into physics because of the fact that we were discussing about that that's the problems are often better posed in physics and we are facing this very difficult problems of understanding convolution networks nowadays that provides a better framework often to study it. And I'll try to show through this talk that they are very close relations between tools that have been developed in physics and the type of algorithm and mathematics which are behind. So, the problems I'll be looking at is the general problems, essentially ambitious of learning physics from data and priors. And the typical difficult problems are the problems which involves the interaction of many bodies with the fluctuation on many lengths of scales. We have this toy models that is called five for that I'll be mentioning about, but the applications, we're really interested will be more in cosmology, I'll be speaking a bit about weak lensing distribution of matter in the cosmos, but also turbulence flow turbulence as here or gas turbulence in astrophysics and at the infinity smallest very small scale at quantum chemistry problems. So I'll be looking at problems from the statistical physics angle problems at equilibrium and here the issue is to try to learn the Hamiltonian. In other words, the energy that defines the probability distribution of the states, and in this machine learning framework what we have our realizations or examples of state out of which would like to estimate this energy. So basically, the questions that immediately arise are modeling questions. So what's the family of energies which are reasonable to capture the physics, we're interested in. And as always in that type of problem there is the optimization problems that comes to optimize potentially the parameters that are there. One thing that'll be showing our relations with image classifications which folks extremely different but nowadays we know normal nets are behind both so they should be something that is similar behind. So if we look at the physics problem of usually there have been a huge amount of work on that and the simplest problem that have been studied for large scale interactions are Gaussian models so Gaussian model means that the energy has a quadratic form like this, where K here is the inverse of the covariance, and in particular for turbulence that was a model that was proposed by the famous Kolmogorov 1941 paper, where he derived from Navier-Stokes equations the fact that under a Gaussian hypothesis you will get power spectrum which has a four to the fourth third decay. And here what you see our images at the top the original image at the bottom the Gaussian model which have exactly the same power spectrum. And as you can see obviously is that we are losing a lot of structures it gives information about regularity but nothing about essentially the geometry of the structures. The question is how to capture this information. Of course, people try to go towards higher order moments and that was essentially a failure, because there's too many higher order moments and statistically the estimators have a very large variance. Now here came the deep neural network. So neural networks, as I imagine most of you knows, have been mostly used, but not only for classification and regression. So in this case you want to estimate a parameter y that may be for example an energy or the class of an image, given the data x, and typically to do so you would like to estimate the probability of y given x. And the base classifier would consist in taking the y which maximize this probability. So essentially you can view a neural network as an estimator a parameterized estimator of this probability distribution. And what is the parameters well when you build your network, you are choosing linear operators which are convolution operators that have in the case of images a very small support translation environment it's a convolution. And then followed by a rectifier, you have all these images which corresponds to different filters. And then the next layer you recombine all these images with the filters which is still convolutional to build the other image of the next layer with a sub sampling every few layers. And again you apply your rectifier, you cascade, and you'll get very small images up to the final linear classifier or sorry the final linear operator that is going to build a model of your log probability. So the parameter theta here of your probability is the aggregation of all these matrices, which may corresponds to hundreds of millions of billions of parameters. Now how do you optimize this parameters, basically with a gradient descent, which simply is maximizing the likelihood. In other words, you are trying to find the parameter theta which maximize the probability to observe why given X for all the database that you have. And the big surprise as you know is that that kind of architecture gives exceptional results, not only for classifying images or sounds, but also for language, but also for regression in physics, and to generate physical fields. So the question is to try to understand what are the relation between these networks and their architecture and the underlying physics, given the fact that they can simulate some interesting physical phenomenon. When you look at them, there is one thing that immediately striking is that you have a kind of multi scale approach, because progressively you have filters that as you do the sub sampling across layers are going to have what is called a receptive field in other words a support which is getting broader broader and broader. In other words you are progressively aggregating the information as you go towards fine scale. The other thing is that the first layers are relatively simple people have observed something like small localized wave that will call wavelets, and we'll see that afterwards. So, people tried that to model, as you probably know, physical fields such as the turbulent fields that I showed. And that's in particular the work of Mattias bench and their team where basically they train a network on something totally different. There's cats and cars and so on. And then they take one such field, they decompose their field over their network, and then they compute the correlation matrix at any given depths between two images within this layer. And how do they restore an image, they try to synthesize an image which produce the same correlation matrices inside the networks, and they do that by gradient descent. So, the surprising thing is that by doing so you can synthesize things that looks very much like the original. On the other hand, the number of correlation parameters that they use is huge compared to the number of pixels so they are here issues of estimations. And one question is basically to try to understand mathematically what is all this about. So, what I'm going to try to do is to show the relations with a concept which is very important physics which is the renormalization. So this is an old idea and many people have been emphasizing the relation between the two because of these ideas of scale. What I'm going to show here is that not only there is a similarity with Wilson renormalization. If you look at them within these ways that basis, but it gives a different outlook at the renormalization group. And that's what I'll call here a conditional renormalization group which is a bit different from the standard physical one or the one used by Wilson, which may be one reason why you can get these results that were previously obtained. The second thing will be to try to understand what kind of potentials we can use to build these models. And the key elements I'll try to emphasize is the fact that what you need to do is to capture the interaction between scales. And to capture interaction between scale, you need to be nonlinear and phase and in particular separating phase from amplitude is going to play a very important role, and we'll see how this will lead to turbulence models. And the third part I'll show the relation between regression and classification problem in deep network. One application will again be for physics to do energy regression in quantum chemistry, but also look at standard image classification problems, because it's very interesting to see the correspondence between that. So let me first begin with the renormalization group and I'll begin with a standard view as it was introduced by Kevin of and Wilson in the 70. So the idea here is you have a field. So X, I'll be using the notation of machine learning X is my field. It's not fine. Okay, so X zero is the field at a very fine scale. In this framework we suppose that we do know the Hamiltonian in other words we do know the energy. And the idea is to try to reduce the number of degrees of freedoms and to do that you progressively build a coarser and coarser approximation of your field with averaging and sometimes. So once you reduce the size of your field, the idea is to try to compute what is the probability distribution of this field at a larger scale. And that's going to define a new energy functions. And of course you can compute this probability distribution from fine scale to a core scale by doing marginal integrations. So the key idea of this renormalization group is to observe that you can parametrize these energies. These will be in physics the coupling constants that define the interaction within the field. Because one field can be derived from the finer scale field you're going to build a map from these coupling constant at the core scale to the coupling constant the fruit from the fine scale to the core scale. Excuse me. So, basically when you move across scale, you're going to see an evolution of these coupling constant. And the key observation of Wilson is that phase transitions corresponds to a fixed point of these maps, which defines the evolution of the coupling constants essentially of the physics across scale. Okay, so this is the view very briefly of the renormalization and typical applications that have been used is for scalar potential. The scalar potential are the cases where the energy doesn't only have a quadratic term as Gaussian distribution, but you also have a nonlinear potential, which only depends upon each point of the field. It impose value on the scalar value of the field. So you have a potential that you can expand for example over polynomial. And the well known case of the five four model, the potential is a combination of power four and four power two so that you have two minimum near minus one and one, so that the field values have a tendency rather to be equal to either minus one or one. And as you know, if you modify the parameters between the quadratic term that is here the kinetic energy defined by a laplacian and the potential term. At one point, you are going to arrive to a phase transition, where you have very long range interactions that comes in. One observation. In physics, usually, you expand this potential in polynomials, but you can as well expand the potential as a little piecewise linear approximation that would mean to expand your potential with. So there is an advantage is that then the potential is not going to dominate the quadratic term at infinity so stability issues will be much easier if you use rectifiers rather than potential. Okay. Now to compute this renormalization group. The key point is to compute the probability distribution of xj from the probability distribution of xj minus one. To do so what you need to do is to do a marginal integration over all the free variable of xj minus one, which are not within xj. Now, the question is how to express these free variable. The standard approach is to use for your basis. If you come back to the publication of Wilson in the 70s, this is not the first one. This is not what he used he used something like way that's so what is a way. And we've led is going to be an orthogonal basis that is going to characterize these three degrees of freedom with waveform which looks like sine wave oriented sine wave, but which are local in space. And with these wave let's, if you dilate them by scale to the j and you translate them, you can get an orthogonal basis. What does that mean. It means that if you take your field xj minus one, you can expand it orthogonally into a course of field, plus the way that coefficients, and you see this field is expanded into this one here the way that coefficient, which are the high frequency fluctuations that allows you to come back to X zero, then xj minus one can be really composed course of field wave that coefficient course of field wave that coefficients. Now, we have the system of coordinates now to do the integration, and we can therefore compute this probability distribution. Now here's the twist, instead of trying to compute so in the situation in which we are, we don't have the Hamilton, we need to compute the Hamilton, and to define the model. So the way we are going to do it is instead of trying to build a model of the probability distribution of these images. We are going to try to compute a model of the conditional probability of the probability of xj minus one given xj. Now, if you have decompose your image into a wavelength basis to know xj minus one given xj. It's the same thing than to know the high frequencies, given the low frequencies. So in other words to learn the interactions between the high frequencies and the low frequencies. And that will be the model that we're going to build. Now, once you have that, you can build a model of the fine scale grid, because it basically consists in taking the model of the course grid. Then building a model of the wavelength coefficient reconstruct the next grid, building a model of the next wave that coefficient the next grid, and so on. And you get an expansion of your probability distribution as a product of these refinement conditional probability distribution. So now the key problem is build a model of these conditional probability distribution which again appear indirectly in Wilson calculations, which are these energies, which are really the interaction energies between high and low frequencies. So to do so, now you face an optimization problem. And that's why we're going to see why the renormalization is so important and why this factorization is important. Basically it's going to totally precondition the problem. Now, if you want to find a parameter theta of a probability distribution, you will do it by minimizing the minus likelihood loss or maximizing the likely to do so, you can do a gradient descent, and the gradient descent basically consists in updating the parameter until the expected value of the probability distribution of your model will be the same than the expected value of the true probability distribution on the potential that defines your model. So that's the standard log likelihood, and we know that the rate of convergence is going to depend upon the Haitian, which corresponds to the covariance of the potential that was used for the model. Now the big problem is that when you are close to a phase transition. This gets very unstable and the iteration almost doesn't convert. It gets extremely slow. Now, if you factorize the problem, instead of computing the parameter of the field, you slice it into these conditional parameters. The first thing that you're going to get is that these instabilities are going to visit. And the reason why is the fall. Basically, you are constructing field whose probability distribution have a power spectrum that have a power low decay. So the ratio between the smallest to the largest eigenvalues are very large. This is essentially the bad conditioning of your Haitian. When you slice the problem, you are basically with wavelets you basically isolate different frequency bands. In each of the frequency band, the spectrum changes by a constant. So within each frequency band, the problem is well conditioned. And it's this preconditioning that allows you to very quickly compute these conditional parameters. The first thing that you can do, and that will begin with this toy model to see to try to understand what's happening is to recover the five four model. So for the five four model what you want to recover is the quadratic term that happens to be a lexation but you don't know so it's a singular operator, and you want to compute the scalar potential. The way we do that is we decompose in a way that bases, we compute the conditional probabilities, and then we recascate all these conditional probabilities to compute the potential. And at each scale what you see is you are computing a potential at very core scale the potential looks like a Gaussian potential and progressively as you refine it, you are recovering the five four potential. Now, if you are at the critical temperature at all scales, the potential is going to look alike, exactly because of cell similarity. And you have with that kind of algorithm, no critical slowing down and the fast convergence, including the estimation of the laplations, which is the first quarter texture. Okay. Now, once you have this, what you can do is sample your field very quickly without suffering from any instability due to the fact that you may be close to the phase transition. How do you do that. You have expanded your probability distribution as a product. And now we are going to synthesize the field by beginning at very core scale. Very few pixel. This is going to be fast to simulate. Then, with the conditional probability distribution I'm going to synthesize the way that coefficient, given the low frequencies, those images. From that, I can automatically reconstruct the finest form. Then I synthesize the way that confession reconstruct synthesize the way that coefficient reconstruct synthesize reconstruct. Now, one of the biggest problems is well conditioned because of the renormalization, and you get a very fast sampling whose timing doesn't depend upon the fact that you are at critical scale. So, this is just a comparison as a function of the size of the system. If you are close to critical temperature, the number of iteration it's well known in typical algorithms grows exponentially. So that kind of conditional renormalization group, it doesn't change the number of iteration per pixel doesn't change with the size of the system. And when you have to simulate the expected value, or the sampling by using an MCMC chain, the length of simulation of the MCMC chain doesn't change with the fact that you are close to the critical temperature the number of iterations is of the order of 30 here once it's precondition, as opposed to nearly over 2000 iterations, when you have something which is unstable. Okay, so with that kind of techniques, you can recover five four but obviously this is not our goal. The goal now is going to take this idea, and to show in one sense it's related to the deep net, and how you can do much more powerful thing. So the first application I'm going to look at before deep net is weak cleansing, weak cleansing, you have images, which corresponds to where you see the light, the light emitted by objects in the, which are very far away. Now, because of the presence of masses, the light is going to be deformed, and the deformation of the lights, which is called this weak cleansing is going to allow us to infer the presence of dark matter in the universe. Another problem is to understand what is the statistics of these images with these deformation and light fields. And there is now the Euclid mission that is going to be sent by, in particular, European American Consortium so one of the issue is to understand the statistics. One of the work and that work was done, all this work was done with Julio Biroli, Tanguy Marchand, and Mizaki Ozawa was to build a model of these probability distribution, basically by learning the quadratic term, and showing that you can model the probability distribution with scalar potential, but at all scale. The fact that you have a scalar potential at all scale means that you can capture not only local interactions but also global interaction, which is needed when you have a gravitational field. And that's the work that we did. And you can, once you apply this slicing as a product of conditional probability distribution, estimate each of the term, recover the potential, recover the quadratic term, and re simulate images with an explicit form for the Hamilton. Now, this kind of thing looks a little bit like a very naive network. If you look at a wavelet transform, you have an image, you compute the wavelet coefficient that the first scale, the average, which is really composed the average which is really composed with a wavelet coefficient and so on. So it's a first tree, and you compute the potential of interactions, and that gets you the result for five four, or for weak lens. However, this is not going to be sufficient for turbulence. This is not sufficient because the model is essentially based on scalar potential, and the interaction between scale remains a little bit too simple. And that's what we're going to look at. So, the second part now is going to try to understand how to refine these models by capturing better the scale interactions. Now, in our framework, capturing better the scale interaction means building a model of the conditional probability of the high frequencies, given the low frequency. In other words, the wavelet coefficient, given the low frequencies which corresponds to the way that coefficient at all larger scales. And so, you need to define this interaction energy in other words what kind of potential would make sense to capture these interactions. Again, the first idea that comes to mind is to go to high order moments but that's not possible too many moments too difficult to estimate it doesn't work. Second approach, what about using a network possible. The problem is the black box we'd like to do physics so the question is it, is it possible from the prior we have on physics to infer information about the potential we want to put here. And what I would like to show is that for many physical fields, you can do it, you don't need to go to a blind deep network. So to understand that I'm going to show images. First, this is an image with a lot of geometry. This is the average. These are the way that coefficients at the first scale, then you compute that the next scale next scale, and what you see is the following. First of all, the way that coefficients. are very spot. That's quite well known because in the regular regions, the fluctuations are nearly zero, and the only place where they are going to be large are near edges near sharp transitions. But there is something more important. Now, if you look at different scales. Whenever you have geometry, the different scale are going to look very much a lot. And that there is a very strong dependency across cap. And it is this very strong dependency that we need to capture. Now, there is one reason why it's not so easy to do it because the first way you may think to do is to say, Oh, it's dependent. So let's try to correlate the way that coefficient at two different scale, the way that coefficient essentially consistent taking the image and filter it with a way that at a given scale and another scale filter it with a different way. Now, the problem is that these coefficients are going to be totally uncorrelated. The reason why these coefficients are uncorrelated is because they live in two different frequency bands. And so the phase oscillates at different rates. And if you compute these expected values, it's going to cancel out and the correlation will be zero. So to exhibit the dependency, you need to be nonlinear. And what basically you need to do is to separate the phase from the amplitude. The modulus, the amplitude of the way that coefficients are going to be very strongly correlated. But that's not enough. The phase also are correlated and to show it what you need is to correlate to see how the phase at one scale correlates with the modulus at a final scale. And these correlation are going to be nonzero. Basically, the idea is to express the interaction across scale through interactions between amplitudes of these way that coefficient and phase. Now the problem we're going to get is that these correlation matrices now are going to be very large. And if you want to do an estimation, you would like to do something like a PCA or to almost diagonalize your covariance matrix. And we're going to get closer to deep net because how to do that you can prove that in order to nearly diagonalize this covariance matrix what you do need to do is to compute a second way that transform. In other words, separate the different scales that are within this envelope. And if you do that, you are getting to get much fewer correlation coefficients, which are the so called scattering coefficient. So let me show now how it relates to a way to a network. First we did our way that transform computing the variation at different scale. Then each of these images, we applied the non-narrative which is here a module. And then we recompute a way that transform. And then we renormalize each of the scale, exactly as in a normalization group. And finally, we compute the correlations between the way that transform of the way that transform and all the other coefficient at any given way. So here it begins to look like a bit more like a normal net, but we don't learn the filters. They are all way flex. So essentially deep network without learning the weights. Now, if you do that, if you take images of turbulences, if you build a model, which is essentially consistent computing the parameters of these potential obtained with. Calculating the interactions across scale, you recent size fields, which have essentially the same statistics, which recover the geometry. The number of correlation is much fewer than what you would get in a deep net instead of having 10 to the power of five coefficients. We have about 1000 coefficient, which is fewer than the size of the images. And these coefficients are interpretable. We know exactly what it corresponds to what kind of interactions we are calculating across scale. So that's the kind of thing that we've been applying to cosmology. So that's a work that was done in collaboration with physicists at equal normal superior in the astrophysics cosmological group with in particular led by Erwin Alice. This is a distribution of mass in the universe where you see the filament structures. This is the maximum entropy generation with that kind of models that capture these interactions. And here we have an explicit expression of the Hamiltonian. And we've been working on the application of that kind of thing to regress the cosmological parameters, but I won't go so much into that. So I would like to finish by looking at the regression and classification problem to show how all this is related. So in the case of regression and classification, what we want to do is to estimate, let's say an energy, given the state X. And what we're going to use is the same structure. So we are going to compute the scattering transform by computing the way that transform and the modulus, and then reapplying away like transform and a modulus that you see here. And the ones that haven't yet gone to the bottom of the network, we are going to average the coefficient. So these are all the coefficients that we're going to get out of this cascade of weblet transform modulus. So it's again a kind of very nice network. And the nice thing is to say, so there is no combination across channel out of that, let's try to compute the energy. In other words, the log probability. Now, there is one reason also why doing this makes sense. There is one fundamental properties that you can prove that you obtain by doing the scale separation is stability to action of the field morphism. So if X is deformed with a different morphism, in other words, imagine an image which is going to be deformed tau is a translation which depends upon position so it's a deformation. There is one very important property. If you look at these coefficients so in the output of your network. It's not to deformation. In other words, they almost linearized the formation or they linearized small deformation, which means that the representation of a deformed image compared to the deformation of the image is going to be of the order of the size of the deformation. So it's lip sheets continues to the deformation. This is a very important property that is not to obtain with a Fourier transform, and we'll see that it plays an important role in almost all applications. So let me first begin with quantum chemistry. For quantum chemistry. What you have is the following. You know the position and the charge of the atoms of a molecule in other words you know the different type of atoms carbon hydrogens and so on. And what you would like is to learn the ground state energy without going through calculating the end body Schrodinger equation. There is of course, a beautiful approach that was introduced by con and sham to approximate the end body Schrodinger problem by observing that the energy only depends upon the electronic density, and by relating the electronic density to this energy. Now, the electronic density, of course, you don't know it, if you don't solve the Schrodinger equation, either. And what we're going to do is first use the prior information to structure the network, and then learn this energy. So what kind of information do we have in this problem. What we're giving is a list of atom. So obviously, this list gives, as I say, the number of charge and the position, and the energy is invariant to the indexing of the list. First, obvious property. Second property the energy is invariant to translation of the molecule. So translation of all the position parameters. And also to rotation. If you slightly deform the molecule, the energy is going to change slightly. The other thing is that you have fundamentally a multi scale problem. Why because you have very strong covalent and bound chemical interaction at very short scale, long range scale, you have the electromagnetic force in particular, Van der Waal forces. What the kind of decomposition that you're going to obtain with a multi scale representation give you is a kind of factorization of these forces. First evaluating the very strong at high frequency interaction forces, and then regrouping at different scale, the forces responsible to due to the electromagnetic field. All these properties I'm just stating, by the way, will essentially be the same for images, which explains in particular why the problems are somewhat similar. Now, in the case of quantum chemistry, what you're given is, as I said, the position of the so we're going to represent this electronic density in a very naive way by saying that each atom has exactly the chart located at its position. So you have a dirac with amplitude is multiplied by the number of charge. Then we are going to compute our way that transfer. But to get an intuition of what's happening, make a convolution of your dirac sum of direct with a way that a dirac convolve with a waveform gives you back the waveform located at the position of the dirac, multiplied by the amplitude of the dirac. So what is going to happen is that it says if each dirac is emitting a waveform, a small way that these way that are going to interact. And then you are going to have the face collapse, the modulus, which is going to produce the interactions. When the way that's a very small, they support doesn't interact, it just gives you blocks, when the way let's get bigger. The interactions are going to create these patterns of interference, which are capturing the geometry of the molecular. So you do that on two layers. This is what the scattering that we're going to do. Then you average, you are going to get something which is environmental translation rotation and stable to the formation. And then you do a linear regression. So that was a work that was done by Michael Eichenberg, Georgia says Kacharis, Michael Herne, Nicolas Poilver, and Wittiri, who were postdoc and student at Ecole Normale Supérieure. Now, if you do that on a large database of 100, that was organic molecules, this is a standard database that was used a few years ago. And you compare that to state of the art deep networks, which learns all the filters, you essentially get the same results. Now, you get the same result in these cases, but these are relatively easy problem. The reason why they are relatively easy is that the molecules that we're using have only of the order of 10 heavy atoms, and at most 30 atoms, so we're dealing with small structures. When you have something simple like that, again, you know enough about the physics and interaction not to have to learn. And if you build the appropriate scale interaction, you will be able to get precise results. So the integration gets very different when you have much more complex problems. And that's where we'd like to finish. To get these complex problems I'm going to move to images. So, what we are currently doing is we're defining a fixed network that implements this kind of conditional renormization group approach with wavelets and a linear classifier. And if you do that for images on simple image, for example, if you want to classify digits, you will do as well as a deep network which learns everything. The reason is that when you deal with digits, basically the source of variations are translation rotation deformations. These are transformation you know about that you've been able to deal with. And you will be fine. When you have stationary textures, it's like turbulence problem because of stationarity, things will work out and you can do as well as deepness. If you deal with image, which are much more structured where you have cats dogs mushrooms and so on. There will be a big gap, relatively to deep. If you do this kind of wavelet scattering the air we get over 1000 class is about 50% error, whereas a resnet of about a little bit deeper but not much will get an error which is four to five time smaller. The question is what is learned. And how does it relate to all what we've been discussing about. And that's what I'm going to finish up. This is the work of Florentin good and john Zarkar who are finishing their PhD here I think I'm not super. Until now, we have imposed the type of interaction models we are putting at each layer. Now that we learn it. So what does it mean to learn an interaction potential. It means that we're going to learn in the language of neural network, a one one convolution network, an operator that is going to relate all images at different orientation, different frequencies. So when you go to the next scale you propagate, you learn this interaction operator, which is this one one operator. However, the spatial filters remain wavelet so spatially you still do just a scale separation. The nonlinearity is still a modulus, but you are also going to propagate the phase, and you iterate each layer, you learn this operator. And once you've done that you do the linear classifier. So basically, you are getting something which is much closer to the standard network. The only difference is that the spatial filters are fixed their wavelets. And the only thing that we learn is the filter, which is one one convolution across the channel. Wavelet filter interactions and so on. It's a neural network where there is no bias parameters. And then you learn everything with a gradient descent so you learn all these potentials together in order to optimize the classification. So, what we saw is that if these operators are identity. So we don't learn them. We just set them independently. You have a much bigger error on ImageNet about five times more, and even on a smaller database which is called CIFAR, you get an error which is still of the order of four times bigger than a ResNet 80. Now, if you just learn these operators, you reach the state of the art. What that means is that all the information is indeed captured within this potential. You don't need to learn the spatial filters. The spatial filters are essentially here to separate scale in order then to create the interaction operators separate scale separate scale and so on. And now, of course, the outstanding problem is to understand what is the mathematical nature of these potential, these interaction potentials that have been learned at each scale. But once you understand that, basically you have a framework where you've learned entirely your energy or your Hamiltonian. And what clearly appears is that each time what you've been learning are these conditional probabilities, which relates one scale to the next scale. So I'll be concluding on that. Basically, my first, what I call important conclusions is that it seems really that these deep network architecture are learning scale interaction. This topic of scale interaction is an old topic in physics. We know that turbulence is about scale interaction, small world creating larger world and propagation of energy across scales. One can view these architecture as a way to learn these interaction. The second thing is that the renormalization group framework is I believe an appropriate framework to understand what's happening in particular, it allows to understand why suddenly you are preconditioning the problem. And again, let me insist. There is a basic difference with the more standard way renormalization group is viewed is that when people define what is often called a Hamiltonian and that's in physics, they define it on the field itself on the microscopic Hamiltonian or on the Hamiltonian at or the energy at the different scale. That's not what we do. We say this is much too complicated. What we are modeling is the conditional probabilities. In other words, the interaction Hamilton between the high frequencies and the lower frequency. Now, the important thing is that because this problem is simpler. You don't need to learn the networks, when you have essentially an ergo dictionary field. And I do believe that one should be able progressively to capture most classical physical phenomena, as long as we deal with ergo dictionary fields so turbulence are typical example or the kind of skills you have in cosmology. Of course, there are beautiful problems to try to understand how these Hamiltonian expression relates to the underlying microscopic models of the system. And the last thing is, when you begin to reach the intermediate scale, this is, let's say, the domain of material science and physics, then the problem gets much more complicated. And I have a tendency to believe that there, you need to learn basically the interaction potential, they are getting too complicated or maybe some people will have a way to avoid learning but let's say at that stage. We have no way to avoid learning. And that may be where learning is really important at the same time, interpretation should take its place and there is very interesting problems to try to understand what's the nature of these potentials that have been. So that's it. Thanks very much. Thank you very much. So it's a very interesting talk. Does anyone in the audience have any questions. Okay, well, well, maybe they think about it. I do have one actually and I'm very impressed by your conclusion how you manage to do just by this slide trick, find back the state of the art. And I was wondering, does that indirectly imply that the network, the rest net in this case, which is, which is learning filters is learning something that is equivalent to these ways that basis. If you have a look at the filters they have, are they in any way comparable. Okay. So, to answer that, let me show that slide. There is, when you look at these neural nets, whether it's res net and all the different architecture, what is easily accessible is the first layer. And if you look at the first layer, they look like way less. But then all the other layers. There are three dimensional filters, which is a product of the 2d filters and of the potential. And very difficult to separate where, what is the spatial filters, and what is the potential. So, first layer, we see the way that the other layers. Very hard because you only see these 3d filters which are basically the 2d filters multiplied by the 1d potential. You could factorize way flights out of that does that really reveal what they are doing. We do believe that it does but that's to be demonstrated. Okay, so I, and also I would say that the neural nets have a huge amount of parameters so they have a lot of flexibility. Whether they do that of some other combinations of these vectors I don't know. There's no reason why they should be a unique solution. What we are proposing here is a way to structure the problem and to propose, let's say simple solution or a solution, but you can always do a recombination and and get other filters probably. Very interesting, very impressive. I don't know if you have a question you have. Yeah, I have a small question. Hi. I've already read your book, the wavelength transform carefully. And I have one question that we know we have different wavelength filters like the gap of filters, or the double chip filters, and all of them are formulated by humans by mathematicians. They have a solid mathematical background. In the deep learning era, I'm thinking if we can use the deep neural network or to machine learning methods to help us to design some new wavelength filters. Okay, so, in some sense, they are learning when you look at these filters you observe. As I said in the first place something that looks like wave let's and they are not the standard auto gonna wave that so, or the gap or wavelength. But I would say that it doesn't matter so much what wave let's you're using. Wave let is here essentially to separate the different frequencies and the different orientation. So whether you use this way that or this way that doesn't make a huge difference. What makes a huge difference is the building of the potential that is going to build the scale interaction. If you look back at the time of the research on wave let's in the 90s, there have been a lot of work on trying to optimize this way that basis versus this way that basis and most of this work. There are a few forgotten and few way that basis remained, because in particular the dobishi way that so the gap or way that because they have nice properties and also because it didn't matter so much. How you were optimizing the way that on the final reason. I would rather say here that essentially, you don't care so much. What is the precise way that you take, as long as you do the scale separation and the orientation separation that you expect should be done to reveal the physical phenomenon. Okay, thank you. Does anyone else have another question. Right. Well, I don't think so. So, right. Thank you again very much for this super interesting talk and very impressive results indeed. And yeah, thank you. See you another time then. Thank you.