 So good morning everybody. Okay, so I think we will continue from yesterday's, where we left yesterday. There's a slight change in the talk. It will be wavelets, interaction, energy, and neural networks. So I look forward to another very interesting talk. Thank you. So I indeed decided to slightly change one part of the content of the talk to do something slightly less technical and to introduce also the last part on neural networks to give some of the basis on the neural network for the one who don't know so much about it. So what I'm going to do is first revisit what we saw last time and try to explain why these wavelet bases are a key element to do these scale separations. What are their properties? And understand in what sense the four year basis doesn't is not so well adapted. Despite the fact that it is a major tool in math of course, all over, but also in physics, even in this context. So I'll try to explain that. So let me come back to the problem. The problem is trying to learn the energy of a physical system at equilibrium. And if you are in machine learning, it's about learning the probability distribution of the data. And if you have a model of the data, then there are plenty of applications, both on the physical side to understand the physics, the energy, and generate a new example. And on signal processing it allows to suppress noise. It allows to solve inverse problem, suppress degradations resulting from measurements, for example. So what we saw is that an important tool to do that, when you are in very high dimension and that your probability distributions concern fields that have structure at all scale is to separate scales. And the scale separation is done through the renormalization group factorization of the probability distribution and we'll review that briefly. And then, of course, the question is, what representation are you going to use to represent the field, the data? What kind of basis? And typically, as I said, there are two types of basis on one side, wavelets the other side Fourier, to do that kind of scale separations. And we'll study the properties of the wavelet basis coming back from essentially the core ideas in relation with the uncertainty principle. And then, in the last part, what I'm going to do is an introduction to neural networks and deep neural networks. And that will be the basis for the last talk this afternoon about the application for neural networks for modeling data, data generations. Good. So, as I said, when you have a problem in machine learning in general, there are really two types of problems. When you want to estimate, let's say here, the probability distribution or the energy in the case of a physical system, on one hand you have the first problem which is the approximation. What class of approximation function are you going to use in order to approximate this probability distribution which is equivalent to try to understand how to approximate the energy with a family of energy functions. So these are the so-called ansatz that you use in physics. You have prior information about the type of interaction. You put that within your model and then you try to approximate your energy with that. And to do so, you go through an optimization phase which consists in fitting the parameters. And to fit the parameter, you need to minimize the error. And there are different ways to measure these errors. I mentioned the relative entropy, cul-back-libre divergence, or score matching. And if you want to use then your model to produce new data, you need to sample your probability distribution which means that you need to find a field which has a low energy so that it's a highly probable example. And that goes again through an optimization. So the key tools to do that, as I explained last time, is going to be the renormalization group in order to separate scale. Why? Because the field that we're going to deal with, as I said, a multi-scale field, there are some very long-range correlations. And if you work directly on the field, you are going to suffer from this very long-range correlation in the sense that your optimization algorithm will never converge. It's not that it's slow. It's that it will take years for your CPU to converge. It's just you can't do it. So what you need to do is to take your problem and to slice it into problem which are well conditioned and which can converge. Now, how do you do that? The idea of the renormalization group is to take your probability distribution and to factorize it across scale. So the renormalization group, more classically in physics, it's a tool where you begin from the energy at fine scale. So here you see the scale axis which is shown here. And you want to compute the energy or the probability distribution at the different scale as you take your field and you progressively reduce the resolution. You get a coarser, coarser, and coarser approximation. So basically in the forward renormalization group you go from fine scale, let's say microscopic scale to progressively larger and larger scale. Now, in the case, as I explained of machine learning, we have a different problem. We don't have the model to begin with. What we have to begin with is the data. So what we do with the data, we begin by building a small model. How can we do that? We begin with a very coarse scale. Instead of having a huge image of 1 million by 1 million pixel, we begin from a very small image of, let's say, 2 by 2 pixel. There, you're in dimension 4 and it's much easier. You can build a model. That means you are going to begin at a very, very large scale. Yes? Exactly. And I'll go back to that. You take your image, you average subsample, average subsample, average neighbor pixel subsamples. And when you do that, that's the Kedanov scheme. This, in the Fourier domain, amounts to take the image, reduce it to a lower frequency, lower frequency, lower frequency, so that at the end you have very few frequencies within your image which are the very low frequencies. So you begin from these very low frequency information. That's the only thing that you model. And then you go up the ladder by progressively increasing the resolution. And that's the inverse path. What does that mean? That means that you are going to build a model at very large scale and then progressively you are going to compute the high frequencies given the lower frequencies. So you need to get the probability distribution and these will be the wavelet coefficients or the Fourier coefficient given the lower frequency coefficient. And the key observation, which was the key observation of Wilson that we are still using is the fact that these conditional probabilities, they have an energy which is well conditioned, which means that the Haitian doesn't explode or the covariance doesn't have very long-range correlations. And therefore the optimization goes fast. Now, the question is how to represent that? What basis will give you the best, the most well conditioned problem? And you have, as I said, two problems. You can ask that for the optimization. And as we'll see for the optimization, the Fourier basis is very good. And that was, yes, sorry, at the very core scale, yes. So essentially what you do is you take your image, let's say, and then you take big blocks. And you take two-by-two big blocks of half the size of the image. Not quite. Maybe I can go back to answer that to the size of the previous talk if I can, one second, find it to really illustrate. You begin from an image which is here. How was it built? You begin from the original image. You average it, you average two-by-two pixel or local group of pixels, sub-sampled. You get this image. So this image is an average of a very large domain, sub-sampled. And now the idea is you're really beginning, you're going to begin from here. Maybe that's what you were saying, sorry. Yes, it's okay. And then you go up the ladder. You want to build these models like that. And to build these models amounts each time to compute the fine scale given the course of scale. And that amounts to compute this conditional probability distribution. Which is, which gives you the high frequencies which are here but which have disappeared here. How do you go from this is a low frequency image, this is a higher frequency image. What you've lost are the high frequencies that are here that are not here. That's what you need to compute. And of course these high frequency depend on the low frequency. They are given by this conditional probability distribution. Okay, thank you. Thank you for the question. Can you again represent the frequencies that you've discarded at the previous? Exactly. The phi bar represents always the high frequencies, the higher frequencies that you need to go from one scale to the next. And the phi bar, you can either represent the high frequency in the Fourier basis or you can represent them in the wavelet basis. And we'll see why doing one or the other. But the scheme is the same. The only question that is really here is what is the basis in which you are going to represent the high frequencies? In other words, the phi bar which allows you to go up the ladder. So this is where we are. Now to briefly summarize the key algorithms that are being used to do that, they are, as I said, in the optimization phase, two types of algorithm. One, if you had your model, suppose you did everything, you would like to get typical fields, so generate new data. And how do you generate new data? You need to find a field which has a low energy. And how do you do that? You do a gradient descent on the energy. And this is the Langevin algorithm. And why do you add noise in this Langevin diffusion when doing so? It's because your energy may have a local minima. And you need to get out of the local minima. So what you do is you do a gradient descent which will have a tendency to go down the energy landscape. But sometime, because of the noise, you are going to get out and explore another local minima of the energy landscape. And that will allow you to explore the whole energy landscape. Now the problem is that it can be slow. It can be slow when you have very deep local minima because it will take a long time to get out. That's the first point. Second point, even if you are perfectly convex, if you are very highly correlated, if you have a Haitian which is very badly conditioned, it's always like that in gradient descent. It's going to be very slow. So you will need to increase the convergence to make sure that you don't have this very long-range correlation. You don't have a covariance which has eigenvalues of very different order of magnitude. Now the other optimization problem that I mentioned is that you have built a model, a class of model, but you still need to optimize the parameter. We are going to see in the case of neural network, that means you can choose an architecture, but then you have to choose the weights of the neural network. This is the learning phase. The learning is about optimizing the parameters, for example, the weight of the neural network that I call here theta. Theta is a set of potentially millions of parameters. To do so, you would like the probability distribution of the model to be as close as possible to the true probability distribution, and there is a very natural measure to do that, which is this Kuhl-Bach-Liebler divergence, in other words, the relative entropy. Now to solve that, you can do it by again doing a gradient descent on theta, but as I explained last time, to do that so, when you compute the gradient, you need to sample the probability distribution at each intermediate point, which is very slow. There is another approach, and this approach will be the basis of the neural network generation algorithm. What you see right now with all these beautifully synthesized image, they are following the second approach that I'll describe in more detail this afternoon, and the idea is, instead of looking at the distance between the log probability of the true probability and the model, you look at the distance between the gradient of the log. Now what's the advantage of using the gradient of the log is that you eliminate the normalization constant, and the normalization constant is what takes a long time to compute. So these algorithms are much faster, but again, if your problem is badly conditioned, it's not going to work, and in this case, it's not going to work in the sense that it will take a long time, these algorithms will be fast, but in the sense that it will take a lot of data to converge to a precise local minima, and that you want absolutely to avoid, because you don't always have a lot of data, but at the same time, when you have, you want to use it well. So conditioning the problem is going to be key. And last time, we looked at this toy model, but which is revealing very important properties of this problem, which is the problem of 5.4, which is a modelization of ferromagnetism. So in physics, as I said, usually your energy can be decomposed into a kinetic energy, essentially corresponding to the energy of the velocity or the increment of the value in your field, and a potential. And in the case of the scalar potential, the potential is simpler. It's just a sum of a function of each value of the field. And the function looks like this, the potential. It's minimum when it's equal to minus one or one. That means that it's going to enforce the value either to be equal to minus one, and you see a black point, or rather to be equal to one, and that's a way to model spins. It's as if you have to be minus one or one, so it's as if you had a spin, although you still allow known one and minus one values, but with lower probability. Now, there is the first case, as I said. Beta equals zero, that means infinite temperature. That means there is no kinetic energy. It's negligible. Yes. So that means that essentially the value fluctuates randomly. It's as if you had random spin with minus one, one. And this is not with a zero temperature, but pretty high temperature, and you see it's very disordered. And now when you begin to decrease the temperature, so increase beta, the laplation, the kinetic energy begins to be more and more important, and you see correlation appearing up to the point where you reach this critical temperature, where you see very long range structures, where you see very long domains where you have whites and very big regions of blacks, which are minus one, which are all together. And this is where you reach the phase transition. Beyond there, there is, as I said, the kind of winner take-all, either the minus one take-all over or the plus one take-all over. And that's where you are reaching here. And that means that now the, whereas before the average of the field value was always zero, that means you have no global magnetization. When you go beyond the phase transition, the average is either going to be close to one or close to minus one. That means that suddenly the material have a magnetization property. Now, when you look at the power spectrum, what you observe is the following. At high temperature, essentially the, when you go towards higher frequencies, sorry, lower frequency, it's going to flatten out, which means that when you are very far away, things are going to be independent. When you reach the critical temperature even very far away, the spins are not independent. They are very long range influence. And these very long range influence means that all the algorithms are going to be incredibly simple, long to convert. It also means you can't estimate the parameters, essentially. Okay, so as I said, we are going to factorize the problem as the low frequency and then the transition to go from low to high frequency. Each term we are going to model it to approximate it. And the question now is, what is the appropriate basis to do so? So, yes. With respect to the previous question, here you are assuming a mark of property across scales in the sense that the distribution of certain scale depends only on the scale before. No, no, that's, sorry, it's a question of notation. The phi j corresponds to all the phi bar before. Why? Because phi j is the global block of low frequencies. So I don't, I assume that the high frequency is going to interact with all the previous frequencies that are regrouped within this low frequency image phi j. That's important, thanks. So, let me now go to these wavelet bases. One second to abandon this problem of renomization group and I'm going to go back a bit to why has these wavelet bases appeared and what kind of property will be important for our problem. What I'm going to show is that the key point is that these bases have a local support and that will give you a lot of good properties when you want to analyze structures in an image signal. The second thing that I'm going to show is that they are able to eliminate the kind of long range correlation created by a power spectrum having a power low decay. Okay, so what is a wavelet? First of all where did that came from? It came from very different field. In seismology people are sending waves below the ground getting reflected wave and the question of what kind of wave will be the most informative. And there was an engineer from El Fakiten at the time, Jean Morley, who proposed to use these so called wavelet. It's called wavelet because that's the name of the wave that are being used in seismology, which is basically a local wave that is going to be dilated. Now there were ideas like that coming from very different field, in particular Alex Rossmann was a physicist working in quantum physics and in quantum physics of course there's constantly this question of uncertainty principle what kind of wave packet can be used that is local both over in space and in the momentum domain in other words in the Fourier domain and what kind of representation. So the main property of a wavelet is that it has a zero average and I'll come back to that. And then the idea is you take your little wavelet to dilate it, that's the scale parameter. So s is not yet a power of 2 so it's a scale parameter. When s is big the wavelet is big when s is small, it's a small very narrow way. And then of course because the support is local if you want to cover all the information you need to translate the wavelet. So I have two parameters. One is the translation where is located the wavelet point U and the size of the wavelet, that's the scale. And then you're going to take your signal so your function f here so it's not phi anymore now f is the function that I want to decompose and I'm going to project it on the wavelet. Projection means you compute an inner product in other words you're going to compute the correlation between your function and the wavelet and that's the wavelet coefficient. Okay, you can look at that in the Fourier domain and you always have two ways to look at things. You can look at it in the special or time domain or in the Fourier domain. What is the Fourier transform? So this is the notation. When I put a hat here and omega omega is the frequency so the Fourier transform as you know is the integral of your function against the sine wave inner product with sine wave. Now if you look at the Fourier transform of the wavelet we're going to build it so that it's well localized in the Fourier domain because the wavelet has an integral equal to zero you put here omega equals to zero you get the integral that means that the value of the Fourier transform in zero is zero for the wavelet but you also want to build a very regular function so they are going to have a fast decay at infinity so it's a packet of frequency somewhere over there. Now when you compute the wavelet coefficient it's an inner product, an integral but you know from Parseval integral that you can also write an integral in time as an integral in the Fourier domain so it's going to be the correlation of the Fourier transform with the Fourier transform of the wavelet of the signal. The fact that your wavelet is local in the Fourier domain means that the wavelet coefficient gives you local information also in the Fourier domain about your function so you have something which is local in space and Fourier but you have the uncertainty principle which tells you you cannot be perfectly local in both domain. Perfectly local in space that means Dirac, totally delocalized in Fourier, perfectly local in Fourier that means sine wave totally delocalized in the spatial domain. The wavelets they are somewhere in between but pretty well localized in one domain and in the other though. Okay this is a wavelet transform of a function so that's something as a function of time and you see you have discontinuities regularity that's a very irregular part. The wavelets that are shown in blue you can translate them they are going to have different scale and this is a wavelet transform with a continuum of translation and continuum of scale so you get an image. Each point in this image is the inner product between a wavelet which is shown here and a function. Now what you see are these kind of cones of high energy coefficient that converge there where they converge to the singularity why because when the wavelet oscillates in the neighborhood of something which has a sharp transition boom it's going to give a big coefficient. When the wavelet oscillates like here in a region where the function is nearly very regular therefore nearly constant well because a wavelet has a zero average this coefficient will be nearly zero you see it as gray. White is highly positive black highly negative so you can see that most coefficients are at fine scale are zero because mostly it's regular there are few places where it's going to be big near singularity and here because you have a kind of Brownian motion you have irregularities everywhere so you can see by looking across scale that you see how the singularity progressively develop. Now question is how are these coefficients related to the true property of the function and that's the question that has been studied a lot in mathematics in between 1970 1970s and 2000 and the other question is this is a very redundant image because you move little your wavelet each time can you build a totally decorrelated representation building a decorrelated representation means building here an orthogonal basis so we are going to choose not a continuum of translation but a translation grid which is shown over there where at each scale which only going to vary like powers of 2 to the j the translation will be proportional to 2 to the j you have a discretization grid this will be how you are going to build your orthogonal basis now the first wavelet orthogonal basis it's not a new object that was introduced in 1909 by Albert Haar it's called Haar wavelets and it's a very simple wavelet it's a function which is constant between 0 and 0.5 equal to minus 1 between 0.5 and 1 and then 0 everywhere that's the function now that's a math exercise that you can do take this function dilated by a factor of 2 translated by to the jn try to prove that you have an orthogonal basis of the space of all functions it's not an easy exercise but that's feasible and the way you do that is basically you take your function you progressively approximate it by piecewise constant function that's how these bases appear the problem is it's a discontinuous basis and that's not so good Wilson in his first paper on renormization group that's what he used Shannon wavelet how does it look he wanted to have a view in the Fourier domain and he wanted to have something relatively well localized in space but he wanted also to have an orthogonal basis so what he did he built a function in the Fourier domain so he defines the wavelet in the Fourier domain equals to 1 between pi and 2 pi and 0 everywhere outside why you take this function you dilate it by factor of 2 you see that the wavelets that are obtained they all have this joint's frequency support and therefore they are going to be orthogonal if you compute there in a product with a part of all integral you integrate two functions whose Fourier support don't have a lab so you get another nice orthogonal basis but it's now discontinuous in the Fourier domain so in the spatial domain it means that it's going to have like 1 over t and you don't want that you want something very well localized in the spatial domain so the question is can you do something that is well localized between the two and for a long time people thought that it was not possible and in particular there is a mathematician so can we find something regular and fast decay the let's say common understood answer was no until a mathematician called Yves Meier tried to prove that it was no now how do you try to prove a result one of the technique is you try to find a counter example of the result you want to prove and you show that it's impossible to find a counter example so he tried to build a counter example and big surprise he found a counter example in other words he found that they are well localized in space and well localized in Fourier and in fact he got the Abel price for many other result but this one also has an important element so what did he found he found wavelets that are very regular in Fourier so this is the wavelet in space it has a fast decay faster than any polynomial and which defines an orthogonal basis that means if you take this function you dilate it by factor of 2 for all the powers of 2 and you translate them by to the Jn you get an orthogonal basis of your whole space of all possible functions ok and then the question was ok now we know that there are counter example let's find all the possible wavelets are there some better than the others and so I hear again I define the function in the Fourier domain it's a function let me see if I see so the Fourier domain the function is 0 before 2 pi over 3 it's then going to increase be flat on an interval then decrease all this was well designed so that you could get your orthogonal basis and that's how the function looks like in space ok you have you can analytically write the expression of this function and as I said then the question was what are all the possible wavelets and that's something on which at the time I worked on the during my PhD and there is one way to attack the problem through algorithm you see why is the Fourier transform also so useful the Fourier transform is very useful because there is a fast Fourier transform which takes a signal of size n and computes all the Fourier coefficient in n log n question can you have a fast wavelet transform and there again it was a very interesting time because they were not just physicists and mathematicians working on that signal processors for their own reason had been working on a similar problem and when you put together two fields that's where it's the easiest to make a contribution because you don't have yourself to discover something just have to say hey the people over there they did something that if you just adapt the language solve that problem and that's essentially what I did at the time which is to realize that the algorithm used in signal processing in a totally different context provided a fast wavelet transform algorithm which essentially amounts to take the original signal, the original image which lives on a large frequency interval split the frequency interval into the high frequencies which are shown in blue with the first filter the low frequencies which are shown in red and that looks like the kind of algorithm you'll see also for neural network you have a first filter which is the low pass filter that will give you the low frequency part and the high frequencies that will give you the wavelet coefficient and then you take the low frequency you split it again high frequency part and the low frequency part subsample and you keep your wavelet coefficient it's always the low frequency think of the renormalization group it's always the low frequency that you decompose you decompose it again that's the low frequency, the high frequency how many operations you had n coefficients to begin with the number of operations to keep compute these signals which are twice smaller n over 2, n over 4 O of n so it's faster than the 4H transform which is n log n and the question is can you reconstruct so from these coefficients can you go back to here and you can ask that question and ask the question what should be the filters to invert it and you have a relatively simple problem if you solve this problem you reconstruct an orthogonal wavelet basis in other words how are the wavelet bases constructed and that's something also important in neural networks what you are going to learn they are the filters what goes from the weights you take your image, you filter it with a weight and then you do something else same thing here it's the filter what is going to do your convolution filters satisfy a very simple reconstruction formula once you have the filter you cascade the filters and you get the wavelet so the wavelet is obtained as a cascade of filters and you can prove that you get an orthogonal basis and then you can build as many orthogonal bases as you want you just have to choose the filter just it can be complicated and there is a beautiful work that was done by Ingrid Dobeschi she found the smallest possible filters leading to wavelets that are regular and as you increase the support of the wavelet you see that it's more and more regular these are different wavelets giving different orthogonal bases ok so let me come back to what it's going to be useful take a function thank you so you mentioned the inverse problem in the reconstruction can we be sure that this is not an ill both problem so do you have a risk of how to say instabilities numerical instabilities no because we can choose the filter and you see what am I doing here I'm choosing the filter so that essentially the filter age and the frequency are going to and I'm going to show it here you see the filter age and the filter g they cover the whole frequency domain so if I split into the low and the high frequency you can see that you'll be able to recover the original the question is how do you make sure that they are orthogonal and that's where you need a little bit more condition to make sure that that's the case but you have no instability because you designed them to be orthogonal and to generate your whole space so now you have an orthogonal basis you have your function you can decompose it with the wavelet coefficient they give you all the information it's like Fourier coefficients these are coefficient in the basis so how do they look like this is your function coefficient that the final scale the next scale so a bar downwards down means a negative coefficient a bar up means a positive coefficient now here you have thousand coefficient but they are almost all zero you don't see anything why because this is very regular so the wavelet coefficient is almost zero here they are big here because you have a discontinuity so now we can ask the question how does the wavelet coefficient relates to the property of the function f now think in the Fourier case in the Fourier case this is a key question in functional analysis how do you find that a function is regular you find out that the Fourier coefficient have a fast decay now here you have the same thing but it's more subtle you see when you have a Fourier transform the only thing you can say is that it's irregular or it's globally irregular okay so this is the function here at the top I show the wavelet coefficients the inner product which are ordered per scale these are the fine scale and I increase the scale and this is the spatial variable n and each n I show a wavelet coefficient so when the wavelet coefficient you see zero you don't see anything a big bar these are the few big wavelet coefficients okay and the question I'm asking is how are the coefficients evolving when you go across the scale this image you can view it as a kind of discretization of this image that I showed here you see so that but what I do here is I look at coefficient on the grid so the very fine grid you have here here and here and there there are zeros and at coarse scale they are there now if you go back I'm sorry because it goes in the here in the reverse order this is the fine scale and you see the few big coefficients and the coarse scale now what you would like to know is you want not just to know that the function is irregular you want to know where are the irregularity the fact that this is not discontinuous discontinuity of the derivative the fact that here you have a very regular part and how do you do that you just follow locally the amplitude of the wavelet coefficient and they give a point wise regularity property in other words you know that your function is at a point just by looking at the decay of the coefficient do you have something much more powerful than in the Fourier case because it's not a global characterization of the regularity and when you begin to deal with something which is not Gaussian well some part of your field is going to be very regular and some part of the field is going to be singular this is why you need something which is local in space sorry Stephen sorry of course this is time axis okay you have you see two variables here one variable is the scale so that's the first variable the scale the second variable N is the position of the wavelet so that's the I should have put it you are absolutely right this is the N variable in time so it's the position of the wavelet so each time you have a wavelet with a new position you compute the inner product and when the wavelet is again positioned wherever the function is regular the coefficient is zero excuse me one more question couldn't you think as an alternative to a kind of local Fourier analysis where you would cut your function into pieces and do an actual standard Fourier analysis piece by piece and keep track of all these questions absolutely you are absolutely right you could do that and many people did that but if you do that you have to decide a size what is your localization and once you fix it within this interval you don't know where your singularity is you don't know if it's there, there, there or there in other words the information is delocalized here at the end of the game I will know exactly that it is in T0 because you progressively reduce the resolution the problem of what you propose is that you have to fix a reference size what is the reference size when you have a multi-scale field there is no reference size that's the problem if you have a reference size then it's a great idea you can increase the resolution each time and you pay the price of having more and more coefficients to keep in memory but here as well in a sense now here you are autogonal here you have no redundancy the total number of wavelet coefficient is the same as the total number of your signal coefficient you have an autogonal basis what you would have as a price is indeed a redundant representation I am confused about this point imagine that you have N points a certain number of points on the T-axis and then your N variable would have the same resolution because otherwise you would have N log N points now because here I have N over 2 points here N over 4, N over 8 don't forget that you see in the algorithm I sub-sample that's what I here I had N points here I had N over 2 wavelet coefficient here I had N over 4 because of the sub-sampling this is again your images think of it, we'll see on the images you sub-sample them and then you sum N over 2 to the J you get N I scale my position variable as a moves as my you see at each scale the position variable moves by a factor which is bigger so the number of points is smaller don't hesitate to stop me because there's nothing easy within these things now there is one thing I'm going to show is an application to denoising why? looks not so interesting here but you are going to see that denoising is at the heart of all image synthesis and state-of-the-art synthesis of images with neural networks that will be this afternoon so how can you use that kind of thing to suppress noise you take a... oh sorry there's one slide out there you have your image here you see these were the coefficient one thing you can now do are the coefficients which are small because many of them are small so I keep only 10% of the coefficient the bigger one and I reconstruct from there this is what you get and that's why you can compress so what I did here I did a thresholding I only kept the coefficients that are above a threshold and I reconstruct so it's a partial wavelet sum but because most of the energy you recover something which is very close to the original image this is like an adaptive grid what you are doing here is very subtle instead of saying I have a signal I want fewer coefficients I take a coarser grid in fact what you do is you use more coefficient near the singularities that's why you put your coefficients and you put very few coefficients where the function is regular so just by doing a thresholding you are building a very subtle approximation of your function now denoising suppose you have your image nice piecewise regular function and these are the wavelet coefficients there are not many because they are only located near singularities suppose you add gaussian white noise that's what you see with the random fluctuation if you look in the wavelet domain what is that going to do it's going to create plenty of non-zero wavelet coefficients which are there that you see now you want to eliminate noise how can you do it well there is a very simple idea you know if the noise is not too large it's going to create small coefficients if it's white noise they are all going to have a fixed variance sigma you just put a threshold about the noise you put to 0 exactly as we did before always below threshold and we keep the rest of the wavelet coefficients and now you reconstruct that's what you get you killed all the noise here because you thresholded but you didn't smooth the signal what you would do in a traditional Fourier domain would be you eliminate the high frequency but if you do that all the singularity are smooth here they are not smooth you see they are still very sharp because you kept the very large coefficients you still have a bit of noise of course it's not perfect that's the noise that is kept there but you have something where you can so you are very adaptive in a subtle way because the algorithm is simple you just keep the element above a threshold so people were very excited about these things they had many applications that was the 90s, early 1000 now let me go to images this is the image that I called here f0 you apply the same algorithm but along the rows and the columns of the image so you filter the rows and then you filter the column think of it in the Fourier domain here is the Fourier transform what it's going to do is you take your image you are going to split it into the lower horizontal frequency higher horizontal frequency and then you go in the other direction you split it again into and you have what you see here the low frequency over there and then the high frequency in one domain in the other domain and the three corners you repeat that you repeat that and you get the wavelet transform that I showed in two dimension what it means it means now you have a wavelet which is a little two dimensional function which is local in space but it is local in Fourier this is the different localization at the different scales of the wavelet and these they show the wavelet coefficient when it's gray it's 0 why? because if you think the image which was before a square if you put a wavelet here because the image is constant the wavelet coefficient is going to be 0 where is it going to be non-zero you put the wavelet near the edge near the singularity this is why the wavelet coefficients they are big along the singularity so you see the edge appearing you see the singularities wherever it exists on your field now last mathematical concept is vanishing moments what you are going to do think of it if you have a function you want to expand it locally in the scalar series you decompose it into polynomial what we are going to do is to build a wavelet which is orthogonal to polynomial up to a certain order m so it's orthogonal if you integrate it to a monomial up to an order m it's going to be 0 in the Fourier domain you do the Fourier transform of this relation it means that all the derivative of the wavelet in 0 is going to be very flat near by 0 so it's going to look like something like this this is going to be key to kill the singularity of the covariance near by 0 why? this was your covariance, the eigenvalue this is the power spectrum the power spectrum has a power low decay it has a singularity here when you multiply that there is one point which is important you want to go in the wavelet domain so now the field I call it phi again and you want to compute the inner product of your field with the wavelet coefficient and what you would like is that the wavelet coefficient be very decorrelated what does that mean? that means that the correlation of a wavelet coefficient of the field at one scale and a different scale or two different positions they are very almost 0 you would like things to be decorrelated it looks impossible because the field is very correlated but I'm going to show why it's possible yes this will be the power spectrum of the Gaussian model exactly it's going to be the 1 over k squared omega is what you call k it's always complicated to do the translation that's why you need time that's the k variable omega is the frequency if you write this correlation you can write it in the Fourier domain the operator is diagonal this is the power spectrum and it's going to be the integral of the Fourier transform of the second wavelet and the power spectrum but you see the two wavelets are located first of all observe the two wavelets they are nearly 0 where the singularity explodes so if you do the product essentially the wavelets they ignore the low frequency explosion so they are not going to suffer from this very high frequency resolution then two wavelets at two different scale well they barely overlap so they are not going to be correlated and then within the same scale because you have this phase oscillation the coefficients are going to be very quickly 0 now again here you go back to a whole chapter in harmonic analysis in math so as you remind suppose you take a probability distribution which is a Gaussian so the energy is this quadratic form the covariance is the inverse of the interaction matrix the power spectrum it's the eigenvalue of the covariance and it has this decay basically what we are saying is that the wavelet is just getting one chunk of frequencies which are there and because of this one chunk of frequency the power spectrum varies by an amount which is a constant factor you have no instability the condition number of the covariance matrix is going to be of the order here of 2 or 4 no explosion you had an explosion because if you look at all frequencies together the ratio between the smallest and the largest eigenvalues is huge but if you restrict yourself in this frequency then the ratio is small and yes if you have the Laplacian K yes exactly ok so if you have if it's an identity it doesn't change anything I mean it's just going to shift everything oh if you shift by well but if you do that you're right you killed the singularity but the thing is you have problems if you don't have that if you take 5 4 at the critical scale it is diverging yeah yeah you have that critical point it diverges if you take turbulence it diverges it 5 3rd so you could change the problem but that's the problem we're facing now this is at the center of a lot of work in math basically in math people have been studying that kind of operators which has this kind of divergence and what you can prove is that in a wavelet basis if you look at the representation of your operator with the matrix if you factorize the the diagonal then you get a matrix so factorizing the diagonal means extracting the diagonal of the matrix and inverting the diagonal then you get an operator almost of the order of the identity what does it mean? it means that the matrix behave as if it was diagonal it's almost diagonal so you may say why not making it diagonal well if you use a four year basis you will make it diagonal but you are going to have a cost which is you are not local in space anymore and that's going you are going to pay this cost and that's why I'm going to show let's go back now to 5 4 so again you have the problematic regime which is the regime where you are at the transition point where you have long range correlations as I said if you try to optimize to sample the field with a longevity then you are going to wait forever if the field is large because the number of iteration grows like the size of the field to an exponent it gets very very slow now you can use the renormalization group strategy which consists in separating the different elements and then sampling each element one at a time what do you observe if you count the total number of Langevin step first of all it's much slower and I'm going to whether you take a Fourier or a Wavelet basis and the most important thing is that the number of iteration remember the size of the field doesn't grow with the size of the field in other words whether you are at the critical point at high temperature, low temperature it's going to be the same it's totally flat and then you see these different colors they correspond to different types of Wavelets the hard Wavelets that was the older one 1909 discontinues then the Dobeshi Wavelets then the Wavelet used by the Shannon Wavelet and you see they are all a bit together the best is in fact Shannon because it's better localized in Fourier and in fact if you look at the performance of the Fourier transform it's even better why not using the Fourier transform why is the Fourier transform doing very well because the Fourier transform is perfect to diagonalize the covariance to separate the frequencies so why not going beyond and in fact that's why it has been mostly used in physics but if now your problem is that you don't know the energy you have to learn it from the data if you have to learn it from the data that means you don't know to begin with what is this term and what is this term you want to learn it by parameterizing but now you are going to have a problem because you see the Fourier transform it's very good for this term because this term is a laplation it's diagonalizing the Fourier basis perfect but this term it's local in space and what is the Fourier transform going to do it's going to dilute all the information and it's going to be very hard then to get precise information about the scalar potential so you can still try to do it and that's the kind of result you get what is the best best means smaller of the model the smallest error of the model is in fact Kedanov strategy the hard basis because the hard basis is very well localized so it's going to give you a very good information about the scalar potential and the fact that it's non-convex the worst is the Fourier basis and at one point the error shoots because when the field gets back the error is very big good strategy is the intermediate one it's both good in terms of speed and good in terms of precision of the model so something like the Dobeshi wavelets with two vanishing moments or some intermediate position and one comment hard is good for 5,4 hard doesn't work for turbulence because turbulence has a faster Fourier decay and their R is not well conditioned but the Dobeshi ones stay well conditioned let me quickly finish then because late with the neural network as an introduction so in neural network here we're going to look at different kind of problems usually for example classification and regression problem so you have an image and you want to know whether your image is in the class of grid mushroom let's say this is an image of cherries or dogs or Madagascar cats ok so this is the class Y so a classification problem is really to associate sorry I'm going back now to the notation of machine learning data is X so that's going to be complicated sorry the field that was called Phi tonight I couldn't change everything the field now is called X X is the field X is the data so these are the example of data and these are the example of result and you can do all kind of thing X can be the geometrical configuration of a molecule Y is the ground state energy that you may want to compute in physics so you can compute that with a DFT solving Schrodinger equation or you can compute it with machine learning so what are the key idea of a neural network the neural network is an aggregation of very simple computing unit what it does it a neuron takes an input the data it multiplies it with a weight these are going to be the parameter of the neural networks and then it's going to make a non-linear transformation which is going to make a weighted average of the data by the weight you may have a bias and then you make a non-linear transformation the one that is mostly used in neural network are called rectifiers nowadays which just sets the output of the neuron to zero if the value of negative and if it's positive it keeps it as is so think of it as a biological neuron it's not going to far there is no output in the axon here if the value is negative and otherwise it let it go through and what a neural network is going to do is take such a neuron and organize it in different layers so the input is going to go within a whole series of neuron with weights that are going to have an output and the output goes in the next neuron with weights which have an output with weights again and you cascade them until you get this output so what are the parameters the theta in this model theta is the set of all possible weights in the network when you optimize theta you optimize the weights of your neural network so we are in the same mathematical framework it's just that instead of having 500 parameters you may have 500 million parameters now you still put some prior information in the network where do you put the prior information the prior information you put it in the architecture and that's where we are going to go back multi-scale, why? if you look at the architecture that have been developed in particular convolutional neural network that was developed by Yanleukin ideas first if you look at different part of the images you don't know where the object is maybe there, it may be there so there is no reason to use a different set of weights here than here if you have the same weights in this little region and in this little region that means that the matrix of weights you can view it as a big operator a matrix to be environed by translation a matrix which is environed by translation it's called a toplit matrix and that's called a convolution operator this is why these neural networks are called convolutional networks because the weights don't change when you translate the position of the neuron and then you apply your nonlinearity your rectifier and then you again apply a convolutions over all the output so because you are going to have many outputs you have a whole layer of different images observe that these images have been sub-sampled a bit like in the fast wavelet transform you make a filtering sub-sampling but very different, you make a nonlinearity here and then you filter again with weights which are very local like wavelets, very local weights you don't have any global listing of the information sub-sampling and again your nonlinearity convolution sub-sampling, nonlinearity up to the point the images get very small think of that as a scale axis you make convolution, sub-sampling, convolution, sub-sampling average, aggregate progressively your problem now is these weights which are over there you want to minimize them so you need to find a way to do that but what's your goal? your goal is to have a function here which is going to do what you want here for example it's going to be classification so you want to minimize the error which is the loss you want to make sure that for the good parameter on the data examples that you have your training examples you have the same solution the one that you know is good in other words you know in advance that this is an image of the cat you have to make sure that for the right parameters the network will find out that it's an image of a cat so you have a gradient you have a minimization it's always the same how do you minimize gradient descent is that going to work? yes, if you have a nice convex function the gradient descent is going to lead to the minimum if you have a non-convex function you are going to have problems local minima how can you get out of the local minima with noise and indirectly that's what stochastic gradient descent is doing not going to enter into that now the very impressive thing is that the same kind of architecture applies to very different type of problem image recognition audio, speech recognition diagnostics, medical fault, scientific computation calculation of ground state energy in quantum physics generation of image, music, text programming, mathematical proof why? outcome it's a huge surprise for the last 15, last years why are these structures able to solve such different problems and the argument that I'm making here is that what is in common between all these problems? they have a hierarchical organization you can separate scales and if you take this problem and separate scale the natural way to pose the question will be to understand what is the interaction across scale essentially it's the same kind of reason why the renormalization groups appear all over physics that's a natural way to organize the information now few evidence about that and I'll finish on that by doing a relation with neurophysiology because there were questions that were asked to me last last time by the students what you can do is look what was learned by the network now what is easiest to look at the first layer because you can just display then because there is no ordering of the coefficient it's complicated so let's look at the matrix W1 what is the matrix W1 doing? it's doing local convolution so you can look at the filters here are the filters that have been learned that's one of the first neural network where the first filters were relatively big 10 by 10 how do they look like? wavelets they look like oriented filters in different directions which are local strange why? well maybe not so strange because indeed the network has learned to separate the information within the different frequency bands in order to exploit them then to process the information let's look now at physiology in computer vision this is the brain of a macaque you have the optical nerve which takes the influx from the eye and it goes directly to the first visual area V1 which is in the back of your brain which is here and then the information in your brain or in the macaque brain but yours is the same is going to propagate from V1 to V2, V4, IT V1 what do you have? relatively simple that was discovered in 1960s Ubell and Visel that's where they got their number price what they discovered that the neurons are organized in hyper columns and you can test how the neuron reacts so what did they did? kind of experiments they show something to the retina of the monkey they measure the neuron and they measure the relation between the input output and they observe big surprise the relation in the first approximation is linear so you can compute the convolution kernel and that's how it looks like this comes from physiological publication and there has been thousands of publications about that what did they observe that within your hyper column you have different neurons what are their relations they are dilated they have different orientations all this is organized into this hyper column again these are photocopies of physiological publications okay and that's and these were called Gabor wavelets or Gabor filters why Gabor? Gabor is a physicist because it looks very much like a Gaussian multiplied by a sine wave and these are the functions that have the best uncertainty principle property the most localized in space and most localized in Fourier and it happened that they found it in V1 visual cortex and then the question was okay V1 well understood what about V2, V4, IT IT is where the object recognition is happening and there what they found out 1970s, 80s, 90s a lot of nonlinearities it's not linear anymore very complicated and you have aggregation of the simple cells into what then they called complex cells for a better name and that was it and to answer a question that was asked by one of the students yesterday what's the relation with neural network so obviously there were labs that began to work on that and that's a very nice set of one of the first lab was at MIT, the lab of D. Carlo and that's one of the publication 2018 what did they do? they wanted to compare the response of the neuron with the response of a neural network so what they knew is that the first layer in V1, it looks very much like the first layer of a neural network in the sense that both of them look like wavelengths but then the question is what about the deeper layer and in particular what about IT, IT that's where here IT that's where the object recognition happens now if you go in a deep network you can look at where the object recognition is going to happen in the very deep layer in the layer which is just before the classifier so what they did is they compared the response of the real neuron and the response of the neurons in their neural network now they didn't do that neuron per neuron but they did that populations of neuron by population of neuron and they found remarkable correlation meaning what? meaning that it's probably not the same thing but you see the same kind of property that are being extracted now and that has been confirmed by many experiments afterwards so if you have such a result you may ask why is that just by chance or it's because there was no choice if it means that there was no choice it means that somewhere it is a structure a problem that forces that kind of solution then that means that somewhere it's the map it's the structure of the problem and the claim here is indeed yes that's probably the structure and what is guiding again the structure is again these scale separations which is guiding the fact that you are going to organize that into cascade of filters which are local all the time you are going to have your non-linearities and the last part of the previous talk I explained you have one problem is the phase the phase is what is going to eliminate all interactions if you make correlation it looks like things are independent what is a rectifier a rectifier it puts to zero all the negative coefficient it only keeps the positive coefficient in other words it's a phase killer iterations of filter phase killer filters phase killers that are happening there at multiple scale how to model all that beautiful problems in front but there are many questions that can be can be addressed so to conclude that part the claim here is that if you go back to renormalization group if you only worry about optimization you know the energy you know everything about your system yes you can use the Fourier transform the way that will do will be well adapted as well but the Fourier transform can do it the key point however is you still need to separate these scales in order to have a well posed problem if on the other hand you want to discover the information if you want to discover the energy then you cannot use anymore the Fourier transform because the Fourier transform is global it makes everything in your data when you have a niche here and a very regular part it's going to mix everything you want to have a local information and you observe the same thing the neural network they make local aggregation of information and that's what wavelets are doing this is why it's a very natural tool now all this being said neural network are much much more complex they have impressive performance so there is a beautiful problem to try to understand that so this afternoon what I'm going to show is how now you can take these neural networks and use the Langevin diffusion and generate amazing images these are called score matching such as all the images that you've seen around with created by neural network so that will be for the afternoon thanks there are further questions that were not yet asked so thank you very much very nice talk so I mean with respect to the last slide but you were showing this hierarchy in the visual ventral stream neural networks what is remarkable in the brain is that the deeper you go the more you develop objects in variant recognition in the sense that if you translate if you rotate etc so does this has any analog from the renormalization group point of view that this imbalance is developed on general grounds okay so I'm going to first first of all to do the comparison of the neural net and then I'll serve out the renormalization group the interesting thing is that in deep network you see the same invariance as you go to deeper layers you see that the representation gets to be invariance to rotation to pose and so on and again there is no choice if you don't build these invariance you have too much variability you'll never do your recognition so the question which is very interesting indeed is that these groups are very natural physical groups how do they come into the renormalization group they comes into the energy that means that the energy that you are going to build must have the symmetries that you want to impose in other words the invariance to rotations and so on and if you think about that's where it's important to choose well the wavelet one way to think of a wavelet is being one fixed function psi and then you look at the orbit of psi across the dilation group that means you translate it then you look at the orbit of psi around the rotation group that's when you rotate the wavelet and then the orbit along the translation group now if when you compute the wavelet coefficient you want to impose an invariant to any of these groups you make an averaging along the coefficient in other words you can impose these invariance and your question is very important because it's indeed key in recognition is to build up the appropriate it's like in physics the group invariance the group structure is what is structuring the base law in physics here it's the same if you don't build the appropriate invariant you will never reach the appropriate recognition any further questions from the students no I have a more general question so you switched pretty fast to know a network which is very timely and we hear a lot about it but I'm curious to know what are the main success stories of wavelets before the modern machine learning area in signal processing you mentioned denoising are there other applications that are at the time the main success stories where the compression inverse problems like fMRI imaging and medical imaging you get slicing through X-ray you want to invert the tomographic operator you have an unstable denoising and then you represent the operator of a wavelet and then so all inverse problems noise removal compression then it has been applied to fast calculations because the underlying operators like differential pseudo differential operators can be represented there but there it was in competition with other type of algorithms in particular multigrid algorithms which are essentially equivalent and the multigrid algorithm won the race so I would say the biggest applications where the one that I mentioned and you can view in some sense all this work as being a one hidden layer neural network it says if you do because what are these algorithms where essentially doing the decomposition of a wavelet coefficient kind of thresholding, smart thresholding and then getting the output what our networks do they iterate much more these transformations that's an answer alright thank you very much if there are no additional questions let's thank again Stefan for this wonderful talk