 It's wonderful to thank you. To thank first of all Gabrielle and Joan for their scientific vision because all this program was really assembled by them. And we've seen incredible talks, three days of really beautiful talks and amazing diversity from the talks of Freddie written by hand with this idea of going totally orthogonally to whatever other people are doing to talks which are the technical mathematics to beautiful empirical and algorithmic results. I think this is really the DNA of this community and I'd like to come back to that. It's strange to be crossing 60 years and be here. I mean you've heard the originally saying after three years I just wanted to get out of there. My goal was essentially go to the US. And this has been much more than just research. It's the style of life. It's meeting many friends who are here. And it's meaning also. A sense of beauty sometimes when we do research, excitation. I think it's an incredible, incredible life that we're having. Maybe the one who took the RORB can realize that half of France is on strike because people don't want to work two years more. And Rogen Abayji is 90 and still is hoping to continue to work. And I think most of us hope that this journey will be as long as possible. And that's a reflection of how lucky we are to do that work of research, pursuit of knowledge, teaching with a wonderful student. So that's a luck I didn't imagine one second having. Now there is also a kind of sense of family that we all have. Of course our advisors, siblings, which are all the ones with whom we've been working, students. I have a position which is a bit different because most of you are single child children. You had one advisor. I mean this particular situation where I had two parents. Two parents but who were totally separated. I think they never met in fact. I mean one, you just saw, Rogen Abayji was an incredible engineer, had a vision about what computer vision, what robotics should be. But she had a life which is a real novel. At age 11 she went from, she escaped Gestapo, went from Slovakia to United States escaping in 1967, 1968. The Soviet tanks arriving in the US being one of the first women who ever got a PhD at Stanford and then creating with few pioneers around the whole field. So she is someone who is doing, at the same time she had this love for mathematics but from far away. And then I had an adopted father that you've heard about, Yves Meilleur who is this incredible pure mathematician who was also in love with application but this was a totally platonic love. He was never even touched it. He was just loving the idea of the possibility of application. And my DNA is a kind of crossing of these two approaches. I think that's also the DNA of this community and this is why it's for me it's an amazing community. This mix of engineering, mathematics, physics and I think that's what kept it so juvenile. Why 40 years ago we are still so many in this conference with young people around. And I think that it was very interesting. The talk of Dave about this tension between theory and empirical work is very interesting. But I think I have a slightly different view about it. Yes we are in the temple of mathematics, of pure mathematics and yes to be able to freely work on pure ideas is something that can be incredibly creative. But sometimes you end up in a hole, end up specializing and things get narrower and narrower. And experiment is what shakes up all this. No respect whatever for beautiful theory. If it doesn't match the table it just doesn't work. And I think that this is extremely important. And on the other hand experiment without theory, without mathematics has essentially little meaning. So for me beauty is there. Beauty is the meeting of these incredible ideas and we've seen that during these three days and the experimental. In some sense what's happening in the world. And the last thing a bit general before going in something a bit more concrete is that I found there is something a bit strange in the way the idea develops. Jan you spoke about all these things that had to be wiped out and have to be rethinking about it. Which is indeed important to try to say let's get out. That's our first reflex. Let's get out. Let's go. But what we all experience a bit is we come back to these ideas. And my impression is that there is no linear time in research. It's more like a spiral. We get out and we come back and we revisit but from a slightly different point of view. And the amazing thing is that this spiral is moving and we have no idea where it's going. But we come back to similar idea from totally different point of view. And for me there is three ideas that I've been turning around thinking I would free myself coming back. Three ideas which really have no origin in time. The first one of course is Yarki. Yarki which has been translated into scale separation which became wavelets. And of course I spent part of my life on that. The second one is all that it can be. It's this idea of simplicity, Occam razor that Remy spoke about that became sparsity, regularity. And the third one was the most strange to me at the beginning which is the idea of serendipity which is compress sensing randomness, random projections. And what I like very much about deep networks is that it's allowing to revisit all these ideas from a totally different point of view and forces us to rethink all that. And that's a little bit what I'd like to speak about during these 15 or 20 more minutes. So the beginning is of course the architecture of convolutional neural network which is still this beautiful mystery whether it's for supervised or unsupervised learning what is the underlying mathematics. And we know that there is structure, there is a lot of prior which is hidden within this architecture and at the same time I hope that the time that learning was not necessary. I lost my bet, I lost the three star restaurants and I want to come back to that because of course I want to get something out of these three star restaurants but there is many ideas that is coming out of it. Okay, what's the role of the architecture, non-linearity channels and so on? What about the weights? This is incredibly complicated because each time you relearn you have new set of weights. How can we build models of these weights? And of course the Graal which is understand the output. What is the class of functions which correspond to the output? What are the functional spaces? What can we learn? What we can't learn? How to deal with this famous curse of dimensionality? So I said I wanted to go and look back and maybe it's in the sense of what Guillermo spoke about it's always very interesting to go back to read original fundamental work and that's what I spent part of my time during COVID doing. So the problem under unsupervised learning is about estimating probability distribution but in physics it means understanding the physics because it means understanding the energy if you get the gradient of the energy you have access to the force, you have access to the interactions and can we estimate that of course without suffering from the curse of dimensionality this is at the center of statistical physics. And people have been trying to see what was the origin of wavelets and so nobody ever found and each time you look in a different field you find the origin of everything. So one of the origin is the renormalization group and this incredible work of Gadanoff and especially Wilson who got the Nobel Prize in the 90s. So that's another way to view the idea of hierarchy, pyramid and so on. And the key idea of Wilson is that when you have a very high dimensional system as you can encounter in physics if you want to look at the probability distribution of this field which is the cosmic web what you should do is average, sub-sample, average, sub-sample, coarsen up progressively and look at the evolution of the probability distribution at each of the scale. And the key idea is that this probability distribution has a very regular evolution across scale if you normalize the coefficient. And what that means is that one way to describe this probability distribution and this is explicit in the calculation of Wilson is that you should first begin at very coarse scale where you have a very low dimensional system and then you progressively move across scale by looking at the conditional probability distribution at fine scale given the coarse scale and you have a decomposition of your probability distribution as a kind of Markov evolution. The key idea is that these conditional probability distribution and we see it all over AI is much simpler if you understand how to build the conditioning. Now this was done in statistical physics in a relatively simple model like ferromagnetic models or easing models where basically in the energy you only have a quadratic term and what is called a scalar term which is non-convex which force the value of the image to be either in the neighborhood of 1 or minus 1 which is corresponding to spins. So depending upon the temperature you can have a totally disordered system or very long range correlations. Now the beauty of the work of Wilson what he showed is that everything can be simply understood if you look at these conditional probabilities. The only problem is that that never went so far never has been applied to turbulence interesting things because there was no model and that's where machine learning came in as we'll see. Now before doing that one of the idea of Wilson was in fact that you should decompose with wavelets and in the 1970s he came out with these wavelets. The idea is that if you want to look a high resolution image compared to a low resolution image what you should do is extract the complementary information of the high resolution image that's sorry so that you can build a high resolution image from the low resolution image. And what is this complementary information? These are the high frequencies that you extract by doing convolution with wavelets. Now what we now have are techniques to do that in orthogonal bases and that's what a wavelet will do you get the low frequency and the orthogonal wavelet coefficients the low frequency you sub decompose and so on. Now the key idea is that now this conditional probability of high resolution given low resolution and that's in the calculation of Wilson you can view it as the probability of the wavelet coefficient given the low frequency and the low frequency you can sub decompose it into wavelets so all this can translate into interaction across scale. So the key idea is that you can understand this very complex probability distribution if you can understand the interaction conditional probabilities across scales. So that's the decomposition. And one of the surprise but again that appears indirectly in the calculation of Wilson is that these conditional probabilities instead of being global which leads to curse of dimensionality they are local. If you want to understand the conditional probability of the high given the low frequency you don't need to look at the very large neighborhood and that was what Eero Simoncelli was referencing you can get very local conditioning so that you have no more curse of dimensionality. More than that this is totally non-stationary but the non-stationarity is only in the low frequency. The conditional probability can be stationary this is why you can implement that with a convolutional network and local. Now before going to convolutional network this is a harmonic analysis conference so let's look once at a Fourier transform. This is an image this is a Fourier transform. So a wavelet transform if you show it as a normal net it computes the low frequency and all the high frequency bands that you can see here is convolution. Next layer, next layer. Now when you look at these images of a boat you can see that they are very dependent across scale. The edges are essentially the same because they correspond to the geometry, the contour of the boat. So this is very dependent upon the other scale. However and what is the difficulty and why this problem of scale dependence hasn't been solved until the coming out of these neural nets is that this dependence is totally nonlinear. Why is it nonlinear? Because if you look at the correlation of the wavelet coefficient at one scale with wavelet coefficient at another scale, whatever position this is always going to be zero because of the phase fluctuation because these coefficient lives in different frequency bands so their correlation are zero. So if you want to have a correlation which is nonzero you need to put a nonlinearity which is going to kill the phase and the nonlinearity can be rectifier or if you want to make it simpler from a math point of view you can take the modulus and look at these correlation matrices. These correlation matrices are nonzero anymore so you can capture these famous conditional probabilities across scale. However, these matrices are huge because you have to compute them for any scale any position but if you do a new transform, a new wavelet transform you basically nearly diagonalize these covariance matrix. What does that mean? That means the only thing that you need to do is to get the correlation across the channels and you are going to get a correlation which is stationary, therefore this is a one-one convolutional kernel. So the only thing that you need is to compute these covariance along the channels if you use a relu or a modulus and let's say a skip connection. Okay, so that's the kind of work that was carried with many of you. In fact, there's one who is here, Rudi Morel Siphenz and the idea was can we do what we couldn't do with Wilson couldn't do meaning can we now deal with very complex fields such as these turbulent fluids or this kind of cosmic web. And the answer is yes with only one image, 500 coefficients not only you get here good-looking images but these are maximum entropy models so it guarantees maximum variability and it reproduces third order or fourth order moment so you really have models. But that works for that kind of unstructured problem what about image classification and that's where I'm coming to the bet. Okay, so if you want to do classification the first idea and that was initiated by Johann is to say okay let's do that kind of cascade of transformations and what we can also do because we know that dealing with the correlation across scale can be useful is to try to understand the properties that we should implement across the channels and we can think about that for example in audio we can think about that in different kind of signals and there was some very beautiful work that was done by Joachim Menden Vincent Le Stenland and in the case of audio it works pretty well. Now there was this bet to go but come back to it. So the bet is scattering transform is there 20% on C410 and ResNet is 8% 50% on ImageNet and ResNet is doing four time better so you have a huge gap and that's where I think tables are interesting because they are asking deep questions. When you have such a gap that means you are really missing something huge okay it's not just a detail and so the question is what is it? And so as Edouard said I took a guy who is brilliant very good and courageous and say please save the bet okay so that was his mission and the conclusion is basically too bad for you you have to pay okay so I had to pay but at least I had a great question okay what are we going to put in this role and there's two very brilliant students Florentin Guth and John Zarkar who came say okay so let's learn and what they proposed is to just learn the one one convolutional network here so keep in space the wavelets so we don't learn the spatial filter is it enough to just learn the one one convolutional filter and the answer is yes so okay you're very happy you fill up the table you had the right percentage the problem is that you lost the math okay because now you've learned but what is it that you've been learning and as I said before when you are in a state of mind as is sparsity wavelets and so on what you try to find are groups convolutional long groups and try to interpret these linear filters as convolutional long groups so we looked into the weights and when you want to find something you always find it so of course we found Fourier transform along the thing the only problem is we only found it here and then it's a math beside the first layer like the math so that was the fault of the network of course but okay after a while you have to try to find some kind of new ideas and model for L when you look at the literature there are some very beautiful things that have been happening in the last 3, 4, 5 years let me come back to one hidden layer network one hidden layer network so fully connected you have your input your hidden layer here that you can think of feature vector and you have these linear weights that is going to define the output okay from these inner product with the weight that corresponds to the line of the operator now there is one observation is that when the width increase and that was very much mentioned by Francis Bach you see that the kernel in other words the inner product of the feature vector in x and x prime has a tendency to converge when the width goes to infinity and the reason is that what you are looking is at the product of these inner product and if the weights looks a bit random and have some decorrelations and are identically distributed this is going to converge to an expected value and that was the model of a rachemian rest so why random why random because first of all you begin with a random initialization so and you make a stochastic gradient descent and you can show that when the width is increasing because all these coefficients can be inverted it has a tendency to converge to identically distributed weights which can become nearly independent now the consequence of that is something that is important and that Frantz and Guthrie align is that if you look at these hidden layers which is totally random which is going to converge from one training to the other training you know that it's going to converge to a kernel which is a deterministic kernel so this kernel you can write it as an inner product of a deterministic feature vector what that means is that all these random layers can be viewed as a rotation of a deterministic feature vector it's important to understand deep network why? one may be of the most difficult question in deep network is to understand the dependency upon these weights these weights are random in different layers but how does it depend one upon the other one answer is that basically they are all rotation of a fixed feature vectors and it is these rotations that specify the dependency from one layer to the other to understand that there is this very nice theorem proved by Florentin which basically say ok at the first layer obviously this is deterministic it's the data so phi of x is the identity if you suppose the property to be true in other words and j-1 it's going to be some rotation of a fixed deterministic feature vector if the weights are themselves a rotation of i,d feature vector following a law then you can show that when the width converge the next layer is itself going to be a rotation of a fixed feature vector in other words the kernels are going to converge now is that true you can come back to the network and verify it now if you go and that's the work of Florentin, Gaspar Rochette and Brismina if you go on neural nets such as on CIFAR and you look at all layers you can see that indeed modulo rotation all the layers activation layers they converge to a fixed layer up to a random rotation now the question then is what is this fixed layer this deterministic layer in that framework another way to say it is what the probability distribution of the random feature now another observation at least in CIFAR is that initially the weights are gaussian when the iteration goes on it says if it's only the covariance of these gaussians which change they change a lot so if you look at the weights you almost go back to the initialization that has an important implication is that the weights are almost gaussian conditionally to the previous one so how do you verify that you look, you whiten the weights and you look at the distribution of their eigenvalue of the covariance matrix it should converge to a well-known distribution which is called for j equal to it's very nice as it goes on it remains almost true up to some outliers about 10% here which can have an important role I'll come back to it but then you have this model so let me here's the model the mathematical model now is the following each layer the weights are gaussian random variable each line of the matrix is independent but they are conditionally gaussian why? because the covariance matrix is a rotated covariance you have the fixed covariance and the rotation is due to the previous layer the previous layer have been rotated so they rotate the next layer so the only thing which is random here is the fact of course that it's a gaussian random variable and sorry the covariance has a dependency upon the previous layers through these rotations the consequence of that and I suppose that that has some relation with the first slide probably of shihab who opened the conference it's converging to a rainbow so why a rainbow? because at each layer these random weights they converge when the width increase to a fixed feature vector and how is it the fixed feature vector computed? you have a covariance matrix or the square root of the covariance matrix which in fact does a dimension reduction and then you have the random feature which have been whitened and the random feature they define what is called dot product kernel the kernel is only which depends upon the entry z and z prime only depends upon the inner product so you have covariance a fixed feature vector of a dot product a covariance up it goes and each time what changes from one learning to the other is just the covariance matrix so it's just the color of the rainbow that changes when you do within such a model one training or the other training and the function space here which is a reproducing kernel Hilbert space now can be entirely specified mathematically and is parameterized by the covariance ok so does it work back to the numerics if it works that means that once you've learned one network you can estimate the covariance you can create zinions of networks without learning just by re-sampling your Gaussian matrices you are going to get totally different weights and it should give you the same performance so you try that's the original performance that's the gap no more learning you just re-sample and here you lose a bit about 3% now when you see a result like that you are used to read papers in image processing or machine learning you may say hey why do you show C410 and not ImageNet so the answer could be of course oh I didn't try and so on but nobody would trust me if I was saying I didn't try it gets more problematic it gets more problematic because we can see here in the layer you see you have these outliers and these outliers in C4 are not so bad it works pretty well these outliers begins to look more difficult in ImageNet but now you can do math and think okay what's the mathematical nature of these outliers these seems like what one would think of some kind of fixed features or whatever interesting questions to think about so great that's not the end of the story that's how I would end up this thing however for me there is something a bit new because I was always thinking again in deterministic term and compressed sensing or random projection whatever came into the picture and so we have a new set of math problems so I'd like just to finish by thanking all of you especially in some sense the more narrow family namely Rujena Yves and all the students who gave me so much energy and so much pleasure to do research and all of you siblings who came thank you very much for coming to these three days thanks very much I mean the conference is over we just want to thank everyone for coming I think for us it has been super fun you know it has been a pleasure to see your family all the three generations we have seen and maybe more happy birthday someone has to decide he will be the next one I hope in Israel maybe if you want to organize a birthday send him an email yeah these are professional now you can almost spend your life doing that ok bye everyone thanks good bye