 So, and I'm not any more a technical pre-technique, but I've got no much barrier, but besides these details, I'll be speaking about high-dimensional learning data analysis, and I'll try to present the basic problems, but also some very interesting advances that have been occurring for the last five to ten years in relation with deep networks, which are raising pretty fascinating mathematical issues. And also specific work of people in the group at École Normale Supérieure, and there is a whole group of people that I'll be mentioning on the way. So, let me first begin with the typical problem that we're facing, and I'll begin, for example, with a problem of image classification. So, in classification in general, you typically have a very high-dimensional vector, X. I think, for example, of X as being an image, in which case there is, let's say, typically a million pixels, so it's an element of a space of dimension one million. And in classification, what you want to estimate, approximate, is a functional f of X, which associates to your data, for example, the label. So, in the case of image classification, for example, the label here corresponds to anchors, Joshua trees, beavers, lotters, water lilies. So, the big, big difficulty of that is that you are in very high-dimension, and you have to estimate a function in very high-dimension. So, what do you have to do that? You have examples. So, a set of XI, which are example of images, for example, and for each of these images, you are given the label, so the value of f of XI. And now the problem is you have a new X, and you have to compute f of X. So, as you can see, one of the big difficulty is due to variability. Within the class of anchors, there is a considerable variability, same for Joshua trees, beavers, and so on. And this variability is intrinsically due to the fact that you are in very high-dimension. So, that will be, again, the root of everything. So, let me give a very different example, which are issues of audio classification. So, for example, here you have a sound coming from Accordéon, and yet a completely different sound. You have to recognize that it's the same instrument, despite the fact that you have a huge variability within the data, which is typically, again, several million points. So, if the problem is instrument classification, you want to find that this second sound is much closer to this one, in some sense, than from these two others. Now, you may be asked a totally different question, which is what's the music content? And if the question is what's the music content, an Accordéon and a violoncelle, cello, may be much closer from the point of view of this question if you don't care anymore the timbre and what instrument is there. So, the big difficulty is that we will be facing problems of classification of very, very different sorts, and the question will be how to prepare yourself to answer that kind of questions. The third class of example I'll be looking at is physics. Can we learn physics? Learning physics in this framework means you have X, which describes the state of your system, and FFX here, think of it as being the energy functional. So, if you can compute the energy for any state, you can access the forces by computing the derivatives, so you have access to most of the physical properties of your system. And the question is, can you learn this physical energy functional without any prior information or very little prior information besides a certain amount of data? So, X, for example, may correspond in the case of astronomy to the distribution of masses in, let's say, in a galaxy, or if you go to the infinitely small, it may correspond to the distribution of charge, let's say, in a molecule. We'll look in particular at the issue of quantum chemistry and I'll be showing that indeed it's possible to do that kind of learning, which gives a slightly different look at how you can approach physics. Okay, so why are these problems so difficult? And the only and simple reason is again related to dimensionality and the so-called curse of dimensionality. So, what's the problem? If you look at the problem from a slightly naive point of view, you can view this problem as a simple interpolation problem. You know for a certain number of XI, the value of F of XI, you have a new X, the immediate idea that comes to mind is to say, let's look at all the points which surround X and try to interpolate the value of F at X from the values in the neighbors. And that will indeed give you a good approximation if F is regular and if you have close neighbors. And that's the problem. The problem is that you will never have close neighbors. So, why you will never have a close neighbor? Let's look at 01, a cube, 01 in dimension D. Suppose, for example, you sample this cube at a distance epsilon. How many points would you need to cover your cube? So, obviously the number of points is going to grow like epsilon to the power minus D. Think of epsilon as being 1 over 10. 1 over 10 is not very small. D being 100, which is pretty small. So, 10 to the power of 100, this is more than the number of atoms in the universe. So, there is no way you'll be able to sample your cube, especially given that D is not going to be 100, but it's rather going to be 1 million or even more. So, you have to think of your data as being completely isolated point in your space. And so, you have a very, very coarse sampling. And you can see it from the fact that, that means that for any x, you will have no xi which is close, as measured with a Euclidean norm. And you can see these belong to the same class, but if you subtract these images, obviously that gives you no information about the class. This one and this one may be much closer than this one and this one. So, the problem you are facing is trying essentially to learn an appropriate metric. You are in a situation where your data, let's suppose for example here you have two class, the blue and the red points, they are going to be completely spread in your space. There is no obvious order. On the other hand, because you do have a classification problem which tells you that these blue points and red points belong to the same class, you know that there exists some underlying metrics, delta, such that the red points are closer one to the other than the blue points. So, the typical approach in learning will consist in trying to find a representation, so a mapping phi such that by miracle, once you are in this new space age, so think of phi effects as being a big vector of descriptors. Exist your original data and now you represent your data by a set of features, but typically a huge amount of features. And what you hope naively is that in your new space a simple Euclidean distance will give you a good estimate of this very complicated similarity metric of it. In other words, your similarity metrics is essentially equivalent to use Euclidean distance. So, this is called a Euclidean embedding of a metric. You have a metric and you find a transformation such that suddenly in your new space this metric becomes a Euclidean metric. Another way to view that if you look at it from a probabilistic point of view is that these points you can view them as class 1, class 2 being realization of random processes. These random processes are absolutely not Gaussian. And then when you map them in this space they are going to regroup. The fact that you have a Euclidean distance which describes the notion of similarity means that you can essentially describe this distribution as Gaussian distributions. So by miracle you found a mapping which suddenly transformed your distribution into a Gaussian distribution without losing the crucial information. Now what is the crucial information? The crucial information is the ability to separate these two distributions and to find out that whether a point belongs to this or this class. Now in that context a simple classifier will just be a hyperplane to separate two Gaussians. So the final stage which consists in finding the optimal classifier may be computationally very heavy and require heavy optimization but it's conceptually very simple. It's about fitting a linear classifier. The real difficulty which is behind all that is to understand how to build this feature vector. How to build this Euclidean mapping of this potentially unknown and complicated distance. Okay so the problem of Euclidean of mapping a metric into Euclidean space is not new in math in fact has a long history but the framework in which it has been solved are very different from the framework in which we're. So the first type of results that are typically well known and not so complex is when you know that your data belongs to some low dimensional manifold. Okay so yes potentially your data is in high dimension but it may regroup in a low dimensional manifold. If that's the case then the only thing that you need to represent is the manifold and that means that you need to map your manifold the metric over the manifold into a Euclidean space and that we know well how to do. You can build a diffusion metric so basically over the manifold which is given by the Euclidean distance. Basically it's a Gaussian function of the distance and if you get the laplation induced by the diffusion on this metric you will get a mapping by computing the eigenvalues a mapping into Euclidean metric. Now this case is very far from the situation in which we are why because the data does not belong to low dimensional manifold that would mean that the data is driven by few parameters if you take textures if you take typical scenes sounds and so on the number of variables is within tens or hundreds of thousands so it doesn't belong to any low dimensional structures. The second kind of results that are well known for embedding metrics is when the number of points that you have to embed is specified let's say n points. In this case you can build so Euclidean embedding of essentially a graph in which the distances within the graph will become a Euclidean distance within your space. So these are standard results by Bourguin, Johnston, Lydon, Strauss. Why does that not work in our case? It doesn't work because the metric is adapted to the points you are considering. If you have a new point you have no guarantee that the metric will be appropriate for the new point. In other words there is no guarantee that it's going to generalize. To guarantee that it's going to generalize you need to have some kind of local regularity between your metric and the mapping. And that's where we are going into these not embedding a few points but embedding a very high dimensional spaces with some prior information on the regularity. Okay, now on the other front. On the other front which is the algorithmic front a lot of things have evolved. So you have this very classic idea that has been developed since the fifties which is to try to represent data with neural network. Now these things have never really worked very well until the last five years where very, very spectacular results have been obtained with these so-called deep network. In particular two people have been working a lot around that Yann Leca and Jeff Hinton. So the idea is the following. You have your vector X and a normal network, feed forward normal network can be described that way. You apply a linear operator R1 which can be a convolution operator, okay? And then you're going to apply a non-linearity to each of these value output by your linear operator. For example an absolute value, a sigmoid, but something which is just going to act on each point individually and that would be called a neuron. And then you iterate. You reapply a linear operator and your non-linearity. And then you cascade until the points that you're going to get your representation phi effects. Now then on phi effects you just apply a simple linear classifier. Now you have a huge number of parameters of this network because think of each of these linear operators as being potentially freely chosen. So the number of parameter is the number of parameter in each of these matrices. Typically in such a network it can be of the order of several billion parameters. Now how do people learn these parameters? They have examples and they measure the error of classification on their examples. And then they back propagate the error in order to optimize the value of these linear operators. So these back propagations are essentially gradient descent. No guarantee to converge, but the miracle is it does converge to a reasonable solution. Not only it's a reasonable solution, but as I said the results are very exceptional now. Face recognition now is better done by machine than the visual system. We now have cars which are running in the streets, avoid people, stops and so on so are driven without any driver because of that kind of visual system. Google apparently has over 25% of all their computational resources which are running on these things, which means more apparently than one nuclear plant is just used to drive that kind of thing. Facebook has all its product whether translation, face recognition, voice analysis for Apple are using that kind of technology. So it's not just an academic result, it's something which is very widespread. And the question is why does it work so well? Somewhere these things seems to be doing some very reasonable embedding in the sense that it works well. Mathematically the questions are very open. And that's what I would like to speak about. So how come these kind of structure do provide appropriate embedding? And what I would like to show is that it's very much related to the ability to build environs, but to build stable environs over very large groups. And we'll see here relations with multi-scale transform and the natural reasons why these networks, why it's very natural to build such networks. So we'll look at different kind of applications on images. I'll show applications also on audio signals, but I will also show that that kind of structure brings a very different type of models on random processes. And that's in particular important when you're facing issues of modeling phenomena with very highly non-Gaussian processes, which often happens in physics, turbulence, but we'll rather be looking here in physics of issue of quantum chemistry. So I'll finish on learning physics and how all the physical properties potentially relate to that kind of representations. Okay, so let me begin the analysis with a very simple case. Suppose you have an image of digit like that. You want to recognize that this digit is closer than this one, and these two digits are more far away. A very natural distance that has been analyzed for a long time in vision, but not only in vision, is the idea of deformation. You will say that these two digits are much closer because the deformation which goes from here to here is smaller than from here to here. So that's something which is used all over in classic mechanics. So you are going to look at deformation. So it's the action of a diffeomorphism which is basically going to warp the space to map this into this. And the natural distance that is going to come into it is to say what's the distance between x and x prime. If you deform x into x prime, you can measure the residue, and you want to see how big the deformation is. And how big the deformation is is basically the size of the derivative of tau. If tau is a constant, it's a simple translation. There is no deformation. The two objects are identical. If there is a warping, the maximum of the derivative of tau will give you the size of your diffeomorphism. So... Because here in that case, we impose it to be proportional to the diffeomorphism. You can have squares, squares. That's for homogeneity conditions. You could have a square on your diffeomorphism. This is a point of... We have it proportional to the diffeomorphism. You want stability. So what you really... The kind of properties that you will want... I've put it that way because that's what people use. The kind of property that you really want is to have a metric which is Lipschitz continuous relatively to this. So the way you put it, you can put it many different ways, but the key property will be the Lipschitz continuity, particularly to the diffeomorphism metric. So in particular, obviously, the representation is invariant to translation. An object translated is identical to itself. Now, why that kind of metric is much too naive? If you go into really high-dimensional structures, like, for example, textures, these two textures, you'll consider them as being visually identical, although obviously you cannot deform one into the other, which means that you also want, within these very high-dimensional models, be able to define distances between realization of stationary random processes. And in particular, if twoizations are realization of the same, in that case, stationary ergodic process, then you'd like to say that the distance between them is very small. So underlying this problem is the issue of defining distances on non-Gaussian processes, and if you think of it in terms of representation, it will mean taking these processes, mapping them into a Gaussian process so that then a simple Euclidean distance will give you an appropriate measurement of similarity. Okay, so let me put here the list of property which we want. The first one is we want, obviously, something which is stable to additive perturbations. In other words, phi of x minus phi of x prime should be bounded by x minus x prime. This is completely obvious. You want something which is enviroment to translation, okay, an object and an object translated are identical, so if you take an image, you just translate it, you want that the representation of the translated image be identical to the representation of the original image. The real difficulty will be the stability to the formation. And what I want to show is that this property will imply essentially all the structure of these deep networks. What you want is that if you have a signal, whether an image, audio signals, and we'll see in more general cases, which is deformed, then you want that the distance between your representation of x and the representation of x deformed should be of the order of the deformation. If the deformation is small, the two signals should be considered as close. If the deformation is big, potentially the distance can be big. Okay? Can you say... Of course, we should... Can you say translation? For instance, if you have images, is it like this or is it like you increase the scale? No, it's a translation of a U. So it's a translation... Yeah, a geometric translation. That would be a multiplication by a constant. Yes, but then you have issues with the fact that your images are finite. Here I'm in L2, I have no boundary, then you can go on periodic boundaries, but here I'm in infinity. Basically, these are functioning in L2. Now, one of the very natural tools to deal with that kind of thing is to think of the Fourier transform. Why? Because obviously Fourier transform is unitary. Fourier transform, if you take the modulus, is invariant to translation. Why nobody used the Fourier transform in recognition? It's not stable to deformation. If you slightly deform a function, the high frequencies are going to move very much, and if you compute the Euclidean distance on the modulus of the Fourier transform, they may look very different although the deformation is very small, which is, of course, an issue which is well known in many problems in mathematics. Okay, when you have a problem related to instability to deformation, what you need to do is separate scales. You need to separate the variability at different scales, and the way we are going to do it here is by using a so-called wavelet transform. So the idea is the following. Your function, we are going to separate the information in different frequency bands with filters. The filters are wavelets. So you take a wavelet psi, you dilate your wavelet, so this is a wavelet which is compressed or dilated, and in the Fourier domain, a wavelet, in signal processing term, you can interpret it as a filter, and when you dilate it, you are going to cover different frequency bands. In signal processing, these are called two constant filter bands. And then what you do is you take your function X, you extract the component within each of these frequency bands. This is also called little wavelet transform, so that date from the 1930s in math, and at the very low frequency, what you extract is basically the average. So you take your function, you separate components appearing in many different scales. And if you design appropriately your filters, you are going to get an energy conservation. It's very simple to verify. Now what about images? In images, we'll do exactly the same thing. The idea is that for an image to build up a stable representation, we'll need to separate scales. But in images, you also have orientation. So what you take is a wavelet which is now a two-dimensional function. Think of it as a Gaussian, modulated by a sine wave. You have the cosine part and the sine part, so you get your cosine part which gets smaller and smaller, that's the wavelet, and then you change the orientation of the sine wave with a rotation, and you get wavelets with different orientations. Then you do the same thing. So in the Fourier domain, this is a wavelet, a Gaussian modulated as a certain frequency, and then when you rotate it, you cover a frequency annulus. When you dilate it, you get all the frequency bands. So then you do the same thing that in one dimension. You make a separation in each of these frequency channels with a convolution with your wavelet, and you have the averaging of your function at very low frequencies. And same thing, you conserve the energy. Okay, so what is that related to our original problem? Okay, let me show you an example first before arguing on that. So this is an example of image. These are, let's say, the detail at different orientations at very high frequency. This is the low frequency of the image. This low frequency is split again into details within different orientation, an even lower frequency image, which is split again and split again. So that's a standard wavelet transform with a different component in different orientation. Now, the only component which is going to be invariant to translation will be this one. It's the average of the image. So all the rest is not. And that's the big problem. How to make everything invariant to translation. So let's look back at the problem. In one dimension, you have a function like that. If you want to make this function invariant to translation and if you use a linear operator you don't have the choice. The only thing you can do is average. If you average over a domain let's say of size 2 to the j with a kernel like that, you're going to get a regular function which is almost invariant to small translations. Small translations related to 2 to the j. If you want it to be completely invariant to translation you average from minus infinity to plus infinity and wonderful you have a quantity which is now invariant. But you've lost everything. And that's the issue when you build up an invariant. The difficulty is to make sure that your set of invariant are sufficiently rich that you don't lose too much information. And here if you use a linear operator the only thing you can do is get the average. Okay, where did you lose the information? The information you lost are the high frequencies. That's what you've lost. The high frequencies are very highly oscillating. So if you average the high frequency you just get zero. Nothing to recover. If you do the same strategy then in the Fourier you kill the phase. So take the modulus. What you're going to get is an envelope like this. Now this envelope is not invariant to translation. If x translates the envelope is going to translate. How can you make it invariant to translation? You average. Let's average the envelope. If you average the envelope you have a new set of invariants. However, you've lost information. What is the information you've lost? Well when you average the envelope you lost the high frequencies of the envelope and where is this information? You can recover by taking your envelope you have the average of the envelope in red and you can get the high frequency of the envelope by computing its wavelet transform. And what are you doing suddenly? You are cascading this operator. The wavelet transform each time computes the invariants, the average and the next layer of high frequency coefficients. And that's how it looks like. You begin with x. First you have your invariant by averaging and then all the lost information in the next layer by separating into different scales. Now each of these images we are again going to make them invariant. You get an invariant here and you get the next layer of wavelet coefficients. These coefficients how can you interpret these? The first layer of wavelet coefficients you can view them as interactions between the different components of the image because you send a wave and each component of the image is going to interact through this wave. These are interactions of interactions. Again, you average them with the next wavelet transform this is your new set of invariant and the next layer. What you've built is a neural network. You need the nonlinearity in order to capture the remaining information and now the question is what's the property of these things? How come it can be useful for classification? How did we build this representation that I'll call here a scattering transform? It's a scattering transform because you scatter the information through a very large network. You build it by iterating over linear operator modulus. Linear operator modulus. Your linear operator preserves the norm. If you take the modulus it's contractive in fact you are just applying contractive operator one after the other. The first result that you get is because you cascade contractive operator the resulting operator Sx is going to be contracted. You contract, contract, contract the difference at a position X and Y of the scattering representation is smaller than X minus Y. So you get your stability. You can in fact prove that all the energy of your signal and that's more subtle is preserved by this invariant. So the whole exactly like a Fourier transform all the information is within the invariant. However what you don't have with a Fourier transform and any standard representation your elliptics continues to differ in morphism and that's the key property. If now your signal is deformed any arbitrary deformation if you look your representation with a simple Euclidean metric the distance within that space is going to be of the order of the deformation and that will guarantee that whenever you have structures which are deformed they will look similar and you'll be able to classify them easily from that. So and a comment here you need to have a wavelet because the property which is behind this stability is the fact that your transform has to almost commute so the commutator has to be not too big and that you can prove essentially you need scale separation. Let me show a first example that was developed by Johann Brunner for digit classification. So the case of digit classification is a little bit ideal here because two digits are similar essentially if they are either don't have the same position or are slightly deformed. So what you do is you take your digits this is a well-known database of digits on which there has been hundreds of publications you take your digit, you build up the scattering representation of your digit and then you do a very simple stupid class, linear classify. If you do that you basically get the state of the art. If you look at the performance of these deep networks they do almost as well but what is interesting here is that they learn everything. Here what did we say? We said well we know the source of variability. We know that the source of variability is due to deformation so we don't have to learn anything. We know that the filters should be wavelets so we just send wavelets we get the representation and we get the result. Here they learn their filters and that's the spectacular side but of course you need huge amount of computations and the performance are almost as good but not as good. And are these filters we get type some sense in every sense or whatever? Yes, so in fact if you look at the way they structure their network, they structure their network as a cascade of filter in fact to build up in directly wavelets. So there is a lot of know how when you build up these networks they don't put the wavelet filter but they structure it in such a way that the algorithm has to converge to something close to a wavelet filter. Let's look like the problems of random processes. So these are databases of textures a QWERTY, this is a Berkeley database. We are going to do exactly the same thing. You take a texture image, you compute the scattering representation compute a linear classifier and just do the classification. So a very natural representation to classify stationary processes is to use the Fourier spectrum. The Fourier spectrum gets, if you implement it well, an error of the order of one person. And that was basically the bottleneck of all techniques up to that point. So why one person? Because and I'll show you examples. You may have textures which have exactly the same second-order moments but look completely different. And the Fourier transform cannot distinguish them because essentially it's based on second-order moments. In this case, the error goes down to 0.2%. So by a big factor. And the question is why? Why, what are we capturing when we are doing that? What kind of underlying models of stochastic processes we are building? So let me go back to stochastic process. Suppose that you have a process X, which is stationary. What did we do? We took X and we made convolution with wavelets different frequency bands. Took the modulus and then the next layer is the same thing convolution with yet a new wavelet modulus, third wavelets and so on for all possible wavelets. And at the end, we average all that with a filter. Now, if X has finite second-order moments what you can show is that when 2 to the j is going to go big, this is going to converge to a Gaussian vector, random vector. If you need some egodicity decorrelation property. Now, when j goes to infinity, if you have a little bit of egodicity property this convolution, which is an averaging, you are making an averaging in time of a stationary process. It's going to converge to the expected value. So this vector is going to compute to a set of moments. So what you are really doing is you are representing your vector X, your stochastic vector, by a set of moments which are not first-order, second-third-order moments, which are moments obtained with contractive operators. Why is that important? Why nobody use high-order moments to characterize random processes in classification? Because there is too much variability. When you compute a high-order moment, you have very dilating operators. The variability of your estimator gets very big. In this case, everything is contractive. So all these numbers, you can estimate them from a single realization with this averaging. Okay. Now, you have your stochastic process and you have a family of moments. In what sense can you describe your random process? Now, your random process is a priori described by probability distribution. What do you know about the probability distribution of your process? Well, you know these expected values. Okay? You have the expected values of X, expected values of X transformed by an operator U1, an operator U2, which gives you these coefficients, U3. So, you know these expected values. In other words, you know the projection of your probability distribution on each of these transformed value. Now, the very natural thing to do, which has been described previously, is to compute the distribution which doesn't assume any other information. In other words, which has a maximum entropy. And if X is bounded in particular, but you can relax that, P of X is just going to be a Boltzmann distribution. It's an exponential distribution, which is a linear combination of the constructs. So, you can compute P of X. And if the maximum entropy is the same than the entropy, so it's systematically going to be bigger than the entropy of the true probability distribution. But if it's of the same order, then you've been able to characterize your process. So, let me show you example. Maximum entropy, they should be a minus. Yeah, no, it's a maximum entropy. It's a negative. Is the H ... No, it's often negative. It depends on the community. In probability, usually it's negative. But with this. Oh, with this, yes. Okay, now this is an example of texture. This is the model of such a texture with a Gaussian process. It's very good because this one is nearly Gaussian. This is what you would obtain with the scattering coefficient. I'm just going to get the first two layer of the network. First order wavelets and first order and second order wavelets. So, very similar. Now, look in that case. Here you have very highly non-Gaussian process with a lot of geometry. This is what a Gaussian process will give you. This is a realization obtained from these moments. And what you see is you capture the geometry because you have all the interaction between all scales, all angles. This is a third example. This is what you have. This is a realization of a Gaussian process. This is what you get with such a distribution. If you take turbulent, oh, that's not yet turbulent. This is a very sparse process. These two processes have exactly the same second order moments. Just to show you that second order moments, be aware. Okay, that's a Gaussian model. That's what you get with these models. And this is a turbulent signal. This is essentially a Kolmogorov model, Gaussian model. And that's what you get by imposing these kind of moments. And what you can see is yet you are capturing something which is called intermittency, which is very naturally emerging from these networks. Let me show you the same thing with sounds. First sound. You're going to hear the sound having exactly the same second order moment, Gaussian model. And then what you get from the first two layers of the scattering, the Gaussian model. Scattering. Obviously, you don't have the word, but you have the prosody. Now, does it work always? No. Let me show you another example. Scattering. You have a little bit, but you've lost a lot of things. Worse. I've been a Wikipedia editor since 2003, and I'm the founder of the... Getting pretty bad. Worse. What does it mean? You've lost a lot of the geometry of your signal. For now, we've just looked at translation, but there is much, much more geometry than just translation. If you think in the case of audio, you have structures. You have a harmonic structure. It's very well known that how do you describe naturally this harmonic structure with a spiral with octave? Because if you look at the first base harmonic, it's going to appear here. The next one is one octave above. The next one is one third. This is the harmonic spiral that dates to Riemann, a different Riemann. Next one is one octave above. Here, the... And the very natural geometry that you want to look at this is on the spiral. So how to deal with this, you want now to describe the variability on this spiral. So you have to face new groups. And that's now what we're going to see, which will explain why these networks gets much more complex. I'm going to show that in a simple case, which is images now. In an image, if you just apply what I described on this problem, you don't get 0.2% you get 20% error. Why? Because there is a lot of variability of rotations. There is a lot of variability in scaling, and you didn't build an invariant relatively to these things. So the problem now is that you want to increase the range of invariant. You essentially want to be invariant to any arbitrary group, and we're the group, and we're now going to look at simply the group of translations and rotation, which is not so simple because it's not commutative, but still. You have a signal and we're going to look at the impact of this group. If you just do what I did previously is a convolution with a wavelet and the average, what's happening if you translate and rotate the signal? If you average, the average is not going to be changed. Good. That's an invariant. If you look at the wavelet coefficients, they're going to be translated, and the index of the wavelet is also going to be translated. In other words, the rotation, you can view it as a translation on the circle. So to become invariant to rotation, you will need to become invariant to this translation along the rotation parameter. What that means? That means that your next wavelet should not just act on the translation group, but should act on the translation group, but also on the rotation group. So you're going to build a wavelet which is now living on your rigid movement group, and that we know well how to do. You can build wavelets on any groups, and you make a convolution relatively to your wavelet on your translation rotation group, and you reapply exactly the same thing. So you have your first wavelet transform, initially you just have translation, but now you see your rotation parameter and now you do a wavelet transform on the roto translation group, or the rigid movement groups, and you cascade. And now you're going to get your invariant. Now if you do that on this problem the error goes from 20% to 0.6%. Why? Because you've dealt with your source of variability. And that's the key problem in all these high dimensional problems. How to deal with the existing variability, build an invariant which is sufficiently rich to still preserve the information. Now that kind of things have been used, and that's the work of Edouard Oyalon, for doing classification of objects. So that was the original database. You take your different objects you now build a representation which is the capacity of being invariant to translation, but also rotations and potentially scaling, and then you build a simple dinner classifier. How does that compare to the state of the art? State of the art are these deep networks. That's what we get here. They do better. The error here is 10% the error here is 20%. However if you look at this figure 80% accuracy that's on 200 classes means that you've been able to capture pretty well the difference and the characteristic of each of these classes. What do they get more? That's the question. Obviously they are able to learn all the types of groups which are important in this classification problem. And one of the key questions is to understand what's the nature of these invariants that these people are able to learn through these deep networks. What I want is to finish on the physical problem because it will bring us back to things that we know in a different context. So in physics the problem is the following. X now is the state of your system. So if you have an n body problem it's going to be the position and the value, the mass, or let's say the charge in the case of quantum chemistry. Now, typically if you look at such a problem you immediately see that there is potentially a huge explosion of interaction because if you have d particle you potentially have d squared interactions. However, we know well that you can reduce the number of interaction. There is the so-called multiple methods which essentially is based on the following idea. If you want to look at the interaction of a particle and all the others what you can do is first look at the interaction of a particle and with the neighbors which are a little bit more far away you can look at the interaction with the group with some summary of the group with the neighbors which are even more far away with a bigger group. The equivalent is to say for example what's the impact of let's say a ration in your life if you take randomly a ration in Russia probably not much but the impact of Russia as a whole in your life can be pretty large is there is political tension between France and Russia. So what is important for very far away structures in general is not so much each of the elements but the aggregation and the impact of that is that the number of interaction goes from d squared to log d squared. This is the idea of multi-scale separation. Okay, quantum chemistry. In quantum chemistry if you take and you solve the Schrodinger equation you are going to get the probability density of the electrons and the electrons are going to define the chemical binding between the different atoms. So normally to get the probability density of the electrons you have to solve the Schrodinger equation which is a very heavy numerical problem. So if you look at you can state that as a variational problem basically if you want to look at the probability density of the electron it's going to have different kind of components. The component which is the repulsions between the electron, the component which is the attraction between the electron and the nucleus and a very, very nonlinear component which carries most of the complicated terms nonlinear terms of the Schrodinger equation. Now if you want to compute the true distribution you are going to try to minimize the energy to find the if you have no external force you are going to find the minimal energy which corresponds to the situation of your molecule in the ground state which consists in finding the density which minimizes the energy. So that's the ultimate energy you want to compute and in our case what we want to do is to learn this energy. So what do we know about this energy? This energy is invariant to any rigid movement. You take a molecule you rotate, translate it, the energy doesn't change. If you act with a diffeomorphism on your energy you slightly deform the energy the energy is going to slightly change. So you have a very similar problem to the image processing problem. So what we are going to do is in our case we don't know this density which corresponds to the ideal of the minimal state because that requires to solve Schrodinger we are going to begin with a very crude approximation of the density. What we say is that we know the position of each of the atom and we are going to consider each atom as being an isolated atom. So we neglect all interaction and that gives us an initial density completely different from the true density which really carries all the chemical bounds and to perform the learning we are going to transform this approximate density and build a representation which is invariant to translation invariant to rotation and stable to diffeomorphism. So just to show you the comparison with a Fourier transform I'm going to show you two representations. One with the scattering transform, the other one with the Fourier transform which is made invariant of a rotation by an integration of a circle. And what we are going to do is learn the physical energy in a very naive way as a simple linear regression over these quantities. So we learn the we expand the energy over our invariant descriptors and what we are going to learn is the weights and how do we learn the weight based of training coefficients. And we are going to try to learn a minimum number of coefficients so it's a sparse number of variables. And I think that Francis described that kind of problem. Okay, so here is a situation. You have a database with molecule, you know the energy of each of the molecule. You now train the system so that you can regress the known values as a linear combination of your invariant. And then you test over a testing set which are new molecules. And you look at the error as the function of the number of coefficients that you've used in the regression. So if you look at the Fourier transform invariant, basically it stops at 16 kilocalorie per molecule. To give you an order of magnitude, if you run a DFT code, so a numerical code, the error is of the order of 1 kilocalorie per mole. If you just apply the first layer of the network, the error is of the same order, 14. If you include the second order coefficient, it jumps to 2.7 in that case. So you are not very far from what you can get numerically by solving Schrodinger equations. Now obviously that also will depend upon the structure of the database. These are database that were put together by chemists. And what that indicates, and nowadays there is a lot of people that are going in this field that it is indeed possible to directly learn physical functionals without solving the underlying equations but by getting directly invariants which looks appropriate. Now why are these invariant appropriate? Because what you have here are things which decor light scales. Now in physics we know very well forces are decor lighted across scale. In what sense? If you analyze a physical system you can analyze the chemical bounds which are very short scale separate that from longer range interactions such as Vandava force or even longer range interactions which are typically the kind of thing which are done by these things. Okay, so let me conclude. The first conclusion which is a little bit global is that from a mass point of view all these learning problems the main main difficulty is to find appropriate ways to build metric with features which are simple Euclidean metrics which are similar to the original metrics. So this is called Euclidean embedding. And again it's a relatively standard mathematical question but in a totally different framework, relatively to what this has been done before and if you look at the problem from a probabilistic point of view it's like building Gaussian models of completely non-Gaussian processes. Now if you are in a situation where you want stability to the action of the filmorphism whether it's in space whether it's a long rotation, whether it's a long frequency, then you know you have to put wavelets. You don't have the choice. You need to separate scales so your filters will be wavelets. Now you are in two situations. Either you are in the situation like the one I described where you know the geometry you know that you want to deal with translation rotation whatever and then you can automatically build up your operator. But many problems in learning you don't know the geometry. You don't know what is the source of variability. How to capture it how to build environment and that's where you need to learn. And this is a completely open field. Mathematically and also algorithmically. There's a lot of very beautiful results algorithmically around that and very little mathematical understanding. From I insist on the physical problem because I think that this is going to have a pretty strong influence in physics. It is now possible to build up models in physics and more and more refined directly from data. So it's a completely different approach to building models of energy functionals. And so there are all kind of applications. I spoke here about audio and images another very interesting domain where these kind of things have been applied is natural language. To classify with exactly the same algorithm. You take a text and very beautiful paper did the following thing. Each letter you have 26 letter plus the comma, point and blank is coded with one digit between 1 and 30. You get a big signal which is meaningless. You send that in your neural net and they got state-of-the-art classification of problems such as sentiment analysis whether the text is sad not sad. Topic analysis whether it's about chemistry, math whatever and so. How do they do that training on a huge amount of data? What are the underlying environments? No idea. What is built I don't know. But the result are very spectacular and that's raising very nice questions. Thanks very much. So thanks a lot for your question. It's more curious. So how can you you want to learn physics from the data you link yourself to the available data? To the quality of the data that you have, right? So to me the question is where does it break down? Exactly there. But that's the standard procedures. I mean, physics what is the normal procedure in research in physics? You have experimental people who are doing measurements and several century afterwards you have people like Newton, like Maxwell, like Einstein which puts together all these numerical experiments and try to find out a theory which summarizes it. There is no way you can build a complete theory without having experimental data verifying it at one point. Well, you can try to extrapolate like string theory but at one point you'll need to verify it with data. It's the same thing. This kind of thing can only learn in the region of the space where you do have data. You can do experiment design based on the model and in this way I construct the model based on the data that I have. Sorry, I don't understand in what sense you design the experiment as a function of the model? Well, if you have a physical model you know how you should get the data, like the sample rate and yeah, I mean you're right, you need to have some prior information to get the range of results of what you get and so on but a lot also of experiments can be carried without any theory and in particular in material science where you have amazingly complex system with very long range interaction, very complicated. People know how to make measurements but the difficulty is to predict the behavior of the resulting material. In material science you are in a situation where you can do that kind of thing. The limitation currently is that these techniques are not yet precise enough but you have people in chemical companies for example who are working on that kind of thing now because they have huge amount of molecules to screen to try to see whether they have a chance to be stable or not before doing actually chemical experiments. They do that normally with computational tools but it's often too slow and now they are trying that kind of technique. So, yeah, that's so what kind of question you showed us examples about images and examples about sound you mentioned images somehow is simpler than sound and when you gave the applications you also had the feeling that it was less good on sound issues like speech, like instruments and so on. Although sound is a one-dimensional signal and the image is a two-dimensional what is it now how would you explain the fact that sound could be some tricky OK, I wouldn't say that sound is more complicated if you take the state of the art right now of speech analysis it's using this kind of technique, deep networks so they do have the state of the art for musical analysis speech analysis I didn't describe because we didn't work specifically on this now whether sound is more complicated than image is a very complicated you know it doesn't mean anything you have two functions whether it's in R2 or in R it's the same thing it depends on the nature of the underlying process sounds has an amazing amount of richness and in fact you can obviously make an image out of the sound you compute the time frequency transform you immediately have an image whether one is more I would say in terms of technology, state of the art what works the best right now is speech analysis and you have it on your telephone it now works pretty well images is not at that level in the sense that we begin image analysis begins to work well but the difficulty images are much richer than speech now if you go from speech to music and any kind of sound then it has the complexity of images and then there's the same kind of thing what I wanted to show is that even in sound you have much more complex groups than just translation let's go back to this again because also if we have there is some conversation then if all the conversation is shifted by a slight amount we'll still recognize it even if it's not octave or whatever if some people speak louder or so we still see recognize and so on it seems to me that there are a lot even more invariance than what you described but the spiral is continuous so when you move on the spiral, if you just shift a little bit you move a little bit on the spiral the spiral is not discrete you have a two-dimensional topology you have the topology like that and you have the topology vertical and so that carries that kind of thing that's why in fact very naturally music is written with octaves and with a musical score you can have a lot of richness but very naturally music is organized with Do Re Mi Fa Sal La Si Do so you have the spiral and you have the octave why? because it's a way to capture the topology of the harmonics now at the same time you can have a very slight move of the frequency that's what happens when you go from let's say Do to Do Diez and so on you can move very progressively so but it was an example but I think there are even much much more complicated structures that's what seems to be captured by these networks we are right now working with relatively simple groups even the spiral group is not so complicated so and that's why where again I think there is a lot of richness for math is that the type of transformation that we see appearing in these networks we have a very hard time to understand that I just wanted to ask is it possible to know the number of layers once needs for a specific problem and these deep networks how many layers are there? okay okay in what I described the depths is basically scale as you go and that's in general what these depths means because when you go to bigger and bigger depths you are getting descriptors of larger and larger structure which are very rich and now what's happening is that there are different ways to structure the filters in these networks so some network announced that they have 20 layers some of theirs have 7 some have 10 but it very much depends on how they computed their filters and so if you have to think of depths again think of it of what is going to be the scale of the descriptors you want and if you think for example of an image if you want to recognize very large structure like for example the fact that you have an auditorium here you want to capture very large scale so you want to go pretty deep if you want to deal with speech and you are interested in capturing whole sentence you want to go to very large scales so that's going to depend the depth is going to depend upon the maximum scale which is important for your recognition problem when you apply your structure into real images and you get 80% correct classification 20% wrong classification can you look at those 20% to try to figure out what kind of invariance you are missing out that's what this PHG student Edouard is doing but we are hitting a kind of wall in the sense that all the useful group that all the groups that we are used to don't seem to allow us to penetrate these 20% in these cases and they are able to do it so right now we are facing I think we don't understand what allows to go beyond and there has been also a phenomena in the case of learning there is two type of learning there is so-called unsupervised learning so people who learn without the labels and the people who learn with knowing the labels unsupervised learning is not able seems to penetrate these last 20% so it looks like it not only describes the structure of the data but the structure of the data in relation with the world of the labels but as I said that's what we are trying to do but right now we and they do that what is for me very spectacular they do that on millions of images these techniques what is impressive is that they scale and it's the first time what I mean by they scale is that if you have very very large database their accuracy continue to increase which is radically new we didn't have that kind of things before maybe a point so it seems a way to start with one image one type of one cat for example and then you try to understand all the possible cats from one but in real life you have thousands of images of cats so can we be useful to start from the information that you have but that's what we do as I said even in our framework the last classifier is working out of training data and in the training data you don't have one cat you have several hundreds of thousands of cats if you just have one cat there is no way you can do it because you can't learn the variability so you do you you absolutely need thousands of cats to recognize cats okay so so maybe we should thank the office speaker from today I think it was really a meeting talks thank you