 Thanks. In fact, there is one keyword which is missing here, which is mathematics and the goal of the talk will be to try to show that there is a lot of very interesting mathematics which is emerging nowadays from all these problems in particular neural networks, but their links also with physics and many beautiful, totally un-understood problems are coming out of these domains. So that's what I would like to try to show and I'll begin with very general questions which is fundamental in this field is that basically when you're dealing with problems such as image classification, when you are trying to understand properties of large-scale physical systems, it's about trying to understand a function or a function f of x in very high dimension where x is a vector in high dimension d. So think for example of a problem such as image classification, x would be an image to an image you want to associate let's say its class, so you have such a function f of x and x is really a very complex function or high-dimensional vector such as these images. If you think at the same problem in physics then x will describe the state of the systems, for example in quantum chemistry you would describe the geometry of a molecule and what you'd like is to compute for example the energy. If you have access to the energy then you have access to the forces by computing the derivatives so basically the physical properties of the system. Can you learn physics by trying to approximate such a function given some example and a limited amount of examples and different problems come into modulations of data in that case what you want to model, what you want to approximate is a probability distribution and again they are very well known very difficult problems in physics for example turbulence. Since the first papers in the 1940s of Kolmogorov that has been a central problem in physics try to understand how to define the probability distribution describing a turbulent fluid with high Reynolds numbers but you can even think of much more complex problems such as faces. Can you describe a random process whose realizations would be faces in that case you have something which is totally non- stationary totally non-ergodic. Now the reason why we can ask these questions nowadays is because of these deep neural network which seems to be able to do such a thing. So let me remind you what are these deep networks. I'll take the example of an image. So you want to do image classifications so X is your image that you'll input in the network. What the networks is basically doing in particular convolutional neural networks are cascade of convolutions so you have a convolutional operator which has a very small support as a matter of fact typically 3 by 3 or 5 by 5 convolution and the output is going to be transformed by a non-linearity which is a rectifier which basically keeps the coefficient whenever they are above 0 and puts to 0 any coefficient which is negative and you'll do that with several filters which will produce a series of images here. This is the first layer of the network. Then on each of these image you will again apply a filter. You are going to sum up the result and produce an output which is one of these image in the second layer and you are going to do that for different families of filters which will produce a whole series of images that you sub sample in the next layer and then you repeat each time you define a families of filters. Do your convolutions sum up apply the non-linearity and go to the next layer until the final layer and in the final layer you just simply do a linear combination and you get hopefully an approximation of the function f of x that you would like to approximate. Now how do you train this network which has typically hundreds of millions of parameters which are the parameters of all your convolutional kernels. You update them in order to get the best possible approximation of the true function f of x on the training data and here you have an optimization algorithm. Now what was extremely surprising since let's say the 2010 from their own is the fact that these kind of techniques these kind of machines have remarkable approximation capabilities on a very wide range of applications. Everybody's heard of course of image classification but it goes much beyond sounds, speech, recognition or what you have in your telephones now are based on such techniques. Translations analysis of text are done with such elements regression in physics, computations in quantum chemistry, signal and image generation. Basically when you begin to have enough data a very large amount of training data it looks like these kind of systems are able to get the state of the art and we essentially don't understand why. There is something very mysterious because you have a single type of architecture which is able to approximate these very different classes of problems which indicates from a mathematical point of view that these problems shares the same kind of regularity so that the same kind of algorithms can approximate them. So one obvious question is to understand what is in common with all these problems. What kind of regularities allows that kind of machines to approximate these functions despite the curse of dimensionality despite the fact that we know that in high dimension normally in order to approximate a function the number of data samples are exploding exponentially with a dimension. So you can take many different point of views to analyze that. A lot of work is devoted to the optimization side. How come you can with a stochastic gradient descent optimize such algorithms? That's not the kind of question I'll be asking. Here I'll be trying to understand why that kind of approximation can approximate wide class of functions and what does it say about the underlying regularity which is needed in order to be approximated and basically it's going to be a harmonic analysis point of view. I'll analyze that as a kind of harmonic analysis machines. So the questions are what kind of regularity these functions have in order to be approximated and why these kind of computational architecture can approximate such function. What is really done inside the structure? What is the learning providing? So I'll be showing that there are several key properties that are coming up. One is the fact that scales are separated and basically you do kind of multi-scale representations. If you think of it the depth that you are having here is a kind of scale axis. Why? Because you are going to aggregate the information first of a very small very fine scale small neighborhood and then because you cascade these aggregation and subsampling you progressively aggregate the information of a wider wider and wider scales. The formation regularity to diffeuomorphism is at the core also of that kind of thing. It's at the core of physics we'll see that it appears also in the case of image classifications. Sparsity. When you look at the coefficient in these type of networks many of these coefficients are zero because of the rectifier. It indicates that there is some kind of sparsity properties. We'll see that that's fundamental in order to understand the kind of regularities coming out. So in order to look at the problem I'm going to look at three different types of problems. The first one is classification. Second one I'm going to look and in fact I'll begin with this one. This is slightly simpler but still extremely difficult problem. Modization of totally non-gaussian but ergodic random processes such as turbulence. Now why? Because people have been doing experiments to try to model such turbulent fluid as the one you see here this is in astronaut physics. This is a two-dimensional vorticity field and this is bubbles and the images that you see below are synthesis from such networks. So how do they do that? They take an image they first train the network on a database of images which have nothing to do with these particular images. Once the network is trained particular called image net. You take the image you compute this one example. You compute all the coefficients of this image in the network within the different layers. Now you look at one layer and then you compute the correlations of two images within one layer at different channel positions. So you get a correlation coefficient. You do that for any pair of channels. Then you have a statistical description of your random processes through these correlations. Now you generate a new realization of your random process by beginning from white noise modifying the random process up to the point that it has the same correlation properties. Then you look at the realization and it looks like these things. It looks like realization of turbulence bubbles and so on. So the question is why? What is happening here? It's reproducing something that looks like a turbulent fluid. That was an astrophysical. It reproduces an image which is totally different but looks like a turbulent fluid. This was computed from this for any input white noise and you do the optimization you get a different image. So you get a random process which seems to have similar property question what happened? What kind of model did you build? Other type of modellization of random processes for totally non-aerogatic process. What do they do? They take an image they use such a network in order to have an output which looks like a Gaussian white noise. Then they invert the network so that from this Gaussian white noise they recover something that looks like the original image. That's called auto encoders. So they train that let's say on faces or bedrooms and then they synthesize new image by creating a new Gaussian white noise and applying this decoder. Now if you train that on bedrooms for example you have a databases of bedrooms you put a new white noise and you get a new bedroom and a new white noise you get a new bedroom. Now the most surprising thing is then you do a linear interpolations between these two white noise okay just a linear interpolations and from any of these linear interpolations you plug it here you reconstruct and what you get a kind of new bedroom and at any stage it looks like a bedroom. What does that mean? That means that in that space you have a kind of representation of bedrooms which is now a totally flat manifold because the average of two bedroom is a bedroom. So you have completely flattened out your points representing bedroom which is a wild set of points in the original space into something flat. Okay what's happening why? So these will be the questions that will be organizing the talks. The first one will be about modernization of random processes and these ideas of scale separations. What I'll be showing is that scale separation is at the core of the ability to reduce this curse of dimensionality and one of the very difficult problem in this field in math is to understand interactions across scales and I'll be trying to show why nonlinearities in this system provide these interactions across scales. Then the second topic will be in relation to the regularity of action of defiomorphism. There we'll look at problems such as classifications, regressions of energies in quantum chemistry and we'll see the kind of role that this has and again what kind of mass comes out. The last one will be about the modernization of these random processes such as these bedrooms but we'll see is that in some sense it looks like these networks build a kind of memory and the notion of sparsity is something that will be important. So these will be the three stage of the talk. So let me begin with scale. So why is scale separation so important? This is very well known in physics when you have an body problem. So a priori long-range interaction all the bodies are interacting. You can think of your bodies as being particle but it could be pixels, it could be agents in a social networks. How can you summarize reduced interactions? Let's say with a central particle here where the very strong interaction with the neighbors okay your family the neighbor particles or pixels. Then with the more far away particle you know if what you can do is instead of looking at the interaction of each particle with this one you can aggregate this interaction construct the equivalent field and look at the interaction of the group with this particle. With even more far away particle and these are called multi-pole methods you can even regroup larger amounts of particles and summarize the interaction with a single term for example think let's say if you are in the social network we are six billion inhabitants on the earth you cannot neglect even people there which are very far away let's say some Chinese living somewhere in China because if you neglect Chinese then you neglect China and if China let's say has some particle tension with whatever France or your country that can have an influence on your life but you don't need to look at each Chinese but the aggregation as China. So this idea of multi-scale aggregation allows to reduce the interactions into log d components what is how what is very difficult what is very difficult is to understand the interactions of the groups in other words the interactions in cross-scale what has been well understood for a long time is how to do scale separations and wavelets are the best tools to do that what is essentially not understood since the 1970s is how to model capture scale interactions and what I'll be trying to show you that this is completely central with these these networks okay so how do you build scale separation the way you do that is by introducing small waves which are wavelets which basically looks like a Gaussian modulated by a cosine or a sine wave okay and what you do is you are going to scale these wavelets like that and in two-dimension you will rotate them so you get a wavelet for any angle and dilation and what you do is you take your data you explode it along different scale and rotation by doing a convolution like in these networks so how does it look like in the Fourier domain a convolution is a product so basically you are going to filter your Fourier transform into a channel like that and when you change the angle of the wavelet you basically rotate the Fourier support when you dilate the wavelet you dilate the Fourier support so you explode the information if you look at it in the Fourier domain into different frequency channels now if you want to model a random process through let's say look at correlation what you will observe is that the wavelet coefficient at two different scale or angles are not correlated if the random process is stationary why because they live in two different frequency channel and a simple calculation shows that because the support of the wavelets in Fourier are separated the correlation is zero okay let me look at an example this is an image these are the wavelet coefficients at the first scale basically gray zero white positive black negative you have large coefficient near the edges this is the average then you compute the next scale wavelet coefficients next scale what you see is that most coefficients are very small nearly zero but they look very much alike across scale yet they are not correlated so you are unable to capture the dependence across scale just with a simple linear operator such as a correlation okay in statistical physics how are you going to model random process the standard way to model a random process is to compute moments so what is a moment it's the expected value of some transformation phi m of your random field you compute the expected value and then you define a probability distribution which satisfies these moments and which has a maximum entropy which is a way to express that you have not any more information you look at all possible configurations having that moment and what you can rather easily show is that then you get a Gibbs distribution and this Gibbs distribution is defined by Lagrange multipliers which are adjusted in order to satisfy these moments now what have people been doing mostly until now is to compute moments which are basically correlation moments and that's exactly what the Kolmogorov model of turbulence is about if you do that then what you are going to get here is a biliner function and therefore sorry you are going to get a Gaussian distribution so if you look at a Gaussian model of turbulence that's what you're going to get the images which are below the images which are below have exactly the same moments than the one above so they are the maximum entropy model constrained by second order moments same spectrum but you've lost all the geometry of the structure so what have people been trying to do in statistics go to high order moments but if you go to high order moments you have many moments they have a huge variance and the estimators are in fact very bad because of the variance deep networks seems to get view estimators which are look much better why the key point here will be the nonlinearity what I want to show here is that the nonlinearity is what builds the relation across scale and the key way you are going to relate scales is through phase this is what is the link between scale if you take a wavelet which has a certain phase alpha I'll call it that way I'm going to build a network where I'm going to impose that the filters are wavelets okay so I take my x I filter it with a wavelet and I apply a rectifier what's happening if you do that let's look at this convolution I convolve it with a wavelet which has a certain phase I can get out the modulus of the convolution and I'm going to have a cosine which depends upon the phase of the wavelet and the phase of the convolution now what's happening when you put a rectifier the rectifier is a homogeneous operator the modulus you can get it out the rectifier only transforms the phase by essentially killing the negative coefficients you can view the rectifier as being a window on the phase it eliminates all the phase which corresponds to negative coefficients and keeps the phase corresponding to positive coefficients now what if you now do a Fourier transform relatively to this phase variable alpha what you see appearing is that after applying your rectifier if I do take my coefficient and do a Fourier transform relatively to the phase variable I see appearing the modulus of the output of the filtering I see appearing the phase but each phase is multiplied by a harmonic component k so you do something very non-linear which is you create all kind of harmonics of your phase now why is that fundamental if you want to model random processes you see if you take and I write it that way a convolution and you take the exponent to the power k of the phase what you are going to do is essentially you're going to move the Fourier support so suppose that I look in one dimension have a random process and I look at the component on two different frequency intervals because I have two wavelets which lives over different frequencies they are not correlated these two components because the Fourier component don't interact if you apply a harmonic this frequency is going to move k equals two is going to move here in two lambda k equals three here now if you look at these two components now they are correlated so after applying your non-linearity you create correlations because you move your Fourier support what does that mean that means that if you look over your domain where you've separated all the phase all the directions after applying the rectifier you can view it in that domain or in the Fourier domain which amounts to compute the harmonics all these blobs are going to be correlated you can correlate the coefficient within a given scale by just computing a standard correlation you can compute a correlation across two orientation by using the appropriate exponent which in that case is k equals zero it's the modulus and you can compute the correlations across two different bands and if you look at that this is very close to calculations of the renormalization group the renormalization group is what allows you to compute in a particular case of the easing models what kind of random processes you are going to are going to have and how you do it you do it by looking at interactions of the different scales numerically what do you have these are example of random process I'm now going to compute a maximum entropy process not conditioned on correlations but conditioned on these non-linear harmonic correlation these ones and that's what you get in the case of easing at critical scale you can reproduce realizations of easings for turbulence you produce realizations of random processes which are here contrary to the gaussian case now you see that the structure geometrical structures appears because you've restored the alignment of phase and now one of the very beautiful questions is can we extend the calculation of renormalization groups which we know how to do on easing on much more complex processes such as for example turbulence process in order to understand better the kind of property these random processes have and that's work that is being done with people at ENS in astrophysics in particular okay let me now move to the second problem the second problem is about classification so you want to classify for example digits one of the properties that you see in classification is when your digit moves is deformed if the deformation is not too big typically it will belong to the same class a three stays a three a five stays a five and if you take let's say paintings as long as the if you move on your defilmorphism group as long as the defilmorphism is not too big basically you'll recognize the same painting then it will be another painting and if you move like that on the defilmorphism group you can go across essentially all european paintings that you may find in the Louvre in particular okay so defilmorphism is a key elements of regularity if you want to approximate a function which is regular to defilmorphism you want to build the scriptures which are regular to the action of defilmorphism how can you do that x is a function in alto if you deform it it's not the distance with x is going to be very large how do you build regularity to defilmorphism a very simple way is just to average x you average it let's let's say with a gaussian and you are going to get a descriptor which is going to become very regular to the action of defilmorphism as long as it's not too big radically to to the j the problem is that if you do that you lose information because you've been averaging so how can you recover the information which was lost the information which was lost are the high frequencies that you can capture with wavelets but if then you average you are going to get zero because these are oscillating functions how can you get a non-zero coefficient apply the rectifier which is positive and these will be coefficients which are again going to be regular to action of defilmorphism but these wavelet coefficients you averaged them so you lost information by doing the averaging how can you recover the information that you've lost well you take these coefficients and you extract their high frequencies how can you do that again with wavelets and why are wavelets very natural here because if you want to be regular what is a defilmorphism but defilmorphism is a local deformation a local dilation if you want to be regular to actions of defilmorphism you have to separate scales and that's what the wavelets will do and you get a new set of coefficient and an averaging so that's going to look like a convolutional network where you iterate convolution with wavelets non-narrative convolution with wavelets but in this network I don't learn the filters I impose the filters because I have a prior on my knowledge of the kind of regularity I want to produce so one things that you can prove is that if you build such a cascade where you take your function x and you deform you are going to get a representation which is Lipchitz continues to the action of defilmorphism in what sense if x is deformed if you look at these coefficients this as a look at it as a vector if you look at the Euclidean distance between the representation of let's say the output of your network before and after the deformation the distance is going to be of the order of the size of the deformation and the weak topology of a defilmorphism is defined by the size of the Jacobian of the deformation or the the translation operator that depends upon space and you can prove that you have something which is stable so to build something which is regular to deformation naturally leads you to scale separation again and to the use of these non-linearities the question is how good will that be compared to deep network so you have a kind of network but you haven't learned the filters how good is that going to be compared to network where you learn everything the first problem I'm going to look at here is quantum chemistry now quantum chemistry is an interesting example because you have prior information on the type of function you want to approximate what do you know so the problem is the following x is the state of the system is described by a set of atom position and charge and the energy of let's say molecule you know that if you translate the molecule it's not going to change if you rotate the molecule it's not going to change so it's environmental translation and rotation if you slightly deform the molecule the energy is just going to change slightly so you have a regularity to the action of defilmorphism question can I learn such a function just from a database which gives me configuration of molecules and the value of the energy now if you look in quantum chemistry the way such energies are calculated is by using what is called dft so the key idea is the following you take a molecule and you look at the electronic density of the molecule that's what i'm showing here the each gray level gives you the probability of occurrence of an electron at a given position okay and they are very close to the atoms but they are also in between two atoms because that's the chemical balance which is here so to compute such a thing is requires to solve the Schrodinger equation so in that framework I suppose I don't know physics beyond these basic environments so I don't know Schrodinger equation what we are going to do and there is now a whole community in physics and machine learning doing that kind of thing is here going to represent the molecule just by the state the only thing I know in x is the position of each atom and the number of electron on each atom so I'm going to represent naively the electronic density as if each electron were sitting exactly where the core or the nuclei of the atom was so you get a kind of electronic density like that here I have no idea what chemistry is about then you build a learning system so you take your density in 3d and you compute a representation by separating old scale separating all angle applying a rectifier and you get these kind of images of 3d blocks which looks a little bit like orbitals then you apply your linear averaging which builds a number of descriptors which are of the order of log of the dimension which are environmental translation rotations and stable to the formation and where do you learn physics in the light stage where you just learn the weights of the linear combinations of all these descriptors to try to approximate the true energy of the molecule how do you learn these coefficients because you have a database of example and you regress your coefficient on your database okay there are databases which have been constructed to test that kind of thing typical size about 130 000 molecules these are organic molecules and what people do they compute their deep network where they learn everything and they compare the errors with the errors that numerical algorithm would do with a dft and what people have been observing is that with that kind of technique you can get if the database is rich enough an error which is smaller than the error that is calculated by a typical numerical scheme with a dft in that case we don't learn the filter we just say we know some regularity properties so the math leads to a certain type of filters and basically you get an error of which is of the same order so that shows that in that kind of examples you don't need you basically know what is being what is learned what is learned is the type of regularity and the only thing that what is learned in the filters and you can therefore replace them with wavelets and the only thing that you need to learn is basically the linear weights at the output however these are very let's say simple problems in the sense that the kind of database that have been used until now are database of small molecules about 30 atoms yes sorry yeah when i was comparing here these are the deep nets when i call ever deep net is when you learn everything when you learn everything you get an error of the order of 0.5 kilo calorie per mole when you don't learn anything you get an error of the same order if you use wavelets which are adapted to the kind of transformation where you want to be regular environment in that kind of case the networks is in fact smaller because you you know exactly what you want basically in that case you know exactly the kind of filters you don't have to to build a big structure but again these are not horribly complicated problem and if you look in the world of images you can see the difference between simple and hard problem what is a simple problem recognizing digits so you have an image and you have a digit you have to recognize what kind of digit it is differentiating textures which are uniform random process if you take a deep network and you learn everything or if you impose the filters and you impose that the filters are wavelets you get about the same kind of them if then you move to something much more complicated such as that kind of image image net if you impose filters which are wavelets the errors is going to be in that case you have 1 000 classes about 50 percent if you learn the filters the error is much smaller in 2012 that was the big results that began to attract very much attention the error was 20 percent now it's about 5 percent so the question here is what is learned what is learned in these kind of networks and that's the last part what i would like to show is that there is a simple mathematical model which can capture the first order of what is learned is basically learning dictionaries to get sparse approximations so if you think at this domain this domain was called pattern recognition before what is a pattern pattern is a structure which is approximating your signal and which is important for classification how can you think of decomposing a signal into in terms of patterns x is my data i'm going to define a dictionary of pattern each column of my matrix is a particular pattern to decompose x as a sum of a limited number of patterns can be written as a product of this matrix with a sparse vector z which is mostly zero besides the few patterns that you are going to select to represent x now how can you express such a problem this is a well-known field which is called sparse representations that has been studied since basically the 90s and one way you can specify this problem is by saying well x is going to be approximated by d multiplied by z and i want that z is sparse so i want to impose that the l1 norm of z is small so i solve this optimization problem i'm also going to impose that the coefficients of z here are positive and here you have a convex minimization problem good so how can you solve such a convex minimization problems there are different type of algorithms to do that basically these are iterative algorithms the amount to do a gradient you have a convex problem here so a gradient applied to the square norm term which is going to lead to that kind of matrix and you have your l1 term that you are going to minimize by doing a non-linear projection it happens and in this case the non-linear projector is exactly a rectifier so basically to solve that kind of problem is amount to apply a linear operator do apply rectifier with what is called here bias and the bias corresponds to the Lagrange multiplier so to solve that you essentially compute a deep network a deep network where at each stage you are going to apply your matrix which is here a linear matrix and a rectifier and you iterate in this network the matrices are all the same they only depend upon the one matrix or dictionary okay now how can you use that for doing learning you can use that for doing learning by saying okay i would like to extract the best patterns that is going to lead to the best classification so i'm going to build a network where you compute your sparse coding then i put a classifier and i get my classification result and now what do you want to do you want to compute the best dictionary which is going to lead to the smallest error so you are going to optimize the weights in the dictionary d and the classifier so that over a given database of data and labels the classification error the loss is as small as possible and so you do a standard gradient descent so this is a standard neural network learning the only thing is that you are doing something which is from a math point of view well understood you are just doing a sparse approximation where you are learning your matrix d and you can look at the convergence of that kind of thing okay so this is what wavelets are giving you if you don't learn anything in your network you are going to do a cascade with predefined wavelets compute your environment do your classification and you essentially have 50 percent of error what you could think is okay let's replace the wavelet representation with a dictionary so you learn a dictionary which is optimized in order to minimize the error and you don't improve much essentially you get the same kind of error now what if you cascade the two you first compute your representation with your environment which are now regular to the action of diffeomorphism and then in that space you learn the dictionary and there you have a big drop of error which goes down to 18 percent which is essentially better than what was obtained in 2012 with this famous alexander what does that mean that means that there is really two elements there is one element which is due to the geometry that you know which are captured by translation rotation diffeomorphism you want to essentially reduce eliminate these variabilities once you've eliminated these variabilities you can define a set of patterns because otherwise you need to define a different pattern for any deformation any translation any rotation your dictionary gets absolutely huge now why you have such an error reduction is an open problem what you observe is in the output you have a kind of concentration phenomena that we don't quite understand but at that stage you can build a very much simpler models which is basically a cascade of two well-known understood sorry operators so the last applications i'm going to show you briefly is about these autoencoders so these autoencoders they are able to basically synthesize random processes which are absolutely not ergodic and i gave these examples of of bedrooms where you also see these deformation regularity properties one way to pose that problem is the following you begin with a random process x what essentially this encoder is doing is building a gaussian white noise okay so you found a map which is building a gaussian white noise this map is invertible i'm going to impose that this map is bilibshitz bilibshitz over the support of the random process the third property is you want that kind of deformation properties so you want your map to bilibshitz continues to the actions of the fielmer system questions how to build such a map how to invert it so how to build such a map we have constructed something which is regular to the action of the fielmorphism just before separating the scale and then doing a spatial average now why would that build something which is gaussian because when you begin to average over a very large domain you begin to mix your random variables and if you if your random variable have a bit of the correlations if you average them over a very large domain you have your central limit theorems which tells you that's going to converge to gaussian random variables so then you do a linear operator which whitens your gaussian and you can hope to get a gaussian random variable now the question is the inversion is this map going to be bilibshitz can you invert it and how does it relate to the previous topic so what is hard to invert here what is hard to invert it's not this first nonlinear part because u is the nonlinearities due to rectifier but if you have rectifier of a and minus a which corresponds to two different phase you get back a so that's easy to invert the way that transform is invertible so that's not a problem what is hard to invert is this averaging the averaging is basically building a gaussian process by reducing the dimension and mixing all the random variable and that's not an invertible operator now how can you invert a linear operator which lose information you can if you have some prior information about the sparsity of your data and that's called compress sensing so how do you do that kind of thing same idea you need to learn a dictionary where your data is going to be sparse so the idea is the following your representation in some dictionaries you don't know what it is for now is going to be sparse so it can be represented as a sparse vector multiplied by d if that's the case the white noise that you've obtained by applying your operator l is going to have itself a sparse representation so because ux is equal to dz in that dictionary the dictionary is now ld so what does that mean that means that in order to invert your map what you need is to compute the sparse code so compute this dictionary which is going to sparsify this w and then you can apply your dictionary d invert you you are going to recover x how you do that how do you learn this dictionary well you learn this dictionary by essentially taking your examples and each time trying to reproduce to find the dictionary which is going to reproduce the best examples and that's done by optimizing the dictionary in this neural network it's again with a stochastic gradient descent so i'm going to show examples these are examples of faces okay the top images are the training examples on which you train your network and you optimize the dictionary and then you reconstruct these images that's what you see here then you try with new realization of your random process okay so you have new realization these are the testing image you decompose now you use the dictionary that was computed from the first images and you try to reconstruct and indeed the reconstructed images looks good so that means that you indeed have inverted your bilibschitz map you can train your network on network of database that's typically what a noto encoder is doing this is the training images these are the reconstructed one this is what you do on the testing images these are the reconstructed one now what's happening if you take too white noise from which you compute an image and you do a linear interpolation in the noise domain the noise domain in this case is essentially the domain of the scattering coefficients which have been averaged because they are again regular to the action of the filmorphism when you do a linear interpolation and you reconstruct an image you see how basically you warp progressively one image into the other one if you do that on a different image and another one you see the same kind of thing if you do that on a bedroom you warp one image into the other image now synthesis so now you've represented again your random process with your scattering coefficient and you learned the dictionary which represents your coefficient in a sparse way now the synthesis amounts to produce randomly a gaussian white noise throw this random white noise through this generator and compute an x so now you have computed you've defined a random process okay this random process for differentiation of white noise this is how they look like so you've produced a random generator of faces that's what auto encoders are essentially doing what i'm showing here is you can do it just by learning a single matrix which is this dictionary if you train it on bedrooms it doesn't look so good because bedrooms are much more complicated but you see that your realization looks a bit like geometric bedroom at least it doesn't look like faces what you get here is that essentially you have totally different types of stochastic processes models where everything is captured within the dictionary d and the random excitation defines the space of your free let's say variables or the entropy of the random process you can do things for example and that's important in particular for simulations in physics you you may know when you compute for example in fluid dynamics you may have a very coarse grid approximation for example in climatology and you'd like to get a model of what's happening at fine scale so this is the input the very coarse scale approximation but you've learned your dictionary now you put noise and these are all the possibleizations which have the same low frequencies and each realization corresponds to different noise corresponds to different face expressions so we're in a world which are very unusual from a math point of view but what i want to say is that there is a lot of math to be done here it's not just algorithms now if you come back on deep networks and the the topic itself essentially that kind of structures they have the complexity of a Turing machine Turing machine is programmed by program here the program is essentially these weights and you have built let's say hundreds of millions or billions of weights okay now in a Turing machine if you want to make a code which has no bug what you do is you structure your program you don't build a program with millions of lines of codes currently we are training these machines directly by training the millions of coefficients which makes them extremely complicated in some sense what i'm showing here is that when you begin to understand the math you can structure these machines and you see appearing different kind of functions the first few layers essentially in that case corresponds to reducing the geometry of the problems then you see appearing phenomena which are more related to sparsity and there are probably many many other phenomena that are absolutely not incorporated here now this being said the math problems remains very widely open well the kind of question you'd like to ask is again what is the regularity class what are the set of functions that can be approximated by such networks what kind of approximation theorems how many neurons do you need to get an epsilon there these questions are totally not understood okay well thanks very much question yes well thank you for your talk first of all when you were showing the example of the reconstructed images between the training set and the testing set it seemed to me the the test reconstructed images were much closer to the original then the training will widen so that means that i've inverted the two that's exactly what it means because normally it's the contrary yes you are right and it shows that i inverted the the two columns thanks no there's no miracle which is too bad but there is no way you can do better thank you very much you wow i'm very impressed yes yeah it's very interesting i did for this last topic of modeling the network in terms of wavelengths and then sparsity there's a kind of a prediction for how much information there should be in a network that comes from statistical learning theories that you can relate the generalization ability or the error rate of the network to the amount of information in the network it'd be very interesting to see if the amount of information in your sparsity right because in some sense you're way above the universal so that's not much information so all the information is in this dictionary and so if that were to match up with the prediction the the information bounds that you get currently on uh i mean if you are referring to the information bounds that people have been computing on the networks they are essentially being computed on the norm of the operators and are extremely crude yeah that's right but there is this if you go back yeah i mean there's this target that people want to get which is to relate control by the generalization of the testing error yeah so yeah you're right that that could be right now we have indeed to see how to uh understand the action of this learning of a dictionary the problem is much simpler than in usual network because the learning is aggregated in a single matrix but because it's very non-linear and you put it at each layer it's indeed not so simple so yeah you're right that's an interesting way to to think about it yes that's a question about the various parts of your talk in one part you explain that regression um approximation is very difficult to capture um higher correlations and escape phenomena and at the very end you have this uh this one-dom Gaussian matrix yeah data that produce something with very high correlations so it's uh these are two point of view but it's not clear there okay so it's two but the same okay let let me explain the reason why you can compute correlation across scale angles and so on is because of the non-linearity which essentially realigns in some sense you can view it as a realignment of the phase now in these synthesis what you have is uh the you have the dictionaries which are here so you can view these coefficients as exciting different patterns and then uh the different uh uh relu which are here are essentially performing are selecting randomly these patterns but you can also view it as a realignment of phase so the uh what is absolutely not there in the second one is the maximum entropy principle optimization there is what we are it's much more complicated in the second one because i mean in the standard statistical physics framework because there is really no quantity that you are optimizing you are not controlling anything the the second one is in fact very close to something which is used all over in signal processing which is autoregressive models think of it what how do you build a simple gaussian approximation you begin with your random process you widen the random process with an operator and then you invert this operator you reconstruct your random process this is a what is done by an autoregressive filter this is the equivalent of an autoregressive filter but in a totally non-linear way because you didn't begin with something gaussian that's one way to view these autoencoded transferable are these methods to learning the incoming processes on graphs social networks for example that i don't know so you what you mean learning the processes on class you mean what on graph sorry i didn't understand okay so what people have there are number of groups who've been trying to to do that is basically what you need is to reintroduce the notion of translation deformations and scale on graph well the notion and all these notion can be defined it's just the translation operator is much more complicated people you can do that by going to the spectral you compute the spectrum of the graph with the Laplace-Beltrami operator and from there you can do similar things so people have been trying to do that the math are more complex because translation operator on the graph is much more complicated okay