 Bienvenue pour la première session de ce summer school. Je m'appelle Marc Le Large et je suis le German de la première session. Comme Jean nous a dit, nous allons avoir trois talks de 15 minutes. La première sera donnée par Marie-Lou Gavrier, qui a fait un PhD en Paris à l'ENS. Elle est maintenant postdoc à NYU dans le Institut Flatérone. Bonjour, Marie-Lou. Bonjour, tout le monde. Vous pouvez me entendre ? Oui, c'est bon. C'est parfait. Bonjour tout le monde. Je voudrais commencer par remercier les organisateurs pour mettre en place cette version virtual de la conférence. Pour la première speaker, je vais directement remercier à un topic hôpital de professeur. Comme vous le savez, la conférence de professeur m'a emprunté des progrès d'improvéments sur la pratique de machine-learning. Et sur le côté de la théorie, c'est aussi une direction très active de recherche, pour essayer de comprendre la grande efficacité de ces modèles. C'est un de les moyens que vous pouvez essayer de faire la théorie dans l'entraînement. C'est par l'angle de mécanismes statistiques. Et c'est ce que je veux vous dire aujourd'hui. Et c'est un travail avec Jean-Barbier, Florence Akala et Lenkaz de Borova. Alors, attends. Donc, tout le monde. Je dois peut-être... Vous pouvez voir ça aussi ? Ah. OK. Sorry. OK. So, first of all, I would like to remind you about the statistical mechanics framework for looking at learning, as I'm sure it will come up multiple times during the conference. And I will do so by looking at the elementary model of the perceptron, where we only have one output neuron. OK, so I think there are three important points to have in mind when we speak about the statistical mechanics of learning. And the first one is that we are going to focus on the teacher-student scenario. So what does that mean ? It means that we are going to look at data that are actually generated from the model that we want to study. So in the case of the perceptron, it means that when we will choose some input data of a dimension n, and we will have p samples of them that we can collect in a matrix. Then we are going to multiply them with a given weight vector W star that is going to be our teacher vector. And after a non-learity, that will give us the associated targets Y, the scalar outputs, and we have p of them. So we combine them in a vector p. OK, so if you want... And then we are going to ask whether a student perceptron model is able, from the knowledge of X and Y, the training data, to recover the W star of the teacher. And if you want, it's a minimum model of a learning problem where we know actually that there is a rule that generated the data and we are trying to learn it back. And it follows the tradition of physics to try to find some minimal modellisation of the phenomenon that we are trying to study. And here learning. The second point is to look at the typical case. So fixing a given distribution for the inputs and for the teacher weight vector, we are going to look at the performance of learning on average, over the posterior, the Bayesian posterior of the student weight vector, knowing the training data. So for example here, we are looking at the generalisation error, which is the difference between what the student predicts and what the teacher actually predicts, gave as the true outputs. And this again is taken on average over the posterior. And this creates a typical case, what you will typically observe, as opposed to the worst case that is also a very popular choice of way of studying a learning problem in statistical learning theory. So that's also a bit physically in the sense that you are going to look at what you typically see, typically see in nature. Okay. And the last point is that you are going to compute those kinds of observables in the limits of infinitely large models, the so-called thermodynamic limits, using the tools from the physics of this other system, doing what we call a mean field analysis. And in the case of the perception, we would choose a scaling of the number of samples over the size of the input or the number of parameters in the weight vector to be fixed, as we take both of these dimensions to infinity. And we would find that the generalisation error in this limit converges to a function of precisely this parameter alpha that controls the difficulty of the task. And given that we have simple enough assumptions on the distributions of X and W, we have actually a very good understanding of the mean field picture of this model. And this means that we have the following points which I will illustrate on the case of binary weights and Hadamard inputs for the perception. So the first thing that we have is an information theoretic analysis of what is possible to get given the training data. So it's really the optimal generalisation error that is typically computed with a mean field replica computation and that is here plotted in red as a function of alpha. And you can see that it has an interesting phenomenology because it decreases with alpha and at some point it even drops down to zero to a perfect generalisation. This is something that we call a first order phase transition in statistical physics. On top of this mean field replica computation, we now also have a rigorous mathematical proof because the replica formula is really giving the information theoretic threshold. On top of this information theoretic analysis, we have inference algorithms that are related to those mean field methods and those are the message passing algorithms. For the perception, we have the approximate message passing and the performance in the inference, the reconstruction of W on the perception are here given as the black dots. And finally we have the state evolution which is a statistical analysis of the performance of such algorithms. So think in the average case in the limit of infinity large model we are able to give a prediction of how the inference algorithm is going to behave and this is the red line sorry the green line on the plot. And you can see that it perfectly follows the black dots but it also confirms that there is a computational gap in the information theoretic possible to do and when the inference is actually possible for AMP which I forgot to mention is running in polynomial time. Ok, so we have all these types of interesting phenomenology and it gives you an idea of the kind of analysis that we are after. Now how do we go from the perception with one neuron to a deep neural network? And in this talk I'm going to focus on the next simple case let's say where I have just one hidden layer Z and so I have the two matrices W1 and W2 to infer from a training data sets. First let me tell you why this is a challenging inference problem and second I will tell you about the direction that we have been exploring that exploits the possibility of looking at structured weights. Ok, so what is the problem at hand? We have a training set let's say I have an input of size n and an output of size m I collect p pairs of them so that I know a matrix X and a matrix Y and then I want to infer the two weight matrices W1 and W2 where here I called K the size of the hidden layer. Well the first idea to tackle this inference problem is to decompose it as a series of sub-problems and see if we can have a mean field analysis of each of the sub-problems and recombine them to have a global solution to the problem that we are interested in. Recently this idea of combining mean field solution has brought a lot of interesting recent works tackling more and more complex models in the field and for those of you that may be familiar with it for example it's a key idea for the multi-layer approach message passing algorithm. In our case what is this decomposition? Well first from the knowledge of the outputs we need to infer both the second layer weight matrix and the states of the hidden units because this is something that we don't observe in the training data. Once we have the states of the hidden units we can we only have to infer the states of the value of the parameters of the first layer as we know during the training the inputs. So if we look at those two problems well the second one is actually a like a series of JLMs and we should be able to manage but the first one is trickier. We need to factorize really a matrix of W2 times Z and this matrix factorization will have a rank that is given by the number of hidden units. To understand how difficult is this inference problem what we need to know is to understand the scaling of the size of this hidden units the scaling of the rank of the matrix we need to factorize. So we take to infinity and let's say that we keep K of order 1. Well in this case we are in the low rank matrix factorization regime and we have a good main field of understanding of this model thanks to Thibault Le Sieur and Florian Lenka in particular and within the context of deep neural networks it means that we have a finite number of hidden units. So those models are actually called committee machines and they are well understood within the main field framework and there is a great body of work with a few references that I give you here on the NIS that uncover really interesting phenomenon on this model and I encourage you to check it out if you are not familiar with it. But now if we think of deep neural networks in practice I mean how do they look like? Wait, they look more like all the layers have the same kind of number of units and there is not an infinitely large input layer and then only a few hidden units. So can we tackle the case where K is actually scaling like N? Well in this case we are in the high rank matrix factorization regime and for this we don't have to this day a good mean field understanding of this inference problem. So how can we go around this problem and still tackle our inference of deep neural network with extensive number of hidden units? Well the second idea that we have been exploring is to ease a bit the problem and instead of looking at inferring the complete W2 matrix we try to look at a certain structure of the W2 matrix. So we will assume that now W2 is decomposed into a known part and random W2 tilde and a diagonal matrix so that has only non zero entries on its diagonal that are the degrees of freedom of the learning in the second layer. So of course by doing so we are losing something because we used to have order of N squared parameter in the second layer of order of N. Nevertheless it's a model that has been considered in the literature before both in practical papers where the concern was about speed or memory or in other theoretical papers where this simplification was easing some theoretical computations. And another reason why we like it it's because in signal processing literature this inference problem is actually known as called blind calibration. So and additionally with this blind calibration we can do a mean field analysis. So let me just remind you what is this calibration blind calibration problem. So we have the sets of observation Y we have P observation of dimension M that corresponds to the measurements of P input signal of dimension N and also of the same type of the perception but of which we only know the forward model up to a diagonal matrix S that in a sense needs to be calibrated while we are reconstructing the X. So this is why it's called blind calibration we need to infer both X and S at the same time. And so for this problem there was a calibration AMP inference algorithm that was proposed in 2013 by Christophe Schollke and collaborators and in a recent work we propose the corresponding state evolution and the replica free energy. So if you want for the complete picture the mean field picture to be complete in this model what is missing is a rigorous proof of the replica computation dealing with the information theoretic limits. Ok so to look at in practice how this mean field methods for solving the inference problem are performing on blind calibration we can focus on the case of sparse recovery which means that while inferring S and X we are going to consider signals X that are row sparse that have only a fraction of row entries a fraction row of entries that is non zero and we will have two parameters that are controlling the difficulty of this task alpha which is the ratio the output dimension over the input dimension and the sparsity row and if we naively compute the number of unknowns compared to the number of knowns we will find a theoretical naive threshold under which we cannot reconstruct that I call alpha mean depending on the sparsity and the size of the training set so the size of the number of samples that we have P and here if we compare the alpha mean with performance of the KEL AMP reconstruction we can look at those two diagrams that have as y axis the parameter alpha and as x axis the sparsity row in pink you have the alpha mean curve and in color you have the performance of KELAMP so white is good reconstruction and darker colors are bad reconstruction for S and X and you can see that even a very small number of components P is equal to P is equal to 2 whereas M and N are going to large values like maybe hundreds of thousands we are able to reconstruct very close to this alpha minimum and second thing that we can check is that indeed the state evolution that we propose is matching the performance of the AMP algorithm and here the state evolution is in solid line while the IMP is in dotted lines and for difference input to output ratio and for different number of samples we indeed find a very good agreement so just to wrap up the perspective of this work is to go back to this problem of weight inference in deep neural network and combine the calibration IMP problems in order to infer weights in a neural network that is deep with an extensive number of hidden units and the challenge that remains is that back to the teacher-student scenario we will need to assume that we already know the random fixed parts of the weight matrices so this is something we are still thinking about and that we are happy to discuss with whoever is interested and with this I would like to thank you for your attention ok