 Okay, let's get started for this last lecture today. But before the lecture, actually, today is a very special day because it's the birthday of Abdul Salam. And we normally have the salam distinguished lectures to coincide with the birthday of Abdul Salam. But also on the birthday, the family of our founder and Nobel laureate Abdul Salam, they announce the awards of the Spirit of Salam Awards and they communicated to us the winners of the 2020 Spirit of Abdul Salam Awards. And these awards recognize those who, like Salam himself, have worked tirelessly to promote the development of science and technology in disadvantaged parts of the world. I mean, apart from contributing themselves in science to science, both of them are very distinguished scientists. So this year we have MS Narasimhan. So the citation for him reads as follows, that for his long association with ICTP as head of the mathematics section and a member of the scientific council and his commitment to supporting mathematicians in the developing world, helping early career mathematicians to develop very successful careers. So these awards are actually awarded eventually in August together with the Diploma Award ceremony, but the announcement is made today. So let's give a big hand to Professor Narasimhan. So just to say a few words about him, he's an eminent mathematician with a long and distinguished career with many prestigious awards including the Chavelle, the Loche, the National, the Merite, awarded by the President of France, Padma Bhushan Award, awarded by the President of India, and so on. And Abdul Salam's son, Ahmad, who coordinates these awards, wrote this nice message that he has known Narasimhan for a long time and says that Professor Narasimhan had the benefit of knowing and engaging with Salam firsthand and was inevitably greatly influenced by him. As a citizen of a great country which showed great love for our father, we know how proud Abdul Salam was of his relationship with Professor Narasimhan. The second award goes to another very close supporter and friend of ICTP, Ariyot Tosati, for his leading role in the establishment of high-level training and research activities in condensed matter physics at ICTP and for mentoring many capable young scientists from developing countries, working with them to elevate their research to international standards. So let's give a big hand to Professor Ariyot. Ariyot was actually one of the co-founders. He was kind of brought in by Salam to build the condensed matter group here. So he's one of the co-founder and senior members of ICTP's Condensed Matter and Statistics and Physics section. He in fact served as an interim director from 2002 and 2003 and he's also a recipient of many very distinguished awards like the 2018 Enrico Fermi Prize by the Italian Physical Society, which is the highest honor of the Italian Physical Society. He was also inducted into the Accademia Nacional de Linche and was appointed as the foreign member of the Chinese Academy of Science as well as I think the US foreign member of the US National Academy of Sciences. About him, Salam's son, Ahmad Salam, writes that my father mentored Ariyot and wanted Ariyot in turn to mentor and guide the young scientist over the pride of ICTP. We know how much father valued and respected Ariyot. He was always very generous with his time and support and thus is a wonderful example of the spirit of Salam. So I think with these two very nice pleasant announcement, we can now move to the scientific part of today's session. Today's talk is about statistical physics of inference and machine learning. I guess today we will hear the connection between the first talk and the second talk. Yes. Hopefully. Thank you very much. It's of course a special honor and pleasure to be here lecturing for this special Salam lecture on the birthday of Salam and this announcement of the Salam Award and I really wish to congratulate my two colleagues and friends for receiving this award. Today I want to talk about inference. So I will start by opening the first textbook that I find on inference and you will find that inference has to do with trying to find a hidden rule or maybe hidden variables from data. So it's a branch of statistics mostly. If you look at it in a restricted sense, probably what you will find as a first exercise in inference would be something that could look like that. You have an earn with 10,000 boards in the earn. You draw 100 of them. You find that 70 of them are white and 30 are black. What is your best guess for the composition of the earn? How reliable is this guess? Okay. Simple exercise. You sit down for 10 seconds and you say, I will assume that there are only black and white boards in the earn because that's the best guess I can do. I assume there is a fraction f of x of white and so the probability to pick up 70 white boards is 100 to 70, x to the 70, 1 minus x to the 30. So the log likelihood of x will be this number and it is maximum at x star equals 0.7. That's my best guess. But I get something else. What is the probability that actually the composition of the earn is 60% of a white board? This is the probability that I get. Okay. So this is typical of Bayesian inference. I have unknown parameters. Here it was the composition of the earn. I have some measurement. How many boards I have found? I have my prior. My prior here was completely unknown composition but there were some prior when I said, for instance, I will assume that the earn is made of black and white boards and not pink boards, let's say. That's another prior. And there is a certain likelihood which is if I assume the composition of the earn x, what is the probability of a certain outcome? Okay. And then I used without saying it explicitly but that's what you would do naturally to solve this exercise. You just use Bayesian rules. You have the posterior which is a probability of the unknown parameter x, the composition of the earn given data which is p of y, probability of the observation given x, time the prior. Very standard. Actually we did that yesterday when we were dealing with error correcting codes. Error correcting codes when I was reconstructing a message at the moment of the decoding. I want to reconstruct a message. Let me call x the message that I want to reconstruct. y the message that I have received. So I have one information which is a channel. For instance, I knew that if a certain bit has been sent, it is flipped with probability p. That's my description of a channel. So that gives me, given the sent message, what is the probability that I received something? And then I had a prior and the prior was this code book, the famous code book on which we had to agree. So the receiver knows that the message was sent from a code book. And so what I was using yesterday without quoting it explicitly was just Bayesian rules again. And you may remember that just to refresh your mind about what we were describing yesterday, these were LDPC codes. I had a channel in which each bit can be flipped to a certain probability p. And then I had this description of the code book which told me that the bits satisfied some algebraic equations on GF2. Statistical inference is in the same spirit, but it will be going in a range which is a range where there are many unknown parameters. So instead of having just one parameter, the composition of the urn, I will have many. That's the case actually in coding. I had many bits that I needed to decode. To find many unknown parameters, you need many measurements. So n will be the number of parameters and m will be the number of measurements. And very often a control parameter which will be key in all the lecture will be how much data do we have? How many measurements do we have? And that is measured by m divided by m, the amount of data per unknown parameter. And the questions that we ask, there are two types of questions. What are both algorithms? How to do that in practice? Can we do that efficiently or not? And then the prediction and the quality of the inference. What would be the performance of the best possible algorithm? Are there some phase transitions? We know there are some because we have encountered them already in coding. And what applies? And can we also monitor the performance of algorithms? So I will spend part of this lecture explaining one example specifically. I thought it would be interesting to have one specific example to develop a bit the ideas in this example. And then in the second half of the talk, I will go to something a bit more general and also apply it to machine learning. So these are two images that probably they look pretty much similar to you. Actually, I think the right-hand image has been taken from the left-hand image by filtering it. So you decompose this image in wavelet. And this is the wavelet coefficients. And among the 65,000 wavelet coefficients, you keep only 25,000. And you reconstruct the image from this 25 dominant wavelet coefficients. And you get something which is okay. You have lost a little bit of information, but for the I it is okay. You do that every day. As soon as you put an image on your hard disk or whatever, there is a compression which works basically this way. So this is very well known. It's data compression. One question that has been asked about 10 years ago or a little bit more in information theory has been the following. In the process that I have described to you, I have first acquired the image. I've taken the image with all its pixels here. And then I processed it in order to compress it to put it in a file. So I did it in two steps. Is it possible to acquire the image directly in a compressed form? So that's called compressed sensing. For taking a picture of your vacation time, it's not so much relevant because you don't care because you have enough disk space and so on. You should care because of environmental issues, but in practice we don't care. But we care a lot if you think of acquiring images in some specific context like, for instance, magnetic resonance imaging scanner, things like that. Imagine that you can acquire the image directly with a certain speedup. Maybe you can spend half your time in the scanner and you use it much more efficiently and so on. So this is typically the kind of application of which people have been thinking. And it has become a very interesting and important topic in information theory because of that. So I will describe it. So it's a vast field. You can describe it in a kind of specific framework, a mathematical problem, so that you understand it. It's a kind of archetype of compressed sensing which is very well defined in a mathematical sense and one can, I think, understand what goes on. So this would be the case in which you are doing linear measurement. So one typical example of that is tomography. You have a sample. You want to understand what is the composition of the sample. You send it to the synchrotron. The synchrotron sends lights through the sample and you see how the light is absorbed by different pieces of the sample. And you do that with many angles. Each angle is a measurement and from all these measurements you can reconstruct the composition of the sample. That's something that is done routinely at synchrotrons. Mathematically, it is easy if you have as many measurements as unknown. This is called, what you do is a linear transform of your data, of your sample. It's called radon transform and you reconstruct it by inverse radon transform. So like Fourier and inverse Fourier and it works. But the question that I will ask is the following. I will suppose that I have a signal with n component. I do some linear measurement. I do m linear measurement. So my measurements are defined by a measurement matrix. It's a m times n matrix. And my question will be, can I reconstruct the signal x from the measurement? So I know my apparatus. I know what measurements I'm doing. I know the matrix F. I find what are the measurements. Can I reconstruct x from the measurements? So in principle, that's just linear algebra. So it's very easy. But the question I will ask is, can I do it when m is less than n? So then the first answer that you will give me is no, you cannot do it because it's not uniquely defined because it's under parameterized. But I will add another ingredient. I will suppose that x is sparse. So x is sparse. It means that it has a certain number of components which are zero. So for instance, if I am thinking again in terms of the image that I was looking at before, if you look at it in the right basis, like a wavelet basis, in the wavelet basically you will do, typically you take two neighboring pixels and you represent them by the sum and the difference. But if you have a uniform area, a uniformly colored area, the difference of two pixels is zero. So because of that you had, you saw it, you had a lot of wavelet coefficients which were very close to zero. So I will assume that I have my data. I have represented it by this vector x by my unknown, sorry, but I am in a basis in which I know that this vector is sparse, like the wavelet. So can I use this sparseness information in order to reconstruct the signal in spite of the fact that I have a linear system with not enough equation with respect to the unknown? And so this is my linear problem. This is the unknown. This is the measurement. I have m number of measurement, which is less than n. But the thing that I will do, I assume that there are r components which are different from zero. And so the thing that I will do is that I will do, I will exploit the sparsity and I will just assume, I will make an assumption, I will make a guess. For instance, let me assume to start with that maybe the first r component of the x vector are zero. Then I look at my linear system and of course this part does not contribute so this matrix is irrelevant and I get basically that y is equal to f times the lower part of x. So now the thing that I will assume is that here my number of measurement m is larger than the number of non-zero components. So I am going into an over constrained linear problem. And then there are two things that happen. If the matrix f is chosen randomly, let's say IID entries, then there are two things that can happen. Either I did guess well, my guess for the zero components was okay and then I have more measurements than the unknown but all the equations have to be compatible with each other so there is a solution. Or I did not guess correctly which component were non-zero and then I will have a contradiction. So this is an algorithm. It is an algorithm that will give you the solution and it is... So it tells us basically that as soon as r the number, as soon as m, sorry, the number of measurement is larger than r, the number of non-zero component, I have an algorithm that finds a solution. Is it more or less clear? Okay. So this is a nice algorithm. It has only one drawback that I have to do all the possible guesses and there are n choose r possible guesses that I have to check. So this algorithm gives me, in principle it gives me, actually it's the best possible algorithm but it's like in Shannon's theorem you remember that is I have imagined an algorithm but it is not practical because as soon as n and r are a few hundred they're very clear that I will never make it this way. So I will look at this problem now but this gives me a little bit, this gives me some benchmark. I know that ultimately there would be with infinite computing power there would be a solution to my problem when the number of measurement is larger than the number of non-zero variables. So we will look at this problem in the thermodynamic limit in which are many variables n goes to infinity. They have a certain density, the non-zero variable of a certain density rho and the equations have a density alpha that is everybody scales with n, number of equations, number of measurements and number of non-zero variables. So what I showed you with my hands but that was easy to get is that linear algebra tells you there exists an algorithm, it's solvable basically by enumeration when alpha is larger than rho. But that is an exponentially slow algorithm. Then people developed what is called the L1 norm approach, it's an approximation to this problem in which you turn it into an optimization problem and you want to find the n component vector x such that the m equations, y equal fx are satisfied so it must be compatible with the measurement and you want to minimize its norm. So ideally you would like to minimize, you would like to find the vector compatible with all the measurements which has the smallest number of non-zero component. So that would amount to minimizing what is called the L0 norm. The L0 norm is just a number of non-zero component. That is extremely hard to do, we don't know of any way to do that and so what people tried is to relax that and minimize the L1 norm. So L1 norm is just the sum of the absolute value of the components. So it means that for each variable x you put it in some kind of potential like that and you want to find the system of x that minimizes the sum of the xi. So of course this potential tends to bring the xi towards zero but the xi also have to be compatible, they have to be in the linear subspace of the measurements. So this is good because this is a convex problem. You have a linear subspace and you have convex constraint. So this can be solved efficiently, there are fast algorithms for finding solutions to convex problems. That's one approach to the problem and the second approach which is the one that we developed as a physicist to this problem was using developing statistical physics tools for this problem. So statistical physics tools, it just goes by defining really an inference problem. I have done my measurement y mu, I want to infer x and this would be p of x given y, given the measurement. And my p of x given y, it's made of two pieces. I have a prior. My prior here, I have used the prior saying I know that I am looking for a sparse signal so I put a prior saying with probability one minus rho xi is equal to zero and with probability rho I draw xi for instance from a Gaussian, that would be a Gauss-Bernoulli prior but I could take some other function here. And then I have all the constraints due to the measurements. So this defines a problem of statistical physics. I have n variable xi, they have their intrinsic measure and they interact. So this statistical physics problem I like to draw it with my factor graph that I showed to you yesterday and this factor graph will be the following for each variable xi which are the circles here I have my prior p naught which is this Gauss-Bernoulli prior here and then there are the constraints and the constraint they amount to some interaction between the variable and so for each measurement I have a constraint it's a square here and you see that the square it connects the value of the xi by a linear constraint and this linear constraint has a weight which is the measurement the component of the measurement f mu i mu is the index of the measurement i is the index of the vector. So this is a factor graph it's a generalized spin glass and I can run all the things that I have described yesterday so I can write the belief propagation equation the mean field equation that I was describing so it means that I will have messages running around this graph there is a message m going from i to mu there is a message m going from mu to i you can write the BP equations it's exactly the equation that I wrote for you yesterday and I trade them this is called BP belief propagation applied to this problem it has a long history some steps are recalled here it turns out that in this case it is a bit complicated because my variables are real variables so when I exchange these messages they are functioned on the reals but it turns out that because this system is infinitely connected you see every constraint is connected to all variable every variable is connected to all constraints because of that it turns out that the signal that is received for instance by one by the signal that is sent from i to mu the only thing that you need to know are its first two moment basically it can be substituted by a Gaussian distribution function so the messages that you sent they really close on a number of parameters which is just two real numbers a mean and a variance for each of the message that we are sending forget about the exact equations actually I have not written them completely they would take a couple of slides and they would not be illuminating but just remember that you are sending messages along these graphs and there are four messages sent along each end of the graph and actually this can even be simplified a bit after some trick which are called in the jargon it is going to tap description and you can turn it into just two end variables this is an algorithm it is exactly the mean field equation described by physicists for many years by spin glass physicists already in the 80s but it acquired its own name in this special context and it was called generalized approximate message passing ok the thing which is interesting about this algorithm so it is just a mean field algorithm in which basically I have these messages sent on the graph and I iterate them until I reach a fixed point and from the fixed point I can reconstruct the unknown pattern, the unknown distribution of x that I wanted to know the thing which is interesting first of all it is a very fast algorithm it is computer time scales linearly with the size of the system and you can also study its performance and know when it works and when it does not so the analytics study actually it has two aspects and here I am really starting to use all the tools that I was describing yesterday one aspect is to use the replica method the replica method will allow me to compute this object which is called in this context a free entropy you can look at what is the probability p of d probability that the reconstructed x is at a distance d from the original signal s and phi is p of d scales exponentially with n so phi is 1 over n log p and so phi of d will tell us if the probability of this Bayesian probability that I am looking at is it concentrated in regions close to the original signal or far from the original signal and that is what we will look at in the next slide now the cavity method is a way to analyze this message passing equation that I was describing with my hands in the previous slide and to see how it iterates and to monitor how this message converge or not and basically in the cavity method one finds that the iteration of the BP flow it creates a flow according to the gradient of this free entropy so it gives a full analytic control of the algorithm of the BP equation that is spectacular that is very important it is not so frequent that you can get a full control of an algorithm like that so here is an example so this is an example in which I have my initial message has a value rho not equal 0.4 so it has 40% of non-zero component and I am reconstructing I am trying to reconstruct it with a number of measurements which goes between 0.56 up to 0.62 so if I start with many measurements I have 62% of measurement with respect to the number of variables this is the free entropy it tells me it is peaked at D equal to 0 it tells me that the weight the Boltzmann weight of my statistical physics problem is concentrated as distance 0 it's concentrated around the pattern that I am looking for around the X that I am trying to reconstruct so that's good because that's the most probable outcome and also the flow of the algorithm will flow along the gradient of that and there is no metastable state so actually in this case for alpha larger than alpha equal 0.62 which is here if I look at the number of iterations after around 80 iterations I reach a fixed point and this fixed point is exactly the signal that I wanted to reconstruct so it works very efficiently now if you have a bit less a bit smaller number of measurements for instance I go now to 0.56 it still peaks around the signal that I want to reconstruct it's still the most probable point but unfortunately at a distance D equal around here 0.15 from this measurement there is a maximum and this maximum it corresponds to the glassy state that I was describing before and because when I start the algorithm I don't know where is the signal so I start far away so I will flow here and I will stabilize at this maximum and in fact here is a relaxation time of the algorithm and the relaxation time of the algorithm as function of the number of measurement you see that the algorithm works very well up to some point and then at a certain threshold which is around 0.59 then the relaxation time diverges so the algorithm gets lost it's a freezing of the algorithm due to this metastable glass state this is just to show that if I look at the error as function of number of iterations again if I have many measurements alpha equal 0.62 the error goes rapidly to 0 the reconstruction error if alpha equals 0.56 then I have an infinite plateau and I will never find the solution so this creates from this I get the phase diagram and the phase diagram is like that so this is rho is the number of non-zero component alpha is the number of measurement I told you before as soon as alpha is larger than rho that is above the diagonal there is a solution with the stupid enumeration algorithm but now I want to look at practical algorithms the one that have a time that I can run in practice and there are really two of them which are plotted here one is this famous L1 relaxation which is a convex problem it has a phase transition there above this curve I have enough measurement and I can reconstruct the signal below this curve it does not have enough measurement and the result of our statistical physics study is the performance of belief propagation it also has its own phase transition it is described here above this phase transition you can reconstruct perfectly and below you cannot so there is here an intermediate zone this orange zone here in which my BP algorithm is not able to reconstruct but I know that if I had enough computer time I would be able to reconstruct the signal this is typical of the phase diagram that we have already seen in coding actually if alpha is the number of measurement I have three phases I have a phase in which alpha is too small I don't have enough measurement so I will never reconstruct the signal there is a phase in which alpha is rather large and it's easy to reconstruct the signal and actually my belief propagation algorithm does it correctly and there is this intermediate hard phase and in this hard phase there should exist some algorithm there exists some algorithm that solve the problem the question is whether I can find a polynomial time algorithm that solve the problem and so it is again one of these pictures that I showed already in codes I mean if I have many measurement I have a smooth landscape I can find the ground state corresponds to what I call here the crystal the crystal is the signal that I want to reconstruct if I have not enough measurement there is no way I can solve the problem because I don't have enough information and there is this intermediate range in which the energy landscape it has one good minimum which is the crystal it satisfies all constraints has zero energy but all the algorithm are fooled by the existence of this this band of glassy states in order to and this is very ubiquitous in statistical inference in order to get into the around the glass trap so it's not a thing which is interesting in this whole thing I'm telling you about a glass, a crystal and so on and all this has to do with an algorithm it's not a material it is an algorithm on how the variables behave in the algorithm so you could say, well this guy has spent too many years of his life spent on glasses and he's copying the ideas of glasses that he has in mind he puts this in the algorithm for a kind of vocabulary but it's more than that and the proof of that is here taking this analogy of crystal and glass if you want to make progress in this intermediate phase then the thing that you should do is try to devise a trick in order to nucleate the crystal and nucleating a crystal so it's something that was actually done initially by people working in error correcting code and then was imported from error correcting code into this into this field and the idea is the following basically so we're just going to take a couple of ice crystals and put them in the top, see what happens and there you have it so that was just a description something that an experiment that can be done by anybody at least maybe in Trieste it's not so frequent but if it is cool enough in the winter you can have some purified water it could be out super cooled hope that it does not freeze and if it is cold enough but it does not frozen and you seed some crystal in it it will freeze that's something very well known and so the idea is to build a specific measurement matrix and my measurement matrix it will have a certain structure basically the variables that I am measuring I will assume that the variables and in the first block of variable I will do here a number of measurement which is large enough so that I know that I concentrate for the first block of variable here I do as many measurement as unknown so I know that I will be able to decode this first block but then I have some coupling constants that the measurement which couples the variable in the first block and the variable in the second block so you have this off diagonal guys so I have the measurement inside the block and the coupling between the block and so if this is done correctly the thing that I expect is that I will first nucleate the crystal in the first block and then there might be a wave of crystallization that goes to the other blocks and that is something that has been realized it has been realized this way you have here the error the mean square error how much the reconstructed x differs from the original signal and you look at it as a function of the block index here it's a case in which there are 20 blocks and you look at it with as a function of the number of iterations of BP so after one iteration you are completely lost after already 10 iterations this curve here you see that the first block and the second block they have been decoded perfectly but you have a big error for all the other block so this is the crystal and then you have a wave of crystallization after t equals 100 you have crystallized the first 9 blocks and the last 11 ones are not yet done and you wait long enough and after some time you crystallize everybody and actually the nice thing with this algorithm is that you can prove that the critical threshold for reconstruction with this algorithm here is alpha c equal rho so you go up to the ultimate limit to the Shannon limit for this problem because you have an algorithm it runs in linear time and it reconstructs the signal up to the ultimate limit that can be done the one that could be done only exponentially with enumeration so this is one interesting example of what can be done but I wanted to have this first part to give you in some detail this example of compressed sensing and now I would like to make the link between this inference problem and machine learning so I will get back to machine learning to the first talk so I have neural network it is built from artificial neurons it is built from some layers there is an input and output and I imagine that I am training this network by supervised learning I have a database and from this database I want to infer the weights which are the weights of the connections between the various neurons so there is one parameter one weight for each of these edges on this graph so I am in a large dimensional space I have a large amount of data and from my data so the data is a pair of an input image and a desired output and from this data I want to infer what are the weights so it is typically a problem of statistical inference it falls exactly in the description that I was giving before so I will show you a little bit of what can be done using statistical physics on that but I want to insist in this part of the talk on one aspect which I find absolutely crucial and which is data structure and in some sense the thing that has been done so far in my physicist approaching this kind of question it was relying it was focusing on cases in which the data structure is very simple typically IID data and I will show you that this is certainly not the case for the real data and I think it matters it matters a lot I mean we are used in physics in theoretical physics we are used to taking a problem and going to a kind of simplified version of it but I think that in this case IID data distribution was just losing a large part of the problem and the question that I am after in these years is really trying to understand if and I think there is some evidence already that the good performance of these deep networks it is also due to the fact that it addresses problem where the data is much structured and the geometry the architecture of the network is well built in order to make use of the structure of the data so the theory the statistical physics theory had on machine learning had developed in the 90s already and there were many results in the 90s very acute results and they were basically dealing with relatively simple problems like the perceptron which is just a two layer network or a committee machine which could be a three layer network like this one and it is what in the framework which is called the teacher student so let me give you an idea of what can be done and what had been done already in the 90s on this problem so imagine that I want to study learning in a perceptron so a perceptron is a machine that takes an input x applies a vector j it does a dot product j times x and it outputs y equals sign of j times x and I will look at a problem in which the task so this is a task that has to be to be solved this is known by what we call the teacher the teacher knows and the teacher is characterized by a vector j i will look at the case in which the component of j are binary they are plus or minus one so I have a teacher he defines a task and the task is this vector j and this vector defines an orthogonal plane and basically all the input vectors which are on this side of the plane they should output plus and all the input vector which are on that side of the plane they output minus this is in large dimensional space and what I will do is present to what we call then the student in the learning task what I do is that I present a number of patterns so I will present maybe this pattern here this pattern here this pattern here and from these patterns that will be my database and there will be a student which is a network that I'm training and it want to and it will try to find the vector weights k such that y equals sin of k times x and again I will assume that the student has the same as a good prior he knows that the k i has input plus or minus one and in order to solve this problem I'm among them we all physicists in the 90s we assume that the input the component of the input vector they are i, i, d are just Gaussian input vectors random vectors with independent entries if you do that then you are in good business you can describe a statistical physics of learning and basically what you do is statistical mechanics in the space of the weights of this vector k so you have the student vector k and you have two types of constraints for each oh sorry this should be k i and this should be k i so the component k i you know that it is binary so you have a first constraint k i is plus or minus one but then for each pattern that has been shown you also have a constraint that the sin of k dot x mu should be equal to y mu that has been given to you by the teacher so you see I have drawn on purpose I took exactly the same graph as before because this is the same problem as compress sensing basically I have binary variable instead of having continuous variable I mean there are some modifications but philosophically and also in practice it is the same problem and this problem had been studied already in 1990 by Georgie and what he did is analyze this problem with the replica method and you can also on top of replica method in retrospect looking at it in recent years you can look at the algorithm that you get from the mean field equations and that's called the BP cavity GAMP algorithm whatever it has a long history and what you get is this following phase diagram basically you will find that the generalization error as function of the size so alpha now is the size of the database is the number of examples that my teacher has presented me as function of alpha the error the generalization error that is the error done by the student after learning how much error does he do on a new example and you see that okay there is no surprise here it decreases with the database more information you have in the database the better the student is then at some point so there are two curves here and there is the curve which is the red curve and the green curve let me follow the green curve green curve is what the algorithm does and so you find the algorithm converges you iterate the mean field equation you get the value of the variable that I call k i which is the weight of the student perceptron and at some point it crosses over not a crossover it's a phase transition to perfect generalization means that the student has aligned perfectly with the teacher so this happens around a bit below 1.5 now this is what the algorithm does again there would be another threshold which is if I don't focus on the algorithm itself but I look at what would be the best possibility using all the information that I have in my statistical physics problem and without algorithmic constraint then I could learn perfectly around 1.25 so you have again two phase transitions the algorithmic phase transition and before that there is the ultimate phase transition that I can get without to perfect generalization if I don't have algorithmic constraint so this was started in 1990 by Georgie in recent time there has been a lot of developments on the mathematical aspects let's say of replica and cavity theory and so Jean who is here Jean Barbier and his collaborator were able to prove actually this result I mean all this was heuristic in the sense that the replica method is heuristic but now we have the tools mathematical tools to turn them into theorem so this has now the nice status of a theorem all this is nice it tells you look we have a scheme in which statistical physics tells us something how much data we need what is the phase transition and so on all this is very nice but it does not apply to reality if you look at learning in a real database and so on it does go away and so the questions that I am challenging really and the hypothesis in all this that I am challenging is the ensemble of data I insisted a lot in the fact that we can do this we can do this computation here sorry we can do it here we can do it because the ximu which are all these messages they will depend on the ximu and the edge they are IID there is a new random number associated with each edge of the graph and that's very important in order to be able to do the computation otherwise we don't know but I think that by going to this ensemble it was nice it leads to a nice result but we throw the baby with the bath water so let us look at let us look let us look at data so I get back to the MNIST data the handwritten digits you remember that each digit is standardized on a square of 28 by 28 pixel so it is a vector in 784 dimensions so you can try to study not all vector in this 28 by 28 dimensional space vector is a handwritten digit I mean if I write something completely like that it's not a handwritten digit so you can ask the first question what is can you define the manifold the subspace of this 784 dimensional space where the handwritten digits lie it's not a very well defined question but you can try and in order to try the first thing that you want to do is to analyze the dimension of this 784 and if you want to do that there is one trick which is relatively easy to the best of my knowledge it was first used by Grasberger and Pocaccia in the 80s in order to find the dimension of strange attractors in turbulence and the tool is the following imagine that I have p-points in a space of dimension d and now sorry if I look in a space of dimension d I draw a circle of radius r the number of points inside it if I put random points it will grow like r to the d this means that if I look at the nearest neighbor distance between two points if I have a database of size p in a space of dimension d the nearest neighbor distance should scale like p to the minus 1 over d so that gives me a tool I look at my minus database I take 5000, 10,000 up to 70,000 point and I look at the scaling of the nearest neighbor distance and I extract from that an effective dimension so this was done and the result it was done by several groups and so on I am not precise on all the reference because it has a long history but this is data from Spiegler et al I think what you get is a log-log scaling which is reasonable and it tells you that the effective dimension is around 15 ok so now we start to have a bit more a better idea of what is the nearest problem what is the handwritten digit problem this is not a handwritten digit this is not either when the neural network sees this thing it should answer this is not in the manifold where I was trained I don't want to give you if it is a 0 or 5 or 1 because I have not been trained on that so you can be a bit more precise you can look at the manifold, the sub-manifold of the 5s and you do exactly the same kind of things scaling of nearest neighbors and you find that the sub-manifold of the 5 it has dimension around 12 sub-manifold of the 1 it has dimension around 7 each digit here there are several ways of estimating but you get an estimate which maybe you are wrong by dimension maybe I say 12 it's 14 but it's ok so you have the nearest problem is like that you have a 15 dimensional manifold of 100 digits you want to identify within this manifold there are 10 what are called perceptual sub-manifolds each perceptual sub-manifold is connected with each digit and they have dimensions between 7 and 13 all this in a 784 dimensional space so this is the problem that we face it is very very different from looking at IID input in which if I have IID input I'm filling the space I'm not in a manifold my data so in recent we have been trying to develop an ensemble that captures some of this idea of the structure of data because I think that it is a step which is really necessary and the ensemble is built as followed I will build my data so my data will be my input points instead of taking them randomly with IID gaussian components I will build them from superposition of our features and the features are defined by feature vector and I do the weight for each feature is called CR and so I build a pattern by taking a superposition of the features if I leave it like that I have just a linear manifold of data then that is a very easy system and it's very trivial because in some sense the learning will take place only within the manifold and everything which happens outside is completely irrelevant so I am getting back to the original problem and the trick is to apply a folding function that is we have the manifold of data and we fold it and we fold it by a non-linear function applied component wise that's my linear function so in the end my data X the value of the X component XI for the data number mu for the pattern number mu is given by applying component wise a non-linear function F to a linear superposition of features so the data it is characterized by what I call a latent representation the latent representation is the component that gives me how the data vector how it is decomposed on these features so you have to understand that I have my data points they are on a manifold but the manifold is folded and so it is hidden and you don't know where it is exactly like what you had in NIST now I have to define what could be a reasonable task what is the objective what could be the equivalent of my teacher that I had before and we will assume that the task is well defined and is a function of the latent representation that is I will assume that the output the design output is just for instance a certain function of C dot W tilde in the latent space so basically you have to imagine that of course all this takes place in large dimensions so it's hard to draw but I could have a task which would be in a subspace like a space like that and it could be defined by just a kind of perceptron in the subspace this should be the output should be plus here and minus here but then I am taking this thing here and I am folding it I'm not sure if I am able to fold it on the blackboard but I will have now a completely curved thing and the thing that was linearly separable becomes nonlinear separable here I did a little bit of folding but then you have to imagine that I fold it many times I like like the baker and then if I do that I am completely I have a much more complicated task it is simple if you know the same way that basically if you had the minus problem and you would be able to characterize easily the handwritten digits the problem would become much much simpler in the subspace of handwritten digits you just have this 10 submanifers to find they are this is much easier because you are in a 15 dimensional space not in the 784 dimensional space so this is typically the kind of problem that we addressed we have the examples which are defined the output is defined according to the latent space representation and we look at a simple network which is not a perceptron but a two layer perceptron so it has one input one hidden layer and one output layer and basically when you do that what you can find first of all is that if you measure the generalization error so here I have a neural network with K hidden units and if I look at the generalization error as function of K these are the blue point but I can also look at if I run twice my algorithm with completely different random variables different initialization and they differ basically by the same distance as I am from the ultimate from the real task that I have to do so it means that two random runs of the algorithm they get at a distance which is the same as a distance to the optimal which makes sense this is exactly what we see also if we do NIST now in the real database we see exactly the same phenomenology and the green points they are something different and the task that I realize I have generated a neural network I throw to this neural network a random vector not one which is in NIST or not one which is in my hidden manifold but a completely random Gaussian vector and I ask what is the generalization error and it behaves very poorly both in NIST and in the hidden manifold so this is a phenomenology which is interesting it was not at all like that if you don't have the hidden manifold structure you find that the generalization is perfect on any data point so it tells us that the phenomenology of the hidden manifold model is good with respect to NIST at least and the nice thing about it so it does not have the pathologies of the teacher student perceptron and the nice thing about it is that it can be studied analytically and both the online learning and the phase diagram can be studied in some details probably I will skip these details because I think that let me just flash one slide to tell you that there is a solvable limit to this problem it is a limit in which the number of neurons goes to infinity the size of the hidden manifold goes to infinity and the size of the database also goes to infinity all of them simultaneously and in this limit we can derive the phase diagram, we can derive the dynamics of learning, we can compare it to what happens numerically and so I think we have with this hidden manifold model we have maybe the first structured model of data in which we can make some prediction and hopefully they will we will be able to compare them also to what happens in experiments I will skip this this relatively complicated aspect just summarize the question on the data structure to stress once again that I think that data structure is absolutely essential it is structured in manifold submanifold and so on and it has a combinatorial structure and this to me is very important that is what you find is that in real data for instance in a handwritten digit a digit is made by the combination of some strokes some stroke each stroke is built from the combination of some elementary small pixels and so you have this combinatorial structure and it is something that you can see in all the data that is being studied and that is missing in the theory so far so just to give you a message that it is now possible to define some models of data some ensemble of random data in which both you have some aspects of the existence of the hidden manifold intrinsic dimension which is much lower than the full dimension and at the same time you can study it analytically both from the online learning and for the phase diagram if I want to give you a kind of take home message about statistical inference maybe one thing that you could remember is that we are looking at the problem of inferring a hidden rule or hidden variables from data in the case where you have many variables and many data chosen from an ensemble and when you are in this range then you can do statistical physics and the physics approach it has basically two main tools one is the cavity equations mean field equations and it can be turned into an efficient algorithm and the second one is the replica method it gives you the phase diagram and it gives you a control of what the algorithm is doing and very frequently I gave you for example the one of error correcting code and the one of compressed sensing but there are many more than that and also the one of perceptron learning we have three phases we have a phase in which the data is not large enough and so it's impossible to infer to get the correct result of the inference we have a phase in which we have much much larger amount of data and we can do it easily and there is an intermediate phase which is hard is the phase in which you have your algorithms tend to be trapped in metastable states this is relevant for machine learning provided one can work on an ensemble of data that have good characteristics and I think this is one of the frontiers of the field at the moment we are working on that there are not yet so many results because it's much more complicated technically it's much easier to handle the case in which you have IID data but it turns out that in some ranges you can go beyond that and get quite interesting results so I think it's an interesting perspective for the next few years maybe decades but let's say years to be on the safe side thank you very much for your attention again it has been a great pleasure to be here for the Salam lectures thank you Marc do you have any questions well I have a question about the hardware about the hardware so far what we do is to take the write down algorithm or entering for turning machine all the computers are turning machines and they run the electronics with boolean and boolean algebra and boolean expression now the question comes to my mind to optimize these kinds of things it wouldn't be nice to have a machine model which is not necessary to ring model or reusing electronic with the boolean expression calculation do you think that in future you will have anything like that well it is an interesting topic it has been some machines have been built that were designed along the principle of artificial neural networks let's say the processor would just implement some of the inputs with the non-linear function this had been done already in the late 90s there were a few they still use the boolean electronics yes they use the electronics which normally NAND gates and you could do that also there are no threshold devices you could do that with analog machines but the thing that I want to stress is that in recent years in all these games learning in deep networks is an extremely time consuming task because you have to count through the database so many times and so it is extremely slow and one part of the progress has been a progress of hardware but on a completely different side which is being able to use to use graphical units that were invented for totally different things for video games and so on and being able to program them in order to do that efficiently still I think that at some point it will be interesting to get back not in terms of performance because it will be very hard to fight with there is a technology of this technology and the GPU is one of the latest avatar of this technology it has been going on for many decades it's very strongly optimized with very big companies that have been investing much money so the question will not be to fight so much about about computer speed but to see qualitatively what you can do with qualitatively different ways of computing and this I think is interesting because there is this question because I've got something else in my mind there were some claims that the artificial neural networks and sometime other like quantum computers they are more powerful than Turing machine which means that you can simulate Turing machine on those models and not the other way around but given the progress that's not the case using the Turing machines so that's why I asked this question but if we have eventually something based on what recently some people are talking of neuromorphic computing but still they are using the same old electronics based on Boolean expressions now this I think looks promising your spin glass stuff probably you can come up with some kind of hardware based on that given that it's in the solid state well maybe I just have one comment that there are existence theorem in some sense that they just tell you something about existence for instance there is a theorem that tells you a neural network with three layers one input, one hidden and one output is able to do universal it's a universal computer let's say if it is broad enough this gives you some kind of existence proof but in practice the width of this thing should be absolutely enormous in order to implement something so the fact that it can be turned into a universal computer it's good to know that in principle it doesn't help a lot for practical applications and it is true that so far the best way has been to work on these new types it's a new framework for computing that is being developed with this artificial network but to implement it on Bonafide good old computers let's say that is the most efficient so far so far thank you so concerning the complex sensing this algorithm based on nucleation so it's not just an algorithm it also requires a specific form of the matrix F right so it requires also a specific form of the data and it goes back a little bit to the issue of structure of data so one question is whether these two things are related maybe you can shed light on the fact that maybe you can learn faster because you find the easy part of the data to infer the second question is has this been implemented can be this been implemented in a technology no, that's a thank you for asking the question the in the complex sensing part the F is the measurement matrix it tells you what are the what for each linear measurement what are the variables that are being measured and with what weight and this here in the way that I was designing it I was designing it as a theorist I say okay I am got the got a theorist and I am allowed to design my linear measurement as I want and I design them with this kind of banded matrix structure which is optimal it turns out that if you want to do it in practice for instance we thought about that on some concrete problem which is tomography the thing that I was showing before how to reconstruct the composition of a sample from tomographic measurement but we never found a way tomography you have something which is what the synchrotron does it sends lights and you measure what has been absorbed so this you cannot change at least you can change the angles but there are not so many things that you can play with in order to define your matrix your matrix is basically defined by some angles and so we never found a way to implement in this concrete case taking into account the experimental constraints to turn that into a scheme that would be solved by the algorithm so it is a kind of theorist dream but it did not find an application because you need to I mean you need to find a way which you have the experimental constraints on the measurement and then it may turn it more complicated and then the second part of your question is how this is there a link of that with the learning in the neural network in some sense that there are some patterns that could be easy to learn it's true that banded patterns in the sense of the measurement could be easy to learn but I think that at least my point of view at the moment is not that I am looking for patterns that are easy to learn but a set of patterns which has a more realistic geometric structure which compares to what we know of the geometry of the data of real data sets are there any other questions? I have a very simple question so when you were discussing the manifold just the cartoon of the manifold of satisfiability you have been talking about four phases so there is a big cluster then there is many smaller clusters then there are few clusters and then no clusters at all and when you have been talking about image reconstruction and actually the lecture today you have been talking about so easy, hard and impossible so how to relate these two four to three does it mean that few clusters belongs to impossible or what? I was so you are asking a question about the intermediate phase let's say inference I have this intermediate phase in which if I write the energy which would be the number of violated constraint the number of measurement that I am violating there is one solution which is here and so I have something like that and I was describing it by a cartoon saying there are a band of glassy states all over the place which are creating all the trouble and so these glassy states they would be the equivalent in this problem of this would be one state another state a third state etc the question is so in general the number of states like that it scales exponentially with the size of the system it has a certain entropy and then the question that I was addressing that I was describing in this four phase in satisfiability was whether if I restrict the measure to these guys is the measure more or less uniformly spread over this exponentially large number of states which mean that the measure here will be exponentially small or are there some big blocks big blobs that catch a large part of the measure and so if there are big blobs it's called condensation state and if they are not so this might exist in this case but I am not sure in a way we really did not look at that in detail for two reasons one reason is that in the end from the point of view of statistical physics the measure is concentrated here this is the one that carries and so there are ways you could say I look at the statistical physics at a certain distance d away from this point and then I could look at this problem that you are addressing but it needs some specific study the second point is that from the point of view of inference what really matters is the fact that they exist all these guys and that they trap the dynamics whether you condense always in a small number of them or not it's not something that is so important from the point of view of the algorithm and so on so that's why this question has not really been studied so far I have one question so you talked about application of statistical methods to machine learning and machine learning seems to be very good at what pattern recognition or if you have some large data like LHC so are there do you can you identify some areas of theoretical physics which are particularly well suited for applications of machine learning or are using machine learning in physics yeah well there are many I mean first of all there are applications in experimental physics that is very clear you said especially in astrophysics where you have a lot of data in astrophysics in the particle accelerator detectors there is also quite a lot of it in quantum materials quantum chemistry in general in some sense deep networks they seem to be very smart interpolators that is if you have for instance if you are doing some material science and you are or you do some proteins I don't know and you have a long list of proteins that do more or less the same function and you want to understand if a small mutation will create a protein which is in the same family or not this you may do it by some protein folding algorithm and so on it will take a lot of time but if you want to do it on a purely statistical approach you could do it with training a deep network and in general it gives you a good answer so it is used in material science quantum chemistry and so on so there are quite a lot of applications in this sense can it be used in order to derive things or to help doing theoretical physics I don't see examples now we are still safe ok alright so if there are no more questions let's thank Mark for this wonderful set of lectures