 Montanari, and this colloquium today is part of a joint series of colloquia we're organizing together with the ICTPs, organizing together with CISA, we've already had several events. The series started actually at the beginning of the lockdown, it's been quite successful so far. Let me remind everybody that one of the colloquia that we couldn't hold because of a technical problem on the 8th of July has been now rescheduled, this is the colloquium of Scott Haronson on quantum computing, quantum supremacy, has been rescheduled to July the 29th, always at 4 p.m., the usual time, 4 p.m. and 3 a.m. time. Let me also remind everybody that the colloquium is also live streamed on YouTube. Of course if you are connected now you can follow it through Zoom, but in YouTube the colloquium will remain there so you can actually watch it even after the colloquium is finished and remain there on record. So now before I give the floor to my colleague Jean Barbier for the introduction of the speaker, I'd like to give the floor to Matthew Diamond from CISA. Matthew. Yes, good afternoon to everyone. Stefano Rufo wanted to be here, he couldn't be here because he's at an important meeting in Rome. He regrets not being able to give the greetings to everyone and more importantly he regrets missing the conference but I'm sure he'll see the recording. He just wanted me to convey how much he's enjoyed this series of meetings, how enthusiastic he is for the synergy between CISA and ICTP in this initiative about data science, life science, artificial intelligence, natural intelligence, particularly because it's something that's very much growing up from the researchers, the investigators, rather than being imposed on us. It's something that many, many people at ICTP and CISA are enthusiastic about. So let's begin the conference. Okay, thank you. Jean, go ahead. Hello everyone. So my name is Jean Barbier, I'm a researcher at ICTP. So I would like first to thank Andrea Montanari to be here today. This is a very special colloquium for me. It's a real pleasure for me to introduce Andrea because as a young statistical physicist working myself in information theory and machine learning, of course the work of Andrea had a great impact on my own research. So let me tell you a few words about the biography of Professor Montanari. So Professor Montanari received a lower degree in physics in 97 and a PhD in theoretical physics in 2001, both from the Scholar Normale Superiore in Italy, in PISA. He has been a postdoctoral fellow at the Laboratory de Physique Theorique de l'Ecole Normale Supérieure in Paris in France and from the Mathematical Science Research Institute from Berkeley in USA. Since 2002 has been a chargé de recherche sorry chargé de recherche in French with the Centre National de la Recherche Scientifique which is also called CNRS at the Laboratory de Physique Theorique in Paris. In September 2006, he joined Stanford University as a faculty and since 2015 he's a full professor in the Departments of Electrical Engineering and Statistics. So a few words about some of the awards that he got was co-ordered the ACM Sigmétrix Best Paper Award in 2008. He received the very prestigious CNRS bronze medal for theoretical physics in 2006 in France, the National Science Foundation Career Award in 2008, the Okawa Foundation Research Grant in 2013, the Applied Probability Society Best Publication Award in 2015. He is an Information Theory Society Distinguished Lecturer for 2015 and 2016. In 2016 he received the James L. Massey Research and Teaching Award of the Information Theory Society for Young Scholars and in 2016-17 he was elevated to IEEE Fellow. In 2018 he was an invited sectional speaker at the International Congress of Mathematicians and he's an invited IMS Medallion Lecturer for the 2020 Bernoulli IMS World Congress. So Professor Montanari is famous for a number of contributions. Let me in particular cite his contributions to the statistical physics of disordered systems and its rigorous aspects. He then moved to information theory, to the connections between statistical physics and information theory and in particular the theory of error-correcting codes for communications. He's also made great contributions in the field of signal processing and in particular the field of compressive sensing which is very important from the application point of view and more recently he has made important contributions to the field of machine learning and the theory of neural networks which is more related to the talk of today. So if I'm not mistaken Professor Montanari today will tell us a bit about first what is machine learning. We all heard about that a lot these recent years and a great success from the application point of view of this field. But there is a great also a deal of knowledge that is missing from the theoretical point of view. There is a large gap between what is going on from the application point of view and our theoretical understanding and Professor Montanari will tell us more about that. So on that when you are ready. I'll share the screen. Thank you Gian for the nice introduction and okay so let me try to do that. Okay I hope you can all see. Again thanks for the invitation to this colloquium. I'm very happy to be interested at ICTP because okay as Gian explained I didn't expect such a detailed introduction but my PhD was in physics so it's good to be among physicists and have a particular connection to ICTP because when I was a physics student I attended some summer school or conference. I think this is the first one that I attended was organized by Ricardo Zekina among others and about connections with computer science and this greatly contributed to broaden my horizons and my interests so I have a special connection to Trieste. So today's topic will be quite broad. My hope is to try to give an introduction to machine learning to people that don't know much about it or just read about it in newspapers and so I'll try to be gentle. So as Gian said probably even if you don't know what it is you heard about it because you looked at it in front pages of newspapers. These are three front pages from the Economist. I could have picked many others often machine learning in these newspapers you know in the head titles goes under the name as artificial intelligence that clearly is more kind of mysterious and attractive for a front page but you know under the hood or under the title you see that you know machine learning pops up with the real technology underlying it and this is for instance title of an article from MIT technology review that has the same title as my seminar what is machine learning and the definition that is given here is machine learning algorithms find and apply patterns in data and then there is of course the very kind of enticing a sequel and pretty much around the world and then the connection with artificial intelligence machine learning algorithms are responsible for vast majority of artificial intelligence advancements. Now when I read these things you know I get a little bit not upset but you know disappointed that I had many of these the definition find patterns in data you know as a former physicist you know I wonder isn't this just all of science finding pattern of data I haven't been doing this for a long while you know physicists and biologists and statisticians etc we all find patterns in data right and for instance if you ask a statistician what is the birth date of their discipline at least a fraction of them will tell you you know it's probably Gauss or Legendre that first used least squares to fit data on the motion of the planets so this was you know more than 200 years ago so isn't that also finding patterns in data so what's new here so here is where I want to start I want to try to give story that tells you you know what is involved in our thinking about what does it mean to find patterns in data and I'll go through three steps in this classical statistics so this is the way these problems were conceptualized in you know beginning of the last century and and you know most prominently by Fisher, Pearson and many others that really you know created statistics as an independent intellectual discipline and then you know quite different point of view started in the 70s with BAPNIC and you know the notion of statistical learning and the parallel line of work that very much share the same concerns as statistical learning is non-parametric function estimation in statistics that started even earlier and finally you what all these title and articles are referring to is really the advanced in what is broadly called as deep learning and this is really what you know some extent we don't really understand right and then I'll finish you know last part I'll try to describe some recent work in this you know try to address these questions okay so throughout the talk I'll stick to a very canonical setting that is really you know page one or two in a textbook on this on this topic that is what is called the machine learning is called supervised learning in statistics is called the regression and what is this is a setting in which you have data that come in comes in the form of yi xi okay and xi is a vector typically is called the feature vector or a covariance vectors and the yi's are real numbers and they are called responses or labels and what we want to do is we want to come up with the function so here the xi's for instance they are in here the dimensional and we want to come up with the with the function that allows us to predict the the the new labels given a new feature vector so given an xi this has to be spit out or an x this has to spit out a y so this is our prediction for y all right so a simple example that you most of you probably have played with is you have one variable that is your predictor x your covariance and one is your responses you know this can be the behavior of stocks over time or the gdp over time or the temperature test over time and then you want to predict a new y so you will try to fit some curve like this perhaps right of course you know there are much more complicated example and what you know one of the main you know breakthrough in machine learning is progress in image recognition here the x's are images so are very high dimensional vector very long vectors and the y's are labels that tells whether you know the image represented dog or a cat in this case okay so this is the setting of regression or or supervised learning how does statistics conceptualize this well the idea is the following there is a statistical model a statistical model is a collection of probability distribution parameterized by a vector of parameters theta okay think for instance all the Gaussian distribution parameterized by the mean and variance so here the vector theta is the vector of mean and variance and then what you assume is that you are given data and these are you know these xi yi's in our case are independent in identically distributed draws from this distribution so you sample data iid from this distribution and what is the job of this is the main mechanism by which nature generates data and what is the job of the statistician is to estimate the real parameter theta zero so these are generated according to one distribution in this class and you want to hopefully you want this to be very similar to data zero and once you have this then you can come up with a predictive model that for instance can be the conditional expectation of y given x under p theta so this is the conditional expectation under p theta so if you did your job right so you will set theta equal theta hat and if you do with it your job right this will be very close of theta zero so you will be able to compute the right conditional expectation okay just to give a concrete example the you know one very basic example in in this area is logistic regression in this case uh okay i will not specify what is the distribution uh on x but the y's are plus minus one and the conditional distribution of uh y given x is given by this logistic function so here this is my notation for scalar product theta transpose times x okay so how do you go about the estimation there is one main approach to it that is empirical risk minimization and and basically what you do you minimize a function so you use an optimization approach in which you minimize a function that is a sum over all the data of a function l that is called the loss function that depend on the parameter theta and the new data zi so here i use zi to denote both the response and the covariance vector okay what is the rationale for this approach well you try to construct uh your function l so your freedom is to design this lost function l you try to design your function l so that if i take expectation so this is the expectation of where should i break and over p theta zero over the truth so if you take expectation over the data you obtain what is called the population risk and the minimizer of the population risk is the true parameter vector okay so let me give you you know continue the example that i was uh describing before uh you know remember in that case we have p theta of y given x was e to the theta let's say y equal plus one given x e theta times x divided one plus e to the theta times x and uh what i might want to try is to determine theta is to do logistic regression and what this means it means minimizing you know some of terms of you know that are uh l of theta y i x i sum over i where each of the l is given by this term and what is this this is basically just the log of the conditional distribution so if you can write this also y given x so i'm just taking the log of minus log of p theta y given x and minimizing the sum of this so this is the uh the max also known the maximum likelihood estimator okay excuse me to interrupt you uh on the right seems that the slides it seems the slides are frozen uh did you write something or you're still on the estimation slide i'm the and this sorry okay so let me try to do it so perfect so i just yes sorry let me know if it freezes again i don't know why this happens okay good um um yeah so this is maximum likelihood a simpler method of this type would be just least squares okay so why this work uh you know the basic idea of why this work is the following the population risk so the expectation of the loss is constructed in such a way that its minimizer is the true parameter vector and and the empirical so this is again the expectation of l of theta and under theta zero of theta and the data yx if you want and this is instead the empirical mean sum over i l of theta yi xi okay so the the estimator is constructed by minimizing the empirical mean and the real parameter is given by minimizing its expectation and the hope is the two of them are that the two of them are very close together and so based on this intuition basically you show that the difference between these two the two parameter and the estimator is upper bounded by the gradient at the point theta zero and then you do a bunch of uh a bit of math and you obtain that typically the error in estimation is of this form in regular situation is of this form is the order of the order square root of p over l okay so here p uh was a bit quick over it but p is the dimension of the parameter vector here the parameter vector is a vector in rp okay so this is a you know under this type of models this is a general type of finding uh uh the error in estimation is scales like the number of parameter divided by n which is the number of samples okay so this you know kind of formalizes the intuition that we have is that if we want to fit a model you are better have you know many more samples than parameters and if you look at old you know papers on this topic they would tell you you know our old textbooks they would tell you probably you need 10 times more you know samples or parameter to begin doing anything okay or you know you can remember for Norman quote of you know with three parameters i can fit an elephant okay so what is the point of view of statistical learning and statistical learning starts well with this concern says look this doesn't make any sense uh you know the idea that you have a you know a you know a statistical model so a class of probability distribution and that images are iid draws from these distributions so perhaps the iid draws from a distribution but the fact that you can really write down this model doesn't make any sense i don't believe you that you write your logistic model or your whatever model explicit model and this is really a model for the distribution of pictures of dogs right and plus even if it is look this distribution of pictures of dogs depends very much on where do i sit in the world because the background is going to change depending on the landscape etc so this you know idea that you write down an explicit statistical model is meaningless according to this point of view we should rather forget about trying to model explicitly that and try to just optimize over all functions okay so use data to optimize for all functions so how do you go about that so i'll describe this going through three main ideas that are empirical risk minimization uniform convergence and convex optimization okay so empirical risk minimization the idea is the following say look okay what i really want to minimize is the distance when i'm giving a new label a new sample i want to minimize the distance between the new labels and my predictive function applied to the new image okay and this distance think of it as your humming distance or a two distance whatever you want of course we don't know probability distribution p with respect to which this expectation is taken so what we do we replace this expectation by the empirical average okay so this is the real reason why we should do empirical risk minimization is that we are just trying to optimize this and now an interesting thing is that in doing this we have to add a constraint that f is not any function but belongs to a certain function class that i call here script f okay and why do we have to do this constraint and you know if you think a minute about it the reason is obvious if i have no constraint here and i try to you know minimize the distance from the data points overall possible function i can come up with very silly ways of explaining the data right like this this is a perfect fit for data is a very reasonable functions but it will not be a good predict of most situations okay so perhaps you know you can try to think about you know one minute what is a good example of a function class or what is the the the distance function um perhaps i will not wait for answers here but you know for instance you know people that things that people studied in in in in practice are for instance polynomials this is an old example of bounded degree but you know people studied much more sophisticated thing that is for instance sobolev classes for instance function in which the l2 norm of the second derivative is bounded by by constant or things like this so typical function that people study in this in the context of non-parametric statistics is this this and also of statistical learning and and you know distance function well you know most of why what i talk about i will talk about this distance function okay second pillar is uniform convergence and the idea is the following okay how can i hope to prove that such methods work well perhaps what what i can try to prove is that overall functions f in my class the maximum distance between the empirical risk and the population risk is upper bounded by something that is small and what does it mean that is small i want something that goes to zero as n goes to infinity and in particular i want to characterize the tradeoff between n and the complexity of the class f okay so typically results of this type will tell you that this epsilon becomes small if n becomes much bigger than the complexity of the function class now i will not define what complexity is there are several formalizations of this but here the important point is that now i'm not any longer limited to fitting classes with finite linear parameters in general the complexity is less than the number of parameters that i need to specify my function but can be much less okay for instance the class of function with l2 norm of the second derivative bounded by a constant is infinite dimensional but has as finite complexity in this sense okay so this is what is empirical risk minimization about the last the last ingredient of this statistical learning approach is algorithms you know you know it's nice to say i want to minimize this risk over all functions but how to do i go about it i want to do this in practice not just in theory i want to do this on a computer right so the first thing that you do is that you parametrize your function class okay so this is of course you know you need to do it if you want to come up with an algorithm for instance i was giving the example of the function class of the function with second derivative bounded by constant the sobo-lab ball and how you parametrize this well you can use Fourier series in one dimension for instance you can use Fourier c okay so this can be you know this will be an infinite dimensional vectors but i'll truncate it at some level you don't really always need to do this but yeah and this is not necessarily the best way of doing it but to be concrete let's talk about it okay so once you do this now this is a an optimization problem in rp so it's a well defined optimization problem still you have to solve it and and you know what we classically would teach to people to student is well what you'll do is that you'll choose your parameter set theta your distance function in your function class such that this whole business this whole objective is convex so this is a convex optimization problem again to come back to my example if i want to you know if my loss was the square loss and my class what was the sobo-lab ball i parametrize it you know for instance in this way as a Fourier series and then the sobo-lab constraint is you know the constraint that the second derivative of this function is bounded in l2 is the same as saying you know model constant this so theta belongs to an ellipsoid so now i have a convex optimization problem quadratic optimization problem over an ellipsoid okay so this is the what we do to approach these problems all right so this describes now you know i think it's took me a little bit less than half an hour but i kind of gave you you know a kind of complete overview of machine learning up until you know perhaps 15 years ago huh okay there is an interesting sort of question you are in the chat that is what is there any lower bound on epsilon on this supremum right uh yeah i mean so i mean this is this is you know very well studied thing i mean for instance an obvious lower bound is you see this is this is really a low of large numbers right say you what's called the uniform low of large numbers rn r10 is one over n sum over i the loss theta at data points di so it's very fixed theta the fact that rn so here what i'm saying is that rn is basically similar to the expectation of r hat okay so if i fix theta this is the low of large numbers right now i don't want this i want to or if i fix f now i want to take the supremum so you want a low of large numbers to hold simultaneously over all points f where f belongs to an infinite dimensional space right so an obvious lower bound is that you just take one point right so this means that in practice often you don't expect this to be good down less faster than one over square root of n now all you know the complication now is that when you take many functions you can have a bigger constant than this so it's kind of surprising that this can hold despite the fact that you are taking the supremum over many functions but yeah this is a very well studied problem and and this you know very much related to you know topics in probability theory in particular concentration of measure and chaining that are you know very very well studied yeah so i i i hid this but you know the math beyond that this the empirical you know risk minimization theory is really supremum of empirical processes and empirical process theory okay so now what happened deep learning what happened is that since you know basically there was this beautiful theory and also had a large impact on practice but at a certain point practitioners started ignoring this theory and came up with you know very powerful methods that that don't seem to you know follow or you know adhere to any of these three prescriptions and these methods were very successful in many applications and and you know one reference points why 2010 this is why when you know in this image net challenge that is one of these you know dog and cat image recognition challenge it's not dog and cat is more difficult than that uh you know the error so this is the error in in this challenge this is a data set with 40 million data points and to 20 000 classes so it's not just two animals but 20 000 type of objects but the error in this thing started going going down dramatically and this is when you know people started you know geoffinton but they started making neural networks work okay and okay the prototype of this this is not really what people do in practice so don't try to do this for you know and you know beat everybody on the image net challenge because people have come up with more sophisticated models but one simple prototype to keep in mind is you know a multi-layer fully connected neural network in this case the parameter space is just uh parameters is just a set of matrices w1 w2 wl the dimension of these matrices are chosen such that they concatenate so you can take products of wl wl minus 1 blah blah blah to w1 and now your function class or your parameterized function is simply you know you alternate multiplication by these matrices and application of sigma where okay applying the matrix means multiplying by matrix and apply sigma means applying a scalar function component wise and start that the examples for this color function are tangent or you know the most popular is this relu non-linearity all right so this is what is a known you know multi-layer fully connected vector is a product you know sequence of multiplication by matrix and scalar non-linearity non-linearity okay so let's see how things uh go wrong here with respect to the previous theory first of all you know if you even if you have a simple very simple loss function this is highly non-convex okay so there is no longer up to convex optimization here because this function is very non-linearly dependent in the w's okay so this overall objective is completely non-convex second pillar that shakes is uniform convergence this is a kind of famous experiment by a group at google braid where they what they did here is they took you know some standard data set they they repeated it i don't remember this plot but this plot is probably for c4 10 or or image net and and what they do is that they replace the true labels of these images by so they first run their algorithm as a function of the number of steps in the optimization algorithm they plot their training error their training error is this blue curve and it goes down pretty quickly and then you replace uh so interestingly gets down to zero means that you fit perfectly all the data points okay and then you replace the true labels so the true labels that say cats and dogs and mouse etc by completely reshuffle the random labels and what you see that this is also goes to zero so it means that your model is so rich that you can fit perfectly even pure noise okay and and uh despite this uh you know this model generalized well this is the generalization error and uh you know here is as you increase the function of labels that you replace by random labels and this is 10 so it's zero label corruption and it degrades grade graceful with the number of corrections so it means that this is interpolating perfectly the data so the picture is this now the one-dimensional picture of this would be this you have uh you know pure noise data and your model is something that is doing okay perfectly interpolating the data and despite this is working well so you're not fitting the data with you know something reasonable you're completely interpolated okay so this is what I said so because of that the test error so the error on the unseen data is of course 10 percent is much larger than zero that is the train error that the error on the sample that you see uh and uh you know so we are kind of you outside this uniform convergence regime in which you know the test and the train go hand in hand and finally empirical risk minimization is the last thing to go down okay so here I should say a word about how you optimize this this cost function this this neural networks okay it's a long story of course because there are lots of tricks and and uh you know uh methods that were invented to accelerate this but the underlying engine is that you write a cost function that is exactly the same format before so is the sum of samples of a loss applied to label i and function f so you know a standard one you know you could the standard one would be cross entropy that is the same of as the logistic loss that I described before but you know let's think of this as again the square loss for simplicity this is a perfectly fine example and then you do a gradient based method so the simple gradient based main story is really gradient descent at which which at each step I go in the direction of the gradient opposite to the direction of the gradient this is not what you do in practice because each of this step would take computation of order n because just evaluating this function takes computation of order n and then you know it's again 14 million for instance uh so what you do in practice is at each step you take gradient with respect to just one sample or perhaps a small batch of samples so this is what's called stochastic gradient descent so you know this seems to be you know just empirical recognition but if you think about it a moment you realize that you know saying I'm minimizing our head doesn't tell the full story why because this is a non-convex optimization problem it turns out that as you know we saw before it has a global minimum that achieves r i equal zero so zero training error in fact you know it's clear that if it has one probably it has many of them the network is so over parameterized that it has many of them and so the output will depend on initialization and all sorts of other details about the outcome so saying I'm minimizing our head doesn't tell the whole story and this is something that was you know very much revealing in particular you know was clarified you know a lot by work by natis rebro and collaborator which cover okay so this is what where we are with deep learning more or less namely you know we had an old theory and the old theory doesn't apply to the new type of methods okay so in the last part of the talk I want to describe a piece of work that we did over the last year more or less on a very simple model that is a random feature range regression and this is with a group of you know fantastic students at stanford verus gorbanis on the main table let me see okay it's and okay I should mention that that you know there is many people that have been you know many researchers have been working on it on this problem and you know doing similar things here are at least a few reference references that you know very quite were quite influential at least for me in particular work of sasha belkin and of sasha rachlin and peter bartlett and and many others and I only wrote the papers you know before 2019 or 2019 because 2020 there is many more right so the distinction with this work is that we'll do exact asymptotics in a very simple neural network okay so what is the model here is okay you know I I'm a physicist I like a core so I like spherical cows so this is a spherical cow model the data are iid and my vector my future vector that should be you know cats and dogs are uniform on the sphere so they are you know completely isotropic and completely this is a sphere in the dimension and d is very large so here I'll be extremely simple the response so the label will be a function on excite plus noise the noise I'll take gaussian that doesn't really matter we can do other noises and the function is a generic function it has to be integrable okay so this is the the model for the data what is the the class of functions that I will try to fit uh is these random feature models what is this this is this model so I take uh basically x I multiply by w I pass it through a nonlinearity and then I add up all of this okay and so this is you know if I just write you this this is a two layers fully connected neural network okay so people would write this as a diagram of this would be something like this input is v and then there is an intermediate layer with d with n uh nodes and then there is an output node but here the twist is that I'll take this layer to be random so I'll take the these weights to be iid and uniform on the unit sphere okay and I will only train the second layer okay what what is the advantage of doing this okay first of all I didn't fit by the model this is a model that okay uh many people studied in particular was very much popularized by rahemia rect and uh and the advantage the theoretical advantage is that this is linear sorry for the type or in the parameters so I will only fit the ai so this function is linear in the ai how will I try to fit it I'll do ridge regression what is ridge regression means that I do least squares regularized by the l2 norm of a what does fully connected signify is this thing that you know you can think of this as multiplying w times x the first layer and fully connected means that w is a you know full matrix all the entries of w's are nonzero or can be nonzero and they are treated equally okay so this explains so I'll do ridge regression I minimize by this very simple method uh now why I'm interested in this random feature model uh after all it's very you know seems very quite different from real neural network that are highly nonlinear and very complicated and blah blah blah well there is a line of work that you know very much was prompted by this paper of jaco gabriel and hungler okay the idea existed even before apparently but this is what spurred it and there is a lot of papers on this topic and the idea is that okay there is a regime I will not give detail in which actually neural networks you know can be approximated by a linear model linear here means always linear in the parameter or not linear in x and the argument goes a you know heuristically as follows here I describe I plotted I give a cartoon of the parameter space r to the p so this very high dimensional space and this blue curve or blue manifold is the set of models that perfectly interpolate the data okay so the set of parameters that give me perfect interpolation zero training gap this is a very complex manifold we saw there is one of them there is many of them is a high dimensional manifold now if I initialize at a random point if this manifold is like this is very very rich if I initialize my gradient descent of random point chances are that I'm very close to an interpolated so this is my initialization chances are that I'm very close to an interpolating uh model right so this means that gradient descent probably what you'll do is that you go to a very close interpolating model so this is a red line that now I cover this what gradient descent is doing right so if this is what gradient descent is doing I can as well zoom inside this peak this little window but if I zoom inside the little window now my model will be linear right because every function if you zoom it enough it becomes linear right and so this is basically the heuristics behind this sequence of work and this has been proved in all sorts of right if it's been proved that you know if you drive the network in a certain regime this approximation this linear approximation is correct when why do I normalize with the the ridge penalty the ridge penalty you remember I normalize with this well because I'm doing gradient descent with an imagine of doing gradient descent respect way in a linear model so you are minimizing the sum of squares for a model that is linear so what you do is that you converge to the interpolator that is closer to initial initialization and this is exactly what you have to know this by me okay so now this model one you you you you bring it to this level of simplicity you can do very precise quite precise asymptotics there are some open problem but but you know there are quite precise results that you can get and perhaps I will skip so you know there are two results that are kind of interesting to kind of asymptotic that you can study one is the polynomial asymptotics okay let's say that the number of samples is proportional to the dimension to some power and you know for simplicity let me take here the width to be infinite and I'll briefly scheme over this so in this I'll give you just the picture perhaps of this result so in this regime when the data number of data points is between you know the dimension to the l and the dimension to the l plus one what you can prove is the following that the risk that you achieve is you know the projection on the function onto polynomial of degree bigger than l and this is true for any regularization parameter between zero and the maximum value okay so we see in this in this case that optimal error is achieved at zero regularization and zero regularization corresponds to models that interpolate the data so this is the picture of how the risk goes as a function of number of parameter over dimensions in this model it follows a staircase when this logarithm passes one so you progressively you know fit polynomial shells of polynomials of higher level and and you know you are at a plateau in between and this is achieved by interpolants okay perhaps I want to spend a few more minutes on instead the proportional asymptotics in which I take the number of data points proportional to the dimension so here we have three parameter n is the sample size d is the dimension and capital n is the number of neurons that is also equal to the number of parameters in this model so before I was taking you know the number of parameter very large and studying the tradeoff between these and these by the way this shows you somehow that you can achieve meaningful performances you know even if the number of parameters is infinite here instead I'll take the number of parameter you know proportional to d and study the tradeoff okay okay more precisely I take the two ratio to converge to capital n over d and lowercase n over d to converge to psi one say two okay here this flows again okay so in this in this limit we can you know exactly compute the limit of the risk and yeah this is depends on the activation function only through its projection to linear function and non-linear function so they compose the activation function into a constant a linear part and non-linear part where this is the orthogonal decomposition or in l2 of the Gaussian measure then in this setting in this asymptotic you get an explicit expression of the risk that is the sum of two terms one is a bias term and the other is the variance term you see that this is a variance term because it's proportional to the noise level and bias term instead is proportional to this f1 that is the norm of the linear part of the function okay this is this formula you know this expression you know these two function calligraphic b and calligraphic b is completely explicit and they are given here I will not go into detail but the math that is behind this is a random metric theory for kernel linear product random matrices what are these these are just matrices that random matrices that takes the form of a function where the xi are you know for instance in our case have no zero so this is what is a kernel linear product random matrices so perhaps I'll skip this because I'm a bit out of time okay so now we can in this simple model we can you know in this regime we can exactly predict what is the performance of this kernel this regression method and this is the typical plot that you obtain here I plot the test error of this model as a function of the number of parameters over number of data points okay so again traditionally you know the classical statistics point of view should be so this is equal to p remember the classical statistics point of view is that you should take the number of data points much smaller than the number of parameters much smaller than number of data points so you should work in in this regime right and so what you see here is this curve of the test error goes down as the number of parameters increases because your model is getting better and better at fitting the data but at a certain point starts increasing and this is where you know they would have told you you know you should stop here because you know your model is becoming too too rich but now interestingly you know the error goes down again when the number of parameters gets very large and its minimum is achieved at very large number of parameters so now this has been called the double descent phenomenon and so this is a regularization zero and this is a positive regularization and attracted a lot of attention I think in part because of this funny shape the reality about it is that what is really interesting about it is you know the interpolation property that is here I'm using zero regularization which means that I'm interpolating the data and you see this is what gives me the optimal error okay let me show this in a in a in a different figure okay here for instance I'm plotting the same figure in in for various value of the regularization parameter okay and you see if if you choose your your regularization optimally which means take the lower envelope of this curve this peak disappears so the curve is constantly decreasing with the number of parameters so this is kind of adhering to the statistical learning point of view or non-parametric statistics point of view that you know you can fit infinite dimensional model with finite data but the interesting fact one interesting fact here is that the regularization that I'm using and that is optimal is I mean zero regularization is optimal zero regularization is this dot dashed curve and this achieves the optimal so the optimal is that zero regularization and very large over parameterization and this at least happens at high SNR not so much as low SNR so at low SNR the optimal regularization is not zero okay this is the same plot the same kind of data curves plotted as a function of the regularization and again at high SNR you know when the noise is low the optimal is achieved down here of the arrow that is when there is no regularization and at low SNR instead it's achieved that the positive regularization and actually there is a phase transition between the two okay so what is the origin of this phenomenon well it turns out that the origin of this phenomenon is that the nonlinear part of the activation function acts as an effective regularizer okay here what I'm plotting for the same model is you know you know the bias and the variance this is in the very wide limit the bias and the variance as a function of two quantity the regularization in my ridge regression remember always that we are using this kind of penalty and zeta square what is zeta square turns out to be the only parameter that characterizes the activation function is the ratio of the linear part of the activation function so e of sigma times g divided by the nonlinear part of the activation function so it's how linear is the function okay so these lines are the level lines of the bias and the level lines of the variance and you see that you get the same risk the same effect either by decreasing lambda or by increasing zeta okay so putting more regularization is equivalent to increasing the nonlinear part of the activation function okay so there is a kind of this in this model a kind of self-induced regularization that comes in because the model is nonlinear and this and this allows for this interpolation phenomenon okay so this this kind of idea was extended in a number of directions and these are a few examples that you know limit into what we did okay just a quick conclusion okay I hope that at least you know from this talk it was clear that there are lots of strange and surprising and new phenomena to be understood in this area and and ideas from you know mathematics or theoretical physics even can be very useful here in particular here what we did is study some you know most you know the last part of talk the main mathematics tool was random matrix theory but random matrix theory of a kind of a new type because it's it's not just standard matrices with iid entries and in the other direction of course is is a very useful tool machine learning it has been very successful in certain applications in particular computer vision robotics speech recognition etc but you know we still have to understand you know what is the the precise boundaries of applicability of this in particular can we use it reliable to analyze medical data or biological data you know DNAs etc this is pretty much an open an open question you know with of course some progress but yeah that's all thank you thank you very much Andrea for the wonderful talk extremely clear and so maybe we can now okay spend some time for questions there are some questions in the chat some yes so I've tried to to to answer some of them but there is a one of them so Stefano noticed that in your model the problem was convex in the parameters in the in the parameters that you are learning in this problem yes yeah so this problem is of course convex in it's even simpler it's quadratic so it's extremely simple again so of course the excuse for that is this picture that again here and I very quickly glance through and then you know again let me try to explain again the I guess the question is you know are we throwing away the all the interesting aspects of the problem by looking at the convex one and and you know if that is the question the answer is yeah perhaps we are oversimplifying but the justification for this again comes from this regime in which the model is sober parameterized that you know close to any point there is an interpolator so there is a global minimum if you want this is a very non-convex problem and there are many global minima and this manifold is the manifold of global so this is the picture you know there is a it's like a landscape if I had to plot it I've plotted here in one dimension so let me plot it in two dimensions so the optimization function that you are trying to minimize is perhaps like this okay so it has a lot lot of course this is a really high dimension phenomenon so so this means that if you start from a point yeah it's non-convex but you are very close to a global optimizer perhaps and therefore you know you can approximate really this cost by quadratic cost in that neighborhood right now this turns out to be the case either when the problem is really over very over parameterized so again started with this and all of these people all of these papers okay prove theorems of this form right that if the model is very over parameterized then this approximation that I'm drawing in a cartoon here is good right now in particular this paper of Shiza Bach shows that you know even the problem if the problem is not so much over parameterized because of homogeneity if I start with a very large initialization basically again I can do this linearization okay so basically I start I take any linear network and I scale it by an alpha and that is killed by one over alpha and basically for alpha very large I can approximate this by linear model so there is a way always to drive an over parameterization model for an over parameterized model there is a way to drive it in a regime or to train it in a regime in which it is approximately linear basically you know put a small parameter by hand in such a way that you make it linear and since it's over parameterized now there is an interesting question is is this really all there is in neural network can neural networks do better than this okay the answer depends a little bit on who do you ask to my answer would be no in reality you can learn things that are outside this is called the lazy regime or sometimes the mtk the neural tangent kernel regime my answer is no and I've you know we've brought examples of this type that shows that okay there are things that you can learn with the neural networks that you cannot learn yeah you cannot explain with this linearized model and I think there is growing evidence of that but nevertheless I think this kind of model is it's useful conceptually because at least it shares some of the phenomena of real neural networks right and so again it's a little bit another example of a spherical cow you know shouldn't be taken too literally but it's a good spherical cow okay thank you on there there is another question which is related to your slide 63 which is about this w shape the w descent curve is this type of shape only seen in this simplified two-layer neural network or is there evidence that it appears in other architectures uh no okay so here again I was quick here because I didn't have much time but uh okay so so this thing you know this kind of picture like this uh first appeared uh in simulations really um so it seems you're frozen again sorry we're still on this lazy regime slide okay stop here let me stand so let me take the opportunity of this question to give some credit so this this uh all right this thing that you can interpret and generalize really okay it was kind of known to practitioners but it was brought very much to the attention of television by again you know part by uh Srebro and co-authors so this was all simulations and some of them was on simple two layers network some some multi-layer network uh by uh Shu uh Belkin and co-authors and uh so this uh and and there were simulations of multi-layer networks actually by by a group of uh actually physicists Giulio and uh Matthew the art and and co-authors and then who else uh uh so this picture again was was obtained in a number of models and uh you know in particular Misha Belkin was the one that pointed out that it's nothing specific to neural networks and it it happens with a lot of different models for instance you know random forests and you know very general or parameterized models and I think pointing out that was extremely useful because then it pushed people to look for simple models to to explain this and and you know Belkin and co-authors they had some models with nearest neighbor classification in which they could reproduce some of this phenomena etc yeah so it's general now do you see it really in practice no because as this picture shows as soon as you turn on a little bit of regularization this this this picture you know this peak that is a divergence basically disappears right now it becomes a little bit of a peak this regularization is actually a very small value I don't remember but it's a fairly small value if you crank it a little bit more you know as as in this picture you see now it disappears completely the double descent thing so uh so now in practice do you put a ridge penalty well in practice people put all sorts of of you know regularizers like dropout plus the role of lambda is somehow played by early stopping this this peak is what you see when you train the model until really you reach a global minimum if you stop a little bit before just a tiny bit before this is roughly equivalent to regularizing a bit and again this peak disappears so you know it's like one of these phase transitions that you can reach only when you get you know you drive a parameter to zero in reality you know you see more of uh you know a smooth behavior okay thank you so there is um maybe Erica raised her hand so uh Erica are you here yes can you hear me yes please yeah I just wrote the question in the so I was wondering if you could comment on my question whether you think it's a problem that if you have the same problem we have in climate prediction environment when we try to use the machine learning and we try to sort out the problem and this was very common like already 20 years ago uh then it became a bit less uh used because the main criticism that we always get is that they lack a little bit of process understanding so usually this machine learning based method they do not add much in process understanding so if we want to understand what did what is what is the reason of a specific dynamical um dynamical problem and what is the answer to a specific forcing this machine learning method can help you to detect the connection between the forcing and the dynamical process but they don't uh help you to understand the process and so the usually we get this criticism and most of the high impact journal they really don't like uh the paper based on this and I just would like to know your opinion about it I think that's it's a very valid criticism I think I think that that I mean to some extent so it's a very valid criticism but but the the kind of machine learning you know world and you know statistics even more is progressively moving I think towards addressing or trying to address questions of that type I don't know if they will succeed or not but or we will succeed or not but but yeah I think it's very important I mean this is when the point is that historically this was not seen because the focus was a lot on vision and on in particular within vision or specific you know dataset like image net etc but even there people you know a few years ago started realizing these type of things that for instance um you know you train a neural network this works very well at distinguishing you know cows from dogs but then if you put the cows on the pink background uh the neural network stopped working and you realize that the neural network was working because not because it recognized the cow but because the cow was always on the green background and so it really recognized the the green background because there was some grass behind not the cow right and this is probably what it's you know in vision perhaps you can even in vision is a problem right because then you know if you change the background with them but you know even more in science I can imagine so if you predict something right but not for the right reason then it will it will not work once you change a little bit the data but now there is uh you know a growing uh growing uh awareness of these and people have been come up with methods to construct a little bit more robust models now putting in the scientific understanding is very difficult of course but but for instance you know putting it in the invariance to the background okay one simple hack is that okay you change the background and you train on you know you train not only on your data but on data in which you change the background okay so people have come up with data sets of the type and with something you know for instance invariant prediction is a keyword there another keyword that people use is right for the right reasons you know you don't want only to predict right you want to predict right for the right reasons uh okay yeah I mean is there a universal and excellent solution to these problems uh I would think no at the moment but there are some interesting ideas so I don't know if this idea will come up to the level of addressing you know the concerns of the scientific community but yeah I think it will be very interesting to look what happens okay thank you thank you so I have I had to to select questions because there are many unfortunately we will lack time so we'll just select two last ones one is a comment by Stefano again that that says there are papers that I think you actually mentioned Andrea that show that networks can learn cats and dogs so essentially complex data and generalize well however they can also learn randomized labels and then do not generalize at all so the important thing is the structure in the data not of the network what the important thing yeah so I showed the one of these papers and this is uh yeah I mean it's by no means the only one again I gave a few other pointers but let's see is it frozen again yeah so I give one example perhaps I will not unfreeze it but but I give one example that is was this uh young benjo hard uh rect etc paper there are others again the eyebrow more or less at the same time said or even earlier some papers etc now is it just the structure of the data of course not right because you know it's very simple to come up with a model that interpolates the data I just predict zero everywhere or I predict the dog everywhere except at the training points okay it's a completely silly point or silly model but or another is nearest neighbor right nearest neighbor you know predict look for given a new image look at the image that is closer closest to the new image in some metric and then predict the same so you are given a new picture you have a list of n training pictures you look at the picture about among the n training picture that is closer to the new one closest to the new one is it a dog so you say the new picture is a dog right so this is a method nearest neighbor is a method that only uses structure in the data if you want it interpolates it doesn't do so bad right you know in the old times it was you know considered in the 50s 60s was considered a decent methods method but actually it works much worse you know it was forgotten 30 years ago more or less right so so there is something there that is not just the structure of the data obviously thank you and so last comment so from alessandro from practical reasons could we say that we can leave out the occam's razor principle and create models with billions of parameters and leave the system to optimize in the best way possible because it can dot mark interagression mark sorry but this kind of approach is somehow opposite to the way science went since the beginning of the story where we try to simplify as much as possible okay so you know I said in question I mean I don't think that this okay so I I'm going to say I don't think that this deep learning approach you know is the panacea to solve or you know I heard that plenary talk at the lips once that of somebody from I don't know which which company I will not make name but the statement was okay humanity has a lot of problems you know hunger and this and that our approach to this is that we'll solve intelligence first we'll get you know machine that thinks and we'll let them solve you know hunger you know whatever pandemics etc I don't think that's that's gonna work right and I don't think this will solve science or whatever right and in particular it seems to me that there is a distinction or a kind of a gradation between problems in which you have a huge amount of data this data are very structured and quite low noise and and and problems in which the noise so for instance vision with clean images is a problem of this type the noise is very low why because if I see a dog or a cat I can recognize it so there is at least one classifier that is the human being that can do it very well right so there is not noise in the labels right at the at the other end you know genomics or or you know perhaps you know weather prediction or you know it's all sorts of problems that arise in classical statistics in biology and science these are high noise genomics is probably a very high dimension you don't have that those many data you have much noise here are much more complicated functions probably the noise there is very high in those cases this approach in which you know you throw it you know huge amounts of parameters you know it's not clear that will work you know so far I didn't see any you know convincing papers showing that those words there was there was a paper in nature somebody predicting after shocks after shocks after an earthquake so obviously a group in Harvard obviously very interesting problem they got their data they predicted it using the neural networks you know very fancy they published on nature or science etc and then I think a few months or a year later somebody I think at EPFL or somewhere or at age they redid the same thing using a logistic regression with one parameter and they achieved the same prediction and then they redid it with two parameters and they beat the neural network so yeah I mean there is a you know it's not that this is the panacea okay thank you very much Andrea I think I speak for everyone thanking you a lot it was it was really an amazing talk unfortunately you will suffer from your success because in less than five minutes the the students from ICTP will have the occasion to ask you more questions on this on this other zoom so all the students that that have access to this to this other room please connect in in five minutes and you can have a chat with with Professor Montanari without interference of other people so please take this opportunity see you Jean thank you very much Sandro maybe you you want to say a few words no just wanted to thank everybody and Andrea in particular for for being with us there were a huge number of questions in the chat and I'm actually grateful to you Jean for having answered most of them in private so thank you Andrea again and I guess the next appointment is the colloquium next week at 29th of july at 4 p.m we'll have Scott Aronson talking about quantum computational supremacy so thank you very much everybody and bye bye see you soon okay see you everyone thanks