 Okay, hopefully you guys played around a little bit, right. I Should add I don't know at the bottom. There's a cell on a different kind of architecture Which is called a convolutional neural network? All right, you can go look up what it is All right, and the basic idea is now I do image filtering these tend to work better But the point is like look you have to go Google You have to read Even even if you don't want to do that you'll see that in this thing you opened up if you go back here You'll see there's all these notebooks. I wrote right and They have different kinds of things right so here's it you know here is a different thing with a Susie data set right and You can you can just go through and play right, but what I just wanted to show you is that it's not that hard to Write this code right at least initially to start playing and It's really you have to play it's a numerical Experimental field it's an empirical field Right the best thing you can do is re-implement papers you read so it's not that hard Open it up the documentation is amazing You can usually find the answer to Google on Google of lots of stuff But then you will also get confused. It's like all coding. There's some shape that doesn't agree There's some funny little code thing you got wrong you try so always whenever you take some code try to change it and You'll find that you break it quickly, and then you don't understand and you go back, you know It's the usual thing Is Reese, you know, it's empirical numerical conversations things so in the last You know five minutes I have in this lecture. I Just wanted to tell you what how you should really Understand how you build something right and this is really This is really the deep learning workflow. I would say for most problems So what you do is the first thing you want to know is how good can you do anyone do on that task? Right, so you have some task, which is classifying something and you want to establish is my neural network close Being that good or bad at something right so for example, and this is called finding an optimal error rate or Establishing a base base error rate, right? So you want to know what's the best you could do on a task because you want to have some metric For establishing how close your model is to the best you could do and the best thing you should do is usually ask a bunch of Experts or something like an expert that says what's going on, right? So a lot of deep neural networks is about automating things that humans can do very easily Right, so you ask how good is this? What's the best I could do? Right, and then what you do is there's two things that can happen. You're either overfitting or you're underfitting Right, and this is gonna be the theme of the next thing too You have to take pictures. It's in the review. Go read the review. All these figures are straight from my review Our review it's not just my review a lot of people worked very hard on the review I should stop saying my review. I don't mean to use subcredit in any way Though I did wait I mean I used to say I wasted my whole sabbatical because I thought it would take me a month And it took me 11 months to write and even then I couldn't finish it and everyone helped me on that review so much But enough people have told me they found it useful that I feel less bad about Doing that with my whole sabbatical. So So if the training error is too high that means your model is not complicated enough That's called underfitting right your bias is too big So what you should do is you should either need to train longer You need a new model architecture or often what's missing from all this is data. You might not just might not have enough data Right, you might need to get more data If the training error is not high Then the question is is the validation and validation is like a test set But it's not a test set because you're never allowed to tune anything on the test set So what you do is you take your training data set You divide that into a training data set and a validation data set and the validation data set is like a test set Which I use to tune hyperparameters architectures things like that Right if I'm going to change anything I Can't use the test that the test set is after I'm all done I declare victory then I can check what I did on the test set I can't check what I did on the test set before that So if I have to change hyperparameters change stuff I have to make like a separate what I call a validation test set, which is like a test set that I change used to change things So now if the validation here is high, but the training error is low. That means you're overfitting Right you're fitting weird things in the training data set that are not generalizable So that means you have to regularize more or you need more data or you need new model architecture or something Until and you need to keep on going here, right? So this is basically your model is not expressive enough This is your model. You're Overfitting so you have to regularize this is basically a mismatch between the training and test error and then if that's done Then you're done, right? That's basically the workflow So this is all hyper and then often you have to tune so many hyper parameters Right each of these arrows is actually training lots and lots of models with different hyper parameters changing stuff around And that's why it gets so computationally expensive That's why you hear numbers like it took 30 million dollars worth of extra, you know You know electricity to train, you know, whatever the open AI or whatever the new thing is I don't know whatever they are like, you know all these things right 40 million It's because you can't you're not just training it once if you're just training it once you'd be okay It's because you have to try a million different architectures a million different hyper parameters. You have to do all this stuff Right and often you could have the right idea, but you have the wrong hyper parameter And then you think it doesn't work, but it is actually the right idea. So it's really just you have to you have to play Right and the last thing I want to emphasize is that there's lots and lots of examples, right? Even if I just go to Keras So these are examples of things you might want to do right so here is image segmented You know so and what's nice about all these is they have less than 300 lines of code And you can see how much you can do with 300 lines of code So for example here is a neural network that learns how to segment images Right all it takes as its input is a picture and an output is The outline of the object you want to do Right and then you have to come up with a loss function, right? So now I say oh, I'm gonna so now the neural network is designed to take an image Put an output and now you have to come up with a loss function So you have to decide how am I going to measure the loss between what the neural network outputs and what this outputs, right? So often use I don't know what to use in this particular thing. There's a very prominent Architecture that's become very popular for doing this kind of thing, which is called unit You can just go read about it You don't have to build if you're just applying pre-existing methods. You don't have to build them off the shelf Right so in biology in the last three four years like it's hilarious It's like the rest of biology three years ago You could get a nature paper or nature or communications paper with a unit thing and now everyone has a unit, right? But my whole point is you shouldn't be scared so it's 300 lines of code right, this is all it is right and and Again, I can tell you what the output is because I can read this thing looks like the way they're putting an output is soft max So they're saying this is the probability of being the output and they're using cross entropy to do this thing Right, so I can just look through do this. There's much more complicated stuff Right, so let me see Well, you can just go through and look at all these things you know and and and these are like, you know Pretty calm. There's a pretty complicated stuff, right and you can do it all With 300 lines of code the other thing I should point out that I didn't have a chance to talk about is computation matters Everything is faster on GPUs But actually coding for GPUs in these libraries is trivial. It's usually just a flag Just a GPU true All right, so back in the day it was really hard to do this and now it's trivial essentially So using so I think this is gonna be an important tool in everyone's toolbox Going forward. I don't think you know, I'm not one of these people who thinks like deep learning is gonna put science out of business. I Mean, I don't think it has that capability any more than like linear algebra has the capability of putting science out of business But we know linear algebra is useful and we know this is gonna be really useful going forward. So you should just The sooner you familiarize yourself with these things the better it will be for you All right, so I think that's all I wanted to say about this and In the next lecture, I'm gonna do some more theory. I'm gonna do there's only so much basic stuff I can do so I'm gonna give you a research talk mostly because I'm not gonna be here next Next week because I have to go back and teach back in Boston So I'll tell you something about bias invariance and double descent in neural networks. That's kind of interesting At least we're really proud of it. We're really proud of these papers. We're gonna talk about okay So see you in 15 minutes, I guess Any questions? No the notebook answered everything All right, so see you back here at 11 o'clock. All right We had started with lecture four by Professor Mata. Maybe I just put it here Hope for the best Can people is it coming through? I don't know if no one complains online. I'm not doing anything. It's okay. All right so in the last Hour I have I thought we'd go beyond something basic And tell you something that I I don't know are I was I've taught this This research actually came out of the fact that I taught this class many many times And at some point when you teach things enough you realize you're saying lies it's been my constant uh and Then you ask people around and you realize everyone is saying lies and some people realize they're lying and other people don't The more you teach stat mech the more you realize that you lie a lot Stat mech is a that mech is mysterious But so I'll tell you about something that we did this is really work that was done by Jason who's a postdoc in my group. It's based on a couple of papers this physical review research We originally said it to PRX and had the most Longest most aggravating review process ever It was kind of fun it was kind of it was a kind of a Sociologically interesting except for the fact that I think it was so annoying for Jason that he's gonna go He's probably gonna leave academia even though you shouldn't and go join some startup where you'll get paid like four times as much So in that sense I regret that process, but in every other sense it was kind of amusing But anyway, the paper is really good. We're very proud of it. We're proud of this work We think it's really good. I think it's one of the best papers to come out of the group in the last couple years So I'm happy to change share it and the bulk of this work was done by Jason All right So let's start with what I always learned as a physicist, right, which is I learned this quote from von Neumann Which said with four parameters I can fit an elephant and with five I can make him wiggle his trunk right and This is just like a funny You know little paper from the American Journal of Physics where they amusingly like you know Make that make a stupid model where you can make that happen, right with four parameters or five parameters They make a trunk wiggle is just to tell you that the received wisdom until like 10 12 years ago Is you should never have a lot of parameters that parameters are bad for you, right? Because you're gonna end up overfitting So The central mystery of machine learning I would argue on a theoretical level or one of the central mysteries I would say to me. There's two central mysteries Is Why can ML one of the central mysteries is why can machine learning make good predictions? Despite having so many parameters So for example, here's image net which now is no longer state-of-the-art now It's considered a medium-sized data set. It was considered impossibly hard five years ago now It's medium-sized just so you know the pace at which machine learning goes right to 1,000 classes 1.2 million trading images so medium-sized data set and you know and the categories are pretty hard so here's like different kinds of cats here's different kinds of dogs and These is basically, you know How well people did on image net and this is how many computational unit resources it took operations a giga Gigaflops and this is top 1% accuracy and you can see Alex net basically beat everything else in 2012 It was down here in like, you know 1940 and it just keeps getting better and better and better and better and better and better better and better and better Right, and this is only up until like 2018 But what I want to point out is that the number of parameters here It's five million to a hundred fifty five million parameters We have 1.2 million training images And I think the question is why you know, there's something wrong between this von Neumann quote and the empirical thing That's been going on and I think a lot of people were thinking about this We all anyone who had thought about this for like half a second was like what's going on But I think it you know, oh Right now, but you know again We'll come back and think about how simple an example, you know a simple example of polynomial regression Because this is where all our you know intuitions were developed right and unlike Unlike the example you guys did this is polynomial regression over the Legendre polynomials just so we're clear It's not polynomial regression over xx squared x cube. It's the Legendre polynomials It's not really important for most of what we do, but you might as well know So how complicated a model should we use right and this is like somehow just a review of what we already did That's why I feel comfortable giving this Which is that you know here is my true mock true function just so we're clear Right. Here's my true function Here's my training data as you will see there's noise What do I do right? So, you know, I can start with like a very simple model With a few Legendre polynomial. This is what I get. So the red are test data And I can calculate the training error and the test error You see this is what happens And I can just keep increasing the complexity right this is you guys did this exercise But this talk is not for people who have been set up to do this But this is why it works particularly well. So I make it a little bit more complicated and look You know, it's the same thing that happens the training error and test error both go down I make it even more complicated and Look at some point I can exactly fit the training data, but my test error gets really really big We just we saw this earlier in the lectures so This point Where the training error goes to zero it's gonna play a central role in everything I tell you All right. So at some point there's gonna be a point where the model is so complicated. I Can get zero training error. We're gonna call that the interpolation threshold. So Again This is the classical statistical intuition has always been the optimal complexity is some intermediate level here where the training error is not zero But I have basically have this kind of you know treat off because here I'm overfitting here I'm underfitting and the optimal thing is right here. So as we discussed earlier, right? The way we like to think about this is that we like to decompose the test error into three components a bias of variance and a noise and The bias is the tendency to underfit right? It's the inability to express relationships underlying the data And so the bias in this regime basically goes monotonically decreasing with model complexity The variance is the tendency to overfit memorize details of the training data that are not general for example noise and the variance tends to go up because as I make the model more and more complicated I Need more and more data to train it and so what happens is that here? I've made the model too complicated for the amount of training data I have so I'm fitting all these fluctuations from sampling. This is bias and this is variance So this is I would say until four or five years ago Maybe three four years ago, I don't know when but there was a series of papers One from Ben rec group and then from Michael Belkin's group That basically pointed out that this picture can't be very complete right so again bias is the tendency to underfit So this kind of polynomial can't express this complicated relationship Variance is the tendency to overfit so here. I've you know put fit all the training data points which are in blue But obviously I'm fitting random fluctuations. I'm not expressing everything that's going on So the question is really we come back and it seems like These kind of models are just defying this thing right so naively you're like, okay Maybe it's not true. You try to make stories about why static classic bias and variance still works But these numbers are suggestive of something really bad going on like something fundamentally different and I think This yeah, go ahead. Please interrupt. It's because it's because The variance is basically what overfitting is overfitting is variance They're the same thing because overfitting means that I I'm sampling from the real distribution right, but I am My model the more expressive my model is the more parameters it has right think about the polynomial regression here right the other this model Just never has enough parameters to fit all the fluctuations in the training data Right, I have the same amount of training data points one two three four five I don't know. I'm not gonna count them, but there's like 15 training data points or 20 training data points But this model doesn't have enough parameters, so it doesn't overfit Because it can't fit all these things, but this one because I have so many free parameters It can just make all these wiggles and that depends on the amount of training data, right? no, the bias means If I had infinite amount of data, it shouldn't be going to zero here. Maybe that's what's confusing Shouldn't be going to zero. Yes. It's not the variance of the data. It's the variance of There's many ways to think about variance One way of thinking about it is that if I had many different data sets of the same size How different would the model parameters be? You know, yeah, but like variance is variance is a hypothetical quantity that says if I had many different data sites so like the point is right here if I had slight if I drew different training data points again They would look slightly different and this function will look completely different And I did it again this function will look completely different this function will look completely different So that's why there's so much variance Because these blue points would move around and this thing would just adjust itself to do it So every for every data set it would be very different Where as this function I'm in the wrong direction this direction this function on the other hand I'm still in the wrong direction This function on the other hand It won't care. I move these blue points around. It doesn't care. It's still gonna basically do the same thing So the variance is low here Right every time I do it if I give a different data set is gonna basically look the same So that's basically why it's the same picture And that depends of course on the number of model parameters I have So that's why the variance goes up here and the bias goes down Okay, so I think the moment people realize they couldn't ignore this anymore Was this very famous paper from Ben Rex group? Right and what they showed Was that they took Modern neural networks and what they did was they basically showed that even if I randomized all the labels or Shuffled all the pixels Or made random pixels or added Gaussian noise to everything if I trained long enough I could get zero training error on all the neural networks people use So they're expressive enough to capture relationships of random data in high dimensions So we must be fundamentally missing something. So that's that was the interesting thing then after that People started wondering what's going on and then there's been a flurry of papers about all this stuff, right and I would argue that I'll show you a simple example that tells you that has nothing to do with random neural networks at all Right, so I've never seen anyone write this example down me and Jason constructed it for our talks I kind of feel like we should just write this down in some paper to show people how simple it can be All right, so here's the same thing Right training data points 25 fit parameters 25 now what I'm going to do is in this simple polynomial regression I'm just going to keep on increasing All right now look what happens. I increase the fit parameters to 50 all the sudden My test error goes down Nothing complicated Not a deep learning model Nothing complicated at all polynomial regression. Keep on going 100 fit parameters. Oh my god. It's doing even better The basic idea was that this was first pointed out. You'll see like two three years later Michael Belkin Screwed basically pointed out that this should probably be a generic property of Many models, right? So you did with random forests. You did it with these other things Nice work really nice work Right, and they now there's like a small zoo of paper. I mean it's not small zoo a large zoo You know of papers. It's like the San Diego zoo papers thinking about this Though I must say I have a hard time extracting Intuitions and information out of them So you have this crazy thing. This is polynomial regression. So we just didn't get statistics, right? Yeah Yeah, oh, it's a great question. So he's asking if I have many more parameters and data points How do I do this? So these are just mid norm solutions. So we don't even do gradient descent I just want to point out in this paper and what I'm going to tell you everything we do There's no gradient descent and it's a quasi convex problem meaning that once you require minimum norm solution Meaning that you just choose the thing that has the minimum Norm it's pseudo inverse. I use pseudo inverses for everything Regression the only problem is that I can't take an inverse so I can replace it with a pseudo inverse That's like saying setting everything to zero that I can So that's how I do I just do pseudo inverse mid norm solutions everything's convex no SGD nothing It's convex once I tell you it's pseudo inverse. It's a unique solution So the simplest setting you can imagine It's nothing complicated And that's how you do it. It's a great question. I Should have said that So we wanted to understand so a lot of people have done this but We like these concepts, so we wanted to understand what the hell is going on here. It was especially confusing because Many of the people from physics, you know Montenari Italian Paris Sting Las Mafia as I like to call them They had done a lot of calculations and they didn't all agree with each other And we were very unhappy because they also didn't they had a lot of replica calculations and a not a lot of intuition for our liking So we just wanted to understand so like a lot of papers I'm proud of it's because I didn't understand what was going on and we want to understand it for ourselves so I'm going to tell you about how we thought about this and we did do all the calculations with a different way, right and So This is a long tradition right of studying these kind of models in statistical physics spearheaded by this a Four mentioned Italian and French statistical physics community who really been at the forefront of all this really doing great work so And the paradigm they usually think about You know along with you know Heim and Daniel Amit and all these kind of like Israeli that whole school of people is what's called a teacher-student model Set up right so we again We you hear a lot of stories about why this is happening and why this isn't happening and how complicated things have to do So we wanted to make the simplest model We could so we could understand what's going on And so what we did was I'm not going to show you any of the math It's gonna be no math in this talk essentially But I show you there's a 40-page supplementary information and we had to Split our paper into two papers Because it was too long originally Right, so there's 40 pages plus another 15 page paper So there's a lot of math behind all this stuff, but the math isn't interesting to me. What's interesting is the intuition really So The basic idea is we have a teacher model Right, so this is how we generate the data. So the data is generated by some function plus noise All right, and in the teacher we're gonna choose these Features and these parameters to be random vectors and for most of the stuff I show you I think this is just gonna be a linear function All right, and ignore all this stuff That's all normalizations to make all the calculations work out, but it doesn't really matter Just basically a linear teacher model. All right, and these it's just to remind you these are random vectors I'm normalizing by some standard deviations I'm normalizing by this f prime because it makes some integral notation. You know saves you like 30 lines of Notation later on But it's not really important for all this stuff and then I'm gonna teach I'm gonna fit it with a different model which is called a student model and We're gonna consider two models one is just linear regression Just which was also called rich regression rich less aggression where I just you know Have each thing in the output is just a weighted thing and then I'm also going to consider Essentially kernel regression or a two-layer neural network if you wanted to sound sexier Where the inputs get transformed through some middle layer a hidden layer with some arbitrary non-linear function by and These parameters W are drawn randomly but are fixed Right, so it's just a random kernel and then I train the output layer So I train these parameters Right from a hidden layer, so I just introduced one hidden layer with random matrix random stuff in there Very simple, so it's just kernel regression questions Going very slowly, but slowly is better Than anything. All right, so what you can do is you can calculate a bunch of stuff So these white line and these black lines are all analytics, right? So we do cavity method we do it. I'm pretty involved. It doesn't matter And you can calculate test error training error these things actually have error bars, but you can't see them It doesn't matter all this stuff And what you see for this model is that you see what happens is the test error Kind of diverges at this interpolation threshold And then comes back down and the training error goes to zero exactly here and The important point we're going to have in all this stuff is that there's going to be three things three ratios that really matter There's going to be the number of parameters, which is the same as a number of things in the hidden layers and P number of features which I'm going to call NF and A number of training data points, which we're going to call and right So there's there's just basically three three three things that matter and as is usual in all this for the experts in the Audience there's at least a few I know This is the usual we play usual tricks at F number of features input features number of parameters M We send all three of them to z infinity and we Hold these ratios fixed if that didn't mean anything to you then just ignore this Right, and we also assume replica symmetry. Yes Yes, that's that's that's going to turn out to be the fundamental trade-off The fundamental trade-off is going to turn out is nothing about statistical fitting, but it's going to be about Resources which is consistent with what we understand about modern or not nothing is for free Related to what what's really to what? Yes, it's a second-order phase transition You can show us a second-order phase transition. It's a classic second-order phase transition You can calculate it not going to talk about it. It's all in the paper second-order phase transition. All right so so we can calculate test and training errors, right and These models behave slightly differently So for the two-layer neural network already showed you this thing diverges This interpolation threshold and it comes back down the error training error goes to zero Ridge regression Notice that I can't change the number of features and the number of parameters are the same thing Right here. I can change the number of parameters independently of the number of features by just increasing or decreasing the hidden layer Here I can't so when I change the number of parameters. I also change the number of features And what happens is again you get a divergence of the test error It comes back down, but then it starts increasing again. So that's rich regression But remember here This means slightly something different because I can't change the number of parameters without also changing the number of features So here I'm changing both the size of the data and The number of parameters whereas here I can fix the size of the data and just change the number of parameters independently That's the fundamental reason these things be different. Yeah This stuff if you want to know how to do it, they're pretty fun. They're pretty but they're pretty involved in calculations pretty fun So the next thing we asked is Can we say something about the bias and the variance of these models? so it turns out for whatever reason people who should know better in Previous works have defined bias and variance wrong Just completely wrong and they give nonsensical answers So if you define bias and variance right it's part of why we had a very long referee process First we were told we were wrong and then we were told we were irrelevant Which is pretty much the difference was irrelevant, which is pretty much how it goes, right? So If you look at the bias and variance that's plotted here remember we can decompose the error into bias variance and noise What you see is that these transition the variance in rich regression the variance diverges the bias is zero but then actually increases and the error in this linear regression model is because The bias is zero here. So remember I'm generating the data with a linear model. I'm training it with a linear model at this point The number of parameters is less than the number of training data points And then here the number of parameters is bigger than the number of data points or the number of features And what you see is here the bias goes up And that's why the error goes down and the variance does this thing where it goes diverges and comes back down in this two-layer model I'm training with a linear data and then this is a nonlinear model and the bias just keeps going down down down down down down And goes Approaches zero monotonically, but the variance again diverges at the interpolation threshold. Yeah. Yeah, we'll come to The whole talk is explaining the curves Right. I just want you to know what the phenomenon is. Yeah, that's the whole talk. That's the next 30 minutes How do we intuitively understand this? Right because I think these are actually the right curves Yes, yeah, yeah, yeah, we're gonna expect the whole talk is explaining these curves It is we haven't done the calculations yet, but it's basically the same thing you can do the calculations. I use minimum norm I do pseudo inverses. So remember this is look look at think about the setup. I just want to emphasize how simple this is We're only training this top layer. So this is just regression It's just regression with a kernel random kernel Regression I know I can find the solutions Right the solutions are easy There are just X transpose X pseudo inverse X. So this is the minimum norm So usually if it's full rank, I just put inverse instead. I put I might have the transposes wrong, but you know roughly speaking There's a pseudo inverse here So that's all I do That's like min norm solution. It's the same thing is saying I'm doing ridge regression But I'm sending the ridge parameter to zero That's the same thing as a pseudo inverse Ridge less regression. They call it the statistics if you know what we're to regression inside But I'm just replacing this with a pseudo inverse. That's the only thing I'm trick. I'm doing it's just a normalization That's just the noise. It's not really important for anything. I don't know what lecture notes you're talking about I promise you this is a hundred percent, right? I Don't know what people are saying. I'll just say there's a lot of wrong results floating around I Am a hundred percent confident in this thing, okay You should never be a hundred percent confidence But whatever the significant digit where you can't tell the difference between a hundred and ninety nine point whatever whatever whatever we're confident All right, so the whole point of this curves is to explain this Why does the variance diverge and then come back down so strange? Why are you less sensitive to sampling noise once? you have more parameters weird and Why does the bias behave these different ways in these two models, right? This is what you'd normally expect the bias to just go down down down down as I make the model more complicated But here you get something different. All right Again, this is just same graph again You shape so this is this is what we classically always thought about right? So I just want to point out this was the curves. We've been drawing the left-hand side of this thing What's funny is you extend this thing and this is what it looks like, right? So both bias and variance decrease in the over parameterized regime for this nonlinear model Right and the point is this error is the sum of these things plus noise Yeah, yeah, that's no trade-off there There is a trade-off which was already raised computational costs But there's no trade-off in the statistical sense all right to understand this We're gonna have to think a little bit about some intuition. So I'm not gonna show you any math just intuition from now on so as I said you could think about all these models as I make a prediction I Have something that takes the input features transforms them to the model features right, so it just Take the X Transform it fit it and now I get what I could think about is I can think about Decomposing this matrix zz transpose Right, and that's just like how the model sees the data. What are its principal components? So everyone here who here does everyone here know about PCA who doesn't know about PCA? Okay, don't be shy. So PCA. This is standard thing. It's called PCA And what happens is that if I have some data and I want to know I want to make figure out What happens is the data is high-dimensional What I can do is I want to know in which direction is it varying the most So what I can do is I can calculate the correlation matrix between all the data diagonalize it and The eigenvectors are what are called the principal components and the eigenvalues tell you so the eigenvalue If I think the correlation matrix of any data set the eigenvalues tell you how much the data vary along the eigenvectors So I have some data Like the basic idea is just this I have some data, right? So imagine I have two-dimensional data and It's a blob But it's like a blob That's like you know in an ellipse most data is an ellipse What the principal component tells you is that it goes to this diagonal direction To this eigendirection of the ellipses and the eigenvalues lambda Tell you how much variation there is along this direction So the eigenvectors correspond to the directions the data is extended and the eigenvalues tell you how much How extended they are in those directions And so now what I can do is I can look at this Not in the original feature space by the space of the models Right this is how the model views the data Z is how the model views the data everyone understand that Right if I go back to this thing here This picture here the model only views the data through this layer Z Right it doesn't have access to all this stuff So this is how the model views the data and I'm going to ask in the view point of the model How is the data varying in different directions and the important point you should take away is the larger the eigenvalue The better sampled it is Because the eigenvalue tells you how much variation there is in that direction Okay, so there's two different ways of thinking about This thing so what I can do is again I have the fit parameters I have the bottle features and now what I can ask is let's look at the spectrum of These features these directions right the spectrum tells me how well sampled I am So if I'm in the under parameterized regime over here What you see is that I can look at the eigenvalues and they're all very far away from zero They're all well sampled and in particular. There's a gap in the eigenvalue spectrum Everyone understand like the magnitude of the eigenvalue tells you how well you've sampled that direction So here there's no directions that I haven't sampled well So the variance is pretty low now let's ask what happens as I come back and I Reach this interpolation rest threshold what happens is actually the gap closes and now the point is I have all these kind of Directions that I've sampled really poorly in the model feature space All right, so I can't distinguish between poorly sampled features and while sampled features and The point is that if I make predictions using these samples here these directions here. I don't make really bad predictions Because those are not there in these directions where the eigenvalues are small The training data is not representative of the real data because it's not sampled well We'll come back to this in a second Now the real miracle is what happens as I increase This thing right the real thing is is that and I'm I'm mixing up pictures for fairy tale telling just so you know this is not quite the spectrum If you want to see the real spectrum, it's in the paper. This is for a simpler thing This is the Marchenko pasture distribution. This is for the rich regression It's not gonna really be important you can do this for everything but now the interesting thing is That as I go to the overparameterized regime something else happens Actually a gap opens up again So I get a bunch of eigenvalues at zero which are the directions I haven't sampled at all Right think about the rank of the matrix. I have more features than data points So I have to have a bunch of zero eigenvalues those pile up here, but now the Directions I did sample. I sample well You know anything about random matrix theory. This is the Marchenko pasture distribution that we've shown here. So What happens is it's actually really easier to tell the Directions I haven't sampled from the directions. I have sampled And so the variance goes down because I get I never have this thing where I'm mixing up things where I can't Tell if I've sampled the direction well or not and this is basically Generically what happens right so that was the spectrum, you know That I showed you for you know, I showed you that that was actually the spectrum I wish even though I had this picture here. This was for ridge regression and This is what the spectrum we calculated analytically doesn't really matter But the point is the variance diverges because the eigenvalue goes to zero at the interpolation threshold Meaning I have these very poorly sampled directions that are going on and you can calculate all this analytically It doesn't really matter Right again just emphasizing again here It's really easy to tell the difference between things. I've sampled and I haven't sampled here impossible. So You can just actually explicitly check this So what we can do is We can just look at the minimum component. We can take all the data and projected onto the component of z with the minimum eigenvalue and And what I've shown you here is that The orange is how that is the spread of the data in the training data set and blue is the spread of the dating training data In the test set in a big in the true data distribution and what you can see is here The training and test set in the under parameterized regime basically look the same the minimum eigenvalue Doesn't look that different, but near the interpolation threshold What happens is this is what the training data set Looks like it's this orange line And if I think about how I'll make predictions using that it predicts that this is the relationship With this direction But the test set is spread out like this and the real test real relationship is like this Here they overlap and now the interesting thing is if I go back here To over parameterize again the spread of the data is the same. I learned the proper relationship So that's really why the variance goes up and down all right The last mystery is Why does the bias? Go up here. Well, the answer is very straightforward In this regime because I've increased the number of parameters and the number of features I have more parameters than data or more features than data points. I Always kind of have directions that I haven't sampled in the data Right, there's feature if I think about how many samples I have this feature space Which is NF and the whole point here is NF is bigger than M The number of training data points I have so for NF minus M directions I haven't sampled those directions at all so that means I whatever predictions I make Reflect whatever assumptions I had I put in So they're basically reflect bias and that's why the bias goes down as I more and more As I increase the number of features in the training data set while hold it right while holding the number of data points fixed There's more and more directions that I haven't seen and for this reason you get more and more bias Right, this is why the bias goes up. That's not true here because I can change the parameters independent on the number of features If I increase the number of parameters, it's not that I have lots of features space. That's unsampled. That's basically what's what's going on and You know the last thing I want to point out is that there's another source of bias Which is that even if I don't have noise? It turns out if you do this whole analysis bias models can interpret signal is noise All right, so there's this there's extra bias that comes in these over parameterized models It's not really important for all this stuff. It's just a subtle point. I think it's already getting too complicated So we'll we'll end it here. All right So what are the four major lessons of our analysis? So that's basically our understanding of double descent, right? That there's two basic sources of bias in over parameterization. There's the usual Source of bias that we always think about which is that the model and the data mismatch But there's a second source of potential bias Which is that I if I over parameterize I can have unsampled directions in the feature space And this is going to introduce extra bias Lesson two is that the variance comes from poorly sampled directions of feature space. Yeah. Oh, it should be unsampled. Oh So it's a good it's a good point. I should just fix it now before I forget That's it's bad when your typo actually becomes the opposite opposite meaning of what it's supposed to be There you go unsampled Right lesson three the variance but not the bias diverges at the interpolation threshold There were a bunch of papers claiming the bias Right, and then finally and I think I think this is the really thing interesting thing Interpolation is not overfitting We can use many parameters when we can easily tell noise from unsampled directions. I apparently did not proofread this slide It's because I don't want to believe this because it's so weird. That's what I mean interpolation is not overfitting All right, we can use many parameters when we can easily tell noise from unsampled directions Basically you just don't want to poorly sample stuff And it happens that when the number of data points get close to the number where the training error goes to zero by definition That's when I have just enough parameters To barely sample all my data That's what it means for the training error to go to zero if you think about it Training error is getting lower and lower and lower right when I hit zero is right when I have just barely Enough parameters to do it so I am not going to be sampling all the directions well That's why there's always an eigenvalue that goes to zero. I don't know what you're saying, but we have all the order of limits, right? I mean, I guess we do it for I mean, we're doing a cavity calculation, right? We're doing a cavity calculation, right? So in the cavity calculation, you don't have to worry about all this stuff I mean, I wait there's nothing wrong. There's a min norm solution The way you do it is you calculate everything with lambda finite and that's that lambda to zero in the afterwards Which is the right way to do it Yeah, you can't they are they are that's that's what rich regression does right? That's what it does. I agree your intuition is completely right. So I'm trying to understand what Yes, yeah, so if I have rich regression. Yes, you're right Yeah, I agree. So that's the whole point That's part of why people had such a hard time seeing this double descent because because if you regularize this whole divergence goes away It's exactly right intuition exactly right intuition your intuition is a hundred percent right. Yes So what happens is to see this you really have to turn off the regularization Yeah, yeah, yeah, no you're absolutely right. No, I was just trying to understand what you were saying. I was just a trying to understand Sorry, I'm sensitive at eight referees seven. I don't know something like that Yes, because it's the spectrum of the training data So yeah, because the Z matrix the Z matrix, right is a matrix just think about linear regression So it's something that has Z matrix is training data number of training data points Right, so this is a matrix by number of model features. So each row of this thing is One training data point and P which is the same as f or rich regression, but not or it doesn't matter in what we do because In some sense in some sense we're working if you want the I don't know how familiar with the ideology of the field But we're working in what what you would call the NTTK limit because the kernel is fixed. We don't train the kernel So there's nothing so you can say in the in so far as a infinite if you believe the NTTK neural tangent kernel or whatever NTK NT I don't remember what neural NTK a neural tangent kernel stuff Then this is like a neural tangent kernel Calculation, but I think I should point out is it's lazy learning, right? So there's not even any feature learning going on here. Is the dumbest setting you can imagine is pseudo inverse lazy learning Nothing complicated Basically almost a convex problem once you tell me it's the pseudo inverse. It's a convex. It's a convex problem No many minima no none of this crap Hey, none of the stories people tell we don't train this. Yeah here Yeah, there it's always equal to one And that's why you can't increase the number of parameters without increasing the number of features So that's why if I go this way The bias goes up because not because of anything to do with the number of parameters But it's because I can't sample the whole input feature space just two different models Yes, and they behave differently Huh I'm not trying to compare them. I'm just saying there's different behaviors possible depending on what's happening. I Well, I can just tell you in the literature. No one got the bias, right? Even for the left-hand side model right a lot of people were saying bias diverges at the interpolation threshold I don't know it doesn't matter the intuitions are all I care about That's fine, but it's still not clear why it diverges and comes back down in the first place Okay, you can say okay. This is Bias, but it's not clear why they should do this Right the whole point of double descent is that this minima is usually below here But this is above but it's still weird that there's a minima on this side I don't know. Maybe it's obvious to you. It wasn't obvious to us. I Mean the first people who did this calculation in some way were Andrew Saxon and Madhu Advani But you know and like a lot of this spectral thing It's hidden in their paper Actually, yeah, I was talking to them a lot when they were doing these calculations um Yeah, it is whatever it is. There's lots of people working on it I'm just telling you this is like to me. I don't care about the math I just care about the intuitions and that's the intuitions we got so I should point out this is all lazy learning This is all whatever it is Right basically because we use min norm again. It's like it's just regression right with a different basis It's exactly the same thing. It's just like you're basically going I Mean you can go here and look at what happens It's it's it's literally that the min norm solution Doesn't like to over pit because it doesn't need to use all those directions It can easily distinguish The directions that are zero that don't have to be used from the directions that do have to be used Right. There's this big spectral gap So it learns very quickly. Don't just throw away all the stuff at zero. The problem is right the problem is when Gaps are closed. I don't actually have an easy way to tell what's important and what's not important That's really basically all it comes down to Once I know That I can throw away these things to zero. I know I should never use zero That's what min norm does right in some sense min norm pseudo inverses throw away everything at zero Right. It is linear Yeah, I mean one way of saying Okay, there's one way of thinking about it is this way I do y is equal to z w right plus lambda over two Whatever norm w squared like this and then I send lambda to zero at the end of all this thing That's the same as taking a pseudo inverse because if I take a pseudo inverse This is the cost I said lambda to zero But that's the same thing it's just the important point is that all that matters is the eigenvalues of ZZ transpose But they come with a lambda Yeah, this is this is the important thing that comes in in the bridge regression. It's in all regression So now if I have You know if these are all much much bigger than lambda That's easier to tell zero from non-zero, but if they're very close then how do I know? But that's the essence of all this stuff All the sets essence of it is as I add more and more parameters actually every direction that I do sample I sample well and The samples directions I don't sample well. It's pretty easy to tell them apart from everything else That's essentially all right. This is my cost function. So if I do this, this is a vector This is my model features and then I add a term Which is that the norm of the vector squared and now what I look at Ridge regression the way it works is that ridge regression predictions go like the eigen one over the eigenvalues of this ZZ transpose plus lambda That's just that's the and the part that diverges if you'll see in the test error Is exactly the fact that if I get a zero eigenvalue, you see your error goes to zero So that's why normally I have to regularize But if I know that all my things is finite except for zero, I just ignore the zero But that's just basically what all these models are doing This is generic This is also why the basic intuition understood instantaneously Is that like if I put cut off I? Cut off the divergence right just cutting off the divergence. It's the only time my field theory training has ever helped me Yeah, so that's it. That's basically I think we're done early. I didn't want to it's a long four hours It's a long lecture. So what I wanted to point out is this is you know What's going on but there's many things we didn't put in the model The main thing that I find interesting is we didn't put a lot of things in the model We didn't put any fancy optimizers. We didn't put any fancy regularizers We didn't even put feature learning in the model Right. I think that's the deeper thing deep learning not only You know makes lots of parameters, but presumably those parameters learn useful and meaningful kernels, right? These parameters are adjusted to learn meaningful features So that's I think the more interesting and harder thing about deep learning How does well? I don't know more interesting, but it is kind of funny So what I like about the work is it gives some intuition and it separates out What's necessary from what's not necessary to get the phenomenon? Which I always find frustrating because people tell me I need ABCD to get things But what's the minimum setting? So that's that's what I would say and so as I said The work is Jason's this is my current group and All these people give me lots of money to think about About stuff. So well, mostly the NIH now. I wish the rest of them gave me money I keep putting them on my slides and the hope they'll give me money again No I don't know. I'm trying to get more money from the Sloan. I'll let you know in a month All right So, let's thank Professor meta and then give him a hard time with more questions So question there's a notion of compressing. Yes where instead of rich regression you use some I mean L2 no minimization they use the L1 no minimization and extract the feature vector uniquely But the the condition should be the the feature vector should be sparse enough And also this matrix your your Z matrix has to Satisfy some special condition to have some, you know, the solution unique solution I mean in the feature vector unique solutions part in the condition should be as feature vector should be sparse and they Know the force it feature vector to be determined uniquely This is a Z matrix has to satisfy some condition which is a course or called in coherence Yeah, something like that. So, so is it is this sense in a notion of compressing anything to do with your Yeah, so the calculations when you do the cavity calculations So on it once I built up basically did the same calculations for compressing you can reproduce all the phase diagrams Yeah, the spectrum is the same so I think in if I could make the Connection more we let's of course the point is that this is L1 norm and you're assuming something about px being sparse Not really I would say, you know, I think that is much more about the fact that if you put more assumptions on the data I guess I don't I don't think the compress no the Z matrix doesn't become sparse at all Yes, but what really matters is they put data features. So Look, I can say it's a little bit more complicated. I don't try to show them complicated things or because you said that If you increase the number of This let me show you a More a better a better result Yeah, it's a little bit more. It's a little bit, you know, it's a little bit more tricky And that right because it always is It's always a little bit more tricky than is let on in a talk right Move all the face diagrams out Okay, well apparently we don't have all the face diagrams Okay, I can show you afterwards It's a little bit more tricky, but it's not the compressing transition at all has nothing to do with compressing Nothing, I mean in so far as you can use the same and you can see that because you can use the cavity method to analyze compressing is a beautiful series of papers by honor bond and and and and There's really interesting thing in the cowardice is that there's relationships, but they're not the object They're not the face renditions are different. I mean, they're both second-order based transitions where the cavities Diverges but that just indication is the second-order patient the nature of the transition is very different Thank you for the very informative book So I'd like to ask whether I mean My assumption is that like all of this thing about double descent is about the supervised learning Is there any reason to expect that we have like something similar in like more broader category of learning like unsurpassed learning for example Yeah, yes, I suspect it's true How to define it is very hard. We don't even really know how to do double descent Calculations properly for categorical data. It works roughly the same. We know it numerically does Unsupervised learning even the numerical experiments that definitively show it are not there Because we don't know what overfitting means because we definitely don't have metrics, right? You can see it You can kind of see it Right like if you do image stuff, but it's hard because we don't have good metrics No one can agree on what the metric for Overfitting is we know it exists We know it when we see it, but we don't know how to quantify it well or not in a thing So but I think that's one of the more interesting It's one of the most interesting things to me like in unsupervised learning. What does overfitting mean? There must be something like double descent Right How do we quantify it? How do we mean it? I don't know. I don't know It's an open problem. I think I think our best step for that is RVM. So I think this is that something to contribute I think there's interesting calculations to be done with changing the number of hidden layers and Doing these things we started to try to start setting them up But honestly sociologically we got this started. So we're working on other problems Every I double machine learning every three four years and I'm like, why do I why am I in this corner? It's next and then I just get out until I get Interesting because I feel like I don't understand something and I have to do a calculation But I think they're really if you want to work in machine learning and especially theoretical machine learning and statistical physics I think that's a great question. It's a really great question. And I think RVMs are the way to do it Other questions so I think my my question Is maybe Similar to professor Hyun's question so so the way you do some Something like interpolation is like that in the blackbird and Yes, yes sort of the solution and if we have many many parameters and If we consider the situation that the data is Extracted extracted from the polynomial and the fitting is done by the polynomial Then I think the Yes, yes, yes, and I think that the Parameters should be very sparse so And it's definitely not in feature space This in fact the whole point is that to see this you have to go to the PCA eigenvectors You actually can't see it in the original feature space You have to actually go to the PCA space and look at the minimum eigenvector, which is a collective mode It's a collective mode phenomenon, right? The whole point is that the eigenvectors are the collective mode, right? And it's really in the collective mode thing and not in the raw parameter space It's not it's not sparsely and compressed sensing is the opposite, right? Right, I mean you can't there is a way of I don't know if the mathematicians have done it But we have a way of looking at compressed and it's a little bit different difficult because you don't know how to calculate Spectrums, but there's a way to fake your way through a spectrum of compressed sensing and You can see that what happens is it's similar in the sense that gaps diverge and things disappear But I think that's just the general way of thinking about phase transition these disordered systems That's kind of fun, you know, and we see the same thing in ecological models, which have nothing to do with this You know, we see phase transitions in ecology. They're the same thing We see the same thing in eco-evolutionary dynamics So I think it's just a property of these disordered systems where you have very Distributed degrees of freedom, but I don't think it's sparsely. I really don't because it's about collective Right, it's about minimum. I think minimum eigenvector has nothing to do in the original basis With sparsely, okay, I might be missing something but I would be very surprised very surprised Just numerically, you don't see it, right? So you could just this is an easy experiment, right? Now you can go in one hour and do this in Python. No, just do the general Polynomins, you can just look if it's sparse. It's not so it's like a 30 minutes of code, right? So So in that case Is your red line has not sparse parameters, but But the dashed line the ground truth has I think the truth model has many sparse parameters No Okay, just do the numerical experiment. I mean literally takes like, you know 15 minutes I don't know about your students. I should take you like 20 minutes It takes me like six hours because inevitably after like Google every error and forgotten every command, but you know It's a good thing about being a student and not a professor Other questions Thank you for the lecture. I have a question that actually I'm interested in the feature representation learning This this kind of neural tangent kernel style explanation is very interesting but in practice, there are lots of neurons that are That have us a very low variance which implies that some kind of feature or compressed So I want to ask that there are some research that connect between Permatize regime and feature learning space. Yeah, I agree We know we were trying to break this into bite-sized stuff. So the stuff I know Hyme Hyme's group has some papers, you know, Hyme's Lenka and Floran, Lenka's how do you say Lenka's last name? Yeah, there you go collective and and Floran, I can't say his last name either have some papers I think But I think it's it's it's underexplored, you know, I think there's three or four papers I think they're good starts, but I I think it's It's the harder problem, but I wasn't even like, you know, honestly My confusion is often that everyone says everything is necessary for everything else So I just like to break up into little pieces. This phenomenon requires this little part So what do I get when I have feature learning that I don't get in lazy learning regime and things like that, right? So that's that's what I'm more interested in but I agree with you That's the interesting thing right once I learned the kernel what happened and then there's also the training dynamics, right? We took all the training dynamics out of all this There's no training dynamics So what does the training dynamics get you then there's like the idea of how different forms of regularization affect all this stuff There's you know, I feel like the form of regularization is much more important for the future learning Then it is for this lazy learning what I would say right So I agree with you. That's the that's the interesting thing I'm getting more and more convinced the feature learning is not super important for a lot of the things neural networks But you know again, that's because I realized we got statistics wrong, right? We I mean by we I mean collectively we just don't understand. We didn't understand statistics, right? It's very surprising Thank you for the answer other questions No, everyone's ready for lunch now. Okay. Let's thank you guys for listening and feel free to bother me