 who's gonna talk about the next speaker is Mark Selke who's finishing up at Stanford and then we'll join IAS and Harvard recording in progress now we need the recorded music okay okay so mark is gonna talk about robustness and isoperimetry take it away okay thanks so much for inviting me and coming today so people can hear me okay okay great so yeah so I'll talk today about universal law of robustness via isoprimetry and this is joint with one of my advisors Sebastian Mubeck who's at Microsoft so the purpose of this talk is to say something about very large machine learning models so just to give a sense of why this is a pertinent topic you know machine learning models have been getting extremely large very quickly it's probably most of us are aware and to just say a bit of comparison with kind of the statistic story from 10 or 15 years ago classically when we think about how large our statistical model should be we think that the number of parameters should be kind of just right so maybe if there are too many parameters we're going to overfit if there's too few parameters we want to underfit and there should be some kind of Goldilocks zone and if we think about kind of the extreme of choosing too many parameters in a way that would be sensible we would think that well if we're fitting and data points then at most we're solving an equation so we should need and parameters and this is really not the case for for large modern neural networks so just to show some numbers if you look at the MNIST data set these are these digits here there's like 60,000 images a pixel an image has a thousand pixels maybe and typical model will use something on the order of a million parameters so it's a lot more than 60,000 a similar story for image net all the numbers are bigger the parameter number is even way bigger and okay if you look at these recent large language models it's even up to almost a trillion so there's a pretty prominent and successful story that has shed some light on this which is not what I'm going to talk about today but I want to just mention it first so so just as I contrast so this story is saying that when I have a lot of data points and they're in high dimension maybe these two large numbers are comparable then it's kind of fine to have mild over parametrization so there's quite a few works on this the last one was talked about a few days ago so there's these curves where if you look at a larger and larger model size then you improve then you overfit and then you you go back down again what I'm going to talk about today is a specific setting in which it seems like extreme over parametrization may be mandatory so for example in double descent often theoretical analyses will say that well if you over parametrize by a factor of 10 that's that's fine that's good and here we're going to really want to over parametrize by a sumber constant factor okay so let me say what the setting is and what the result is and what's going on now so I'll give a informal statement and kind of it's trying to reconcile the following three phenomena of neural networks the first is that they're over parametrized the second is that they often memorize their trading data almost perfectly and the third is that they generalize well to the same type of test data but they are brittle against adversarial perturbations at test time so if I if I add a little bit of adversarial label noise to this image I can't tell it apart but it gets classified very differently so our result says that if you want to memorize and you want to be robust then you need over parametrization in some theoretical model so to give an informal statement which I'll then explain more carefully for the next few slides we're going to fix a reasonable function class described by some number P of parameters so this is something like a deep neural network with some fixed architecture so every trainable weight in the network is a parameter we're going to sample and data points ID from a d-dimensional distribution so it can be fairly general but should be truly d-dimensional so it shouldn't be like a line living in d-dimensions but it can be some kind of Gaussian not necessarily spherical and then we're going to add label noise we're going to have some maybe deterministic labels and we're going to add some noise and then we're going to try to memorize the labels and we're going to try to do this using a function in our function class that we fixed beforehand and we're going to want this function to be Lipschitz which is our proxy for robustness and the result is that we're going to need at least order n times d parameters in order to do this so if our function class does not have enough parameters it will not contain a robust memorizer so in short over parameterization may be necessary for a robust memorization okay so let me now explain in a bit more detail what I'm talking about with all these things and you know please ask questions as I go through this technical setup so so we're going to be living in d-dimensions for now we'll just say we have data on the unit sphere as we'll see it can be relaxed and you should think the number of data points is something of polynomial order in the dimension so pretty flexible linear is fine cubic is also fine the labels are something deterministic this function g which we don't really care about and crucially we're going to have some noise so each label yi will have some noise zi and the noises need to have some average variance sigma squared so the noise distribution can even depend on the input xi it doesn't it's pretty unrestrictive so we're going to think about memorization so we'll say a function a classifier f fits the labels perfectly that means it's a perfect memorizer right so that makes sense and we'll also be even fine with partial memorization which means that you fit the labels including some of the noise so you you fit the labels better than just ignoring the noise so for example if our mean squared error is less than half the noise level then then that that counts as memorization okay next what is robustness I'm going to say a classifier is robust if it has a small options constant okay so this is sufficient condition for any sort of robustness it certainly implies that if I change my input a little bit then my classifier won't move very much it's a bit of a strong condition but it's convenient mathematically so we're going to define robustness this way for the purposes of this talk and I want to point out that the scale here is constant so our inputs are on the unit sphere our labels are constant order and so it makes sense to hope for a dimension free lip shits constant here so this this has caused some occasional confusion so let me just say if you're thinking about a real image maybe you think it has d pixels and each pixel is an order one number so the scaling seems a little strange but if you just rescale down all this is saying is that if I change 1 percent of my pixels then my classifier should not change very much it's it's kind of it's saying you know a small relative change in my input should should be benign there's no there's no funny games going on okay now a fact is that if I have some random data set some random x I y I pairs generated this way then there's going to exist a perfect memorizer which is O of 1 lip shits and the reason is just that all of my inputs are separated because I have random inputs in high dimension and so I can I can just do something very simple I can take a sum of bumps so for each input point x I I can put a bump function supported right around it and just just fit x I but no other points and just do this for each input separately okay so very simple very silly not what we want to be interested in and okay so the simple construction is lip shits because the the points are separated it's a perfect memorizer but it uses a lot of parameters it uses n times D real parameters because to specify this function f here I have to tell you every single coordinate of every single input point x I so in terms of parameters it's it's you know much less efficient than you would think should be possible and what we're going to see is that you know this is the the best parameter efficiency you can get already this silly construction so I don't I certainly don't want you to think oh sorry yeah yeah so so each x I is a point in D dimensions and in this function f we were using all the points x 1 through x n to construct the function f so so that's N D real numbers and they the input points of your problem yeah but but the the idea is um we want to think of fixing a function class with some number of free parameters before we see our data set so so we need to right so so in order to be able to find such a function like this in our in our function class we need to have all the functions of this form for all possible points x I does that make sense okay okay great yeah thanks for the question it's a it's a bit it's a bit subtle the the setup any other questions okay great okay so yeah it turns out no function class is more parameter efficient and let me now say what I mean when I talking about parameter efficiency a bit more precisely so when I have a function class specified by p parameters what I mean is abstractly there's some parameter vector in p dimensions and each function of my class corresponds to some parameter vector okay and I need some regularity conditions for this to make sense otherwise my parameter could vector could just be one-dimensional because I can biject it with any set so we're going to require first that the parameters are bounded in norm by some constant I'll call J and a second that the parametrization is J Lipschitz in the sense that if I change the parametrization vector then the classifier changes in L infinity norm by a most J times this amount of change okay so the eventual dependence on J is going to be logarithmic so so if J is D to the 10 that's completely fine so so I think of these conditions as being quite benign and on the other hand these conditions do let us prevent silly pathologies because I can't extract infinitely many bits from a single real parameter when I have these kinds of regularity conditions there so just as a simple first example if I'm looking at linear regression then my functions are going to look like this they're just inner products and the number of parameters is going to be the dimension of the original space again okay let's look at a more realistic example of this setup so let's look at feedforward neural networks with so so these are given by linear maps composed alternatively with nonlinearities alpha for example the ReLU and in our setup the parameter vector W is going to encode all the entries in all of these weight matrices here and some very nice thing about our setup is that if you have a structured matrix like a convolution there are a lot of repeated entries this correspondingly decreases the number of parameters I can just treat it as one coordinated my parameter vector and if we think about the Lipschitz constants in this case both the Lipschitz constants in the parameterization and the Lipschitz constant of the classifier itself are going to be upper bounded by a natural quantity which is the product of the operator norms of the weight matrices so the Lipschitz constant in W is kind of what we need to be reasonable that's this number j but as I mentioned the dependence is logarithmic so so basically you can think that as long as the log of this product of norms is reasonably sized we're in good shape so so in particular as long as the depth is not insanely large this is very well handled by our setup and often this norm project is also explicitly controlled for other reasons like to ensure training stability any questions about this or anything else I've said in the past couple minutes okay great yeah so it turns out there's a there's a trade-off here so if you if you don't care about robustness you just want to memorize in the setting I've described you just need n parameters even with a two-layer neural network but as I've said you need n d parameters to robustly memorize which is achieved by this sum of bumps a very simple function and it if you try to improve it's not clear how so the general trade-off here was first conjectured for two-layer networks in this previous paper by Bubeck Lee and Nagaraj and our main result is a very general form of this trade-off so the setup is that we're going to fix a function class described by p parameters let me just say they have poly d size in general there's a log j and we're going to have a data set with noisy labels so as a reminder this means the the x i's are i i d say on the sphere and the y i's have some noise in there then for any memorizer in my function class its Lipschitz constant has to be at least of order sigma root and d over p so again sigma is the noise level in the labels and is the number of examples d is the dimension p is the number of parameters in particular to get a dimension free Lipschitz constant I need to have at least and d parameters so that's the main takeaway in my setup before I said that the x i should be a uniform on the unit sphere but they can actually be a lot more general first you can have any distribution satisfying isoprimetry something like a log so below inequality I'll say what that is on the next slide and also you can have a mixture of like almost and different distributions and it's still fine and you can also have data dependent label noise you just need some like expected noise level so it's some like half your data set can be no noise and it's fine too this trade-off is also tied in general not just at the end points but but at any number of parameters basically and the reason is that you can project down to some intermediate dimension and use the same kind of some of bumps construction there so let me now say what I mean by isoprimetry so this is a key property of high-dimensional space and the basic definition is that any Lipschitz function concentrates extremely well with Gaussian tails right sorry I think the minus sign is merged in with the fraction bar in the numerator anyway so this is a common property of high-dimensional distributions so spheres gaussians some slightly more exotic distributions and it's implied by having a good log constant so it's a it's a strong property but it's it's sort of generic issue in high dimensions it's not it's not like a very rigid property okay so as you can maybe tell us this we tried to make this theorem kind of realistic and not not such a spherical car result so let me now try to say a bit about what how you could try to tie it into practice so certainly real data sets are going to be mixture so maybe uniform on the sphere is not very realistic but I can think my data set has a cat component a dog component maybe some slightly more specific versions of these components seems fine and in order to you know we're not going to know what the dimension is that's a that's kind of a tricky thing in trying to think about how to how to apply this in practice right you have some number of pixels in your image that's obviously a very bad notion of dimension but you could hope to say that there's some effective dimension there's some data manifold and and the result would hold with that effective dimension and this leads to some possibilities of extrapolating from large to large scales from small scales and so on in various ways what is the noise so so in theory in the setup if there's no noise there was nothing for us to learn so if I had if I had this problem with no noise then I would just need one function in my function class because there would be there would be no memorization to do and yeah so so we wanted to fix a function class beforehand and then see our random data set so if our end of data set had no noise in the labels then we could just we would just know what function to use so like kind of like this is really looking at memorization so so and one one interpretation of what memorization is is like memorizing noise that's kind of the setup so so you know I'm not I'm not trying to argue that memorizing noise is good but I think it's a it's a good way to get a theoretical problem and if you if you don't have a good understanding of like a priori of what your data set is gonna look like it may be a reasonable way to think about you know modeling it so so for example just one story that's related is the called the long tail hypothesis it says that in a large data set a significant fraction of the points will be kind of isolated so there's kind of a long tail in the the cluster distribution in your data set so a lot of the the images will be kind of unrelated to other images and maybe their labels can be thought of as noise from kind of a learning point of view but it's certainly an assumption that would be nice to remove and maybe with a slightly different setup in there's let me just quickly say that for if you look at the scale that this type of prediction leads to you do get something realistic so this is very naive it you shouldn't trust any of these numbers I just want to show that they're kind of on the right order of magnitude so what you can do to try to extrapolate is you can look at MNIST this is a relatively small model their small data set and so empirically good robust accuracy is achieved at something like a million parameters for this model and what we can do if we want to extrapolate is we can say this means the effective dimension should actually be 10 even though the naive dimension the number of pixels was a thousand and there's this so the the discrepancy between a thousand and ten is a hundred so maybe we try to just port this over and we say for image net we're gonna use the same constant a hundred we're gonna scale everything up so you could think of many other ways to do this extrapolation but I just want to say that this does give something up the right scale for current practice so so this is like a relevant scaling that we're looking at okay let me just briefly say something about the proof and then wrap up so I'll give a proof in a very simple setting where the labels are pure noise and they're just ID plus minus ones so the first step is we're going to assume that the labels end up being balanced so they're not too biased and this is going to hold with high probability and it's only going to be assumed once so we're going to use a union bound later but we're just going to assume this once and it won't get amplified so what does isoprimetry tell me it tells me that for any fixed function in my class which is lip shits either the function is unlikely to be minus one or it's unlikely to be one so one of these is so either minus one or one is achieved by my classifier on a very small fraction of my total state space so just you know concentration of measure on a sphere if I have some like linear function on the sphere this will be true for example and this means that if I fix this function f first and then I look at my random data set it's quite unlikely to output both labels in kind of a balanced way on my data set it's going to have a e to the minus nd scale probability to do so because one of the labels is achieved on a very small fraction of the state space okay so this means that when we just union bound over our function class we need at least e to the nd labels in order to be able to memorize these noisy labels we need at least e to the nd parameters and roughly speaking a p parameter function class is like an e to the p sized discrete function class so based on the way things were set up you just do a union bound over an epsilon out of your function class so it's actually not a very technically sophisticated just to mention if you want to do mixtures here you just assume balanced labels within each component and if you want to do partial memorization you just use you just think of adding sub Gaussian variables in the right way instead of instead of kind of their combinatorial way of thinking about it yeah um yeah let me just skip to the end maybe and mention a few open directions that I think are interesting so one is to think about non Euclidean geometries so kind of abstractly the the result still goes through but it's much less clear to me like in what cases it's reasonable to think of this isoprimetry property defined in terms of concentration of elliptic functions as kind of a realistic or reasonable thing um it would also be very interesting to understand what changes for different notions of robustness maybe the elliptic constant can be weakened so what one natural thing to think about naively is the sobolev norm so the l2 norm of the gradient but this doesn't lead to anything interesting at all for this setting because even this silly sum of bumps function it has an extremely small sobolev norm because it's only supported on a very small fraction of the state space so it's important that we're using the elliptic constant and not a sobolev norm but there may be other ways of getting around this finally it would be interesting to see what happens with different architectures maybe it's possible to prove different more refined laws to compare with with you know with how how the state of the art evolves as models continue to grow and to try to prove maybe a result that takes training dynamics into account instead of just a uniform bound over a function class so thanks very much for your attention let me know if there are any final questions any questions thank you and thank you for representation so if you consider the the effective dimension of the manifold as you were suggesting you you will expect that you further degrees a lot to the number of data point that you have in your in your set so what it seems to me that nowadays we are able to so the number of parameters we have is the way above the limit that you need to have this robust classification so I was wondering if do you think it is a dynamical problem so are we reaching the solution or we are ending up somewhere in the space of parameters yeah that's a good question I mean I'm really not sure I yeah I think it's true that if you especially if you think of maybe also an effective data set size as well as an effective dimension even this lower bound might you know still not be getting up to realistic levels so I think on one level you can just think of it as a philosophical result saying maybe it's not crazy to use lots of parameters maybe there's some justification I think but it's also you know even a better or more refined lower bounds might be possible with training dynamics so one thing is there's this phenomenon that often you want to train a very large model but then you can remove a lot of the parameters afterwards so what we have here the the lower bound applies also to the number of parameters you end up after sparsifying so so it it doesn't distinguish between the number of parameters used during training and the number of parameters in the final model at all but yeah potentially a version that is taking training into account might be able to do that that would be extremely interesting any other questions yes I have is it audible from zoom sorry we could you repeat that can you hear me is it clear yeah yeah from zoom there was a parameter called also non-linear can you please go to that slide once yeah here yes in this actually what is the stability condition because if we are trying to introduce the non-linear non-linearity in the system it holds for some bounds right what is the upper bound and the lower bound for this so maybe you're saying that I'm I'm assuming that my non-linearity is one lip shits here when I write this product of matrix norms what is the condition it holds actually to be a valid to be in the same data set sorry I don't understand the question okay I'll just try to be very precise because we are trying to multiply here with non-linearity values right if I am correct yeah are you talking about the product of norms on the bottom or the definition of the function f or the f of x f of x okay okay great so wd multiplied by alpha and wd minus one multiplied by alpha so what is the condition for stability for the non-linear signal non-linearity or it's just a function which we are just so it's composition it's not multiplication first off because I was thinking that it is being multiplied okay it is a composition yeah sorry yeah it's it's composition so you're multiplying by matrix and then you're applying a non-linearity to every entry and then you're multiplying by another matrix what what kind of nonlinearity is being considered in this scenario actually it's kind of anything that's let's say one lip shits that so there are a lot of standard non-linearity is used in practice I mean the relu is the most common but you can make up all sorts of variations as well but I mean they're all nice lip shits functions thank you thank you thanks for the talk concerning this effective dimension if you have a kind of hidden manifold you need this effective dimension to scale linearly with the ambient dimension or like can it be a little oh it can be anything and yeah just because it's just the log sobo of constant that enters in okay and does your result say anything when the perturbation is at the level of the examples rather than the label or is it like much harder to look at and do you think that the same final result would apply or is it specific to label noise yeah that's a good question and I'm not sure yeah if the if the noise is on the inputs thank you mark quick question so you you saw the double descent care of me in the beginning I was just a little confused could you comment a little bit on the how you see the connection of the of your result to that literature that I was just trying to say that there's a lot of work on you know over parameters project over parameterization at large this is a flagship example and I I think I mostly unique feature of what I was talking about is that it's looking at a much larger amount of over parameterization so most of these results you're thinking about over parameterization where the there's just a constant factor or something and maybe quick follow up could you comment a little bit on your last point on how you think that sgd training could be taken into account in this framework I don't have any proof ideas if that's what you're asking yeah I mean I would expect that you want to start with some very specific model and stuff and you know like when we were thinking about this we were also starting with a very specific model but somehow things became much more general but it's hard to imagine that for a result on sgd training I think thanks I do have also one I'm so this is about scale of the data so what happens if you actually consider data that lies on the still the dimension sphere but as norm square root of D and say you also scale up norm what the square root of D say yeah sure so that everything as every component is t of one yeah and also you scale up the label noise would you so I'm wondering how the result changes just need the you know you just need to blow up the lip sheets constant by root D or if you still insist on having t of one what would happen yeah so I guess the way I'm thinking about it maybe let me find the slide the way I'm thinking about it there's kind of a natural choice of lip sheets constant scale to aim for once I give you the scale of the inputs and the outputs right so what I want for robustness is if I change my input by a small amount like relatively to its norm then the label should change by a small amount also relative to its scale so so you know that that's a that's kind of a very clean way to just describe what's happening and everything will go through if you just use that scaling thanks any other questions yes thank you so you had some examples at the end of realistic real networks and comparing Amnist and ImageNet so were these networks also robustly trained yeah yeah so the the result I was looking at for Amnist is this paper by Madri and his lab on robust classification and they used you know projected gradient descent training and yeah I mean that you know they did not verify the lip sheets constant or anything like this they just found that it was difficult to find adversarial examples right more questions do you know how to compare this result with the result of the previous talk in the sense you have lots of regimes in which you have this kind of clustering phenomena between some new ones or parameters which in the end might you know you you enter the upper heavy level parameters regime but then you end up not being able to do it anymore because training effectively reduces the number of parameters are you think that connects with what you discussed yeah I'm let's see so yeah I'm really not sure I haven't thought about this but maybe we can talk afterwards if you'd like I feel like I shouldn't try to make up things right now that I think about it final question and then group picture I was trying to understand your result compared to the double descent curve because in that case even you go to the optimal a generalizing arrow the regime is the sample size is proportional to the number of parameter so from this robustness concern do you know in that regime this model is not robust because you have a lower bound this parameter size but that's not a regime for double descent curves so I want to see is there any explanation for that I think when you are doing so so yeah so maybe we just think about linear regression for double descent so if you if you're doing linear regression over the sphere then kind of the natural scale of a classifier is is going to have ellipsis constant that's like square D for me right because I want a typical point on the sphere to get a non-trivial label yeah so so I think the it's a bit strange but if you if you think about linear regression in the context I was talking about then it's kind of not robust if you want a random point to get a good label but I mean even in that setup when you increase the number of parameter you won't get any robust classifier I mean you'd have to always keep the constant at other root and in that case yeah that's right so this is just I think this model is not very suited for linear regression on the sphere yeah okay thank you right so let's thank the speakers once again