 Okay, yeah, thanks. So I'll be speaking about this work we have on the deep bootstrap framework, which is a different framework towards understanding generalization in deep learning. And this is joint work with Benham and honey. So the motivation for this work is to understand what quite deep learning methods that we use in practice actually work and by actually work we mean give us functions with small test error or test loss. And we hope that such an understanding can help us predict how certain design choices in deep learning affect what we care about so how architecture changes things how learning rate or optimizer affects things and so on. And in this work we don't solve this problem, but we set out a framework towards solving this. And the proposal at a high level it's something like this where we say we introduce this conjecture, all the deep bootstrap conjecture, and assuming this conjecture, this reduces generalization to certain problems and optimization. Which means that we want to understand generalization. It's sufficient to understand these problems and optimization, and prove this conjecture. So we don't do this conjecture, but we give experimental evidence for the conjecture, and the conjecture itself is somewhat surprising so we'll get into that. So, briefly the setting will consider in this talk is supervised classification. So we have some distribution deep on inputs x and labels why so image classification for example. And we want to find some classifier F that has small test error on the distribution. So small probability of misclassifying test example. What we actually do is we sample a train set from this distribution and run SGD on some neural network to minimize the train error on our train samples. And the whole question is, why does what we do give us what we want, right that is why is minimizing the train error in this way actually give us a model a small test error. The following is a framework to approach this. Now the main idea is to compare the real world to what we call the ideal world. So let me define this now. So let's fix distribution D, like image net, fix an architecture and fix a number of samples. Let's compare the real world as follows. So the real world corresponds to actually training in your net on and samples for T optimizer steps. So concretely, you first sample a train set and samples from the distribution, then initialize an architecture from the family. And then for T SGD steps you sample a mini batch from the train set and take gradient step on that mini batch. And in particular, you reuse the same train set, you reuse the same set of samples for multiple mini batch steps as time continues right so in the real world you do multiple epochs, multiple passes through your train set. And at the end you output the model. So this corresponds to actually training a network as people do. Now we define the corresponding ideal world as the exact same thing, except you never reuse samples. Every time you want to take a gradient step you go sample a fresh mini batch from the distribution and step according to that. But besides that you do everything else exactly identical to use the same architecture, use the same mini batch size the same optimizer. Everything else is identical, except you, you, you get fresh samples each time. So imagine this is in a setting where you just have some internet stream of samples from the distribution and every time you want to take a step you get a fresh batch and step according to it. Okay. So in other words, the real world is SGD on the empirical loss. And this is a somewhat unprincipled thing to do because you care about the test there but what you're actually minimizing is the trainer. On the other hand, the ideal world is exactly SGD on the population loss. So here you're directly minimizing the quantity that you care about. There's no notion of trade set even you're just essentially directly minimizing the population loss using SGD. So this is a principled thing to do. And a purely these two objects are very different is essentially the difference between one pass SGD and multi pass SGD, and these could be very different objects. But our main claim is that it turns out that the classifier the model produced by both of these worlds have very similar test almost identical test. Okay, so it's important that you run these worlds for the same amount of time. So I'll define this claim more precisely now and we'll see some experiments. So here's an example just illustrate first. So here we're going to train three different models in both the real and the ideal worlds and compare their test errors. So the x axis here is SGD steps, and the y axis is a test software software is the softmax classifier. So probability on the incorrect labels. Now the solid blue line here corresponds to training a Resnet in the real world. So we're using a CIFAR like problem so the real world here is training on 50k samples for 100 epochs, and we're tracking its test over time. Now the dashed blue line shows the exact same model training in the ideal world. So the ideal world here corresponds to in each of these epochs we get a fresh 50k batch of samples from the distribution. So that means the ideal world is training on 5 million samples in just one pass. And the point is that these two lines are almost on top of each other, meaning that the real world and ideal world test errors remain close, even though the ideal world is seeing 100 times more samples and never using samples. And this holds for other models as well. So for example an MLP here is shown in red and an MLP in the real world generalizes poorly usually on image distributions and we see that in the solid red line where it has a high test error. But it turns out that the fact that MLPs generalize poorly is exactly captured by the fact that they optimize poorly in the ideal world. Even if you had infinite samples, the MLP is just optimized very slowly. So we see here that the suggested models which optimize faster in the ideal world generalize better in the real world. And this is especially interesting because all these models train to nearly zero train error. So the MLP here has a very large generalization gap. The difference between train and test is very large, but the difference between real world test and ideal world test is very small. And this is the point of our work that it's more meaningful. It could be more meaningful to compare real world to ideal world, then compare train and test, because train can always be zero and not meaningful. But this could be a more meaningful metric. These are more precise claims. The claim is that SGD on deep nets on realistic distributions produce similar models, whether they're trained on reused samples in the real world or fresh samples in the ideal world, where the similarity is measured by test soft error and the similarity lasts for as long as the real world optimizer is still moving, whereby still moving we heuristically mean that the trainer is at least 1%. Okay, so why do we need the second point, because the real and the ideal world can't be closed for all time. If we send time to infinity, then if the model is big enough the ideal world will always continue to improve and will eventually just hit the global minimum population loss. Whereas if we send time to infinity in the real world at some point, you'll converge on the train set and stop making progress. So the real world will plateau after some point where the ideal world will not if the model is big enough. That's why this criteria is necessary. We essentially say the worlds are closed for as long as you can hope for, which is as long as the optimizer is still moving. Okay, here's another way of stating the same claim. You can see it as proposing a different way to decompose test error. So the classical way to decompose test error in the empirical risk framework is to say that you should write test error as the train error plus the generalization gap. So comparing test to train plus the difference. We're saying that don't compare test to train instead you should compare the test error to the ideal world test error, the error if you didn't use any samples, and plus the difference between worlds. So this is what we call the bootstrap error the difference between real and ideal worlds. And this depends on all of the problem parameters or the number of samples the distribution, the architecture and the number of steps. But the main claim is that this bootstrap error is actually uniformly small in most settings that is for realistic problem parameters and all times that are less than the stopping time this bootstrap error is small. So the stopping time fn is this time when the real world has stopped stopped moving so when it realistically when it reaches less than 1% error. So here's, here's an experiment to illustrate the importance of the stopping time. So here we've trained the same model in the ideal world, and in real worlds for varying train size. So the dashed black lion shows this is not training in the ideal world, and each of the different colored lions show different real worlds for varying sizes of train set. And all of the real worlds are stopped here at the point where they converge on their train sets. So for example, if we train on 1000 samples, then the real then the real world stops optimizing very quickly, but up until the point where it stops the real world is close to the ideal world. And that's the point that the real world and and ideal world lines are close up until the point where the real world has converged. And there is some divergence for small number of samples but this, this, this gap goes down as samples increases. Okay, so this is why the stopping time is important the stopping time is is the time until the real world converges, and the claim is that until that point, these two lines are close. So, here's, here's another way to state state the same claim. So the three objects of interest for the following. There's the learning curve LFN, which is just the test error on end samples if we train to convergence in the real world. For the GFN the time to converge the number of STD steps it takes to converge on end samples. And finally there's L tilde of G, which is this ideal world learning curve for the test error after T online STD steps. And the deep boot step claim is saying that these objects are related in the following way that the test error on end samples is approximately the ideal world, the online test error at the time which it took to converge on and samples. And the point is that the left hand side is a generalization quantity, both of the points on the right hand side optimization quantities. In particular, this implies that good procedures in deep learning have the following two qualities. First, they optimize quickly on infinite samples in the ideal world setting that's saying this L tilde of T converges quickly. And second, they don't optimize too quickly on finite samples that saying this TFN should be large. These are somewhat in conflict but the main point is you can you can view all advances in deep learning through the effect on these two factors. So for example, some advances like using large models using skip connections using batch norm and so on, they help primarily because they help this, this online optimization speed. Whereas other things like using regularization and data augmentation help primarily because they slow down the empirical optimization speed. And ideally you want to do things that you want to do things that have both these properties. I think in general this kind of factorization is a nice conceptual way to think about things. As mentioned the significance of this framework is that if you believe that this bootstrap error is small, this will this will versus ideal world, or you can prove it, which we can't do yet but which would be great to have. If you believe that this quantity is small, then it reduces understanding generalization to understanding two things in optimization online optimization and empirical optimization. I'll briefly describe our experiments here but the more details in the paper so roughly, we tried to do experiments to show that this bootstrap error is small in a variety of settings. And to do this, we need to simulate the ideal world and for that we need lots of samples, so we do two things for this. The first thing is, we use a good generative model for seafar and sample it 5 million times to get lots of samples that we can, we can simulate the ideal world. The other thing is we use, this is a synthetic data set but it's fairly realistic. The other thing is we use image net, where we sub sample image nets for the real world and we use the entirety of image net simulate the ideal world. And in all these cases we then vary a lot of these design choices and deep learning so we vary the architecture the optimizer the learning rates, the loss function and so on. And we show that the bootstrap gap is small so that the real world is actually close to the ideal world. There's just a quick summary but there are a lot more details in the paper. So, so now if we believe the good bootstrap gap is small, we can use it as a lens to study other phenomena and deep learning. So the general principle is, whenever you change something in the real world like change the architecture or the optimizer so on, you should go and look at its effects in the ideal world. I'll give a few examples of this. For example, you can use this to study pre training. So, it, we know that pre trained models often generalize better in the real world. And in fact, this behavior is captured, almost exactly in the ideal world so it turns out that pre training a model improves its optimization properties in the ideal world, in particular, models generalized better in some ways because they optimize faster in the ideal world. So here's an experiment showing vision transformer trained from scratch in the red line and pretend on image net in the blue line, and we see that the real and ideal world are still closed in both cases and in particular, pre training on image net helps the optimization, the online optimization of this architecture, just as much as it helps the real world generalization. So by pre training helps optimization I don't know but at least this we focus is the attention to if we want to understand this generalization behavior maybe we can just understand what it's doing to the online optimization. One interesting detail here is, we're not claiming that the real and ideal was produced identical models or the distributions over models. And one way to see this is that the real world models always interpolate their train set by the end, whereas the ideal world never interpolates. This is evident is by looking at different metrics. So we showed that the test software is close and the error itself is also reasonably close. But if you measure the test cross entropy loss, then the real world diverges significantly from the ideal world. And this is because in the real world if you train the convergence with cross entropy on train you'll blow up the logits and the test loss will will diverge in the real world. On the other hand, the test loss will always be monochrome decreasing in the ideal world, because there's exactly what we're optimizing. So this is a subtlety about the claim that the models are not actually identical in the real and ideal worlds. They're just identical in the test software. Actually, since I have a bit of time and other interesting so you can use this to study various things and deep learning. Another thing that might be interesting here is we can convert questions about implicit bias to explicit questions about optimization. So here's an experiment that illustrates that we can take these two architectures, one a covenet and second a fully connected net that subsumes the continent. So this is the full connecting that can represent all the same functions that the covenant can and more. Now let's train both of these to zero train error on the train set. And we find that the carnet generally, the carnet generalizes better. And the traditional view on this is to say that some implicit bias of the SGD going on where there are many local minima's. But for some reason SGD on the car net finds a better generalizing local minima than SGD on the fully connected net. On the other hand, our perspective is saying, it's not really about local minima's we can forget about the train set that this difference between these networks is evident even in the online, even in the online optimization setting. So it can be written as an explicit property of just the population loss without referencing local minima's. So in particular the carnet just optimizes faster on the population loss. And that's why it's generalizing better. So the other other kind of implications but let me let me conclude here so in the end we reduced one hard problem which is generalization into two hard problems, which is online and offline optimization. But these two hard problems are different. So we've actually done something here that it might be more approachable to understand these optimization aspects and this generalization one. In the future, I would like to see more research on the online optimization aspects of deep learning similar to the last talk actually, and there's a couple of reasons for this. First, the largest models these days are trained for less than one epoch which puts them actually in the ideal world. Second and perhaps most interesting way most of the many of the mysteries of deep learning still remain in the ideal world. So there's no generalization problem when you have this infinite stream of samples, but you still have all the other problems so you still have to decide how to choose your architecture, why representation learning happens and what that means how to train robust models, and so on. Finally, I speculate that this claim could actually hold more broadly than just deep learning. So the claim that good online after optimizers are also good offline generalizers that could hold more genetically for any kind of reasonable online optimizer not necessarily just SGD on the deep net that we actually use in practice. So this is a speculation and it would be interesting to see more work in that direction, either the feuding or confirming it. Thanks.