 Our guest here today is Ard Lewis. Ard received his PhD in 1998 at Cornell University, went on to do a postdoc in Cambridge and stayed there as a Royal Society Research Fellow. Since 2006, he is a professor of theoretical physics here in Oxford, where he leads an interdisciplinary research group focusing on connections between theoretical physics, chemistry, applied math and biology. A central topic in his research is the study of how complex behavior emerges in systems of many interacting objects. This theme has led him to investigate self-assembling nanostructures, the physics of evolution, and the interplay between Brownian motion and fluid dynamics. Today, he will talk about the emerging behavior of a different sort. His talk will discuss some important properties that appear in large scale artificial neural networks. And so, Ard, please take it away when you're ready. Great. Philip, thank you very much for the invitation. It's a great pleasure for me to be here. So today I want to tell you about deep neural networks, one of the main themes of this seminar series or something I think many of you will know quite a bit about. And I'm going to argue that they have an kind of inbuilt Occam's razor and that this Occam's razor that they have this implicit bias is incredibly important to understanding why they work so well. So, physicists are taught from a very young age that having more parameters than data points is bad. It's nice to illustrate in this little article by the late Friedman Dyson, where he talks about travelling from Cornell to Chicago to meet Enrico Fermi because he had a very, a theory he was very excited about that would explain some aspects of the weak force. So he went to Fermi and he showed him his theory and Fermi Alston, well, how many free parameters are there in your theory and I think there were five, which Fermi scratched his head and said, you know, my friend, Johnny von Neumann says that if with four parameters I can fit an elephant with five, I can make it with its tail, which is kind of the ultimate put down in the world of physics because Dyson had put too many parameters into his theory. And in fact, a group of not too long ago has actually calculated this and indeed von Neumann was right as it often was. And through this little paper American Journal of Physics, you can indeed make an elephant with four parameters and make it with its trunk in this case or its tail is trunk with five. The intuition that we have is that you, you know, don't use me parameters or you're just going to overfit you're going to get nonsense. I think the intuition goes deeper than that we, we very are strongly taught that simpler theories are more elegant, more beautiful and more likely to be true and this is how we reflect something about the world. So it's not just an epistemological argument that we use something practical but something ontological we believe is true about the world. So essentially theoretical physicists tend to think this way. So that brings us to neural networks because neural networks are in fact heavily over parameterized, they can have millions or billions of parameters and typically many, many more parameters than data points that they're heavily, heavily over parameterized. I'll give you an example of what of the first, what why how we might teach a physical undergraduate why over parameterization is bad so I give you these data points right here. So first, this red dash line which is a five order polynomial fit something like this, and I'll get a basket. What I said this order I have some bias have some variance but I'm not doing such a bad job. We think on these points. If I then I fit say a 20th order polynomial, then I can fit the data perfectly but I'll probably get I get this what why the oscillating behavior which is probably nonsensical and in fact my system is under is over parameterized I got. So there are a couple of different versions of this polynomial fits with 20 parameters to this data, and I have no really clear way of adjudicating between which one is correct. I contrast that behavior with a fully connected network with one layer and 1000 hidden units so many minute more parameters and data points. And this green line shows you a whole series of different numbers of hidden layer from one to five layers. The neural network gives you this very much more smooth curve. And the question is why did the neural network not give you this very high oscillating curve because there are theorems that basically tells that neural networks are huge to highly expressive in fact they can be almost perfectly expressive they can express any function including polynomial so they're never can represent this function, but somehow chosen not to even though you haven't told it to do so. I think you're surprising. Here we have the opposite of a Neumann and Fermi's dictum on you have not just enough elephant, enough parameters make elephant we can stick its tail but enough parameters to make it do whatever you like. And we haven't done any kind of regularization we haven't done some kind of trick that limits number of parameters. And yet we see these neural networks fit this data to what to our eye to intuition seemed like the right way of hitting it. And why does it work why is it kind of classic bias variance trade off that you might use to formalize this idea about the number of parameters. Why is it not holds for neural networks. And for that I want to take a step back first and look at a very lovely paper by Lenka's door of a from Paris, came out Nature Physics recently, and she says this to understand deep learning the machine learning community needs to fill the gap to the degree rigorous works, and the end product given engineering process all which keeps scientific rigor intact. And this is where the physics approach experience comes in handy. The virtue of physics research is that it strives design and perform refined experiments that reveal unexpected yet reproducible behavior. It has a framework to critically reexamine improve theories explaining the empirical observed behavior. And this little commentary is that the field is somewhat split between mathematicians who are focusing on rigor and engineers who just want things to work and there's a gap in between to explain big questions. And the one big question she mentions is one that is an old one in fact in 1995 she says the influential statistician Leo Breitman summarized three main open problems machine learning theory and I'll just list one. If you don't have any over parameterized neural networks overfit, the data is a question that she's asking is that he asked in 1995 so quite a long time ago, 25 years ago and she's points out these are still open to today it's subject to the most ongoing work of most of the ongoing works in machine learning community. So this question about why do over parameterized neural networks, not overfit is an old one, and is still in this very recent article considered to be an unsolved problem. So I'm going to take a typical physics approach and one of the things that that's Zadora says in this paper is that we ought to look for models like the icing model that physicists like to use some kind of simple models that will give us some kind of understanding what's happening. So I'm going to give you what I'm going to call the icing model of neural network of supervised learning. So this is a little problem and we'll call this a doctor's decision table for covert 19 to make it practical so let's say you are a doctor. And I present you with some symptoms and I, the doctor may be asked the seven questions can ask you, you have a fever, you have a cough, have you lost your sense of smell. Are you old you have a heart problem, are you obese you have diabetes if all those things are true, the doctor may say I'm going to send you to the hospital. If all the things are true but you're young, the doctor may say well. What about your Germany co morbidities if you have a bunch of co morbidities doctor might still send you to the hospital. If you have the symptoms but you're young and otherwise healthy the doctor may say sorry mate stick it out at home, etc. You can imagine that these lots of different ways these these symptoms can present. So one of the questions you might have as you have a doctor who you watch for a certain amount of time answering questions. And then the question would be kind of machine algorithm and neural network learn from a certain subset of the answers the doctor has given what the doctor's answers will be on any possible set of these answers. We can think about this a little bit more formally. There's a, this is basically a function so if I look at all possible outputs that the doctor gives to the set of seven questions. One of those output as a function it takes all the inputs of which there are two to the power seven because each of these questions can be zero or one possible answer so each function takes 120 inputs and maps it onto a yes or no to the hospital. And then question how many functions are there where there's two to the two to the impossible functions a two to the 128 possible functions which is basically three times 10 to the 38. So although this problem may seem relatively small the number of possible answers the doctor could give to the set of all possible questions is 10 to the 38 is very very large. And so a question might be how does a machine learning algorithm, learn this and how does it do well because typically they do well on relatively small amounts of data. So they pick between all these different functions which they can represent. And so to do this I'm going to look at a, I'm going to think about these neural networks as mappings. So here's a little example of a neural network with one with two layers. And I may remember you inputs, you've got weights. These are the multiplied and there's some offsets. At the end you may have some non linearity as a softmax which gives you an output. So if we get the space of all functions the map, the model can represent so in this case the two to the hundred and 28 possible functions, and the model has your neural network has a bunch of parameters in it. Then we can drive a parameter function map as a map from the set of parameters of this model so all the double the double using these that you put in there to the function space and so a certain set of parameters will give you a certain function so that the neural network has a certain set of weights, given all those 120 inputs is going to give you a yes or no to send you to the hospital. That's what we call what we just wrote that just two years ago we call this the parameter function map. And this is helpful as an abstraction because I can then studies parameter function now so what happens if I randomly pick parameters, what kind of functions I find, and what my student Garo found which was perhaps somewhat surprising at first site was if I randomly sample parameters, then what I find is that although there are 10 to the 34 possible functions or three times and the four possible functions. I don't find the function the probability I find certain functions with much higher probability in fact this one here, I find 10% of the time. There's a function that just all zeros are all once it basically says send everybody to the hospital or send nobody to the hospital. And then as I keep sampling, I find a bunch of these functions which have remarkably high probabilities compared to the mean probability of 10 to the minus 39. What we find here is basically that's the if you ran the sample parameters, certain functions are much more likely to appear than others and I'll show you a bit later that in fact we can have a theory for this that predicts that those functions which are simple and simple in this case means they've got a short description. So if everything is all zero things all one description of that is make everything zero make everything one it's a short description. It's simple that can be approximated by some compressor so we compress these with an impulsive complexity which is a measure of the simplicity. And what we do is that we find that simple functions that are high probability functions are all simple. So in other words, this particular neural network has an intrinsic bias towards simple functions are much more likely to make simple functions and complex functions. So what that help with generalization is then the question. And the answer is yes. So here we have a whole bunch of different target functions by target function I mean I have a doctor that has given a certain set of answers. You're very simplistic doctor since everybody to hospital or nobody to the hospital, and as a doctor to get more sophisticated they're more complicated answers. And what you see is that for the simple target functions simple sets of answers to the questions. This is the generalization area that neural network gives if I give it just half of the question of the of the of the set of answers if I have half of the set of possible answers which is 64. And I see how well it does on the other 64. If the function is simple the neural network actually makes relatively small errors you know 5% 10% type of error which is remarkably good given how much how many possible functions there are. The only thing is if I give you 64 inputs, and then I ask you what how many functions are there there are there are actually still two to 64 functions that will fit that data, because I can be anything else on the on the unseen data. But pick one of those functions at random, which is what a very stupid learner would do, then what you see is that basically the error I get is 50%. So the vast majority of functions that fit the data give me bad generalization, but these simple functions to which is biased give me good generalization. Here's another way of looking at it I have the generalization error for a target that has this complexity. And each of these little dots is what the neural network does upon training by sgd for a different random starting position you see most of the time it finds a set of relatively simple functions, which is a good thing, whereas the random learner learner which picks a function at random just finds a bunch of random patterns, because none of which generalized very well. If I haven't contrast a very complex function that I'm trying to learn that neither one neural network know our random learner do particularly well. So neural networks have this inbuilt outcomes razor. They work well on structured data. Let's take a step back and think about our doctor who's answering questions about patients. So we already saw some patterns in that in those answers like if the person is old very like the sense of the hospital. If they're young only send to the hospital if they got comorbidities for example. And so there's these patterns they're telling you that the data is that the kinds of answers doctor telling you are simple, compared to the full set of 10 to 38 functions that they could give. The bias towards simplicity is helpful. If you're looking at structured data. And the claim would be that most of the data that we have is in fact structured in one way or the other. Now, where, why is there simplicity bias in this parameter function that well we can, we showed empirically in the previous slide by of just simply sampling. These are very bright undergraduate students now a digital student with us with us, Chris Menegard who came to me and worked out a proof for some simple networks that you can show for example that if you have a perceptron which is kind of the simplest network, and then the sample the parameters, then, if you look at that output and you count the number of zeros and ones, then the your equal likely to get all zeros as one one as two ones as three ones as four ones but of course there's many many ways to put 64 ones into a string of 128 just tells you the individual strings that are simple are much more likely with low entropy are much more likely to appear than high entropy strings, and we've been able to prove mathematically that this bias only gets stronger as you increase the number of layers. I just put in here three red lines because of three undergraduates from Hartford where I'm affiliated who did this really beautiful work. Now, I wanted to actually ask a bigger question so we have to prove this for these very simple neural networks but do we expect this to be more generally true. So I've used this Occam's razor. I should probably spill Occam slightly better here Occam. And William Occam was a Franciscan monk, a scholastic scholar who lived quite a long time ago also partial part of the time, most likely in Oxford and possibly at Merton College it's still contested whether he was there or not. And his most famous statement is that these are not to be multiplied without necessity, which is actually only a what appears in about 1639 and a commentary by John Punch on Don Skotis another scholastic. What Occam actually said it is, it is what you found in his texts is this little quote, which basically means plurality is not to be posted without necessity. And in fact, if you look through history you'll find that Plato Aristotle all the way through that the idea that simple explanations are better than complex ones or that you shouldn't multiple explanations without necessity is very common. There's lots of modern approaches to this is a very nice Bayesian approach by David Mackay, which I don't actually think is correct but it's interesting. I'll show you in a moment something from algorithmic variation theory, but it's still contested and philosophers look this ends as not expectedly philosophers they are very strong as growth on other but they have some very interesting to say but this idea of Occam's razor. But what regardless of whether the reason for comes razor is not so clear in practice we know that it works remarkably well. And so the fact that these neural networks appeared to have a bias for simple functions is something that I would like to ask is this more general. So, why do they exhibit this imbal Occam's razor is this a more general property. And so what I am going to try to explain to you is an intuition that comes from the famous trope of monkeys on keyboard so here's a chimpanzee on a word processor. I used to use typewriter where I realized that the younger people always don't know what those things are so the input is equal to the output, how likely is it monkey to type the first 10,000 digits of pie. Well it's one over the number of keys to the power of the length plus one because you got three points. The fact is, this string is equally likely as any other string of that length, they're all equally like you're unlikely because they're all there's a uniform like you're getting an issue of a certain length. So instead of typing into a word processor that monkeys type into a computer program like see, well then, interestingly, there are short codes that will generate see this is from the winter in in Amsterdam on a three character code that will generate the first 15,000 to the pie using a spigot algorithm. Now monkey might by accident type this code and suddenly pile come out. This tells us that if you're thinking about monkeys typing into some kind of computer. Then there are certain types of outputs like pie or like 010101 which is print zero one X number of times, which will be much more likely than other strings that are truly random of that length. So this intuition has been formalized in a beautiful piece of mathematics. That's often called algorithmic information. And here are two of its founders called the growth and Chatham, who basically pointed out if I have an individual string. I could describe this complexity as the shortest program on a universal Turing machine that generates that string university machine is is the idea from touring the machine that can do any possible calculation. It's universal because if you can do any possible calculation then, if I have a universal machine or another universal machine they can always be written one can always compile to the other one. So this description is fundamentally in asymptotically independent of the actual machine that I use. And intuition is if I have a string like like this 0101 which is just print zero and 50 times as local growth complexity it's simple. Whereas this string below it to my knowledge is has no shorter code that just prints the string. The warning is that I don't know whether this there might not be some code that can can produce a string and form an equal growth complexity is uncomputable, because it reduces to the whole thing problems reduces to girdle's undecidability. But this gives in spite of that I can't calculate call growth complexity I can approximate it and it still gives me some interesting intuitions, such as a random number is defined as any number for which the shortest which has a call growth complexity equal or longer to itself in other words there's no shorter description and just print the number. The complexity of the sets can be much less than the individual elements of a set. So if I give you all the numbers from one to Googleplex. Then that's a very short description, but individual numbers in that sequence can be very complex and I'm going to use that intuition in just a minute. I think to the monkey theorem really comes from Ray Solman of the third founder of a it, who was thinking, like before call the growth in Chaitan about what happens if I put random programs into university machine so I'm going to have a for this university machine that only takes binary inputs, it's a prefix code which means it doesn't machine which only it can, you can easily decode there's no special on code words needed to tell me the end of the program so just look at all the inputs. I'm more likely to produce a certain output X like Pi. Well, also for all programs that generate Pi, the probability of getting a program like L is one half to the power of the length. And so that's all the sum of all those programs giving the probability of getting Pi, and the shortest one, which is the call growth complexity is the most likely one that I'm going to get. So it's going to give me some intuition that smaller, simpler sequences are much more likely to appear than more complex sequences if I randomly type into some kind of algorithm generating machine. And in fact, Marvin Minsky, one of the founders of modern AI said, not long before he died. It seems to me that the most important discovery since going on was discovered by Chaitan Solman of the call growth of the concept called algorithmic probability which is what this is. So the next person should learn all about that and spend the rest of their lives working on it. A bit of hyperbola maybe but I think the ideas behind this are super interesting. The next person is a 11 also known for P versus NP, who proved in 1970s in fact that not only was this a lower bound because this is the least one program that will generate X, but it's also an upper bound asymptotically. So this idea that if you randomly type into some kind of computer program, you're, you're exponentially more likely to get outputs that are simple turns out to be a bound on the top and the bottom. Now you might wonder why you were not taught this as undergraduates in physics is such a powerful theorem. And the reason is, is because well, this only holds for universal training machines and many systems of interest are not universal, but actually is formerly incomputable and only holds in the limit of very large X because asymptotically, these order one terms are terms like terms compiler terms are fixed, but you don't know how big they are. And so for these reasons, these ideas have largely languished a little bit in the field of theoretical computer science and mathematics. Now, I got very interested in this actual my students pointed out to me because we were finding a lot of simplicity bias empirically trying to figure out where it came from. And so we have a much simpler proof that isn't as powerful but is much more general, where you say let's say I have some input output map. So I can generate any output with the following algorithm, run all inputs to my map and give all outputs counter frequently the outputs appear and code the outputs by Shannon Fanner. Adios coaches most efficient way of describing this kind of entropic code. This is one of the ways I can generate my outputs. And if my map is simple order one term, then it won't grow with size, then I can rewrite this as an upper bound on my probability of getting X because I have two to the minus K of X, given F and given N. Okay, so it's not for common goal complexity is Congress and given to is a conditional common goal complexity. And for the people that are interested, you can read more about this in our paper. We basically show that if the map is simple and there's certain constraints on the map. Then in fact, up to order one terms again this conditional common goal complexity is equal to the true common goal complexity. We still are a calculator common goal complexity but what we do is a typical physics approach and say look, I can work out asymptotically what this should look like. We used to many things that we can formally show asymptotically have a certain type of skating behavior still holding outside of the bit where we can formally prove it. And we're used to the idea that we might make some approximations and then add in a parameter to correct for that so here I'm going to take some approximation case it will wiggle to the true common goal complexity, and give a pre factor, which captures an additive constant and offset which was captured additive concepts. And for what we show is for maps that are simple. This should work on well as a bound on the probability that you're going to get a certain output. And here just very briefly I show you a whole bunch of different maps that I'll show this on. Here's a map from RNA which I talked about in just a second. Here for example here I'll give you this I'll talk about this example. Here are these are what are called L systems which are used. She comes usually from each of us where I was an undergraduate from a botanist circle us that Linda Meyer L systems. And it describes, it's used to make on plant shapes and it's also using computer graphics. So if you're going to pick rules in the world of L systems and you're going to find simple shapes much more frequently than complex shapes and this line is our predicted on the complexity. And here's a set of different equations whole series systems work that way. And the reason I got into this was because I was studying some evolutionary problems. So there's a mapping from RNA sequences uses only has this for alphabet as a sequences and some RNAs are coding, but a lot of RNAs can also be functional and they may be catalyst or they may be, they may have some kind of structural role. And then it really matters what shape they filled into so this sequence here will fill into particular shapes say this particular catalyst. And solving from this this is is complicated, but so going although neural networks might solve that problem watching what happened to protein folding recently at the mind. But we can solve a simple problem quite easily which is from this to secondary structure tells us basically what the bonding pattern is. And what's kind of interesting is here we found that if we randomly pick sequences of RNA, then we're actually more likely to get simple secondary structures and complicated secondary structures. We can get simple ones and complex ones. And what's fascinating is one way to look at nature so here are on length 100 RNAs of which there are 932 that have been found in nature to be non coding to be some kind of functional effects. Here's probability of finding them the green dots and the probability find them in nature. And this is their complexity so complex they are and this is our bounds. So you see that the most common ones found in nature are in fact simple ones. The ones that are less likely to be found in nature are significantly more rare. So nature evolution is always also following this inbuilt outcomes razor, and you're seeing simple things appear more likely than complex things. This is how we got into it. And then we realize this applies to neural networks as well. And here's an example from one of the examples I showed you the simplicity bias in the boolean system. Here I'm showing you for a convolutional network on C410. The argument being that this basically this bias works for a wide range of different networks. It's a very general argument and therefore it should hold for a wide range of neural networks. Now, some people in the audience are probably wondering why I slipped something into the road which I absolutely did, which is the following. The neural networks are not trained by randomly sampling parameters, which is what we did to get these probabilities of function, these priors as it were. They're trained by something called stochastic gradient descent. Stochastic gradient descent is a method that follows gradients down on a loss landscape. And actually, in the field, I'd say the dominant hypothesis for why neural networks work so well in the overparameterized regime is that there's some kind of magic to SGD. And one of my main claims today is that this is not true. Okay, there's nothing magical SGD and SGD itself does not give any extra implicit bias that explains the fundamental problem of why do these things work in the overparameterized regime. And the intuition behind that goes a little bit as follows. I've just told you that the probability of getting a function goes scales exponentially with a linear change in its complexity. So I didn't have got some loss function here. Okay, and some zero errors, I come with a classification, then down here, this is roughly the probability that I find a set of parameters that give me zero error on this for this particular training sets. I can have many different functions, as I said, as many functions, there's a lot of redundancies of function one function two function three. If function one has a larger basin size, a lot of the probability of getting it upon random sampling, then SGD is much more likely to have a have a basin of attraction that is also large. So SGD is much more likely to fall into these that isn't fall into these basins. So we first worked out in evolutionary theory in these RNA systems a number of years ago, and we showed that this works remarkably well for evolutionary dynamics. And so we just recently with shown again with two undergraduates from Hartford involved shown that the same principle the same dynamic principle holds for SGD. So SGD doesn't, you know, feel the subtleties of the landscape or that if you can tweak high parameters you don't get slightly different behavior, but this is the dominant behavior that you get when the volumes of your basins of attraction vary over many orders of magnitude. And what I'm going to do next is go a bit more in this function based picture I've been a bit vague on what I mean by a function. So the actually for the bullies is this relatively straightforward a function is what, what the outputs are on all the inputs for something like MNIST here which are a bunch of handwritten images, you know, 20 by 28 with 256 values of each pixel, and I'm going to go back to the number of possible inputs. So what I'll do is I'll just define my functions as a fixed in a fixed set of inputs, which could be say my training set and my test sets, my function is simply what does my neural network give as a function of parameters on the so if I have these images, I feed it with neural network. If it gives me 50419 which is correct is your errors that's one function if it gives me 50479. If it gives me 5047, I have one error as a different function. So I'm going to represent my functions, given a certain set of inputs as to what the labels are for example for classification. And that's helpful because I can start thinking about things like Bayesian argument so here's a classic Bayes theorem right the posterior is given by this, this combination of the likelihood and the prior divided by the more than likelihood the prior functions is the thing we just looked at before how likely am I to get a function, in this case upon random sampling of parameters, the likelihood so how likely am I to get this function given the data right well if I train to zero error which is very easy to do in the neural networks because it's highly expressive. Then this takes on a particularly simple form which is one if the function gives me zero error and zero otherwise. And that means that I get something very simple. I get a posterior which is directed portion to my prior divided by this constant, which is my marginal likelihood. The more is likely it is then simply this, you know, it's the I'm more just lies out over all possible functions is to sum over the priors of all functions that give me zero error on the training set. So total probability of all functions that give me zero train the air set. So it's really nice in this picture I suddenly have a pretty concrete way of applying base theorem, and I'm going to be interested in how likely might get a certain function, and that I train on a certain training set. That's the kind of Bayesian picture for supervised learning. And so what we showed was I think something quite surprising to me at first which is here on the y axis I say how likely I am to get a certain function on m this here I've got 100 images of m this. I'm not on a test training set of 1000 of 10,000. But because I always read the zero error the first 1000 images are always identical so the functions only differ on the hundreds that I do the test set that I haven't trained on. And what you see here is that this particular function is the most likely function to appear it's a function with one error. And this is the probability that I get it on for SGD so I keep running I run, I run 10 to the, I may say a million different SGD runs with random seeds. And this is the most likely function then these are less likely functions it gets. And this is a probability that I get it upon random sampling of parameters. Now I don't actually do random sampling parameters because it's too expensive for this for the network, we use a Gaussian processes to calculate this on, and this is last week's beautiful talk by Yasha Sol Dixit showed us that in the infinite limit these Gaussian processes and we can we figured out how to use that to calculate basically this, this posterior on this some, sorry, the, the, the Bayesian posterior. And so what's fascinating is that the probability that a stochastic gradient sense converges on a particular function is simply given by the probability that you get up on random sampling of functions. So actually I'm just showing you the probability of functions for all possible functions and there's 100 images on my test set and there's, I've binarized them actually in this case so there's two to the hundred possible different functions. The vast majority of these functions are extremely unlikely to appear. And what you're seeing here is this very small subset of functions, which have low error, which are the high probability ones that both SGD and, and random sampling would converge on. So SGD is behaving like a Bayesian optimizer is giving you functions of the probability report very closely predicted by the, the posterior. So, there's two kinds of questions about generalization why do neural networks generalize at all in the over parameter regime where classical learning theory tells us they shouldn't. Well, they do that because they're highly biased towards simple solutions, and because SGD basically follows that bias. So that's the kind of question you might ask which is a given that a generalize as well can I further improve the performance by fine tuning hyper parameters and the answer is yes. And that's an interesting question but I don't think there's any universal answers there, even though in practice this may be super interesting or helpful. I just want to give you one example of where you might look at the second order effects so here's a convolution neural network now we know that one of the ways we can improve CNN behavior on images is by something called pooling. We have highlighted the probability of the zero error function with no pooling. And then here is on this on fashion in this story not on this fashion in this. And here is the probability on with pooling. And you see that both the, the SGD and the Bayesian probabilities increase upon pooling, which is the kind of things that you do to fix make your network work better. So this is basically an example of how you like on second order effects given that it works well can you make it work better. But primarily what you're seeing here is that the SGD is following on the Bayesian prediction. So the question then is, can I break simplicity bias, okay I told you that basically bias is the reason why this thing generalized as well. It's the reason why even though I can, it's having over parameterize and have many possible functions that picks these low. These simple functions is what it does so the question is can I break the simplicity bias. And one of the things that we know is that neural networks exhibit in order to chaos transition. This is something that actually also yasha and others worked on and for tension on the narratives. There's an order regime and a chaotic regime, particularly for wider initializations, right was don't have this problem, or this feature as it were, but tensions do and so we can use this, because we are on our simplicity bias arguments break down for these kinds of chaotic systems who expect the system is the bias to break down and indeed this is what happens so as you increase the width of the initialization, you go to the chaotic regime, what you see actually Greg gang from Microsoft was going to point this out to us first, although we should have known is that the as you increase the width going to kill regime your specific biases appears the bias with simple functions is gone. So that's super interesting so here I can, you see this as well in just a rank plots or rank the functions by probability. So here I've got a more chaotic and less and non chaotic system. And I'm going to for the, the Boolean system I'm going to do the same thing and before this line was when you saw before. And this is for the more chaotic systems the system is less biased towards simple functions. And you see that it just generalizes much worse on on the simple targets. And the reason is because it's not biased for simple targets, it's got on. It's not very by it's not nearly as strongly biased and so because there are so many functions possible, it picks by kind of entropy arguments, the vast majority functions don't generalize well and that's what it finds. And you can see this as well here with the same little scatter plots use generalization error. Here's the, the complexity of the solution that you find, and then you run this thing many, many times and you see what kind of solutions that find, and the normal bias network for this simple function has lots of simple solutions, on average, close to the true function of your target, whereas the chaotic one, even though the target is here is mainly finding these complex solutions because there are way more complex solutions than there are simple they're using simple solution so by entropy just that wins. But if we give it a complex target, then they both don't do very well, and they both tend to give not such good generalization, because there's not this thing as a free lunch. And so complex targets is where these things don't work so well. Now, in that article. You saw that she I didn't mention it at the time but I did put the picture up. One thing she also points out is that you know for physicists to think about this they got to look at not only the algorithm, but also at the data, the structure data. And so what's nice about this patient picture is I can look at the data so now looking at this Boolean system again, I have a target function. And I have a particular answer that my doctor gives which is around here middle middle middle to high complexity target function. And what I do here is I pick a whole bunch of random functions and I simply calculate what their error is compared to this target function. And you see that there's a whole bunch of functions here that give this complexity. And this is the one with the lowest error the one minus the error is what I've lost. So this envelope gives me all the different things that my network does this tells me something about my networking engages with the data. And this is just base theorem but now I'm going to average this over different training sets I'm going to do it not just for one training says it before, but an average or many training sets. And then of course my function may or may not give me zero error on the training sets. The probability gives me zero error is one minus the error that it gives the function to the power that the so the mean error that it gives the power and because I've got to pick and instances of data points. And so that's what is now my, my, my likelihood has now become not one zero one but no one minus epsilon to the power and this is a the average like I have an average over all, all training sets. And once the reason there's an approximate here is because I've done a slightly dirty trick which is I've taken the meet the average of the ratio and replace that by the ratio of the averages and the reason for that is because we're pretty sure that the more the likelihood does not vary a lot from test sets to test sets. We're going to test this. Let's have a look at the data. Since the, this one minus is to the power and which is the size of your training sets, the, the things that are going to dominate or the smallest error functions for each complexity. So I'm only going to look at the best functions for each one that's an approximation that shouldn't be too such a bad approximation. So this is for for a simple targets on as a function of the series for Ms one small mass epsilon, but as I go to higher powers of them, right, then the further the worst my error gets penalized by this, this exponential power. You see that as I increase the size of my training sets, what the shows me is that as I train and large enough training sets, the functions that I would normally find are constrained by the training sets to functions that are closer to the target function. So this happens for different complexity to targets. That's the, the one part of my Bayesian, my, my, my base theorem. Now I want to look at the prior, I already talked about prior before. The difficulty is if I want to look at all parts of all functions, some of these priors for these more rare function extremely hard to calculate a priori. So I'm going to do something slightly simpler. This is what my student Henry, another undergraduate with he said, well, why don't we just look instead of probability of the functions that's just looking at probability as a function of the complexity. That's much easier to calculate so here is a typical probability, how likely to get a certain complexity upon random sampling, although there are exponentially more function with high complexity because it's basically bias and much likely to get each individual one. So I get something which is relatively flat. And what I see is, you know, for this normal simplicity by system and get a complex distribution is relatively flat. Whereas I go to a chaotic system which I am down here so chaotic system where I don't have specific bias, then what I see is that my probability of getting of my network producing high complexity outputs becomes a very large because there are many more of them. And so I can that's how I can calculate my my pfk. So what I'm going to do is I'm going to look for try to calculate my posterior by multiplying my at my my prior and my, and my likelihood in this in these approximated ways. Okay, so I've got this is my prior this is my likelihood as a function of the, which goes into the only only place where training comes in is how it affects this one minus of silent this airport. Data data and the training interact with the system. And this pfk is basically the prior the bias that my network has towards certain types of functions. And so what I see here is for non chaotic network or chaotic network. Okay, so many as you add that as you increase the number of layers it becomes more likely to be chaotic. And this is the for this particular targets. This is the histogram of how likely I'm going to get certain complexity of functions if I randomly train if I train with STD for my normal network and this is for my chaotic network so the one that doesn't have simplicity bias and as you see it I've seen before, you tend to get these more complex solutions. So this I show this for three different target functions. Here is the normal one, a normal network, this is a chaotic network the pfk right so these, this is the one minus of silent factors pfk. And here is for a simple targets. This is what I get from my approximation pfk times one minus of silent. This is what I get from directly sampling right so actually just running the network on a CD, you see that I make this even those approximations are remarkably crude my predictions are remarkably good. These are my prediction on the top. The bottom is what I get from running SGD, that's the predictions are just when taking pfk. There's nothing what should be in there the bottom one is SGD so I'm able to predict with a simple theory, pretty well with the distribution of functions I'm going to get given different targets. And so that's super interesting because I think I've explained here I've given you a, I've shown you that you need that simple device for these things to generalize as well and I've shown I can break it. Not only can I break it but I can actually explain to you how the data interacts with the, the prior for this ising model of neural networks namely the Boolean system. So this can you work this on larger net or complicated data sets so here's amnesty that image those are numbers again classic data set. And here you see as a function of the width so as you go this direction get more calm when you get more chaotic. As you make the network deeper chaos sets in earlier. And so what you see here is for amnesty this is the generalization error for this particular system. So as you go into calic regime the generalization starts to decline very rapidly, just like you saw for the, for the Boolean system. Now I don't really know what the complexity of my targets are because the targets. I don't really have a clear idea exactly target except that I want to give the right labels on the images. But what I do see is if I look at the complexity of the solutions that I find for normal amnesty. So I know part of me a normal network which is not chaotic. Compared to more chaotic one, the Celtic one indeed finds much more complex networks and here I'm using something called on critical sample ratio which is an approximation to the complexity of the functions that are being found. So the guy get exactly the same story. I'm finding much more complex functions when I'm the Celtic regime. And here's the prior pfk calculated for m this using the same complexity measure and this is for the normal network is for the Celtic network. And you see that the prior functions that I'm getting are much more skewed towards complex functions. And if I corrupt the data now so now in order to make a more complex target function and corrupted data so I'm going to mess up some of the labels. Then what I see as I saw exactly the same phenomology I saw in the on the Boolean system that the the complexity of the of the functions found goes up, the error goes up as well if I get to a very complex data sets. And basically my my simplicity bias is not helping me very much new or at all much organization is almost the same as a chaotic one. Exactly the phenomenology here is exactly the same as I'm able to calculate with my Bayesian picture for the Boolean system. So that's an idea for C for I can do the same just that can do the same thing so it's not something particular to amnesty. So it's summary then my Bayesian picture gives me this on this very nice. Which tells me that I've got I need to have bias simplicity bias if I break simplicity bias I don't generalize as well, and then I can look at the data. And what I see is that as I increase the number of training sets, basically the probability that I, that I get a certain function so the probability that I converge on certain function goes that goes down on the far away from my target function and goes and is maximized towards where my target function sits. And that is that training the whole training is is what my subside on to the power f. So I've been able to kind of separate out these different aspects of what a neural network does the data is this bit right here and network and the algorithm or basically this bit right here. Now, the last thing I want to talk about is a little bit more of a specialized topic. If you start delving into this world of generalization, then you'll find there's a very very voluminous literature on bounds. This is really frequent this kind of picture. And what people have been trying to do for a long time is to calculate on rigorous bounds on the generalization area that you get for certain hypotheses and you'll if you in this with your you may have heard this pack learning VC dimension a lot of complexity said there's a whole literature language around this is like a classic very famous bounds VC dimension tells you that you can on bound the area that you're going to get on a certain hypothesis by the VC dimension of the whole hypothesis class. So we've just recently with my student year to understand this literature, partially because we kept being asked by people in the field how our work linked to this. We've done a big review paper on all the bounds that we could find that we thought were interesting. Here we have a, we've got the classification of all the different possible bounds, where they fit into a classification of from the simplest bounds. But also the easiest to make rigorous to the more complicated bounds that take algorithm and data into account, therefore hardly to calculate. And we've given also see it's here there's your drata for these bounds. I think what's quite exciting is that we, we've used something called a pack base bounds, which is fundamentally frequent this idea but as a Bayesian type connotations. That was caught that David McAllister on Chicago with basically gives you the error on the predicted error on your test sets, in terms of kale divergence of two different distributions. And one thing we were able to prove in our paper is that the kale divergence, if you use a function based description is always less than the kale divergence that you would get if you use a parameter based description ourselves as that. If you want to make these bounds you shouldn't use, you shouldn't really look at the parameters so much of your system but try to look at functions, which is not what people have done. And what we have drive the function based bounds. So here's our version of the pack base bounds with a function parameters and the main thing you see here is that we bounds the error. Given that you get zero error on the training sets by some terms and terms of terms you can largely ignore goes one over the size of your training sets, and this is the marginal likelihood of of your system. The marginal likelihood is very nice because it includes both the data and the set of all possible functions that you find. And what we're able to predict is here is the, for example, the generalization error of a fully connected network as a function of label corruption so as you corrupt labels and image data sets for here M this and this more complicated data sets up here is green one is C far which is a more complicated images, and these these dashed lines are the SD results and these solid lines are predictions from the from the from this bounds. And that's just, you're just making a hard prediction. And so this works actually more could be well when I say remarkably well is because, until recently, people made a big deal of having a bound that even gives you a value on say on binary classification of less than one, let alone something that actually follows the data this carefully. You can also look at something like this learning curve so how does the error vary as you increase the size of your training sets. And so what I think that people have found is that if you look at this your generalization error as a function of the size your training sets, and for a wide range of different data sets, you find this kind of almost dashed lines of the or the SD results, you have a scaling like behavior that seems to be mainly dependent on the data and not on the network. And so here are a whole bunch of different on a whole bunch of different neural networks everything from a network to a mobile net to dense nets. And the top line is our bounds that we calculate directly with no fitting parameters and you see that we're able to reproduce these exponents, you know, not not too badly. But if you think about this, this is our log scale, we're sort of actually doing our work really well because these articles are very important it tells you if I, you know, if I double my, my test set which may be a lot of my training set part of me which may be a lot of work, how much gain do I get to my generalization error, and we're able to at least reproduce this, we haven't been able to explain exactly why it's a scaling behavior yet but we didn't reproduce this with this much simpler description. So let's look at comparing things across different types of network so here you see seen in with no pooling. Here you see in with max pooling. Sorry, the other way around this is no pooling and this is max pooling and the error goes down in the real system that goes down in our bounds. Here you see as you increase the number of layers, typically neural networks generalize better. That's what neural network does on MNIST and this is what our bound does. So here we compare a whole bunch of different, here we compare the generalization error versus the bound for a whole bunch of different networks so this is a large error for mobile network which is a small network and this is dense nets which work better. And we're able to, for this state of the art networks we're able to more or less predict what their generalization error is going to be our priority. We think this is by far the best bounds on the market and it means that we've explained something about the generalization of neural networks and what's really important is our bounds simply uses this margin of likelihood. So there's nothing by SGD in there it's basically just something overall probabilities of getting functions that give you zero training on the error sets. And that thing is enough to give you the prediction of how SGD train neural networks work. In other words, this tells us this function based story is capturing something, the essence of what makes neural network generalize. So with that I want to thank you for your attention. I've talked told you that neural networks generalize because they have an implicit bias towards simple functions as predicted by a it and this simplicity bias we think is a much more broad principle in nature and ought to be taught in physics curricula. We're also said something which is quite controversial in the field but we're pretty sure is right which is the SGD itself is not the source of generalization but access a vision optimizer. And of course that many common intuitions from learning theory ideas about bias various trade off the idea that you should somehow simplify your parameters or limited parameters are incorrect. But that doesn't mean that learning theory is incorrect incorrect just you got to do more sophisticated versions like our pack base bounds, which which seems to capture the capture is numerically capture, and also give us some deep intuitions about why neural networks work so well. I just want to walk through the people that did this work so Kamal and Chico were the ones that started out explaining to me how this this explained to me how simplicity bias worked. Guillermo is the one that taught me how neural networks work show Feng and you just joined my group and are explained to me how flatness works and how information geometry works. David was a visitor or was a visitor is as in stats but helped us with a bunch of our proofs. Then, and Henry were two undergraduates or master students who Henry did the chaos and once has worked on double the set curves in calcium processes. And then Chris, your Vlad and Isaac are for undergraduates from Hartford, who did really extraordinary work. And one of the great exciting things about this field is that it's so wide open that you know undergraduates who are very smart and some of our undergraduates are very smart indeed are able to make really incredible breakthroughs and teach their supervisor something about something very exciting. So I want to thank you for your attention and particularly thank all these collaborators who are of course the ones that did the actual work. And I'm happy to take some questions. Wonderful. Fantastic. Thanks a lot for teaching us so much about all of those things if people have any questions, and please raise your hands and then we're going to take it from there. So you will be the chat to the chatting. Yeah, you got a comment which is. Maybe if there's no questions, let me go first then. Until people can think of something. So you mentioned that while you showed empirically that SGD basically did the same thing as the Bayesian inference would do. And yes, do you think my question is we have any sense if that's only if that's sort of generally the case if that's a general property of any optimization algorithm or if that's to do something with the structure of the object that's being optimized. If that's sort of the properties of the neural networks playing in to make this statement or true. Yeah, so so the point is on what we're saying is it's a property fundamentally of the loss landscapes. Okay, so and, and so what the it arguments tell us is that the basin volumes of these loss and skates are going to vary by many many orders of magnitude. So basically, if you're, if you have any kind of optimization technique, you're walking along you're much more likely to follow the basins that are large and the basins that are small. So even though there are many functions that will give you a zero error on the training sets, you're much more likely to follow into the simple ones. Now you may say why does that matter well that matters because the data that we're looking at is typically also simple. And I showed that by varying the complexity of my of my targets are basically varying the complexity of my data. So this should this will work so this will work for any optimization technique that you can find. And for small systems, we can show that empirically, you can do really stupid optimization where you just randomly sample parameters but that just skills very badly. But whereas the is very special is that it's an incredibly good optimizer so it's a particularly if you use some tweaks to its own. Say you something like Adam or someone some some hyperbolic it optimizes it finds zero training solutions extremely efficiently and that is super important. But the point is that just because it finds them very well doesn't tell you that it's finding the right ones with any fundamentally different than any other kind of optimizer. Not because it's efficient, not because it finds better solutions. That's the first order the answer. The second order, what we all know is that we can tweak hyperparameters and get better generalization. And that I would say is a much more subtle question. And that is probably often linked to two details of how that the landscape and the optimizer interact. And you can see that by changing the type of SGD that you use which is many variants and sometimes they work well on this system on that system. And, you know, if you're an engineer, or you're actually trying to get something to work in practice. This is actually this is important because even a few percent difference in your generalization performance can be really helpful. But fundamentally, why does the neural network pick a certain subset of functions from the hyper asking or, you know, is number function that it could pick because it's over parameter to pick a very large number functions. So the deep question, that question is not on to the best of the balance of this question by the structure of the landscape. Why does this structure landscape help us because we're looking typically at structured data. That's the story in a nutshell. No, thanks. There is a question I see from Duncan who was raised to send these go ahead. Thank you very much. Thanks for the fascinating talk. So I'm interested in using neural networks amongst other techniques for emulating processes and client roles, for example, and there, you know, we worry about the generalizability and robustness of these emulators. So the, this idea of complexity. I hadn't come across it before. Sorry, chaotic behavior in your own networks. And so, is there some way of diagnosing that can you can you find out that you're in this chaotic regime that is not, you know, probably not finding the simplest solution and might not generalize so well. You actually, so if you look at the other side of one of the papers but if you look at Russia so big series book last week, and a number of other people around him in that Google brain group and also put Stanford, they've written a whole series of very beautiful papers about this and you can see this because you know in chaotic systems, we have a positive outcome of exponents we see that, and they see positive exponents. Typically your systems are hard to train. So, maybe another thing to say would be this would be this, if you are interested in problem x a client, then to first order what these neural networks are as a simple they're biased towards simplicity. But actually what you really want is a network that has the inductive bias towards the kind of problem that you're working on. So one example of the second order effects. So convolutional networks when you add porting to them, that's been done to pick up some of the symmetries we know are there in images. And so that is something we know the images have that bias so why don't we put that into our network so that it does that well on those images and, and lo and behold what you see is both the vision prior and actually are more likely to to converge on functions or give you high probability that on functions of lower error. So the question would be if you're looking at something like climate physics, you know, maybe it'll be that vanilla neural networks don't have the right they've got some bias which is helpful otherwise it wouldn't do something but you might want to play with some of these ideas and get a neural network that has an intrinsic bias towards the kind of things that you're trying to solve. There's a lot of work in that direction in the field so it's, yeah. I'm just giving you a line that the language is you want to have an inductive bias towards solutions that are good. So one, there's actually a really fascinating question that this also picks up more philosophical one on that. I have Max Tugmark who spoke in a recent one has written about, which is, you know, why do neural networks work so well on in physics and they work remarkably well in certain physical contexts. Well, my answer that would be that's because a lot of the things that were studying physics have fundamentally some kind of simple underlying effective theory. So because neural networks have bias or simplicity they're finding these simple effective theories and that's why they work really well. And as long as that's true they're going to work very well. If you actually have something which is genuinely complex like in certain aspects of climate physics where you've got chaotic systems that are very sensitive to conditions, then that may not be a good bias. Yeah. Thank you very much. I see there is also Gerard has posted a question in the chat window. Okay, I'm attempting to read out such that we can hear it. Yes, Papian and allowed formulate neural network problems as a sparsity problem where L2 misfit in L1 model is minimized. From this they showed a gradient formula and naturally have real functions. They also suggest deeper networks are equivalent to more iterations. Can you do any relations between their work and yours. So, extremely large literature by more mathematically orientated researchers that look for special properties of neural networks and often termed in terms of sparsity in terms of major properties in terms of a whole series of types of tools that they're used to there are very powerful tools in their in those contexts on I my take on a lot of that work is that it's interesting but it's not getting at what is special about neural networks. And that's because very often they're they're they're focused on, you know, particular properties of of of SGD now this particular paper is not one that I am familiar with. That's. So, what is true is that deeper neural networks are more expressive. That simply means that you can describe more functions with them, but that's not probably that important for them performing well unless you're looking at a very complicated problem. Because typically you're interested in the simpler functions which smaller networks can do. But you know, if you're looking at a very complicated problem so it's a very complicated limit data set then you might want a very big neural networks or deeper network, something that nature because you need to be able to have access to the two complicated functions. Those are kinds of things. But I think maybe this is something that that you saw also in the paper by Lenka, which is that they're in this field you have a large number of more mathematical orientated scientists who are very concerned about proving certain types of things I'm just guessing this paper is in that same category. And then you've got a whole series of engineers are trying to make things work and hopefully physicists can find their way a little bit in between and what I'll show you today is a typical physicists kind of approach to this work was a little bit in between. So trying to use a mathematical rigor where needed, but also avoiding it where we think it's not needed. I see some more questions up. Thank you very much for a very educational talk I fear I may ask a very ignorant question coming off the back of it. In your conclusion slide, I can't help but note that there are some functions that are to the other side of your bound. Does that mean they're overfitted. So you mean down here. Yeah. Yes. Yeah, that's that's a sampling issue. So basically, if you have a function which has very low probability but you only sample it a few times, right. Then if it happens to appear, it's just an artifact if you if you sample longer that goes away. Basically, yeah, so I haven't told you about this thing. So here I just show you upper bounds, right. So, so what you see this is a lot of probability versus an upper bound. So, but when we send parameters, the highest probability function is the one close to the bound, but I do have these kind of low k low p one so ones that have low complexity but never as rare. What we show, which is kind of interesting is that these low k low p ones are functions for which the set of inputs itself is highly structured. So, actually, we have another bound where we think where we say the distance you are away from the bounds is proportional to is exactly equal to the number. If you put this in log two space and the number of bits you're away from the bounds is the number or is magnitude in powers of two that you're away is equal to the number of bits of information. Lower that your inputs are compared to random inputs your inputs, most inputs are random and so the subset of them are not random and you've got to pick those. And what that's really telling you is that this network is not the universal there are certain types of outputs that finds hard to make. For various structured inputs. So there's a lot of stuff of that nature that we're exploring at the moment, and it's telling us something about theories of machines right you know you might remember that in computer science you talk about hierarchy, you know, Chomsky hierarchy you've got, you know, finance state machines at the bottom and university machines at the top and as you walk up on the certain, you can get more and more and more. So you can do anything that's calculatable with the finance state machine, you can do a very small subset of things. And so this is picking up something of that nature. Nice, thanks a lot. There is another question by some key please go ahead. Hi, thank you for your talk. First of all, appreciate all the references. Um, could you please, you said it already but could you please elaborate what the work you were involved with that naturally led you to this you mentioned evolution or name mapping. That's right. Yeah, so the history of this was always different from the history of which the way I told the story. History is that we were started looking at these mapping from RNA sequences RNA structures. And what we found was that for every time that we tried it, that this mapping was very highly biased very similar to this. Okay, so certain structures were very likely to appear a certain structure very unlikely to appear. And so what really happened was I saw that in these RNA maps, we then did the dynamics of these so that if you think about evolution dynamics. It's also on some very complicated fitness landscape rather than a loss landscape but it's very similar to what SGD is doing on landscape moving around, right in some kind of stochastic way. So, what we found about five, six years ago was that the RNAs that you see in nature are the RNAs that are likely to appear program sampling. And the argument that we made was well that's because these ones are much more likely to appear in the first place so even if the network, even if the landscape is changing all the time over because the fitness landscape is not static is changing. So you're still going to get these are much more likely than others. It's kind of entropic argument they have much more entropy in, in sequence space and so you're more likely to land on them, even if they may not meet the fitness so we have this paper. We call this the arrival of the frequent where we basically frequent structures are much more likely to fix in nature, then less frequent structures even if their fitness is lower, because it's a kind of it's not really the entropic argument it's a kinetic argument but it's linked, you can think of almost like a free, very roughly think about this as like free fitness right fitness is that how likely you are to reproduce but there's also how likely to appear in the first place, just like the entropy term. And so what then happened is I then had my student Kamal was a mathematician. I said to him look at this can you give me an explanation, a general explanation for this is because we saw not only this system but in a wide range of biological systems everything seems to be the same thing. And that's how we got into the algorithm ration theory. And then, so that's that's the history of a history of is that we started looking at this in biological systems and then realized that the mathematics of neural networks were extremely similar to the mathematics of these evolutionary systems. And so that, in fact, and the nice thing about them is it's much easier to get data than is an evolutionary systems. And so that's why we started running off in that direction. Thanks, there is Ed, who also has a question please go ahead. Hey, thanks for a very interesting talk. I've just got another question about the chaotic and ordered regimes. Yeah, so you've shown kind of, as you go deep into the chaotic regime, you heard performance, because you end up using this bias. I've also seen people advocating that you, well I've seen people advocating that you should initialize on the boundary between the order of the chaotic regime. If you go deep into the order regime, then you get this kind of very flat loss unscathed. And I was wondering if you looked into what happens if you go into the audit regime from this, this bias perspective. The thing is, in the order regime, it doesn't seem to. So, this is the, the, the edge of chaos that arguments are popular arguments that they run on their biological digital server on people that coffee letters and So the first thing to note is that this order chaos, only holds for certain types of nonlinearities like earth, air functions or 10 functions not there for relative. It is something very specific to those kinds of nonlinearities and typically we use values in practice, which is why it's not often out used. Um, but yeah if you go too far into the order regime that you have difficulty, you may have difficulty converging as well with the chaotic regime, we have explanation for why doesn't work very well it's also hard to train in that regime. And so the, the, the, the early argument by these, the, the Stanford and the, and the Google rain groups on the chaotic systems that you just try to sit on the border between these. I don't think there's been enough empirical evidence to show that that is in fact the case and given that we made a use values for which this, this, this face condition doesn't exist. That's a lot less interesting, but I, you know, I think try it, you know, it could very well be the case that this is something important for us to keep in mind. Another question. Let's see. No, yes, please. Yeah, thanks. Thanks for the talk. It was interesting and so many different things to digest. So I just wanted to ask you if you could maybe reiterate one of them that I found particularly interesting which was on the marginal likelihood that turns up in the pack base bound as you kind of then derived it. I mean, I think that's something that was already hippos hypothesized by David McKay that like the marginal likelihood is a good indicator of generalization but not formalized like that. So could you maybe just explain a bit again how, how it ended up in the pack base bound that you got there because that's super interesting I find. Okay, so that's a bit of a longer story. I can, I don't have time right now to explain. But basically it comes out from the kale divergence turns out to reduce to the marginal likelihood, but you're right that David McKay a long time ago said you know this is the really important thing to look at for generalization. And actually Andrew Gordon Wilson, who is at NYU has written a series of very nice recent papers where he looks at this in more detail the intuition that you really should be looking at the marginal likelihood. If you look at our, our long paper on bounds, the generalization paper, then you can throw my publication website you'll see it, the link to it. We cite Andrew Gordon Wilson's paper and we give us a bit of a wider set of descriptions for why marginal likelihood is a good way of, of explaining something about the model and what Andrew and what it says and what David McKay of course said was that what the words likelihood is telling you is, is how the function that you're producing and your, your implicit bias of the functions, interact with the data so it's like a number that's giving you a combination of these two. And that nice thing is in the pack based bounds on it falls out in this way. There's a linear classification which is what with Dennis Laurie, it comes up falls out and so there's a very deep interesting link there although the, you know, the pack base is not it's not fun to the Bayesian argument. It's a, it's a frequent story, but out comes the outcomes the, the vision, the vision, the marginal likelihood. So there's something quite interesting there. Okay, fantastic. Thanks a lot, Art for this very educational talk thanks a lot for taking the time to answer so many questions. And thanks a lot for the people who actually pose those questions so I think that's a natural end for today. Thanks again and talk to you next week. All right, thank you very much.