 So welcome back everybody to this afternoon session, we'll have two talks, it's a particular pleasure to welcome Cengiz for the first talk, a particular pleasure because you had to jump through quite a few logistic hurdles or overcome quite a few logistic challenges to be here, but you made it and we're very happy about that and we're looking forward to your talk about deep learning theory at limits, not at its limits yet, so yeah thank you Cengiz. Thank you very much, it's a great pleasure to be here, I'm Cengiz Pehlavan, I'm a physicist by training, actually the last time I was in this room as a graduate student in Sting Theory I was taking a summer school over here, since then I started working on theoretical neuroscience, that's what I call myself now, somehow I ended up in an applied math department and started working on machine learning too, but as also I was attending this meeting my talk got longer and longer because I said I want to say something about this and something about that then also it meant it gets shallower and shallower, so we'll see how this goes. So the first thing I want to actually address is this, so this is a theoretical neuroscience session and why would a theoretical neuroscientist give a talk on deep learning theory? Now because your friends might have told you that you know backprop is not pathological plausible, so deep learning is not relevant, and so on so forth, so that statement about backprop being biologically not plausible by itself is of debate, so we can talk about that later but I want to give you two motivations, I'll just give you some general motivations, not necessarily about, well it has some connection to the rest of the talk. So the first one is basically coming from this influential paper, so its title is using goal-driven deep learning models to understand sensory cortex, so this is a review paper about sort of an industry of research that arise in the recent years trying to sort of make parallels between back propagation-trained deep neural networks and the brain itself, so and of course historically neural networks themselves are inspired by the brain, their architectural choices and so on, so further along history about that, but I mean the current philosophy is basically summarized in this figure over here, so at top you're seeing the brain in a very cartoonish picture, all these things correspond to several areas in the individual cortex, and at bottom you're seeing a convolutional neural network with its layers and the idea these arrows are basically saying that okay maybe these layers are corresponding to this area in the brain somehow and the hierarchy of the deep network has to do with the hierarchy of the brain, so the data that backs up is basically here, again this is a figure taken from this paper, so let's look at figure B over here, so what's being happening here is the following, so there's a monkey, the monkey is being shown a series of images and in parallel there is a convolutional neural network that was trained on some kind of a classification task and the recording is done in an area of the brain called inferior temporal cortex, this is a pretty high individual hierarchy and neurons here are responsive to I mean very high level object categories, so the black line over here is corresponding to this neurons response, it's a real biological neuron, so each you're seeing several example images over here, so basically there's a bunch of images and this is the response and the red curve over here is basically a response of neuron or actually a linear combination of neurons, neural responses in a particular hidden area, hidden layer of the convolutional neural network and you're seeing basically a very, very good match, so these authors basically went on to quantify the match between this biology and artificial neural networks and they observed a few interesting things, so one of them is summarized here basically what it's showing on the y-axis is how well an artificial model can predict responses in the cortex and on the x-axis is basically the performance of a model in a image classification task, an artificial task, so the better you get in the x-axis you also get better in the y-axis and models that are trained to do good in artificial tasks are also good for explaining the brain, there's also a finer level comparisons here, for example they find that like earlier layers of the convolutional neural network are better descriptors of earlier layers in the brain and later layers of the convolutional neural network is better descriptors of higher areas in the brain, so even though I mean you know the representations here came to be through evolution and maybe learning development and the other one is through a backprop you know there seems to be a match between some features of the representations itself, so therefore one reason for a theoretical neuroscience is to study deep learning is to say something or learn something about the representations of deep learning and maybe that will shed some light into representations in the brain, of course it's much easier to study this, I can simulate this on a computer as opposed to you know a monkey which involves you know lots of procedures years of training and like you know things like that, so that's one motivation, okay and this is basically a website brain score this is again hosted by Jim DiCarlo Lab at MIT and what you're seeing here is basically a list of models and these models are listed by their ability to predict the brain activity and basically all the models that you're currently seeing are such deep learning models as opposed to older-fashioned models that people are becoming sort of more principled about the way they try to explain brain activity, okay so that's one reason, so the second reason I want to propose is also there seems to be some problems that are common to both study of deep learning and the brain, so one of them is this this subject that was brought up many times in this in this workshop the fact that modern deep neural networks are over-parameterized compared to the number of parameters, number of training data that with which they are trained with, so I mean state of the heart has reached you know 100 million parameters, now the same could be said about the brain too, so you know our brains have about 10 to the 14 synapses and so here is a quote from Jeff Hinton which was on a reddit you know discussion he said the brain is about 10 to the 14 synapses and we only live about 10 to the 9 seconds so if you were to think of brain as a parametric model and we're training in this supervised way we would need about 10 to the 5 data points per second labeled data so that obviously is not happening, now to be fair Hinton was arguing that this gap could be filled by unsupervised learning but if you talked on your biologist they'll say I mean things like well I mean it's not even clear that learning is as important as we think to intelligence right so most in so here's a quote from Tony Zador in a neurobiologist in a in a review article or opinion article he wrote about this so he says I mean most animal behavior is not the result of clever learning algorithms supervised or unsupervised by his inquiry in the genome there are many animals which basically function at birth and of course we learn during our lifetimes but if you think for any task and a number of examples that you see for that task the number of parameters so the number of samples that you see is much smaller than the potential number of synapses that take part into that in that in that computation so we are even in our learning paradigms we are living in this over-parameterized regime and other authors are so commented into this so for that reason I mean all these lessons that are coming from the study of deep learning and inductive biases in this over-parameterized regime might have something to do with the brain so I'll try to show an example of this idea applied to neuroscience today so this is my depiction of basically this idea of over-parameterization and inductive bias so this is a tool depending on how you count this okay three-layer neural network with hundred neurons in each layer so 10,000 parameters I trained it on a scalar input scalar output with just two data points so it's really really vastly over-parameterized you know I threw all of machine learnings optimizers into this and it ended up this is a very powerful function approximator you know we could fit any crazy wiggly high frequency function of these two points but it ended up fitting a line so there's some kind of an Occam's razor going on among all the explanations that this machine could come up with this data it chose something simple so one might imagine that these kinds of principles are also what's happening in the brain and there's strong reliance on inductive biases and therefore understanding what what's caused by these what causes inductive biases is both important for the brain and also study of deep learning so with these motivations I mean I've been looking into deep learning a little more so in particular for this talk we'll study the simple scenario well this scenario which is not simple but I mean simply I guess state so we'll look at multi-layered feedforward neural network I'll have a data set so x is going to be at the inputs and y's are the desired labels so the way I choose my for me p is the number of points so that's a data set it's the opposite of statistics convention and there's going to be some loss and I'm going to be basically focusing on a gradient flow dynamic so gradient flow means of course the theta the parameters of this neural network will update that basically by the negative of the gradient of the of the loss with respect to the parameters just want to point that I mean so there is a similar equation for a network's output under gradient flow dynamics so f of x here basically corresponds to the function that this neural network expresses as a function of x and also the parameters theta and as the parameters are evolving over time of course we get a expression about how the predictors or the prediction of a neural network changes over time which is given by this expression it just follows from simple chain rule and and then arise an object here which is basically I mean again was mentioned a few times during this talk which is given by this dot product of networks output with respect to the gradient of networks output with respect to the parameters dotted in itself evaluated at two different data points and this is a neural tangent kernel I'm highlighting this because this will have something important from this talk okay so I mean this is a differential equation description so one might imagine hopefully we can say something about this differential equation of course I mean it's deceptively simple because all the complexity of the neural network is basically hidden in this tangent current kernel which itself is evolving over time so how do we study this okay so I well I mean you know a good strategy that you say been exploiting for a long time is basically studying non trivial limits and see but tractable limits and see if you can gain insight so that's where the title comes from deep learning at limits and in particular I'll talk about three such limits today I was hoping to talk about three such limits but now it's got longer so maybe I'll stick with two and also illustrate neuroscience applications of one of these limits so the first one of these limits is basically the well studied and influential one of the infinite width limits so the idea here's the following so here's some more notation so I'm going to find is going to be my non linearity at this moment and then basically inputs are cascaded by some affine transformations and non linear transformations from one end to the other one and I'm and what I'll do is basically to this particular limit was basically is the following idea you keep adding more and more neurons at each layer such that they go to infinity and if you initialize your net network randomly in a particular way for example by choosing these weights from a normal distribution with zero variance and sorry zero mean and variance one and then also scale your basically the inputs to the next layer by one over the square root of the width of that layer to keep the variances sort of order one as the as with cost to infinity you observe by which by now has been proven in many different ways that basically something very great happens all this complexity gets reduced to the following thing so what changed from here to there is basically this object became static meaning that it doesn't evolve anymore but it gets fixed at the initial condition and and especially if you're considering a loss which is a square loss right in that case this becomes a linear differential equation then you can solve it and you can study it and you know life is good so so let's do that but before doing that let me just illustrate in pictures what happened so this was the full complexity where all layers are trained in this particular limit this gets basically mapped to or changes to this other picture where you take take some input and map it through some static nonlinearities these nonlinearities are given by the gradient of the networks output that initialization and then again if you're doing square loss it turns out I mean okay so then and then you combine these features linearly to produce an output and if you're doing square loss it turns out that procedure by itself is basically a kernel regression sort of problem where the kernel is given by the dot products of these these features and that's that's that's the that's the neural tangent kernel machine limit of deep learning okay so and that a lot of of course well-known results about this problem because at the end of the day this is a linear problem it's linear in the following sense right so you're fitting a linear model but the linear model itself has features that are nonlinear functions of the input so it is trivial but not so trivial so it gives some insight so that's what we're going to talk about and now so our particular contribution to this was basically the following so we took this model and you know used some machinery from a statistical mechanics which I'm not going to go into the details of it's in these two papers and basically came up with a way to predict the generalization error of this kernel machine in a pretty generic way for any task so that task can be any target function that this network you're trying to fit could be anything and in a data distribution okay of course the stage generality comes with a caveat so all this theory depends on basically spectral properties of the kernel diagonalize on this distribution so and what comes out of this is basically a formula for you know given a set of a number of samples and you basically then get the then you try to say how well this model will generalize over a full data set we have a formula that predicts it the formula looks like this it's complicated I'm not going to details of it it depends on the spectrum of the kernel what I'll do is I'll try to you know point a few of the phenomenologies that this describes in relation to relating it to neuroscience okay so but there are many other phenomenology that comes out of this equations okay so first let me try to convince you that this actually works so here are simulations let's look at this one first of all so the solid line is the theory so this is basically I'll be evaluating our three or equations and that's what this equation is the the triangles are the so we took basically a number of examples from the MS data set maybe ten hundred thousand and train this kernel machine and then you know calculate some generalization error this is all in square loss and basically so so that procedure is basically these triangles so that's and error bars are basically averages over different sub-samplings of training data sets and the more interesting thing is basically these these circles so these are basically in this particular case a three-layer neural network with 800 not infinite train with gradient descent and basically that's you know those are those points how they how well they perform so this is a statistical mechanical theory in the sense that it's not it's not upper bound or lower bound it's supposed to basically describe the data is itself and you know it does pretty well I mean we expect this to you to break at some point and we can talk about where that breaks and that's the other two limits I'm gonna talk about I'm gonna attempt to get there but I mean for the exist a parameter regime which is non-trivial and basically it's a pretty good descriptor so the other one is basically a slightly more complex data set C for 10 and this is probably a four-layer MLP with 5000 it's again the same same thing I believe actually in this particular case yeah exactly circles are the kernel limit and triangles are the neural network simulation so okay so this net theory is descriptive in some parameter regime so can we use this to actually say something some insight something non-trivial about deep learning okay so the first thing I'm going to focus on is basically this idea of inductive bias or in which I'll call spectral bias I'll call spectral bias because I mean which relates to the fact that neural networks are fitting data with simple functions which in turn has to do with the spectrum of this kernel and that goes through the following fact you can take any kernel and basically diagonalize it under any data distribution okay and so this kernel has some eigenvalues and eigenfunctions so eight thousand eigenvalues and fires are the eigenfunctions in this case this particular slide okay and the nice thing about this description is you know I mean so I don't want to go into technical details but a kernel defines a functional space the producing curl Hilbert space and so on so forth and these these eigenfunctions are constitute the basis in that space so which means basically after speaking again you can think of this neural network in this limit basically at the end of the day taking these eigenfunctions and linearly combining to produce something and similarly you can think of a target function which neural network is trying to fit okay so basically in this particular limit the whole problem boils down to figuring out these coefficients which it depends on of course all the networks complexity which basically maps to the you know the maps of the true target function okay cool so why am I bringing this up because you see these equations you can prove a few interesting things the first one is the following so you can prove that under again we're talking about square loss over here you can prove that there's a spectral bias in the following sense the kernel machine tries to fit data sort of in in an ordering of these these eigenspaces it will try to first fit the data in the space corresponding the first eigenfunction and then this and then this and so on so forth particularly mathematical statement is this one here so we can basically form these relative errors so this is the neural networks weight learn the weight this is the true weight this is the relative error and you find that basically the generalization error for a particular mode is smaller for any dataset size given that the eigenvalue corresponding to that eigenfunction is larger okay so I call this spectral bias there's an ordering of functions okay and using that you can basically formulate a metric which I call cumulative power c of k this is basically taking a target function and add its sort of coefficients up to some value k and and if this powers a vector metric if this is rising very fast that means that model that that function is a very easy for this for this model to learn and if that metric is rising slowly then actually there is no good alignment meaning that this network is not going to be learned this function function very well in a sample efficient manner okay so so so let's here's an example of this fact so these are basically the target function that's complex it's a very special case where we can actually analytically solve the whole eigen spectrum of the kernel under this data distribution and actually come up with even more interesting statements so roughly what's happening here is the data is sampled in a high-dimensional over the high-dimensional sphere uniformly you can show that the kernels eigen functions in this particular case are spherical harmonics and now you can consider target functions which have only a particular type of spherical harmonic content now it's just why kms are spherical harmonics and I always forget one of them is degree and one of them's order so I'll just say k k is the first index so you can have fun target functions which have only a certain k index so this k is basically like a frequency k goes higher means these are more and more complex functions and you can prove that basic or show that again in this limit it will take about the giving the dimensionality of data due to the power k samples to be able to learn this target function meaning that if you were if you had the function that has only let's say k equal to 7 content in this particular case we have 15 dimensions you'll need about 15 to the power 7 data set size data set to be able to even see a reduction in our generalization error so that's basically what this is showing each line here is a different target function with a different sort of k content and you see that if you have a k1 function the generalization error curve base starts falling only after 15 samples but for this guy over here which has 4 content you need about 15 to the 4 samples to start falling so this basically is illustrating something about the inactive bias of neural networks in the sense that you know it's easy well if you know what you're looking for it's easy to fool them basically generate the tasks for which these are actually not very data efficient okay so this particular insight now we're going to take this from this domain and try to say something about the brain and in particular we're going to get this data so this is a data coming from this article by stringer and then colleagues and what the authors did here was basically took a mouse and then looked at its primary visual cortex recorded about from about 10,000 to 20,000 neurons through some imaging technique and show this animal many many different images including images from the image net dataset okay and so what you're going to do is basically we're going to treat this data as fellows we're going to think of basically a bunch of a population of neurons so we're going to forget about everything that comes before it and we think of this population as a static population and then try to train a readout from these this population using some kind of a delta learning rule and this basically boils down to gradient descent and at the end of the day you can show that this whole procedure ends up being a kernel regression problem again with a kernel that's defined by the dot product of these representations okay cool and then ask the questions that I just asked so for example we can ask if we can reconstruct the scenes that this animal was seeing so in this animal was seeing the following are examples of pictures that this animal was seeing so we're going to look at this data and give it a bunch of examples and then try to train our readouts and see if we can reconstruct it so we did this in multiple ways I'm just going to tell you one result that comes out of it so one thing we did that was meaningful was or it came out gives gave some insight was instead of reconstructing the original let's say image we took it and bandpass filtered it with several or several several frequencies and basically try to construct you know high or low or mid sort of range frequency portions or filtered versions of the images and long story short what we find is basically that I mean it's much easier to reconstruct a low frequency content of the images from the from the data set often recordings then high frequency portions and basically and and there's a belief that or there's a thought that I mean a mouse are very highly visually not good and they rely on very low level low frequency content of the world so that was a biological sort of insight they came out of this you can ask about other questions for example so here's again some samples from the image net dataset this animal saw for example looking at this data can we tell from this from this area whether I mean tell between this class I believe the class was birds and I don't know rodents and it turns out I mean this code is terrible for this so you can plot this metric the the metric that I described the test model alignment it's basically rising linearly meaning that I mean yeah basically this with few samples you will not be able to do this classification problem looking at this code so basically it also says that again verifies the intuition that primary visual cortex is mostly basically for low level features like edges and so on so forth and to get basically to be able to learn from it these kinds of object level pictures or features is hard okay cool so here's another thing that that this kind of kernel picture mean pushed us to ask so the one thing we noticed was basically look I mean if everything depends on this kernel which is the dot product right I mean I can always rotate this queue by these these vectors by just some rotation matrix it just doesn't change the dot product okay I'm done two minutes okay so okay then so the question is I mean why choose one code over the others is there a reason that among all the possible rotations that this one could choose why biology chose one but not the other ones so is it an experiment we did so I'm going to skip the explanation of just give you the result what we did was basically we took the biological code and compared it to all this rotated versions so these basically have the same dot product kernel meaning that learning performance from that code is extremely okay it doesn't change however if you look at the total spike count that these different codes uses biology is very different so so here are basically histograms of the total spike count of all the randomly rotated versions of the biological code and the vertical the true the black line vertical black line is basically biological code and the vertical red dash line is something where we optimize for spike count so it's basically we try to fit you know everything keeping the kernel fixed and basically you know these guys are like like I don't know tend to be like hundred standard deviations away from the histograms that you can't get from the biological so it's sort of random rotated code so therefore so biology does something special about trying to get efficiency okay all right so I've what happened is basically what I thought was happened so I end up talking about one of these limits and the problem with this limit is there's no representation learning and but if you want to study representations then you have to find other limits that's that's possible study so one thing actually you can do is over so instead of finding a limit you can look at these this metrics of test model alignment over over training in a in a network not in this limit and you'll find that basically this particular model alignment is increasing over time or across layers but so this is a more of a descriptive study but so I'll just flash two recent papers and then leave so we have found two other limits where you can study representation learning and which is analytical trust tractable so one is this paper so this is a clear paper from this year so what this paper shows is that if you initialize a neural network from small initialization you don't have to go to infinite with limits it turns out that the descriptor of the neural networks learned function is not the initial tangent kernel but the final tangent kernel so of course I mean how do you get the final tangent kernel you have to train your network but yet there's still a match to a kernel machine which has its own properties so I'll skip that and the second one I want to highlight is this paper which is just a hit archive so this basically shows a different kind of an infinite with limit where representation learning is possible and we basically that includes in introducing a new parameter here gamma and taking that parameter to infinity with network width and so what we do is we take dynamical distance gradient descent equations and pass it through the MSR formalism there was a talk about dynamical mean filter yesterday similar formalism and then basically take infinite with limit and at the end of the day show that the evolution of all of these networks hidden layer representations for the stochastic process and that stochastic process basically is very descriptive of the neural networks behavior so I'll just show what's happening over here this is training loss at the top is the tangent kernel limit so and here is basically a representation learning limit this dynamical mean field T is able to predict both and here it's also showing basically that the dynamical mean field T is also able to predict not just the losses but also the the kernels the hidden layer kernels which basically showing representation learning at the beginning and during and after training so with that so this is a very recent work and we haven't really used it to gain more insight at this stage I'm showing that there's theory and the theories matching data and but I'm very excited about this particular one because I mean I think there's a lot to I mean there's a lot to study it's phenomenology and get get more insight into representations and then what they reveal about learn features okay so with that I want to basically acknowledge my group members especially today's talk was Alex Blake and Abdul's work and my funding sources and thank you very much and apologize for going over time thank you very much that was very interesting there was a lot of stuff in this in this talk do we have some questions from from the audience Mario let's start with you thank you very much very nice talk I have a question but it's based on your introduction you said that we think that when and when we are born in our DNA we have something that allow us to have a brain that already is able to do some things so do you believe that in the DNA the wire connections are encoded in DNA or how is it encoded I mean it's a it's a it's a very good question again this is of debate whether for example I mean in one of the papers I've written tried to estimate the information content in in the DNA and how much information would you need to specify all the connections and cloud the big gap and and so on so forth but we can argue whether that calculation is correct or not but I mean I guess the common most believe that it's not the exact wiring sort of that is quoted but some rules to develop the whole system to a level that you know it's ready to function with with little training afterwards so so that's that's I think that's reasonable like the fingerprint is not encoded in the DNA there is like a rule that makes your skin create a fingerprint yeah similar some ideas okay thanks for the question are there any other questions from the audience I was wondering when you showed these plots on the on this on the efficiency of the biological learning woods that you found sorry do you think that you could sort of maybe add some kind of energetic costs and then get efficient representations also with some gradient-based learning or do you think that these are really signatures of some you know alternative learning rule that somehow yields itself better to also you know metabolically efficient codes yeah so it's a very good question so you're asking whether these so here I'm taking the given code and trying to say something about it but you're asking whether this arises from a particular kind of training yeah so I believe I mean yeah so you could certainly try to pan penalize not sparse codes in training and see if something like this arises yeah certainly true but you haven't tried it well we haven't tried this one yet okay okay then let's thank jengis once again for his talk and for answering our questions recording