 It's going to be interesting so Thank you very much. Yeah, like like Manuel said, I'm not you a loo. I'm the best thank God I hope we get to see you's talk later during the program But for now, I'm very happy to be here and I'd like to thank first of all Manuel Jean Prager and Schubo for organizing this very nice program for getting so many cool people here to Trieste I've really been enjoying it so far Let me also start at the time It's my microphone on Is that okay? Is that better? Okay, so Yeah, like Manuel said What I'm what I'm interested in is the theory of neural network in a particular what I'd like to understand is what do neural networks actually learn from their data Okay, so in other words, what kind of features do they extract? From their data set and then what kind of features do they then use to to classify their inputs, right? So this is a question about feature learning and in some sense this question is as old as as deep learning itself It's been with deep learning since the beginning and one such beginning is of course the famous Alex net paper From 2012 right when when Chisholm ski suits give a and and hint and train this deep convolutional network to win the image net Challenge and if you look at that paper, it's actually quite an interesting read They discuss how to distribute the training across GPUs and all kinds of things and then at the end They actually show the filters that the network learned in the first layer at the end of training Okay, and so this is the the figure taken from that paper And if you look at these filters and you're like a computer vision person on your scientist You're like, ah, this is I know these guys right these are these are Galbo filters So I guess the interesting observation here was that this combination of an interesting data sets in which net the convolutional architecture and SGD as a sort of training algorithm gave you something that looks roughly like these Galbo filters Okay, but this approximately equal here is important because if you now think you know I can just sub this first layer with the mathematical definition of a Galbo filter and avoid some trainable parameters You'll actually do worse. Okay, so there is something in the data That is really important here and that some of these filters have picked up And so this is one of these early one of the early examples I guess also for this bitter lesson of deep learning that somehow the features that we learn directly from data If we managed to learn them, they're better than some kind of features that we would hand engineer using some domain knowledge And this is an idea that's not just sort of prevalent machine learning It also appears in neuroscience and actually it appeared in neuroscience quite a long time ago in the in the form of efficient coding Right, so the filters they are somehow adapted to the environmental stimuli, but I'm not going to expand too much on this Instead what I really want to focus on is is, you know, how do we learn these features and then in particular how the neural networks learn And in the title of the of the talk was you know beyond the Gaussian world because What we would like to do of course is we would like to have some kind of Gaussian theory, right? And what I'm going to argue is that of course to learn these filters in particular the non-gaussian fluctuations in the data particularly important Okay, and here's a small example of that So this is the test accuracy of the of a dense that that's one of those off the shelf a deep convolutional networks That people use in deep learning and he I'm training it on cyber 10. That's the standard classification benchmark And okay, you you you get this kind of learning curve and you go up to 80 90 percent. Okay, that's great And now to test the impact of various sort of statistical properties of your data We can construct clones of this data set. So for example what we can do is we can construct a Gaussian clone Okay, so this is a Gaussian mixture that's fitted To to the data set so I have one Gaussian for each class I fit the mean and the variance and then I sample the new data set from that Gaussian clone Okay, and I can now use that as a training set train the neural network on it and test it on cyber 10 images And this will tell me to what extent the mean and the covariance So to what extent the first two cumulons of the data are important for this task Okay, and if I do this I get this curve So what's maybe not surprising is that there's a big gap at the end of training, right? So there's lots of filters see lots of information the higher order cumulons of the data that my Gaussian mixer doesn't capture and I pay for that By a lower performance. Okay, so this is just to highlight that the you know the non-gaussian part of the data here It's very important a second observation Which is maybe not so not so trivial is that at the beginning of training these two curves actually coincide That's so for the first. I don't know 10 20 steps of training for the network at least in terms of its generalization performance It doesn't make a difference whether you're looking at the Gaussians or the real images and we'll come back to that a bit later But yeah, so the observation is that these higher order cumulons they important for generalization This is to even in two layer networks is to even the perceptional shown example later But of course we do not really have theory for that now And so what I want to discuss in this talk is roughly three questions To make sort of a headway into into this problem So one is sort of how dynamically speaking do neural networks learn about these higher order cumulons? How efficient are they? You know if they important on your networks a good method to learn from them And then finally how do these higher order cumulons actually shape the newer representations, right? How do we end up with these with these garbo filters at the end? And I'll start with with images and computer vision and later. I'll talk a little bit about natural language and transformers Okay, so how do you learn about these these these higher order cumulons? So we saw here in this plot that some are in the beginning of training You know whether you look at the Gaussians or the real images doesn't make too much of a difference in terms of your of your test Right, and so we wanted to test sort of what happens in the middle here, right? Ideally, you know, we think about these are sort of different approximations to the to the Cypher 10 data set And ideally would like to have another approximation there in the middle Something where maybe the first four cumulon that are matched Cypher 10 and the rest is zero Okay, so the bad news here is that this is not possible so There's a theorem that tells you that a distribution can either have one cumulon two cumulons or all of its cumulons So as soon as you turn on this this third cumulon, you turn on all of them So we cannot play this nice structured game. You could do some kind of Maximum entropy model fixing the first few moments sampling from that maybe it's really tough So what we did instead was a true in the true deep learning spirit We just took a neural network a genitive neural network and we trained on Cypher 10 Okay, so in particular we took a Wasserstein gun We took the Wasserstein gun because we wanted to make sure that the means and the covariance of the images match the two data and for the two gun that's actually quite a part and Then we trained the new network on the images sampled from these guns Okay, and when we test the network now on Cypher 10 you get this you get this blue curve and You can see that interestingly here again at the end of the day, you know I'm losing something in terms of performance right the gun doesn't capture all the information that's in the images You can see that the images I sampled from the scan they don't look great But for the first 1,000 steps of training or something like this a network doesn't seem to care Right so for the first 1,000 steps at least in terms of the test error what I train on the Wasserstein gun images or on the real images It's the same story and I can play this with different Models here's another data set sampled from a big diffusion model and Same story these curves always always collapse in time So there seems to be some kind of ordering here right so the neural network You know it could have gone in any kind of way in time to learn about these statistics But somehow it really seems to go through these in a certain order right it really seems to be learning distributions for increasing complexity through time and This is a fairly robust Observation made this yes No, that's why I'm that's why I emphasize sort of really from the point of view of the test error So no you're absolutely right. This does not mean that the dynamics are exactly the same One thing for example we we want to do is now look at the representations You know to what extent do they correspond to each other? So this read just in terms of the of the test performance, which is of course in some sense, you know the most important Also, let me just use that opportunity to encourage you all to to ask some questions I know we all just had lunch. So let's try to keep this as interactive as possible. I know God I shouldn't have said that Mark So that's a good question right how do you quantify in a sense the distance in in in distributions I still do synthetic images the model is much better for me So we discussed this quite a bit for me at the end of the day The best test is, you know, how much do I learn from the cipher 5 million or whatever images? About cipher 10 so for me at the end of the day the test error of this particular convolutional network It's a good. It's a good proxy, right? There are ways of quantifying this right so in in Yushiyuki's talk for example He presented some of these measures that you have to just evaluate the quality of generative models And on these measures this cipher this diffusion model behind cipher 5 million it does much better So in terms of fresh air fresh near distance and all these things. It's a much bigger model This is this w again. It's like five layers or something in the data. You mean yes, so for the So for the Gaussian for the for the for the mark of random field for the Gaussian I assume stationarity, so I just sort of I really just compute the mean of the covariance for each class And then I plug that into the Gaussian for the gants and for the for the generative models I don't have that kind of control the only thing I can control is that at least the mean of the covariance They get it right and actually that's not trivial often you get nice-looking images, but statistically speaking they're completely off But I have no control anymore about the cumulant specifically This is great. Keep it coming Okay, so this is just a sort of experimental observational experimental observation, right? So the question is now of course, can we make this a bit more? quantitative and Yeah, so we say maybe quickly so this is true we tested this in in the very variety of of Architects also transformers which handle images in a very different way from convolutional networks They show roughly the same the same behavior then the middle what I also found interesting is we tested this with a network That was pre-trained on image net. Okay, so now you have non You could say if I start from random weights, you know, I hit an image with a random set of weights There's some strong CLT kind of vibes going on red So obviously the beginning everything is Gaussian with a pre-trained network We get this actually exactly the same behavior So here this is pre-trained on image net this resonant and then we train the whole network and to end on safer 10 The training is much faster. So it's not the same data. That was kind of reassuring for us But the sort of the motions that the never goes to are the same And so I think this actually raises some interesting questions about what do you actually learn when you do pre-training, right? There's lots of anthropomorphic intuitions about learning concepts that transit that kind of stuff at least statistically speaking here We don't really see much of that But it seems to be maybe more than optimization issue. I don't know Okay, so this is just an observation In the in the paper then we also looked at this from a bit more more theoretical point of view We look at this in just a single new on the perceptron. Okay on a synthetic data set Given the questions, I don't want to go into too much detail on this Let me just say so basically what we can do here is we can look at the gradient flow of this perceptron on this Binary classification tasks. So all the dimensions are random. There's two dimensional subspace in which the points actually have a meaningful Separation and then you can basically compute the different classifiers That depend only on the mean that depend only on the mean and the covariance That's called official discriminant and you can compute the first sort of non-gaussian correction to them And they correspond exactly to states of the perceptron throughout time Okay, so somehow also the perceptron goes to this learning of distribution of increasing complexity And I think the important observation here is also that the non-gaussian statistics they important even only have just a single neuron Okay, so even for the simplest network that doesn't necessarily mean that it only looks at mean and covariance of the data We can talk about that about that offline But so okay, so this first point was that at least you know empirically New networks seem to be going through this sort of very organized way of learning about the distribution of their data The higher order communities are important now You can of course ask a new networks actually good methods to learn from these on these humans Are they efficient in some in some in some sense? Okay? and so This is some recent work that we've been doing where we thought about this question And the way we thought about it is by reframing it as a hypothesis test or as a classification task Right, so I want to know has my network learned about cumulants what I can ask is can it make the difference in the classification sense between images of Horses for example and gaussians that have the same mean and covariance, right? So this is just the hypothesis test right have some some standard basic distribution q And then I compare some other distribution pp to it in this case the distribution of horses Of course, I cannot describe this this distribution of horses very well So we resorted to looking at some at some toy models The way you do this is you look at the the likelihood ratio, right? So this is very classical statistical testing you compute the likelihood ratio This is just really the the ratio between the two distributions for continuous random variables and Then you can use this classical result and you can look at the norm of this likelihood ratio Okay, and this norm you evaluated over the the simple distribution so the Gaussian distribution And if the second moment of this likelihood ratio of this norm diverges as you go in the thermodynamic limit You can distinguish the two distributions. Okay, if it stays order one you cannot or at least not in polynomial time Actually, this is the statement. So you just cannot yeah, so this is my next life Exactly because now the question is of course, how do I model the p? I'm not gonna model it with horses like I can't do that So this is a statistical statement, right? This is about how much information is then the distribution cannot distinguish them of course if we talk about neural networks What we interested in this is an algorithmic distinction. So can we do this with a polynomial time? Algorithm okay, and then there's some very nice ideas that appeared recently that sort of inspired of this sum of squares hierarchy there first appeared in the PhD see the thesis of Sam Hopkins and Basically, the the idea here is to not look at the likelihood ratio But instead to look at the projection of the likelihood ratio into a Polynomial subspace, okay So here we're gonna have a baseline distribution q which is Gaussian So the the polynomials in which you expand out the Hermit polynomials This is this this projection and you expand into all the polynomials of degree at most D and Then the statement is that all the conjecture. It's it's not proven it, but it's proven right on many It's proved to make the right predictions on many models is that if you p and q are nice enough And if your degree is sort of logarithmic in the input dimensions if this low degree likelihood Ratio stays bounded. There's no polynomial time algorithm that can distinguish these two distributions so Very simple, but I think very nice idea of going from this information theoretic quantity to something that tells you Something about a pretty quite class of algorithms right the the key here is that the degree scales logarithmically with the input dimension Right, so this means that spectral methods these kind of things that covered by this framework With probability one as you go to two high dimensions and there's a very nice review If you if you're interested in this that I recommend from from which I learned many of these things from from Kuniski wine and Bandera Okay, so what are the data modes that we're gonna that we're gonna use here for p so a lot of work in this framework Has been done on these Gaussian additive models. Okay, so for example, you have some kind of spiked Vickner matrix The review by by Bandera at all they focus on that So you have either spiked Vickner matrix or you just observe an unspiked matrix You want to distinguish the two, right? So these are data models that we've seen quite a few times already Of course here we have a square matrix, right? So we want to extend this to a case where you have n dimensional Sam with D dimensional samples and you have n of them Right so natural Gaussian model to consider here would actually be a spiked wish that model that we've seen many times now this morning I gave a very nice Introduction to it, right? So under P you observe a data matrix where each vector comes from this spiked Wish that model and on IQ you observe white noise and it's pretty easy to extend The LDLR method to to this model and what you do is you recover the BVP transition So in other words these are distinguishable if the signal to noise ratio is bigger than than the square root of the number of samples over D Now we wanted to check, you know, we wanted to say something about learning non non Gaussian fluctuations, right? And so we Try to do sort of this the smallest variation on the spike wish that model and that you can and so we generate samples like this So under P you can observe such a data matrix X And now you generate these these vectors X in in two steps. So the first step looks a lot like a Spike wish that model. Okay, W is still a Gaussian noise vector But now the G which would have been which is just a scale and which would have been a Gaussian random variable in the spec Wish that we draw it from something. That's non Gaussian. So you can you can make your pick You can draw it from a radar you can draw it from a Laplace anything. That's non Gaussian is nice And this just means that now my my x mu is is non Gaussian. Okay. I have none non zero cumulants But of course, you know, if you match the mean and the variance of the Laplace to the Gaussian You can sort of look at the covariance matrix and the simple spectral method is all you need, right? So what we're going to do in a second step is we're going to widen the data Okay, so it turns out in this model. You can actually compute the square root of the covariance with a couple of lines and so our axis will be the widens Inputs from this spike human model. So in other words, this is now a data set Which has a spike in the four-fold accumulant But if you just did a spectral method if you just look at the covariance matrix, it looks completely isotropic This is the the p that we're going to use and then okay, you can deploy the the LDLR toolset and this is Yeah, then you can ask, okay, so how many samples do you need to distinguish the two? distributions and then you can deploy the LDLR toolset and Lorenzo did the heavy lifting here and he's got a poster on this But basically what what Lorenzo found is that you can bound you have matching bounds on this on this likelihood degree ratio They look a bit Forbidding but what they basically tell you is that the likelihood ratio it's going to diverge if you have quadratic sample complexity Okay, so contrary to the spiked vision art You now need sort of a number of samples that's quadratic in the in the input dimension If you have something that's smaller than that you're not going to be able to distinguish the two Right and on the one hand this makes sense because You know estimating things from a four-folder Cumin on the other hand of course you could say well, I'm still I just need to estimate one num one one vector I just need to estimate this this vector you right But with the likelihood degree ratio tells you here is that you need a number of samples that's quadratic in the input dimension I see So I think Bruno and then Marco Furness incorrectly. This is from the perspective of like information theoretical perspective, right? Not really right. This is more from the So the information theoretical perspective would be you look at the likelihood ratio. Mm-hmm. This is actually much harder to compute Because we now have no we don't have a closed expression for distribution of p So this is about polynomial time algorithms. So this is a computation Yes, I'm trying to connect this to the original metric that introduced in the first slide Which is performance and when it comes to performance I think that will depend on the task right because for example, we can show that in in a spiked model like a Gaussian mixture The performance is a completely equivalent to a Gaussian from the perspective of perceptron But it depends on of course how the means correlate with the process that generated the data even above the BBPA Yeah, that's so respect to that least performance might still Have like Gaussian performance depending on how you are labeling this data Yeah The important thing here is that there's nothing in the second order statistics that help you do the task If you think about it as a task, right? So here the the key thing is that you you force the network or whoever's doing the test you need to look at the cubulance because that's the first information carrying sort of Statistics, but indeed the comparison to new networks is actually subtle and we can we can talk about it offline Okay, thanks Mark, I think what's next and you want to just a clarification if here you do the this whitening to the Case where G is Gaussian Then you there's nothing left. There's nothing left. Yeah, if G's guy Okay, I mean mine is much more mundane. I'm trying to understand exactly the model. So So W is doing it's a matrix So yeah, I'm writing it for one one vector sample, right? So W is a Gaussian noise vector Okay, you can so this is exactly the spike if gene use Gaussian Yes, I'm this is like to understand exactly what's laplace. So you're changing the prior of the signal No, sorry, what's so you so there's the prior of the of the spike of the signal which I'm not specifying Okay, so let's say uniform on the sphere uniform distributed Yeah, the prior that I'm changing is the prior over these scalar variables G which I'm never trying to reconstruct So that's the SNR. No, that's related to the SNR, but here I'm doing it in such a way that The Laplace still has mean zero and variance one, but I'm really changing some turning on the cumulant by changing the distribution of G I Make it I don't see the connection to the BVP also in the other case So in the BVP I would think so I get a matrix and the matrix is something like lambda UU transpose plus GUI I mean exactly in the in the in the spike vision now Yes, if you take the covariance of the average covariance of this thing, right? Yes The variance of G mu is the IID it's one this goes away. So you get the usual beta over D UUT plus GUI So I don't see how you pass from the matrix formulation to the vector formulation of x mu That's the bit that I'm missing. Ah, okay. This is okay. It's just a way of Right, I like to think about sort of how I sample these things and so but they're completely equivalent But if I were to write again the matrix formulation for the case in which Laplace, yeah How would it look like? Well, so the so I don't have a closed form for the distribution So it's less natural to write it that way. Exactly. That's that's why sort of if you want to go this route It's easier to write as a vector. Thanks. Yeah, the questions on this How do you discuss this more? I think there's many other questions that that you know that we can ask you that that are fun And we have some Yeah, we have some questions. I have some questions at least so I'm happy to discuss more the next two weeks Okay, for now, I don't see any of you just check the time quickly. Okay Yeah, so but what we wanted to connect is of course is the performance of of neural networks, right? And so like I said, there are some subtleties now in how to go from this hypothesis test in with a polynomial time algorithm to the performance of neural networks I'm happy to discuss sort of offline but what I can tell you is that together with Esther who's also got a poster here as there's done many sort of careful simulations. We basically just tried In your networks on these kind of classification tasks. So here, for example, we tried them on these on this model of images I'll show how that works a bit later, but this it's the same idea you have some data which is not Gaussian and you compare to its Gaussian clone and In particular, we wanted to know is you know back to the sufficiency questions Do you need neural networks for that? Could the kernel could a lazy method do this just as well? And the the quick answer is that no so here for example, I'm showing you the test accuracy as a function of the training set size, right? So Again for small training sets. There's there's nothing that either method picks up here But then as I increase the data set size, there's sort of a very sharp transition The neural network suddenly picks up something In in the data and the lazy methods on this case the random features don't do anything Okay, and even if we sort of go a bit further the random features at some point They will start picking up something but okay They will they will have a much lower performance and and it will take them sort of a lot of data to To get to to a decent decent performance. You can do this also a bit more carefully You can look at this sort of really interpolating between the new networks and the lazy regime using this alpha trick that we learned from linear She's a and you get the same story, right? So as I sort of go from the feature learning regime to the lazy regime my transition to where I pick up the structure it goes further and Later on I'll show you what exactly in the data changes that the neural networks then so yeah, all of this to say is that Your networks are efficient at learning from these higher order humans at least, you know as long as you compare them to these linearized neural networks to these Kernel methods, but now of course, okay What we really want to know is you want to want to understand feature learning right? You want to know representation learning so how do these humans actually shape the representations that the neural networks picks up, okay? so To to do this, let's look at another It's looking at another data model. It's the last day. No, it's the next to last we'll see This is joint work with alessandro in gross who's here at ICTP and with alessandro We were interested in sort of questions around learning conclusions learning from from images And so we we want to have a good model for images in fact We wanted to have a minimal model for images. Okay, so then you can ask you what what does that mean? What do you need right? And so one very basic thing that we started from was this idea of translation and variance, right? Images are roughly translation invariant. That's why convolutional networks are a good idea when you're dealing with images Okay, let's start with something that's translation variant and because we're you know the statistical physicists that start with some Gaussians Okay, so we have a Gaussian process mean zero and then the covariance is translation variants in the sense that the covariance between any two pixels It only depends on the relative distance between There's one free parameter in this model is the correlation length between between the pixels It's very easy to sample from this in the image the samples the images that you sample from this They look like that. They look nothing like images particular. They have this typical Gaussian process thing where they're very blurry Okay, but there's actually a lot of discussion especially in the in the neuroscience literature if you go back to papers from the late 80s the 90s Where they think about you know what actually makes an image an image and one thing that comes up repeatedly is this idea of edges Okay, edges are really important sudden changes in in luminosity exactly the thing that the Gaussian is missing So how do we get edges in these in these images? We're going to put them through a non a saturating non linearity Okay, so very simple point-wise non linearity here We're using the the error function and we're going to control the slope of this error function using this little parameter g Apologies that's nothing to do with the G that we just talked about But okay, we have this little gain factor g that controls the slope of the 10h around you right and what that will do Is it will control the the sharpness of the edges in my image? Right, and so as you increase the gain you make the slope steeper and you get these these Images which you look look a lot like you know a bit like sort of a poor man's Ising model samples right and of course they have the same mean and covariance as an ising right that's why he chose this covariance Statistically speaking what you do as you make the edges sharp as you increase the relative importance of these high-out Acumulus and we'll see where that is important So what we can do now is we can train a neural network on on data sample from this model like I said We have one three parameters So let's do a task where you know We want to distinguish images that have short range correlations from images with with long-range correlations And we did this with sort of a standard vanilla two-layer fully collected neural network And if you do that you you get pretty good accuracy and then what you can do is you can actually look at At the weights in the first layer that you learn or get the features right This is a fully collected network features where it's the same thing and what you find is weights that look like this Okay, so your neurons roughly split into two groups and half of them Look like the weight vectors that have plenty on the on the bottom. We call them oscillatory But really I mean they don't look like anything in particular. They're important So you can't just prune the way the network somehow relies on them Well, we were kind of excited to find is that on the other hand the other half of the neurons roughly It has this very particular weight Which is localized in space, right? So the dark blue pixels each correspond to one weight and they're almost all zero except for one blob, you know Which is which is very much non-zero So in other words here we have a localization in the weights Actually, if you look a bit closer, you can see that you've kind of learned a convolution of a single filter Which is kind of interesting from a machine learning point of view, but okay I'm not gonna go into too much detail on this today If instead I trained my network on the Gaussian clone of this task Okay, so if I turn on the Gaussian mixture with the same correlation length, I find what's it look like this? Okay, so again I have these sort of high frequency oscillations and then the other half of my neurons They actually look a little bit like sinusoidal waves and indeed they are and if you think a little bit about it this makes sense because what you're looking at here is You know a correlation covariance matrix, which is circulant so its basis is the Fourier basis So what you're doing is PCA But what we wanted to understand is okay, where does this localization come from where why do you learn this localized localized thing here? So we thought a bit about sort of Gaussian equivalent models But unfortunately this is a regime where these Gaussian equivalent models their magic breaks down, right? So what I'm showing you here is in green It's a measure of the localization of these receptive fields over time Okay, and you can see that in the beginning of training nothing happens and then there's sort of a transition in in how localized they are Okay, and statistically speaking the the flip side of this localization of the weight is that the statistics of the pre-activation Changes, okay, so the pre-activation. That's just the scalar value that you get from dotting your input with your weight Okay, that's what goes into the hidden neurons, and then you apply the non-linearity, and if you plot the kurtosis of this Of this distribution, right? You can see that in the beginning. It's close to zero the excess kurtosis And then it's rapidly decreases so the distribution of pre-activation suddenly becomes very non-Gaussian Just at the same time as this localization happens Okay, so the bad news is that okay. This means that we can't use these Gaussian equivalent tricks here But this is also hinting at something interesting, right? So there's something going on here in the in the non-Gaussian fluctuations would much be which might be worth checking out and so Does this make sense maybe I should pause here for a second this makes sense what we're observing what we're doing We're trying to do okay, I don't see any immediate questions So yeah, we want to understand this okay, where do we get these localized weights from right? So what's the signal in the data that tells my neural network actually for this task you want to be convolutional and so again to Analyze this we went back to the simplest model that we could think of just a single neuron perceptron and we can look at the Gradient flow of that network trained on this type of data, okay, and in particular We just expand the the loss here up to third order in the weights So the second order drops out because there's a point symmetry in the data So for symmetries and this just goes away So these are the first two terms as I expand my my gradient flow Okay, and so what I get here in terms of the data statistics that come in it's I get two pixel correlations and four pixel correlations And so now if I want to understand, you know What is the impact of the non-gaussian statistics? I can reshuffle this thing and I can rewrite this in terms of Covariance matrices and the cumulants right so in terms of a Gaussian contribution and the and the non-gaussian distribution And so I get this sort of gradient flow If I just integrate the Gaussian part I get this oscillating weight right So there's the signal for the localization it comes from this non-gaussian part from the cumulative So let's look at that a bit more more closely. What can you do with such a cumulant? It's a it's a fourth order tensor, right? So we saw that the neural network on the Gaussian data was doing some kind of PC a so let's do something similar for the cumulants Okay, let's do a decomposition a tensor decomposition into a sum of other products of factors So this is just a generalization of the spectral theorem to the tensors There's actually lots of lots of things you lose which are nice about the spectral theorem But you can still do this decomposition. This is a symmetric tensor. You have some guarantees that this thing at least exists And then you can look at what do they look like these these eigenvectors of the cumulant and if you have I call the blurry images So if you have non-gaussian inputs, but the gain is small, you know, there's a little bit of edges, but not really You get CP factors or eigenvectors which look like these gray lines. Okay, so they're a bit of a mess They're all over the place don't have any particular structure And likewise the weight of the perceptron is also a bit all over the place This is the blue line at the end of gradient integrating this gradient flow But then as I increase the gain as I make my data more non-gaussian as I made the edges sharper something interesting happens My eigenvectors localize in space There seems to be a sudden transition here and that's important because as the CP factors of my cumulant localizes So does the weight? So clearly the weight is picking up something here in this in this queue and it's actually just Following some kind of a tractor dynamic to one of those CP factors Okay Why is it doing that because it actually does a bit better if it's trained on the non-gaussian data than on the Gaussian data Right, so this is again. It's relevant to the task to exploit these cumulants and so the perceptron doesn't even though it's a very simple It's a very simple model. Okay Now this is a purely numerical characterization right you compute these tensors you You compute this tensor decomposition. It's a bit of a pain But what I find interesting is I'd really like to understand this this transition a bit better right because you know we've There's a very rich mathematics around this this random covariance matrices No, BVP transitions all these these kind of things and here we're seeing something that's a little bit reminiscent of that at the level of this random Cumulant, okay, so in particular here We're plotting and this is really a heroic effort by by Alessandro who did these numerics on the classes of Columbia University Thank you very much Where we plot sort of how localized are the CP factors that you get from these matrices as a function Of the number of inputs that you have Right, so this is again think about this in terms of the cyber complexity, right? I can only detect that there is something in the cumulants here that goes beyond just what you would get from a finite data set of Gaussians, right and there seems to be a pretty sharp crossover Okay, here we're plotting it for historical reasons and the referees as a function of linear sample complexity So as a function of number of inputs over dimension This is clearly not the right scaling for for cumulants You can see that not only does the transition get sharp. It also moves to the right, right? So this is this is bad But yeah, something I'd like to discuss this a bit more as to what extent can we actually capture? What's what's what's going on here in a bit more mathematical terms, you know And you wanted us to suggest some some open problems, you know I think yeah This is something that I'd like to understand a bit better and if we can discuss this over the next two weeks That would be that would be fantastic any questions on this completely lost. Yes Thank you Say again So the question is what's what's on the y-axis? Yeah, I didn't see that very clearly. I'm sorry So what we're measuring here is how localized is this weight vector, right? You remember If you have little samples or little signal to noise ratio, sorry The CP factors are all over the place. They're also adding something like this Then they get very localized and so what we're measuring here is the inverse participation ratio It's kind of you know some over the four elements to the fourth power divided by Some of the square square It's something that people use also, you know And I don't know what the mechanics to measure like localization these kind of things it moves I hope that I know I think so Again, so for historical reason we put this as a function of linear. So the question was The impact of dimension here, right? And the observation is that the transition or the crossover it moves to the right Of course if you wanted to look for face transition You want something that is at fixed clearly the scaling here is wrong, right? This is not a linear sample complexity kind of effect We had to plot it like this for yeah historical reasons But no, I think clearly also the LDLR would suggest that probably a quadratic scaling is the right one here No, ideally want these curves to collapse of course it great. So yeah Let me see time so I still have a couple of minutes. So I guess the interesting thing Okay, great So I guess the interesting thing is that there seems to be the sort of interesting way of thinking about your data Right. So where does this localization come from it comes from the symmetry that we put into the problem, right? The data is translation in variant and if it's translation variant This has a certain effect on the higher order cumulants It localizes its CP factors and this is something that neural network seem to care about Right, so the representations then depend on these on these higher order cumulants So I think there's lots of things you can you can do here You can look at different at different symmetries, right? I think it's also interesting that this system looks sort of localized at the level of the cumulant But D localized on the level of the covariance. Okay, so for images, there's lots of questions here Let me spend the last couple of minutes to Think a bit out loud about to what extent can we actually extend this approach Of thinking about data symmetries cumulants and representations to other data modalities, okay? And of course if I say other data modernities here Given what's been going on in the planning in the last few years what I have in mind is natural language, right? I mean, this has been the biggest Sort of breakthrough in machine learning. I think in the last couple of years I guess all of us have used strategy BT to some extent in the recent past Oh, I also see that my laptop is about to die Sorry So yeah, the question is now what does it transform or learn from a statistical point of view? Okay So I guess it transformers are sort of state-of-the-art in many in many domains now But we want to know what do they actually learn again from the statistical and point of view I think an important point to consider here is that if you train these transformers You usually train them not on the final task not on the translation that you want to do at the end But you train them in the sort of semi supervised way, right? So you show them in the case of text for example, you show them a sentence you split up into tokens You hide one of those tokens and then you ask the the the transformer to predict that missing token, right? There's this mass language modeling task and GPT is actually called GPT because it stands for generative pre-training a transformer So this is self-supervised this is great because it allows you to leverage on on the on the massive amounts of text that they are On the internet, right? But so what do you what do you learn if you do this kind of procedure? What do you learn if you do this kind of procedure with a transformer? So let's focus on a single layer of self-attention This is kind of the sink the simplest transformer that you can think of right a transform is a sequence of self-attention and MLPs That you stack on on top of each other So let's focus on just one of those layers of self-attention. That's a keen to focusing on one convolutional layer and a convolutional neural network, right? So what this network does is it takes the tokens It maps them into a vector and then it adds to eat vector something. That's called a positional encoding That's just another vector that hardcodes where in the sequence was this talk Okay, so the peas are fixed the ease depend on the words They will change for each sentence and then you get this set of vectors And what you want to predict now is a new sequence of vectors or maybe you wanted to predict just another vector for the missing masked word, okay, and the way the Transformer does it is by first computing a linear transformation of these input. That's called the values It's just a linear matrix modification And then each predicted output token will be a linear combination of these values Okay, and the the matrix that contains the coefficients of this linear combination. It's called the attention matrix Okay, so you have this attention matrix and then my my I've output token It's going to be a sum over, you know these values with the coefficients that come from the I throw of this attention matrix Okay, so it's a very simple procedure. It looks a lot like a sort of non parametric kind of regression or nearest neighbor kind of prediction algorithm The way that you compute this attention matrix usually see do two more linear projections for your data Those are the keys in the queries and then you just dot them together, right? So in other words, you could have just done a self similarity You could have just taken the overlap of each of these inputs, but then you wouldn't have any trainable parameters So this is what the single layer of self-attention does. It's fairly simple transformation and So again to understand sort of what do you learn from this? We trained these On on synthetic data Okay, and so in particular we thought about you know, you have a finite vocabulary So you want to have spins that take a finite set of values. So POTS model would be a natural Natural model for this kind of data, right? And then you get Forget about the V for example for a second. You get this kind of Hamiltonian, right? You have a Interaction over the places on the sequence J. I J But in the POTS model each of the spins or each of the spin values they are orthogonal to each other For language, it's kind of a poor model, right? Because different vectors representing different words They will have different degrees of similarity, right? Some words really mean completely different things. Some are semantically related So we introduced this V mech vector to a V matrix Sorry to encode the overlap between the different words. You can think of that as some kind of similarity measure between the words and Okay, let me Go through quickly through this if you generate data from this kind of model and you train the standard transform on this It's kind of a struggle. So this is one layer This is three layers. This is always on this prediction task, right? Because this is POTS model we can actually compute the conditional distribution of a single spin So we know what's the best possible Prediction here we can actually do a lot of progress by Rearranging the attention mechanism a little bit. Okay, so we can rearrange the attention so that the Coefficient only depend on the position in the sequence. Okay, and the values depend only on the on the word meanings Okay, so we're decoupling the positions and the inputs and if you do that You learn this thing in in no time This is this fact that self-attention Now this is not some kind of magic actually makes a lot of sense if you if you think about it a little bit because you realize that this Fact that self-attention now has the same form functional form as the conditional distribution of one spin given all the others in the POTS model Okay, and indeed, you know, there's a huge community in statistical physics That's been looking at learning inverse ising model inverse POTS models and so on and it turns out that Training the single layer of self-attention with this mass language modeling loss It's something that in the inverse ising model. It's it's well known. It's just the pseudo likelihood for the inverse POTS model Okay, and you know, these are just a couple of references This is a very established method. Of course, nobody thought that by just stacking some of these layers You you get the kind of performance that you get with transformers But I still think in terms of just understanding what one layer of self-attention does it's quite a useful quite a useful mapping There's something that actually works on real data, too People have seen this work well in protein prediction in computer vision. So so this is nice We did some replicas on this I understand that Federica already talked a little bit about this I should say that Ricardo has a poster on this and yeah, Federica already talked about this In her talk they did the heavy lifting here on the on the replica analysis and so yeah I just wanted to mention that at this point And we finish with one more one last slide. Okay So If you remember the very first thing I showed you was this learning of distributions of increasing complexity of the of the image models right so going through the motions earning a Gaussian model of your data first and then going up the the hierarchy of of generative models, right and So what you can ask now is this is does this kind of ordering also apply appear in the transform Okay, and so here. I just want to give a little sort of preview This is what they were doing at the moment with again Ricardo Federica and Alessandro Laia who's also at CISA So natural question is to ask now, of course, what do you learn with like a deep transform right? What happens if I know stack layers of self-attention, right? If you want to analyze that you want to analyze of course a rigid data model So natural extension here some kind of p-spin where you have pairwise interactions like we had before But now you also have you know triplet interaction quadruple and and so on And so what we did is we went up to I think fourth order We sampled from this fourth order data set and then we trained a transformer with with three layers and with two layers on that kind of data, okay, and the kind of plots that you see they look like this, okay, so The blue the green and the violet line I guess they are the error that you get from fitting just the best two three four body approximation to this kind of data Okay, the orange line is what you get from training this deep transform and Again, I was very happy to see this the transformer seems again to be going about this business in a very systematic way Right, so you first learn the two-body approximations then you you know you wait a little bit Actually quite a bit and you learn the best three-body approximation and then the best four-body approximation and so on so again, there seems to be an ordering in In the learning of the of the neural network Train for stochastic gradient descent on a very different type of task with a different type of architecture But again, there seems to be some kind of systematic ordering here Okay, again, it's not trivial that you would exactly hit these kind of plot holes Right, you could go about this in any kind of which way So there is some some convergence in terms of the themes here. I'm again. I'm happy to talk about this more of line All right, I should conclude so I started with this question What do neural networks actually learn from their data, right? And so the argument that I tried to make is that the higher order statistics that they are in this data in these data sets They're actually very important if we want to understand this this type of feature learning, right? and There seems to be some common motifs emerging here in very different data modalities very different architectures very different tasks, right? One of them I think is this idea that there is some kind of order in which you scoop up these this information And there seems to be some kind of ordering both in the language and in the vision And neural network seem to be good at this compared to other methods There's something about non-gaussianity you're looking for non-gaussian projections of your data and so yeah, I want to Investigate these a bit more in the future and then finally I think there's an interesting interplay, you know Once you've understood this where you can think about what kind of data symmetries are there How do those affect the cumulants and how do those then show up in the in the representation? That's one natural question is now to ask Okay, what do these many body interactions look like if you if you're looking at language, right? What are the features? What do what happens if I decompose them? Okay? So this is just a shout out to the to the people who did all of this work I mentioned Lorenzo Federica Alessandro Lyre was part of this transformer project from day one. Alessandro Ingosto just appeared at the back there We're recommended some nice work on on real data that I didn't get a chance to talk about Maria worked on the Perceptron in distributions because in complexity and as that it's a lot of analysis on the neural networks for the hypothesis testing Thank you very much Well, this is the view from Caesar by the way, I should Okay, are there any questions? Thanks, Sebastian. It's a really cool line of work of research Correct me if I'm wrong but many of your analysis are in the online learning Setting right where time is identified with samples and I wonder with this high-order cumulon business whether if you would actually see Re-see many times the data using mini batch and all that would this actually help Extracting this this information faster because you you exploit correlations may be in a smart way And I wonder how much would that affect this overall picture That's a really good question And the and the quick answers. I don't know of course, you know the experiments that I showed you on safer 10 also the experiments I showed you at the end there with the Pots model they actually done on fire data sets and and I agree. I think that going through the batches again Will have some non-trivial effect, right? But it's very hard to analyze SGD on these When you know the correlation that you get from revisiting samples. They're really out to handle So I don't really know how to go about this about this business Even if an empirically I was curious to know whether if you think it helps or I mean empirically It's kind of funny Because always think I say this because I'm very biased because obviously like online learning, right? But I think empirically speaking actually the online learning regime is now the relevant regime like these large language models They actually never see the same data point twice because the train. I mean the data is just so enormous That they do one epoch and that's it. Maybe they do too It's actually really funny because you know these these whole stories of you know Has the network converged it or not that were a big deal when you were looking at these computer vision models in? Language models. They don't even care. They don't even train until convergence anymore The loss curve scores like that and then at some point they stop because you know, it's been two weeks and they Need to move on you know So I think actually practically speaking the online learning regime is pretty relevant But I think from a theoretical point of view. It's a very interesting question any more questions these Plateaus that you get like learning data of increasing complexity resembles a lot this kind of a saddle to saddle or Staircase call as you want phenomena that you observe even with Gaussian data Which is related to you know staying settle points of the optimization landscape for long Yeah, so I wanted to hear your thoughts whether you think here is the structure of the data that is generating some kind of structure in the landscape or or Is there something else out there? I Think it's an interesting mix of the two right so of course like in online learning the you know The famous plateau of sudden solar that's related to going from a linear model of your data to a nonlinear one And so if you want to scoop up any sort of higher order humans of your data You need to do that right so you need to go through these kind of motions But then I think for example in the pot's model a big part of why we see the plateau so clearly is that The interactions matrices that we have there's some Over they're low they're not low-ranked, but there's some so about the product of vectors and we give all of them the same coefficient Right, so you you sort of develop you can discover all of them at the same time I think you could sort of mitigate these patrols a little bit by just giving them different spectra For like a for better word, but they are data driven. I agree with that Okay, thanks When actually if I may add one thing when depth comes into play there's also some interesting effects into You know to what extent does the network actually exploit its full depth. I didn't talk about that, but we can we can check Hi, sir. I nice talk. I don't know in the case of adversarial attack I don't know if like we can use your approach to understand why the prediction is very wrong That's a really good question. I don't know but I was recently at the I was recently discussing with Julia Kemper who's at NYU and she's studying with her group adversarial taxes lot and There's apparently a thing in adversarial attacks where early stopping helps in making the models more robust okay, and and so what they're testing now using these clothes is this idea that maybe part of what makes you Sort of brittle or part of what makes you vulnerable against adversarial attacks is related to when you scoop up these higher order fluctuations, so I don't know but I hope that they will find out That's a very good question. Yeah Can you go back to your last slide? I think yes Yeah, that one. No, I really want to emphasize it on the view from Caesar. It's very nice Further from the sea, but I just I noticed that it it hits the first two kind of Theoretical predictions if you will yeah, and then it it skips below and I was wondering if you wanted to comment on that Basically, I didn't but it's a very good job from the last row Yeah, it's there's a bit of a gap I'm gonna go out on a limb. I think this is just because we used to little data We actually used fairly small data set here, and I think that's just It's just related to that. I think if we if we have a bigger data set Will match I that would be consistent with the stuff that we ran into as well, so Okay Yeah, so I mean, you know Claudia Merga probably yeah So we were also trying to learn cumulants and a slightly I mean a different setup Yeah, and the higher the cumulant the more data you need. Yeah Orders a man to more data for each new magnitude And I as far as I recall batches were not super helpful. You actually knew yeah I'm actually glad that you mentioned this because I usually have a slide on the unsupervised stuff And it's to that that Claudia Merga in the group of more. It's a they have a very nice recent work Actually, too where they also look at sort of how the distribution of inputs changes as they propagate along the network I thought that was quite interesting and then Claudia says work in unsupervised learning You know, how can you sort of actually learn the elements of these high-order interaction tensor? Strongly recommend taking another question. No on support vector machines learning higher order. Yeah It's funny. I was discussing that paper with Bruno yesterday So the question was is there any connection to to some of the older works? Looking at the port vector machines using to some statistical physics. I think there's actually a lot of a lot of connections I think you know you you guys found the you know The stepwise in the different regimes decrease of the of the kernel that we sort of rediscovered machine learning, right? Because it was very nice paper and I think you know the reason why I showed you the kernels right doing so poorly on this On this hypothesis testing task, it's exactly because of that right because to estimate each moment They'll need you know Another order of magnitude more data and that's gonna take forever or it's gonna take a lot of data And that's what is in supporting these tasks. So I think it's it's very much related to that Okay, if there are no more questions, we thank you and Sebastian. Thank you