 plus what we can learn from Gaussian models of data. Take it away. OK, so thank you very much. First of all, I would like to thank the organizers for inviting me to this very interesting workshop. So broadly speaking today, I'm going to talk about the interplay between data structure and machine learning performances. And in particular, I will try to shed some lights on what we can learn from Gaussian models of data. So to start with, as we all know, the great success of deep learning can be mainly ascribed to two distinct factors. On one side, the design of efficient learning algorithm and architectures, while on the other side, increasingly higher amount of data at disposal. However, up to this point, lots of efforts have been devoted in trying to understand the role played by both architectures and algorithm, but very few has been done concerning data. This actually started becoming quite an odd topic within the machine learning community as also this workshop testify. And indeed, quite recently, Andrew and G posted this tweet, where he basically started wondering whether it wouldn't be the case to actually start pushing both researchers and company to work on data while keeping architectures and algorithm fixed rather than doing the opposite. So why is this the case? Well, actually, because data play a central role in machine learning applications. However, if you look at the ill-statistical physics works in the field, like the seminal paper by Gardner and Derrida, you will easily realize that despite these works managed to capture several aspects of learning, they completely ignore the role played by data. So basically, data are considered to be unstructured and sampled identically and independently from a Gaussian distribution. However, we know that in machine learning application, real data sets are instead structured, and it is precisely this structure that the machine learning models tries to grasp in order to learn how to generalize well on previously unseen data. Now, quite recently, there has been a huge activity within the statistical physics community where we basically try to extend the previous statistical physics works in such a way to start including data structure into the game. Up to the point, we now manage to deal with data that are drawn either from a single Gaussian or from mixtures of Gaussian, but with non-trivial means and covariances. Okay, so despite we know that real-world data sets are not Gaussian distributed, given this recent advancement in statistical physics theory and in particular, the replica method, one may still ask himself, can, I mean, are there machine learning settings where basically Gaussian data models can exactly describe the behavior of machine learning models when trained in real-world data set scenarios? And moreover, even if they do not, can we still learn something from Gaussian data models? So to start with the first question, it turns out that there are indeed some learning settings where the machine learning model is completely blind to the fine details of the input data distribution. What he just said, it's about the first moments of the distribution itself. This is, for instance, the case of the example. I'm going to show you on these slides, and then I'm basically going to discuss in the next couple of slides, which basically corresponds to the case of a data set where the input data points are basically drawn from a mixture of K different Gaussians, and the labels are completely random in the sense that they are not correlated at all with the input data distribution. Now, this data set defines a task which is well-known in a machine learning application, and that goes under the name of storage capacity problem. So in this task, we basically take a machine learning model that in the specific example of these slides, I will consider to be a single layer neural network. And then once you have this model, the goal is to train the model in such a way to be able to fit some randomly labeled examples with a given training loss plus L2 regularization. Now, the model, the setting that I'm basically analyzing in these slides can be analytically solved, and analytically solved, I mean that we can actually exactly compute the training loss with the replica method in the high-dimensional regime, and express then this training loss as a function of simple scalar quantities plus some parameters that characterize the input data distribution. Now, if you do that, and you compute actually this quantity with this computation, what we realize is that there are actually some Gaussian universalities that go around, and in particular, a Gaussian universality at finite regularization, which is well summarized by the following theorem. So if the loss is symmetric, which is basically not such a strict requirement because basically almost all loss is employed in machine learning use basically this requirement, and if the Gaussian mixture is homogeneous in the sense that all the different modes of the mixture share precisely the same covariance matrix omega, then the training loss of a mixture of homogeneous Gaussian is equivalent to the training loss that you would get by approximating the homogeneous Gaussian mixture with a single Gaussian, but with matching covariance omega. Is that clear? Are there any questions up to this point? Okay, cool. So the proof of this theorem can actually be found in the paper, but actually this result can be further confirmed by numerical experiments. So to see that, what you could do is to basically take a single layer neural network and train it with a mixture of Gaussians with random labels, and then you can actually measure the training loss as a function of the sides of the training set. So if you do that, you will get precisely these orange dots points that I'm showing you here on these three different plots. The solid black line instead corresponds precisely to the theoretical predictions that we get from the replica theory by approximating the homogeneous Gaussian mixture with a single Gaussian of matching covariances. So as you can see, there is a perfect agreement between what the theory predicts and the numerical simulation, which further basically corroborates the idea that there exists this Gaussian universality and that in the end, what the model is actually observing is not a mixture of homogeneous Gaussian but a single Gaussian with matching covariances, okay? Now, once we have basically observed this behavior, what we wanted to check was whether this sort of universality can actually pops out even when we basically consider real-world data sets. So to do that, what we did was the following experiment. So you can basically take the real data sets of your choice, for instance, Amnist, Fashion Amnist, Cypher 10 or Tiny ImageNet, and you repeat precisely the same experiment. So you take your single layer neural network, you train it on the input data points of one of these real data sets and then you measure the training loss as a function of the size of the training set, okay? If you do that, once again, you will get these colored points that I'm showing you here on these metrics of plots. The solid black line instead, once again corresponds to the theoretical prediction of the replica theory that is basically obtained by approximating the true underlying distribution of all these real data sets with a single Gaussian, whose covariance metrics precisely correspond to the empirical covariance of the specific data set, okay? Is that clear or are there questions? Okay. So if you see, there is already quite a good agreement between the simulations and the theoretical prediction. And indeed, when I actually saw this agreement, I had to rerun the simulations several times just to convince myself that there were no mistakes going around. But despite the agreement is quite good, you can still observe some deviations from the Gaussian universality. So if you see there are some context in which for instance, the agreement is still not perfect. So here the central question is actually, why we are actually observing these deviations? Well, if you remember, one of the main assumption of the theorem was that the Gaussian mixture, the different modes needs to be homogeneous. However, if you actually take a real data set, like for instance, the Cypher 10 in these lights, and you actually plot the covariance matrices of the different modes, for instance here, the three different modes of Cypher 10, you will easily realize that the structure of the covariance matrix is far from being homogeneous. So the modes are pretty different between each other. So this is what is actually causing the deviations from the universality in real-world data set. But actually, what it turns out is that if instead of considering the real data sets as they are, you actually preprocess your data sets with some transformation, for instance, either random feature or wavelet scattering transform, then the covariance matrices becomes more homogeneous than what they were in the very beginning. And if you now rerun the simulations after having applied this preprocessing, then you actually will get now a very perfect match. So what this is basically suggesting you is that if you basically train shallow neural networks with random labels and apply this sort of preprocessing, what the learning model is observing and is trying basically to learn from data is just the covariance matrix. So the second moment, is that clear or are there any questions? Yes, please. Yes. Standardization. Yes. So still the same results obtained? Yeah, exactly. So if you, I mean, what we did here in the simulation was to first preprocess the data set by standardizing, which is, I didn't mention it because of course is a standard practice that you typically do. But if you also apply this further transformation, then you get a better agreement because basically this transformation are making your covariance matrices more homogeneous. And so you are basically, let's say, satisfying the assumption of the theorem. Okay, thank you. Okay. Yes. One clarification. So these are all the training losses still? Yes. Yeah, these are the training losses, for instance, the square logistic or hinge as a function of the size of the training set. And here the different lines corresponds to different regularization strength. Because if you remember at the very beginning, I was inserting in the model an L2 regularization. And so lambda here is the strength of the L2 regularization. But if you saw some other metric like generalization or other evaluation. No, we just saw the training loss because with random labels, I mean, generalization cannot be defined. But actually I will have generalization after. But you planted models. Sure. Okay. So, okay. But however, this is not yet the end of the story concerning district Gaussian universality. Because it turns out that there is a second type of Gaussian universality that it is indeed much stronger with respect to the previous one that you basically get in the zero regularization limit. And that is basically well summarized by this second theorem. So this theorem basically states that in the random label settings under the same assumption of the previous three theorem, so symmetric loss and homogeneous covariance matrices. If the optimization problem is convex and the data covariance is full rank, then the training loss of an homogeneous mixture of Gaussian does not even depend on the covariance matrix itself. So if you run once again the simulations, you will easily realize because actually we can run and repeat exactly the same experiment to measure in the training loss as a function of the size of the training set in the limit of zero regularization for different datasets. And what you will actually observe is that all these datasets, so all the learning curves associated to all these different real datasets, they all collapse on the same curve. So not only the interpolation threshold is the same, but also the full learning curve. And the learning curve on which all these data models are actually collapsing corresponds to the one of completely unstructured data. So there is this much stronger type of universality that basically tells you that in the zero regularization limit, not even the covariance matrix matters. Okay. So going back to our central question, can Gaussian data models exactly describe the behavior of real-world datasets? Well, yes and no. I show you that there are some settings in which this actually appears. But I mean, it's not the full story because not for all machine learning settings, this is the case. So going to the second question, even if there are some machine learning settings where you can't actually identify some strict Gaussian universality, can we still learn something from Gaussian data models? Because maybe they are actually well approximating in a qualitative way the behavior of the machine learning model. So to basically give you some intuitions about that, I will basically talk about three different examples of Gaussian models that actually turns out to be useful for understanding some interesting machine learning settings. And the first one is the so-called correlated hidden manifold model that actually turns out to be very powerful for actually start understanding something about transfer learning. The second model is the mean field generalized spots which turned out to be extremely useful for start approaching self-supervised learning tasks and in particular transformers. The third model is instead the teacher mixture model which basically allowed us to start approaching the thorny issue of fairness in machine learning problem. Now this, let's say, third model is extremely simple, is simply a mixture of Gaussian with the labels provided by a teacher vector. But unfortunately, I mean, for time constraint and for the purpose of this talk, I'm not going to talk about this. I will specifically, let's say, focus on the first two models. So they correlated the hidden manifold model and then we filled the generalized spots. But before doing that, let me give you some, let's say, motivation of why these two settings, transfer learning and self-supervised are so important and why we actually devoted so much efforts in the design Gaussian models of data to better investigate this sort of framework, okay? So basically, as we all know, deep learning is intrinsically data-angry in the sense that it requires lots of data to actually generalize well on previously unseen examples. However, if you think about it, there are some settings where actually collecting huge amounts of label data is simply unpracticable. For instance, in healthcare, one should think to design a pool of medical experts which can actually label each single frame of each single patient medical examination. And this, of course, is a cost in terms of both time and money. So a possible solution which can consistently mitigate the need of new label data is transfer learning. This is a deep learning technique that is based on the idea that the generalization performances of a neural network that has to be trained on a data scars target task can be consistently improved by exploiting the knowledge that a second network is previously acquired on a related but data-abundant source task. So the typical transfer learning pipeline occurs in the following way. You first train a network A on the source task. Then all those layers which are actually responsible for feature extraction are transferred to a second network, the network B, that is then trained on the data scars target task while keeping the transfer feature map frozen and letting just the very last layer to re-adapt to the target set. Now, deep learning prediction is typically farther at the stage of fine tuning where they basically unlock the transfer feature map and then retrain the entire network on the target set. Now, despite being actually widely used in deep learning applications, transfer learning still remain poorly understood from a theoretical point of view. And indeed, there are several questions that still remains open. For instance, how related do the source and the target task needs to be? Now, in this work, what we basically did was to propose the correlated hidden manifold model as a model for structured and correlated datasets where basically the correlations between the source and the target set appears explicitly and are directly tunable, okay? This basically allowed us to explore several transfer learning settings and therefore to delineate the boundaries of transfer learning effectiveness. Now, as the name itself suggests, the building block of the correlated hidden manifold model is the hidden manifold model itself. This is a model that has been proposed by Sebastian Gold and collaborators in 2019. And it is basically based on the evidence that real-world datasets do not span uniformly the entire input space, but they are rather confined in a lower dimensional manifold. So according to this model, each input X is constructed as an linear combination of some Gaussian coefficient C times some generative features F. And here, HELL is precisely the size, the dimension of the lower dimensional manifold. So the intrinsic dimension of your dataset, if you want. Then the C can therefore be interpreted as the lower dimensional representation of the X of the input data points in the feature space, okay? The labels are instead provided by a teacher vector theta that directly act on the Gaussian coefficient C, so on the latent space. So you could think to this model as one, I mean as behaving along the lines of modern generative models that basically starting from latent Gaussian variables then can generate high dimensional inputs as much as you want, okay? Are there any questions up to this point? Okay, perfect. So there is a silent rate in the hidden manifold model that is that it basically provides a direct access to the generative features, the teacher vector, and the intrinsic dimension of the dataset. So in the correlated hidden manifold model, what we did was to exploit this rate. And in particular, we constructed the source task as a standard hidden manifold model while the target task is constructed from the source task by applying three different types of manipulation on the generative features on the teacher vector and on the intrinsic dimension of the dataset. So the first one is feature perturbation and substitution, the second one addition or deletion, and the third one teacher perturbation. So keep in mind that all these transformation are actually meant to describe the situations which can concretely occur when you train machine learning models on real-world datasets. For instance, teacher perturbation would correspond to the case where you have basically two datasets and these datasets are sharing a common set of inputs but they are labeled according to different labeling rules, okay? So given this data model, we consider then the following transfer learning setting where we basically took a two-layer neural network that we train numerically on the source task. Then we basically took the first-layer weights and we transferred them to a second-layer neural network that has to be trained on the target set. And it is trained on the target set while keeping the first-layer weights frozen and letting just the second layer to read up to the target set. We call this model the transfer feature model, okay? And the goal that this model has to achieve is basically to reach the lowest possible generalization error on the target set by performing some empirical risk minimization over an L2 regularized logistic loss. Now, once again, this model plus the correlated hidden manifold model can be solved analytically thanks to the recent advancement in statistical physics to include the structure that I was mentioning you at the very beginning. And if you actually solve this model that is computing through the replica theory, the generalization error, you will realize that some Gaussian universality appears also in transfer learning, even if they are less strict than the one that I was showing you in the very first example. And to see that, you can actually consider the following experiment. So you take basically the amnesty letters dataset, which is simply a dataset of unwritten letters and you just consider a subset of these letters. Then the source task is constructed by dividing this subset into two distinct groups, assigning to each one of these groups a different label, just because we wanted to deal with binary classification tasks, okay? And then the target task is constructed from the source task by simply substituting one letter per group, okay? So given this dataset, we have basically trained the transfer feature model on this data type. And what we got, if we measure the test error as a function of the sites of the target training set, it's precisely this light blue curve, okay? The orange curve instead corresponds to the random feature model, the green curve to a two-layer neural network trained completely from scratch, and the dark blue curve to transfer feature plus fine tuning. Now, there will be many things to say about these plots and comparing all these different learning models, but at this stage of the work, we were actually interested to check whether the correlated hidden manifold model can concretely reproduce this scenario that we are actually observed in the experiment of real datasets. And to basically check this, we constructed the source task as a standard hidden manifold model, and the target task is basically constructed from the source task by perturbing the 30% of the feature, okay? If you, with this dataset, you perform exactly the same experiment, so you measure once again, the test error is a function of the sites of the training set, you will get this behavior here. So there is a striking qualitative agreement between what we can observe with real datasets and what instead we are actually observing with Gaussian data. Now, even if there is, in this case, not, I mean, a numerical agreement, this qualitative agreement is still there and actually allows us to basically start investigating and deriving different phase diagrams for basically understanding transfer learning effectiveness. And therefore, we actually learn something from, I mean, this theoretical analysis with the Gaussian model that turns out to be useful for deep learning application. For instance, the first thing that we learned is that transfer feature model is a hexabit delayed interpolation threshold. So they are actually able to fit much more data and this is simply due to the correlations that are encoded within the transfer feature map, okay? The second thing is that transfer learning is highly asymmetric in the sense that transferring from harder to simpler task in terms of generalization performances is way better than doing the opposite. A third thing is that transfer learning can be actually effectively, but it needs to be used wisely because it strongly depends on how much the source and the target task are actually correlated. And this basically allows us to notice, I mean, this strong dependency on the two correlations that basically, contrary to what deep learning prediction is typically does, that is to transfer all the features, all the layers of feature extractor at once. It is better convenient to, depending on the correlations between the two data sets to transfer just small parts of the network. That is why in this very last paper we basically design an algorithm that allows to basically identify the optimal transfer depth and also we propose a strategy to select the optimal source task among all the available candidates. And all of these came from a Gaussian data model. Okay, so are there any questions up to this point? Yes, please. So you could measure similarity in many different ways, I presume. So I'm wondering sort of what you use here or how you measure that, does that change your qualitative? You mean the... The similarity between the source and target. Ah, okay. No, here actually, if I go back in the data model you measure the similarity between these parameters. So for instance, here you see you have the teacher vector of the source task. Then you have the teacher vector of the target task and Q is basically identifying you the overlap between the two teacher. So this will allows you to actually measure the correlations between the two data sets. Then given these two data sets you can actually check how this correlation propagates throughout the networks. And for that we have some more, I mean we use some well-defined similarity measure among neural networks. Like for instance, information imbalance which is a measure that has been developed quite recently in Alessandro Laiu's group. Okay. So okay, so understanding basically how transfer learning works is extremely important because basically deep learning models nowadays are never trained completely from scratch. And transformers do not make an exception. So transformers are actually a special type of artificial neural networks that are currently achieving a state of the art results in various domains. For instance, language modeling or image classification. But here the central question is why are these network type of networks so special? Well, if you think about it, when we either read some text or we look to a given image, we do not just blindly go through all the words or through all the elements of an image, but we are also able to assign them and meaning depending on the surrounding context. So for instance, if I give you this sentence, the animal did not cross the street because it was too tired. Directly from the context, you will basically realize that the pronoun it is referring to the word animal. Now, if I give you instead this other sentence, the animal did not cross the street because it was too wide. Again, from the context, you will easily realize that the pronoun it is referring to the word street. Now, transformers are actually very good in playing this game that is learning context and meaning from sequential type of data. And once again, one of the key things that make their success is transfer learning. Now, in the context of large language models, the source task is a special type of self-supervised learning task, which is called the Masked Language Modeling. This task basically consists in training a neural network to actually predict missing words in a large amounts of row text. And in this way, basically, the transformer is pushed to learn the interactions among the different words in a sentence. Now, this task is of course self-supervised because the labels are directly sampled from the input. And as you can see, it's data abundant by definition because we can always take and collect huge amounts of row text because we do not require any type of annotation. The only requirement that is asked to the user is simply to mask here and then randomly some words. Okay. Now, once the transformer has been pre-trained on this task, it can be actually fine-tuned to some downstream tasks, for instance, text generation as in the famous example of chat GPT. Now, despite actually transformers are widely used in deep learning applications, they still are considered as a sort of black box models. And indeed one may ask, for instance, what transformers actually learned when they are trained on masking language modeling tasks or how many samples are required to achieve good generalization performance is when we train neural network models in self-supervised learning mode. Okay. Once again, to answer this type of question, in this paper, we basically proposed the generalized POTS model as a good data model for, let's say, mimicking the interactions between the different words in a sentence. So how does this model work? Well, we model sentences of length big L as the sequences of one not encoded vectors. So here the dimension C is nothing but the size of the vocabulary. So if you want the total number of different words that you have in your data set. Once you have this, each sequence is then sampled from the Gibbs measure of the generalized POTS model, where the energetic function differ with respect to the standard POTS model because we do not just allow interaction among the positions of the different words through the matrix J, but we also allows interaction with respect to the semantic meaning of the different words. Okay. Are there any questions up to this point? Yes, please. Yes, if you do this one hot encoding, it seems that the dimension C would be very large for a realistic case like English or something. So then would you have enough data to fit these models or something? Well, actually what people typically do in Transformers is that after having done this transformation, they apply some layers of embedding which basically reduce the dimensionality. For instance, from, I don't know, 50,000, it goes to 500. And then actually you also learn these metrics that performs the embedding. So it's really, I mean reduced. So roughly I should be thinking that even though one is doing this one hot encoding, like all words are not actually distinct. There are many words that are actually pretty similar. So if I project it down to a little bit. Yes, exactly, exactly. Because then you have first stage of tokenization, it's called inJergo. So where you basically construct from words token which are basically vectors made of some embedding plus positional encoding. I mean, these are all technicalities of Transformers, but yes, you can actually reduce the dimensionality in the end. Okay, so once we had, I mean, having this data model, then we construct the Masked Language Modeling Task in the framework of the generalized spots by simply sampling big M sequences from the Gibbs distribution. And then to construct the data set, we basically take these sequences of why not encoded vectors and we mask one why not encoded vector randomly per sequence. This defines the input of the data sets associated to the Masked Language Modeling Task, while the label will be the Masked why not encoded vector itself. So given this data set, then the goal is again the usual one that is to basically train at Transformers in order to be able to find a good estimate JET and UET of the true interaction matrices of your model in order to basically achieve the lowest possible generalization loss. Okay, so having defined the data model and the task, what about, I mean, the first question what Transformers learns with Masked Language Modeling? I'm not going to talk about this, but if you are interested in Sebastian Golt is going to talk about it during his talk. What I will hear focus is instead how many samples are required to achieve good generalization performances in self-supervised learning. Well, it turns out that statistical physics as we all know is extremely powerful in answering these type of questions, especially in teacher-student settings where the labels are provided by a teacher vector T and then the goal of the student is to find a good estimate WET of the teacher T by actually minimizing the test loss. Okay, so as I was saying at the very beginning of the talk, once again the extension to include the structure in replica theory turns out to be extremely useful because we know actually how to compute these test loss. I mean, even in context where dataset are structured. And this actually turned out to be useful also for studying Masked Language Modeling task because if we now make an assumption on our data model, so we basically make it Gaussian by relaxing the discrete nature of the one not encoding and substituting with some real variables, the Masked Sequences will now be sampled from a Gaussian mixture. And how basically these words are basically connected to each other is controlled by the covariance metrics of the Gaussian distribution. Now, if you make this assumption, you can actually apply the whole machinery of the replica theory also in Masked Language Modeling with the only difference that now the labels are not provided by any teacher vector but are instead directly sampled from the input distribution itself. So the goal of the student in this case is not to infer a generic teacher vector but precisely the rows of the covariance metrics. So we can, thanks to this approximation, compute the generalization loss in Masked Language Modeling. And if you do that, what you would get is precisely the solid black line here where basically I'm displaying here the test loss as a function of the number of samples per sequence length. Now, this is what the Gaussian approximation of the generalized spots model is giving us. But we can actually run concretely some simulation on the actual data models on the generalized spots. And what we will get once again is a striking qualitative similarity between what the Gaussian data model can predict and what instead is actually observed with the true data model. Are there any questions up to this point? Yes, please. Yeah, yeah, because actually the appearance of the peak that is something I'm going to discuss in very few minutes later, it's basically due to, it basically appears at the interpolation threshold. And this interpolation we actually discovered that strongly depends on the type of loss that you are using. Now, in the Gaussian data model, I was using a square loss because the values are real. While in the data model, I mean, since it's not encoded, so our discrete variables, it should be interpreted as a classification problem. So actually you are using a cross-entropy loss in this case, so the interpolation threshold change. But despite that, I mean, in both cases, you actually observe the presence of this peak. And this peak actually is directly related to the double descent phenomena. But this time, I mean, the appearance of this peak is not due to label noise, but it comes directly from the noise that is encoded in the input data distribution because it's a self-supervised learning task, okay? And moreover, I mean, as you can see, there is an absence of the initial descent. And this is basically due to the higher level of noise that you have in the input and that you can't control because it's intrinsic of the input itself. So just to basically conclude, I was basically starting with these two questions. Now the first one, we have already seen it. The second one, can we still learn something from Gaussian data model? Well, the answer is yes. There are some machine learning settings where you can still learn something. For instance, we have seen an example in transfer learning, but also in self-supervised learning. And with this, I basically conclude and I would like to thank all the people involved in this project, in particular Bruno and Ludovic that are here in the audience. And thank you very much for your attention. Questions? Please, can you go back to the slide on task similarity? This one? No. Task similarity, the second part of the talk. Where you are designing the correlation between the high-dead manifold model. Ah, sorry, yes. Okay, yes. This one. Yes. On the source task, you have same view. On the target, you also have same view. If, for example, I decide to change, like, not to use the same same view. Is this meaningful? I don't think so, because these are basically some Gaussian random variables. So they do not provide any type of information. The structure in the input is really encoded in the generative features, F, because it's really like if you have an image, for instance, of a cat, these features are basically telling you that a face of a cat is made of some, with some noise, and I mean, the typical shape of the cat. And the shape of the cat is controlled by these features. So this is what you need to basically change if you want to change the structure of your dataset. Then you can also play with intrinsic dimension of the dataset. So how much, let's say, data points are spread in the lower dimensional manifold. And also you can act on the teacher because you could have, for instance, I don't know, you could take MNIST and divide it in even odd digits. The images are the same, but the task is different because you want to classify even odd images. And you could take also MNIST and try to classify numbers that are greater or smaller than five. And then this is another task where again the images are the same, but the labeling rule is different. So these are the three things that are basically important for the data structure. Okay, last question. On the last language task, can you share the intuition behind the semantic interaction? Yes, okay. Because basically when you have, let's say some words in a sentence, they occupy in this sentence a particular position, okay? But the meaning that we assign to this word is not just about the position that they basically take in the sentence, but also the intrinsic meaning that they have. For instance, if I, in a sentence, I have the word apple, I know that this is a fruit independently on the position that it takes. And so when sentences are constructed, what it matters is the position that the words occupy in a given sentence, but also the intrinsic meaning of that word that is the image that is triggered in your brain when you think about that word, okay? Hi, Peter, again, thank you for this wonderful talk. I would like to come back to this collapse of all the curves in the random label setting. I want you to understand whether my intuition is correct. So in a sense, what you're doing is, so okay, when you learn, you can learn two things. If I'm not mistaken, you can learn the rule linking inputs and outputs, what we usually call the teacher in these setups, and something intrinsic to the structure of the inputs, right? So in this setting, there is no rule because you have random labels. And by making the, by homogenizing the inputs by this standardization or this whitening or whatever, you are also kind of destroying the data structure. So is it really surprising that at the end, everything collapses on the same curves because both aspects are kind of... Yeah, actually, I mean, you are not completely destroying it because you still have the Gaussian clouds that are found in a given position. What is the same? It's just the covariance matrix. And actually, as you, I mean, can see even if you have random labels, the covariance matrices needs to be the same because actually you are also, even if you have random labels, you are still learning something about the input. And so this is what it was surprising because at the first glance, one can say, okay, yes, but if you have random labels, you are not learning anything about the input data structure. Instead, this is not the case because if the covariance matrices are not homogeneous, then you start observing something that deviates from discussion universality. But I mean, you still have some structure. Even if I agree with you that it's a little bit less evident in this case. But one thing to say that it's more a further, let's say, branch of the theorem is that if you instead consider square loss functions, then in this case, you observe discussion universality so that they collapse, even if the covariance matrices are no more homogeneous. So for the square loss, there is this farter, stronger universality, even if they are not homogeneous. For the other losses, this is not the case. But of course, we still have to understand so many things about these models and how they work. And this collapses, this universality hold in some specific scalings for the numbers of hidden units and things like this, right? Yeah, I mean, for the moment, we see that it holds when the scaling is linear. So you have a number of samples that scales linearly with the input dimension. I don't know very well if maybe Ludovic and Bruno did some advancement later to include the square regime. But actually we are also start working with, I mean, to extend the statistical physics theory to the square regime. And Manfred did also a beautiful work about that. Yeah, in the 80s, I think, okay. Any other questions? Okay, thank you for your talk. My question is that the Gaussian data model, is it, can we use it for image classification in the sense of breast cancer problem? A problem like in medical, medical feed, for example. Because I can see the example you gave just based on the minister fashion data set. Yeah, that would be cool. And indeed, I have some medical data disposal. So one of the future direction that I typically present when I talk about this transfer learning work is precisely to try to apply what we have basically observed with Gaussian data models to medical imaging. But actually we have already done this in this paper but with the toy medical imaging. So the one that you could download on Kaggle. Sorry, it's here. So in this paper, we have actually applied this algorithm to medical data sets starting from a previous analysis on Gaussian data models. And we actually observe once again the same behavior that is that the optimal transfer depth is finding in a non-trivial position. But I personally also would like to try also with other medical data sets that are not downloaded on Kaggle and which are real one, the one that you actually collected during the experimental, let's say, projects. Okay, the reason why I ask the question is that like in that domain, it involves future extraction, like future selection and things like that. The dimensionality of the data is very wide in terms of I don't know if such Gaussian data model will be able to handle that problem. Just I'm correct in working on that every... Yeah, no, of course I'm not saying that Gaussian data models works for every context but still you can learn something from them. And for instance, in this case, in this list, this is what we learn. Then of course, if you are interested in particular things, then it could be that they do not say anything because maybe the model is too much simple but in this case, it worked. That's okay, I'll show you what I'm trying to say. Sure, thank you. Yeah, in the interest of time, maybe let's move the remaining questions to coffee time and then we reconvene at 10.50. Let's thank Kaderika again for the really great talk. Thank you very much.