 Okay, so good morning, everybody. First of all, I would like to thank the organizers for inviting me to this very... Sorry, maybe I should mute myself because I see... Okay. Yes, but they... Ah, okay. The young? Okay, perfect. Sorry. Okay. As I was saying, I would like to thank the speakers for inviting me to this very interesting conference. So basically, today, I'm going to talk about the research activity I've been focused on in the last, let's say, three years, and which, broadly speaking, regards the interplay between data and machine learning performances. Okay. And as we all know, the great success of deep learning can be mainly ascribed to two distinct factors. On one side, the design of efficient learning algorithms and architectures. On the other side, the increasingly higher amount of available data. However, up to this point, lots of efforts have been devoted in trying to understand the role played by both architectures and algorithms, but very few has been done concerning data. And this actually started becoming quite an odd topic within the machine learning community. And indeed, quite recently, Andrew and G posted this suite, where he basically started wondering whether it wouldn't be the case to start pushing both researchers and companies to work on data while keeping architectures and algorithms fixed, rather than keep on doing the opposite. Why this? Well, because we all know that, basically, data are the core of machine learning problems because they are used to fit billions of parameters to implement trial and error strategies for hyperparameters fine-tuning and so on and so forth. However, if you look at the earlier statistical physics works of the late 80s, like the pioneer work by Gardner and Derrida, you will basically realize that despite all these works managed to capture several aspects of learning, they actually completely ignore their role played by data. So data are simply assumed to be unstructured and sampled identically and independently from a Gaussian distribution. However, we know that, basically, in real machine learning applications, data are simply structured, and it is precisely this structure that a machine learning model tries to grasp somehow to learn how to generalize well on previously unseen data. Now, the good news is that quite recently there has been a huge activity within the statistical physics community, and we basically tried to extend the previous statistical physics works and in particular replica theory. Up to the point, we basically now managed to deal with data that either are sampled from a single Gaussian or a Gaussian mixture, but with non-trivial means and covariances. So including data structure into the game and extending statistical physics to take care of data structure has basically allowed us to start exploring several machine learning frameworks, and here in these slides I'm going to list some of them which we are currently active. For instance, we are trying to understand the gap in generalization performances between random features and neural network models when all the information to accomplish a given task is precisely encoded in the higher order statistics of the input data distribution. And if you're interested about it, Esther and Lorenzo has two posters on the topic. Then we also recently started getting interested in self-supervised learning that now basically after this conference we know that is the dominated paradigm with which large language models are currently trained on. And like for instance the Transformers and Ricardo has a poster about it if you are interested. Then we also start approaching the turning issue of fairness in machine learning problems. But in today's talk I will be basically focused on some advance that we have done with statistical physics theory on basically transfer learning. So why did we got so interested in transfer learning scenarios? Well simply because we all know that deep learning is intrinsically data hungry in the sense that it requires a lot of data to generalize well on previously unseen examples. However if you think about it there are some context in which basically collecting huge amounts of labeled data is simply unpracticable. For instance in healthcare one should think to set up a pool of medical experts which can label each single frame of each single patient medical examination. And this of course has a huge cost in terms of both time and money. So a possible solution which can somehow mitigate the need of new labeled data is precisely transfer learning. And this is a deep learning technique which is based on the idea that the generalization performances of a neural network that has to be trained on a data scarce target task can be basically consistently improved by exploiting the knowledge that the second network has previously acquired on a related but data abundant source task. So the typical transfer learning pipeline then occurs in the following way. You first have a network A that is trained on a data abundant source task. Then all those layers which are responsible for feature extractions are then transferred to a second network, the network B that is then trained on the target set while keeping the transfer feature map frozen and letting just the very last layer to adapt to the target set. Eventually deep learning prediction has further had the stage of fine tuning where they basically unlock the transfer feature map and then retrain the whole network on the target set. Okay, are there any questions up to this point? Okay, I hope it was clear. Okay, so despite transfer learning is widely used in deep learning applications up to the point that we all know that nowadays nobody basically trains deep learning models completely from scratch. This technique is still poorly understood from a theoretical point of view. And there are indeed several questions that stay open for instance how the source and the target task needs to be related in such a way to improve transfer learning performances or is the fine tuning stage always beneficial or there are some conditions where it can lead to overfitting. So we believe that the key ingredient to answer this question is a good model for data which can somehow capture the non-trivial correlations between the source and the target set. So what we did was basically to propose the correlated hidden manifold model as a model for structured and correlated data sets where the correlations basically appears explicitly tunable. This basically allowed us to explore several transfer learning settings and it allows us to delineate basically the boundaries of transfer learning effectiveness. Now as the name itself suggests the building block of the correlated hidden manifold model is the hidden manifold model itself. You have learned from Francesco something about this model but this is a model for structured data that has been proposed by Sebastian Golt and collaborators in 2019 and it is basically based on the evidence that real word data sets do not span uniformly the entire input space but they are rather confined on a lower dimensional manifold. So according to this model each input X is constructed as a linear combination of some generative features F with some non Gaussian coefficient C. So the C could be interpreted as the lower dimensional representation of the input X in the lower dimensional manifold. So this model goes precisely along the line model generative models that starting from let's say a Latin variable then starts basically producing high dimensional inputs. Then what about the labels? The labels are provided by a teacher vector instead that directly acts on the Latin space. Yes. Now it's applied component wise. Here I just wrote it in the metric notation but I should have put it on a point. Then here Sigma is some non-linearity whatever you want, ReLU 10H whatever non-linearity you want and then Alice instead the intrinsic dimension which is nothing but the dimension of the lower dimensional manifold. So there is a salient rate in this model that is that it directly provides access to the generative features to the teacher vector and to the intrinsic dimension in the synthetic data model and we actually exploit this feature of the hidden manifold model in the correlated hidden manifold model and in particular we construct the source task as a standard hidden manifold model while the target task is constructed from the source task by basically directly manipulating either the generative features or the teacher vector or the intrinsic dimension and in particular what we did was to consider three different types of manipulation feature perturbation and substitution, addition or deletion and teacher perturbation. Now keep in mind that basically all these manipulations are meant to mimic some situations which can concretely occur in real-world data experiments. For instance the teacher perturbation would correspond to the case where the students are sharing a common set of inputs but they are labeled according to a different labeling rule and this is basically control in our model by this parameter Q which defines the teacher, the source target teacher overlap. Okay are there any questions up to this point? Okay perfect. So given this model for data we then consider the following transfer learning setting. So first of all we take a two-layer neural network that we train on the source task completely from scratch. Okay then we take the first layer weight of the first two-layer network and we basically transfer to a second two-layer network which we train on the target set while keeping the first layer weight frozen and just take the second layer on the target set. Okay so we call this model transfer feature. Now given this model the goal is the usual one so achieve the lowest possible generalization error by empirical risk minimization. Here we tried with logistic loss plus some L2 regularized and actually we basically have a good news and a bad news. So the good news is that this model, the correlated hidden manifold model basically belongs to a family of Gaussian model that we know how to treat analytically by means of statistical physics tools. And by analytical treatment I mean that we managed to compute the generalization error, the training loss and all the other observable of interest in a given machine learning problem. But the bad news is that the model is still a Gaussian model. So in principle there is no guarantee that this model can actually correctly describe situations that you can concretely encounter while working with real-world datasets. However, there is a good news in the bad news that is that basically there exists some Gaussian universalities in some machine learning context where basically the machine learning model is completely insensible to define details of the input data distribution. What we just care are the first, let's say moments of the input data distribution itself. And this is for instance the case I'm showing here on this slide which is precisely the case of a single layer neural network that is trained with random labels on a given classification task. Okay. So basically you can image, for instance, to take your favorite real dataset. For instance, Amnist, Fashion Amnist, Cypher 10, whatever, let's say, dataset you want. And then you basically train your single layer network with random labels in a given classification task. And you measure the training loss as a function of the sites of your training set. Okay. If you do that you will be basically obtain all these colored points that I'm plotting here in these metrics of plots. Okay. But actually the interesting thing to notice is that the solid black line instead corresponds to the outcome of the statistical physics analysis. So the theoretical prediction of statistical physics that are basically obtained by approximating the true dataset distribution with a Gaussian measure whose mean and covariance matrices do precisely coincide with the empirical mean and the empirical covariance matrix of the true dataset distribution. Okay. So if you look at it, you observe precisely I mean a striking qualitative match that basically is suggesting you that these learning models for instance is just taking care about the covariance of the input data distribution. It doesn't look at anything else. Okay. Is that clear up to this point or are there any questions? Okay. Perfect. So it turns out actually that if you go into the zero regularization limit there are even stronger Gaussian universalities up to the point that all the learning curves corresponding to all the different real dataset distribution collapse precisely into the same learning curve which is the one of unstructured data. So there are even further and stronger Gaussian universality. Now this was just an example but actually I did it because even in transfer learning you can observe some Gaussian universalities even if a bit weaker than the one that I'm showing on this slide and to see that we could basically design the following experiment. So we can construct the first task by selecting a subset of the MNIST letters and then basically group all the examples in this subset into two distinct groups and you basically assign to each one of these group a label according to the group membership. Okay. This is the first task. Then to construct the target task what we do is that we remove, we substitute a letter per group. Okay. In this way some relevant traits that were previously present in the first task, some important inputs do no longer appear in the target set. Okay. So given this, I mean this set of real data sets what we did was then to measure the generalization error as a function of the sites of the target set of the transfer feature model. Okay. If you do that you will get precisely that I'm showing here on this plot and we basically compare the generalization performances of this model with a two layer network trained completely from scratch which is the let's say green curve in the plot a random feature model which is the orange curve and then a transfer feature model plus fine tuning. Now if you look at this plot you can basically observe many interesting things for instance fine tuning seems to be not so beneficial in the data scars regime. There are some peaks that appear here and which are directly related to the double descent phenomenon and if you see I mean in the transfer feature model the peak is delayed and this is simply a consequence of the correlations that have been encoded in already in the transfer feature model map. Okay. But what we were basically interested in this at this point was basically to check whether the correlated the hidden manifold model can capture somehow the picture that emerged with the I mean real dataset experiments. Okay. So to do that we take the source task as a hidden task was constructed from the source task by substituting the 30% of the feature. Okay. If you don't repeat precisely the same experiment that we did in the case of the real data what you get is a picture like this. So there is a striking qualitative behavior between the numerical experiments with real datasets and what can instead be observed with the synthetic one. And I just would like to stress that the solid lines on these lights do precisely corresponds to the replica prediction so the statistical physics computation. Okay. So motivated by this striking qualitative agreement and by the fact that in the correlated hidden the correlations appear explicitly and are directly tunable you can image as a theoretical physicist we start drawing all the possible phase diagrams by tuning all the possible parameters to see many different scenarios. You can find all these phase diagrams in the paper if you are interested in maybe I don't know. But actually for the purpose of this thought yes because it shifted to the right so you see here it occurs here wait for the transfer feature model it occurs slightly later yeah yeah to the random feature ah no no sorry no it was delayed no don't worry it was delayed between the two models. You mean the teacher the first target teacher perturbation yes no okay the phase diagram are obtained with the correlated hidden manifold models simply because we run some experiments like this here I'm just showing you a single experiment of real data of course for time constraint and actually what we saw by running all these numerical experiments is that there was a striking match by just tuning these parameters of the correlated the hidden manifold model so at that point we simply trusted that it was correctly reproducing the thing and so we generate the phase diagrams with I mean the the correlated hidden manifold model also because if once you have the replica computations to compute all this stuff is basically much I mean it's basically faster than running simulations so you can get phase diagrams quite easily without waiting for yeah okay so for the purpose of this talk I would like to basically show you the phase diagram which are basically related to understand the effect of the source target relatedness in transfer learning problems okay so what are you observing in these lights are three different phase diagrams which compare the transfer feature model with the two-layer network the random features and the transfer feature plus fine tuning the blue region corresponds to the case where the transfer feature performs better than the random features and on the y-axis we have the size of the target training set while on the x-axis we precisely have a measure of relatedness between the two tasks which as I was telling you is this overlap in the teacher perturbation for instance okay so you can notice many interesting thing for instance for fine tuning is overfitting in the data scars regime transfer feature is always performing better than all the other models when you do not have enough data and the two tasks are I mean consistently correlated but for the purpose of this talk we like you to concentrate on this red spot on the right corner of the left corner of the phase diagram and this is precisely an example of negative transfer what does it mean it means simply that in this regime the source and the target are so purely correlated that basically transferring the features from the source task is not beneficial because those features has nothing to do with what you would have learned on the target set with enough data so basically it is even better to keep them fixed to random rather than to transfer something okay so given these statistical physics analysis what we would have liked to do was basically to try to understand which are instead implication in a deep learning setting okay so to do that we basically considered the following deep learning setup okay concerning the data sets we use three different clones of the Cypher 10 data sets constructed in the following way so in the first clone which is called ISOGM we simply sample images according to a Gaussian mixture whose mean precisely coincide with the empirical mean of the true Cypher 10 distribution okay then the second clone is instead constructed by sampling images from again Gaussian mixture but this time with the first and second moment matching so not only the means are matching with the true data distribution but also the covariances so basically this clone are meant to form a hierarchical family of approximation of the true Cypher 10 distribution that increase I mean in resolution let's say and these two benchmark data sets have been proposed by Maria Refinetti, Alessandro Ingrosso and Sebastian Golt quite recently then to go beyond the second moment we instead propose to basically construct a third type of clone by playing with a bottleneck size of a depot encoder so the larger is the bottleneck size the let's say more accurate would be the reconstruction of the original data set then for architectures and algorithm nothing to say because we basically use standard protocols in computer vision deep learning yes please I mean it's simply a depot encoder so let's say the function with which you are basically generating the data is the network itself so you start that you start from an original image then you train your depot encoder because these are models that are trained to generate new type of images and what you do is that you basically at the end get the reconstructed image in the network you know I think I don't get the point how you get an input because the images are meant for that so you basically have a neural network that first have an encoder stage where basically encode all the information that is relevant in the input then as a decoder stage where basically no no this is no this I mean if you play with the size of the bottleneck you will get the reconstruction that you have sorry no the bottleneck is fixed and then you just play with the size of it yeah exactly no no no it's just that you corrupt it to get a new clone yeah yeah it's fixed yeah no no you just try to I mean to get let's say to play a little bit with the size of the bottleneck to construct a different type of images yeah yeah exactly no no no no no no this is just to okay perfect now I got it sorry no it's just to corrupt a little bit your image in such a way that you basically have an increasingly higher approximation of the two underline distribution yeah exactly you're welcome okay so given let's say these three types of clones we wanted to understand the effect of the target correlations in a deep learning context so what to do that we first consider a transfer learning scenario where the source task is the depot encoder clone while the target task is precisely cipher 10 and what I'm plotting here is what we have called in the paper the defrosting profile so the first point in this car so this point here corresponds precisely to the case where you basically train the neural network completely from scratch on the target set the last point corresponds instead to the case where you keep the entire network frozen and you just retrain the readout layer on the target set so frozen to the features that have been learned on the depot encoder then in the I mean in the middle you can find these intermediate points that would corresponds to intermediate situations where basically for instance in this point you keep frozen all the layers up to the end and then you just retrain the remaining layer on the target set okay so if you look at this plot nothing special happens because this is precisely what deep learning prediction is often does that is basically to transfer all the feature extractor layers and then you get with this transfer learning strategy the optimal performances however this is convenient to do only if the datasets are kind of good correlated between each other because if instead they just share the first moment of the input data distribution you start precisely observing the negative transfer effects that I was mentioning before so in this case if the source task is ISO GM it is basically never convenient to transfer okay however what it is interesting to notice is that there are some intermediate situations in which the optimal transfer depth is some non-trivial number so it is neither convenient to train completely from scratch nor to basically keep the entire network frozen and therefore I mean transferring up to the very last layer I mean all the feature extractor layers does not seem to be the optimal learning transfer learning strategy and by the way we observe these sort of scenarios I apologize for the bad images of the retina disease but if you if you do that in a standard let's say transfer learning settings where your source task is ImageNet and the target set is some medical dataset you actually observe some let's say closer phenomena so given the robustness of this phenomenon for us it would have been let's say very important to first of all identify the optimal transfer depth independently on the source and the target so basically have an algorithm that can do actually this can identify the optimal number of layers to transfer and then a possibly a strategy to identify the optimal source task among all the candidates one because if you do that then this is extremely important because you will start from a better initial condition that is important for later fine tuning strategies okay so concerning the algorithm we have a very preliminary proposal that is extremely simple so you basically start from the condition where the entire network is frozen and you just let the very last layer to be retrained on the target set then with I mean subsequent cycles so you keep on unfrozing all the layers up to the point you arrive to the condition that the entire network is trained from scratch on the target set now if you do this operation you will get the defrosting profile I was mentioning you for a given pair of source target task however the good thing is that in all the experiments that we did this defrosting profile appears to be quite regular so you could basically have an algorithm that samples just a few points on this profile and then infer the position of the maximum directly from it is that concerning how to select the source task among all the available candidates what we did was to use this measure information in balance that have been developed quite recently in Alessandro Laios group and to compute this measure and to select the source task what we did was the following first we took two neural networks okay the first one pre-trained on the source task so it's one that you could easily download from the PyTorch Zoo for instance and the second one is a neural network trained completely from scratch on the target set given these two networks we make flow through both networks the test set of the target task okay and we extract the internal representation of I mean both layers of both networks for each one of the layers and this point which are basically this H here now at this point we have all the ingredients to compute the information in balance and to do that we first go into the space of the source internal representation okay we then rank all the internal representation of each test dataset point according to Euclidean distance measures okay and then for each point in the test set we go and check which is the first nearest neighbor okay then once we have identified it we go into the space of the internal representations of the target set and we check whether the first nearest neighbor in the source task is also first nearest neighbor in the target task okay if this is the case then it means that the local neighborhood of the source and the target space are similar okay and information in balance is a quantity that is precisely measuring the similarity among let's say neighborhood of internal neural network representation spaces okay you can then repeat precisely this let's say a procedure for all the layers and if you do that you would get a plot like this where I'm basically plotting the information in balance as a function of the convolutional layer at which it has been computed and in particular I'm considering three different curves the pink one corresponds to the case where the source task is iso GM the green one when it is GM and the green one is the clone that comes from the deep out encoder okay so the and the target task is always the Cypher 10 so what you can notice is that basically the information in balance is actually able to capture the fact that the deep out encoder clone is the most promising tasks which it is more convenient to transfer if you want to train on Cypher 10 because the smaller is in the information in balance the more similar will be the two neighborhood space okay and another side result is that I mean it is basically further corroborating the idea that the first layers learn the first order statistics while the deep layer in the networks are basically learning some higher order information in the moments of the input data distribution okay so with this basically I conclude and I would like to give you some tech home messages first of all supervised learning is not always feasible because not all datasets are actually easy to label then it is through the transfer learning is a possible solution to data scarcity but it needs to be used wisely because it strongly depends on how much the source and the targets that are correlated so we do not in principle have to use it blindly then once you have a model of synthetic data that is actually able to reproduce what you observe on real datasets this is actually a good thing because you can explore several transfer learning I mean machine learning scenario and then the first thing that is more technical is the fact that I mean the replica calculation seems to suggest the existence of some universality class because just by tuning very few parameters you are actually able to reproduce the experiments that you see on real datasets and so with this basically I conclude I would like to thank all these amazing people which I had the pleasure to work with and in particular Ludovic, Alster Lorenzo and of course Sebastian are in the room so thank you very much okay well about the statistics I mean the statistic of the input data distribution not of the internal representation I mean ICGM is constructed from Cypher 10 when you share just the first moment of the input data distribution for a transformer you mean ah okay in that sense okay yes but these are not the let's say the correlations I was talking about in the sense that I was really referring just to the input data distribution so I'm not looking at the distribution of the internal representation then once you have the signal at input it propagates across to the network and then yes it happens what you are saying but the let's say the input data distribution it is what I'm referring to when I say first second and higher order moments well my guess is that if you take a transformer then maybe you could observe something similar but the thing is that I mean I don't know how much this is instead related to the fact that this network compressed the signal because for instance in the transformer this is not the case and so I don't know we should definitely check this out but for the moment this work was restricted to computer vision tasks so that's why we took state of the art but we could use also visual transformer and this is something that we are planning to do as a next step this was just a first step to see whether we managed to see some signal yes yes a lot of ok so the let's say the ok let me go but ok you mean to get these cars ok no in that case is the standard limit in the replica calculation so you send the input dimension to infinity the number of samples to infinity and then you keep the ratio finite it's the usual one yes also yes in this case yes because we basically do these replica calculations with for instance random features model that has been then extended to the case where you also do not have simply random features but also features that have been trained on a given task and what you do is that basically you can if all these layers are fixed you can run the replica calculation as if you would have just a single layer network which is the readout and then basically you have your input data which are nothing but the activation function of the very last layer and so then you apply the standard limit that is to send the input dimension to infinity which is in this case is the dimension and then the sides of the training set but the ratio needs to be kept finite yeah but actually the input dimension the hidden dimension and the number of samples they are all going to infinity with the same ratio that's the thing because instead if you take the input dimension finite and then send the hidden to infinity yes you're right it's precisely what you're doing this is a non-trivial limit because they are scaling precisely in the same way yes the final one if you change the size of the bottleneck of course you would observe something yeah yeah yeah I observe that the green curve is pushed up and it's close to the blue yeah I think that you would observe something more similar in the sense that if the two data sets are closer then actually yeah you could also do well too I mean we didn't compute the distance between the images because we were just interested the input data distribution and this is a sort of distance measure between internal network representation that is what we were interested in because we what we believe is that maybe pushing with this sort of measure we could somehow implement an automatic algorithm that detects the optimal transfer depth by looking at the internal representation similarities but yes yeah yeah if you basically do that as you were saying I suppose it would be something like similar I mean but reverse because the data sets if they are closer then you get an IR agreement and yeah there is a hierarchy in these data sets that is reproduced but let's say this experiment was more a proof of concept to see whether it could work than with real web applications