 Cristos Trampolidis from UBC in Vancouver. I already forgot how to use him. Yeah? Recording in progress. Right. And he's going to talk about imbalanced trouble. Cristos, the floor is yours. Thank you, Marco. OK. Thanks. Good morning, everyone. Thank you for the invitation. Good pleasure to be here. I've attended the previous workshops online and always thought it's a great venue. All right. So today I'm going to talk about very recent work. So actually this is something that is unpublished and hopefully it will be soon an archive. And it's going to work with my students, Ganes, who is at UCSB in Santa Barbara and Valentina, who are at UBC. Val is actually an undergrad and he's now starting his graduate studies. OK. So the motivation for this talk comes from the common knowledge nowadays that large models generalize better. OK. And this is something that is depicted in this plot. This is somewhat outdated. It's a plot from 2019. But it already shows the trend that some of the networks that generalize the best, that's if the best top one accuracy, are such that the number of parameters that they train is much larger than the size of the training data set. And when I say much larger, it's on the order of 10 or 20 times. Right. And of course this trend has exploded over the recent years. And really I guess the holy grail question here is why do these models generalize well. OK. And there's lots of work on people trying to understand that. And perhaps a conceivable step in this direction, one of many, is to try to answer this question. What are the structural properties, if any, of the solutions that are learned by these large models? And so what I mean here by structural properties, I want to know whether there is some specific geometry that is being learned in terms of the weights of the neural net. And I will specify, I will make much more concrete this question, but at a high level this is the question. OK. So in order to specify the question, let's consider K-class classification. So supervised K-class classification, very standard setting. We're given n-training data. These are pairs of feature vectors and labels, x-i, y-i. X-i's are say images, so these are p-dimensional vectors. And I have K-classes. And the goal is to learn a model, f of theta, that is a mapping from the space of images from the p-dimensional space to the k-dimensional space that is parameterized by theta. So theta here is the vector of weights. And the prediction is done according to the majority wins all rule, which says that you look at the entries of this model evaluated at a data point, at a feature vector, and then you pick the largest one and you declare this as your class. OK, now how do neural networks learn such mappings? So they take a simple examples, x-i, and they learn an embedding map, h of theta, that is parameterized by previous weights. And this h of theta, this learned embedding, it has a hidden dimension that we'll be calling d. So these are the embeddings. And then once you have the embeddings, what do you do? Essentially, you do linear classifications. So once you have the embeddings and you just have k-classifiers for k-classes, k-vectors, k-dimensional vectors, which you stack in this matrix that is d times k-dimensional. And then, essentially, the overall model takes this form. It's w transpose times h theta of x. And sometimes people call these the logits, the outputs of the neural network. Now, the whole point of neural network training that is different from, say, support vector machine classification is that you are jointly training the classifiers and the learned embeddings. And typically, if you are doing k-class classification, perhaps the first choice for your loss would be this cross-entropy loss. And so training here is being done with respect to the parameter status of the hidden embeddings. And the classification vectors w. So in this talk, what we will do is we will rather view the entire network as a black box. And we will isolate our attention at the last layer, so at these embeddings that you learned and at the classifier. And the question that we will be asking is what is the geometry of the learned embeddings and of the classifiers when over parameterized neural networks are trained with stochastic gradient descent until zero cross-entropy loss? And so in particular, what I mean here by geometry is we want to know whether there is something that we can say about the norms and the angles of this vector. And at this point, this question seems rather optimistic to have an answer because, as you can imagine, the answer could potentially depend on many things. It could depend on the specific architecture. It could depend on the nature of the data set. It could depend on the specific even realization of the stochasticity of the algorithm. Despite that, there is a result by Papian Han and Donoho in 2020, which they call the neural collapse phenomenon, which to me is quite surprising and a very nice formalization. So the authors made the following empirical discovery. They said that under the assumption that all classes have the same number of examples, the geometry of the classifiers and of the learned embeddings is characterized by the following two properties. The first property is the neural collapse property, and the second property is this angular tight frame property, which I will be calling ETF property for short. And I will tell you what these two properties are. OK, so let's start with the first property. So the NC property says that embeddings of examples belonging to the same class C collapse to their class mean embeddings mu C. So this is the equation for it, but let's see in a sketch what this means. So if I look at the original space of images, say in an instance when I have four classes and I have, I guess, six examples per class, then the images on this space might look like that. And then for each class, there is a corresponding mean. And around this mean, there is somewhere there is the examples. Now, what neural collapse says is that if I look at the same picture after training, so if I look now at how the embedding space looks like, then the picture would look like that. So now the learned embeddings are concentrating towards their means. And this is what this equation shows. And so here convergence is with respect to the training epochs. So when training continues, this type of picture should emerge. All right, what about the ETF property? So the ETF property is much stronger. It says something specific about the geometry of the learned classifiers. So the first property rather is that these classifiers lie in a k minus 1 dimensional subspace. So here I'm showing you how the geometry would look like for four class classification. So for class classification, I can draw it in R3. So I have four classifiers. They all have the same norm. So you can see here that they all lie on the same sphere. And they are all maximally separated in space. So the cosine of their angle has this value of minus 1 over k minus 1. And so in matrix form, what this means is that if I look at this Gram matrix, W transpose W, then it converges to this matrix, which is essentially a scaling of identity minus this rank 1 component of 1, 1 transpose. And what about the embeddings? The embeddings will converge exactly to the same geometry as the classifiers. And specifically, they will align with the classifiers. So these are the two properties, NC property and TTF property. And at this point, these are just formalizations, they're hypotheses. And so then once the authors did that, what they did is they formed appropriate metrics and they tried to verify through experiments whether this holds. And so here is our replicated experiments for the NC property. So what we did here is rather just following, of course, their suggestions, we take the deviation. We look at the deviation of the feature vectors HI from their corresponding class mean. And we look at the average over all examples and over all classes. And we keep track of how this metric evolves over the training epochs. And we do so for two datasets for CIFR, Terran, and Amnist. And what you see here is that this metric goes down over training, right? Suggesting that indeed the embeddings as the network trains longer and longer compared to their mean. Now, what about the geometry? In order to check the geometry, you one way to do it, and there's many ways to do that, could be to look at the graph matrix, the normalized graph matrix, and compare it to what theory suggests. The theory here is rather hypothesis or conjecture, right? And so the theory value here is this identity minus the rank one component matrix. And so again, you can observe that as training progresses, this metric goes down. And this is true both for CIFR and Amnist. And this is also true for the embeddings. Now you can see that for the embeddings, you have a little bit worse convergence, but still the trend is there. Okay, so what are the highlights of this result? It has two nice features, right? So the first one is that it's very, very simple, right? It's a very, very simple description. And if you look at their paper, they also do experiments for other datasets and for other architectures. And the second important result is that it's cross-situational invariant, right? So it holds for CIFR, and it holds for Amnist, and it holds for fashion, Amnist, and so on. And the only real restriction here is that all classes have the same number of examples, okay? That's the requirement, and that's in all their experiments. And so what I want to ask here is what if the data are imbalanced? Okay, so specifically what if classes are imbalanced? Now, why am I even asking this question? It's not only a matter of just to do something, okay? So here are three reasons why I think it's meaningful and important to look at imbalanced datasets. First of all, imbalances are more frequent than not. Second, this geometry, as you noted, is very, very symmetric, right? So I think it's good once you have such a result to try to understand how brittle it is to changing the structure or the symmetry in your data, right? And so is this something that only occurs in very symmetric situations, or is it possible to find such structural properties even when we break the symmetry? And the last motivation is something that, we're not there yet, but I'm hoping that by understanding such structural properties for other scenarios as well, perhaps there could be a link between this geometry and generalization, which is of course that holy grail. And I can say maybe one more word afterwards about this link. All right, so here's the specific setting that we'll be looking at. So we have step imbalanced datasets. So it's a particular case of generate class imbalanced where there is only a minority fraction row of classes that have a number of examples that I denote here by n mean, and then majorities have a number of examples that are denoted by r times n mean. So r here is the imbalanced ratio. And so really what we want to know is how does this geometry change as a function of r and the minority fraction? And so here is our hypothesis, our conjecture. So we say that the geometry is characterized by the following two properties. The first property doesn't change, so nc property stays the same. And then we propose to substitute the etf property with something that we call the selly property standing for simplex encoded labels interpolations. And this is a more general geometry that boils down to the etf geometry when you consider the balanced case. Okay, so let me show you an instance of how this geometry looks like for an example where I have four classes and two of them are majorities, two of them are minorities, and I have an imbalanced ratio of r equal to 10. So you see here that there is eight vectors plotted in total, there's four for the classifiers in red, four for the embeddings in blue, and then there is two for majorities and two for minorities. Now the first thing that you observe here is that the symmetry is broken, okay? So first of all, the classifiers and the embeddings do not align anymore, and also you see that there's different behavior with respect to angles and norms when it comes to minorities versus majorities. Okay, so for example, you can see that classifier majorities have larger norms than classifier minorities according to this geometry. All right, then I'll tell you how we come up with this geometry, but before I do that, does this work? Okay, so here is an experiment where what we're looking at is we are, during training we are measuring the average of the ratio of the majority classifiers over the minority classifiers in terms of norms as training progresses. And we compare this to the theoretical value where theoretical value takes the value of one if the ETF, under the ETF conjecture, it takes and takes a different value under the cell conjecture, okay? So the comparison to the cell geometry is given in solid line here, and comparison to the ETF geometry is given in dust line. And what you see here is that when I compare to the cell geometry, the curves do go down, but when I compare to the ETF geometry, they go up, right? So clearly they converge better here to the cell geometry. Now, this is about the norms. What about angles and embeddings? Here is the comparison for the ground matrices. So you see here again, we're doing the same experiment and we're doing it for SIFR, Tern and Temnist with Resinet 18 and for imbalanced ratios of R equal to one, corresponds to the balanced case, by the way. This is why there's no blue dust line because ETF and cell are the same here. R equal to five, R equal to 10 and R equal to 100. And so not here that all these curves do go down. The same is true for embeddings, but for embeddings we do observe that convergence worsens, and in fact it worsens with increasing balance. So that's something that, at least in this experiment, seems to be the case. And all right, I could show you some more of these pictures, but now before I do that, and I could show you some more of this offline, perhaps. Let me tell you a little bit about where does this cell geometry comes from. So in order to describe the cell geometry, I need to define only one object, which is this simplex encoded label matrix. So this is a K times N matrix. Remember that K is the number of classes, N is the number of samples. So this is a matrix that looks like that and each column corresponds to one example. So what I've done here is I'm showing you how this matrix looks like for the case of where I have four classes, and every row corresponds to one of these classes, and every column corresponds to one of the examples. And what I've done here is without loss of generality, I've taken the first 50 examples to belong to class one, the other 50 to class two, and so on. And the feature of this matrix is that it only takes, the entries only take two values. The first value is K minus one over K, and the other value is minus one over K. And it takes the value K minus one over K if the example belongs to that class. So essentially this is an encoding matrix. It is nothing but the standard one-hot encoding matrix centered with this one transpose. And so why do I care about this cell matrix? Because the cell geometry is defined according to the SVD factorization of this matrix. So you take this Z-hot, you form its SVD, and then the hypothesis is that W transpose W will converge to V lambda V transpose, and correspondingly for the embeddings. So this is the description of the geometry. Okay, and really this is where this picture came from. In fact, it is possible to compute this SVD factors in closed form, which is kind of pleasing because it means that we can get exact formulas for the ratios in terms of the imbalanced ratio and in terms of the minority fraction. And here I'm just showing you the example for minority fraction to be one-half. And so really when I showed you this plot where I was comparing the ratio of majorities to minority norms, and I had this theory formula, this was the theory formula that was evaluated. And you see the dependence on the imbalanced ratio on the number of samples. Okay, and this is again what I showed you. All right, so last thing, where does this come from? So if you have a guess for the geometry, then checking it through experiments is easy. And I'm putting easy in quote marks here because to anyone who has done these experiments, it takes some effort. But really for our purposes, the big question was how to come up with this geometry at first place, okay? And I think in the original work by Papian Han and Donoho, what I think is the more, at least to me, surprising in their papers, how did they come up with this formalization, right? All right, so here is how we do this. So we use something that is called the unconstrained features model, which essentially views the rest of the neural network as a black box and just focuses on the embeddings and the classifiers. So this is something, this is a model that was proposed in several contemporaneous works just after the original paper by Papian Han and Donoho. And the model essentially what it says is that you should treat all layers but the last as a black box that can generate unconstrained features that are very powerful, right? And so essentially when you look at your cross entropy minimization, what you're doing now is you're minimizing over the classifier's W and you're minimizing over the embeddings without any restrictions on the embeddings. So note here that embeddings are no more parameterized by some theta that depends on the previous networks. So I have this non-convex optimization now. And so these papers, what they did is they introduced this model and they used it in order to study, to explain the neural collapse phenomenon in the case where I have balanced data. And so in the balanced data case, everything is symmetric and aligned. So essentially in all these papers, and again there's several of them, I've highlighted the one that at least I found that their analysis was very transparent and motivated much of our work. So the proof, but in all of them, the proof more or less goes like that. So you take this cross entropy loss and you successively apply Cauchy's vars and Jensen lower bounds. You get a lower bound on the cross entropy loss and then you saw that the qualities are satisfied if and only if the ETF geometry holds, okay? And again, because everything is symmetric, it makes sense that it goes well with Cauchy's vars and Jensen, right? Everything is aligned and at the end of the day in this case, you know what you want to prove, right? So that helps the analysis a little bit. In our case at least, there's no similar, there's no alignment and it's not a priori clear what is to prove, right? What is the geometry? So what is the answer? All right, and so we take a somewhat different route than directly lower bounding the cross entropy loss. So we have a sequence of results analyzing this cross entropy loss. So the first result is that the first realization is that once you allow for imbalances and multiclass, so then the geometry of the solutions changes with regularization. So in the balanced case and in binary case, this is something that does not occur, okay? So you have to have imbalances and multiclass in order to see this behavior of the solution changing with lambda. So that's why I call this imbalanced trouble at the end of the day. And the current practice in neural network training is that lambda is negligible, okay? And so what if we look at the limit of lambda going to zero? Okay, so this is kind of the regularization path as lambda goes to zero. It is related to the implicit bias analysis of gradient descent, but it's something weaker. And so what we saw here is that if you take lambda to go going to zero, then the global optimum of this reads regularized cross entropy loss minimization converges to this max margin problem. This is a non-convex max margin problem. We call this the UFSVM, standing for unconstrained feature support vector machine. And really now the claim is that perhaps we should be trying to understand what is the global optimum of this problem, okay? And so this is what the result shows that as provided that the dimension is larger than K minus one, where K is the number of classes, then the logits, the optimal logits that are being learned by this support vector machine are equal to this simplex-accoded label matrix, okay? So essentially if I view this logits as the learned model, then I'm interpolating a label matrix that is encoded according to the simplex constraints, and this is motivating the name of SELI matrix. And then number two property is that the gram matrices have this property that they are dependent on the SVD factors of Z hat. And so essentially according to the definition of the SELI geometry, this shows that the solution, the global optimum of this non-convex problem follows the SELI geometry. Now, how does the proof go? Well, we use a rather standard convex relaxation which substitutes the non-convex rather, which substitutes the objective with a nuclear norm of the logits. So essentially kind of the insight here is that you should be looking at the logits. And so we have this convex relaxation, and here the bulk of the proof is showing through an explicit construction of dual certificate that Z hat, that this SEL matrix is the global optimum, and then using this to show that the relaxation is tight. Okay, so I don't know if I have time to go through that. Maybe I quickly say that up to now we've only characterized the global optimum of this non-convex SVM. One question is whether gradient descent converges to that. And so it is possible to view this unconstrained feature model as a two layer linear model on a specific input. Essentially, in particular, this is the standard basis input. And so according to previous works by Luen Li and Gientel Garski, it is known that gradient descent will converge to a stationary point of this UFSVM. Now, is this stationary point a global optimum? In the binary case, yes, by previous results. In the balanced case, yes. In the balanced case, I don't know of a formal result, if anyone knows, I'd be happy to know about it. But, and the experiment suggests that this is the case. But there is something interesting here, which is that convergence slows down with increasing imbalance. And so I think that that might be something that is interesting to investigate theoretically. And again, these are just experiments where we run gradient descent on, can I just run gradient descent on cross entropy loss and measuring distance to celli geometry. Okay, so I will end here with this slide that is just summarizing our hypotheses and our findings and the motivation for why to look at imbalance data. Thank you. Questions? Thanks a lot for the interesting talk. I was just wondering, have you looked at or tried experiments with non-cross entropy loss, like a pulley loss type or I guess square loss or would you still see something like this happen? No, we have not, we have not. And that's a good point. This is something that people have looked in the, so let me say maybe one word, that there's a lot of follow-up work on the neural collapse paper. And people have looked at various things. One of the things that has been looked is whether the same geometry occurs if you run, if you optimize square loss. I don't know about polynomial decaying losses. And for square loss it does, it's the same. Yeah, for square loss it's the same and then there's a corresponding theoretical analysis using again the unconstrained features model. Why I'm wondering is just in the case of imbalanced data that might be useful to use a different loss if you have to reweight it. Yeah, no, absolutely, absolutely. And this is why I also say that at the end of the day I'm hoping that it's also perhaps something about generalization if there is any link because once you have imbalances then generalization gets worse. So now we have these different geometries with different generalizations. Could there be a link? Other questions? Thanks for the talk. Is there a nice geometric effect known also in the case of regression like the neural collapse something of this nature? I mean, this looks specific to classification, right? Oh, yeah, not that I know of. Not that I know of, yeah. Thank you for your presentation. I was wondering if you can tell a bit more about generalization. So in particular, do I see the same effect if I train on random labels or not? Yeah, very good question. So this is, I would say, one of the two missing experiments for us to publish the archive paper. So we do want to run experiments on random labels to see whether the same occurs. For now, we've run experiments on ResNet 18 and VGG with Amnist and CIFR. But in terms of experiments, there's definitely more work that can be done here. Because there is certain things that, for example, as I showed you that convergence gets slower for embeddings and it gets slower as you increase the imbalance ratio. So I think that this is something that should be investigated more. And we'll do our best to investigate at some point and then I'm hoping that there's more interest in that. Then if there are no other questions, I think we can move on and thank Christos again. Thank you.