 Yeah, please show your screen. People are taking a seat right now. That looks good. Fantastic. OK, give me just one second to find the mic. All right. And we're ready to go. Recording in progress. Welcome to this last session of even higher dimensions. We have two more talks, which are on the machine learning theory side of things. And the first one, we're being joined by Noam Razim, who's joining us online. And he's going to talk about generalization, deep learning, in particular, these low-rank biases that we can find in some of these networks. So Noam, thank you very much for joining us. OK, so hi, everyone. First time, I'd like to thank the organizers for making this wonderful conference possible. And I apologize for not being able to make it in person. So my research mostly focuses on theoretical aspects of deep learning. And indeed, as you can see from the title, today I'm going to talk about an implicit rank lowering phenomenon and how it may shed light on generalization. I'll cover a lot from the following three works, which were done in collaboration with Saf Mama and my advisor, Nadav Koy. Perhaps the biggest mystery today in that deep learning is how neural networks are able to generalize, even if we trade them without any explicit regularization. And this is despite the fact that often they're overparametric. The number of learned weights can be far greater than the number of trained them. In these cases, we know that there are just many possible solutions, that is, weight settings for the network, which exactly minimize the loss, fit the data. Some of these will generalize well, but others will not. And perhaps the magic here is how simple, grain-based optimization methods, they lead us towards solutions, towards predictors that do generalize. The conventional wisdom around this is that grain descent and its variance leads optimization algorithm. They induce some form of an implicit regularization, which guides us towards predictors that are simple, in some sense, meaning that they have low complexity. For some complexity measure, which we don't quite know what it is. And so a major goal in the theory of deep learning is to mathematically formalize this intuition. Since tackling state-of-the-art neural networks can be difficult at times, we can start by looking at simpler settings and make our way forward, going towards models that are closer to what we have in practice. Perhaps the simplest setting of them all is linear regression. Here, when the number of learned weights is larger than the number of trained examples, we have more features than samples, then we have an infinitely many possible solutions. But by what is now a full chloroanalysis, if we run grain descent initialized at zero, we don't just converge to an arbitrary minimizer of the loss, rather to that which has the lowest Euclidean norm. Out of all possible global minimizers, we get that which has the lowest norm. So in linear models, we have a norm which is implicitly minimized, and we understand it well. But now the question is, what happens when you go to more complex models? And today we're going to see this through the models of matrix and tensor factorizations. Of course, we'll discuss how exactly they relate to neural networks in deep learning. And we'll examine whether that we also have some norm that is being implicitly minimized or perhaps it's something that is quite something else. We'll start with matrix factorization. And one of the common testbeds for studying implicit regularization is matrix completion. Here we're given a sum subset of entries from an unknown matrix. And our goal is to predict the missing entries. It is straightforward to see that completing a D by D prime matrix, that's the same as solving a prediction problem over two discrete input graphs. The first corresponds to the row index, k takes values from one to D. And the second corresponds to the column index, takes values from one to D prime. Through this correspondence, you can look at a value of an entry ij in the matrix that we're trying to complete as the label of an input ij. The inputs are just a pair of indices. The observed entries, that's our chain data. The unobserved entries, that's our test data. And any matrix is simply a predictor. It's a lookup table that at each location holds the prediction for the corresponding input. We can go about solving matrix completion through matrix factorization, which refers to parametrizing our solution, our matrix, as a product of matrices, kw1 to wl, and fitting the observations using great descent. The objective here takes the following form. We have a sum over all observed entries. And for each one, we have the square difference between what we see, yij, and the value of our solution at the same location. Notice that I also don't have here any explicit regularization. In particular, the hidden dimensions between the matrices, they're large enough such that the rank is not constrained, and we can express any matrix. In matrix completion, we have an infinitely many possible solutions, right? Essentially, we can complete the matrix in any way that we would like, as long as we fit the observed entries. So what we'll determine, which type of matrix we get, which type of solution we get, is exactly the implicit regularization that played with great descent. The main motivation of looking at this model from a deep learning perspective is due to the fact that matrix factorization, that corresponds to a linear node network, fully connected network, with no non-linearity. So although these are apparently simple, they still have highly non-trivial optimization of generalization dynamics and are still widely studied today. And the empirical observation made a few years back by Bruno Sikardov is that if the ground truth matrix that we try to complete has low rank, then just running matrix factorization from a small initialization near zero, and with a small step size, this will actually lead to accurate recovery, even though there are just many solutions that will not generalize. And they conjecture that this accurate recovery is due to the fact that implicitly, we are converging to the minimal nuclear norm solution. This is a nuclear norm, that's the sum of the singular values of the matrix. Why nuclear norm? This is based on classical results from matrix completion theory, which show that if you return the minimal nuclear norm matrix, you can get accurate recovery. So they conjectured that implicitly, this is what is going on. And to be more formal, their conjecture is for gradient flow. It's a continuous variant of gradient descent. It's what you get when you take the step side towards zero. All the theoretical on NASA that I'm going to show today, they'll be for gradient flow. Okay, so this conjecture was proven for certain restricted cases, but it was not clear if this is something true more generally. And indeed, a couple of years later, follow-up work suggested that maybe something more nuanced is going on, which cannot be explained through normalization. And that work, it examined the singular values of the matrix and how it behaves. So here we denote by WE the n matrix, that's the product of our weight matrices. It's the solution that we return. And by sigma MR, that's the singular values of the matrix. Okay, R, that's the index of the singular value. And M, that just stands for a matrix factorization. What Aurora showed, this is the work that I was referring to, is that when we train a matrix factorization starting off from near zero using gradient flow, then the singular values evolve at a rate that is proportional to themselves to a pair of something. And this has a profound impact on how they behave. It means that they're subject to a momentum-like effect where they're almost slower when they're small because the rate of movement is proportional to themselves, which is something small, and faster when they're large. So as a result, since initially we start off near zero as common in practice, at first singular values will barely move, they'll move really really slow, until each time one of them reaches a certain critical threshold, after which it will shoot up while the rest will stay stuck at zero. And then perhaps another one will reach some threshold, will shoot up while the rest will stay stuck at zero. So we'll get this incremental learning phenomenon which you can also widely observe in practice. And this is the singular values of the matrix as opposed to the iterations of the intercept. You see that one at a time, we learned the singular values and this leads to solutions with few large singular values and many small ones without exactly low-ranked matrices. So the bottom line is that, of what I'll show, that there seems to be some form of an incremental learning phenomenon which leads to low-rank. And they conjecture that this is not something that can be captured by normalization, meaning that for any norm, we should probably be able to find some setting in which we don't reach that minimal norm solution. To put everything together, we started with two opposing viewpoints. According to the first one, we have a norm that is being minimized, the new clean norm. And according to the second one, no norm is being minimized. There's something a bit else going on. And now is where our work comes into play. Essentially, we aim to resolve this tension between the two viewpoints and check whether indeed we have some norm that is implicitly minimized. As the title follows, the surprise, what we showed is that, no, actually there exists settings in which all norms will be driven towards infinity and that is in favor of minimizing width. So I'm not going to the full details of these settings, but they're actually quite simple. And if you take them in practice and run experiments, then you can also see that as the loss is minimized, all norms shoot up. This means that norm minimization cannot capture the implicit regularization in matrix factorization, but perhaps rank is a better interpretation. It's also I mentioned in here that also several other works have shown further support for an implicit rank minimization in the set. Okay, so we saw that in matrices, matrix factorization not clear, it's not exactly clear that we have norm minimization. In fact, there are cases in which provably we do not. And rank might be a better way forward. And now we'll see how this interpretation translates to more complicated models closer to practical neural networks. And now we're going to see this in tensor factorizations. And why would we want to do that? Well, although matrix factorization is interesting on its own behalf, as a surrogate for deep learning is inherently limited. First and foremost, it corresponds to linear networks. So it misses out on the crucial aspect of non-linearity. The second limitation is more of the matrix completion setup, or if you recall as a prediction problem, it corresponds only to a prediction over two input variables. And as we're going to see next, by moving from matrix to tensor factorization, we'll actually be able to address both of these limitations in a way. So tensor, for the sake of this talk, this is just a multi-dimensional array. Capital N here denotes the number of axes that we have, that's also called the order of the tensor. And we're going to look at a straightforward extension of matrix completion tenders, known as tensor completion, where now we get a subset of entries from some unknown tensor, and our goal is to reconstruct it. Just as matrix completion can be seen as a prediction problem, tensor completion is also equivalent to a prediction problem, but now the number of input variables is N. It corresponds to the number of axes that the tensor has and the number of values each input variable can take that is equal to the dimension. The correspondence here is the same as before, the value of an entry in the tensor, that's the label of an input, the observed entry is the training data, the unobserved entry is the test data, and now each tensor is a prediction. There is nothing too profound going on here, it's just worthwhile noticing that prediction problems over discrete input variables, that's equivalent to tensor completion. Just as we can factorize matrices, we can also factorize tensors, and here we'll refer to by tensor factorization to parameterizing our solution as a sum of outer products and fitting the observation using grain descent. The objective here is very similar to before, a sum over all observed entries, and for each one we have the square difference between what we see, and the value of our solution at the same location. Now our solution to the sum of these outer products, to each w, r, one, w, r, n, bs are vectors, we take their outer product, we get an order n tensor, these are called components, and our factorization is the sum of these components. An important definition is the tensor arc, which is an extension of matrix rank to tensors, and for a given tensor, its tensor arc is defined to be the minimal number of components that we need to express it as a sum of these components. So essentially we can limit the tensor arc of the tensors that we get by using fewer components, but since we're interested in the implicit regularization at plate, meaning at which solutions we get, simply by running grain descent without any explicit constraints, we consider the case where r here is allowed to be arbitrarily large and can express any tensor. The main benefit of transitioning from matrix to tensor factorization is that all matrix factor addition corresponds to a linear network. Turns out the tensor factor addition corresponds to a certain shallow, non-linear convolutional network with multiplicative non-miniard. So still it's not quite what we have in practice, the non-miniard here is not value or max pooling, but it does take us a step beyond commonly studied linear models. And this equivalence is not new, it has actually been extensively studied in the past in the context of expressive power, and now we're going to examine what is the implicit regularization of plate in these models. The analysis here is going to follow a similar line where we saw for the singular values in matrix factorization. We'll denote by sigma t r the norm of these components, okay r that's the index of the component, and t that stands for tensor factorization. Recall that the number of non-zero components controls the tensor. What we show is that when you chain a tensor factorization starting off from new zero, then the norm of the components evolve at a rate that is proportional to themselves to the power of something. These dynamics are structurally identical to what we had with the singular values in matrix factorization. And again, accordingly it implies that we get this momentum like effect where when they're small they move slowly, and when they're large you move faster. And again we'll get this incremental learning process where now you can see the component norms with respect to the iterations where each time we learn a single component and the rest will stay stuck at zero. And this leads to a solution which have small number of non-zero components. That is they have low tensor rate. We can also take this dynamical analysis and step forward and show under certain technical conditions an exact tensor rank minimization result. For example, if the tensor completion problem has a rank one solution, then under certain conditions, in particular if the initialization is small enough, then we will reach that solution, meaning we will stay rank one for longer and longer. Small initialization is. Okay, so that was tensor factorization. And now we'll move on to the last one we're going to discuss today. And basically the reason to go there is that although tensor factorization takes us beyond linear predictors, it still acts this. Can't recall that corresponds to a shallow non-linear component. It only has a single hidden link. And turns out if we move from tensor factorization to hierarchical factorization, which I won't go into their exact definition, it's a bit technical, but what's important to know is that essentially there are a composition of many local tensor factorizations. And each of these local factorizations have components, which we call the local components of the hierarchical factorization. Then now these factorizations correspond to a certain deep nonlinear convolutional network. It's deep variants of the same networks that we had before. So again, the nonlinearity is multiplicative, still not quite what we have in practice, but it's also not that far from it. And again, this equivalence is not new. It hasn't been extensively studied in the past in the same works that I mentioned before in the context of expressive power. All right, so just as tensor factorization induces a notion of rank, which is controlled by the number of non-linear components, these factorizations also have their own notion of rank, which is called the hierarchical tensor rank. And I won't go into its exact definition, but what's important to know that if you can represent a tensor with few local components, then it has no hierarchical components. Okay, so motivated by the incremental learning phenomenon that we saw before, now we're going to examine what happens with the local components in this model. It will denote like sigma HR, the Serbianius norm of the R's local component at some location. And okay, that's the order of the component. You can think that this is some constant that is greater or equal than two. And what we see here is that we have the same phenomenon again. Okay, we have the exact same dynamics where local components evolve at a rate that is proportional to themselves. As a result, this leads to the same momentum like effect where they'll move slow and small and faster and large. And we again get this incremental learning process of now local components, which leads to low rank solutions. But now it's the different option called the hierarchical. So to sum up that theoretical analysis part, basically we can put all the three models that we saw side by side and see how this nice analogy between the implicit regularization in all cases. Okay, in all models, we examined some quantity, singular values for matrix factorization, component norms for tensor factorization and local components in the hierarchical factorization. In all of them, they evolved at a rate that is proportional to themselves. And this leads to this incremental learning process, which results in low rank solutions for the corresponding motion of them. But in mind that we have the structural identity between the implicit regularizations. Okay, so this concludes the theoretical analysis part and in the time that I have remaining, I will briefly discuss some implications of this theory to modern deep learning. So first a practical application that if we take neural networks and parametrize some of their layers, either linear weights or the convolutional weights, as one of the factorizations that we saw and run gradient descent over the factorized form, then the analysis, they imply that we'll get low rank implicitly by gradient descent, which means that we can compress this layer. And this has been official both for computational reasons and because it can boost generalization. And this application has actually found it's going to practice and has been incorporated in several papers already. Okay, the second implication is more of a theoretical one. That's a possible explanation for generalization over natural data. So a major challenge in explaining generalization through implicit regularization is that we need to find complexity measures that are both implicitly minimized by gradient descent. Okay, this is the obvious criteria, which is often discussed. But equally as important, we also need these complexity measures to capture the essence of natural data. In the sense that we need to be able to fit natural data with predictors of low complexity. Okay, because if we cannot, then even if we get the minimal complexity subject to fitting the data, it will still not really be low complexity and generalization bounds based on this measure may not tell us much. Motivated by notions of rank, in particular the tensorial ranks that we saw being minimized in certain non-linear convolutional networks, the question you asked is, can they serve as measures of complexity for explaining generalization? Basically the question is, can we fit natural data sets with predictors of low tensorial ranks? And what we showed that, at least for these simple data sets, endless and fashion, we need can. We can fit them with predictors of extremely low tensorial ranks, meaning that for the convolutional networks who are responding to tender factorizations and implicit regularization towards low rank actually explains generalization. We have this agreement between the complexity measure which is implicitly minimized by gradient descent and the complexity measure with which we can fit the data with slow complexity and both of these properties together ensure that we can get good generalization year of tea. That was the second implication and now the last one is a mix during practice. So we saw that in hierarchical tender factorization the hierarchical rank is minimized but what does this even mean for the corresponding convolutional networks? Turns out that the hierarchical rank measures the long range dependencies modeled by the nipple. If for example, we're in an image classification task it measures how strong we take into account relations between more distant patches of pixels in the image. So an implicit lowering of rank this implies here an implicit regularization towards locality. It means that all long range dependencies will be weak and only local ones can be strong. And this gives rise to the question of whether we can use explicit regularization to counter this implicit bias to improve the performance of modern convolutional networks on long range tasks. And we show that this is indeed possible. We design an explicit regularization motivated by a TV that promotes high hierarchical tensor rank that is long range dependencies and show that it can significantly improve the performance of modern convolutional networks that just resonates on long range tasks. We get that quite surprisingly we can counter the locality of these over parametrized networks just by using explicit regularization without modifying the architecture which is often believed to be this. Okay, so now I'll conclude to wrap up the talk these were the end of the implications. So to recap what we saw, the overarching goal was to try and improve the understanding of implicit regularization even as by a bit. We started by looking at matrix factor addition which corresponds to linear networks. There we saw that the existing conjecture that is that a norm was implicitly minimized but we show that there actually exists settings where all norms are driven towards and fairly in favor of minimizing them. Then we moved to tensor and hierarchical tensor factorization which correspond to certain nonlinear convolutional networks. And then we saw that also we have some notion of rank which is implicitly lowered and then we discussed implications to modern deep learning this year. We saw that by parametrizing layers of a network as one of the factorization we can get compression almost free. We saw that implicit rank lowering may have potential to expand generalization over natural data. And lastly that we can counter the locality of convolutional networks just by using explicit regularization. And one final note to conclude what I believe is the main takeaway from this line of works. Basically we saw, as we zoom out three different types of neural network architectures which would have a notion of rank which is implicitly lowered for the function that they realize. In each one of these cases their function is represented either as a matrix or as a tensor and it has a notion of rank that's been lowered for. And seeing this we believe that perhaps more generally in more complex models whether it's transformers, recurrent networks or residual convolutional networks we also have some notion of rank for the function that they realize which is implicitly low. This hypothesis a bit optimistic or perhaps a bit naive but I believe there is quite a few empirical evidence that do point towards such a thing. And if this is true then discovering these notions of rank may pave way to both explaining generalization but not only that also improving performance either by designing regularizations or the architecture such that it aligns the data. And with that I'll finish. So again, thank you very much and I'll be glad to answer any more questions if there are. Very nice talk, Norm. Thank you very much. I'm sure there's gonna be some questions and let's start right here. Hi, thanks for the talk. I was curious about you mentioned an experiment with the deep CNNs on long range task on which you managed to achieve good performance by promoting a higher rank tensor. Can you tell us about that experiment? Yeah, so to be brief essentially we use two different benchmarks for modeling long range dependency. I think the most simplest one you can think of for example say that I put two CIFAR 10 images next to each other and I tell it to classify whether they're the same class or not. And then I create data sets where the two CIFAR images are more distant apart. This is just a control setting. Now you will see that if you run a standard resnet then it will, when they're close it will be able to generalize but when the images are far apart it will not generalize at all. It will basically be as good as random guessing and if you just add a certain regularizer which promotes high hierarchical ranking I didn't go into its exact definition so it's a bit hard to describe exactly what it does but you can find of course the details in the paper. Then suddenly you are able to close this gap and generalize as well as you do on the local tasks on the more long range tasks. And you have a similar phenomenon in other benchmarks or long range dependency. Okay, thanks. Any other questions? So Cassie one there and one there, I'm gonna have to walk a lot. Yeah, thanks for the nice talk. My question I guess is about the theoretical results that you showed now. If I understand correctly, these are for square loss. Could you comment a little bit on what is known about say logistic loss? Yeah, that's an excellent question. So actually the square loss here was just for simplicity that dynamical analysis that we saw it holds true for any differentiable loss. And indeed, if you even run experiments for classification, you'll see a similar behavior. The question is, is what kind of results you can prove beyond just dynamical analysis? For example, the tensor rank minimization result and that was indeed for regression. So I don't think much is known in terms of rank minimization results for these models for classification. So it's a very interesting problem but the dynamics they're similar across all set. So you get this low rank bias pretty much regardless of the loss in common setting in everything that we try. Maybe if I can follow up on that question. I mean, it's quite striking. You had it on your summary slide too, right? It's really a very universal structure that you find, right? It's not just, it's even in the exponent, right? It's always just two minus two over the number of matrices or the tensor rank or whatever. Can you give us a bit of an intuition on like in the gradient flow dynamics like what is it mathematically that always brings about this structure, this particular form? Yeah, sure, that's an excellent question. So in these particular models, it's the fact that you have a product of multiple things. So in matrix factorization, you have a product of L matrices. So the gradient of each matrix, it depends on all other matrices that gives you L minus one terms. And then when you take the derivative or the single value of the norm, then you get this twice. And then you have this two minus two over L. And in tensor factorization, N is the order of the tensor. And again, you have this product essentially, it's a different operation, but you have products of N things. And then the hierarchical factor that means the same thing. K here, it stands for the number of things that you take their product together. So it basically stands for the fact that you have a product of some number of things and then their gradient depends on the size of the matrix. Okay, I see that makes sense. Thanks. So we have two more questions and then I guess we have to move on. Hi, thanks very much for your talk. So my question concerns the part about tensor factorization. So you show this very interesting phenomenon where you have this sort of analog of the singular values, the component norms that one by one start growing. And I was wondering if you can tell something also about the overlaps. I mean, if you assume that you have a data which is exactly a rank R tensor. So we know that under mild conditions, you have uniqueness properties of these models and in some applications, we want to extract the right directions in the space. So are you able in this framework to say something about the overlaps that you get with this? Yeah, that's a very good question. So the results that I displayed are just a dynamical analysis and then under a small initialization that you can get the rank one solution. But actually a followup work and XM work by Ron Gea and colleagues, they do, they show for a specific type of a ground truth tensor. It's an orthogonal composable tensor. They actually characterize this full stage where you learn one by one, each of these components. So in the first simple ground truth tensors, people have not asked people, other people have characterized exactly this whole process. And I believe it's probably possible to do in other more general setups, but it depends on the ground truth tensor. Right, thanks. All right, we'll have one last question for you now, and then we'll let you go. I was wondering when you do your grid and descent, does the initialization matters? And also when you start to state your matrix completion problem or tensor completion problem, what is the simple complexity in your setting? Okay, so first of the initialization is crucial and the fact that it is small near zero. All these dynamics, they imply that when you're small, then you will slowly, and this leads to this incremental process where each time one of them will be large enough to shoot up, if you initialize large, then you will not get low rank bias because everything will be large and things will not go to zero. So this is indeed crucial. And in terms of sample complexity, so we do not exactly do analyze the exact reconstruction that you get here. These analysis are a bit asymptotic in the sense that, for example, the tensor rank minimization, it requires you to take the initialization towards zero. So this is indeed a whole that we still didn't feel in the theory, getting exact the sample complexity results for how well you can reconstruct the tensor just by using the implicit bias. I know for matrices there are works that already did this, but for tensors I'm not sure. All right, let's thank Norm again.