 Sorry, it's okay. You see the screen? Yep, we do. Great. Okay, so I'm going to talk about something slightly different because I'm going to talk about deep linear networks. But it's quite related to the last two sections because we will see a similar type of sparsity. But of course in the case of linear networks, because we represent linear maps, the sparsity just means that we a certain bias towards low rank function. And what I want to focus on is a special kind of dynamics that we observe that leads to this lowering bias, which is a very interesting dynamic where during training, we observe it and we prove, at least to some degree, the fact that in this regime, the parameters will jump from saddle to saddle during training, learning each time a slightly larger solution of larger rank until reaching a solution. And describing this jumping from saddle to saddle is a very difficult problem because these saddles can be quite complex, especially in the deep case. But that's what we did in this very recent paper. So let's define these deep linear networks. So deep linear networks are just so the fact that the idea to represent a matrix as the multiplication of multiple matrix. So here, we have L matrix matrices and we call the L the depth, big L. And then we just define the parameters as being the concatenation of all of the parameters of these matrix, all of the matrix entries. And I will focus in particular on what's some type called rectangular networks when the number of neurons in all of the hidden layers is equal to some constant w. It's just for simplicity reason. It's not particularly interesting, important in the analysis. What we are going to focus on is doing the gradient flow on the costs on the parameters where we do our analysis for general cost C, which takes the matrix and returns a general convex C and a differentiable C. And one case of particular interest is the case of matrix completion. So matrix completion, the idea is that you have some matrix A star that you want to reconstruct, but you only observe some entries of the matrix and you want to recover the missing entries. And what's interesting with this problem again is that it's very important to have a dispersity or low rank assumption because if you were to just, if you don't, if you have no assumption about the rank of the matrix or something like that, the best guess that you could make is just put zeros in all of the entries that you haven't observed. And this would be the minimal Frobenius norm solution, which would correspond a bit, which is essentially, you can think of it as the equivalent of the kernel method solution in the previous talks. And so it's not able to generalize at all in this setting. And it's what you need to be able to generalize in this setting. You need to assume, well, first of all, you need the matrix A star to actually be low rank, because if the matrix A star is not low rank, there's little chance to actually recover the missing entries. And then assuming that that A star is low rank, you need to find an algorithm to actually find a low rank matrix that actually fits the entries. And this is a very hard problem. We know that this is an NP hard problem. So of course, we cannot expect to have a fast algorithm to actually recover this low rank matrix, but we can be interested in how different algorithms are able to approximate this low rank solution. And I will try to show, I will explain a bit how your networks in a certain setting can actually do that, at least to some degree. And the setting I'm talking about is the, so of course, because in linear networks, you have the same transition between the lazy regime. And in the lazy regime, you cannot expect any generalization actually in this matrix completion setting, because it will actually not learn at all the missing entries. And this lazy regime roughly corresponds to a very large initialization of the parameters. And so instead, what we focused on is the initialization when the norm of the parameters at initialization becomes very small and studying the limit. And before I present the result, I will show you a bit what it looks like, because I think it's quite nice behavior to see. So here in the first on the left, I plot the train and tester during training. And what happens is that we see that at the beginning of the training, we see a plateau where the both train and tester are almost fixed up to, for a long number of iterations. And this is not so much a surprise, because the origin in deep linear network is actually a saddle. And in particular, when we are here in this case, it's depth four network. So actually, it's a very, it's a non strict saddle in the sense that the Hessian is zero at the origin. So it takes a very long time to escape this saddle at the origin. But then you would expect that it would, it could just escape the saddle and then converge to some global minimum. But what's very surprising is that we see some succub subsequent plateaus, this first one and then the second one. And I mean, it's not clear necessarily just from this plot what would be the explanation for these petals, but you could expect that these petals would be explained again by some saddles. But why would the training path be attracted to the saddles? Because in general, the saddles are not necessarily attractive. And that's exactly what we're going to show the, to show the existence of the saddles and this path from saddle to saddle. And here I have a plot to describe a bit what I'm talking about. So in the middle, I showed this is a projection of the training projectory in parameter space. And you see the green line does these kinds of this zigzag where you see it initialize close to the origin marked with the zero. And actually the two zones here and here where it does a turn that's exactly the region where you have these plateaus. And what we're going to show what we showed is that there is a first path actually as as the initialization becomes closer and closer to zero along a specific direction. The green path will converge to this blue path where this blue path is not is a very special path because it's a path that starts from a saddle and converges to another saddle. It's pretty rare usually there's very few paths that actually converge to a saddle. Usually they will maybe get stuck close to a saddle for some time but then we'll be able to escape. But this blue path is very specific in the fact that it actually converges to the saddle and gets stuck there. And then at this saddle which is represented by these black dots, we find another path which then goes from this saddle to the next one. And the third path going from the second saddle to the last one which is a global minimum. And so the idea is that we want to approximate this complex trajectory of the green line by this sequence of paths. And you can see that this approximation can be quite good because if I plot the train and test error of each of these three paths one after the other, of course because this path gets stuck at the saddle actually it would take I put these three dots to represent the fact that there's possibly an infinite amount of time between this first segment and the next one. But you can see that if you glue together these three paths you get exactly the same loss behavior, the exact same dynamics in time as the green dynamics that we had on left. And so that's the behavior that we try to explain. We're not able to describe everything completely. We were at least able to identify what these paths are. And you see that from an auditification it works quite well at least in this example. And also to prove the fact that from the first that if you initialize close to the origin you will move to the first atom. But okay, so before I can explain a bit how the proof works I need some a few elements which are related to symmetries of deep linear networks. So there's two types of symmetries which are interesting is the first one is the rotation of a linear network. Basically a linear network you can rotate all of the hidden layers by an orthogonal transformation and it will keep the same output and not change the gradient flow. And so I represent a rotation as like as one rotation for each of the hidden layers. And then you can apply this rotation to any parameters to rotate the by rotating adding a rotation on either side of the weight matrices. The other type of symmetry that we need is the inclusion. It's quite simple is the idea that if you have a network with a smaller width w than w prime smaller than w then you can include this parameter into a larger network by just adding some zero neurons under it. And the nice thing is that again if you after this inclusion you still represent exactly the same matrix. But the other thing also is that if you you keep the same gradient flow path and so in particular if you initialize the parameters on with such as the inclusion of some smaller network throughout training you will still say stay inside this inclusion. And now combining the two you can take the rotation of an inclusion and by taking any rotation of some inclusion you can get taking the image of this map of the inclusion then the rotation you get lots of invariant subspace. So these are subspace of the parameter space where if you initialize inside of the subspace you will get stuck in this subspace and will not be able to escape. And what's also interesting is the subspace is that even though you are in a width w network if you initialize inside of the subspace for w prime is equal to one actually you the training dynamics is going to look exactly the same as if you are training just a width one network even though you are in a width in a large width network actually in in effect you're just training with one network. And basically that's how we show the idea is that the idea for the proof is that we as we initialize close to the saddle at the origin actually the it's quite natural that if you it seems intuitive but it's quite difficult to prove it's also is that if you initialize close to the saddle you will escape the saddle along the direction which escaped the fastest and that's what we call these optimal escape direction. In the the shallow case where when l is equal to two this escape direction correspond to the eigenvectors of the smallest most negative eigenvalues of the hessian so it's quite easy to identify them but again for deep networks actually the hessian is zero you need to take only the elf derivative is non-zero so you need some kind of you can still define this optimal escape direction as some kind of generalization of eigenvectors to tensors which is which complexify with the analysis. And then so the first part showing that that's the path that you as you initialize closer and closer to the to the saddle it's the it's kind of you can show the fact that you escape along this escape direction but then the next thing is that we we don't only care about the fact along which direction you escape you want to know first you will escape along this direction and then you will follow this blue path that we had and we need to define what this blue path is and that's what we define as the optimal escape path theta one and this optimal escape path so it's it's unique in the sense that it's the only path that escape along the direction along this optimal escape direction so this is not always true so you can look at all of the paths that escape a saddle and in general if you choose the direction there's there could be many paths that escape along this direction but because this escape direction was optimal and also because of lots of other uh we need lots of constraints also which are satisfied thankfully for linear networks but for the for the shape of the saddle exactly we're able to show that the that there's actually exact there's a unique path which escape along this direction uh this this optimal escape direction and um and the other thing is that there's actually multiple escape directions optimal escape directions but they are all the same up to rotation and they are all of the form the inclusion of a width one network and so if you escape along these directions you you will escape essentially this the escape direction lies inside one of these invariant manifolds where you get stuck inside this environment manifold but the optimal escape path also because of this unity we can show that this optimal escape path must also lie uh inside of this uh invariance of space as described here and so we know that so we escape very close we follow this optimal escape path for some time but then this optimal escape path is going to get stuck at uh because it behaves just like a width one network so it cannot learn the full matrix right it can only rank learn rank one matrices because it has a width of one and so at some point it will get stuck to a local minimum along the width one uh networks and and actually this width uh this critical point so it's it's a local minimum inside the space of width one network but it's actually a saddle inside the bigger space and basically you could just use the same analysis to show that oh then you will approach this uh so we are able to show that you approach this uh second cycle saddle theta one and you could use the same analysis to show that you would escape along again uh optimal escape direction optimal escape path and then go to another saddle and so on the only difficulties that we need to to to for this proof to work we need to to assume that the net that the initialization approaches the the origin along a generic direction and it's we don't have enough control to to control exactly how you approach the next other that's why we cannot apply just apply again the same result and that's why we are not able to really prove the the full uh step but still we are able to identify all of these optimal escape direction and that's exactly what i used to to to plot this path uh to to identify this uh blue purple and red paths and yeah and so what happens is that you jump basically you will uh from each other you escape along this optimal escape path which is unique up to rotation and each time because you are in uh you always learn a net uh matrix which is one one rank higher so it's really like you were it's kind of a greedy algorithm to try to find the minimal rank solution because an easy way to try to find minimal rank solution is to say okay i'm going to try to learn a with one network if it doesn't work then i try to learn a with two network then with three network and so on so on until i find a global minimum and it seems uh i mean in some case this algorithm could work but because it's kind of a greedy algorithm you cannot expect that it will work in general but under some assumption you can show that this algorithm would actually converge to the minimal rank solution which is uh as we but you need to assume something so it i don't think it's general because again uh if it was general this is a npr problem so it's not clear whether this algorithm would be uh uh yeah it can satisfy this this problem yeah and so uh to conclude so i we show that if you have a small initialization then gradient flow actually has a very interesting phenomenon where it jumps from saddle to saddle and it explains the plateau that we observe during training and and leads to some low rank bias uh in the end thank you i think i'm out of time or oh yeah i have to to see but i think i'm done