 as well. Perfectly. How are you doing? Not too bad I think. Can you see my slides? Yes we can. So thank you very much Nikola for joining us. Yeah so our next talk will be given by Nikola. Nikola is an assistant professor at EPFL and he's going to talk about the role of stochasticity in learning algorithms. Nikola take it away. Can you see the pointer? Yes. Okay thanks so much for the invitation and I'm really sorry not to be able to be with you today. I mean the line of the workshop really looks like amazing. So I'm super happy to talk to you about this recent light of work we have done about the role of stochasticity in learning algorithms. So it's mainly you know thanks to my amazing collaborators, Scott who is a PhD student with me, Lucas who was a postdoc with me and Julien who is a researcher at the Econormal at the Pond and Chaucet, the National Department of Chaucet in Paris. Okay and really like all the projects you know start sometimes ago when the 90s April went on a sabbatical at EPFL and so we were like brainstorming a bit like among different professors in theory of machine learning and at one point he raises like an open problem to see if it would be possible to find a framework where it could be able to be possible to theoretically characterize a difference of generalization between stochasticity and descent. Okay and at that time for me it was super striking because you know I come from a convex optimization background and I always seen as just as like an efficient algorithm to get you know to the same point to get to the same performance. But tonight that point we realized that in fact when you have like a non-convex on scape the noise can you know can have you to some good after the loss and can in fact lead to better generalization problems. And that's really what we're trying to study today and that's something which has been you know fairly well observed empirically so you have a lot of paper which is I mean which I explained that in practice when you want to train an neural network you will get to better solution if you're using a small batch. Okay so putting another way it would be super difficult to train a neural network with a large batch always fully competent descent. And that's something which is not what I understand theoretically you had like some you know some first work on it but it was always in the case of gradient descent or stochastic gradient descent with labeled noise. So it's not not exactly the noise of gradient descent it was like more like you are like artificially adding some noise into your data and then you try to see how it modifies the dynamics. And so today we will really consider stochastic gradient descent and if there's only one thing to remember from this talk it would be this plot. So what we will see is that it takes this one framework namely it would be a sparse linearization problem from which it's possible to show that on a particular parameterization of this framework it would be a non-convex parameterization. The performance of stochastic gradient descent will be of one order of magnitude better in terms of generalization than the performance of gradient descent. Okay so that's what we will try to study today. Is it clear? Very much so. And so you know let's start by a quick introduction so what you know what are we doing is relax your classical machine learning set-top. So we have an observation x, y, i which are a idea according to some unknown you don't know how and what you want is that you will have some prediction function at w of x. Okay so y will be close to f of w of x and the aim of the goal is to find the value star which will minimize your true risk. Okay what is the true risk? It is expectation and the unknown distribution how of some log's function between your prediction at w of x and your label one. And today we are considering some regression problems so it will be the loss will be the quality class. Okay and you have different examples so you can consider some linear prediction so your prediction will be equal to w transpose x and you can also consider some neural networks so where your prediction function will be a non-linear function of your observation x which is still parametric by some parameter w. And since the distribution how is unknown as classical machine learning you know how you can find a good estimator of the value star is by minimizing the empirical risk. So by minimizing the sum of your loss not over the true distribution but over the empirical distribution. Okay so you are minimizing the average of your loss among your data. But then what is super important to have in mind in that particular setting is that we are in the over parameter set. So the number of parameters the dimension n is really larger than the number of observation n. Okay d is larger than n and d larger than n means that it will exist a lot of different parameters w so that you have an exact match. Okay so that f of w star of x i is equal to y i. Okay so it exists a lot of different parameters and the main question will be to understand to which particular solution to which particular w star which is minimizing the empirical risk we are converging to. And so we will consider two different algorithms the gradient descent algorithm and the stochastic algorithm descent algorithm. Okay so these are optimization algorithms to minimize your empirical loss and gradient descent units. I mean I think you all know this it's the algorithm which will follow the negative gradient direction at each step and the stochastic algorithm will be the algorithm which will follow an estimate of your true gradient and in order to build this estimate you are just sampling uniformly one observation between one and n and you just follow the gradient of the loss evaluated in this observation. And as I already told you when we were in the classical converter setting for example for linear prediction then often the goal is to show that gradient descent and stochastic gradient descent converge to the same point. Okay but now in this non-convex setting okay so for example when we are saying that the prediction is a neural net stochastic gradient descent and gradient descent can converge to different solutions and in fact what we'll be able to prove is that the noise in SGT will help you to converge to a different solution which will have some better generalization property and the way to characterize where the solution of the algorithm is converging it is what is referred as implicit bias. We will quickly define so what are the motivations behind the implicit bias concept you know is that we know that in practice neural networks which are over parameterized so with more parameters and observation generalize super well whereas you have a lot of different solutions which will exactly fit the data and that generalizing well even if we are using very little regularization or even sometimes non-regularization and so the main question is to understand why we still obtain this implicit generalization property and one of the ideas, one of the ways we understand such phenomenon is you know to understand the implicit bias of the algorithm which is this idea that you know the training algorithm you are using as GD, GD, Adam everything is not picking any solution any minimizer of the training loss but in fact we pick one particular solution which will have some good property okay and what we are trying to do when we are studying the implicit bias of an algorithm is to exactly characterize what are these properties of the solution picked by the algorithm you are using. Okay and the second idea we will investigate today is that in fact stochasticity in the algorithm will play a fundamental role in this phenomenon and in fact will help to in some sense emphasize this good property. So in order to characterize the implicit bias of the algorithm we will use some some norms. Okay so let me define this we still have our unbreakable error and of W which we are minimizing as always gradient descent or stochastic gradient descent and so we can denote by W infinity the limit of our optimization algorithm okay so the Wt will converge to W infinity and even if both algorithm and GD and GD will converge to a minimizer of the training loss which correspond to an interpolate around the data what we want to characterize is what are the property of this particular solution W infinity. Okay and in order to get the final description of this solution picked by the algorithm what we will use is that we will use a concept of minimum norm interpolation so we will say that W infinity is a minimizer of some auxiliary problem so it will be the minimizer of some function some criteria r and it will be the minimizer of this criteria r over the set of the minimum of the function n okay so your algorithm is not converging to any solution to any minimizer of the training loss but it will converge to one particular one and to the one which corresponds to the minimum error solution and you can think of as a criteria r as for example some norms the L1 norm or the L2 norm okay so if your algorithm will be implicitly biased through the L1 norm it will mean that in fact the solution picked by the algorithm will be the minimum L1 norm interpolators and if your algorithm is implicitly biased through the L2 norm then the solution your algorithm will converge to the minimum L2 norm interpolator solution okay so this is what we are trying to do we are trying to characterize for a particular algorithm what would be the corresponding r and you know you have to keep this in mind that it's really in contrast with explicit realisation which we can also do is that you could also directly add to your optimization objective you could explicitly add the realisation r so that you will explicitly be biased through this regularization but today what we are considering is like implicitly biased so we will be biased through our result you know without adding it to the algorithm and maybe one of the simplest examples of implicit bias is the case of the d square so when you are doing a regression with a linear predictor and in that case you directly see that both the iterate of gradient descent and stochastic gradient descent have a very nice property which is that they belong to the span of the observation x1 to xn okay and using this and using the fact that WT will converge to the solution you can directly show that in fact for a linear prediction the iterate of the algorithm both gradient descent and stochastic gradient descent will converge to the minimum L2 norm interpolator I don't have a lot of time today so I won't be in detail the code but this is a super simple proof and the main takeaway of this is that if you have a linear linear predictor you are biased through the L2 norm geometry and both gradient descent and stochastic gradient descent have the same implicit bias okay and this idea of implicit bias has been studied in a lot of different papers for a lot of different models for classification for regression for linear and non-linear predictor we had a lot of I mean the intrinsic work on this but the idea of stochasticity has been a bit forgotten and so since for linear model and convex problem it was not possible to distinguish between gradient descent and stochastic gradient descent what we are doing is that we will move to a bit more complicated model but still super simple which are diagonal linear network of depth 2 so what we are considering is that we are considering a diagonal network of depth 2 which is right here and this is really equivalent to consider a linear parameterization but instead of having this you know white beta I will parametrize beta as u times v okay while this adama product is just like the product component so instead of looking for a bit of a beta so that your prediction will be beta times x you will be looking for two vectors u and v and your prediction will be equal to x a scalar product with u times v okay and so the problem is equivalent to minimizing this function over w which is a parameter u and v so this problem over u and v of this square loss but I mean but of your prediction parameterized by this you multiply with v and what is super important to have in mind is that no this problem in u and v is non-convex it is the convex problem in the prediction function beta but when you look into the this problem in the parameterization u and v it's becoming a non-convex problem and so even if the class of function you know you can generate is still the same it will be a different dynamics when you will minimize this function using a gradient descent algorithm flow on u and v and in order to study the implicit bias of a gradient algorithm on this particular example we will consider the gradient flow which is the infinitely step size limit of the gradient descent algorithm and so mainly the main result we have in that case okay which has been shown by netis-hevro and poltos in this paper from 2020 is that when you are doing a gradient flow of this loss with this parameterization so here it's just for simplicity it's not anymore u times v but it would just be w square it's exactly like the same you're just assuming that your beta is positive and you're doing dynamics and the positive vectors so when you are doing the gradient flow and you initialize your gradient flow at the scale alpha okay what you can show is that your algorithm will be biased through a particular geometry defined by a function r of alpha okay the function will be parameterized by your initialization scale and what is very important to have in mind is that for a large alpha when your initialization is large then the geometry will be close to the l2 normally it will be close to the Euclidean geometry but when the initialization will be small when alpha will be small then your geometry will be close to the l1 geometry okay so if you are starting from a point with the large initialization then it will be implicitly biased through the m2 norm but if you are starting from a point with a small initialization then it will be implicitly biased through the l1 norm and in the case that you have a sparse prior on your data it will be particularly nice to have this l1 implicit realization so that's for gradient descent and then the question you know it's what about the stochasticity induced by stochastic gradient descent so what is happening now if you are still minimizing this non convex function but not with gradient descent but with stochastic gradient descent and to do that we will not consider exactly stochastic gradient descent but we will consider one particular model of this dynamic for small I mean which can be considered as a limit when the set is small and you see that even Ninos of the main question would be will the noise change implicit bias will the noise change implicit realization of the problem and empirically you directly see that if you plot the generalization performance when I assume like a sparse prior and I compare the generalization performance of gradient descent and stochastic gradient descent two different runs of stochastic gradient descent okay so here I took kind of two extreme you really see that the generalization performance of HDD is far better whereas the loss is still going to zero but you also see that the loss of HDD is going slower so this is what we would try to understand the last four minutes we have today but so first what we have to do is to define you know the stochastic gradient flow because as I told you even in the dynamic case it's kind of difficult to characterize implicit bias of the discrete algorithm and descent so we felt I mean we are using some continuous small step size approximation and so if you rewrite you know your stochastic gradient descent algorithm where you put the noise c, i, t of the model in a particular form that in fact you can see that you can model this algorithm which is particular stochastic flow this particular stochastic differential equation so it will really be your normal gradient flow plus a particular noise term which will depend of Gaussian motion but you see that the covariance of this noise term is super important and so in fact it's really like the covariance which will help you to exactly match the covariance of the original stochastic gradient descent algorithm that's the first point and you also see that the noise term belong to this particular space w multiplied by the span of the perturbation so kind of you keep the same structure between the both algorithm and that's really important because all are there would be for this exact model of lgd and would not hold for different stochastic model and empirically you really see that if you compare if you you know discretize this dynamic with a very small step size it will behave in a similar manner as a stochastic gradient descent and another important point is that you see that the step size is still present in this stochastic model okay and what we were able to show for this stochastic model is that if I you know let me remind you that for gradient descent we were biased through this geometry induced by the function r alpha for stochastic gradient descent it's possible to show that with high probability the covariance will converge to a zero training error okay it will converge later and the limit beta alpha will still be biased through this geometry induced by the function r alpha but it will not be the alpha of the misalization it would be an alpha effective of the alpha infinity which explicit formula for this formula for this alpha infinity so alpha infinity with people alpha times I mean times a number which would be a stochastic number because it could depend on the integral of the plus along the hyperion and you see that this you know term will always be smaller than one so the alpha effective you are enjoying with gradient descent with stochastic gradient descent is always better than the alpha you have with gradient descent okay and it's also possible to get more quantitative but the idea is really that you know using gradient I mean using noise then stochastic gradient descent will behave as gradient descent started from a smaller initiation and also what is important to have in mind is that it's not you know it's not free lunch the smaller your initiation is the slower you are converging to the solution so you're improving your generalization per person but you are degrading your convergence speed so I won't have time to talk about label noise result but I can just show you you do quickly this kind of result where you see that when you compare gradient descent and stochastic gradient descent the convergence of the prime loss of stochastic gradient descent is slower which means that the integral of the loss with stochastic gradient descent is large and so the implicit bias will be better for stochastic gradient descent than gradient descent so what does it take a message of for today's talk the first one is really that it could be a nice idea to consider some appropriate stochastic model of your optimization algorithm in order to enlighten interesting and pertinent results and the second message is that specific type problem namely sparsinal regression then it's possible to probably show that the noise you know can help you to bias your dynamic towards well generalization well generalizing solution okay thanks very much thank you very much Nicola that was very interesting very nice talk we have time for some questions and already see one over there so let me just walk over thank you for the nice presentation I've had a question so some authors pointed out that the large step may be very important in the performance of stochastic gradient descent so I was wondering what is what do you think is missing so what will change in the picture if you include this this additional piece to the puzzle okay that's a very nice question and we we have some unmet objectives what is first important to keep in mind is that with a stochastic model you know we still have this large step size okay so you can have for different step size for hgd with different step size will correspond different stochastic continuous equation okay so the continuous dynamics in some sense when you have noise still model the large step size that's the first point which is important to have in mind and which is different from the gradient flow which totally neglects the step size in fact if you are taking you know the step size to zero then the limit of hgd would be the gradient flow so in order to still get some noise you have to consider large step size and then just to give you an intuition okay so we sense that the implicit bias is better when alpha effectively is smaller in order to have alpha effective small you need to have this term as large as possible and you see that this term sometimes depends on the step size so having a large step size is important in order to you know to increase the stress effect for your invention so step i mean large step size is totally present in the pictures but it's possible to be more quantitative and we are currently working on this okay thank you is that okay for you i have another question right here hi i was wondering do you have an explicit expression for this function capital r how does it depend on this parameter alpha yeah yeah you have a closed form for i mean you can look at it in the in our paper in this paper it's uh in that case it would be easy to depend on some blog and it's a closed form function you can plot to you know everything which will depend on an alpha and your parameter and you in your asymptotic expression when when you take a limit of alpha it it converts to some norm is that the idea yeah exactly so it does not convert to some now but it will be equivalent to some to some so in some sense when you are looking at this problem for super large alpha it will be equivalent to the problem of minimizing the two norm of your parameter under interpolation condition and when you are looking at this problem for super small alpha it will be going to minimize the one norm under the interpolation condition okay okay and i mean this is really i mean it's not difficult i i didn't put the formula of ah but it's you know there is no nothing in the meter it's one function you exactly on an analytical function of everything okay so but like log uh i forgot like uh hyperbolic signals and stuff like that okay and we have time for one last question so i'm just gonna okay walk around thank you Nikola um actually i just have a question regarding this diagonal network so i understand it has some nice properties because you have some tractable closed form solutions but do you have any intuition if there could exist another simple model where you could have um some kind of implicit bias as well i mean uh that's a really good question i mean for me that's the only only simple model where you have like an explicit characterization of the implicit bias otherwise you will have some characterization of the implicit bias only for you know if you take the initialization to zero if you take the initialization to infinity or if you are considering a classification problem because in some sense classification problem uh without regularization you iterate will be diverged to infinity so you will always escape this kind of scale of initialization so classification is that difficult but for regression uh for now that the only kind of problem for which we have astricized characterization and the reason is because in fact i mean i haven't time to talk about it today but underlying all this mechanism you have some mirror distance structure okay and the implicit bias of a mirror distance is perfectly understood and in fact this is a key idea that when i'm doing radiant flow on this u and v it's equivalent to do a mirror descent or a mirror flow on beta and then the geometry will be given by the potential of the mirror distance and for now that the only problem is small diagonal network i mean you can add multiple layers of the diagonal for which we have this mirror distance structure so even if you are considering like fully connected network like it would be like u times the matrix v you won't get such a characterization that's a very good question yes it was um okay i think we have to move on to the next speaker so nicole yeah we hope that you will come to recording stopped at some point in the future but for now let's just thank you again for your talking your questions