 Proste za vzivno, da se vzivno. Zato smo se pošličili v zelo, da se je vzivno, da se se je vzivno vzivno vzivno vzivno. Statistiko in optimizacije je vzivno, da se je zelo vzivno. Jedno izgledaj je to stokastik gradenja metod, da je tudi algoritam izgleda za to, kaj je tukaj problem. Tako, izgledaj smo izgleda, da se vse rezultacije izgleda z vrštih vrštih, stokastik gradenja desent, Nevazoutim težko, vsak občin borbov, carmanje zicevaj z njogot in pustjev. Znamenje je. Nicko zaraz imam poška. Masih, ko je tudi o težk 타čke rezultate, About those kind of heuristics and in particular the generalization properties of such kind of variants of SGD. So, as I will call it, the random list, stochastic gradient method, stochastic gradient descent. OK, sG, m, sG, d. So, I start from the problem setting, we consider linear regression in abstract, Hilbert space, with a square loss. So the choice of the square loss is critical in our analysis, all that I am going to say heavily depends on this choice. Tako, pravdaj sem prišljala, da se prišljajte to, da ne boš nekaj, ali tako, zelo skvrlost vrša rola. Tako, občasno, in včasno, in v stokastiku optimizacije je, da minimizujem, način, e, kaj je zelo skupniti, tako, vzelo skupniti skvrlost, tako, kaj je zelo skupniti, nekaj, z W. In srečno je tukaj, da in začal del vršč je vse začal. Zato vse bo, da so počkaj narediti počkaj, nekaj ne so izgledati počkaj, ko je smelja izgledati. Počkaj sem lahko od vršč, ki je izgledati počkaj in izgledaj. Zato se smo izgledati vse od vršč. Sleda je svar in je zelo tako aspečen koncerno, tako in h. Vzajem ti, da je h. In s ed deliveracijami voda sem prejzene z tvogtiko, šteňu materija za vzac, a za poživaj vzajem tudi in očastnji prizat, več smast, displaying includes, for instance, functional progression, if my input points belong to an infinite dimensional space, for instance, our curve, or functions. Also this is, let's say, is an equivalent way to write the learning problem in reproducing kernel in他 is you consider learning in a rkah, ovo z n candidater seemsri controller pose. What you want to do is to minimize these risks, where this time, the function w that I want to find, the element in hk is a function now depends non-linearly from the input variable. As I think most of you know, if w is a function in a reproducing kernel vaccines generated by a kernel k, prej listen property. Therefore, w evaluated at the point can be written as a scalar product between two elements in our liber space H. So xi here is like x? Exactly. So xi are the input. You will see in a moment why I don't call it x. And so the xi can be seen as the input space r is the output space. And w is a function je vsega funkciona in vsega vsega vsega in vsega vsega. Vsega, ki sem da sem dač, je, da se izgleda, da izgleda vsega in vsega, in kaj se z kernelem, tudi izgleda vsega, in skupi, bo, da se ne zgleda. In ne, da se izgleda, ne zgleda, vsega in vsega, vsega, vsega vsega. Vsega, da sem zgleda, je, da sem zgleda, tudi, z kernelem, tudi, vsega, je ili vsej vziv, kar je vziv kratič nekaj v rápido, iz njega je ješem zvarjablek, in jo odeltam v modelu prezentu. The classical approach is the t concrete of regularization. In bre whipping me is to start from some very classical and basic results about tkonov regularizacjom for learning problems and to, then I will state some assumption and then state the main results and then I will compare them with the one that we get in for stochastic gradient descent. Ok, so, what is the tkonov estimator, the regularized empirical risk estimator? Ok, is a point in my space age that can be built in this way. So, I first replace the risk, which is unknown by its empirical version and then I add to this risk a regularization function r, that in this stoch will be the square of the norm. Ok, then what I do is that I minimize this regularized empirical risk and I built this estimator w hat, which depends from a parameter, which is the regularization parameter, that balances the weight of the regularizer against the data fitting term. So, the properties of this estimator has been studied a lot, but mainly from a statistical point of view. So, now I want to present to you some statistics, statistical properties of this estimator. In order to do that, I need to introduce some assumptions. So, the first one is a fairly standard assumption. Basically requires that my measure rho is as a bounded support. So, all my point axis, the norm of my point axis is bounded almost surely and the same holds for the output y. So, the second one is this, ok, this assumption can be relaxed for what I'm going to say, but I will consider this simpler setting. Another assumption that I do in this stoch is the fact that the risk has a minimizer. And therefore, since it has at least a minimizer, the set of minimizers is non-empty and they can consider the minimal norm solution. This is a less standard assumption, it's a particular case in learning that is referred, I will call the attainable case, where you have exactly one target parameter to estimate. So, under these assumptions you can already prove some consistency results of the t-conov estimator, but more assumptions are needed if you want to have more precise bounds, so known asymptotic bounds for the error. So, the existence here is just because you perhaps don't find infinite dimension, right? If you're in finite dimension, if you're bounded in this case, you will have always a solution, right? In infinite dimension, no, in finite dimensional, so yes, bounded sets are compact, therefore I have a convex operator, yes, convex continuous operator, the risk in this case it is, and therefore I have a minimizer. In finite dimension, anywhere else in the talk, in the future, or can we always take each type, you say about something that we find dimension in W? Yes, you can, you can, yes, yes. Okay. As a correct, so you know something like in finite dimension, but you see that none of W star is very large. Yeah, it can be, yes. So large, infinite, that is finite, right? It is finite, but you don't have maybe an estimation, for instance. No, sorry, my comment was the other way. So yes, my comment was if you don't assume that W is in the set, it's like having W star being very large in the finite dimensional set, but here you assume that W star is finite, and the norm is reasonable. Thank you. This is a standard solution. So some of the results, so I'm presenting the results in these settings, some of the results that I will present all even without these assumptions, but in this talk I will assume that these all. So mainly because I want to introduce this source condition that is the technical condition that allows to get these error bounds. So first, I define T, which is the second moment operator, so it's an operator linear from H to H, and this is well defined, bounded C, thanks to the boundedness assumption that I just assume. OK. What is this source condition? So it can be read as a regularity assumption of the solution I am assuming to exist, and more precisely it asks that the solution, the minimal norm solution, belongs to the range, not only to the range of T, but to the range of T at a certain power. OK. So I will use this strange indexing just for easy to make easier the comparison with existing result, because this is the usual scaling that is used in the available results, in the available papers. OK. What does it mean? So with the picture we have the space H, if r is equal to 1 alpha, I'm not asking anything, since T to the zero is the identity, therefore I'm just asking existence of the minimizer. If r is bigger than 1 alpha, then these spaces that are vector spaces, subspaces of H, are nested, and as r increases my condition is stronger. OK. So as r increases I'm requiring more regularity to my solution. Why do I call it regularity? Here is another interpretation. So the operator T is a compact operator from H to H, self-adjoint, therefore I can diagonalize it and build a sequence of eigenvectors and eigenvalues. OK. So my point W dagger belongs to H, this is a basis of H, and so this quantity is finite. So this is equivalent to existence of W dagger. What does the source condition mean? So it asks that since W dagger is in the run of T, then H I can invert this operator and I can write H this way. Therefore H belongs to H, to the capital H. And here what I can do, I can compute this quantity using this basis of H, and I get this condition. So I get that since H is as a finite norm, this sum here is finite. What does it mean? So remember that these eigenvalues are going to zero. We have a compact operator. OK. So this inequality here asks that these coefficients, these Fourier coefficients go to zero sufficiently fast. OK. So faster than usual. So this is the usual, and this is the stronger assumption. OK. OK. So with this condition, so boundedness plus source condition, we can prove the following error bounds on the T-Konov regularized estimator. So the key point here is the choice of the regularization parameter. As you see, I can prove that I have some bounds if I choose the regularization parameter as dependent on n in this way. So in a way that depends on this source condition. What I get here is that if r is smaller than 3 alps, then I have this rate that increases as r increases. But then at a certain point, this rate doesn't increase anymore and it stops and is minus 1 alp. So if r is bigger than 3 alps, even if I'm requiring a stronger regularity on my solution, I will not be able to approximate it in a faster way. OK. So just the idea of the proof is a bias variance tradeoff. So what I do here is I introduce an auxiliary point, w lambda, which is the minimizer of the risk through risk plus regularization. So I will not have access to this point, but I don't need to compute it. I just use it as a reference point. And then I decompose the difference of the norm between my estimator and my true parameter w dagger as the sum of... I can bound this norm with the sum of these two terms. And as you can see here, we have a term that depends on the sample and the term that depends only on the choice of lambda. So on the regularization properties of my approach. So balancing the two terms give, since the behavior of this term is determined by the source condition, we will have the result. So these results, the results that I showed are minimax. In this sense, in the sense that if you fix a class of probability measures, and in this case, the class of probability measures that I fix are the ones that satisfy the assumptions I just introduced, then, okay, I will have a different solution in general for each probability measure. And if I fix an algorithm, I can compute the distance between my algorithm and my estimator and my true parameter. Then I can do a worst case analysis in the sense that I put in front of it a maximum, therefore, I take the problem on which my algorithm behaves worst. And then among all these, all the algorithms that I have at my disposal, I take the best one. So I take the algorithm which has the best worst behavior. Okay, and as you can see here, I don't know if you remember the exponent in the bounds for the T-con of regularization, you get the exponent to this n is the same, therefore the, yes. Oh, sorry, sir. Is there a reason why you're choosing to do this in that Hilbert norm rather than the two norm? Okay, yes. So the reason is the following. The results that I'm going to, that we obtained for multiple passes SGD are optimal minimax in the H norm, but are not, we still, let's say be optimistic, still don't have optimal results for the error. That's why I'm going to present these results, but basically you can do the same in the norm with the error and the r minus one-half with two r. Okay, and you get the same. Okay. So what happens, what I told for r bigger than three-halves is the fact that you do not, even if you're asking more to your solution you don't see this regularity and this is called as saturation, so T-con of regularization as a saturation effect. And the problem in practice is the fact that you do not know r, so you do not know how to choose the regularization parameter and so usually you need some adaptive results and in this case, since you are dealing with the H norm, you can use, for instance, lepsky, also known as balancing principle that allows to recover these optimal rates. Okay. So if you, finally, put some further assumptions on your operator T so that more precisely the eigenvalues decay at a certain rate, then you can prove improved rates that are called capacity-dependent rates. So that depends on these decay of the eigenvalues and are more precise, let's say, tailored to the properties of your operator T. So in the talk I will consider the capacity-independent setting. Okay, but there is a but, so all the analysis that I show to you does not take into account the optimization error. So there is a okay, so what can we do? So someone did it already, so there is in the paper by butoebusque, we can what they suggest is the fact that if we are in the large-scale scenario computing this estimator W at lambda can be costly and so this estimator is the solution of a minimization problem and does not come automatically out of your computer, but you will need to approximate it, to compute it. And so what makes more sense in this scenario, since this will be a bottleneck, is to analyze the behavior of, let's say, an approximation of the true minimizer. Okay. So we will denote it with the lambda and t, meaning that t is the, so that this w hat lambda t is the outcome of the teeth iteration of an iterative procedure applied to the regularized empirical risk. So to the classical bias variance, the composition we add another term, which is an optimization term. So the first obvious comment or consideration that one can do looking at this decomposition is that when we use an optimization procedure in order to optimize the regularized empirical risk, it makes no sense to go beyond the statistical accuracy, since it will be not so it will be lost, let's say the additional effort will be lost. Okay. So on one hand this decomposition, so parallel to this observation let's say in the last year in the last years there have been a very active research on optimization methods to solve problems that have exactly this form so that can be interpreted as let's say regularized empirical risk functions, so that have the structure of, have some structure and maybe are possibly strongly convex or not and especially in the large scale scenario. So there have been a huge amount of work on methods that scale to this dimension so our first order and in a sense are not blind with respect to the structure of the function we optimize, so taking into account that we are minimizing a sum, so I'm thinking to both first order method but splitting stochastic, incremental, aggregated, so all these class of methods. So what we can do once that we have the convergence properties, so the complexity bounds on these optimization methods we can think to balance these two terms and to obtain some new tradeoff that includes also the number of steps that are needed in order to approximate my true parameter. What we can also do and it's what I will do now is to instead take another direction and to abandon this decomposition, trying to have one that does not split let's say the statistical part from the optimization one. Ok, so stochastic gradient method is maybe the algorithm that does this job so the idea is that we forget the empirical risk and we try directly to minimize our true risk, so the expected error in this case and with what, with a stochastic gradient descent method, since we do not have access to the true gradient since the our measure is unknown. So one possibility is to use this simple version where each step is determined by, so each stochastic approximation of the gradient is given by a single sample point and so the in this setting there is no explicit regularization we do not have explicitly lambda appearing and we have also several theoretical results, so under various assumptions that establish some convergence results of this iteration towards the minimizer or the minimum of the true risk but what's the point here so what the point I want to make is the fact that in practice very often multiple passes over the data are done, so of course considering a situation in which I know how many data I have so it's not a truly online situation because I know the horizon, I know how many data I will see and in which I have the possibility to see to visit them more than once but this is done in practice and so our idea was to try to explain from a theoretical point of view what these work ok, so I rewrite we focused on the simplest version of multiple passes stochastic gradient which is a C-click version so I rewrite the iteration in order to to make some points as you can see here ok, if you look at multiple passes stochastic gradient these can be immediately interpreted as a minimization procedure for the empirical risk so this is well known if we visit each point infinite times we will converge to the minimum of the empirical risk but then the other observation that can be made is the fact that each step is a gradient so it's a stochastic gradient step as before and why I use this strange decomposition with an inner loop and an outer one because I wanted to keep this number t to put in evidence this number t, which is the number of passes over the data that is usually called epoch ok so the main question here that I try to answer is how many passes should we do in how many times should we visit our data in order to minimize the true risk so what's the point so as I told you there is the convergence of the iteration to the minimum of the empirical risk is well established starting from Bercekas but also before so there are many, let's say it's classical but what we would like is to prove this kind of result so what's the point and the idea is to exploit stability properties of our stochastic gradient and early stopping how so we first introduce an auxiliary iteration which is the gradient descent applied to the risk as you can see here I fix the step size which is gamma over n gamma is a constant n is the number of points and I write the gradient descent in this strange way in order to make the comparison easier with the incremental method so wt plus 1 will be the result of t plus 1 let's say n t plus 1 iterations of the gradient descent on the true risk so what happens, we know that it's a gradient descent applied to the risk this iteration will converge to the minimizer of the risk on the other hand what I have is that I have my say multiple passes stochastic gradient applied to the empirical risk and I will prove that for certain time these two iterations are closed to each other but then after a while this iteration deviates from my goal and goes towards the minimizer of the empirical risk so the job here is to detect, let's say, this zone where the approximation of my true objective holds so I'm a bit confused here because if you're using fixed step size and you're not averaging then you're not converging so here I cheated a little bit so here this one is converging, do you agree? no so this is a true gradient applied to a complex functional so I fixed step size and it's fixed oh there was no x i okay oh sorry instead this one on this one so it's not very true, let's say it's true in function values for sure but here we do not assume strong convexity so that's the subtle point but it's not important or essential for the presentation okay, so before giving the theoretical results let's see what happens in practice so what we did here is we generated some so it's a linear regression problem we generated some random points uniformly in RD and some noisy measurements and then we divided these points into a training and a test set and we applied these incremental stochastic gradient so what happens is that if you look at the training error we know that since it's an incremental it will not be a descent method but this converges towards the MIMO while if you monitor the test set what happens is that which should be a good approximation of our true error expected error what happens is that after some iterations this curve starts to go up again and therefore we start in, let's say, an overfitting region and we should better stop our iterations here instead of going until the end how many points him? so yes, I don't know I tried to, so it's an old figure I think that here should be, I think it's around like 30, 50, less than 100 and this is the number of iterations so should be, I think that the right answer is 30 points and this is the true number of iterations so here is really the first epoch so the first epoch is very effective here and then it starts decreasing a little slower and I guess do you know what function you are what r is for this particular simulation? sorry? what function or what r? it's a linear regression and I don't know what is d a plot to show the behavior but it's a linear function in some rd so xi are in rd and yi are in r ok, so here there is an example there should be another point of view another perspective on a similar example not the same so here we have this I know better because it's more recent so here we try to approximate is again a linear regression problem but it's a function that can be written as a linear combination of trigonometric functions so here we have 40 trigonometric functions and if I'm not wrong I don't know exactly point 30-40 ok so this is the result what happens after the first epoch of incremental gradient and that's what happens after the tenth and that's what happens after the 100 epoch the idea is to stop the iteration in the middle in order to achieve a reasonable approximate so a regularized approximation of our points ok, so result if we assume so only boundedness and existence of a solution without assuming any source condition what we can prove is the following the step size as you can see here the step size depends on our constant unknown in general and so what happens is that we can prove that this estimator is universally consistent so converges almost surely to the true one if we assume two things first the number of epochs goes to infinity as the number of points grow and second it goes to infinity not too fast so this can be interpreted so yes, I have the comments so the comments that we can make on this statement are three so this step size is fixed ok, is a constant but I recall you that is divided by n in the implementation and so the what this says is that on one hand early stopping is needed to achive consistencies so this can be interpreted as an early stopping rule since the number of epochs should not go to infinity too fast but on the other hand we need multiple passes so what this result says that with our analysis, with this choice of step size you need multiple passes there are single pass single pass, I know that she does exact same path ok, I am right I need it yes, it is needed with this so what I am saying is needed with this choice of step size ok so yes, just comment, brief comment on this, so what we proved is that if you do, if you take a very small step size, so gamma over n and you do multiple passes where multiple passes is of the order of this quantity here then you have consistency on the other hand what we can prove is that if we do one pass stochastic gradient method and you choose a larger step size and you choose just one epoch then you will achieve the same result ok, so here again is the same imprecision that I mentioned before so I am stating this for the iterates so for the Ws this is not known I think so the idea is this one if we care about the error so why multiple passes make sense why should I do smaller step size and visiting the data more than once instead of doing only one pass with a longer step size so I hope I will convince that this makes sense in the following slides so that's where the source condition comes into play so if we want finite sample bounds we need to assume a source condition and in this case we can still prove high probability we can prove high probability convergence with high probability with this rate here exponent here if the number of passes over the data depends on our is proportional but depends on the regularity of our solution so some comments again what does it tell us the fact that first the rates are minimax again so are optimal the same that I showed at the beginning for T-Konov method good news there is no saturation with this method so even if r goes to infinity so as r goes to infinity this rate is improving so it's improving until square root of n and the yeah probably and the stopping rule so what happens is that the only thing that I have to tune, let's say to I need to be adaptive is the stopping rule therefore the number of passes over the data so also here as in the T-Konov case I can get adaptive results by using for instance again a balancing principle ok so now I'd like to do a second round of comparison with one pass stochastic gradient so in order to compare with the existing results that let's say are in the same setting as ours so in an infinite dimensional setting taking into account the source condition I add this let's say additional regularization term in order to compare all of them ok so the theory as I said at the beginning about stochastic gradient as a long history so convergence in finite dimensional case for strongly convex function is classical and goes back to Robin's and Monroe and there have been many other developments later instead in infinite dimensional case and specifically in reproducing kernel liber space with the square loss I would say that the analysis starts from this paper by Smail and Yao in 2006 so there are two three papers on which I think it's useful to compare with which is useful to compare our results and this paper by Iminging and Pontil this by Tarec and Yao and this by Francis and Imeric Dioleve so the ok, first I say first things that are less relevant, not less relevant but less relevant for what I'm going to say later so first this is the only paper which is able to to obtain capacity dependent bounds so the same that hold for t-con of regularization so that is able to take into account the effective dimension of the problem between the finite and infinite dimensional case the other two cases instead are in the capacity capacity independent rates as ours and as you can see there is a difference we have some methods which have saturation for instance this one or this one and this one by Iminging and Pontil that does not saturate so and there are different results in expectation and with high probability is this one as ours and usually the analysis to obtain high probability bounds is more involved so now I want to say the only comment that I think is really relevant here so as you can see here also in this case is true by the way in order and all these papers obtain optimal rates, minimax rates so they are indistingušable from a statistical point of view all the methods that I presented the so the point here is that in order to achieve these rates both in all these papers the step size depends on the source condition on the parameter r in the source condition here we have only gamma also here and we have both gamma and lambda so the new picture is the following is also the case in the third paper yes also here the gamma depends on the source condition so here we have the new picture let's say basically we have two regimes in which we can achieve minimax rates one is the one in which you take a small step size universal so not dependent on the source condition and the number of passes on the data that depends on your source condition the other one is the case where you do one pass with a bigger step size and you do only one pass possibly with averaging so in both cases what I want to say is that model selection is needed and in here in order to achieve minimax rates here you have to select the right step size and here you have to select the number of passes over the data so from this point of view what I'm claiming is that our method comes as a natural approach in the sense that what you can do for instance is to have your training error and divide it and keep a test and just monitoring online the behavior of the test error and simply stop when over repeating happens so that's I think it's the the main difference is who plays the role of the regularization parameter here is the step size, here is the number of passes and what I'm saying and I repeat is that being adaptive on the number of passes over the data is natural and can be done Is it sort of too far to say that all that really matters is the step size being kind of the same whatever approach you use because I guess the sum of the step size Yes, so here all the analysis that I have in mind with the constant step size let's say constant divided by n but yes for instance for the approximation error so that is the optimization error here what really matters is the sum of the step size and the same I believe so if you use different step sizes as long as sum kind of stays the same you get up to where so the point is that from our analysis you can derive some convergence for this case so for t star equal 1 so if you take into account the gamma and if you take gamma depending on the source condition so you can but the point is that we do not get optimal rates and we do not we do not get optimal rates and the step size that are derived from our analysis are too small so an open problem a future let's say direction of research would be to understand what is the analysis, if there is an analysis that allows to interpolate between the long step size 1 pass towards the short step size multiple passes because here to main back to the triangle of computation statistics and something else so here the multiple passes actually can't the gradient cost as expensive much more much more it's true, it depends what you have to do here in order to select the right step size so probably you will have to do multiple passes maybe on less data just to adjust the step size I don't know too why not doing like constant step size? I'm sure you can derive similar results with gamma t equals gamma with the early stopping which should be much earlier so gamma constant so in our case at the moment we are not able to do it it would be great it's always convergent so if you do a single pass with a fixed gamma always convergent it will be very slow with multiple passes you reduce the variance maybe I don't know only because you know in 5 or not you need smaller step sizes because as you can see so as we if you look at the sample error so the stability it increases with gamma therefore I don't know I don't know but I think you need the sample error needs a smaller step size so you can only get smaller and do multiple passes you cannot be bigger we need to do less in a single pass exactly yes it's exactly like this so we can prove that so from our analysis you can derive that you have convergence for t star n equal 1 and e r gamma should be a gamma of n increasing I don't know if this answer to the question so just one last comparison there is also gradient descent so we know that gradient descent on the empirical risk with early stopping works and what I can say is that and both from the computational and the statistical point of view there is apparently no difference between multiple passes stochastic gradient and gradient the full gradient so one point this is one point to be understood and I have to say that is one point to be understood also in the optimization scenario same correlation but gradient descent is in more direction so you can prove that the stopping time is the same so if you let's say do the mapping so one epoch equal to one pass is this exactly the same same stopping time same step size you can use a constant step size you can use a constant step size for batch mode problems but should be smaller I think that the ellipse is constant of the empirical differential you have the one over n appearing is the capacity dependent yes for sure this one we are not at the moment we don't have it seems feasible if you combine what you did but it might be to combine and to get level 2 would be nice so on this as I was saying this is an open problem also in optimization is not clear what is the advantage of an incremental gradient method applied to a sum of parabolas instead of a gradient method but what is known is that so empirical observed at least and some almost theoretically justified is that at least when you are far from the minimizer at the beginning of the iteration one pass of incremental is more or less equivalent to one full gradient pass until you do not enter in this confusion region where you are close to the minimizer but the minimizers of your summands are away so push you away from the true one so this would be again could be one reason why from empirically the incremental gradient has a better performance than the full one ok so just maybe some idea of the proof so just to see a tradeoff a new tradeoff so we I recall you that I consider I am comparing my iteration with the one the true gradient one and so the the bias variance tradeoff this time is obtained by bounding this quantity here with the difference between our let's say in multiple passes stochastic gradient iteration with the gradient on the true risk so this quantity here now the bias is exactly the optimization error because I am applying the gradient method to minimize a function this is exactly the optimization error and here instead I have the sampling that comes into play and so as I showed you before with the picture the idea will be to prove that this quantity is increasing with t decreasing with n but yes and this quantity instead of course is decreasing with t therefore the idea of the proof is to balance these two terms ok so I prepared just a few slides to give an idea of the proof as I said at the beginning so I don't forget as you can see already from here the square loss plays a crucial role because otherwise I do not have this expression of the iterates so the idea is that I can write the iterate at the end of one epoch as a perturbed gradient descent iteration so it's almost a gradient descent with step size gamma but we have here a perturbation term and so the idea will be to yes to compare this with the analog in the continuous settings so here I have this operator, it's just this one covariance operator, second moment and ok this one is an element in age and then here you have these two quantities that are quite complex as you can see so that's the point and the same you can do for the iteration on the expected risk and you get more or less a very similar expression here very similar a and b and so what you do basically is to write the difference between these two iterates by sum of an operator which has norm less than one and you forget it and then you have this sum here where you have always let's say an empirical mean and it's expectation so you compare the empirical mean with the expectation and so basically you can apply a concentration inequality to this the only problem is for these two complex terms here because they are not sum of independent variables but they are still sum of martingales for the concentration inequality still apply and then you get a bound of this form increasing in t, decreasing in n as I was saying, with high probability instead the approximation error is standard, is well known and you can prove that the rate depends on the source condition and on the sum of step size that are constant and so the final result is obtained by balancing simply these two terms that are obtained, the expression that I showed ok, so the contributions I think that we add some results that are the first results explaining, trying to explain the generalization properties of some used heuristics for stochastic gradient and more specifically multiple passes and early stopping and so some future work I already mentioned therefore some of these differences, so I think that's all, thank you ok, this is the paper so further questions and comments for Silvia and see more can start so there was no notion of randomization within a pass ok, so this, so we started from the cyclic one I think it was the easiest, probably I don't know if from the analysis it is the easiest one but in any case so I think that already Lorenzo Rosascov with Junong Lin try to generalize these results both to a known square loss and to other sampling techniques so in particular the stochastic one so you visit your points multiple times but you randomly select one what is still missing which would be very interesting the random reshuffling because here is you're saying my data came from some distribution ID then I have whatever order and I'm just running it through it and you're saying everything is fine it works, it works I mean also in the deterministic cases like this writing optimization but here it's all asymptotic I guess no, the key is that you can do that so you take the risk up to a point where the start and the error and this is the t star of n I'm just trying to reconcile this kind of result with the earlier results we talked about doing this workshop where having random permutation versus random sampling the random sampling comes from the fact that you have random so your points are ID and that's it I think so and then after this you can, let's say this is just an optimization procedure to minimize the it's not true but to minimize the empirical risk and you can either access to visit your points randomly, you can visit your points cyclically, you can choose to visit your points by randomly reshuffling them after each epoch so from the optimization point of view this last policy seems to be the best one empirically, there are some results very recent results to show that it works also in theory so I don't know exactly the constants so what is known that at least up to my knowledge is that the in the deterministic settings so the incremental choice of the points behaves usually worse than the stochastic sampling but for instance in the strongly convex case they are the same then there is a recent paper by Gurbuzu Balaban, Parrillo and Osdaglar Yeah, but there the constant is awful Yeah, the constant, yes, it's a matter of the constants Can end those things Yeah, but still it's linear so it's from let's say we do not, don't see this effect here I don't think so what could happen so I don't think so that if your optimization so if your optimization procedure is better faster, probably you can stop earlier but would expect this more from an accelerated method than from the random after one, yeah Another question for Silvia Yep, could you show question people, that's good Could you show your convergence rate again Again, yeah The best R is in infinity Yeah, but you cannot choose it so it comes with your problem So it's a datum of your problem So what's the typical value of R? I don't know I don't know If the data is in the traditional framework we have the convergence rate is square root n but in your case your best convergence rate is square root n Be careful, because here we are looking at convergence of the iterates and not of the error So the square root of n that you're talking about is on the error Ok, so in this case would be on the error would be 2R divided by 1, so should be better than one alpha I hope You should compare one alpha with this So when R goes to infinity, this goes to one over n Ok Other questions You The question related to basically you talked about regularization this approach is not using regularization at all So basically Or maybe not So I basically have two questions If I know the value of R Which one is going to be faster to use early stopping I mean to It's only one path I would suggest to use one path because it's one path The point is that usually you don't know it The regularization comes from the early stopping if you want The idea that is well known in the inverse problems literature where you in a sense exploit some self regularizing effect of iterative procedure So you can interpret them as regularization procedure And therefore the early stopping of iteration allows you to get a form of implicit regularization I understand but could you add some adding some regularization would make your problem more strongly conduct So it would help you to make progress at the beginning and then is there a way to add regularization basically you want the cake and eat it and you want the wedding No, I think it works so you can add it you can apply it Also I think this analysis in the sense but probably can be refined what I fear is that the lambda will depend on R again so you will have something like this where you're so you should then cross validate also on this parameter If you use a fixed lambda I don't know if you go to the If you are able to approximate the true solution I don't think so I meant that you could allow lambda to change with t and not only with n So sorry So I compared my results with the results in the online setting where the horizon is known because this makes more sense in the sense that we know the horizon but of course so all these quantities are constant with respect to t but in the true online scenario you will have something that varies with t more or less in the same way so will not be constant Yes, you could do it but I guess that will depend again on R also in this case Last question and when you don't know R what kind of cross validation cross validation also for the guarantees you can achieve the same rates with cross validation for this for the iterates you need a balancing principle but since the stopping time is the same both for the risk and the iterates you can play a little bit with this and use cross validation