 Ok, donc je voudrais premièrement remercier les organisateurs pour m'apprécier ici. C'est un plaisir d'être ici aujourd'hui avec vous. Donc je n'étais pas totalement sûr sur le niveau de la technologie que je devais mettre dans mon talk. Donc il y aura des images au début et puis un peu de maths. Et j'espère que vous allez avoir quelque chose à l'aide. Donc je vais me mentionner que la partie technique est un joint-work avec Eric Moulin de Telecom en Paris. Ok, donc un peu de contexte scientifique. Donc vous devez avoir entendu de la grande datae dans les dernières années. Et ce qui a changé dans les dernières 10 ans, c'est que le progrès technique a permis d'avoir une compétition très chineuse, une très chineuse storage et une très chineuse sensor. Donc je vais vous donner quelques exemples qui ont été gardés sur le web. Donc c'est comme le moor's law pour les CPUs. Donc essentiellement, le power doubles à chaque 18 mois ou 2 ans. Alors que les gens considèrent que ça devrait s'entraîner, c'est quand même un peu up. Donc la compétition, comme tout, est plus rapide. Ok, donc ça devrait faire notre vie plus facile. Mais le problème, c'est que, en même temps que les compétitions s'entraînent plus vite, les gens stockent plus de données. Ok, donc ceci est le moor's law pour la coste de storage. Donc essentiellement pour 1 gigabyte, ceci est essentiellement comme un film, 200 pictures, 30 000 e-mails. Ça costait comme 1 million dollars en 1980. Et maintenant c'est comme un peu de sens. Donc clairement, storage n'est pas une issue anymore. Un autre huge technical progress sur le sensor side, que la plupart des disciplines scientifiques ou des entreprises peuvent stocker. Parce que la coste d'acquérir des données a signifiquement été réduite. Donc c'est comme juste une sequencing de DNA. Donc pour une grande DNA, ça costait comme 100 millions. Mais comme le 1er génome, c'était très expensif. Mais quand les données s'entraînent, c'est moins et moins expensif. Et je pense que si j'étais à ajouter la dernière figure, ça devrait être moins. Maintenant, les gens commencent à faire une grosse sequencing de DNA. Je pense que les États-Unis vont séquencer 10% de la population de quelque chose comme ça. Donc c'est maintenant beaucoup plus cher. Donc maintenant il y a beaucoup de données. Mais je pense que l'un des messages principaux que j'ai présenté comme tout est plus grand, c'est que la main chaine pour moi n'est pas, dans les grandes données, c'est pas le terme grand, mais c'est des données. La taille n'est pas la seule chose qui s'intéresse. La nouveauté est que la plupart des entreprises et la plupart des disciplines scientifiques commencent à faire un business ou une science base sur les données. Parfois c'est grand. Parfois c'est petit. Et je dirais des exemples où c'est grand. Et des exemples où ce n'est pas grand. L'autre clé est la variété. Il y a beaucoup de différents types de données. Et dans tous mes exemples que je vais vous montrer maintenant, vous aurez toujours deux numéros magiques, comme ce sont les statistiques. Il y a des observations en dimension P. Par exemple il y a des images en dimension P, P being the number of pixels. Donc dans tous mes exemples, keep in mind that those numbers may be quite big. Ok, so the first example of where the money comes from is like search engines. So whenever you type a query on Google Bing or Baidu to be. Then there is a lot going on beyond the scene and essentially there is machine learning going on beyond the scenes. Essentially now what is N, what is P? N would be the number of clicks or clients of those search engines. So this can be like billions. And what is P? What is recorded for each client? This would be the entire search history, if not more. So P can be like a list of all potential websites in the world. So this can be like billions or more. When you put a one where you have visited that website and zero otherwise. So here, given that data, they are going to order the results depending on your own preferences. So for me to the France, means to the France, but you want to the France, the cycling one. But this is what is output for that. Ok, so there is machine learning. Whenever you type something on Google or Bing there is large scale machine learning going on. Then like marketing. So this is a huge aspect of one of the main applications right now of machine learning. Where whenever you go on Amazon or any like merchant, then you get proposed like objects which are supposed to be tailored to your needs. So this is my Amazon account and they propose me like skirts. So they still progress to be made. This is what they still research to be made to make this like better. So maybe I wear skirts, but this is my own private life. Ok, so let me like away away from direct business is like other disciplines heavily using like data. So this is one closer to my interests which is like computer vision. So the task of object recognition so give me an image, tell me if there is something in it and tell me if there is a dog, a cat, a cat or something is immediately cast as a machine learning problem. Input you have like N images and N is large. So now the current benchmarks are like millions of images and if you are Google you have billions of Facebook you have billions of such images and each image is a big object. Why? Because you have millions of pixels for each image. A very similar problem which look very different but at the end for us it's almost the same, it's bioinformatics. So you replace essentially images by proteins and object classes by function of that protein. So we have a lot of proteins I think more than 2 millions different proteins for humans. Each protein is a complex object. It's a sequence of amino acid. It is a 3D, it's a molecule in 3 dimensions and so the data is complex and high dimensional and you have lots of them. And the funny aspect is that many of the techniques developed for computer vision may be used for bioinformatics and vice versa and machine learning provides a good way to abstract the problems. And the last one, very big it's so big that it's still not like open yet so it's like astronomy so when you record like things from the space from space and they are currently building like this square kilometer array which would be ready like in 10 years and the output would be like 10 to the 9 gigabytes per day. So this is big data. This is very big data. And the last one which is not big data and I removed it for my slides this is personal pictures. Typically I have a slide which I forgot where I show my own personal pictures organized like most of us in a flat folder and it's a very small problem. It fits in my hard drive it fits in my cell phone but it's quite complex because it's hard to distinguish between several kids in particular for the same family and organizing all this is a huge challenge it's small it's not big and so just to give you an example of problems which are big like this one and problems which are small but it's still complicated like computer vision in some setups. Ok, so just to summarize this introduction so I have a lot of problems where both p and n are large so always p will be the dimension of my inputs and if you wish we can take a computer vision as a running example where p will be the number of pixels and n the number of observations and I have showed like many examples. So what will be the task I will try to address today is running time. Ok, whenever you want to run any algorithm on those type of data when both p and n are large you want to be careful about running time and you want to avoid any non-linear complexities ok, so if you want your algorithm to be at most linear in the size of your data so if you make no assumptions if you have n objects in dimension p it takes O of pn just to read your data and you want your algorithms to run in that running time complexity so of course as soon as you say I'm going to run fast you have to introduce a trade-off in terms of predictive performance if I do nothing it's O of 0 in terms of running time but it's not very helpful ok, so there will always be a trade-off between like statistics like optimisation you want to go fast and statistics you want to predict correctly ok, so this is one of the main theme and maybe the other theme which is a bit amusing it's we go back to very simple methods from the 50s Robice and Monroe I think I have a slide on that one and this is the last picture that I'm going to show is yes, so this is one of the leading computers in the 50s ok, so IBM 1620 I have no clue what this is but I've been told it's a good computer it was a good computer very expensive 100K very slow ok and very big as well in terms of size ok, so at that time computers are not powerful enough so people have created efficient algorithms to run those computers now we are in 2010 so I don't have a Huawei phone I have another brand of cell phone but this is much more powerful and this one it's more beautiful as well it's more powerful a bit less expensive and so we have much like the computing power has increased a lot but what has increased even more is the size of the data ok, so data have outgrown have outgrown the computing power and now we need to use the same algorithms which is essentially Romance Monroe that we look that we look at the data only once and this is a master equation which I will go over heavily in this talk but to me it's amusing that for different reasons we come back to the same algorithms and for 30 years where computers were powerful enough to deal with the data that we have we didn't need those algorithms ok, so that is like a cycle and which is due to both technical progress and increase of the amount of data ok, so this will be the outline of my talk I will just spend some review about what I mean by supervised machine learning and I will go over the main algorithm that people use which is called stochastic gradient and the only technical part concept that I will be that I will introduce in the talk is a concept of trunk convexity it's like classical conditioning of an optimization problem and if your well condition it goes fast and if your ill condition it goes slow and those numbers will be the convergence rate of the algorithms ok, the distance will be your performance ok, of your algorithm minus the optimal performance of your problem ok, this has to go to 0 as n, the number of observations is going up and of course you want this to go fast to 0 and the magic numbers will be 1 over n and 1 over root n as convergence rates 1 over n being better than 1 over root n and we see that in the classical analysis of those algorithms you have lower bounds of complexity but you have like running time which you cannot beat ok and we show that we can beat them by of course adding some assumptions and this will be like least squares at first and then we will use like smooth losses to beat like the classical lower bounds of optimization ok, so a bit of notation but I am going to consider like always n observations x i y i so this can be your images x i can be your image y i can be the real number coding the presence or absence of an object in the image we assume we have n observations they are all like independent and identically distributed from the same distribution and we are going to assume that we do linear predictions ok, so this is a strong assumption but you have to be careful it is linear in the parameters of your problem and not linear in the inputs so in fear of x you are going to put all your knowledge or the problem ok, so if you do computer vision you are going to encode like shape, color, texture, that fear of x and these of course are non linear functions of your inputs if you do like bioinformatics you are going to encode whatever you know about the biology and chemistry and so on ok, so this is why you put all the expert knowledge and of course once you know that fear of x it becomes an abstract problem where you do linear predictions and this is what I am going to consider today I am going to assume I am going to assume I have p such features p being quite large so I am going to consider very classical like regular un particle risk minimization where I am going to minimize with respect to my predictor theta an objective function which is a sum of a data fitting term ok, so I go over my data at the other edge of L which is the loss ok, I am going to pay a cost of predicting this is my prediction and this is my output so I have a loss function which I will describe in the next slide for each data point the goal is to minimize this ok, you can take at least squares being one example and we will need to regularize for several reasons one of them being one of them being that you want to you want to predict well on future data so what are the usual losses so for that talk you can say it is least squares ok, least squares is the most simple loss, this is adaptive to cases where the output is a real number ok, so this will be one of the main motivations but in many instances you want to predict the binary random variable let's look at the most used instant which is like advertising, click prediction ok, so if you are going to click or not on an ad then the output is click no click ok, so y will be 0 or 1 ok, or minus 1 or 1 and clearly if you use a square loss you are going to try to force like to force you are going to have bad predictions and because the loss is not adapted so what people have created in the last like 20 years loss is adapted to binary classification where your output is minus 1 1 ok, so this is your classical output minus 1 will be a click on the ad minus 1 no click on the ad and you are going to predict as a sign of your linear function so of course the linear function has an output a real number, you want to threshold you take the sign of that real number so essentially if you make an error, if the sign of your prediction is not the sign of your label ok, so if you take the product of y times your prediction if that product is positive you make no errors no cost, if that product is negative you have a cost so essentially what you are going to be judged by is a function of your predictions which is only depend on the product y times your prediction and that function is a 0 1 loss which is the blue loss here, so this is the metric when you are going to be using that algorithm the problem is that this metric is not convex not even continuous so it is kind of very hard to optimize so what people have been doing in the last I think it was a hot topic in the 90s and 2000 was to design convex surrogates to that 0 1 loss so those have names which you may be familiar with the first one is in turquoise, this is a square loss this is very simple surrogate but as you can see it is going to over penalize a lot good predictions thin as quite bad and what has been like leading the park in the last 20 years is the loss in red which is called the hinge loss and this leads to the sub-productor machine which you may have heard of and this is red and the green is the logistic loss which is the one which is heavily used in industry and that is the smooth loss so this will be a major aspect of my talk that smoothness helps you to optimize so this is why in this talk I'm going to consider only the logistic loss in green and the square loss in turquoise so now I will assume that I have one of those two losses so let's go back to my original problem then you have two quantities of interest which are super important for machine learning you have the training cost this is the cost which you can observe give me your data give me your predictor like I compute the loss on my training data these I have access to and I want to minimize but this I don't really care so much what I really care about is the prediction on unseen data which I call the testing cost which I call F which is the expectation from pairs of inputs-outputs coming from the same distribution of the same loss and I have access to the training cost but I really care about the testing cost so there are two main questions that people tackle like in statistics and machine learning the first one is to compute data hat give me the data, I want to compute this so this is a pure optimization problem and the second is a pure statistical problem give me data hat the optimal or something which is close to optimal for that thing does it do well on this F of data so this is the two main aspect you have the training cost which you have access to and the testing cost which you really care about so the key in the recent method is to tackle those simultaneously in the sense that you want to you don't want to separate this in two pieces ok, people that do optimization and people that do statistics we're going to merge the two together so a bit of like assumptions so this is very simple I'm going to assume that my functions are smooth and what I mean by smooth I mean that the second order derivative will be bounded from above so this is for all my all my Hessians, the eigenvalues are less than L so very simple on the left a smooth function on the right a non smooth function so this is very simple for that direction and the context of machine learning this is my traditional loss if I assume that my loss L is differentiable and smooth then essentially the Hessians are proportional to covariance matrices and assuming boundedness of the Hessians one sufficient condition is to have boundedness of the data so this is typically seen as a weak assumption and I'm going to make it throughout the talk the less weak assumption is when you invert that inequality this is what people call strong convexity so a function will be strongly convex so I'm simplifying a bit if all the eigenvalues of all the Hessians are bounded from below by a constant so if that constant is zero you get convexity but you want that constant to be strictly greater than zero zero and you get a strong convexity so on the left you get a convex function which is not strongly convex because you get a flat part this is another fine part over there on the right you get a strongly convex function which has curvature in every direction so this is in 1D in 2D this will be the traditional image here I plot level sets contour plots of a function in 2D so this is a global minimum and you go up in all directions so this is like a function with a good condition number so a large value of mu and this is a function with a small value of mu so this is really the main images so as you can guess those ones will be easier to optimize than those ones message of the top so why is it a big assumption for machine learning it is because if I take the same problem where I have an average of a maturing data of a loss function and if I assume square loss for simplicity then the Hessians are all equal to the covariance matrix so the covariance matrix is a matrix of size p obtained as a sum of n rank one matrices so the rank of that matrix is at most n so if I have p bigger than n which is common in applications think about computer vision p might be millions then you will never be invertible so you will never be strongly convex so in a sense most modern problems are not strongly convex you always have directions of very strong correlations so what people have done to deal with this is to add if you are not strongly convex let's add something to make it strongly convex and this corresponds to adding like a square norm the problem with this is that it makes your problem better behave it will be easier to optimize but you are not optimizing something different you add a bias to your problem so whenever you see like a regularizer like this you should think as mu 10 into 0 as you get more data so this is very important mu is not free if you add mu you make it better behave but you don't solve the correct problem so whatever I will do another type of regularization ok so now let's review the classical algorithms to do like optimizations so this is very basic if you assume like a convex and smooth function gradient descent is the most classical algorithm and so you go it's an interactive algorithm where you start from any point and you go down the direction of the negative gradient with a small scalar parameter and the key is that that algorithm it's known that it's going to converge to the global optimum of the function if you assume convexity and the speed at which it's going to converge will depend on the easiness of the problem which is characterized by the presence or absence of strong convexity so if you take a really easy problem strongly convex then the gradient will push you toward a good direction very quickly and you can show that you get an exponential convergence rate often called linear convergence rate everytime you make an iteration you divide the cost by a fixed amount ok so those are easy problems the problem is that for harder problems you go from something which is exponential to something which is much slower ok so this is because you tend to oscillate a lot you make very small steps when you have a long valley like this so this may be by news ok the even worse news is that this cannot really be beaten in the sense that if the problem happens to be hard it's hard, it's not because the algorithm will be slow not because of the algorithms but because of your problem ok so you have lower bounds saying that you can beat those but the best you can do is monoverty square ok for example which is faster than monoverty but you cannot beat them ok so if the problem is difficult but what it does for you is adaptivity in the sense that you don't need to tell gradient descent if you're going to be easy or not ok just run the algorithm and it's going to adapt automatically so you see the key advantage of those algorithms, adaptivity if you're not adaptive you're going to be always very slow so of course people have noticed that you want to avoid those oscillations and people have considered Newton method ok so in Newton method you replace the scalar parameter there by the inverse of the Hessian ok so this will converge quadratically so you get very quickly a lot of significant digits in your solution but you have to first like build the Hessian so this is a P by P matrix so it's going to be hard when P is 1 billion and you have to invert the linear system which is even harder ok so this people don't consider with large amounts of data because every step is too costly and even more important so this is a very big difference between like machine learning statistics and optimization is a fact that we don't care about high precision ok so Newton will get you the answer up to machine precision very quickly in a small number of iterations so every iteration is very slow but you need only a small number of them but we don't care we being people for machine learning ok if you go to other people for optimization they may care and I'm sure they write but in our setup our cost functions are averages ok so this means that naturally they deviate from their expectation with a deviation of 1 over root n so our functions are already not well defined so it's useless to go below that error so this is really a key that we can use bad optimization methods ok which may be very slow for others but for us it's enough we want to get quickly to a decent place it's first aspect second aspect in machine learning the cost functions are averages ok so we don't optimize any cost function g we have averages and with this we can use to get faster algorithms ok so this is the main topic of the talk which is stochastic approximation so what is stochastic approximation so this is a wide field which I will see from a very small angle I want to minimize the function f ok so this is main theme in this talk this is my cost which I want to minimize but I don't observe the gradient if I was to observe the gradient I would simply do gradient descent I would be happy about it I only observe noisy estimates of my gradients so every time steps I don't have access to the gradient but just a noisy estimate so the key is that that noisy estimate is unbiased as the correct expectation it may have any variance as long as it is bounded it's ok but it has a zero bias so in our case this is very natural we are going to consider a noisy gradient obtained by a single observation for the key this is the most important slide of the talk what is f in this setup f will be something that you don't observe this will be your testing cost the expectation on unseen data will be what you really want to minimize and you want a gradient of this but you don't have access to it so you're going to consider the loss from a single observation fn which is the loss from the observation yn f of xn so this is give me one data point if I compute the loss and it's gradient then if I take the expectation of this you get the derivative or the gradient of your test error ok so this is the main the only thing that people that we do is and we being like the community and not us not just me is realizing realizing that a single data point provides you a noisy gradient of the test loss ok so this is the key thing so of course this stochastic approximation goes far beyond optimization but in our setup we can characterize the behavior which is the goal of today ok so now let's look at the algorithm ok so this is oops sorry ok so I'm going to now have an iterative algorithm and the key here is I've switched from the index t to the index n ok so t was by index of iterations for gradient descent but now I'm going to use n being the number so why do I do this is because the number of iterations will be exactly equal to the number of observations ok so t is going to be equal to n so whenever I see a new data point ok I compute the gradient of my current predictor for that data point and I take a direction in the negative gradient ok so this is often called stochastic gradient descent and or this is a form of robins monon a key novelty from the 90s seems like super trivial but very important is averaging ok so this is at the end or when I want to use my predictors I take the average over all the past and I will call that theta bar and this is due to polyagrooper I will show examples later now the key question is what should the step size be so for deterministic gradient descent the step size is easy to find you have line searches and this is very basic but for stochastic gradient descent you might have an open problem how the step size should decay so if you don't decay you can easily see that you're going to make direction of the gradient plus some noise so if you never decay you're going to oscillate around the optimum so it's not going to converge so you need to have gamma n going to 0 and the key question is how fast ok and if you read a book on stochastic approximation from the 60s or 70s or even 80s they will tell you gamma n being 1 over n why because gamma n 1 over n the sum is diverging and the sum of squares is converging ok so this is the reason why this is like only for easy problems so in my context for only strongly complex problems then in the 80s 90s people have said you should have bigger step sizes 1 over root n ok so why 1 over root n because the sum is diverging and the sum of squares no yeah the sum of squares is converging but anyway forget about this and so this is good for robustness and what we propose with my colleague is to go even further go constant ok so this is the main topic for today so let me mention my running time so this is why in my introduction this was the goal of the goal was to have algorithms that had complexity over np the size of each of the observations and this is by design the case ok so since my number of iterations is my number of observations I simply take a single path through the data and do it so the funny aspect and don't take this in the wrong way it's a single line of code ok so what my colleague and I are focusing on are a single line of code ok so don't think that to do all the things that I've shown in the beginning like being Google, Amazon and so on there is a single line of code ok there are millions of lines of code just to prepare the data to run that single line of code ok so all the heavy the heavy coding is done just to access the data to compute the features and the learning part is really one line which is essentially that one ok so we are going to focus on that single line but remembering that you need money more to make things work ok first some bad news ok so the bad news is that it has been like done by the Russians good news or bad news depending on your point of view it has been done in the 80s by the Russians essentially the best you can do is stochastic gradient descent ok and the best meaning that you have like global like minimax flights of convergence this means that if you find something that goes faster on all functions you are wrong ok so essentially and those rights are also linked with the absence or presence of strong convexity so the quantity, the conditioning number is a key here and the way you converge is 1 over mu if you are strongly convex so it's 1 over n so it's goes like decently fast and this is achieved by gradient descent with a certain step size and if you are like you have a big large like large number of features so if you have a lot of correlations if your problem is quite hard then you go from 1 over root n ok so this is also achieved by stochastic gradient descent so the good news is then we can achieve the best we can the bad news is for modern problems this is very slow ok like 1 over root n goes to 0 ok for sure but this is quite slow the second bad news is that you have to adapt the step size to the difficulty of the problem so you have to decide in advance am I going to be slow or not so another line of work from the 90s also by rations Polyakayudisky but another reason in Grenoble is if you start to use like bigger step sizes and you consider smooth problems so here what I have hidden is the work for all problems smooth and non smooth and the lower bounds are applicable for both non smooth and smooth problems but if you're willing to assume smoothness ok so as we do modernization we are free to choose a lot that we want so if you choose a good loss then we get smoothness and while those people have shown that you will get like this 1 over n convergence rate asymptotically so if n is very large you go from 1 over root n to 1 over n and you are free of the condition number of the strongly complexity constant so mu remember has to be small ok so sometimes 1 over n mu is bigger than 1 over root n when mu is too small ok so while they shown that this can be done for smooth problems so the question is now is it possible to merge everything and to get a single algorithm that will work smooth problems with a convergence rate of 1 over n in all situations ok the idea is you want to be robust to ill-conditioning ok so you want to over that 1 over root n and get 1 over n in all situations so this is what we are going to present and I have like 20 minutes left yeah ok cool so let's take least squares least squares is by far the most simple example so this is my function for least squares ok and often stochastic gradient descent is referred to as least least means square LMS this has been studied a lot but typically with a decaying step size so here what I mean by strong complexity since in that context the Hessian is a covariance matrix is simply that the Hessian has to be like invertible and the lowest value being bigger than mu so this is a classical assumption so what I have proposed with my colleague Eric Moulin from telecom is to use a constant step size ok not 1 over n, not 1 over root n just constant ok so this is like just the result and I will try to explain the intuition behind it so you assume bandiness so this we assume to know you assume, you assume you sigma is a loss in your predictions and what we were able to show is that and just to give you an example of the flavor of the thing that we prove is that so you have f, f is test error the key is that f is test error whenever you see a result about convergence in optimization in machine learning, you should question do I get a convergence on my test error my train error, this is a test error so this is what you really care about theta star is my optimal prediction in my class of functions the best I can do theta bar n is my my average estimate it is random because my data is random my data are random so I take an expectation of the randomness of the data ok so this has to be positive because theta star is my optimal minimizer of f and this goes to 0 with 2 terms which are classical which are classical in optimization in statistics the second one is it depends on your starting point if you start close, of course it's easier so the bounds typically have to reflect the closeness of the initial point ok, it depends on the initial point divided by n no, this is a key, n and not root n and then on the left this is a classical term which is depending on the noise of your problem ok so it's called sigma sigma square p over n and for the expert in statistics in the room this is the traditional like performance of these squares ok this is one which you cannot beat even if you had infinite computational power you cannot beat this sigma square p over n and we achieve it by simply a single path through the data ok, so here the formula is not so important the key is that you have one over n and the mu has disappeared from the equation but the funny part is that you could get that algorithm with no computation ok, so this is maybe like another important slide is by just looking at the Markov chain aspect of the problem so you take this my recursion so this is my loss of a single data point yn minus my prediction square this is my stochastic random recursion so if I take the gradient of this with respect to theta I get my residual times my future ok, so this is a very classical now my xn are IED ok, so whenever you do stochastic approximation with IED inputs then your iterates from a Markov chain this is always true, fine but now my gamma, my step size is constant, so this means that my Markov chain is homogeneous because dynamic is always the same so now if you make a few extra assumptions then that Markov chain should converge to a stationary distribution ok, which I will call pi of gamma of course this will depend on the value of my step size so at the end, the way you should think about this algorithm is that it never converges so you start from theta 0 you follow your dynamic and at the end you start to oscillate around like the expectation under your stationary distribution theta bar of gamma, ok, you oscillate around it, but you never converge ok, but the key now is when you do averaging ok, the first key, sorry, is that 4 L squares, you can easily show that you do oscillate around the true value ok, the reason is that the gradient is a linear function so you can invert like gradient and expectations so you oscillate around the true value so this means that if you do averaging ok, of the trajectory you're going to converge to the expectation under the stationary and here this is your global optimum even better it is known that this is like various ergodic theorem that the rate at which you converge to the global optimum will be 1 over root n in distance ok, but since we measure errors in square distances you get 1 over n from the start without any computation so just because you have a homogeneous Markov chain which oscillates around the true value at the end, then you get the convergence rate of 1 over n so this is a bit like misleading in the sense that you get 1 over n, but you may not get those constants ok so the heavy lifting in the paper is to get those constants and to show that they don't depend on the condition number ok, but the 1 over n aspect is straight forward from the straight forward from the from the analysis ok, so let's look at a very basic very big data ok, p equals 20 ok, so I have a bigger one on the next slide so it's just to show intuition behind the algorithm so I'm going I'm considering like a synthetic example with p equals 20 and in all my plots I will always plot in the x axis the number of iterations which is equal to the number of observations in a logarithmic scale and in the y axis always a distance to optimum also in logarithmic scale so if you converge, you should go down like this up to minus infinity so here I have tried like several step sizes ok, three constant step sizes and one decaying step size and in dotted this is before averaging and in plain this is after averaging so as you can see if you don't do averaging because those like three colors and you don't average, you do oscillate ok, so this is my slide from before you are not supposed to converge because you do converge towards your stationary distribution so you will never converge what you can observe is that if you take a smaller step size then you get a bit closer but never there now if you start to average then those curves tend to and if you look very carefully the slope tends to be minus 1 ok, so it's just an illustration of the fact that by doing averaging you transform your non converging problem to a converging problem so my intuition is that by constant step size ok, maybe I go here if you do a decaying step size you start there and you go very slowly towards that region ok, and it takes a while but if you do a decaying step size you go very quickly there and then you oscillate and averaging allows you to converge ok, so this is what you see in the turquoise plot so if you don't, this is a decaying step size then you do you decay if you use a decaying step size without averaging you still converge but you get slow convergence and if you start to do and then you do you cover that but it takes more time and this will be my last technical slide just benchmarks ok, so of course you want to apply this to bigger than P equals 20 and even those benchmarks which were classical like 2 years ago or 3 years ago are big but not super big so I've taken like 2 classical benchmarks from the community one where P is 500 observations and one U which is more typical where P is huge so you have 1 million features but the data are very sparse so many of them are zero and N is this simply so top is the third data set and bottom is the second data set so what do I have left left and right this is one of the hidden not very glorious aspect of optimization is that typically what people do and I do this often so I'm allowed to say bad things about myself which is typically people do they show a bound like this they say oh I take this step size and this is going to converge optimally for various reasons and then they show simulations but often they use a different step size which is much bigger because otherwise this is much too slow so I've done on the left on the left I've taken C equals 1 which is the step size which is given by by in the various analysis so forget about the red line so you should focus on the blue line which is a constant step size and on the no blue which is constant and green which is decaying you can see if you don't optimize the step size for the decaying step size it goes very slow this is logarithmic scale this means that before if you're here you have the performance like tens of thousands of data points later this is very typical of optimization if you use the step sizes from papers it's very slow for a constant step size it's still going down quite quickly now if you start to optimize things may differ a bit for the decaying step size if you start to optimize I have optimized to be as small as possible for the dotted line to be correct there I have to take C so large that I have to start to diverge this is a classical behavior of decaying step sizes the decay is too rapid to take a constant which is very large to be working well at the end to work well at the end you have to start doing bad at the beginning clearly not a good behavior and if I optimize the constant C for a constant step size it doesn't change much so this is just for one dataset as this you can see for several other datasets as well robustness is a key issue here so you want to be robust for the positioning but also robustness with respect to the various constants that you need to use for the algorithm so of course if you optimize the constant C it's going to work better because it's going to be a bit faster but often this is first costly and then sometimes dangerous because now you move away from the convergence proof so it may seem to work at the beginning but diverge at the end so you have both convergence proof and a step size for which you have the analysis and which works decently in practice then this will be my last really the last line what happens if you don't have the square loss so of course in advertising they don't use the square loss they use the logistic loss which is not a square loss so let's see if we can reuse our intuition so we also have a Markov chain it doesn't depend on the shape of the loss it is also homogeneous because gamma will be constant you will also have a stationary distribution you converge but now the way you define the stationary distribution is the expected gradient is 0 the problem now if f is not a square function f prime is not the linear function anymore and you cannot invert expectations and gradient so the gradient to average under your stationary is not 0 so what you have is that you do oscillate and if you do averaging you do converge quickly but to the wrong value so this is your optimal value and you converge to the wrong value so what I won't describe are ways to essentially restore convergence so you have ways with twice the complexity just like over with this I converge the tropptimum so in the interest of time I won't do it but this is based on Newton just to conclude so what I've presented is a constant step-sized average stochastic gradient so the goal if you have to remember a number you have 1 over n 1 over root n if you're slow this is 1 over root n but the goal you want to be fast and this is what we are able to achieve by using something and using smoothness of the loss functions if we have a non smooth loss function there is no way we can get better at 1 over root n and we are able to do it by using smoothness so this is simply like constant step size stochastic gradient for square loss and for the non square loss for logistic but I call like online Newton step which has the same complexity as stochastic gradient descent but which I didn't describe a key is robustness to step size selection of course there are many extension people have been considering that setup which I won't re-describe maybe the last two ones the first one is parallelization so all of this is if I have a single computer and I do all my computing on a single load so of course when you have lots of data the data are stored in several computers so clearly people are trying hard to make this distributed so if you want to distribute the slow algorithms from the 50s is already possible people can do this it's quite easy to distribute but the problem is when you go from one machine to two machines you have to go from the fast algorithms like that one to a slow algorithm so you lose a lot so what you always want when you go distributed if I go from one to two machines I have to always improve a lot of people are trying to take fast algorithms like this one so we are not the only one having such algorithms but trying to make them distributed and this is stuff to make sure that you always improve as you add more nodes and finally very important a lot of research right now what about non convexity here I have assume convexity of my loss functions so this includes these squares and logistic but many people now are trying to go non convex and one of the reasons is one of the reasons is is here, is that the features so it's ok to assume convexity if you are given the features so if you have an expert that will give you one of the good features, linear predictions are ok but in many setups you also want to learn the features as well like in computer vision the ultimate goal is give me a bunch of images and let the machine do it stuff and then you get the prediction now you have to learn features and now this becomes a non convex problem so you may have heard about people doing deep learning neural networks the only difference with what I am presenting is that they optimize both with respect to your theta parameter but also with respect to fear of x so this is non convex and really interesting because this is where you can make a lot of improvements if you have a lot if n is very large you can really let the machine do everything from scratch but now you need to learn fear of x and this is non convex and quite open for the moment thank you for your attention thank you thank you for your attention we are applying stochastic approximations to predict the general state of wireless usage and it turned out to be quite a powerful approach at least and one I found your approach of the constant steps is quite interesting because let's say there are two properties where a wireless system differs from a system like advertising one is that we operate on a tight on a strict real-time constraints so we basically need a solution every millisecond or every 10 millisecond on a small dsp and the second property is that we need robust solutions which means the statistics of the channel changes we cannot be wrong or we should not be wrong because when we do wrong the users or multiple users lose coverage in fact people use constant steps size for tracking typically because if your data are not IID but they vary over time constant steps size is a way to be adaptive to the changes in your data so clearly this is not what we have considered it's not our motivation but typically this is used a lot for tracking constant steps size so you will adapt to changes in your data but of course you have to then you need to average over a smaller window so you average over last whatever any cateration instead of overaging from the start this is used a lot I think actually one more point both questions are robustness and real-time constraints are similar related because if you let's say the statistics of the channel changes you need to read that it's always good if you want to do that quicker I think robustness I think my goal is take like linear systems I hope if you all trust when you invert a matrix you trust it maybe not the experts but I trust it and the goal is to have the same kind of trust for those algorithms so this means to be extra robust you have to cover all possible cases so typically here if I take a step size it's too big ok so let's take examples even for that toy actually I should put it again if I take like 2 over r square ok it's going to blow up and not blow up like being bad blow up like going to diverge a lot ok so whatever you play around with radiant descent and big step sizes you have to be careful so this is why making sure that your data are bounded is and this can be like put explicitly ok if I have a large data point I just like squish it ok so clearly we want robustness we don't achieve that robustness yet it's not yet part of the pack but the goal is really the goal is this to be as robust as linear systems merci I have a question about the file function so as you said we might think of estimating it or if we don't have enough data it's part of the data scientist to choose it properly my question is are there problems where you need where you function file needs to be extremely protracted and which makes it very difficult to choose to compute maybe and does this constrain you because you have to choose the file to express your problem as an optimization problem so is this always the right way to go or are there classes of problems where this would not be the right formulation because the file would be too complicated for instance I think first for most problems you know a bit about them so I don't believe in like totally blind machine learning because you have like no free lunch theorem optimal everywhere you're going to be very bad you can formalize this very precisely so there's no magic algorithm this being said you have a class of problems which are common like all natural signals speech if you take all the natural signals like speech speech and audio images then you have a common theme which is an expert in this so for those ones you can use the expert knowledge to create a good fee for a random problem which you have never seen before there is no universal solution there is no if you have lots of data you can start to learn from it but if you have a small amount of data this is hard My question is not so much about learning about the class of problem on the graph oh ok ok then then there's something called like I assume that Fiorvex is known and explicit ok so this was like very very popular ten years ago which is you can have implicit Fiorvex you can have infinitely many Fiorvex ok so you may think this is kind of useless to have infinitely many Fiorvex but you can use algorithms that will only depend on the dot products between those Fiorvex like so called kernel methods and now you have for any class of problems like problems on graphs you have good kernel for graphs problems on trees so now you have for every type of problem you have typically good features so for graph mining you have a list that people try based on degrees so this is very very classical for every type of problem you have a class of features which people have considered before this is kind of dismiss the Newton method because of complexity practical complexity issues but there are classes of approximate Newton methods that are in some cases much more use much more easy to handle some of our classical computations that are done with modified Newton methods so I agree with you if you do like PDEs for example then the structure of the problem is very precise, you can use it you can leverage it to get approximate Newton methods in a more context the only thing that you know you have to do Newton on the covariance matrix and this may have no structure it's hard to leverage the structure to get fast algorithms because this is a big difference between like numerical analysis where you know a lot about your problem and you have to consider generality by design here exactly because you don't know your features in advance and you want to be robust to all possible features this is a key difference but clearly any of the classical preconditioning from numerical analysis can be used it always helps a bit but you want to be generic and robust is a key issue here other questions ? Yeah One of the major problems of having the wireless regime as Stefan mentioned is that we have to do calculations on the fly and most of the time we don't have the likelihood of observing the perfect actual value of the gradient Is there an interplay between the step size and the imperfect observations of the gradient that can be put in place to have a better converges rate or converges to the actual optimal position So here there's a good question because it's a bit if you take gradient descent over there so here when you do stochastic gradient descent you have errors in the gradient so the key is that the errors are unbiased so the explanation is exactly the gradient but the variance can be anything so it's a bit amazing that you always make errors but you end up still converging ok, so this is a fault in other case in life you make random errors typically you don't ok, sorry you make random errors randomness and unbiasedness allows you to converge to global optimum now if your errors are not unbiased ok, so if they may be adversarial then if the error does not go to zero as you get more observations it will take you anywhere ok, so if the errors are not random you have to have a decaying magnitude of those errors otherwise you can be really bad so of course you can bound so we have other words where we bound if you give me an error in each gradient I can bound the distance to optimum at the end but at the end you need the errors to go down as you get more iterations if you get quantization I guess the quantization errors are not random so they are close to random I guess but they are not totally random so you have to control for this a bit if you want to make sure to converge ok, so what is the problem that sometimes all those bounds are very conservative ok, so do you want to be robust and provably correct or do you want to gain maybe another magnitude of speed by allowing yourself sometimes to diverge so it is a tradeoff here which is not totally clear