 This work is a joint work with a PhD student of mine, so he is still in Tech of Paris, so I moved last year to a company technique, and it's about one part of my work, which is on a large dimensional something, so it will have a very different flavor from the talk given by Jean Claude, so it's much more about statistics and probability, and bias and statistics, so it's with no algebra, no special functions, no linear for the transform, so it will be completely different set of tools which will be used. So I will start by trying to motivate my talk, and then I will go through some technical details, so I'm not sure to cover everything, and I will then present you some numerical illustration and some conclusions. So I will start to spend quite a lot of time on the type of application we have in mind, and just to justify the type of techniques that we are trying to implement right now. So I am interested in computational statistics with application to machine learning, so as you know that big data stuff is there, so when you are looking at big data also people start to look at much more complex models, so if you start to collect a lot of data, so then you can try to model things which are more subtle, so instead of using classical, very very simple models which are typical in statistics like linear regression, logistic regression or things like this, you can start with questions which are much more sophisticated and look at more detailed or complex models, so when you have a complex model then you have a lot of parameters to fit, and since I am more in a Bayesian statistics, so it's not only fitting one value of the parameters but trying to investigate the distribution, the posterior distribution of these parameters, which means that you need then to be able to sample distribution over a very large dimensional space. And of course there are tools to do this, and these tools have been used for quite a lot of time in statistical physics, so for example the famous Markov chain Monte Carlo techniques is one of the first algorithm which has been implemented on the computer to solve the statistical physics problem, but in fact there are some special structures in statistical physics which are not there typically when you are doing Bayesian inference where the models are much less structured in some sense, and so the driving techniques to sample this distribution efficiently is really an issue which is at the heart of many many important problems now. So I am working more or less on four applications, so these applications are following, so the first application I am looking at is not exactly in that order, so I started to look at what is called the aggregation problem, so in aggregation problem what you are doing is that you instead of using what you typically do in machine learning, so you start to collect data which we call the training set, then on this training set you try to estimate parameters and then what you do once you have estimated the parameters on this training set and you will apply, so you have some parameters, so you have trained your classifier or your predictor and what you will do is that you will take some fresh data, a new data set and then you do the prediction, so that's a typical way to do and that in fact it's not, there is a better thing to do and the better thing to do is to do some aggregation of estimators, so aggregation of estimators is instead of, from the training set instead of picking the best in some sense data point, what you do is the best parameter, what you do is that you will sample some distribution which is constructed according to some precise rule, I will not go into that level of details and then to do the prediction instead of taking one value of the parameter which has been estimated on the training set, you will compute the conditional expectation on the new fresh data of the predictor, so you will aggregate the answers and this is called aggregation of expert, but to do aggregation of expert you need to learn in fact the proper aggregation rules and the proper aggregation rules is a kind of posterior inference, so it's kind of sampling according to some posterior distribution of a large dimensional space, so it's the first application of this, so I start to look at this and to do what is called pack Bayesian problem, so in the PhD thesis in five years ago by Benjamin Gage, he's now a researcher in RIA in Lille and he's still working on this type of topic, so aggregation, so that was the first motivation of doing this, the second motivation of doing this is to solve Bayesian in an inverse problem, so it will give you an example instead of going into the maths, but in Bayesian in an inverse problem what you are looking at, you have data which was pollution data in the problem I was looking at, so pollution data, so you have a boat for example on the sea which is injecting some pollutant at some amount of time, and the pollution is not immediately detected, so you have satellites which monitors images which collect this data, and what you are trying to do is that you are trying to reconstruct, so you have a certain system which is quite complex, which is kind of, it's a PDE, it's a linear PDE in that case with diffusion and transport terms, and you are willing to construct, to reconstruct what the amount of pollutant which has been injected by the boat in that case, so that really you have to solve a kind of linear inverse problem from observations, you have a lot of uncertainty, and what you are willing to do is that you are willing to, not only to have an assessment of all the uncertainty which is there, you have also other examples like in Bayesian inference for high dimensional system, I will use this example later on and Bayesian non-parametric but I will not cover these aspects, so all these problems and the list is not exhaustive, I will share the same difficulty which is sampling techniques for this type of models which are not very well structured, and the techniques which are known so far are not very accurate to doing this, because I do not care really to high dimensions. So now I will use a very simple example to just let you know more precisely what I am looking at and then I will try to describe an algorithm for doing this. So I will use as an example to illustrate the type of problems I am looking at, the Bayesian inference, so it is my first case, so Bayesian inference for a nine-dimensional model. So I start with one of the most commonly used problems in statistics and machine learning which is a kind of binary regression problem, so what is a binary regression problem, so what you want to predict answers, so the why are the answers that you are willing to predict, so in the binary regression problem these answers are yes or no, and you have a set of explanatory variables, so these answers which are yes or no are maybe explained, so you answer yes or no to some task, but these answers in fact are conditioned to some repressors, so typically you have explanatory variables, and these explanatory variables are maybe extremely large dimensions, so for example you are visiting a website, you will click or not click on a given link, so these are the binary variables and you have a lot of explanatory variables and all these explanatory variables may try to give you, in fact try to explain from all these explanatory variables, I try to explain the answers, so there are many many problems of that sort, recommender systems of course are very, so you can, so in the recommender system you will, because I say yes or no, but it can be it's one to five, so I can do some recommendation out of that, for example because I am trying to predict or I am analyzing, so you give answers and I am trying to, I know a lot of things about you and I am trying to try to interpret the answers that you have given, so here it's a yes or no, but it can be others discrete data, for example it can be, instead of being binary it can be a multinomial variable, it's exactly the same thing, and one of the, so one model when you are looking at this binary regression problem, one type of model that people have, the most commonly used model to do that is to use what is called a generalized linear model, so in a generalized linear model what you are, what you are modeling the problem like this, you are modeling the answers, the yi are a bernoulli variable in that case, with a success probability which is a certain function which is called f, so that's a f function which is called a link function of a linear combination of the explanatory variable, so in fact what is the model there, so the model there is that you have n, so you have a binary variable and this binary variable now is considered to be a bernoulli random variable with a success probability which is a certain function f of a linear combination of the auxiliary variable, of course you may imagine much more complex models, so it's a very simple one, you may imagine for example that you have interaction among factors of having a simple linear model there, you can have interaction terms and things like this, instead of having a simple function there you can have a function which is much more complex, you can imagine things which are much more complex but that's a basic example of what is called a generalized linear model and one of the examples which is used in many cases is you take f, the logistic function, or the sigmoid function and so you obtain a model which is called a logistic regression function which is used in many many applications in machine learning and the problem there is that the number of predictor variables can be extremely large so there are many applications where the number of predictor variables is more than 10,000 or 100,000 sometimes depending upon the application you are looking at in a topic model for example it would be something like 10 to the power of 5 so which means that you have to estimate or you have a data set which can be very large so I will not discuss that point here today but n can be very large also and you have to estimate parameters in the dimension which is also very large which can be of order 10,000 or even more okay so that's the setting and what I am willing to do is that I am willing to I am in a Bayesian setting so in a Bayesian setting what you are willing to do you are not willing only to estimate a parameter which is typical in machine learning but you will do typically that you will use this data set and try to fit the best between quote parameters so the parameter beta which in some sense explains the best data that you have also in your training set but in a Bayesian setting in fact you have a bit more ambitious goal and this more ambitious goal is to sample a posterior distribution so the posterior distribution of the parameter given the set of observation so in a Bayesian setting where you will have an additional piece of information so you will put some prior distribution so you will put some distribution on the parameters which is called the prior distribution and the prior distribution is a way to describe loosely speaking the prior information that you have in the parameter before doing any observation so there are typical prior for example is the Gaussian prior so you put a Gaussian distribution as a prior distribution of the parameter but you can also use Laplace prior so Gaussian prior is a quadratic Laplace prior is simply a L1 terms so these two terms have a very different behaviour so this one for example has been is very useful but we will not discuss that point today so if you are at the heart of compressing so it's all lasso so perhaps you have looked at some recent work in statistics so L1 prior are extremely popular now because they increase the favour for spare solutions which are important and what you are doing Bayesian statistics so what you are willing to sample from you are willing to sample the posterior distribution so the posterior distribution in Bayesian statistics is a posterior distribution of the parameters given the observation so you apply simply the Bayes' rules so when you apply the Bayes' rules what you do is you have prior distribution of the parameters so it's pi beta conditional distribution of the observation given the parameters so it's here it's pi beta that's the conditional distribution of the parameters of the observation given the parameters and here you have assumed that yi is a Bernoulli random variable with probability of success, which is f of beta transpose xi, so you look at, this is called the likelihood term, so it's a likelihood term, and now the posterior distribution of the parameter given the observation is a product of the prior times the likelihood, and of course in order to complete this, of course you need to renormalize that, but one of the difficulties in Bayesian statistics in general is that it's impossible of course to obtain a closed form expression for this normalizing term, so you know in fact the posterior distribution up to a constant, so the constant term is, so you have to integrate this quantity with respect to beta, but remember that beta is in dimension 10,000, something exists, so of course it's impossible to estimate the normalizing factor, so that's an expression of the posterior distribution, so the posterior distribution is known only up to a normalizing factor. Okay, so as I said, there are a lot of works in computational statistics on this logistic regression model, and the techniques which are used are typically for data augmentation, because this point today because if you're not expert in MCMC, it's not really needed to understand what is the data augmentation stuff here, but the only thing that you need to know is that up to now these techniques, the convergence time is prohibitive as soon as this is larger than 100, so there is at least one or two order of magnitudes, so we are willing to attack problems which are, whose number of parameters offers the order of 10,000 or more, and whereas the techniques which are known today are not appropriate when you go say 100 is really the limit, for example from the polygram assembler, it's much too complex otherwise. So it has been considered as a daunting problem, but in fact I don't think it's a daunting problem, I think that the techniques which have been used are not appropriate because if you look at in the link prediction, so the logistic regression is behind most of the technique which is one of the work-offs of Google on Facebook in fact to do link prediction, and they use really regression of a large dimensional space, so in fact there are a lot of structures in the problem, and if you look at the problem now, if you look at the sampling problem that you are looking at, so you are trying to sample the distribution which is exponential of minus u beta, and u beta in that case which is given there which is a potential, it's a Gibbs potential if you wish, and the Gibbs potential has two terms, so it has a term which is a log of the likelihood of the observation, and this term in that case it's convex, so it's not completely clear, but it's convex in this case, for at least where f is a logistic function, and you have a penalty term which is either the norm one or the norm two of beta which can be transformed by some design matrix which is called b. So in terms of optimization, if you are simply looking at the optimization problem, it would be, in that case, it can be even a strongly convex potential, so it means that if you simply are willing to optimize with respect to the parameter beta, you simply have to optimize the convex function in large dimensions which is doable with even if beta is if you take the one norm there, it's a non-smooth context optimization problem, but you have tools to do that, so in terms of optimization, it's a rather simple problem, whereas in terms of sampling, it should not be a very difficult problem and I think it's not a very difficult problem if you use the proper tools. Okay, so I will start with what I call the smooth case. So in the smooth case, I will assume that beta is differentiable and which means that here I will first look at the case where this term there which comes from the prior is differentiable and then I will explain more briefly how you can use this type of stuff when the prior distribution is not differentiable, so then you need to do some smoothing in order to apply that. Okay, so we start when I change a little bit the notation, so now what is the problem, so I will try to look at, try to sample this distribution which is given by this, okay? The distribution will get by this, so it's given, it's exponential of minus u of x, u of x is a Gibbs prior, so you don't know typically the normalizing constant, so the dimension D is very large and since I have, what type of smoothness I have assumed, so I assume that u is L-smooth, so which means that I assume that u is continuously differentiable and that this, the Jacobian of u of x is Lipschitz with a Lipschitz constant which is given by L which can be computable. In fact it's very important, so it's very important in the sense that one of the consequence of this is that I have assumed that the growth of the gradient of u of x at infinity is less than, it does not increase more than the norm of x, it's sub-linear at infinity which is the consequence of being global Lipschitz, so the global Lipschitz assumption is important there. Okay, and so if you look now in continuous time, in continuous time you have a diffusion so I can consider this diffusion, so this diffusion has been introduced by a French physicist called Paul Langevin which is one of the founders of modern physics and to describe the behaviour of a particle in a liquid and he has proposed this equation which is called a diffusion so if you are not familiar with diffusion it's like an extension of ordinary differential equation it's an ordinary differential equation and hopefully speaking that you and you have added some Brownian term so BT is here is a Brownian motion and this is called SDE what you can prove and especially because you have assumed that the gradient of u is a global Lipschitz is that you have a unique strong solution to this diffusion and there is under this assumption also what you may prove is that you have a single invariant distribution and the single invariant distribution is precisely the distribution that you are willing to sample from. It's a sample of a unique invariant distribution in that case you may prove that this distribution is reversible and therefore this diffusion has a unique invariant distribution and this unique invariant distribution is exponential of minus u and you may even prove that the convergence of the stationary distribution takes place at geometrical rate so typically it means that if you will be able to to sample exactly this diffusion and if you are so you will start from some point if you are able to sample exactly this diffusion after a certain amount of time you will be close to the will be close to the distribution that you are willing to sample you know exactly you control exactly the closeness because you have extremely precise estimate on the rate at which you converge to this stationary distribution so you can use either you have functional inequalities which have been used for example by Cedric Villani recently like Poincaré or Luxubo-Leff so transport inequalities which are based on either using the total variation techniques or a coupling techniques which are more probabilistic if you wish these two types of techniques are to obtain extremely precise estimate on the rate at which the diffusion converges to pi so in fact conceptually this is very interesting in the sense that it gives you a way to sample pi there is a problem that you need then to be able to sample exactly the stationary distribution of a diffusion which is of course not completely possible because up to now we don't have computers which are working in continuous time so we have we can do only discretization so an obvious way to try to approach this problem is to try to do some discretization if you are trying to do some discretization of this diffusion so to discretize this diffusion what you do is that the simple scheme is to use a Euler-Marudama scheme so it's like a Euler in SD it's simply a Euler for ordinary differential equation it's simply a Euler scheme so what you do is that you fix in fact data use of a certain amount of time so it would be gamma k so typically here the step size of the diffusion is not necessarily constant this is why I use we will see later that it's important not to take a fixed step size so here the step size is not necessarily constant so this is why I have put k there and digitization you compute the next you compute the next point xk plus 1 so you start from xk you compute the gradient of yield at xk and then you have some amount of noise so it's zk plus 1 and it's a square root of 2 gamma k plus 1 zk plus 1 because it's it's an increment of variance in the Brennan motion it has a very familiar form because if you just forget these additional noise terms here but you simply have a kind of stochastic gradient descent so it's a stochastic gradient descent it's a gradient descent here on which you add some amount of noise and an amount of noise which is not the amount of noise that you have there is an amount of noise which is proportional to the step size that you have performed and you obtain like this in fact you obtain so for example if you take so in fact in fact you learn my scheme it's a kind of gradient algorithm so it's a gradient algorithm and you have add on this gradient one step you add some amount of noise so it's very simple to implement it's more or less one line of code as soon as you have been able to compute the gradient so now if you lose at the case where you have a fixed step size so if the step is fixed if you take gamma k equal to gamma in the previous equation then you end up so this equation is simply an homogeneous Markov chain otherwise it's time in homogeneous so if you gamma k is not constant it's time in homogeneous so but if you take gamma k equal to gamma then you obtain a Markov chain you can express of course transition probability density of this Markov chain which is given by r gamma xy which is given there and what you can prove is that under some appropriate conditions so you need a bit more than being you need a bit more there to be a precise condition later on but you need a bit more than being simply gradient leap sheets you need a bit more conditions so you need to have some kind of curvature at infinity you can prove that the Markov chain is reducible so it's a continuous state space Markov chain so irreducibility is not exactly defined as in a discrete state space Markov chain it's a bit more complex positive recurrence so it means that you have a single invariant distribution and this invariant distribution is obtained upon gamma and the problem in fact is that you have lost in fact the fact that so what is your target distribution is pi but since you have used discretization you have a target distribution which is pi gamma and the problem is that pi gamma is not equal to pi so that's our problem so if you use discretization like this you obtain at the end of the day an algorithm which is something invariant distribution which may even converge geometrically fast to this invariant distribution but the problem is that because you have as soon as you have taken a step size gamma k which is equal to gamma in fact the stationary distribution pi gamma is not equal to pi so it's important just to just a word is that it's simply here when you look at this in fact it's important to notice that it's important that delta u of xk is sub linear at infinity because otherwise it's important for stability because imagine that delta u of xk is xk square when x becomes large of course it will be typically not stable so the stability condition in discrete time and continuous time are completely different and in continuous time in continuous time you may have and it's well known in physics in continuous time you may have steep delta u and with steep delta u you have kind of confinement of the solution whereas when you try to discretize of course if you have steep conditions so steep conditions are very bad when you are discretizing so in this case if you have steep potential it's no longer possible to use the Euler scheme so you need to use more sophisticated scheme which are called implicit scheme in order to preserve the reducibility and the stability ok so there is one way to go but there is one way to go but I will not I think it's not necessary in that case so there is one way to go which has been proposed in 96 by Robert St. Wiede and the Robert St. Wiede they said ok so what I know is that I have a Markov chain which I took gamma k equal gamma so I go to Markov chain which does not have exactly the right distribution but what I can do there is a technique which is due to metropolis which has been later refined by ASTINGS which we were starting from Markov chain you can correct you can correct the stationary distribution of a Markov chain by adding a kind of accept reject a kind of accept reject scheme and it works as follows so what you do is that at each iteration of the algorithm what you are doing is that you are proposing according to this distribution ok so you are proposing according to this step you obtain yk plus 1 ok there and what you are doing there is that you are you will accept or reject this move with certain probability which is given there but the thing is that one of the problem with ASTINGS the right thing is that you have you have the right stationary distribution because then you can prove that this because you have added this accept reject scheme then you have obtained an algorithm which becomes reversible with respect to the target distribution pi and not by gamma which is good but there are some cons so it requires to have two times a completely function sometimes it is not necessary and there is a problem of course linked with the convergence and there is a I will not discuss that but in fact these conditions are very difficult to check even in a simple geometry it is not completely clear that the metropolis these techniques preserve geometric ergodicity and there is there is a third problem is that scaling analysis suggests that you have so in fact adding this adding this accept reject term force you to use very small step size and in fact the rate at which you convert to sessionality are much smaller than the rate that you convert with the technique I will explain a bit later okay so now I will I will try to explain I will try to explain the what we are suggesting to do this and we will see that we can tweak a little bit the discretization in order to obtain the right session distribution without having to use the mala correction so I need to I need first to to just four results in fact to do this in order to and the first result in fact I need to have some stability condition for the diffusion so I need to have some condition which will guarantee in some sense that the diffusion will converge geometrically fast to its stationary distribution so when I start with the diffusion what I will do I will look at the generator of this diffusion and in order to have geometry convergence to to the stationary distribution I need to it's more or less if condition I need to find the function for the diffusion and the function of the diffusion in fact is a function V which satisfies this condition so A the calligraphic A here is a generator of the diffusion so the generator of a of a Laplacian of F multiplied by the scalar product of the gradient of U a scalar product with the gradient of F that's the generator of the diffusion and I need to find function V which is which satisfies what is called the Forster Lyapunov condition so Lyapunov is he of course did not invent that but Forster in fact understood that for discrete Markov chain and then it was extended to diffusion this condition is more or less if you want to have a geometry convergence in total variation to the stationary distribution you need more or less to find the function V satisfying this condition as an example so as I said you need a little bit more than being to satisfy Lyapunov condition so what you need typically is that you need to have some curvature at infinity these are other conditions so if you take V of X which is exponential of U of X divided by 2 in Markov chain when you are looking at Markov process it is often the case that if you take so if you remember PX the distribution that we need to sample is exponential minus U of X so it is often the case in the Markov chain context that if you take V of X which is of negative power it is P of X minus 1 half it is often Lyapunov condition so in fact to check that you need a little bit more at infinity you need that delta U of X has some curvature at infinity if you have that in fact you have an exponential convergence to stationarity so you may prove that starting from a distribution Mu zero if you have some constraint on Mu zero Pt is a semi group associated with the diffusion Pi is a stationary distribution which is a target distribution and you can control in that case you can control the difference in total variation between Mu zero transported by the semi group of the diffusion so it is Mu zero Pt minus Pi which is less than C of Mu zero and you have you have a completely explicit expression I will not provide them which depends upon the type of techniques that you are using so you may either use functional inequalities like Poincare or Luxoblief inequalities may use different type of couplings or synchronous coupling reflection couplings so there are different techniques in order to obtain completely explicit expression for C of Mu zero the exponent kappa but at this level of understanding you simply need to know that you have this type of you will converge so the large event diffusion we converge to Pi exponentially fast and you have a precise control on the way it depends so the rate at which it converts to stationarity and the way it depends upon the stationary distribution the initial distribution Mu zero so now if I look at the discreet as a discretization of the diffusion so okay I need to be the same type of stuff as I said it's a bit mark of chain so it would be time in the genius if you assume that step size are not constant and the type of the type of condition I need for Ergama in fact I need for Ergama conditions which is also called the Fosler-Lepounov but it's the Fosler-Lepounov this time in discrete time and the Fosler-Lepounov condition that I need is that I need to find a function V it can be the same function V and for the diffusion which is such that Ergama V which is you applied the transition kernel of the discrete Markov chain Ergama to the function V should be less than kappa to the power gamma V plus gamma B in fact it's not a classical Fosler-Lepounov because now I have a dependency of this Fosler-Lepounov in terms of gamma which makes completely sense because if you assume that gamma here is taken to zero it means that Ergama means that the Markov chain is completely trivial it's xk plus 1 equals xk so you don't mix at all which means that if gamma is very small in fact if gamma is very small you will not mix much in fact so if you take gamma equals zero you have Ergama V which is equal to V and if gamma is larger of course you will if you take larger gamma you will typically kappa to the power gamma will typically increase which means that you will increase which means that you will contract V more so what you can prove then that you have the same type of stuff so you converge it's a V norm so it's a detail there so it can be the TV the total variation norm so the iterates of this Markov kernel we typically converge to pi gamma at the rate which is here kappa to the power gamma so it's a Fosler-Lepounov but you need a little bit more you need to be able to say it's a Fosler-Lepounov but it's more or less uniform over gamma in a range gamma less than gamma zero and the Fosler-Lepounov should have this type of behavior if you take exactly the same conditions than before you have this so if I assume that you are smooth and you have some kind of tail condition of that form you will obtain exactly this type of condition and if I add all other conditions on you if I assume that you for example strongly conducts at infinity I have better constants so then starting from this you may have kappa nb in fact depends upon the type of assumption you are ready to make on few so it depends if you assume that you for example is convex and strongly convex outside the ball but with bounded oscillation within a ball you will obtain different evaluation for kappa nb so it really depends upon this type of assumption but the weakest condition which leads to the worst bound is that you have some kind of kappa nb at infinity and starting from this curvature in fact what you can prove is that you have a control of the moment of v so q gamma here I think I did not put that so it is simply q gamma so what is q gamma of n it is simply equal to the product so p gamma so r gamma is the transition function of the discretization so gamma k are the step size and q gamma is simply p so you apply p gamma n first and then p gamma n minus 1 until to p gamma so it is a composition of this kernel and if you have it is n gamma and if you have assumed that you have a fostered application for this what you can prove q gamma n of v so q gamma n of v is what q gamma n of v of x is the expectation for the discretization of v of xn so it is less than kappa to the sum of the gamma plus this term so in fact the type of condition I have imposed for the discretization in fact a way to control the v moment of the discretization in that form so now I will try to explain what the way what I suggest in order to using this type of result to obtain a very simple technique to sample the stationary distribution and I will the idea is very simple so what I will do is that what I am willing to do I am willing to control so that mu zero is the initial distribution q gamma p is a summing group of the discretization by the stationary distribution all these difference ok so what I know is that pi is the stationary distribution of the diffusion so what I do it just to explain the principle it is not the best way to obtain the tightest boon but it is a way to explain why it works what I will do is that I will just cut this difference into two terms so the first term is written there so I just what I do so q gamma p satisfies a kind of Shanman-Kolmogorov condition so q gamma p would be typically so you take any n and q gamma p would be it is a product of kernel so it would be q gamma n and multiplied by q gamma n plus 1p with obvious notation so it simply is a product of kernel and what you do there is that you replace what you do so here it is exactly equal to q gamma p and then what you do there is that you replace q gamma n plus 1p you replace it by p of gamma n plus 1p because gamma np is written there so you replace it by the semi-group of the diffusion and then you have these two terms so it is simply these two terms of course so now you simply have these terms here now if you look now at these two terms so this term in fact is easy to control so this term is easy to control why because it is exactly you look at the semi-group of the diffusion which is run during a period of time which is gamma n plus 1p so gamma np is given by this and with an initial distribution which is mu 0 q gamma n so mu 0 q gamma n is the initial distribution which is transported by the semi-group of the diffusion and you look at the rate at which it converges to pi so this term is easy simply this term is simply the term that you have is simply the control of the rate at which the diffusion converges to this stationary distribution which give you exactly precise estimate of the rate at which you are converging to pi starting this time from mu 0 q gamma n and then you have a second term to control and this second term in fact controls what it controls so you start now at time gamma n and you compare so you start at gamma n with the same initial distribution with mu 0 gamma n and you will now compare what happens when you transport this measure with semi-group of the diffusion or if you transport this measure with the semi-group associated with the discretization of the diffusion so it is exactly this term so for that term of course we already have the right tools to handle this term and now you need to have a technique to control this term if you look now at this term what you will have to do in fact you will have to do something which is very simple you have to control you can reinterpret that in fact like controlling so if you take two diffusion which are driven these two diffusion are driven by the same Brunnen motion and this diffusion in fact evolves with a different potential so the first diffusion evolves according to the Langevin potential which is the gradient of u and the second diffusion evolves according to the discretization of the Langevin potential which is delta u bar which is given there so it is discretization of the Langevin potential and which are given with the same which are driven by the same Brunnen motion so what you can prove which is a simple technique it is a simple trick is that these two diffusion in fact are mutually so the law of the diffusion yt on the path space they have the same because you have the same Brunnen terms here the law of this diffusion is absolutely continuous absolutely continuous with respect to the law of the discretized diffusion and there is a theorem which is a famous theorem in stochastic calculus which is due to Gerzanoff which gives an exact expression of the Radon-Nikonim derivative of the law on the path space of the diffusion with respect to the law of the discretized diffusion and it is given so that is given by the Gerzanoff theorem so the Gerzanoff theorem tells you that you can compute on the path space you can compute the density of the law of the path with respect to the law of the path of the discretized diffusion and it is exactly given by this term so that gives you the expression more or less without technical details but you can imagine that this quantity is a Radon-Nikonim derivative it is a Radon-Nikonim derivative of the law of the diffusion on the path space with respect to the law of the discretized diffusion so now you have a density and you are willing to control the total variation distance starting from some distributions say for example data x of this quantity simply the Pinskian equality which is famous in information theory so Pinskian equality tells you that you can control the total variation distance between these two distributions by computing the entropy of the Kullbach-Leber distance of between the law of the diffusion so it is a relative entropy of these two distributions so which is there are generalizations in Pinskian you can do better than that but that is the simplest way to do so it is a good control because the equality is quite tight in fact the law are closed it is not very tight otherwise but it is a good inequality otherwise you use there are other definitions of entropy which can be used but in that case it is extremely simple because if you look at this and if you plug in this Pinskian equality if you plug the expression of this density ratio what you see there is that you simply have to control you simply have to control the second moment with respect to the law of the discretized diffusion you simply have to control the difference between the gradient of u and the discretized gradient of u and you have to control this quantity with respect to the law of the discretized diffusion which is very simple so you have a closed form expression but it is not a closed form you have a completely expressive bound of the difference between the discretization of the diffusion and the diffusion itself and the discretization and the good thing is that you simply have to control these moments and you have simply to control these moments with respect to the law of the discretized of the discretization which is quite easy if you have in mind that we thanks to this inequality you are able to control moments of this type of quantity so at the end of the day what you obtain you simply plug now the control of moments that we had before and you have a completely explicit control of this quantity depending upon so what are the quantities which are appearing in this equation it's a d which is a dimension of the space so you have an explicit dimension in d, you have an explicit expression in terms of gamma k and this quantity can be bounded using the and l of course because you have a difference there l is this constant of everything that I done there works because I have assumed that u is something smooth and if I just put everything together if I put this thing together I get an explicit bounds which look like that okay and n is a kind of point which I can choose so this part this part depends upon the way to control this system so it's given there the second part is the way to control this difference there okay and you have completely I did not show that on the slide but you have completely explicit expression for this quantity in fact all these bounds are quantitative and can be computed and you have an explicit for example I did not discuss that but it's important in our context to have an explicit control on the dimension we have an explicit control on the dimension and if I assume that gamma k goes to zero with some additional assumptions there but basically if I assume that gamma k goes to zero at any rate almost any rate but sufficiently in such a way that the sum of gamma k goes to infinity and tell what I can prove is that I can control mu zero q gamma p minus pi entire variation goes to zero as p goes to infinity so what I have proven there is that of course the message that I want so the message and I want to prove that the proof is quite easy in fact the message is that there is a way so it's a bit surprising so if I simply take a diffusion so I have a diffusion which has a right a stationary distribution which is pi okay of course if I use the Euler discretization with a fixed step size that will end up with a diffusion it will end up with a Markov chain but it doesn't have the right distribution but if I use a decreasing step size provided that the sum of gamma k goes to infinity okay by end up with my discretization we still will converge to pi and I can exactly control in fact the rate at which I converge so it's because everything is explicit there so I have sometimes problems to convince my colleague because it seems it's not completely obvious to understand why if you use a fixed step size you will converge to the right distribution whereas if you use a decreasing step size you will converge to the right distribution and on the top of that you can have explicit control you can have explicit control on all these quantities and of course in our case I think we so there is a huge literature about the discretization of Langevin diffusion so I should have mentioned that but the thing that we also the thing where we were a bit more careful than other authors is the way so we have tried to control explicitly how all these bounds depends upon the dimension d because we have in mind that we are willing to sample over a large dimensional space okay so there is also results so we using this type of techniques you can also see what happens so for example you can control how pi and gamma are related but that I will not explain that okay I should have mentioned that this type of results there are many many results which have been obtained by Don Itale long time ago with different application in mind not really with the application of discretization of stochastic differential equation not necessarily for the simulation purposes but he did a lot in this area and in particular he has obtained this type of results for even more general diffusion but not the result before okay so now and this is more or less the end of my talk but now what happens what is essential in what I've said what is essential in what I said before is that you are there are no much assumptions but at least there is an assumption which is important there it's even important to have a unique solution solution to the diffusion problem so is that you obtain the potential you was differentiable so in fact for example the problem this type of problem will not be applicable to the case where you is not differentiable of course which is exactly what happens when you are for example which is a typical case in machine learning where you in fact so you want to sample p of x which is exponential of minus u of x and in machine learning you have a lot of case where the potential u of x is f is a nice function okay and you add g of x which is convex but nasty otherwise so it's a convex function but it's non-regular there are many many examples of this so for example the famous or infamous lasso type of techniques u of x is l1 so it's convex but non differentiable when you use support vector machine which has been introduced by Verpnik long time ago f of x is what is called the hinge loss so it's an hinge function in deep learning also now so it's not differentiable at one point okay so there are many problems where where you can you are willing to look at this so you have a kind of composite function where f is smooth and g is convex but not smooth okay so we try so it's an important problem also to have solution for this and so this is exactly what I will describe now so what I will do in fact I will use there something which is closely related to what people do in optimization and it's a technique which has been used by Morot French mathematicians he invented that in sixties where what we do is that he has proposed to do he has introduced a kind of regularization of function but which is in that case it's a right way to do the regularization which is called the Moro envelope and the Moro envelope is defined as follows so what you take is to take a function h and the Moro envelope so assuming that h is convex for example in that case it's not necessary but assuming that h is convex the Moro envelope is defined as this h lambda of x so lambda is a regularizing term and h of x is simply you take h of y and you add a regularization so the regularization is x minus y to the power square and it's 1 over 2 lambda here so it's a function with its regularization it's always below h of x of course it converges to h if lambda goes to infinity so you are solving in fact this problem so in the case where h is convex so h of y plus 1 over 2 lambda x minus y square in that case is strongly convex so it has a single optimum so in this case you may define h lambda of x which is h lambda of x which is the Moro envelope otherwise it's a bit more complex to define but it's not needed at that point and the Moro envelope is characterized so the point at which the minimum of this term is achieved is called a proximal term and so it depends upon h and x which is characterized by this quantity which is a sub-gradient of this we have a closed form characterization and the thing which is important to know at this point is that for many function g for many function h there is the Moro so the proximal operator is known and sometimes it's extremely simple so if you look at if h of x is for example the sum of x i from i equal to 1 to n so which is the lasso the lasso penalty in fact the proximal so the prox lambda h of x is simply the coordinate ways is the coordinate ways crystal operator smooth so basically what you do is that what is a smooth resolving is this function so you put here it will be lambda it should be minus lambda there so you put there it's really zero there and then it's x minus lambda so in fact what is a proximal operator is completely simple it's simply this and it's related of course it has not been noticed first by people if you have followed what happened in statistics since 30 years so of course non-linear resolving was one of the main theme with people doing functional approximation with wavelets and things like this so in fact it's related so this is related to lasso this is related to l1, compress and things so everything is related the fact that so l1 is always behind this and the proximal operator of l1 is the smooth resolved operator but it has not been noticed that when you rewrite the story you can reinterpret everything with this type of way but it has not been noticed of course first in that way so the thing to know is that you have some properties of course the regularized edge of lambda converged to edge is smooth and there is a way to construct it's gradient and it's gradient can be computed like this so it's very easy in such case so to make a long story short so what you do, what we are proposing in a paper with Alain and Marcelo is that we are taking our potential of x and what you do is that we take f of x and then we regularize the contact but nasty part g lambda and we use moro regularization what you can prove it's a theorem but it's not very difficult to prove not completely easy but if g of x is proper if it's a proper distribution exponential of minus g lambda of x is also proper so in fact if exponential of minus g of x is integrable then exponential of minus u of x is also integral okay and we have proper control so what we can prove is that we can prove that pi lambda converges to pi so pi lambda is a distribution according to this new potential converges to pi when lambda goes to zero we have an explicit control which is simple in total variation between pi lambda and pi and it's because it can be used to solve another problem which is something from convex body but that's another issue and the algorithm which is called mayulas for moro yucida unnormalized angevin algorithm it's simply what you are doing is that simply you are running with a decreasing step size discretization of the moro is a long event discretization of the potential which has been regularized using the moro envelope so you end up so because so what you do is that you have you compute the gradient of f xk and that's the gradient what is the gradient of the non smooth part after regularization the gradient of the non smooth part given by that term so you that's an explicit that's the using and with a decreasing step size and you obtain so there are other techniques but I will not talk about that just a small illustration it's an illustration which is given there so it's linked it's a simple one but it's easy to catch so I choose this one because it's easy to understand so it's the problem is very classical so it's a very classical problem it's a bias and inverse problem in periodic dimension x is an image so it's an image which is observed through an edge which is called a blur operator so it's a kind of convolution operator so it's called a blur so it's known in that case it's still posed in the sense that edge is extremely low pass in that case which is typical also and you add some amount of noise W so what you observe in fact you observe that you observe an image Y this image Y is a given object X which is an image that we are going to do which is observed through a point sprite function edge and you add some amount of noise which is assumed to be Gaussian and what I am going to do I am going to restore X but at least you may see something you want to restore X and you add some prior and the prior information that you use there as a prior observation you use a prior which is given by this so here delta D is called a discrete gradient operator it's a very discrete gradient operator in the sense that what you are winning to favor is that you are winning with the prior information that you have is that if you have a natural image it's called the total variation regularization so what you are taking an image or an image is a lattice of pixel and your prior information is what is that if you take the discrete gradient which is you take the difference between pixel row wise and coordinate wise and you assume that in a natural image in fact the sum of this difference should be small so look at an image like this you have a lot of more or less homogeneous areas separated with of course with contours and so what you expect is that you have a smooth when you do this you have a kind of so you have a smooth a small in fact the sum of the difference of the gradient should be small which is true in natural images so what you are winning to sample you are winning to sample so why are the pixels that you have observed X are the pixel of your X are the pixel of your the true image so you have a term which is simply the likelihood of the observation through this model for observation plus a term which is the prior information on your image which is given there so I use the prior so of course it's always a bit difficult to see that but so here it's a that's a true image so look at this so in fact the number of points I will restore of order 1, 2, 2 so it's more it's something like 10 to the power of 5 so it's a I am in art dimension that the blurred image it's blurred because of the blurring operator H of noise that the map estimates and that what I have obtained there are credibility intervals so it's just the only thing I can display there credibility intervals I am looking at the distribution so now I have sample the image so I compute pi of X so I compute pi of X given the observation Y credibility intervals are what I am looking coordinate Y so X is a large vector of pixels and I am looking displayed credibility intervals so which are more or less here for pixel wise the zone in which I received 90% of the distribution you can see here it's a so the technique which is standard take something like so we have our techniques to obtain credibility intervals so the technique which you the classical techniques you spend more than one day so now our technique is with our technique is something like 20 minutes which is a bit better okay so there are a lot of work which remain to do so there are some work which has already been done so I did not say that we have computer bound for convergence so it is done I did not explain this so we have we can compute MEC so the MSE of estimate so we have result in this direction so there are future work partial updates because when you are in very large dimension in fact sometimes using a full gradient it's not the right way to do so it's much better to update group of parameters by group of parameters but it's not that easy to to do proof with partial updates so perhaps we will do that later so we have to work also to include more complex for sitting using priors a more detailed comparison with Mala trying to understand why there are so much different we observe a lot of difference so it's our method works much better but we need to explain why and also we have to look at there are some bias reduction techniques to even speed up the convergence which which is more or less comparable to it's reminiscent to something which has been done in SDE to remove bias which is called Rohnberg techniques where you use in fact the idea is to try to you will use Euler discretization but with two different step size and you will then recombine you recombine the estimate so it is well known how to do that when you do you estimate for example the smooth quantity of diffusion but it's less clear how to implement that when you are trying to sample the stationary distribution of the diffusion but we have some results already not explained there thanks for your attention and I hope that would be some question ok so are there some questions please ok we now have a question I have the feeling that you mean the continuous case things are quite clear and whenever you start discretizing the gates are open and anybody is trying different techniques on how to discretize and mix is it the case? yes the difficulty comes from the theory in continuous time is completely it's very elegant because it's based on so you have you have extremely precise results on the rate of convergence and very elegant because there is a natural link with PDEs and so you can say a lot of things and of course in discrete time it's more complex so my question is basically let's say differently the biggest problem is how to do to implement on a computer a continuous sd or whatever so do you think there are other ways and with the evolution of going directly from the pity to analog computer why not because I have the feeling it's just a little which poses all these problems for you if you look at even what has been done in the sixties people were using analog computer but of course it's it's doable you can emulate it has done for example in the sixties people used analog computing to solve differential equations so it's doable so it's costly but doable but the discretization is a non trivial issue it's clear that it's a non trivial issue in this case I try to convince you that it's difficult but it's not so difficult but we have the feeling that going from continuous to discretize to the discrete case a lot of people try many recipes I mean are we going to find a general framework to do it because I mean every time I go to conferences and see each one tries a different recipe it says I'm going to do the earlier discretization another guy does type of discretization and it can I mean that in that case in fact I think there are many, there are different issues in fact so there are potentials for example which are steep it's a different problem so I did not mention that so what happens when you have steep potentials so then you need to be more careful because then you need to have a kind of implicit scheme so you cannot use explicit schemes so it's not only I think here in that case we have a reasonably simple way to to do so there are many many other problems with how to use it's very difficult and the other thing which is interesting to me and this is why I'm trying so for people which use machine learning they have already implemented gradient descent because gradient descent is a workhorse of any machine learning techniques right now in large dimension and the good thing to do in our solution is that you simply have to change white line of code and you look at our discretization so you don't have to do an accept reject step or to implement other things you simply have to you take your favorite gradient descent algorithm and you add someone of noise and then the only thing that I try to advocate is that you also need to tune the step size a little bit in a different way so at least when I go to the machine learning community it's so simple to modify the code that I convince people to at least try to do this and it becomes especially interesting so if you look at the literature in machine learning in the last 20 years so people when people start to look at large dimensional problem they start to look at there was a trend to look at convex problems there was this work by Dapnik which was the first to popularize support vector machines and there was a lasso and things like this and there is a constant stream of literature where people say if we look to large dimension we should have simple models which are convex and then they have a single maximum and there are a lot of things that you know how to perform convex optimization even in very large dimension but until 5 years ago when deep learning reappeared so you are really in a non convex case and in fact there is something good with this type of stuff is that when you add some noise when you take stochastic gradient descent algorithm and when you add some noise you also have a very large benefit because you are not gradient descent algorithm is typically stuck in the first minimum that he is looking at that he is finding when you start to use some amount of noise then you escape also from traps so it's more like a global optimization technique so there is something good and it's funny because not so funny but it's funny that if you look at the first context which has been performed in machine learning community with neural nets it was not deep nets but it was neural nets with architecture which are not so far from deep nets and the first technique which has won was the technique based on this type of idea so it was based on MCMC and I don't think that it's really because it was by chance I really believe that it was successful because the technique at least escaped from the obvious local minima and I think there are some other benefits so I don't want to pretend so I see two main points in my presentation the main points are you are if you are doing any algorithm you are white line of code with white line of code you are Bayesian so it's good message if you start to look at non-convex optimization problem this adding some noise this make it extremely robust because when you start to add some noise even you will escape at least look I'm not sure I don't guarantee that you will go to the global optimum but you will escape to the annoying local minima so I think this is why if you look at literature now there are more and more techniques which are being proposed and we are to point are there questions I have one from applied perspective so I like this image reconstruction algorithm you showed and does it always have to work on the complete image or can this be parallelized for instance if you work on blocks yeah I did not do that but yes it can be there are what they call consensus there are techniques I did not work on that so it will not break no no I did not I did not look at this but there are a lot of literature where you do sample splitting so I did not do that but yes it's doable another question I'm not going to debate of the ease and orthodox but one of the feelings we have is the choice of the prior that you're doing here which I mean you used here a lap last prior I remember what was the impact of the choice of the prior now it's not being so it's not exactly this point when you do classical machine learning you have you have two terms you have a regularization term okay which is it's not about the prior so you have a likelihood term so there is a likelihood so we all start with likelihood and in machine learning you always use a regularization term which is in the lasso in the lasso so everyone in machine learning is Bayesian so there is no orthodox a machine learner because as soon as you use a regularization term you are Bayesian so the main difference is so now I have a function when you are not Bayesian I say okay I have a function which is the sum of the likelihood and what I do I compute only the maximum and when you are Bayesian you sample the posterior distribution so I will in fact when I sample the posterior distribution in this type of dimension what I show is that I have many many solutions which are extremely close to the maximum because the funny thing is that there is something funny so and there is something funny is that intuitively everybody knows of course is that you start with a badly you have a problem so of course the use of the regularization is give you one solution of course but the thing is that it becomes extremely sensible to the regularization so if you just report one point you don't see that there are many many other points which are very very close and they can be very different in terms of the solution which can be very different but if you look at the value of the criterion there will be extremely close and when you are reporting a single point like its maximum you don't see it can be extremely flat and it typically in large dimension you are extremely flat so which means that there are many many many solutions which achieve almost the same value of the objective and you see that because people if you look in machine learning literatures there are a lot of literatures where they are trying to optimize the amount of regularization by cross-validation and things like this and when you start to play with this using L1 lasso type techniques you will obtain many many different solutions because a small perturbation on the regularization triggers the apparition of completely different solutions but you have a way to promote because when you start when you have a problem which is in conditions you need to have a prior information in order to solve the problem otherwise of course but how do you choose it typically what people do for example when they do sparsity so they say ok I have many regressor but my assumption is that my assumption is that there are only a small part of the regressor which have meanings so that you will use in L1 or sparsity inducing priors like L1 is one of the many many possible priors that you can use now it's not L1 it's old-fashioned but you will use a prior which is promoting a prior which is trying to promote spars solutions but of course in other problems you always have smoothness it's more smoothness or sparsity inducing assumption so it's not about being orthodox versus Bayesian the problem is you cannot do orthodox machine learning so as soon as you go to machine learning is when you are in large dimension you are non parametric so when you are non parametric you need to have assumptions it's a question there so if L1 is old-fashioned for enforcing sparsity what's for SCAD or there are many other there are many other type of penalty which are lasso is a bit old-fashioned lasso is old-fashioned there are many other penalty which are there is a problem with lasso so the problem with lasso is easy to understand is that when you have it bias the solution so it bias the solution because if you have a large component which is there because of the L1 so because you increase in order to favor sparsity you have to increase the regularization term it tries to contract so it is linked with the lasso it contracts the true value of theta so it bias the solution and very significantly so typically what people want to do the right penalty the right penalty are something like this so of course you want to eliminate the small but at the end of the day you want to have something which is more something like this so when theta is very large you don't want to shrink you don't want to shrink the large values so when theta is large what you are willing to do so typically the scat penalty will do that so it will just eliminate it shrink to zero the small values but when theta is large you don't want to to penalize them so there are many many of course the right penalty is L0 when you are so you L1 is so if you are insisting to be in convex L1 is the last stop before being non convex but if you are ready to have a non convex experience then you can go much closer to L0 and yeah that's it the questions okay I have just a small comment you have to do that for some reason analog is coming back everywhere even in our in our disciplines frequencies are getting higher that's possible and maybe analog in that case it will be easy in fact you simply have to compute gradients sometimes it's not that complex no no that's a good idea there are many sources of born and motion okay I think we'll take a break right now there is a lunch so let's thank the people