 You can clarify the title. Thank you very much for the introduction. Thank you to the organizer for organizing this. Oh, I have to turn off on the work microphone. Oh, God, okay, now this is better, right? Okay, so let me thank the organizers for organizing this nice workshop and for inviting me to give a talk. Actually, you know, changed my mind about the talk that I wanted to give. So you might have realized that the title is not exactly the same than what I had announced in the first place. So this is more related to using recent optimization technique to learn conditional random fields for applications in machine learning. So, I mean, this is about micro-random field, but, I mean, through, I guess, a different lens than what we have heard about so far in the workshop. What I'll present is joint work with my PhD student, Chelle Chouhou, and, yeah, so SDCA here stands for stochastic dual-coordinate ascent, and I'll explain in a minute why this is relevant. So, you know, just to start with an image in a motivating example and sort of something concrete, I would like to start with this image. So, you know, this is an important problem in, I guess, many applications to be able to process images and to segment them and identify the objects within them. And today, this is an application where the type of learning algorithm that you would use is not conditional random field. Today, you would use deep learning algorithms, basically, you know, something like between eight and five years ago, conditional random fields were sort of superseded in this type of application by deep learning algorithms, but I still kept, you know, this illustration because I think, I mean, the focus of my talk today is going to be about, you know, learning CRF in general, then not, like, just for this application, and I like this illustration because I think it gives a concrete example, which I think might be better to understand the setup, at least if you're not too familiar with the uses of conditional random fields in machine learning. There might still be a number of cases where you might want to prefer to use conditional random field instead of using neural network for maybe not exactly this application, but similar applications because deep neural networks require to have a lot of labeled data, and then, you know, they reach great performance, but there are a number of applications where you have a smaller amount of data, and the way you can parameterize conditional random field is advantageous in those cases. So just to give a few examples, a couple of years ago, we worked with colleagues on facade parsing, so, you know, you have images of buildings, and you're trying to detect the windows, the doors, and different parts of the buildings, and by parameterizing cleverly a mark of random field, we could encode the fact that, you know, very often the windows and the doors are aligned, and this was sort of hard coded in the structure of the model, and this is something that is not so easy to do with neural networks, even though there is interesting research to do there. There are other applications where, you know, somehow CRFs are natural and neural networks are less, but my main point here is to show this application to present the framework in which I want to work and to present a fast learning algorithm. So the idea is that at each pixel, so we're gonna view this image as a grid of pixels, and at each pixel you have to classify the corresponding pixel in a certain semantic classes, in certain semantic classes, so, you know, for example, road, sidewalk, wall, tree, traffic sign, pedestrian, and so on, so I think here there are eight semantic classes. So essentially at each pixel, you want to solve a multi-class classification problem, but of course you'd like to leverage the spatial structure of the problem and model the fact that those decisions that you make at neighboring points should probably be related. And so I would like to consider a conditional random field model to try to learn this problem, propose a fast learning algorithm for it, and what I would like to try and leverage is recent ideas in convex optimization that have allowed people to solve very quickly optimization problem that are of the following form. So you want to minimize a function f plus a certain regularization, and typically f of w is going to be some empirical risk for a machine learning problem, and whenever you have an empirical risk, it is usually the sum over a large number of terms, which is the number of data points that you have of the loss evaluated at that data point. So, you know, typically the function f index s that we would consider would be a certain small function f s of w transpose phi of x s, and so if we take the particular case where this function takes this form, then l is the loss function, y s is a label, our output variable, and then we would try to learn a linear function of a certain feature representation phi of x s, which is the input data, right? And so typically, if we have a empirical risk, which is a sum of terms of this form, then our optimization problem is actually of the form minimizing with respect to w, a sum of a large number of functions plus a regularizer. And it turns out that when your problem has this structure, it's possible to use algorithms which are stochastic gradient methods with variance reduction techniques. So what is the general ideas of these algorithms is at each iteration, instead of computing a batch gradient, you pick a single of these functions and you compute a stochastic gradient, but instead of computing the classical stochastic gradient, you compute a stochastic gradient that has a certain structure that allows you to make it that as you're approaching the minimum of the function, the variance of your stochastic gradient is actually converging to zero sufficiently quickly that your algorithm behaves at the beginning like a stochastic gradient algorithm, but then as you start converging, your algorithm becomes more and more like deterministic batch gradient descent. And so people have shown that using these algorithms on finite data set, you can get very fast convergence rates that are essentially making the best of what you can get with SGD and what you can get with gradient descent. And there are several algorithms in that family. So SVRG stands for stochastic variance reduced gradient and then I won't say all the acronyms, but SAG and SAGA are like two famous algorithms in this family. And the one that I would like to be considering is an algorithm which is called stochastic dual coordinate descent, which is actually of the same family, but this is not obvious because it is written in the dual. So the way you implement this algorithm is you have to first transform your optimization problem here by writing its Lagrange dual. And essentially if FS star here is the official dual of FS, then it turns out that, I mean this is something which is maybe a big step here if you haven't looked at these kind of things before, but this is standard to transform this type of objective in that time of Lagrange dual for this empirical minimization problem. And in the dual, what happens, we have a property which is called dual decomposition, which is that essentially for each data point S, we have a dual variable alpha S and so the dual objective is a sum of functions that each depend on only the dual variable corresponding to a single data point. And then the dual of the regularization terms term is becoming this term that somehow combines, I mean this is a norm of the linear combination of the feature representation. And the idea is that in this dual problem, this term is relatively, I mean it is this term that dominates and you have obtained in the dual an optimization problem which almost decomposes for each data point basically. Yes, yes, sorry, so yes, I forgot to mention this. So that's right. So in the setting that I consider, I'll assume that the loss that we were working with is convex, that's right, yes. Or the small function FS that we're working with are convex. And then we can basically solve very quickly this dual by randomly selecting one of the coordinates and doing a, it's like a gradient step except that it's a proximal step. So I will not get into the detail there, but the point is that we're obtained by with this formulation an algorithm which can be interpreted into the primal as one of these algorithms that are stochastic gradient methods with variance reduction and that have very fast convergence. So to give you an idea basically, I sort of gathered the theoretical result that are known for these different algorithms. So if N is the number of data points that you have or the number of terms in the sum and if we have a, you know, the problem is going to be characterized by a condition number and the ambient dimension and then this is the running time. I'm writing here the running times to reach an error epsilon on the gap of epsilon on the optimum. So you see that, I mean, basically gradient descent as a function of the logarithm of one of epsilon which means you obtain linear convergence there. There are these accelerated algorithm which reduce the dependence in the condition number and then using these, this family of algorithm that I've just mentioned in the previous slide. So these stochastic variance reduce these, yes, these stochastic gradient methods with reduced variance. You actually get dependence in the number of data points that combine N and K additively instead of multiplicatively, sorry. So I mean, this is basically, this makes quite a big difference in practice. Okay, so the thing is that those algorithms obviously are applicable to problems of this form and if I were to try and if I come back to my example of semantic segmentation, if I was trying to classify each pixel independently, then I could use multi-class logistic regression for example on each pixel and my objective would be of this form and then I could be using this algorithm and I would be happy. But what I would like to do is I would like to consider a conditional random field in which the graphical model is not of this structure but I want to model the interaction between the choices, the choice, the predictions that I'm making for each pixel, right? So you have to imagine that this is like one dimensional image with like each of these being the pixels and these connections correspond, the red connections correspond to the connection of the graph and in that case, I've actually coupled everything and my objective doesn't have any more structure which is the sum of functions. So what we sort of looked at in this work is, how can we try and consider a formulation in which we would be able to still leverage these fast algorithms to have fast learning for conditional random fields? And to do that, we'll have to obviously do a number of approximations. Okay, so to explain the formulation that we consider, I will have to start from the conditional random field and rewrite it in a number of ways and one trick that I will use is to write the log likelihood as a log partition. So to make sure that what I present is clear, let me explain this trick first. Hopefully this will help, so I apologize because the next couple of slides are a bit technical but somehow if I skip right away from here to the abstract formulation that I will consider, I think we will find it obscure. So the idea is that I have observed data, right? So XO and YO are my observed data points and you can consider this as being here, so XO is going to be the whole input image and YO is my semantic segmentation, so the whole output image. And so if we would like to model the negative log likelihood here using an exponential family, then our exponential family will be of this form with phi, some vector of sufficient statistic, W, the natural parameter and this being the log partition function that obviously since it's a conditional model depends on the input X0. So if I write in practice what this log partition function is, well it's the log of the sum over all the possible values of Y of this exponential. And when I say all the possible values of Y, remember that what I call Y is the entire segmentation and so if there are eight possible classes per pixel and I have a million pixel, there are eight to the power of one million different Y, so this is a very big sum, right? So I put it in red and this is an intractable sum but it will have to somehow take care of this at some point. What I would like to point out is that what we can do is we can determine the left, we can write it as the logarithm of its exponential and as a consequence we can actually push this term inside of the log sum X and write just here the difference of the sufficient statistic with the same but here at the observed value. And now this is the general formula for an exponential family but I would like to apply this to a graphical model. So the vector of sufficient statistic will actually decompose on a certain number of cliques. So in the case that I presented, the idea is that the cliques will correspond to all of the nodes in the graph and all of the edges connecting neighboring nodes. So I will have unary potentials and binary potentials. So the set of cliques here will be two types of cliques, the singletons and the pairs or the edges. And for each of the type of cliques I will have a certain parameter and so this is why I have a sum over all the cliques of this expression which is sort of a specific instance of that and the idea is that we can yet rearrange this expression in another form which is going to be convenient which is to say that in fact we can for each click we can consider all the possible values of Yc, list those values of the dot product between W tau c and the possible values then this can take and then basically take the real Yc out and write this as an inner product. So I mean you basically list, like you introduce a dummy variable Yc prime that lists all the possible values and then you can rewrite this expression as this dot product Yc dot theta c where this is now something that looks like a canonical parameter but which is obtained as the product between features and W, okay? So basically my point was just to take the log likelihood and to write it as a log partition function of something which is a sum of dot products. Okay so let me now explain what is the particular instance in the case of our conditional random field. So specifically we have an input image X, I have features at pixel, few of S. At each pixel I'm going to encode the class by an indicator variable. So Ys is going to be the class at pixel S and it's going to be a vector with K components where a single of these components is one and the others are zero. So it's a POTS model. If we were making predictions of each of the pixel individually I would have a multi-class logistic regression model which I could write this way, right? So with a different parameter for each class and the vector here corresponding to the different outputs and the conditional random field that I would like to consider is essentially taking the multi-class logistic regression model. So basically here these two parts of the model are the same but I'm going to add an interaction term between adjacent output values. So again here Ysk and Ytl, the product of these two is going to be one. If pixel S is from class K, pixel T is from class L and the potential that is associated to this is this parameter. So I can rewrite this POTS model. I can abstract it out. So I can say that this expression I can rewrite it as the dot product of W tau one with a feature vector that in general depends on the output and the input. And the second term I can also sort of write a more general expression which is a feature of the two output variables at these nodes and the whole input. Even though here the input is not there but this is fine to generalize to this case. And then once I have made this more abstract form then you see that in both cases, so I have basically here clicks of size one and clicks of size two. And so in each case I have a parameter here tau one and W tau one and a parameter here W tau two which are associated with these two types of clicks and which are multiplied by a feature vector associated to the corresponding click. So I can write this in an even more abstract form as a sum over clicks of the vector of parameter which is associated to clicks of type tau C times, sorry, times this feature vector and then I have to introduce the log partition function. So far I've sort of wrote proportional to and I didn't introduce the log partition function. And of course we know that this log partition is intractable but then if I use the trick that I presented in the previous slide then the log of the conditional model is of this form and I can rewrite it. So this product is the theta that I had theta C that I had introduced in the previous slide and then you see that basically the whole point is that when we have a conditional random field model we can write it as F of a dot product of a certain design matrix that contains information about the data time a parameter vector W with F which is actually a log sum X function so a log partition function. Okay, all right, so now that we have identified that essentially we have this log partition function that is appearing, the idea is that our learning problem can be reformulated as the minimization with respect to W of F of C transpose W plus a regularization term where F is now this log partition function associated with the corresponding graph, okay. And so the problem there is that trying to optimize this, I mean we know that this log partition function is difficult to compute that the gradient is difficult to compute and so it's possible to use stochastic simulation methods but those are in general relatively slow and so what I consider is to use variational methods to try and solve this minimization problem and then do some relaxation that are allowing us are going to allow us to have a more efficient algorithm. So, okay, so maybe this slide is superfluous but this is just to contrast this with the disconnected case so again if we were considering the disconnected case then we would have something where the log sum X is appearing at the level of each pixel and then like this is tractable, we can use those fast algorithms and in particular we could use stochastic dual code in the sense so the idea is whether in the previous formulation I can actually obtain a formulation would look like this. So the idea is to try and so I said I would like to use variational methods. The idea is to try and leverage what we know about the Lagrange dual of this optimization problem in particular to use the fact that the eventual dual of the luck partition function is actually the Shannon entropy, right? So we know that we can actually write the luck partition function as the maximization over moments associated, vector of moments associated with the conditional, with the Markov random field or the conditional random field of this dot product plus the Shannon entropy. So of course this would allow us to construct a dual but then there are a couple of difficulties so let me introduce the object here. So this set of moments here, mu is called the marginal polytope and essentially it's a collection so mu is a collection of small vectors that are each associated to one of the cliques in the model. So here there would be a vector mu associated with each node and a vector mu associated with each edge and essentially each of these mu C has to be a probability distribution so a marginal probability distribution on, so for example on the nodes that would be a probability distribution on the k possible values that correspond to the k classes that you could assign to this pixel and for the edge it would be like a probability distribution over k squared values corresponding to those pairs of pixels that you could assign, to pairs of classes that you could assign to the pair of pixel. And each Shannon is the Shannon entropy that it turns out you can compute directly from the moments and so if we start from our primal optimization problem we can actually write a dual optimization problem which is equivalent which is based on this dual objective which is now, so this is like the negative of the Shannon entropy. What I wrote here is the indicator of the marginal polytope which is equal to zero if we are inside of the polytope and plus infinity otherwise and this term which is the regularization. And so then this primal problem is equivalent to the maximization of this dual problem. Actually this should be a minimization sorry or no maximization, yeah. But then there should be a plus here. So the difficulty is that again in the dual we have a formulation which is intractable because both the Shannon entropy and the marginal polytope so the marginal polytope has an exponential number of, is defined by an exponential number of constraints and the Shannon entropy is itself intractable to write for a general graph and for a grid in particular. So the idea is to replace to try and consider this dual problem but then to use a relaxation and a classical relaxation is based on using the local polytope. So what is the idea of the local polytope? The local polytope is basically the idea is to take the marginal polytope and to consider only constraints that belong to the constraint that are defining the marginal polytope that involve single clicks and pairs of, sorry, that involve single nodes and pairs of nodes that are adjacent. So in our particular case so I will construct the local polytope as follows so let me first define for each node. So here this is my set of clicks that are nodes so for each S being a pixel I'm going to consider the simplex which is the set of new S which are probability distribution so that are non-negative and whose entries sum to one. So again like new S is a vector of dimension K which is the number of classes. And then I introduce a similar object for each edge so this time you want to think of new S T as being a square matrix of size K by K which is the joint probability distribution over the class assignments for the nodes S T and has to be non-negative and since it's a probability distribution has to sum to one as well. And so with this I mean the collection of all the simplex constraints on all nodes and all edges this is what I will call the independent polytope. So this is a Cartesian product of constraints on all these small moments. And what the local polytope is is basically the set of these moments in the independent polytope that are actually consistent with each other in the sense that you would like to enforce that when you take a moment on some edge and you sum over one of the two variables you get the marginal moment you get the moment that corresponds to one of the nodes. So if I sum over T mu S T I should get mu S and if I sum over S mu S T I should get mu T. And so let me point out that this local polytope can actually be written as the intersection of this independent polytope which is a Cartesian product which therefore decouples over all the cliques with this set of equalities that I wrote in pink and which are just a set of linear constraints. So I can write this set of linear constraints as a certain matrix A times mu equals zero. I have a big linear constraint that encodes all these constraints. Okay, so the local polytope is going to be my relaxation. It can be shown that the marginal polytope is included in the local polytope and I'm going to use the local polytope as a relaxation for M. But unfortunately I cannot compute the Shannon entropy either so I need to introduce a surrogate for the entropy and there are many surrogates that exist and the difficulty is to find one that would be suitable to my case. So since I would like to try and leverage ideas from convex optimization, I have to have a surrogate that would be convex and so this is probably the step that might be the most shocking maybe for an audience who is having a perspective inspired by physics which is that I'm gonna do something which is somewhat brutal here. So I'm not going to consider the beta entropy which is it because it's not convex and I would be more inclined to consider the convexifications of the beta entropy which would be the tree-related entropy but the difficulty is that the tree-related entropy is convex on the local polytope but it's not convex outside of the local polytope and I need a function that is convex somehow outside. So what I'll do is I'll consider an approximation of the family of approximation of the entropy where I can actually write the approximation of the entropy as a sum of terms on all of the cliques where I'm gonna require that each term is smooth and convex on the simplex and I would like in addition that these approximation is strongly convex but only on the local polytope. So in the numerical experiments that we have that I will show we've used something which is very simple which we've used the genie entropy for each of the terms and we've also considered something which is closer to the tree-related entropy which is to use the oriented tree-related entropy but I'm not going to describe the details here. The whole point is basically to obtain an approximation of the entropy which decomposes over cliques and which has nice convexity properties. Yes, I mean, okay, so yes, sure. For example, in the simple setting that I'm describing here, I've got nodes and edges and obviously the edges overlap at the nodes, right? I mean, the nodes. Exactly, yes, so I mean, if I just use the genie entropy, I cheat and I don't do that but if we use the oriented tree-related entropy, this is a way to actually account for that. So the oriented tree-related entropy basically is a weighted average of entropies associated to spanning trees and so basically you pick a root and you orient, you know, you can create a spanning tree from that root and then the entropy terms that you add are conditional entropy terms that take into account the term that you need to subtract. Yes, it remains convex. Yes, yes, yes. Okay, so then so we start from, so I said we're gonna do two relaxation, replace the marginal polytope by the local polytope and replace the Shannon entropy by this approximation. So then why do I get? So basically from the dual problem that I was introducing, I get this tractable dual now, which has this four, right? So I've replaced this function by this one and I've replaced the marginal polytope by the indicator of the independent polytope and a big linear constraint. And then I've got this term that corresponds to the regularization, which I wanted to have. Yes. Okay, so that's, you're asking the difficult question. So, you know, as far as I know, there is relatively little literature on what you lose, on quantifying what you lose when you actually go from, you know, this formulation to this type of formulation. So, I mean, there is a gap here in the literature and, you know, like in this work, we didn't try to tackle that aspect. So we don't know how much we lose, you know, from when we go from this approximation to the other one. Okay, so now, you know, if you look at the structure of what I have introduced, so since the entropy here is a sum of these entropy terms that are just on clicks and since the independent polytope is actually constrained is actually the sum of constraints that each moment here should be inside of a given simplex. You can see that I can, if I take the sum of these two functions and I call this mu C star of, F C star, sorry, of mu C, then I can basically rewrite my dual problem as the sum over clicks of these F C star of mu C. So I have now a dual, which is a sum of functions on clicks, minus G star, which is the regularizer and I'm fine to have a generalizer that couples the different clicks and then, you know, the linear constraint. So if I didn't have the linear constraint, if I didn't have the linear constraint, at this point, if you remember the form of the objective that I was trying to, that I was leveraging to use stochastic dual called the net ascent, I could just apply stochastic dual called the net ascent on this problem, but of course I have the linear constraint, right? So I still have to do something to get rid of the linear constraint and then the standard thing to do in convex optimization is to introduce an augmented like range in formulation. So the idea is to replace this linear constraint by, so C here is a Lagrange multiplier multiplied by a mu and then this is the augmentation term of the augmented Lagrange, right? That basically allows to have some strong convexity. And okay, so what we know is that if I maximize with respect to C this objective, then, you know, this is the same as D mu. And so, sorry, if I minimize with respect to C the dual objective, I get D mu. And so basically what I would like to solve to minimize with respect to mu, the maximum with respect to sort of, to maximize with respect to mu, the minimum with respect to C of D rho mu C, right? But by strong duality, we can exchange the order of the maximization and minimization, right? And so I can actually, instead of maximize with respect to mu, the minimum with respect to C, I can actually maximize with respect to C, the minimum with respect to mu. I can exchange the order of my min and max. And therefore a natural way to solve this optimization problem is to actually somehow optimize with respect to mu, so maximize with respect to mu. And then when we're somehow, when we have converged, then we can actually use that to compute a gradient of this function D C and do gradient descent step with respect to C and iterate. Okay, so and it turns out that the gradient with respect to C here is very simple because it will just be A times, if you see appears only here, so the gradient with respect to C is A times mu C, where mu C is the optimal value of mu for this optimization problem, right? So it's the solution of that optimization problem. Right, that's what I'm writing. So obviously the thing is we don't want to solve a sequence of optimization problem. We don't want to have to minimize with respect to mu up to high precision and then take a gradient step with respect to C and then again solve another optimization problem. So what we want to do is to essentially partially optimize with respect to mu and then update C and then partially optimize to mu. And the question is what is the minimal amount of work that you have to do so that you still make progress? So the idea is to optimize a bit with respect to mu using stochastic dual coordinate ascent and determine what is the smallest amount of work that we need to do to make progress and then to update the range in parameters. Okay, so at this point let's see. So I may go relatively quickly on the result that we have but basically so on this optimization problem there are two gaps. One gap which correspond to the optimization problem on mu which I called delta hat T and gamma T which is the gap on the objective as a function of C and long story short if somehow the algorithm that we're using to optimize on mu which I will propose to be stochastic dual coordinate ascent has the guarantee that after a fixed amount of effort we can reduce this gap by a factor beta which is smaller than one. Then we can show that the algorithm is globally convergent and not only globally convergent but that the gaps are actually converging at an exponential rate to zero, okay? So that depends on, I mean basically what needs to be done is that this eigenvalue of this matrix has to be smaller than one and this specifies how much work you have to do at each step. And so like based on this we can show that if you do a fixed number of iteration of stochastic dual gradient ascent on mu at each epoch then the algorithm is globally convergent, right? So, and linearly convergent. So that was the problem in the dual but we can actually show the same thing for the primal. So if we optimize in the dual and we actually remap the dual solution to the primal via this mapping then we can also show that we have linear convergence in the primal. Of course this is linear convergence in the primal that corresponds to this relaxation based on the local polytope and on this approximation of the entropy, okay? So I mean there is a lot of related work that was done on this topic. Essentially so one thing that I should insist on here is and maybe I haven't insisted on like sufficiently at the beginning is that the traditional approaches to learning conditional random fields would be to update the parameters and then once you have updated the parameters to compute the gradient you have to solve an inference problem which is usually itself an intractable problem. And so you have to at each iteration in the classical approach to learning CRF models you have at each iteration to solve a whole inference problem which is hard. The advantage of this type of formulation is that basically the whole learning algorithm has one single inference procedure embedded in it. So we learn and do the inference at the same time. So this is not our idea. The first ideas of formulation that we're trying to go beyond the standard formulation are these and then they are like work that are closer to us but essentially nobody had actually made connection with stochastic dual coordinate ascent and obtained like algorithms that for which we could show formally the type of linear conversion guarantee that we showed. So I mean I have like very little time remaining so maybe let me just show you a couple of experiments. So I mean I have to apologize here because I don't really have the curves or the experiments that would be suitable for this audience. I have comparisons of our algorithm with algorithm that people have, formulations that people have proposed in the machine learning community which somehow these people have established our state of the art and which we are sort of comparing with because we know that if we're doing as well or better than these methods then we are also achieving the state of the art but in the method that I'll be comparing with I don't have any methods that are sort of the classical methods that would leverage belief propagation on the graph or something like that because I'm comparing with methods that have been already shown to be essentially significantly faster. So I'll compare with these methods which actually sort of proceed by following a similar approach than the one that I've presented except that they use a penalty method and so the dual problem that they consider doesn't enforce consistency between the different moments and then a more recent formulation which somehow is a greedy version of the previous ones. And so I have a synthetic data set and like an actual semantic segmentation data set and basically our algorithm is the one that is in orange or yellow and so I'm showing here the duality gap so the convergence guarantee between the primal and the dual and so you see that we have linear convergence and we're faster than this other algorithm and the contenders that we have are not solving exactly the same problem so they cannot have a duality gap that converges to zero. And in terms of accuracy we however reached the same performance but I mean that's maybe not so surprising and on a real segmentation data set our algorithm is actually sort of not doing as well as a stock block coordinate front wolf and like this is something that at this point we don't quite understand well is that these block coordinate front wolf algorithm even though they have different guarantees seem to be performing very well in some cases so here it's a case where we have removed the entropy and this is more like a support vector machine type of formulation so it's a max margin structured problem and then I will stop because I'm over time so I presented a formulation which allows us to leverage stochastic dual coordinate ascent where we could show that we had global linear convergence algorithm for the primal and in the dual and so our algorithms can be used in other problems in which the different terms are coupled by a linear constraint and there are a number of questions as of you know how we could still further improve this and so the algorithm right now is stochastic in Mu but it's not stochastic in T. Thank you. Thank you. So...