 Okay, so well actually thanks to the organizers for this nice invitation and thanks to the European Research Council for founding the research that I tried to summarize in this talk. So I decided to be lazy and use as a title just in the acronym of the project. So this is a project and what I will be talking about is a result of collaboration with a number of postdocs, students and colleagues. This is only a sampler of those who did the main work that I will talk about. So the basic idea that I would like to try to promote here is that there are some fruitful connections between single processing and machine learning and between all the activity that has been around dimension reduction and compressive sensing that could be leveraged in maybe ways that have not been completely explored in machine learning. So the idea is can we develop some kind of compressive machine learning approaches that leverage ideas from dimension reduction. So as I'm stating it with a question. There will be more questions than answers, but I will provide some case studies on namely compressive clustering approaches that leverage some random Fourier sampling ideas and on compressive Gaussian mixture model estimation. Then we'll see that okay, there are a few things that can be leveraged. There are a few results about dimension reduction and information preservation. So kind of information theoretic guarantees. And okay, we'll see how far we can get. So my one-slide view on machine learning is that we have data and we have some tasks which consist in which more at some stage consists in inferring some parameters, the parameters of a classifier, the centuries of when you do vector quantization. Lots of data and some parameters to estimate. One of the main issues being how generalizable will be the results that you get from this learn parameters when you get new data with the same distribution, underlying distribution. So here take your favorite example from non unsupervised techniques of PCA clustering dictionary learning, all supervised techniques where you're considering classification. So the one classical way of seeing it, but I would like to emphasize it is to use a geometric viewpoint on this type of problem. So we consider the data as a cloud of points in high dimensions and here where the cloud of points has some structure, maybe the task is to infer this structure. But in any case, you can either represent as a point of clouds or as a collection of columns from a large matrix and maybe it is important because in many talks that was maybe the rows of the matrix. So here I'm using columns. I know, no, but I know, but I can help. This is called these will be columns and this will be convenient. Actually, it's so much more convenient with columns. Yeah, that's that's a remark from somebody who manipulates actually data files. I don't When you store them in MATLAB, they come as columns. But anyway, okay So that we say high dimensional. There are two sizes, two types of high dimensions, the dimension of each column. So the dimension of the feature. So if your data, if each column is an image or a signal, it can be already millions of dimensions, but more reasonably if it's a sift descriptor, it's already a few hundred dimensions. The other large dimension is more the volume of the collection of the training collection. And so somehow the question I want to address is how much can we hope to compress this whole collection before we even start performing some learning tasks. So there are different ways of compressing. One way and again, let's try to use this geometric picture is we have let's think about each column of the matrix as a high-dimensional vector. So it lives in a high-dimensional vector space, but we can use some projections, some sketching, some low-dimensional projection to map it to a lower-dimensional vector. So we get a collection of lower-dimensional vectors. In certain cases, maybe this is enough, but and this is actually something that has been used already in machine learning. So in the, for example, in the work of Calderbank but in most of the cases, this requires some knowledge, some model on the the geometry of your point of clouds, your cloud of points. That's it has some low-dimensional structure. Another issue is that, well, probably the impact is limited when really you have a large collections. If you have a large collection, reducing the dimension of each feature vector well, has some impact, but quite limited. So there is an even stronger challenges in the era of large collections to compress the collection size itself. So we can think of the of these collections as large point clouds. But there are there's an alternate way of trying to reduce the size of a collection. Which again is related to the talk of Garvesh this morning. There's somehow the idea of sketching or sampling. So core sets that I know very little, but I think are related to sampling methods where you you find either clever ways of sampling in your data collections point that will be relevant to your to your task. So this maybe will leverage course. There's also sketching more sketching based approaches. I'm thinking of the early works of the taper, core mode, etc. with histograms that can be histograms in slow dimension that can be with a limited with the finite number of beans. You can consider them as vectors and you can actually sketch them. However, when you're really in high dimension, you cannot afford to even build a histogram or you don't even know how to discretize how you would discretize space. So this calls for other approaches. That's the spirit that we've been investigating and that we have been investigating is that of sketching with the idea that you start from a collection and you consider actually this collection as represented here as the empirical probability distribution of your data. And what you would like to build is a sketching operator that will start from the collection and build a finite dimensional vector, the sketch. This sketch will usually be non-linear in the vectors themselves but should be designed so as to somehow preserve the information content of your data relative to the task that you want to perform. So the sketch will be non-linear in the feature vector but the idea is that we can build it to be linear in the probability distribution of the data and that's actually quite easy. This linearity will be what favors a number of parallels with inverse problems and sparse reconstruction. So before we go further let's take an example. So here I'm not yet describing what's the nature of the sketch but this is an example of compressive clustering. So here I just draw a cloud of points so that was drawn from an artificial mixture of four gaussians. I guess you see that they are four clusters. So they're not too separated but they're not too mixed either. The blue centers are the actual center reads of each of the gaussians. And here we design an algorithm that's an approach, two steps. First step we sketch, we compute a 60-dimensional vector and from this vector only, so we completely forget the rest of the collection from this vector through an algorithm that's inspired by sparse reconstruction algorithms we estimate the center reads of the gaussians and here you can see a fairly good match between the estimated center reads and the original ones. So here you take the data matrix and you transfer it to your sketch vector, one sketch vector, not per data. Exactly. So this is an aggregated description of the whole data matrix. So let me maybe contrast with what I understood from our garbage talk. In your talk the matrix was sketched with a linear, by multiplying by your linear projection. And here what the sketch I'm going to build is not a linear projection of the data vectors. It's not a linear projection of the individual data vectors, not of the rows of the matrix either, but it's going to be linear in the probability distribution of the data. But this should be clearer in a few slides. So what are the potential impacts of these types of methods? In terms of memory and computational resources, if you have an increasing collection size, typical methods for clustering like k-means or in gaussian mixture models expectation maximization algorithm will have a cost indicated by the pink curve that grows with the collection size. Here with these type of approaches you have to sketch the collection so there's something linear in the collection size but we'll see this can be distributed. So this is the yellow curve. It's linear in the collection size but it's not that costly. And there is a fixed cost associated to reconstruct learning from your sketch. So if the collection size is relatively small it's not really worth it. But when you get to large collection size essentially you have a fixed cost for learning. So both in terms of memory, well in terms of memory it's even clearer. You have a fixed size sketch so somehow you're forgetting the collection and you can compute the sketch online as we will see. So yes. But from a statistical perspective, would you want that the size of the sketch should increase with the data? It is possible. For the moment let's say to make things simpler let's think about you have a given task and you have a fixed size sketch. Actually it's relatively easy for example with doubling schemes to progressively increase the size of your sketch. But it's not clear in which scenario it is desirable depending on the trade-offs between accuracy and complexity that you may want to reach. So let's keep it simple, let's think of it as a fixed size and one of the questions will be what is the order of size of sketch that is reasonable. Again I have more questions and answers regarding this. So just to maybe try to make it clear what I mean by the sketch it's the idea is to make a parallel with this geometric picture that I had before. So in single processing we're used to thinking about the objects that we manipulate are vectors in a finite dimensional space sample signals, sample images. And so we do low dimensional projections. Here in machine learning somehow the orange statistics the natural object to manipulate is the distribution of your data and you would like to compute something linear in the distribution of your data but that maps it to a finite dimension in which you gather some relevant information. How can you do it? Well if you really had your distribution you could simply choose a number of functions and you compute the expectations of these functions and you see that the expectation of each function is linear in the underlying distribution and you can also estimate it with finitely many samples by simply an average, a particular average. But you could choose those other estimator, more robust estimators of these functions if needed. So this is both, this is nonlinear in the feature vectors because the HL functions that you choose here can be nonlinear, probably should be nonlinear but this is linear in the distribution and somehow this, we've realized recently that this can be interpreted as a finite dimensional version of the mean map embedding. So you take a distribution and you map it to a Hilbert space here, a finite dimensional Hilbert space where you can measure distances and work with it. Any questions at this stage? So with this idea of using a sketching trick in this type of sketching trick for machine learning we can also try to mimic the questions that have been addressed in single processing around the notion where the focus was on inverse problems here that would be once you've chosen a sketch recovering a distribution from the sketch is a sort of method, generalized method of moments but probably you don't necessarily want to do density estimation so the metric with which you're interested the reconstruction of your density is actually might be related to the learning task that you have to consider. And we'll see also that, well it's a generalized method of moments but we want to do some dimension reduction so in fact we have the possibility to choose the sketch. In single processing this would be compressive sensing or this would be in computer science it would be sketching. You design them to preserve information and probably to have some also computational favorable computational aspects. Well here we'd like to do the same thing and compressive learning would be about designing sketches with similar properties. Any questions? Can't we consider something like the empirical risk of some sort of sketch of the data if you could update the data? Absolutely, the empirical risk is some sort of sketch except that in fact usually it's an infinite dimensional sketch because you have to use the whole family of possible values of parameters. So if you only have to do let's say hypothesis testing or I mean you have to compare the risk of a finite number of parameters sure you get a sketch that's a natural sketch. In fact more generally the question is somehow can you design a finite dimensional sketch that gather enough information that you can reconstruct your empirical risk or at least find the minimizer of your empirical risk. So that was for the general picture I'm trying to explore here with you. I would like to give some highlights on some compressive learning examples mostly heuristic. So they're done within the PhD thesis of Anthony Bourrier Nikola Kierven who is still under his PhD and in collaboration with Patrick Perez. So we've seen that the ideas would take a point light considered as an empirical probability distribution of your data and we designed a sketching operator. Typically the sketch takes this form you have to choose functions HL and the challenge is to choose them so that they preserve the information relevant for your problem. So let us go back to this compressive clustering example that I showed. So the standard approach here would be K-means. So with K-means you alternate your clusters you manipulate all your data and draw Voronoi cells etc. Here the way we designed the sketch was by doing some analogy with signal processing where we know that if something is specially localized sampling it in the Fourier domain is a good way to make sure that we don't miss anything. So think of a signal that is perfectly specially localized if you do only a few random samples you will probably miss the information that was present unless you have many samples but if you sample in Fourier everything is well known to work now. So we use this analogy here having few clusters means the ideal approximation of clusters are Dirac cells. If you have K clusters actually they are K Diracs and here a good way to sample such a distribution would be to sample in the Fourier domain. So sampling in the Fourier domain simply means sampling the characteristic function of your distribution and here we are sampling the empirical characteristic function. So the sketches that we used So this is related to what you said about the relationship to mean mapping So you just kind of assume that HL belongs to a kernel or an RKHS I mean is this basically doing linear sketching but on transform features? So if you transform your X's and then do it? Yeah, this is related to actually you can think of it and we are still working on the relations as probably a good way to design sketches is to design a kernel mean map that's a pretty infinite dimensional and sample it that the interplay between the two and the dimensions that you can use to sample is not completely clear for us at this stage So here precisely well here starting from the single processing intuition that you sample in the Fourier domain you sample the empirical characteristic function so essentially what you have to choose is certain frequencies, omegas and in fact when you look at it and you think about random Fourier features this is very related so this is not directly a random Fourier feature but this is a pooled this is the empirical average of certain random Fourier features that's a description of the distribution of your data So this is the sketch and behind there's also a lot of questions on how you choose the sampling frequencies these are details I will not give in this presentation there's still a lot of... I mean we have heuristics that needs to be related I think to these kernel mean map embeddings to have proper choices So if you're interested in more details there's a recent paper at ICASP that's presenting this week actually by Nicolas Kerriven Okay So in this case we use this type of sketching so with 30 complex valued numbers so 60 entries sample the characteristic function and the result that... how do we obtain the result? We need an algorithm that starts from the sketch and performs the estimation of the centroids For that we need to exploit... we need some model I mean we have taken an arbitrary distribution and sketched it so there's no way we can reconstruct it without a model and the model here is a Gaussian mixture model with equal variance so all variances are identity so the only parameters are the weights of the Gaussians or the clusters and their centroids So if you assume that your distribution is a mixture of Gaussians by linearity of the sketching operator the sketch is itself a mixture of sketch Gaussians The good news is that when you take a Gaussian the characteristic function has an analytic expression so everything can be written explicitly and you can design an algorithm that will perform the decoding, the reconstruction inspired by that mimics orthogonal matching pursuits and exploits certain gradient descents leveraging the explicit expressions of the gradients with respect to the parameters to perform this reconstruction So this is something... actually we leverage orthogonal matching pursuits with replacements rather than OMP which brings significantly better results because of somehow the better ability to escape from local minima that's my hand-waving interpretation we have no analysis of the convergence of our guarantees for this algorithm for the moment but we just observe that it performs really well I'll show some more examples So this is for compressive clustering Now can we extend it to more types of problems? Well, there's something... I've already used the mixture of Gaussians Gaussians with identity co-variances it's not difficult to extend it to compressive Gaussian mixture models So what we've done, we've not completely relaxed taking the arbitrary Gaussian mixture model just with diagonal co-variance matrices that's already a richer set and with it you can adapt the algorithm so either reuse the algorithm we had before but just use gradients with respect to these non-constrained Gaussians or that will be used in the next experiments have something which gives a computationally more efficient algorithm but not designed for clustered scenarios more for scenarios where you have overlapping Gaussians you don't really want to recover the centers but you want the Gaussians to fit well So this is using hierarchical splitting to accelerate the algorithm Any questions? So let's see an example of... Well, it's a proof of concept of how this can be used in a large-scale scenario The scenario considered is speaker verification So are you familiar with speaker verification? So who is not familiar with speaker verification? Good So in the word speaker verification is something where... Well, suppose you're calling your bank and you say I'm Rémi Grébonval and I would like to transfer all my money to the following account number Probably... I mean, you're claiming an identity and speaker verification is about checking that the identity is the one you claim So it's not about deciding among a million of identities who you are it's just checking whether you are the person you claim This is typically done with trained models of each of the possible claimed identities But in general you have very little data to train for a given model So the way you train a model is in two steps First, you build a so-called world model with as much data as you can representatives of a very wide diversity of speakers And then you adapt these models to be representative of each speaker So there's a step where you take a large collection and here we use the 2005 NIST database of 50 gigabytes of data For our scale, this is small compared to the petabytes that some may be manipulating But this is, okay, thousands of hours of speech And here you learn a Gaussian mixture model So two approaches will use either expectation maximization or the proposed algorithm Now you've learned this model and there's a so-called adaptation procedure for each claimed identity and now when Alice calls and asks to transfer some money So here is the speech that is pronounced It is compared both to the world model and to the claimed identity and generalized likelihood tests I mean a likelihood test is conducted to decide whether we accept the transaction or not So what we did was to compare the compressive approach with the classical EM approach in this setting So if you look at in more details the database that is used to learn this universal background model This is a database of first thousands of hours of speech Now what are the individual vectors that we consider? Well, the speech is cut into small time frames and in each time frame we compute MFCC coefficients in dimension 12 So you don't need to know what are MFCC coefficients This is equivalent of 15 audio So if you take the whole database there are the 300 million such coefficients In fact, if you look carefully there are many of these coefficients that correspond to inactive no speech I mean there's silence between two words There are some coefficients that are not really helpful for learning and it's a standard technique to first do silence detection But even after silence detection you still have 60 million vectors in your training collection In fact, when we use the state-of-the-art EM C++ toolbox the maximum size that it can manage is 300,000 samples that fit in memory So we first conducted an experiment with this number of training samples either with the compressive approach or with expectation maximization And here you can see the picture So this is essentially a false positive versus false negative curve where the lower the better and each color, the violet curve corresponds to EM the rest corresponds to the proposed approach with various sketch sizes So you can see that if you increase the sketch size it gets better but it doesn't quite reach the quality of EM Now remember that EM was limited by the collection size we couldn't run it with more data And our method can... Yes So here was the sketch using also those random for you features Absolutely And what was the sampling frequency? The sampling frequency Okay, so the W's I said there's a heuristic that developed in the papers but essentially so it's their isotropically distributed and regularly there's a distribution which has low density around the origin because all distribution essentially the characteristic function is always one at the origin and even though you have classical polynomial moments that you could measure implicitly there there's not much information and so there's a heuristic that samples more where the gradient with respect to the parameters of your model is expected to be high So with this sampling strategy we can run the proposed approach with the sketch computed on 60 million samples and here what we get is that for sufficiently large sketch size we go below the EM curve so the results are getting better So in detail here if we use a small sketch so it's 500 samples represent the 60 million collection and from this small representation we're not quite at the performance of EM but not that far If we want too much essentially the performance of EM we simply take twice the number of the size of the sketch and with a slightly larger size of collection so compared to the size of the slightly larger sketch size compared to the initial collection you get something better than EM So this is really a proof of concept I'm not claiming this is comparing to state of the art the EM is probably no longer the state of the art in speaker verification but this is to give you an idea that by simply unrolling these ideas these parallel you get something that is not well that somehow exploits the fact that you have a large training collection and that's by really capturing empirical averages of a larger collection probably you capture more of the diversity of the collection and so you can exploit it rather than better than if you were simply sampling 300,000 samples from the collection and this is where you could also wonder should I increase the sketch size well you see that probably if you have a few samples you can take a small collection but if you have more samples you'd like to really benefit from the more samples by starting to increase the sketch size Okay, how much time do I have? Okay, so now there are a few things I would like to talk about after this illustration of the general ideas of sketching the first one is, okay let's say it's an interlude about how this can be implemented and potential computational efficiency and then we'll dig into some of the possible theoretical connections with what is known in inverse problems and compressive sensing So regarding computational efficiency well this is the expression of the sketches as we propose to compute them with essentially averaging random Fourier features So in terms of architecture you start from your large collection of training samples and then there's your matrix of so the rows of W here are the frequencies where you sample your empirical characteristic function So you first enlarge you have to measure more frequencies than the initial dimension of your vectors So you first enlarge your vectors apply a non-linearity which is the complex exponential and then average to compute the sketch So this is probably reminiscent of one layer of a network and I believe there may be more connections with neural networks there but they are still to be investigated There's some recent work by the group of Duke on some information preservation guarantees for neural networks but I think they consider more the information preservation in the sense of being able to reconstruct the initial data and here the tech home message would be rather that the information that's important is the distribution of the data that you want to preserve at least in this scenario this is what is considered Now, okay, this is the architecture when you think about it in terms of privacy and this is related to the I think that was partly mentioned Well, once you've computed the sketch you just hand out the sketch not the rest of the data So to some extent this could help with some privacy issues of course if the sketch has sufficient information to perform certain tasks well you would not be private with respect to this task but if we can investigate further the information preservation guarantees with lower bounds and lower bounds we may expect to have lower bounds that ensure that there's not enough information to perform certain tasks with the sketch So besides, of course this is you can compute the sketch itself online so it's compatible with streaming and compatible with distributed computing So in summary with these sketches in this particular compressive GMM scenario you start from a large collection and you compute the sketch you achieve high-dimension reduction and then you are able with certain algorithms that are memory-efficient and relatively computationally efficient to extract your information provided that everything has been designed that you have guarantees of information preservation So in this talk I've shown empirical evidence of scenarios where you seem to have information preservation and we'll see what could be roads towards proving information preservation So this is related to work we did a couple of years ago with Anthony Bourier, Mike Davis, Tomer Peleg and Patrick Perez So with the idea that there are techniques classically used to analyze lowering mixed-race completion sparse recovery and inverse problems in general that actually have some really large generality and that maybe they're worth packaging in a sufficiently universal way So typically in inverse problems you consider that you have a high-dimensional vector you observe a low-dimensional version so without further information there's no way you can hope to reconstruct but if you know that your data comes from approximated by a sparse vector then there are algorithms with reconstruction guarantees So under certain well-known conditions for case sparse recovery you're able to build an algorithm called a decoder here This is the terminology introduced in a nice paper by Albert Cohen on DeVore and Vogue Kondamen These decoders, ideally from your measurement you want to be able to build a decoder that has some reconstruction guarantees Here the reconstruction is that with metric that I will leave quite fuzzy So the decoder delta even if you have some noise on your data the reconstruction of your X will be controlled essentially by the size of the noise if your data satisfies your model which is exactly case sparse here So this has been investigated I mean we know there are cases where there are such decoders but the nice work of Cohen and Co-authors was to investigate the fundamental information theory question when are there can I expect such a decoder to exist So we know a number of decoders at one minimization, greedy algorithms and so on I expect that a number of you are more or less familiar with this And in terms of guarantees one possible focus is so-called uniform guarantees, worst case guarantees they're related to the well-known restricted isometric property So now a question who is not familiar with the restricted isometric property Okay, enough people that I will spend some time explaining So these algorithms they are designed to recover case sparse vectors So you have case sparse vectors, you project them to low dimension and what you would like to be sure is that there's no way you can confuse two case sparse vectors So if you have one case sparse vector and another and that you project them and get essentially the same representation than you're lost, there's no way you can hope to reconstruct case sparse vectors The restricted isometric property is essentially something that just says that if you take two case sparse vectors that are sufficiently distant in the original space they will remain distant in your observation space And this means that considering the difference between two case sparse vectors it's a vector that has 2k non-zero components This is the expression of the restricted isometric property It preserves the norm for every two case sparse vector So this is what is known in the case of sparse recovery but actually there's a number of other models that have been considered in the literature So related work has been done with sparse vectors with vectors that are sparse in a dictionary and a number of other low dimension models pick your favorite one So in our case we would be in particular interested with models where the low dimensionality comes from the fact that you have a mixture of few gaussians so you have few parameters somehow and you would be like to consider something that doesn't really live in a finite dimensional vector space but in the space of probability distributions or finite sign measures So the question is still the same If you're given some model some low dimensional model and for the moment let's not specify what we mean by low dimensional model in some space and you have some measurement operator think of the sketching operator When can we expect that there exists some reconstruction algorithm So that there exists a decoder with instance optimality guarantees This is a question that is not new that has been related to all questions from the forties in the literature which are on embeddings where people have been investigating questions such as I have this fractal set here and can I map it to finite dimension and what is the dimension I can map to but I will not develop this further Here we'll see that actually the existence of such decoders is again related to a generalized notion of restricted isometric property So instead of stating one new restricted isometric property for each type of model why not state one for any model set and here it is So we consider sigma which is some subset of your ambient space and you'll say that your measurement operator m satisfies a restricted isometric property on this model or actually on the second set of this model that it's the difference vectors between two vectors of the model if, well, this inequality holds so here I wrote it in an asymmetric way but you can write it with the usual way with 1 minus delta 1 plus delta in a straightforward manner and what it's possible to show that in fact if there exists a decoder with this reconstruction guarantee then necessarily the matrix m that you're considering satisfies the restricted isometric property this is just an analysis of the first case conditions and the second result which is probably the most interesting is that in fact if the restricted isometric property holds then there is indeed a decoder with reconstruction guarantees I come back to the decoder you're anticipating on my next 40 slides 40? just kidding so it's just an existence result so this is where I say it's information theoretic results rather than I mean it's not dealing with the complexity trade-offs there are many nice questions around it I'll try to evoke some of them so here it implies so if your matrix or your measurement operator because you're not necessarily in finite dimension it's not necessarily a matrix satisfies a rip then there is a decoder that provides exact recovery with stability to noise so whenever you take x which is in your model set sigma you measure it, you add some noise the decoder provides an error proportional to the size of the noise so this is just an extension I mean the proof just follows from rewriting that in the work of Cohen but we've just extended shown that it can be extended to arbitrary model sets and something, a bonus that's probably the main difference with the early embedding results from the literature there is also some stability to modeling error so even if your initial signal does not exactly belong to your model set but if it is close enough with the metric that's also related to your model set then I mean you measure its distance it doesn't need to be close enough you measure its distance and then there's an additive term in your inequality so this is both stable to noise and robust to modeling error now of course Francis asked, yes but what's the decoder so some work some work then this year with Jan Tromblin will try to provide some answers but first maybe I need to just exhibit what's the decoder in the previous proof so you know that you have a matrix with the restricted isometry there's the decoder, what's the decoder well it's written here so given your observation well you find the X that minimizes the weighted sum of your data fidelity plus your distance to the model set you're happy to manipulate this in particular if your distance is not so so in certain cases here the distance d'sigma may look abstract in certain cases it's essentially the L1 norm so it could look okay except that you're computing the distance to k-spouse vectors so that's not nice to manipulate the good news is that these decoders are also noise blind you don't need to know the noise level this decoder will work and provide these guarantees whatever the noise level so you don't need to tune the noise level to begin with but I agree it's not very convenient something slightly more convenient was proposed by Thomas Blum and Seth who proposed with Mike Davis the projective half-shoulding algorithm for sparse reconstruction and the projected land-weather algorithm is just generalization to the arbitrary sets so here's something connected to the talk of Alexandre Dass-Premont yesterday who manipulated Prox and there was a question by Guillaume Obozanski on yes but what if the Prox is not so easily computable this will pop up here so the idea of the projected land-weather algorithm is that it's an iterative I mean it's a projected gradient rate you do at each step some gradient to decrease the data fidelity term that's when it's an L2 term and then you project to the closest point on your model set so that's perfectly fine when you have the model set is case-part vectors or lower rank matrices vector sets for which you know it's easy but it's easy to exhibit a number of cases where it's NP-hard to compute these Prox so this opens a number of questions on maybe characterizing what are the model sets for which you have actually something Prox that is computable because for these you may have dimension reduction and computability which would be nice there's a proof of convergence so with a well-chosen step size related to the constants in the restricted isometric property of your matrix corresponding to delta smaller than 1 fifth you can prove that this algorithm is convergent and that it recovers stably you can prove that the iterates provide a stable recovery so if your data was a model set you recover it exactly I mean you converge to it and if you're not in your model set you may not be convergent but you will circle within a ball that's not too far from the noise level and the modeling error again, okay maybe not so convenient and since there were a number of talks on convex optimization maybe you would rather be expecting something of this type where the decoders that are based on the minimization of some functional so we know a number of them so the decoder would be I try to find among all solutions within the prescribed data fidelity the one that minimizes some regularizer f of x so if you take sparse vectors f and 1 norm etc etc well for a number of such model pairs of model sets and regularizer many authors have proved that if the restricted isometry property is below a certain constant then this recover this is an instance optimal decoder so in particular there were results of Kai and Zhang are known to be sharp so the 1 over root 2 cannot be improved for L1 minimization closing an endless story of work showing that 0.41 0.42 0.43 we're working but still every time you get a new a new model and try a new regularizer there's a new guarantee to be obtained well that's what we investigated was the existence of an underlying principle and we got one so please please don't read this small font just to so that you don't read it so in fact if you have a model set that's simply a homogeneous set so it's a cone if you multiply by a positive scalar you remain in your model set or that includes the case of unions of subspaces that are stable by simple linear multiplication and you take any regularizer you can actually define a restricted isometric property constant so it just depends on your model set and your regularizer and with this this you can say that any measurement matrix that has a reap with delta smaller this particular constant will be compatible with regularization with f on your model and the corresponding decoding will work will have guarantees so this so what we would have liked to do would be take a model set find out automatically what's the right regularizer for the atomic norm so at first we thought that atomic norm would do it the paper on atomic norm is nice but if you look carefully at it what it does is say okay here is my model set but in fact this model set has special points in it and these are I call the atoms and they have in fact special properties that I don't really discuss and from this I build an atomic norm that will work the atoms to be at the extreme points of the atomic norm yeah but there's a okay there's an example if you take simply take your model set to be the set of two sparse vectors so probably the natural way of designing atoms would be you need every normalized two sparse vector and then the atomic norm is the so called K support norm with K equals two and it doesn't allow recovery of points that are the corners of the L1 bolt because the atomic norm is flat so it's a bit I think the question of starting from the model and constructing the regularizer is still an interesting and open one if you start from the norm you can define a model class that would be adaptive to a norm that's the way you can do it I mean starting from a norm but I'm interested in the reverse way starting from a model which are the vectors you want to reconstruct how do you build your regularizer and one possible way that we would like to investigate is to try to design it so as to maximize this constant but you may also have like we have done some work with the the skill showing that some sets do not have good contextualizations so there may be it is impossible that it is impossible to design F is convex and compatible master wash many types of model classes simpler ones yes it may be impossible for to have a convex one here I didn't even say that it's F is convex take some F, some sigma I can define this then at some point you need to be able to optimize with it but sometimes sometimes F is not convex but if you look at the squared loss plus F it's convex in certain cases there's work by Ivan Selicnik sometimes it's the overall problem that you pose is convex even though the penalty is not just to give you an example of what can be obtained with the results and I'm probably at minus one now you can so you can unroll the mechanics of the previous theorem and recover existing results gets sharper results for new examples for example I don't know if anybody is interested in recovering permutation matrices from no dimensional projections but you can do it provided you have RIP 2 third and that can be achieved with I mean you can design the operation operators that satisfy this RIP with a number of measurements that's controlled I don't have in mind the covering numbers of the set of permutation matrices that can be done so since I'm a bit short I probably skip this I was just to illustrate the idea that you can already use this technique to somehow design a regularizer for a model so for certain models you could either use the L1 norm that is done in classical papers or knowing the sparsity in different blocks weight the L1 norm and with this weighted L1 norm you have RIP constant that are higher so better recovery guarantees so highly time to conclude so the main message I try to convey here is that there are interesting possibilities to be done between single processing machine learning and with the idea that compressive sensing could maybe give rise to compressive learning techniques with the ability to reduce substantially reduce the size of the collision while preserving the information necessary to perform the task that you want and I illustrated with particular sketches that are non-linear in the data but linear in the probability distribution so just the last minutes for some advertisement for some further results that I couldn't fit in the previous hour so for the compressive clustering and compressive GMM these are the references of the papers where they can be found now regarding information preservation what I showed with this restricted isometry property is the first layer of the story for worst case analysis guaranteeing that the information is there in your low dimensional projection there's and this story is here now there's the question of how much dimension reduction can you perform while actually preserving well this is there are many works currently trying to generalize well essentially Johnson industrial to general model sets in particular there's the work of Dirksen that establishes that the links between the Gaussian which has a measure of dimension and how much you can reduce you can hope to what's the dimension you can hope to project to but it seems that it is not yet the end of the story and that this dimension is not sharp so there are questions what is the right measure of dimension for a model set and in particular when you have a particular given learning task how should you measure it so some of the questions around this are related somehow the notion of compressive statistical learning so when what you want to reconstruct is the risk functional for certain risk functional we now know that you can actually characterize the intrinsic dimension of sketching that is needed for to achieve it but many questions remain open so thank you for your patience and attention some questions comments oh you used the RIP perfectly to do the analysis is that easy to be satisfied because I read it on paper on paper of contest it studies the matrix completion program it says that it is not likely that the RIP will be satisfied for the matrix completion program absolutely so the matrix the RIP is necessary and sufficient to have uniform worst case reconstruction guarantees but it's but there are many analysis that do not require a worst case analysis and that as somehow related to a contest talk we are there's a typical behavior in particular in the matrix so if you do low rank matrix reconstruction there are ways of sampling low rank matrices which satisfy their restricted isometric property but this is not the point wise sampling corresponding to matrix completion so if you want to do matrix completion where you have sample the matrix at few points the measurement operator will never satisfy the restricted isometric property because you can find very simple low rank matrices that will just fit one at a non-observed location so the restricted isometric property is feasible but with measurement that measure essentially the whole coefficients at once does it answer your question so you mean that although it does not apply to the matrix completion problem maybe it will be applied to the matrix approximation problem it applies to matrix reconstruction problem with certain types of observations but not with observation that are related to the Netflix problem where you have unobserved entries and unobserved entries for case pass vectors it's hard to check that given in your application M we satisfy this property in other cases, particular cases that you studied where it can be very easy to check and make sure you will be interested in it if your model set is a linear subspace then it's a condition number but the observation was still turned out to be useful in certain scenarios where you just revisit it under the same umbrella but apart from that I think it is typically difficult I mean there are papers now showing the hardness of certain specified problems related to the computing testing whether the restricted isometric property the restricted isometric constant exceeds a certain threshold other questions could you say something about Chevalkov's IE property could it also be used to simplify this analysis where it's here in this case are you talking about the restricted IE property ok there I'm in a slightly less familiar context but as far as I know it's related to the descent cones of a particular cost function so here I was talking about the ability in general to do reconstruction, is there information in your measurements or not the restricted IE property I think is more related to the choice of a particular regular riser and now you're looking at the behavior of the restricted isometric property on vectors that decrease this cost function so this is this I think typically leads to non-uniform results on non-uniform guarantees when you have a given point to see how you can decrease the cost function in which direction you can decrease the cost function and analyze the restricted IE property there I have one question from a practical point of view how would you choose the size of the sketch which kind of principle can you say something for choosing the size ok so first in the things that I didn't really take the time to evoke there are some general arguments from when you try to generalize Johnson in the Neutron lemma there's the notion of there are measures of dimension of your set that naturally appear and if the size of your sketch sufficiently exceeds this measure of dimension you should satisfy the restricted property in particular when you have k-sparse vectors the k log then over k appears naturally so of course there's a what's the constant that needs to be tuned but for Gaussian mixture models that we played with you can essentially count the number of parameters so if you have a k Gaussian in dimension d essentially well you have k weights and k centroids k variances do the the calculus you know what's the size we have periodically observed that we get phase transitions for the algorithms with this size of sketches and if you only have data say you spin the data with the first part you try to learn the minimal size of the sketch with the rest of the data okay that's that's a different problem I haven't paid too much attention to which is okay what's the model I mean how to choose your model for your data here the point of view was rather okay I'm given a model or I choose my model and this is I decided I will fit 64 Gaussians how many what's the size of the sketch I should use when you're evoking to me seems much harder