 My talk is more focused than the previous one. It is more or less dedicated to a problem of detecting multiple change points arising in a time series, and the main aspect of this work is to introduce positive semi-definite kernels in this machinery to be able to detect particular changes, and we'll talk about that in a few seconds. So it is a joint work with Sylvain Arleau, Zahid Arshawi, Guy Emery Guy and Guy Met Marot. Okay, so let's have a look at the outline of the talk. So first I will introduce some motivating examples, and also the framework we'll use in the sequel, in particular kernels we'll use. Then I will give details about the algorithm for detecting change points, which is called KCP, and also discuss some computational aspects about this algorithm. And then we'll discuss mainly the statistical part of this talk, which is divided into two parts. The first one is rather concerned with the problem of assuming that the number of segments is known. Where are the change points? And in a second time, how many such change points do we want to detect? Okay, so let's start with the simplest problem, which is detecting changes in the mean of a signal. So you observe a blue curve, and in this example the blue curve has been generated thanks to the regression function, which is in black here, and it is a piecewise constant. And then the problem is that you observe abrupt changes in the blue curve, and the main question is to know whether these changes are related to an abrupt change in the regression function or not. And so here it's the case. Some of them are perhaps difficult to localize, and others are clearly false positives. So how to do this in an automatic way? So what are the purposes of this work? The first one is to be able to detect changes in the distributions, in the whole distribution of the observations. So in particular not only in the mean, as I just presented. And for instance, you have many signals. Here it is a signal coming from biology. I will not give too much details about that, but anyway, the fact you can observe is that if you compute a local average of this signal, you will all everywhere obtain something which is equal to one half of a mean. And so all the classical techniques which are able to detect changes in the mean of a signal are completely useless with such a structured signal. So the second point is that we'd like to be able to deal with complex data. And so I split them into two different kinds of data, high dimensional ones, which are high dimensional measures or curves, and more structural objects such as video sequences, graphs, or DNA sequences. For instance, we'd like to be able to detect video sequences in which a given action is happening. For instance, someone is playing music or people are closing or someone is very discussing here, for instance. And so the main fact here is that with such an object, for instance, usually people summarize observations or images by use of histograms. So at each instant you observe an histogram and you have to detect changes from this time series of histograms. And another structured object we'd like to be able to deal with is, for instance, a time series of graphs. So obviously from time to time, the graph changes, but the question is to know whether this change arises by random, or if it's related to a structural change in the distribution that has generated the graph. So we'd like to be able to detect abrupt changes in such time series of graphs, for instance. And of course, we'd like to be able, ideally, to simultaneously deal with such different types of data and provide efficient algorithms as usual. So how to do that? I first introduce the framework based on kernels. So I assume in all what follows that I observe independent and identically observations, random variables, I mean x1, xn, in a given set, capital X. And this set, in particular, has no particular structure. The only requirement I make on it is that I can define a positive semi-definite kernel K, which is a reproducing kernel, in the meaning of kernels introduced by Aaron Dijard in 1950. And so as long as you have such a kernel, you can have the associated reproducing kernel in the space. And also the canonical feature map phi defined this way. And the idea is simply to use this canonical feature map to be able to map observation in the initial space to the RKHS, where there is a vector structure and you can exploit this structure. So it's a versatile tool and a convenient tool to deal with very different kinds of data. And our goal is to provide unified analysis to deal with very different types of data by use of kernels. So instances which are very classical in particular in machine learning. The first one is the Gaussian kernel, which is defined this way. And the second one, which is perhaps less classical, is called the chi-square kernel. And it is defined as the expectation of minus the chi-square distance between histograms with i-bins. Okay, we will use it a little bit later in the talk. So what is the reason for introducing the kernels? It is summarized by this writing. So you have this observation, xi, and thanks to the canonical feature map phi, you can write it as yi. So now you work with yi's, which are iid. And you can write that yi in the Hilbert space is simply equal to its expectation in the RKHS plus an error term, which is defined as simply as a difference by construction. So why doing that? Because mu i star, we have to be cautious about that because it's defined as the expectation of an object which belongs to an infinite dimensional vector space. So we have to take care of how to define this expectation. And so the mean element mu i star of pxi is defined as a unique element in the RKHS such that for every f in the RKHS, the dot product, the inner product in the RKHS between mu i star and f is equal to the expectation of the canonical feature map at point xi and f. Okay, so the assumptions for having that is that h has to be separable and also that expectation of k of xx has to be finite. And the main point which is of interest for us is that for characteristic kernels, for instance, we have that any difference between distribution pxi and pxj implies a difference between the mean elements. So now, since we're interested by detecting changes in the distribution, it reduces to detecting changes in the mean element along the time. So that's the reason why we will try to estimate the sequence of mean elements which is assumed to be piecewise constant. And the main point is that we have to keep in mind that in regions where the signal-to-noise ratio is too low, it will be impossible to recover all the true change points because of the noise, of course. And so that's the reason why we only try to estimate the sequence of mean elements and, of course, in settings where the signal-to-noise ratio is large enough, then it will provide us with true change points. Okay, so now let us give some details about the algorithm we use. So for a given segmentation tau in D segments, we will use as a quality measure to quantify the quality of a given segmentation tau the same measure of quality as that one used in Arshaoui and Cappé in 2007, which is noted by Rn of hat of T of tau, sorry, which has a very expression. And the thing you have to keep in mind about this quantity is that when we consider the linear kernel, which is simply the dot product between vectors in Rd, then this quantity will use this to the classical least square empirical risk. Of the empirical minimizer. Okay, so we can measure the quality of a given segmentation tau thanks to this quantity. And then we'll use a Cappé algorithm, which is defined this way. So as an input, you have the observations xi xn and the kernel to compare them and measure their similarity. As step one, for each number of segment D between one and a given Dmax, you have to compute the minimum over all the possible segmentations with D segments of the criterion I just introduced before, so Rn hat of tau. So we have to notice that this quantity, this optimization problem is a very hard optimization problem and it is performed by use of dynamic programming. So that for each D, we are given with the best segmentation with D segments. So we will discuss this just after. And the second step of the algorithm is given such a collection of best segmentation for each dimension D between one and Dmax. We have to design a penalized criterion, which is defined as the sum of Rn hat of tau D plus a penalty term that has to be made more precise and it's precisely what we'll do by use of model selection. So finally, by optimizing this penalized criterion, we'll be given with the best number of segments and then we will be given the best segmentation with D hat segments. So let's focus on the first step of the algorithm. So it is based on dynamic programming. The update rule of the dynamic programming algorithm is the following one. So it says that for each number of segments between two and Dmax, the cost of the best segmentation in D segments from one to N is equal to the minimum of the cost of the best segmentation in these minus one segments up to D, up to time T, plus the cost of only one segment between T and N. So from that, the usual strategy, so okay, the cost of a segment between T and N has the following formula. And the usual strategy is first to compute the cost matrix which is of size N times N and store it. And so it induces a space complexity of N squared because you have to store this N squared matrix. And also, once this strategy which is quite naive is embedded in the kernel framework, it's a cost in time N to the 4 because you have to compute N squared terms, CST, and each of them relies or involves a quadratic number of terms which are coefficients of a gram matrix. So it's by far too costly to be able to deal with large signals. So with Guillem Regaille and Guimet Marot, we have proposed a small change in this algorithm which is summarized by this pseudo code which is only based on two ideas. The first one is that we never have to store the cost matrix because we do not want to have a space complexity of N squared. And having looked to the algorithm, we see that the most influential step is to be able to update, to compute the T plus 1 column of the cost matrix from the column CT. So to have an update rule to compute this column from this one, from the previous one. And it allows us to avoid storing the whole cost matrix and so to reduce the space complexity to be linear in N and the time complexity to be N squared at most. So it is illustrated in this picture when you see the time complexity of a naive implementation and the reduced time complexity of our improvement. And also, as you can notice, we are able to deal with a sample size of 100,000 observations in about three minutes. So it's already quite interesting to do things, but anyway it can be seen as a limitation because it prevents us from dealing with very, very large sample sizes. And so an open question which would be interested to address would be to be able to reduce this competition time, for instance, by use of the wrong matrix approximation. We could use a wrong matrix approximation to the gram matrix and use this with some printing strategies which are used actually to reduce the competition time of dynamic programming algorithm. And also another point is to be able to quantify what has been lost by this approximation. Okay. So the notification to the algorithm that you presented was an approximation? No, the algorithm I just presented provides you an exact solution to a problem. But a way to be able to reduce this competition time would be to use an approximation of an original gram matrix and perform a prune version of this algorithm. And the fact is that you have to be able to quantify what has been lost from a statistical point of view by using this approximation. Okay. So now let us talk about the statistical performance of the step one of the algorithm. I mean, for each number of segments D, I say that we are able, using dynamic programming, to compute the best segmentation with D segments by minimizing this criterion. Okay. So what is the statistical performance of this quantity of this procedure? To compare the quality of segmentations, we introduce two distances between segmentations. The first one is quite classical. It is the host of distance. And the second one is the Frobenius distance. It is defined as the Frobenius norm between matrices. And the matrix M tau is simply defined this way. So M tau ij, the coefficient ij of this matrix, is simply equal to one if ij belong to the same segment of the segmentation tau, divided by the continuity of this segment. Okay. So with these distances between segmentation, we consider several scenarios to assess, from an empirical point of view, the quality of this first step and then we vary the trace of the kernel we use. So the first scenario, it is, so here in this picture, you see an instance of the kind of signal we have to deal with. So first we choose, we find a true segmentation in D star, which is equal to 11 segments. So here you have in red a dashed line, the position of the true break points. So given such a true partition, in each segment we randomly choose a distribution among a pool of seven of them, and all these distributions have a different mean and variance. So as soon as you change from this segment to the other versus the fine one, you have a change in the mean and in the variance. So it's quite already a simple problem. And here are the results. So on the left panel you see the performance in terms of the Osdorff distance and the Frobenius norm for the Gaussian kernel. So here the curves are plotted versus the number of segments. And what we see is that the Gaussian kernel performs quite well because the minimum location, so the best performance of the segmentation is obtained with the dimension which is equal to the true one. So D star is equal to 11. And if we compare this performance to what is obtained by use of a linear kernel, you see that, for instance, if you look at the Osdorff distance, you see that it is always decreasing. And what it means, in particular, even for large dimensions, and so what it means, it means that even when the dimension, so the number of segments is by far larger than the true one, you still ought to add change points which are improving, which are close to true ones and better from the previous ones. So generally speaking, it means that the linear kernel changes in the noise, simply. And it is confirmed by these graphs. So here it is the frequencies of exact recovery of each position as the true change point. The true change points are in red, and you see the results for the Gaussian kernel and for the linear one. And you observe that over 500 repetitions, we detect exactly the true change points in at least 60% of the repetitions. Whereas with a linear kernel, the exact recovery frequency is between 10% and 20% only. If I turn to the second scenario, the main difference between the second scenario and the previous one is that here, when you change from a segment to the following one, the next one, actually you change, there is no change in mean and invariance. So the distribution here and here share the same mean and variance. So the difference occurs at a higher order of the invariance distributions of the observations. Okay, so here you have an instance of a kind of signal you have to be able to deal with. And the results are quite similar. So you have the same minimum location which is at the true number of segments. And here, the conclusion for the linear kernel are similar, but the performance is even worse than what it used to. Okay, and here you see that the exact recovery frequency for the linear kernel is almost null. So you almost never are able to recover a true change point. It only puts change points in noisy regions. Okay, we have also considered another kernel which is the Hermit one because it is related to the Hermit polynomials. And the main idea of this kernel is simply that it is sensitive to changes in the distributions up to the in the first five moments of the distributions. So the performance in terms of exact recovery frequency are better than with linear kernel, of course, but you see that there is still a difference between what is provided by the Gaussian kernel and it is mostly related to the fact that the Gaussian kernel is a characteristic kernel which is not the case for the Hermit kernel. Okay, and the last scenario it says that, okay, at each position we observe 20 bins histogram. So in each segment we have randomly chosen the 20 coefficients of this Dirichlet process and so such that in each segment we are able to generate histograms with 20 bins. Okay, so you have an instance of what you can observe in the first three coordinates of the signal you observe. And an important question is that in this scenario we are dealing with a structured object because the sum of the kernel of the histogram the sum is equal to one. So the question is is it necessary to take into account this structure of the data we are dealing with and to answer this question we have compared two different kernels of the chi-square one I've already introduced a little bit earlier and the Gaussian kernel which is here and the main point is that the Gaussian kernel ignores the structure of the data so it doesn't know that this is an histogram it considers that each observation is simply a vector in all 20. And what you see is that the performance of the Gaussian kernel is less accurate that what you obtain with the chi-square which exploits precisely this structure and it is confirmed by the exact recovery frequencies because you observe that the performance is lower with the Gaussian kernel than with the chi-square one. So there is a potential gain in exploiting the structure of the data that's the main conclusion for this third scenario. Okay so now we'll talk about the last part of this talk and so it is mainly concerned with the problem of designing a penalty so that minimizing this penalized criterion provides you with an estimator of the number of segments and this is done by middle selection. So the model was the following one so each observation in the RKHS is equal to its expectation plus a noise term and we assume that the sequence of min elements is piecewise constant. So for a given segmentation too we can consider the associated vector space which is made of piecewise constant functions built from this segmentation tau so it is noted by f tau and the estimator we consider is simply the empirical risk minimizer which is noted by mu hat tau and it is simply for us the orthogonal projection onto this vector space. So what is the best possible choice of penalty to find the best segmentation? Actually it is what we call the ideal penalty and how is it defined in this way so the ideal penalty is what we have to add to the empirical risk to recover the best possible segmentation which is called the oracle segmentation. So first how would you would we recover the best segmentation? Well if we minimize this quantity so which is the true loss associated with the segmentation tau if you minimize this quantity over tau you get the best possible segmentation tau star and so it is simply a matter of writing this in this way but you can define the penalty as a difference between this quantity and this one and writing this in another way it is equal to a sum of a quadratic term and a linear term. So our strategy is the following one it is simply to use concentration in equalities to be able to provide an upper bound with high probability of this ideal penalty and so we have provided concentration in equality for the linear term first which is simply based on Benschein equality so it's quite classical and another concentration in equality for the quadratic term and so it is this one so it is stated under some assumptions the first one is simply that we assume that the data are bounded in the RKHS through as long as for instance we consider a bounded kernel or if we are in settings where excise are bounded themselves and we also assume that the noise in the RKHS is bounded by a constant Vmax so in particular we do not assume that data or Gaussian and there is no constant variance assumption so this result allows us to to deal with ill-valued vectors and not only vectors in Rd as it used to with ongoing results of that kind and so our result says that the quadratic term is close to expectation with a deviation term which is related to the estimation error and another deviation which depends on an X and an important fact is that this depends on the X here which is related to the probability of a large event on which this inequality holds true has to be simply X and not X square in particular if you use some strategies based on telegram inequality for instance you would get an X square and in our setting where we have a large collection of segmentation this X square prevents you from getting an accurate result so it was important to derive to get this order of magnitude for a second deviation okay and so thanks to this concentration inequality we were able to derive an oracle inequality which simply says under our assumptions that if we define the best segmentation by minimizing this penalized criterion where the penalty is defined this way so you have constant c1 and c2 d tau is simply the number of segments of the segmentation tau then there exists an event of high probability on which the performance of the final estimator you get remains almost the same as the best performance you can get with an estimator you consider up to a constant which is larger than 1 and remain the term and so the interesting thing is that in our more general setting because we consider ill-valued vectors we were able to recover a penalty which is quite similar to the one obtained derived by Bjergj and Maasai in 2001 under a Gaussian assumption for instance and real-valued vectors so the whole algorithm simply can be split into two steps so the first one is compute by dynamic programming the optimum of this criterion over all the segmentation with these segments and in second step find the best number of segments by optimizing this penalized criterion and another important point is that c1 and c2 are constants and they have been chosen from data by use of what we call the slope heuristics Do you know if I knew everything about the problem could I get a function for c1 and c2 are they different than noise or whatever No, it's a as far as I know it's a really difficult thing to get it's still an opinion question Can you get like c1 and c2 like if you get c1 and c2 only approximately? Actually, when we derive such a result we get c1 and c2 we get probably pessimistic values of c1 and c2 from theory because they derive from concentration inequalities and constants are not optimized for that If I were to be the devil's advocate estimating c1 and c2 as hard as estimating the number of components I would not say that because this slope heuristics is based on theoretical arguments which says that in some regimes it works quite well to be able to to estimate these constants so it's not really a difficult problem in practice So if we compare now the performance of the penalized criterion we optimize which is the black curve to the true risk the red curve you see that in the scenario one which is the simplest one both for the Gaussian kernel and the Hermit kernel our penalty captures is not that bad to capture the behavior of the true risk and the minimum location is the same and you observe that the exact recovery frequencies for all the true change points when we choose d hat is quite good and it is even better if we allow small mistakes if you have a candidate change point and you consider that it is a true recovery if you allow a small error for instance a free position error then these frequencies grow up to 70% So it's the same conclusion also true for the second scenario and all the distributions are the same in the same variance you see that the distribution of the d hat is a little bit more spread than the previous one it is a harder problem actually and also dealing with histogram-valued data it goes quite well So to summarize we have provided and described an algorithm to detect changes arising in the distribution of observations it is relatively efficient and it is theoretically grounded thanks to consumption inequality and also oracle inequality which provides non-assembly guarantee on the performance of this algorithm and it allows to deal with vectors and also virtual data such as graphs and so on as long as you can define a kernel on that and there are many open programs some of them are for instance the one I've already mentioned about reducing the n squared time of course you can use approximations instead of dynamic programming algorithm if you want but the key thing is that you the most important thing is to always to quantify what has been lost by using this approximation and it is a really hard problem from a statistical point of view we have also to investigate for instance the link between the choice of a kernel and the type of changes were sensitive so it depends on the kind of change arising in the first or second moments or in higher moments of the distribution perhaps the given kernel is not the best for all of that simultaneously even if it is characteristic because in theory a characteristic kernel are sensitive to a change arising in any moment of the distribution but not in the same way if it arises in the first moment or in higher order moments ok and the last point is to there is so you mentioned the slope heuristic I have to say that using that induces an additional computational cost because you have to explore segmentations up to d max and as you see in the conventional complexity all complexities tie in time space or linear in d max so it's an important feature of a problem so it would be a good thing to be able to revisit this slope heuristic to save computation resources both in time and space and while in the same time preserve them accuracy thank you questions or comments? go ahead I have a doubt how you how you estimate how you could detect the change of the variance if two fragments have the same mean but different variance the kernel estimates will give the same estimate a new start no no no precisely not thanks to the before yes when you have a kernel which is characteristic then any change between the two distributions induces a change in the mean elements as long as you use a characteristic kernel if you have a change between these two distributions in print people you should be able to detect this change in terms of the mean elements so the mean elements should not be the same otherwise it contradicts the fact that the kernel is characteristic so the mu hat will be different too? of course yes it depends on ok so the mu star i is different from the mu j star the theory says that but actually when you estimate things so this one and this one obviously it depends on the quality of your estimator you use and so if you are in noisy regions if you have a very small number of observations mu hat the estimator of this quantity and this one should be very similar and so similar that you can't detect any change but it's only a matter of signal to noise ratio I would say is it clear? yes you use a Gaussian kernel so how did you result depend on the bandwidth of your Gaussian kernel and how did you choose the bandwidth? ok so of course it depends on the bandwidth but in a very not in ok I can say that the dependence is not so strong because except if you choose a very very very bad bandwidth because it's for instance I would say 10 to the minus 4 and or in the other extreme 10 to power 10 for instance except such extreme situations it would work not that bad and of course there is an optimal bandwidth but all the results I have shown you are not provided with an optimized bandwidth we have only considered a bandwidth which provides us with not that bad results so there is some room for improvement if we were able to precisely choose that bandwidth but in terms of the analysis where does the bandwidth comes into as an assumption I think we do not have any elements to answer because if I show you the raccoon equality you will say that the norm which yes this norm here it depends on the kernel because it is the sum it is defined this way sorry for that yes this way it is this norm the sum over all the positions of the norm of fi in the archaic's h and this norm depends on the kernel you have chosen so I say that because okay I am lost sorry yes this one so the bandwidth appears here and also here so it is quite difficult to say that there is a for instance you cannot optimize in saying that I will choose h such that this superband is the smallest as possible because it depends also on this quantity and it is not that too let me have a partial answer for that one so the caching kernel the distance between elements essentially the distance between the kernel density estimates okay and the width of the density of the kernel is the width of other minerals so this gives you like an idea of the influence of the of the width and this leads to a question so here so as I just said like different terminal elements like h to distance between like distributions which is meant to be back very good for doing tests for example because it is invariant to like it is not an entity to small changes in distributions I will try to analyze the means that the variance is to get the better of the distance is it an estimated variance or yes of course so the analysis is would be really more difficult so perhaps it is possible but with many additional technicalities I would say there is a paper in the work at Wisconsin trying to do this for tests going from differences in elements to like normalize the difference and they are able to show that yep in your simulations how do you choose dmax my question is for example for model selection you choose dmax for example less than the true number of change points you will do not so badly because you will find the most reliable change point detection and if you choose the parameter of model selection very big you can overfit I agree actually I would say except in very extreme situations the choice of dmax is really influential in that step I mean when I will choose from the data when I will estimate from data a constant c1 and c2 by use of slow peristics slow peristics it's difficult to explain that in a few seconds but it says roughly speaking that you can predict the type of dependence with respect to the dimension of the curve of the function which is the empirical risk so you can with high probability you can say what should be the behavior of the empirical risk as the dimension grows or at least is large enough and with this slow puristic exploit the knowledge of this shape the empirical risk should have and you will learn from this shape you compare this shape to the observed shape of the empirical risk and the comparison provides you with these two constants and the fact is that these comparisons depend on the range of values of d you will use to make this comparison and if it's too small you will have really bad estimates of these two quantities and it will degrade your results so in practice in our simulations I've chosen dmax equal to 100 and if you remember d star was equal to 11 so but it already provides really reliable results with lower values of dmax which is only a way to be no but is it clear? okay thank you last question slide before 39 could you get like the delta 1 to go to 1 if n is sufficiently large so there is I would say it means nothing because there is a log n which is in this penalty and so you can decrease the delta 1 but it will increase the other constant here so it's a matter of trade off and even if delta 1 is equal to 1 it does not say really that the procedure is optimal because there is a log n here and it's not very clear anymore do you have results in terms of the recovery of tau instead of because this is not giving recovery of tau this is giving the from a practical point of view there is the simulation results from a theory I would say there is the Ph.D. of Damien Guerreau which is co-advised by J.R. Bue and Sylvain Law who is exactly doing that and so he has consistency results to show that this procedure the same algorithm or almost the same provides consistency for the tau hat when estimating tau star the true segmentation do you also have a notion of distance because you already mentioned some distance on the segmentation do you have guarantees on the size of the distance I would say that I'm pretty sure that there is a rate of convergence of Dwarf and probably Frobenius that's true in fact yes Damien is there so there is more on the bandwidth of the Gaussian Karnet so I have a result on the number of change points and it depends on the norm in the RKHS so this norm is like a function of the bandwidth and it's like a pump so there is not under many assumptions no further questions for Alan ok so let's thank Alan again