 So this talk is going to be about clustering which is not really something people often study in previous complexity, but this is probably one of the, not the first, but some of the first works about this. And we will show and nice results about tractability which is, which is not like, which has some interesting properties. Anyhow, the problem we're interested in is the clustering problem. We have our input is some set of points in the play or in higher dimension, generally in RD. And we want to cluster them. What does this mean? We want to split them on clusters, basically say that this points are first cluster, this points are the second cluster, this points are the third cluster. And now we want to also set a, pick some cluster center, so it's another point for each of the clusters. And now the cost of this is just for all cluster, for each clusters, the cost of the cluster is the sum of distances from every point to this cluster center. So we sum distance from this to the center, this to the center, this to the center, and this for numbers give the cluster cost. And so on for all the other clusters. And the whole cost is the sum of costs of each cluster. And the distance we could use is actually, it could be different things. It could be, for instance, for the K-means problem, it's the square distance of the Euclidean distance, square Euclidean distance. It may be just Euclidean distance and then the problem is K-median. It may be some other Manhattan, Heimann distance, whatever you please. Yeah, but probably the most studied notion is, notions are K-means and K-medians. So this is intuition and that is how problem is formally defined. We are given a multi-set X of n vectors. In ZD, Z means that the integer, this is somewhat important for our study. I will later explain why, but it's really, genuinely it's not that important. Yeah, we are given how many clusters there should be. And we are given this cost bounds, which we want to verify if we can fit our clusters into this cost bounds. So basically, we can partition these points into K clusters. So just K subsets of the points. And find K vectors in RD such that this sum of distances is mostly where we sum over all clusters. And this inside cluster, we sum over all points in the cluster, the distance from this point to the cluster center. Yeah, so this is defined for K-means, where we have the Euclidean norm and squares of this. Yeah, so this is the cluster problem about the motivation generator. It's a very important tool in machine learning, data science and other stuff. So I probably won't spend a lot of time motivating this problem. It's quite well known as this. Okay, so what was known about this problem? There was this famous Lloyd's heuristic algorithm, which is take some random centers, then cluster points, then again, shift the centers and so on, which probably is studied a lot. Yeah, and there were other heuristics based on it or independent ones. And basically that is how people solve clustering in the real world. They just run some heuristics and stuff, and it often works well. But from the complexity point of view, this problem is not that easy. We know that it's NP-hard even for two clusters. We know it's NP-hard even for even in the plane. A good thing is it's somewhat solvable if we have a really small dimension and a really small number of clusters. So there is an algorithm which runs in time and to the big O of decay, which is given by Naba et al. And this is based on some computational algebraic geometry. So it's not really that fast, I guess, really. So I don't think people really use it in practice. But so on. Yeah, so basically this doesn't really look very good in terms of complexity, because I mean it's NP-hard here, NP-hard there, and we only have this. And moreover, there was also lower bounds. Like a final lower bound of the form. You cannot solve this problem faster than N to the omega of k, even for constant dimension. But this was for a bit of a different variant of the problem one. In my definition, the cluster center could be anything. It could be any point in RD. But in this paper, the cluster center was one of the given sets. So up to this technicality, this result kind of shows also that this algorithm is as best as possible, though not really formally. But we can actually hope to be good just by having d and k small. Yeah, also there was, of course, a lot of work about approximation algorithms for key clustering, because really in practice probably people don't really care about the exact solution, but one plus epsilon is fine also. And here there were a lot of nice results. For instance, the core set line of research, which is very long and extensive. For instance, there is a poly k over epsilon core set. I think it's even k over epsilon squared. And since a core set means that we can pick some small number of points, such that the instance which has only this points is the same as the original instance. So it's basically a kernelization, a classic organization in terms of parenthesis complexity. But anyhow, since we have, since we can reduce instance to this many points, we automatically have an algorithm in this time, basically. Yeah, this was given here. Yeah, but so from parenthesis complexity point of view, again, I was, it wasn't studied much. So probably the closest probe paper to what I represent in this paper about binary clustering in Heimann distance, actually I studied also the low rank approximation and other stuff, but this is the result that just causes to our results. Yeah, and I, we're giving, so remember that the capital is the current of our budget. So how many, what is the total cost of the clustering we are expected to have? And if it's, if it's small, then we actually are able to give an FPT algorithm in here. So there is an algorithm in time d to the d and polynomial in N and d, where N is number of points, d smallest dimension. k can be arbitrary here. So actually if this thing is small, the capital d, then k is kind of supposed to be large because if you have, if you pay really a small cost compared to number of points, then a lot of these points have to be kind of in their own clusters or they have to be equal. So really k here would be either really big or there will be a lot of points which are the same. So when we parameterize by the capital, really k doesn't now make much sense. It's, it's supposed to be, it's supposed to be large or not small definitely. Yeah, so yeah, yeah, and yeah, but formally k could be arbitrary here in these results. So yeah, so this actually this presentation gives the problem more like a editing flavor. So we have some, we have some data set which we know is close to the actual clustering of the points, but then there are some small number of d mistakes in here and we have to correct it to, to kind of make this into a real clustering. And this will be also the setting of our work. Yeah, so anyhow, in our study, we focus on the exact version of problems so we don't allow for any approximation not to solve the problem exactly. And you parameterize with respect to d, plus maybe some other parameters in some of the, in some of our results. And again, this, this is kind of an analytic flavor. So we want to have an instance which is nearly a clustering and then tweak it to actual clustering by spending not too much time. Also, I was saying that the input is integer. It's not really that important, but the problem is that when we bounce the capital, it must have some, some kind of, some kind of scale. So if input is arbitrary points, then it does really matter that d is small because points can also be, can lie in like, in the sphere of radius 1000 or so on. But then when points are integer, it kind of gives a scale. So really, when d is small, when capital d is small, then the actual clustering costs, then the points are really not far from each other or, or a lot of them are the same kind of. Yeah. But the cluster centers are allowed to be any points in our d. Though for most of the, like for most natural versions, like Euclidean norm, Alivar norm, or like Heming norm, these points will actually be in like, will be some whole tractables. For instance, for the Kamin's case, the cluster center is always the, they're just by coordinate average of the points in the cluster. So it's always some rational number with some kind of small denominator. For Alivar norm, it will be just always one of the integers because it's a median of the points and so on. And we studied the problem with relation to various distances. So there, I was stating the problem for just the square of the clean distances, but generally we consider these distances of the form dp, where dp is just the pth power of the p-norm of the vector. And we also consider the special cases of distance zero and distance infinity when this is the L zero norm or Heming norm. Just a number of coordinates where the, the distance between the points is just a number of coordinates where they differ. And this is the, an infinity norm, the maximum of the coordinate, the current difference. Like it is kind of natural that this is the one limit case of this and this is another limit case of this. This is zero and one this is infinity. Yeah. And but it's all interesting. Generally, picking a different measure compared to the Euclidean distance kind of allows us to, to emphasis on different things. For instance, you can see these two pictures, these are clustering, optimal clustering for the case I1, and this is optimal clustering for the case I1 force. So this, so you can see that when we have lower p, it basically penalizes less the spread of the cluster, but penalizes more kind of that amount of directions we have. So in this clustering, it's more fine to have this long lines than to have something which is close, but onto, onto these different coordinates immediately. So this point was in the red cluster because it was like this, but then it became into the green cluster because really this is not bad enough. Like since you can look at the distance definition here. And basically if p is small, then if the distance between x and y is large but only in one coordinate, then this distance would be first raised to like some power. So it won't be able to do this, this much. But if the distance is small, but in a lot of coordinates, then it will be worse when p is going lower. So this measure, so the change of measure basically changes these priorities between a lot of different coordinates and a lot of total change. Anyhow, this is about different measures, and let's start with some simple observations just to get the flavor of the problem. So imagine we are in L2, so we have just our usual Euclide distance. Then we can show that, or this is a well-known thing, that if we have a given set of points, then actually the cluster center which minimizes the total distance will always be there by current average. So we know exactly what this set should be if we know already which points are in which cluster. We just take the sum of all points and divide by the number of the points as vectors. So this is fine. So basically for L2 case, we don't really need to find actual centers, they are already given by the clustering if you know the clustering. And actually vice versa, and this works even like for any clustering, for any distance. If we are given a set of cluster centers, then we can always determine the optimal clusters if provided that this set of cluster centers. Because since the point always pays the distance from this point to its cluster center, it always makes sense to take this point and set it to the cluster which is determined by the closest cluster center. It doesn't make sense to put this point into this cluster since this one is closer. So whenever we have fixed set of cluster centers, we can always forget about the actual clustering. The actual clustering follows from a set of cluster centers always. And so we can reformulate the problem a bit. We can now, in a real information, just ask, can we find partitional key sets and key cluster centers such that this is minimized? But now we can just ask, can we find key vectors such that this is minimized? When we just, for every point, we take closest of these centers and then sum over all the points. Or not minimized in our case is can we make it at most deep? Anyhow, there's some simple things. Let us move to the actual results. So for, we start with the other one case. So here the distance is just, you know, the Manhattan distance where we, the distance between two vectors is just sum over all coordinates, the absolute value of the difference between this value and the coordinate. Yeah, and we want to minimize this again. We again want to find the key center such that this is at most deep. Actually for the case of F1, we know that if we have a fixed cluster, then its center is necessarily a median by coordinate. So if we have, in the case of V2, it was by current average and now it's just by current median. And from that, it follows that we have a trivial end to the d-key algorithm because in every coordinate, in every cluster, we just pick, yeah, which is, which just brute force all possible values of the median. Since the value of the median is always one of the present values in this, among the points in this coordinate. So basically in time, end to the d become brute force all possible cluster centers and then just brute force k of them, it will be at most end to the d-key. Yeah, but we want to do something better, of course. So now we'll present this algorithm, which runs in time, d-capital to the d, or with some constants, and polynomial in d. So again, if d is small, then this is kind of good. So I briefly will sketch the proof. I assume that the points are distinct. If you have two points, which are distinct, and then the same cluster, then the cost at most is at least one of this cluster. And that means that we cannot have really at most, we cannot have more than d of these composite clusters. By composite, I mean clusters when we have at most two, at least two distinct points. So clusters like this, I call them composite, and it cannot really, it cannot really be a lot of them, at most d. And moreover, there can be at most two d points in this cluster, because every one of them has at least two points, and every point pays at least one. So really, only as very small number of points is interesting in some sense, because for the points which are not in the composite clusters, they just pay zero, because they are the only point in the cluster, and they are the center of the cluster, and so on. They don't really contribute to anything. We are only interested in finding these points, identifying these points, and then clustering these points, which are interesting. It's not too many of them. The next thing to do here is to do a color coding. We can randomly color the points, and with good probability, we will have that all of these interesting points are colored differently. So in this case, so this would be the usual probability of coloring them right, which is good enough for us, because it's just exponential in d. Yeah. And then we can also, we should also try all the possible ways to speed the colors between the clusters. So basically, for instance, in this instance, say we already have, we don't really know this clustering, but we have the colors on the points. We assign them randomly, and we know that all of this, all of these interesting points have different colors. So we can also brute force that actually red and blue are in one cluster, and green, agent, and brown are in the other cluster. We don't really know yet which of the red points actually are in the cluster, but we cannot brute force that this cluster is one red point and one blue point. We don't really know yet which point is this. So this gives us a reduction to this auxiliary problem, which is probably kind of interesting on its own. In this auxiliary problem, we have t sets of points. Again, an ampardee, which is our budget, and we have to choose one point from each of the sets, such that this is minimized. Basically, it is kind of a problem of identifying one cluster, when we also know that all of the points in this cluster will be colored differently. So from each of the sets, we'll come out one point, and we will turn this into a cluster. So yeah, this is just the cost of taking one point of this set, and clustering this somehow. This is an auxiliary problem, and basically, the idea is if we can solve this problem, then we can solve our original problem, because whenever we brute force, whenever we have some coloring, and whenever we brute force this partition of these colors in the coloring, then to identify cluster x1, we just have to solve cluster selection on all the red and all the blue points. So basically, all the red points will be one of the sets, all the blue points will be another set in this cluster selection problem, and our task will be to select one red point and one blue point, such that the cluster has small costs, where this cost can also be actually brute force. Yeah, and when we do this repeat, when we do this for all of these kind of color classes, then we solve our original problem, because now we just have like, for every of these sets of colors, we have a cluster, which we know is optimal by the cluster selection problem. So now it remains only to solve this problem. And some, again to get the flavor, some boring ideas about how to solve this problem, for instance, we can try just all possible clusters in this time, which is, yeah, because we can just, for every set, we can just check which point we select, and then it's easy, if we know which point we select from each set. We just find the optimal cluster center. We can also, within this time, because if we fix, we can say that, say for x1, we know which point we take, we just take, we just try all possible endpoints from this cluster, and then we, then what we can do, we can just look at this point x1, and then we know that cluster center is not further from this point, is not further than d from this point, because this is one part of this sum, and this whole sum is d. So then, if we know the cluster center is not further than d from our point x1, which if we solve already, then we just take our x1, point x1, take our coordinates, and then just try and smooth the value of x1 in all the coordinates by d, up or down. So it will be 2d plus 1 possible values polished coordinate, and then this will be the whole time to try all possible cluster centers. So this is also not too good, because it's exponential in d, but it gives us a nice idea that actually, if we fix one point here, then it kind of restricts the number of cluster centers, which we should consider, and the actual solution will also follow this idea. Yeah, but another thing is actually, it's not just this is at most d, they're also at most d coordinates when x1 and cd4. So basically, if we identified x1, we can just try all possible values of x1, so it's fine, and we know the coordinates when they differ, then it's fine, because they differ in most d coordinates, and in each coordinate, by at most d capital. Since d capital is small, we can really try everything now, if we know which coordinates are there. Yeah, but we don't really know which coordinates are there. So the actually interesting part of the algorithm is to identify the set of coordinates, where x1 and cd4, and after this, it will be, yeah, and for this, we employ this kind of machinery with hypergraphs. So we fix our point from the first set, and then we build the hypergraph. Basically, this hypergraph shows where the other points of the inputs are different from the x1. So the vertices of this hypergraph are just the coordinates, just we have one vertex for each of the coordinates from one to d, and edges of this hypergraph are for each, exactly one edge per each of the points in the input sets, and the positions of this, like the vertices which are included in this edge, are exactly the positions where x1 is different from this xj for which we construct an edge, from this x, which we construct an edge. And basically, since every point which is, which can be in the same cluster, can have distance at most d to this x1, then the size of this edge also is at most d capital, because, because the whole cluster cost is at most d, then there cannot be two points which are in distance at most more than d. Anyhow, and if you look at one, if one potential solution, then this solution induces a side hypergraph, where we have at most capital d vertices, because there cannot be more than capital d coordinates when the points are different, and there are also at most capital d edges, because there also cannot be more than capital d points which are different from our points. So here for one simple example, we can look at, like we have five dimensional space, we have fixed our x1, it has some values somewhere, and we have some other points. And what we do again, these vertices are one vertex for each coordinate, and for every points we construct an edge. So basically, first point will yield us edge 2 and 5, on 2 and 5, because it's different from x1 in coordinates 2 and coordinate 5. So like this is the edge corresponding to the first point, and analogously this is the edge corresponding to the second point. It can be different from x1 in 1 and 5, and this is like, this is the slope of the desert, because it's only difference in one, and this edge is the first. And if you look at the solution, for instance, let's say that actually we should take these three points in the solution. Then the solution gives us some side hypergraph, and it has not too many edges at most 2, and it has not too many vertices at most 2. So now the idea is to, since we have this bound, we can try to somehow bound combinatorially the number of these sub-hypergraphs which we are interested in. If we can do that, then we will win. And the idea is as follows, here we try all possible h of this form, and then we try to find all this possible h and g. Of course, straight forwardly this won't give anything, because for every possible hypergraph, and every possible location of hypergraph in another, there cannot be any win in considering this thing, because there could be a lot of hypergraphs. But then we will use this theorem that actually if we have small fraction h cover number of h, then we actually can't do this sufficiently fast. Because yeah, it doesn't really matter what this means, basically this means that if this is constant, then this is polynomial, this is polynomial, and this only depends on the capital, or basically it is like the capital to the power of the capital. So this will give us our desired runtime. If we will be able to prove that fraction h cover number of h is small. So again, our general strategy is try all possible h, and try to find them, and provided this holds, we will do it with the use of this theorem in quite a short time. So I have to do a remark that actually this whole discussion is a bit similar to problem Marx was solving in his paper there. There was a consensus patterns problem, and so this is kind of a slight generalization of this. But the line of the proof is somewhat similar. And again, fraction h cover is just, you know, the usual h cover number is just the size of the smallest h cover, and the fraction h cover number is, it's kind of a relaxation of this notion. So you say for the triangle the h cover number is just two, we have to use two edges to cover all the vertices. But the fractional cover number is one point, one and a half, because if we set half to each edge, then we cover all the vertices. Where cover means that some of edges adjacent to this, instead of this vertex is at most one, should be at least one. And basically the main idea, which actually gives us this FBT algorithm, is that for our hypergasm, which are interested to us, they actually have the fractional h cover number at most two. And the intuition here is quite simple actually, because each vertex in h is covered by at least half edges in h. From that it follows it. This is at most two, because if this holds, we just assign two over number of edges to each edge, and then each vertex sums to at least one. But that holds just because vertex, what is a vertex? We just have to remember what the construction was. Vertex is a coordinate where x1 and c differ. And h is another point x. So basically h doesn't cover vertex, exactly when x1 and this x1 which we picked as the point in Gloucester and x, this arbitrary point, has equal value there. And basically if more than half points have the same value, then c also must have the same value, because with the L1 distance, it doesn't make sense to have a value which is different from more than half of the other values in the same coordinate. So that this thing actually transforms to this statement. And then we can use all of this machinery. And this is it. Okay, this is just to remember algorithm that I will skip this. Yeah, so the actually the funny thing about this whole problem is that for L1 and actually for all the LPs where p is at most one, this algorithm works. It's in FPTTIM on the capital, it's of the problem. But for L2 norm, the cluster selection sub-problem is double the heart for the capital and also even for T. So basically this shows us that the same approach couldn't really work for L2. Though really we don't really know what happens with that tool at all. So we don't know that this approach doesn't work because this is double the heart, but we don't know if there is an FPT algorithm for just the clustering problem. Because it actually follows from this, it actually follows that the original cluster problem is also double the heart on D. But anyhow, just to give a sketch why this is true, let's remember our cluster selection problem. How does it look like? We have some sets of points and we have to select exactly one point and select some cluster center such that the resulting cluster has cost at most D capital. So now I will show a reduction from a multicolored click, which will give us the double the heart of this thing. Double the heart just means that we cannot solve this problem faster than we can solve a problem in FPTTIM. So in some F of the capital times some polynomial in M. And reduction will be more or less easy because what we will do, we will have from our click instance, we will produce a cluster selection instance where the coordinates are just the vertices. Every vertex here is one, it's its own coordinate. And that point sets are the pairs of colors. Basically, we will take all the edges between color red and color blue and make every edge into a point and it will be all the same set. So in the final solution, we will pick exactly one point of this color. And the same for all the pairs of colors as well. So basically for every edge in the original graph, we have one point and the coordinates are like in natural way of the instance matrix. Just every edge produces a point with two ones in the coordinates corresponding to its endpoints. So h12 is one in one, one in two, zero, zero. And one three is one in one, one in three, zero, zero. And what this class selection does just again. It selects one edge from each column, from each of the groups. And basically for the clustering, for this instance, it means that we select one edge between each of the color classes. So basically we select a click if we're liking. And it can actually show that really if our cost is small enough, then we only can select a click in this way. So basically here, if we select a click, then all the ones are nicely picked in the k columns. So we have k columns, or I have k months gone once and all the other points and all the other ones are zeroes. Then we can bound the cost and see that really if the cost is this, then the instance is a click. Or here if instance is a click, then the cost is not high. But then if the instance is not a click, then once or not, that nicely packed, and then this will reduce a larger cost. So this is a sketch by, why is it true that it's double on heart. So this is kind of an interesting, interesting dihoteme with the different distances. But anyhow, so let me just have a full list of our results in this paper. So I was going through this algorithm, which is like the only algorithmic result in this paper. It also generalizes to any LP number which is at most one. I also shown that it's double on heart for it too. And it's the same holds for actually for every P which is greater than one. So this gives again a nice dihoteme. We know that this is FPT, but we don't know what is this really, but we do know that it doesn't, it couldn't work with the same approach. It's either double on heart on D, capital at all, or it may be FPT, but with a completely different approach. And also a funny thing about this is for instance for a zero, which is kind of a limit measure of these things, the problem is actually the original costing problem is actually double on heart on D capital. So for L1 we had FPT algorithm, but here it's double on heart. And even if D is small, it's also small. So the dimension, it still doesn't help. And for L1 you also know that this is double on heart on D capital. So again for this we have double on hardness, for this we have double on hardness, for this we have FPT, and for this we don't know if it's FPT or not. So this is the state of things after the results in our paper. And just to conclude, so what's, there are of course a lot of open questions about this, but like the ones which fill the gaps in our paper most closely is for instance we don't really, or okay this first question is more like the general question. So if for L at most zero, or for L greater than zero, we don't know the problem actually is FPT double on heart on D and K. Again there was this XP algorithm, an old XP algorithm, and the part of DK for L2 and basically it works for L1 and so on. For L1 there was a trivial algorithm but anyhow. There was hardness but for the restricted case when the cost of centers are selected from some fixed set, so we don't really know if even for L2 is double on heart on D and K in the general setting when the cost of centers are anything. And we also don't know like for all the other variants of LP is double on heart on O. So this is like a really interesting question. Like if it was, I would really doubt it, but if it's FPT then it would be like really good. Again for L2, as I said already we don't really know if it's FPT double on heart on D capital. We know that we're only about this sub problem cluster selection, but it's not, it's just really follow that original costings double on heart after the cost selection double on heart. And of course there is a question of can we use other metrics to obtain something efficient or maybe something just interesting, or can we use other parameters of the dataset maybe like there are some known parameters of the matrix. You can define matrices, maybe it may be useful design in FPT or like another kind of fast targets for this problem. Okay, this is all.