 Okay, so I think we're live now. Welcome everyone. It's a pleasure to have Shai Moran from IAS who's gonna give the talk today. So before moving on to Shai, let me introduce everyone who's joining us today. So it's nice to have a large number of groups. So first I'll start with André from MPII. Welcome guys. Then we have Benjamin from, sorry, I lost track here, UW University of Washington at, sorry, Wisconsin, at Madison. Welcome Benjamin. Then there's Clémont with a group from Stanford. Welcome guys. Then there's Irfan from Indiana University. Welcome. Fang Yi just joined us from University of Michigan. Hi, Fang. Then there's Jinish with a group from Caltech. Welcome. And then Jiang Yang is joining us from Virginia Commonwealth University. Welcome. And finally, we have a group led by Nishant from University of Victoria. Welcome everyone. So, all right. So it's a pleasure to have Shai. So before we start, let me thank everyone who's working for TCS Plus behind the scenes. So that's Ilya Rajenstein, Clémont Kanon who's here in India Day and Odette Ragev who are doing all the legwork. Let me also announce that the next talk, couple of weeks from now, will be given by Dan Upon, Nanung Kai from KTH. And then two weeks after that will be Michael Kerns from University of Pennsylvania. So, but today it's a great pleasure to have Shai Moran from IAS. So Shai got his PhD in 2016 from Technion advised by Amir Yehudiov. And then he spent a year in California in between UCSD and the Simons Institute at UCSD. He was working with Shahar Lovett and now he's at postdoctoral scholar at the Institute for Advanced Study in Princeton. So Shai has done an impressive amount of work in statistical learning theory and more generally complexity theory. And so today he's gonna tell us about some joint work with, joint work with Ken, sorry, Daniel Ken, Shahar Lovett and Jiapeng Sang. So welcome Shai, it's a pleasure to have you as our speaker today. Okay, thank you very much. Just a small correction, Amir Spilka was also my advisor for my PhD. He should get the credit. He was a good, he should get the credit for that. He suffered me for three years or four years. Okay, so yeah, thanks a lot for inviting me. It's really a great honor and I hope I will be able to leave up to the standards of these great talks that you host. Okay, so let's, shall we begin with the talk? Assume yes. Okay, go ahead. Okay, so yeah, so I'll talk today about two related projects. And the common theme they share is that they're both about comparison. So they both manifest the strength of comparison queries. So comparison is maybe one of the most basic tools, algorithmic tool that we know, you know, from the very early days of computer science, sorting algorithms, many data structures. And yes, so today we discuss two more manifestations of their potential power. So the first talk will be, the first part will be about machine learning, about interactive learning. And this is a joint work with Daniel Chachar and Gia Peng from UCSD. So what is active learning? Yeah, so it's going to be completely self-contained. You don't need to know anything about machine learning. Everything will be explained. So what is active learning? So in standard passive or supervised learning, we're given a bunch of labeled examples and we need to learn some unknown concepts. In active learning, we assume that we are given unlabeled examples. We will soon see examples for this. And we need to query the labels of hopefully few points and still predict as well as if we would have got all labels. So this is useful when unlabeled data is cheap but labeling it is costly. So for example, you need to in the context of medical, of medicine, you may need to query a doctor to ask whether this data indicates some disease or not. So the question is when is it possible? So let's consider just to get used to it, a very simple example. So let's assume there is some unknown threshold in one dimension that labels all the data and what we are given are end points, these black points that you see on the bottom. We do not see their labels. Some of them are minus and some of them are plus according to this unknown threshold. So there is some point T until everything is minus and from T onwards everything is plus. And our goal is to make as few label queries as possible. So in each round, we can pick a black point and ask what is the label of this point. And the goal is to make as few queries as possible to be able to infer the labels of all end points. So as probably you can already see, one way to do it is using binary search. What does it mean? So at first we get these end unlabeled points. We will query the middle point. We see that it's positive. Once we see it's positive, we can infer that everything to the right of it is also positive. And then in each iteration, we go to the half of the unknown interval, we query the middle point and we infer at least half of the remaining points. So after a total number of log n rounds or log n queries, we infer all end labels. So this is one example where we, even if you get unlabeled input, you can still make a few queries and the exponentially less than the number of labels you have. Unfortunately, this phenomenon where you have such strong algorithms is not, does not hold in general. So even for very simple function classes, it does not hold. So let's see an example, the two-dimensional thresholds. So what are two-dimensional thresholds? So now again, we are given unlabeled points in the plane. We have n points in the plane. There is some hidden line that everything to the left of this line is labeled say positively and everything to the right of this line is labeled negatively. We don't know this line. We just get the input point, the any good points unlabeled. And our goal is again to make as few queries as possible to be able to correctly infer all possible labels. And now I hope we will convince you that any algorithm must query essentially all labels. Now why is that? The reason is that you can find n points in the plane that are in convex position. So you can take them on a circle, for instance. What does it mean that they're in convex position? It means that every point can be separated from the rest of the points by some line. So as you can see, if the points are on a circle, this is indeed possible. Now think of the following is the central argument. The adversary will always answer you red. Whenever you query a point, whether it's red or blue, he will tell you it's red. And as long as you didn't query at least two points, there are two consistent continuations of this labeling, one when the one point is separated, it's blue, and one when the other point is blue. So you will have to ask any algorithm will have to query basically all points. Okay, so this is somewhat disappointing because still in machine learning, there is a lot of research in active learning. And yeah, so what can we do next? So one direction is to assume that the data is nice. So these kind of examples are not possible, but another direction that is actually also practically relevant is to allow additional queries. So what do we mean by allowing additional queries? So remember that we have a domain expert, the one that gives us the label and we have the algorithm that can query the labels in each iteration. So instead of just being able to ask whether the point is red or blue, we want to give the algorithm more possibilities. Now, of course, the important question is which additional queries may we allow? And of course, this is problem dependent. So you can consider a very strong model where any yes, no query is possible. And indeed, this was considered in the 90s, but you can show that from a practical perspective, this is useless because most of the yes, no questions that the algorithm will ask will be meaningless in many contexts. So without getting into too many details about that, we need to somehow restrict the queries. And one direction that seems to be successful in practice is to enable relative queries. Now, what are relative queries? So I will not define it formally. I don't think there is even a formal definition, but let us see just two examples. So this is one relative query. You can ask which of two restaurants Shepanese or McDonald's is better, okay? You query relative information or you can take three objects and ask is object A, is more like object B or more like object C. This is another example of a relative query. And practical algorithms in practice it shows that human annotators are able to deal with such queries and to provide meaningful answers that accelerate the learning process. So going back to the formal world, what we consider is perhaps what is the most basic relative query. So let me formally define what I mean by this. So we assume that the class of functions on which the adversary that labels the points is a class of Buddha and function, H, capital H, but every Buddha and function there is in fact, the sign applied on some G, which is a real function. So for example, when capital F, when the underlying class of real function is the class of all linear functions from RD to R, we get that H is the class of all half spaces. And there are many function classes in learning that are of this type, for which there is some underlying class of real functions on top of which we take sign. Neural nets are examples also, you know, polynomial signs of polynomial virtual functions, et cetera. Now what kind of queries do the learner get to ask? So as before, the algorithm can ask a label query. It can pick an input point and unlabeled input point and ask what is the label of this input point, which corresponds to sign of G applied on the input point. The novel type, the novel type of queries we allow are comparison queries. So the learner can also ask whether they pick two input points and ask, is G of XI is larger than or equal than G of XJ? Okay, so these are the kind of queries the algorithm is allowed to ask. The goal is the same like before to reveal all labels, but the learner has no power and can get to ask these comparison queries. Okay, so let us consider a very simple example that hopefully will help us digest it. So let's go back to the case of two dimensional thresholds, half planes. So what is the crucial observation here? So there was a very nice interpretation of a comparison query. So let's say we already queried two negative points and we know that the label is negative and now we take a comparison query. So I claim that let's say one negative point or if you have two points of the same label, then the comparison query exactly corresponds to comparing the distance from the separating line. Okay, if this is not clear, then please shout at me. Okay, so in this context, the game is as follows. We get unend points in the plane. There is some hidden line that separates the point to two sets and what we can do in each step is either label whether a point is on the positive side or on the negative side, or we can take two points and compare which one is closer to the separating line. This is the game. Now I claim that now we can remade the previous situation in which we acquired end points. So let me elaborate on what the algorithm is. Are there any questions about the problem we are going to solve? Yeah, I have just one question. I was gonna ask it later, but I'm just wondering about this. So is it going to make a big difference if you actually reveal the value of G of X instead of just whether it's bigger than G of Y or something? Yeah, so if you reveal G of X, so this is an example for a non-relative query which is for practical reasons, it seems to be useless. So revealing G of X gives you the value, right? Like the value of the linear function in position X and this is a lot of information and I guess with such queries, you just need, I guess, D plus Y, three queries or two queries to interpolate, right? If you have a linear function, it makes the problem kind of trivial but also practically it's like, let's say I want to, the learning problem is to predict which restaurants you like or dislike. So it makes sense to ask you whether you like McDonald's more than Burger King but it makes less sense to ask whether you like, whether, you know, like to assign an absolute value to how much you like McDonald's. So these kinds of queries are also practically more difficult to answer, to give like absolute values, real values to, it's easier to compare two things than assigning absolute values, right? I mean, there's no canonical scale. Okay, so going back to the problem we're going to solve, so we get an input points like this, black input points, there is the unknown line, we don't know it. Our goal is to make as few queries either comparison queries or label queries as possible and to reveal all possible labels. So how does the algorithm work? In this case, so in each step, we sample 10 black points, 10 unknown points that we don't know their labels. We label all of them, so we know that let's say these are blue and these are black and these are red. Okay, so we spend 10 queries just to label all of them. And then using comparison queries we find in each class, in the blue class and in the red class we find the point that is closest to the separating line. So this is just like finding a minimum in an array of, you know, at most 10 points, two minimums. Okay, so using comparison queries, we find the point that is closest to the separating line. And now the observation is that if we build these cones, so what are these cones? So we know that the point that is closest to the separating line, we lie on the, let's say the blue point, we lie on the convex hull of the blue point. So we take both neighbors on the boundary of the convex hull and we consider this cone, this intersection of two half planes that is at this point. So I claim that every point in the blue cone has to be labeled minus and every point in the red cone has to be labeled plus. And this is just because we know that the cone is, the apex of the cone is on the nearest point to the line. Okay, so how do we proceed? All points that are inside this region, inside this infrared region, we can, we know their labels, we know that they are blue, so we can get rid of them. And then we repeat this process on the unlabeled point. So again, we're going to sample some S of size 10. We're going to find the two nearest points, label them, compare, find the minimum, build the cones, infer and go on. So obviously we repeat as long as there is some point that is unlabeled, so at the end of the day, we will have all points labeled. And the question is how many rounds will it take? So as we will later see, the crucial lemma is that in each round, we infer one half of the unlabeled points that remain. Okay, this is what we're going to show later or to discuss later. So on expectation over the sampling of 10 points, in each round, one half of the remaining points is being labeled. So this gives us a bound on the expected number of queries. So in each iteration, we query 20 queries. We have log n rounds on expectation, so the expected number of queries is 20 log n. Okay, so is this example clear? Now I want to move to describe the more general results. And so if there are any questions, I'll be happy to answer. I'll give you. So I've got a stupid question. Is there any meaning to the constant 10 or? It's a very good question. The constant 10 is a kind of, you can think of it as a combinatorial parameter that is assigned to the class of all half planes, just like the VC dimension of half planes is whatever three. So it's another kind of combinatorial dimension that we introduce, and we will discuss it in more detail later. So 10 is some kind of dimension that is assigned to each set class of Boolean functions. Or yeah, any other questions? Okay, great. So let me now discuss some of the more general results we have in this context. So yeah, what we focus is on the class of half spaces. So we just saw the two dimensional case. What can be said about three dimensions? So unfortunately, the first news I'm going to bring you are bad news. So there are sets of endpoints in R3 that require at least omega of n label and comparison queries. So just like in the plane, we had a counter example, you know, on the line, we did log n flavor queries. And in the plane, we had this difficult example. Similarly, if we go to R3, we have a difficult example for comparison and label queries. And I think already here, there is an interesting kind of semi-formal open question. So notice that in the one dimensional case, we could do very well just with label queries. And then this failed in two dimensions. But in two dimensions, once we introduced comparison queries, which query two points, not just one, it's some information about two points, then we could do well in two dimensions, but this is again, failing three dimensions. And my question is, is there like a three dimension, you know what, three query, like comparing three points, like a three wise comparison in some sense, that will allow you to, again, to fix the, to remedy the three dimensional case, to do log n for three dimensions. And maybe you'll get a hierarchy. So you have some three way comparisons that saves the, you know, get this exponential speed up in the, in the 3D space, but then it does not work for R4. And then you need like a four dimensional query or 4M. So maybe it will be like an interesting hierarchy like that. Okay, but that's just a small detour. So why is there are examples like that, that require many queries? These points sets are exotic. And what I really mean by that is that we can do much better if we assume that the data is well behaved. So let me be more precise. So in particular, if we have bounded bit complexity or large margin, then we can do much better. And more formally, so assume now that each of the unlabeled points is on the integer grid. And the norm of this point is not too large. So let's say, you know, the L infinity norm is at most capital B. Okay, each point is in this D dimensional cube or D dimensional grid. So then we can still, we have a similar phenomenon like in the plane. We can still do in order of log n queries. We reveal all n labels, but and the constant before log n is roughly D log B. Okay, so D log B is roughly the description length of each point. So for example, if we just work with bit strings, so B equals one, so we have our points come from the Boolean cube, then we get something like D log n, the dimension times log n. Okay? And just as a toy example, imagine that the unlabeled points are all possible points in the Boolean cube zero one to the D. Then basically what it says, so n in this case equals two to the D, you have all possible zero one points in our D. So basically what it says is that all the labels, all two to the D labels can be determined using just D squared comparison or label queries. So you reveal two to the n labels using just D squared requests from the function queries. Okay? So this is the first positive result if the bit complexity is bounded, then we can retain the logarithmic dependency. The margin based bound is that, so what is the margin? So here we assume that basically the convex hull of the positive points and the convex hull of the negative points are far away from each other. So there is a, so that's what margin means. And here also we can get a very good behavior. So we get D log one over the margin times log n. And for those of you who do optimization, you can identify this D log one over gamma from inner point methods like ellipsoid or radiac cutting plane. And another remark is that we cannot get, so there is a dimension here in the margin base bound. There is dimension on D, here's D log one over gamma. And in statistical learning, it is natural to ask for dimension independent bounds when one consider margin, but we have a lower bound for that. So if you allow a very large dimension, then there are points that with very large margin for which you need a lot of queries. Okay. So for the next five minutes we're going to be I think the only kind of technical discussion, but so please concentrate a bit harder. And now I'm going to introduce this combinatorial dimension. I mentioned before the 10, what was 10 for the plane? Okay. So what is the influence dimension? So it's again, it's some kind of combinatorial dimension. It's a number that is assigned to every class functions. So for every class of function, you assign some number to it and we call it the inference dimension. And this inference dimension essentially captures the query complexity of the class. So I think the nice thing about it or what really helped us technically a lot is that it reduces the analysis to just, analyzing this combinatorial parameter. So now we give you a problem. You want, I don't know, you want to give you some class of function that I ask you what is the complexity of it. So you don't need to think about an algorithm explicitly you just need to analyze some very specific combinatorial dimension. And it may be hard to analyze it, but it's very, very concrete and specific. And just to make sure this is query complexity in the comparison query model that you introduced. Comparison level points. Yeah, so we will see soon the exact statements. And now, before we go to the exact second, let me just note that this inference dimension extends to any type of local additional queries. So let's say now, instead of using comparison queries, you want to use three wise comparisons or any other type of comparisons that should be local in some sense, but most natural things are local from my experience. I mean, in this context. So, you know, so a similar type of parameter will also capture other type of queries. Sorry, so three wise comparison, doesn't that immediately reduce to two wise comparison? No, no, three wise comparisons. Yes, I mean, if we take the definition like in the example that I had with the babies, then it reduces to the two wise comparison. But I'm saying if you have now some kind of query that just depends on three input points, the answer to the query just depends on three input points and you have a bunch of queries like this. So you can define an inference dimension for these type of queries. Soon it will be clear how, once you will see the definition and for this you have a similar characterization. So basically once you tell me, okay, I want to now design an active learning algorithm. And you know, the way I can communicate with the annotator with the human annotator is I can ask him label queries and then I can ask him whatever, some kind of, you give me an additional type of queries that depend on the problems in some sense and the annotator. Then I can tell you, okay, so for this kind of query that you specified, there is this inference dimension and you just need to understand the inference dimension to understand the information complexity of your problem. That's what I'm saying. But for now, let's forget it and let's just focus on comparisons. And later maybe it will be clear. Okay, so let me define the inference dimension. So remember that we have a class of functions H and H is sine of G, where G is in some underlying class of field functions, capital F. And we are mostly interested when capital F is the set of all linear functions. But this is a general definition. So the inference dimension of H of this class is the minimum number of K such that for every sample of size K, there is some point in the sample. So there is some point, God knows this point. You don't know it, but God knows this point. God can remove this point from the sample. And then you ask all queries on the remaining K minus one points. And then God shows you this point back without telling you what the label is, but then you can infer the label. So for every realizable sample of size K, there exists at least one point whose label can be inferred from the queries on the other points. Let's see, we'll see two examples and I hope it will be clear afterwards. So the first example is about thresholds on the real line. I claim that the inference dimension of threshold is at most three. What do I need to prove to you? So I claimed, so what I need to show you is that for every three points, there is at least one point that I can remove and then query the points on the remaining two points and then fill the label of this third point. So there are two cases, either the hidden threshold, labels all of the points, all of the three points positively. So God sees all of this. And then I can remove the middle point and once I, and if you know the two labels of the extreme points and you know that both of them are plus, then the middle point must also be plus because if it was a minus, there will be two sign changes and in each threshold, there is at most one sign change. So I repeat, if the three points all have the same labels, then the middle point you can hide the label and still infer it just from asking on the other points. The other direction is that not all three points have the same label. So two of them say are positive and one is negative. And then one of the end points can be inferred for the very same reason that there could be at most two sign changes. Okay, so again, the definition is what I needed to show to you in order to further inference dimension is three, is that for every possibility of taking three points and labeling them with a function in the class, there is one point that can be removed and inferred. So Shai, two quick questions. One is here, you're just considering label queries. There's no comparisons in your computing the inference dimension. Very good, very good point, very good point, right? Okay. And the other question is, is it, is it? So does this reduce to VC dimension when you have only label queries? No, no, no, no, no, no, no, no, no, no, no. Let me answer both questions. Let me at least. So it's at least, no? No, no, no. So it's independent of the VC dimension. So, yeah, yeah. So, okay, so the first question, the first comment was that we only use label queries here. So in the inference process, we only use label queries. So what we actually just argued that only with label queries, even if we just restrict ourselves to label queries, the inference dimension of thresholds on the one-dimensional line is three. And this is strongly related to the fact that we can learn all labels on the one-dimensional case using just log N, you know, this binary search business that we did before. So this is a very good comment and soon we'll see another example in the two-dimensional case where we also need comparison queries for the inference. And then the second question was whether what is the connection with the VC dimension? So it's not connected. They're incomparable. So you can find classes whose inference dimension is, I mean, you can cook up classes. There will not be very exciting, but you can cook up examples where one is large and the other is small and vice versa. But even just for label queries, right? Even just for label queries, it corresponds to what is called the teaching dimension. The inference dimension level points, it's not completely trivial, but one can prove it that the teaching dimension is in fact a version of the teaching dimension at least. It's Hanneke defined it, I forget the name he gave it. But yeah, but it's not the VC dimension in any case. Okay, so let's see now another example that hopefully will help us digest the definition. So I claim now that the inference dimension of half-lanes is at most seven. So again, what do I need to show you that given any seven points and any labeling half-lane, I can remove one point from there. There is a point that can be removed. God can remove a point from there and later we can infer the label of this point just from querying on the remaining points. And the proof is very similar to the algorithm we saw earlier. So if we have seven points, then there is at least four of them that have the same label, say plus, say red, just like in the picture here. And then you know, using label queries, we can find the nearest point and then we build this cone and the point that is within the cone can be removed. This point that is circled by a green circle. This point can be removed and this is exactly, notice it's the same logic that we followed in the algorithm. Okay, so really the point here is that any cone in the plane is determined by three points. Can be determined by three points. You take a cone that is spent by some bunch of points, then there are three points that form a basis for this cone. This is basically what it boils down to. Okay, so this shows that the inference dimension is at most seven. It is in fact less than seven, it's only five, but so yeah. Okay, so let me now state the general theorem. So if we have a class H and the inference dimension is K, so we have an algorithm that, there is an algorithm that infer all labels, all end labels of any realizable sample will just K log K log N queries. And the algorithm is very similar to what we saw before. What you do is in each iteration, you sample two K of the remaining unlabeled points, uniformly at random. You label all of their queries and you sort, you basically sort the positive points and the negative points according to their values, even comparison first. And then you just infer, and this is an abstract step. So now you have all this information just from the queries on these two K points that you sampled, and you just infer. So in any other points, in the endpoint that you did not sample, you ask yourself whether it's label is implied by the queries we just did. And if it is, then you label it. And then you remove all inferred points and you repeat the same iteration on the unlabeled points. And the lemma is that in each iteration, on expectation, you label half of the remaining points. And the proof is not complicated, the proof of this part. But I will not discuss the first. And there's also a lower bound. So if the inference dimension is larger than K, then there will be some samples for which you need more than K, at least K either label or comparison queries. Okay, so this is, so basically, whenever this inference dimension is small, then something non-trivial can be done. And if it's large, then it's a lower bound. If it's infinite, for instance, then nothing non-trivial can be done. Okay, so that is about, that this is it about the first part. The second part is about showing an application of this methodology that we developed of this inference dimension to, I don't know, complexity theory. But first let me ask if there are any questions about the first part. Anybody? Yes, I have a question. Yes? Yeah, I'm just wondering if it's possible to say more about the relationship between the inference dimension and the particular kinds of comparison queries that are allowed. What do you mean by particular kind of comparison queries? You said that you define the inference dimension with respect to having some comparison queries, but you didn't say, I mean, in the examples, I mean, for the half planes, you gave the example of just the simple kind of comparison query, but the inference dimension says for a class of comparison queries, it's defined comparison queries, but you didn't really say what they are. And I'm wondering if they, I think they managed to confuse us. So, okay, so the point is as follows. Do you see the slide now? This is how the comparison queries are defined. You have a class of Buddha and functions H, which is just taking the sign of some other class of real functions H. We only focused on the case where F is, where the class F of real functions is of linear functions. And then we get half spaces or half and functions. But comparison queries are defined as they are here. So there is the unknown function G or sign of G, but there is this G that labels our points. And a comparison query is just asking the annotator whether G applied on this point is larger than or equal than G applied on that point. And when G is a linear function, in the case when G is a linear function, and let's say both G of XI and G of XJ are non-negative, then this comparison query is equivalent to asking which of the two points is closer to the half space, which found the zero set of G. Okay, that's what we used in the plane. So for a linear function or for an affine function G, G of XI is larger than or equal than G of XJ when both of them are non-negative, if and only if XI is further away than XJ from the zero set of G. And that's what I meant in the geometric case that this comparison query, which is just take the hidden function and you ask which one it gives us a larger value, but in the case of half spaces, there is a natural geometric interpretation in terms of distances from the zero set. Does it clarify? I was asking about in the definition of inference dimension, in the definition of inference dimension, you mentioned it includes, it says the label can be inferred from comparison and label queries, but you didn't say, I mean, you could imagine different, having different kinds of comparison queries. Okay, so what we formally defined is only with respect to the comparison queries in the previous slide that we just discussed. Oh, okay. So this is what you can do. You can take the remaining K minus one points and you can ask for each pair of them. You know, it's G of XI larger than G of XJ and you can ask what is the sign of G of XI for each point. And from these particular queries, you need to be able to infer the point that was removed, the label of the point that was removed. And Mike, because you mentioned more general comparison. Exactly, yeah. So I didn't give examples, but what I'm saying is that if you specify other type of queries, let's say just for instance, the three wise comparisons, then you can still make the same definition. You can extend this definition to other queries. Yeah, you just need to be able to, we do need some locality of the type of queries in order for the theory to extend, but the definition makes sense also, if instead of two wise comparison queries, you also have three wise comparison queries. Or I don't know, you take all points and you look at the determinant of the, I don't know, you take, you build the metrics and they look at the determinant, they take the sign or something. So you can think of other type of queries that one can... That could potentially give you different dimensions. And this will give you, yeah, that's correct. That's correct. That's what I was asking, yeah. Okay, okay, so I hope... Yep, yep, thank you. Okay, are there any other questions before we move to the part two, which is much shorter and with just a simple reduction? Are there any other questions? Okay, so let's continue. So the second part is basically an application of the third part. So let me first define the problem of three sum. So in the three sum problem, we get as an input an array of n numbers, x1 to xn. And one of the formulation of this problem is that we need to decide whether there is a triplet, ijk such that xi plus fj plus xk equals zero. So there is a trivial algorithm that just goes over all pairs, all triplets and checks whether they sum up to zero. This takes you roughly in cube time. A nice exercise is to improve it to n square algorithm using sorting somehow. And in 95, in a paper by Gadget and Overmars, they show that the three sum problem is in fact a bottleneck for many, many other problems. Their context was this geometry in the sense that if you can improve the quadratic time algorithm for three sum, then you will also get an improvement for this bunch of other problems that are studied in the script geometry. And based on this hardness of three sum, the fact that it's a complete problem for a large class of problems, they also conjecture that this quadratic time algorithm cannot be substantially improved. And yes, and okay, so one example of a problem that three sum is a bottleneck for is you're given n lies in the plane and you need to decide whether they are in general position. And so for this problem, the best known algorithm is quadratic and three sum hardness will also imply it's optimal. So it seems like a pretty central problem for many other interesting problems. And also more recently, there were other connections in the context of fine-grained complexity. Now, one avenue of at least convincing ourselves that three sum is hard or three sum conjecture is true is maybe to try to derive lower bounds in simpler models of computation. So for example, let's go back to sorting algorithms. So if you ask a random CS graduate and you ask them what is the complexity of sorting and they will tell you n again and then you will ask them why, then they will tell you, okay, there is this information derivative lower bound of log of n factorial. And we also have merge sort that reaches this lower bound in terms of an upper bound. So although it's formally this statement is incorrect because in the Turing model, we don't have an n log n lower bound for sorting and it's even not precisely, not even true in some, at least in some cases. So, but still, I mean, if we have a class of algorithms that solve sorting using comparisons and we can show that using comparisons you cannot do better than it's some form of explanation of evidence for the complexity of this problem. So a similar avenue was taken in the context of three sum and there they noticed that many, the basic algorithms, they use certain kind of queries. So what kind of queries do they use? They basically, in each step, you can take a linear combination of your input array and you ask whether it's equal zero, it's larger than zero or it's smaller than zero. So for example, in the basic simple algorithm that just goes over all triplets is of this kind, right? You go for every triplet, you ask whether x i plus every j plus x k is zero, larger than zero or smaller than zero. This algorithm is in this model, but also the other, the more sophisticated algorithm is also of this type. So the question is really, so we have now a linear decision tree. What is a linear decision tree? It's a decision tree where every node is labeled by a query of this type and there is a, it's a, the outer of each node is three, there is equal zero, smaller than zero or greater than zero. And the depth is the maximum number of queries. And we also look at the sparsity of a query, which is the number of variables with non-zero coefficients in the query. Okay, and the question is, okay, so maybe we can prove that any linear decision tree for three sum must have depth at least n squared. This will correspond to the fact that, this is analogous to the fact that any comparison based algorithm for sorting requires n log n comparison, these are depth at least n log n. Okay, so actually in 2014, this conjecture was refuted by Grondon and Petty and they devised an algorithm in this context, in this linear decision tree, that does only n to the three half many queries, for which much, much less than n squared. And in terms of floor bounds, there is a, there are two papers, the one by Elon and Chazelle and one by Ericsson and they show that if the queries are very sparse, or only three sparse, so you can only take a combination of three variables, then you cannot do better than n squared. Okay, and what we show in this work is that there is an LDT for three sum of depth, roughly n log n, n log square n, and the only queries it uses are these label queries. So you pick a triplet and then you ask, is it zero, is it larger than zero, is it smaller than zero? All comparison queries. So you take two triplets and compare the sums. So these are in terms of, so I wouldn't even call it a linear decision tree, for me just a comparison decision tree, but if you insist to look at it as a linear decision tree, then it is six sparse, right? The sparsity of each query is at most six. So it's a very simple kind of linear decision tree, and I personally prefer to look at it as a comparison decision tree. So you can see that just with comparisons, you can basically achieve the information theoretical barrier of n log n up to a log factor. Wait, sorry, I got completely confused. Why is that not a n log square n algorithm? So because... Okay, that's a good question. Because not sparse, I understand, but... So for every n, you give me n as an input, and then I can take my time and I build this... Oh, it's non-uniform. This decision tree, and maybe building this decision tree will take me n cube or n to the fourth, and it will actually probably something like that. But then I build it and then, you know, once... So you can think of it also in an amortized sense, right? If you give me... If you tell me, look shy for the next thousand years, you're going to solve three sum on n equals one billion. And I tell you, okay, let me work a little bit, and spend one year maybe, and then I'll build this decision tree, and then in an amortized sense, it'll be very, very efficient. And actually, this is also interesting, whether you can incorporate these things in a data structure and do it in a... But yeah, but okay. But is it clear there? Yeah, thank you. So how does your upper bound does not contradict the Eilen-Chazelle lower bound of n square for higher sparsities since you've got sparsities six? Because the lower bounds for higher sparsities for four and five, it goes very... It's very, very... It decays very fast as you will go from K to 2K. Oh, okay. It doesn't contradict it. And the reason is that the lower bound is very weak when you reach 2K. It's actually undefined for 2K. So we get 2K, right? Yeah, so another remark is that you can also use the same algorithm for K sum, and you get n times K times low square and 4K sum. Okay, so now, before I will discuss some interesting problems, open problems, let me just explain you what is the connection between this and this inference business. So the idea is very simple. So remember that in the inference dimension business, we have these n points in RD, and we want to query all the labels, to reveal all the labels, and we get to ask these comparison queries and label queries. So here assume that the unlabeled points are all points on the n-dimensional cube with exactly three ones. And think of the input array as the normal to the hyperplane, to the half space that labels these points. So we have this x1 to xn, and basically the function g is inner product is the inner product with x1 to xn. So as you can see, at a point P with Hamming-Wade 3 is labeled positive even only if the sum of the three corresponding entries is negative, and it's labeled zero even only if the sum is zero. So our goal, the three sum problem is basically check whether among these n choose three unlabeled points, there is one point whose label is zero. So what we will show is something stronger. We will show that with just n log square n queries, not only can we decide whether there is a problem or not, but we will show that not only can we decide whether there is a triplet whose sum is zero, but for every triplet we will classify whether its sum is positive, negative or zero. And this is exactly the active learning problem for half space test with comparison queries. The unlabeled points are all triplets of Hamming-Wade 3, and the input array defines the function, the target function that labels it. Okay, and then basically what we show is that the inference dimension of these guys of this set of points is only n. And from this we already get this randomized algorithm like before, you query two n points in each iteration at random, you label them and you label all triplets whether they're positive, negative or zero. You compare all the positive triplets and you sort all the positive triplets and all the negative triplets and then you infer and you proceed in this manner. You can also de-randomize this algorithm. It requires some non-trivial technical effort, but it's fairly standard, it's not very exciting. Okay, so is the reduction clear how these two problems are related, these active learning of half spaces and linear decision trees, or comparison decision trees actually. Okay, so let me give you some more applications that follows exactly the same reductions. Let's assume you have two subsets of two arrays of size n or real numbers. And your goal is to sort a plus b. So all what is a plus b is all pairs a plus b where a is in capital A and b is in capital B. So we also give an LDT that does it in just n log square n time, queries I mean. So notice that number of pairs can be quadratic and this is only linear in n. So it's square root size of the output. And the only queries that it uses is whether the comparison queries, a one plus b one is greater than or equal to a two plus b two all difference comparison queries. So is this pair larger than that pair is the first pair larger than the second pair more than the third pair is larger than the fourth pair. So only with this kind of access you can very efficiently find the order type of a plus b. Another, I think a toy application that... Yeah, it's not in the paper but it's something I prepared for the talk. So let b be some unknown polynomial of degree at most d over the real line, the univariate polynomial. So b defines an ordering on the numbers one to n, one, two, three, four after n. Yeah, so what is the order in p defined? So i is less than j. If p of i, the value p gives to i is less than p of j, the value that p gives to j. Yeah, so every polynomial defines some ordering. So is it true that we can sort this ordering? We can find these other things. Are these simpler when these small? This is the goal, yeah, to want to give the dimension of the degree of the polynomial is small to understand this ordering better. So again, yes, we show that you can basically sort it using just this square log square n queries. Yeah, so with this constant, you just do poly logarithmic number of queries and you sort n points and you just use the fact that the degree of the polynomial that defines the ordering is small. And again, we use comparisons or difference comparisons just like before. And again, one of the disadvantages of this whole framework that we suggest is that it's non-uniform. So we can show this information theoretically. It is very, you can sort very fast such order. But can it also be done in a uniform way? And this I think is a nice open question. So take the same setting. We have some unknown polynomial of the degree at most d and we want to find something about the ordering. Let's say what is the media? Or I don't know, any other, can we do it fast? Can we do it in sub-linear time? Okay, so let me summarize and then take some more questions. So that we had two parts for this talk. First, we discussed active learning. And we showed that if we allow the learning algorithm to also use comparison queries, then from an information theoretical perspective, it becomes much stronger and it overcomes many of the bottlenecks of classical active learning. And from a technical perspective, we developed this inference dimension which captures the quail complexity. And then we use this machinery to devise nearly optimal comparison trees for a bunch of a combinatorial or geometrical problems. And then in terms of future research, then one obvious direction is to consider other type of additional queries, maybe that are used in practice, in some application, maybe other type of related queries. I mean, there is many possibilities in this direction. Also, you can consider a streaming version of this question. So now you don't get to see all the unlabeled points together, but you get them one by one and you only can remember six of them or whatever, like a streaming model, maybe noisy version, diagnostic learning, so if you think of crowdsourcing, which is a very typical scenario in which people use such relative queries. So maybe some of the comparisons will be noisy. Uniform algorithms, oh yeah, so I wanted to, so another curious fact is that this decision trees that we build, let's say for 3SAM actually give you a short certificate for an array to not to be a no instance of 3SAM. So think now on 3SAM from an NP coin perspective. So if the array contains a triplet that sums to zero, it's very easy to prove it, right? I'll just show you this triplet and then you verify. But can I convince you fast, let's say that the array contains no triplet that sums to zero. Is there a short proof? Is there a short certificate for this fact? So right, so it seems that our decision tree, this comparison decision tree, if you just follow the path that corresponds to the snow instance, then you get a very short proof of that. But verifying this proof actually takes you time, at least in the naive way. So I think it is interesting to understand the non-deterministic or the con non-deterministic complexity of 3SAM. Is it sub quadratic? Yes, and I mentioned sorting A plus B, sorting polynomially induced orders. I think all of these questions make sense in the uniform model and it's interesting to study them. Okay, thank you very much. Thank you Shari. So we can take questions. Let me ask about these other types of additional queries think so because I had this question earlier that said, well, okay, what if you reveal the value of the function? And so you convinced me that that was too much. And so comparison queries are sort of somewhere at the bottom. And in situations where you have a real valued function that you are, there's some underlying real valued function, but then the labels are given by the sign of the function. So you're sort of truncating though. So if, but in other learning models, so if you, if what you're trying to learn is actually a Boolean function, for example, is there any such extension of the query model that can be done? Like say you're just learning DNFs or something so that there's no, you know, I mean, it's just a Boolean function. And then. So you're asking if there is some, not so, I must believe there's a meaningful extension of the model to other learning scenarios where the function that you're trying to learn the class of functions is just like a straight class of Boolean functions instead of here, the class of functions that you considered their Boolean functions that are obtained from a real valued function, right? And so there's a natural way to extend. Yeah, so, yeah, so I didn't think about it. I don't know, I don't have a, so I mostly work with such classes that are basically signs of real functions. But yeah, for DNA, it's an interesting question, like whether for DNF there is some type of additional query that makes sense and can actually at least accelerate the learning process, the simple complexity. Yeah, so now I guess you would have a hidden DNF, the secret DNF and you get some examples and you can get the labels, okay, that's the standard thing. And the question is, is there a meaningful way to? Yeah, so it's not clear what else it might be having for you. Yeah, but it's cool. I mean, it's not clear also that not. I mean, it's, I would have to think about it. But in general, I think that in this active learning setting, I mean, there are plenty of works and in practice they use plenty of additional queries and then really not an expert on practical work in active learning. But I'm sure there are, I mean, I know that there are plenty of other type of queries that people use and maybe it will also be interesting to consider them from a theoretical perspective and to formalize them in this context. I had another very small question, which is that this parameter inference I mentioned, so is it, I mean, it's not gonna be easy to compute. Is it even, can you say anything but I mean, it doesn't seem like it's even obviously in NP for example. Yeah, so it's also, I guess in order to formally define a computational problem, it's also, yeah, okay, so. Well, deciding whether inference I mentioned is at least. Yeah, yeah, but I mean, usually we work with infants like with class of ordinary functions from RD to R. So it's the questions, how you represent. So if you just get, let's say, I mean, there's a question of representation here at the function class, but I think that already the inference I mentioned for just for label queries. So, you know, without comparison, it's already, I believe I don't want to commit, but if I remember correctly, then it's already hard in this context. When let's say you get all the truth table of all your class, then even then it's hard. Yes, I believe it should be hard. It's like hitting set or something. So I think it's computationally hard to decide where the inference I mentioned is small or large, just like, you know, other parameters, VC dimension, or most of the parameters that pop up in learning or like that. We have other questions. Anyone want to ask a question? If there's no more questions, I think I'll take us offline. So thanks, Shai, for the talk. Thanks everyone for attending. I remind you that couple of weeks from now, it'll be Dan Upan, Nanangkai telling us about distributed shorter spots a couple of weeks from now. Okay, so thanks, bye-bye. So Shai, I'll take us offline, but you can stay for a minute if you'd like, or anyone could stay for a minute.