 Welcome everyone to another TCS Plus. My name is G, I'm the moderator today. We have Oded Ragev as the operator. And we're gonna start by having him take us around the table and introduce the groups we have with us today. It's been very challenging today. Lots of groups, let me try. So there's, I believe, Amit Levy with the group from Waterloo. Hi everyone there. We have Bouddhima from EPSL. I can't quite see you, but I believe you're there. See the logo? Clemore with the group from Colombia. Hi guys. Dimitris from Wisconsin. Everyone. Esan from USC. Hi everyone. And we have Gregory from Indiana University. Everyone. We have Jalal from Johns Hopkins University. I don't quite see you, but I believe you're there. And we have Janish from Catech, the group from Catech. Kevin from University of Michigan. Hi. And we have Samson. You have to help me out here. Oh, from Purdue. Okay, hi Samson. And we have Shavas from a few floors above me here at NYU. And the setting up. And we have Sankirt from UCSD. You see a nice chair, but I'm sure you're there somewhere. Okay, so back to you, G. Thank you, Oded. And I'd like to also mention that behind the scenes we have other organizers, including Clement Canan in India Day, Thomas Vidic, and Ilya Razenstein. And I'd like to remind everyone to please, if you have any questions, unmute yourself and ask at any time. And without further ado, let me introduce our speaker for today, Moses Charikar from Stanford. Moses completed his PhD at Stanford and spent one year at Google and then several years at Princeton, where he worked on a lot of exciting things, including approximation algorithms, metric embeddings, streaming algorithms. His work maintains a good balance. Also, it has strong applications to practice as well, evidenced by the Paris-Kennelakis Award in 2014 for his work on locality-sensitive hashing. Without further ado, I'll ask him to take it away. Sorry, Moses, I must have muted you. So you have to unmute yourself. You press the microphone button at the top. Got it. Okay, now you're with us. Okay. Well, thanks to everyone for joining. Thanks to the organizers for inviting me. I know it takes a lot of effort to organize these things with all the technical issues and so on. So I really appreciate all the effort that everyone puts into this. So today we're gonna talk about some work I did recently on using hashing to estimate kernel density in high dimensions. And this is a joint work with my student, Paris-Simonakis at Stanford. There's a lot of jargon in that title. And hopefully as we go along, some of these words will become clear. I should apologize. There seems to be some issue with my video. So it's frozen most of the time, but every once in a while you'll see the picture move. But hopefully the slides should be fine. Okay, so let me start with telling you what the problem is. So here's the problem. You have n points given to you. These are points in high dimensions, y1 through yn. There's a certain potential function, kx, y, which takes two points, x and y, and outputs a value. And this in some sense is sort of the influence of one point on the other. And think of this as a function that drops off with distance. So it's highest when the distance is zero and falls off monotonically with the distance. So you have this database of n points. You have this potential function, kx, y. And there's a query point, x. And what you'd like to do is evaluate the average potential at x from the points in the database. So you wanna sum up the potential contributions of each of the points in the database, kx, y, and divide by the number of points. Okay, that's what you mean by kernel density estimation. Okay, and that's exactly the problem that we'll be talking about. So when you look at it, it seems like a very simple problem. You have n points. You've got a summation that involves n terms, one involving each of these terms. You know, what's the thing to do? Well, obviously you go over all the points one by one and sum up all of these terms. It gives you essentially a linear time algorithm. There's some issue about how you compute each of these pairwise potentials, but let's not worry about that. The main question that we'll be thinking about is can we do better than this? Okay, so if we wanted to estimate the value of this summation, could we do this without actually having to make a pass over the entire dataset? Is it possible to somehow do some work upfront so that we index the points in some way? And when the query point x is presented to us, we're able to estimate the value of the summation accurately with high probability, okay? So let me, well, let me sort of tell you what are these? So far, you know, I have this mystery function kxy, this potential, but what exactly is this potential? Well, you know, if you have to think of one particular function kxy because of this talk, think about the Gaussian kernel. So this, the value of this kernel kxy is just the probability density function of the Gaussian. So it's a high dimensional Gaussian. The probability density is e to the power of minus Euclidean distance squared divided by sigma squared. Okay, the number of kernels that people think about, the Gaussian kernel is a very popular one, but there are also others. There is an exponential kernel where instead of having a square of the Euclidean distance in the exponent, you have just the Euclidean distance, okay? You could also have polynomial kernels where the value of the kernel drops off as a inverse polynomial with the distance, okay? But that's the basic question. So it's a very simple, simple problem. Okay, so now that we understand what the problem is, let me try to tell you, you know, why you might care about it, okay? So here's the setup. You know, the setting where this problem typically arises is where you have a set of endpoints. These are sampled from some distribution that has some probability density p, okay? So we don't know what this density is, but what we'd like to do is we'd like to estimate what the value of this density is at some point x, okay? So we have some point x and we'd like to compute, estimate what the value of the density is, and you give another point, you like to figure out what the value of the density is at that point and so on and so forth. And all of this you have to do on the basis of these end samples from the density. Now, you know, there are many approaches to this kind of problem. I mean, one common approach, and this is an approach that folks in TCS have pursued quite a bit, is I assume something about this density function. Maybe it's a mixture of Gaussians or maybe it has some other form. And now from my samples, I can try to learn or estimate the values of the parameters in my functional form for the density, okay? So this is some sort of parametric estimation. There's another, you know, branch of statistics called non-parametric estimation, where you would like not to make any assumptions about this density that you're trying to estimate. Can you still come up with estimates that are close to the actual density without actually assuming any functional form for this density, okay? And on this field of non-parametric statistics, the density estimation is a very popular technique, okay? So how do you do this? Well, you essentially take your endpoints. If you want to estimate the value of your, of the density at a new point X, then all you do is you compute this kind of density estimate. Okay, and there are all kinds of results that show you that, you know, in the suitable conditions, this will actually converge to the actual density. Okay, so this is actually a very popular technique in statistics, in non-parametric statistics, for estimating this unknown probability density functions. But this is actually, this turns out to be useful subroutine. So, you know, it's a useful primitive and it gets all kinds of, it has all sorts of applications. So one particular application is, you'd like to find outliers, okay? So given a dataset, given a new point X, what you'd like to know is, you know, is this a typical point or is this an unusual point given the points that you've already seen, okay? And one way to do this is, you know, well, go ahead and estimate this current density. If this current density is very low, then that tells you that this point is slightly outlier. If this current density is high, then it tells you, well, it's kind of a typical point, okay? And if you think about it, it's sort of doing something very natural, right? It's saying that if you have a point which is close to many points in your dataset that you've already seen so far, then the current density is gonna be high and then you would think of this as a typical point. If it's far away from most of your dataset, right? Then you would expect that this point is an outlier, okay? So it turns out this is actually a useful technique for outlier detection and people do use it quite a bit. People also use it as a method of clustering and so on and so forth, but we won't really talk about these in. In fact, it was this application to outlier detection that initially led us to start thinking about this question. Okay, so let me give you, give you an example of what, you know, this current density might look like, okay? So let's say we have a set of points that this is a two-dimensional set of points that are sampled from a particular density, okay? So in the X and the Y axis, I have, you know, the dataset. The Z axis represents the vertical axis, represents the density function. So this is, you know, some density that I picked. I sampled the dataset and now I wanna figure out, you know, what this unknown density is. I don't have access to the density anymore. I only have access to the dataset. So, you know, the first thing to do is well, just take the discrete distribution that is non-zero only on the points of your sample. Okay, so obviously this is a very bad approximation to this unknown density. That's not a good idea. So what's the current density idea? The current density idea is just that you take this point distribution and smooth it up by incorporating an appropriate smoothing function. Okay, and that's exactly what this kernel does. So you might, for example, put a little Gaussian around each of the points in your dataset, okay? And now average these Gaussian densities, okay? Now, a lot depends on exactly how, you know, what the variance or what the bandwidth of this Gaussian is, okay? So suppose you end up picking a Gaussian which has very small sigma, very low bandwidth. Then the resulting density that you get by averaging is not gonna look very different from this, just this discrete distribution that we had earlier. It's gonna have a lot of peaks around the points in the set itself. And it's not really gonna do much averaging, okay? So that's a problem with having very low bandwidth. If you have very high bandwidth on the other, and then you'll do too much smoothing, you'll average things out too much and you'll actually lose some low-level structure in your density. So, you know, it's a problem. What I'm trying to point out is that it's really important if you really wanna get a good estimate of the sunlight, that's really to pick the right bandwidth. Too small is not good, too high is not good. And there's a lot of literature that actually goes into trying to figure out exactly what the right, you know, what's a good choice of bandwidth for this problem. But for our purposes, we're gonna assume that, you know, someone's actually done this work. Someone's told us that, you know, this is the problem that I wanna solve. This is the bandwidth that I wanna use. And we as algorithm designers now have the task of trying to figure out just how quickly we can estimate this kernel density, okay? All right. Now, I should point out that there are a number of, there are other applications where you have similar-looking functions where you have a set of points Y1 to Yn. You have a query point X and you'd like to evaluate some sum over all of the points of some function where, you know, each term in the summation is just a function of two points, query point X and one of the database points Yi, okay? So for example, the log partition function arises frequently, okay? And here, this is, well, this is sort of a softmax function. If you look at this function, look at the summation, the summation is just sum over all the points in the database Y, e to the power of beta inner product of X and Y, okay? And again, you know, if you had methods by which you could evaluate this summation quickly, then this would be very useful for all kinds of learning applications. Okay, so that's another example where we have this kind of summation involving n terms and the questions can we actually estimate this faster than actually evaluating all of the n terms one by one. You know, another situation arises when you're trying to do gradient descent. Let's say our goal is to, you know, we have some loss function. We'd like to minimize this loss function over the examples that we have, okay? So we're trying to figure out the value of X, Y, I, we have a set of examples Y1 through YN. There's a loss function which measures, gives you kind of the error or loss as it's typically called in machine learning, between X and YI, so it's a, that's each of these terms L, X, Y, I tells you how close X is to YI. What you like to do is minimize the sum of these loss functions, okay? And a popular technique to do this is by doing gradient descent, okay? And when you do gradient descent, if you look at the gradient, the gradient again is the sum of such pairwise terms, okay? Now one method to speed up gradient descent is to do so-called stochastic gradient descent. Instead of computing the full gradient, you actually come up with an estimate of the gradient. For example, just one term in this summation on average is a good estimate for what the, what this term is, okay? So think of it as a random variable that has the right expectation. And using such estimators can speed up this gradient descent in practice. You could ask the same question. Can we, can we come up? I mean, this is one particular estimator just taking one term. Is it a better way to estimate what the gradient is in time less than linear in the number of terms? Okay, all right. So, winding back to the actual problem that we'll be discussing. Again, this is a prototypical problem. Given a database of points, y1 to yn, this is potential kxy, a pairwise potential. This is a query point x, okay? And what we'd like to do is estimate the average of this pairwise potential kxy at the point x over all of the end points in our database, okay? All right, so I alluded to this earlier. We're not going to try to compute this exactly. Instead, we'll try to estimate this. And we wanna come up with a good estimate. So let's say we have some parameter epsilon and we'd like to come up with an estimate within a multiplicative of one plus epsilon factor of the right answer with high probability, okay? And if you think of one particular kernel for this talk, think about the Gaussian kernel, okay? So the actual pairwise function that we care about is this Gaussian kernel, right? So that's the prototypical problem. All right, I should say that, actually any questions at this point, I realize that everyone's muted, but do you guys have any questions about the actual problem, the basic setup? I hear, I see Clemone saying something but I can't hear him. Yeah, I think he was just unmuted for some reason. So maybe let me ask, are we focusing now on the kernel density estimation or the non-parametric, what should we keep in mind? So, the problem that I wanted to do is this one, okay? So I'm not really thinking about, I'm not gonna worry about how well this kernel density approximates this underlying density. So someone's figured out that this is a good problem to solve. They're using this as a subroutine and my goal is to speed this up as much as possible without losing accuracy, okay? So I wanna come up with a good estimate to this summation without doing linear amount of work. That's the question. Do you have anything to add? I see the question now from Clemone. You were right, he did the same question and he's asking whether we can evaluate the kernel in unit time. Oh, whether we can evaluate the kernel in unit time. So, let's say we can or let's say it's logarithmic. It's not gonna be too important. So I'm not gonna agonize about that too much. But yeah, let's say the kernel can actually be evaluated quickly, okay? For example, for these kinds of kernels that we're talking about, one could imagine you might be worried that if your points themselves are high dimensions as I said they are, then just actually evaluating the Ukrainian distance is gonna be very expensive. But you can do dimension reduction, get it down to log n dimensions and get a good estimate of distances, okay? And use that in your estimate. So yeah, so for all purposes, think of kernel evaluations as fast. Think of them as unit time, okay? We won't be, we might be off by a factor of log n perhaps. All right, any other questions? So are there any assumptions in the kernel that you make for this approximation? So we'll state our results for different kernels, okay? I mean, presumably the method that I'll describe could be a flight to other kernels as well, although I'll actually state results for two or three families of kernels. Basically, we're assuming that the kernel is monotone, okay, it drops off with distance. And other than that, the results will somehow be, will involve some properties of the kernel, but we'll see as we go along, okay? To keep the Gaussian kernel in mind for now. Any other questions? All right. Okay, so this kind of problem arises in non-parametric statistics, but another place where this kind of problem arises is in the so-called batch model, okay? So in, for example, in n-body simulations, you have n particles. These are each exerting forces on each other. And what you like to do is you gotta figure out for each of the particles, what's the net force that it experiences, okay? And then you might wanna evolve the system a little bit and then recompute, you know, what are the forces and the particles and so on. So in simulating in these so-called n-body simulations, this is exactly the kind of the problem that you need to solve, except that, you know, in this case, the dataset and the queries are exactly the same. It's, you have endpoints that form the dataset and the queries are exactly the same set of endpoints. Okay, you'd like to actually evaluate the action of each of these queries, sorry, the action of the dataset on each of the points of the dataset itself, okay? So in this case, you know, the naive algorithm would take n-squared time because there are n-squared pairs of particles you'd like to compute each of their pair-wise interactions and add them up, sort of, please, okay? So naive algorithm is n-squared for this problem. It turns out, and if you've never seen this before, it's pretty surprising, you can actually get by, you can get good approximations that run in time and log n and sometimes even linear, okay, in the number of points. And this was sort of a big breakthrough in the late 80s and 90s. So done by Greengaard and Strain for the Gaussian function and Greengaard and Rockland more generally for various other kinds of potential functions. This is a method that's called the fast multiple method for the Gaussian functions called the fast Gauss transform. This is an enormously influential method in numerical analysis. In fact, it was awarded the 2001 steel prize. It was also selected one of the top 10 algorithms of the 20th century by the editors of computing and science and engineering. So it's a big deal in numerical analysis, okay? And there's a similar, there's an analog of this method in computational geometry. It's called the well-separated parity composition due to Karlahein and Kosaraju. So let me tell you, so this sounds really amazing, what was actually an n-square time calculation can really be done in linear time or n log n. So maybe this is something that's useful also for the problem that I described, right? Where we have these endpoints, we're trying to compute what's the total contribution at this one query point. Okay, so at a high level, how do these fast multiple methods work? What they do is they rely on some kind of hierarchical recursive space decomposition, okay? So they take space and they chop it up into pieces. Trying to find my cursor, okay? And now, if you wanted to evaluate what's the total force, let's say, at one particular particle, the insight is that if you look at some portion of the space that's very far away, okay? You don't need to worry about where exactly all of the particles in that portion are located. As long as you know that, you know, this is what you're trying to evaluate the action here, this stuff is so far away that you might as well aggregate them all into one piece, okay? So you might as well approximate the action of all of these particles very far away by a single particle, okay? And doing that does not actually introduce too much error. So applying this idea and then, you know, using various properties about these kernels, how do their tails fall off and so on, that essentially is the basic idea behind this fast multiple method. Okay, and amazingly, you can get runtimes which are linear or n log n, okay? This seems, you know, this seems great. This seems like something we could actually apply for our problem. The caveat is that, you know, these rely on these space decomposition which really have an exponential dependence on dimension, okay? And the kinds of settings that Greengaard and Rocklin were really interested in was, you know, 2D or 3D. That's what really happens in the physical world, right? They enshrine magnetic forces or gravitational forces. So these methods work really well in low dimensions, but they scale very poorly with dimension, okay? So in our setting, you know, we're interested in high dimensional data. We're interested in finding outliers and so on and so forth. These space partitioning schemes are not gonna work for us, okay? So we need something else. All right, so there's a recap. Here's the problem. We have the database of points. There's one more thing. Let's assume for now that my potential function, the pairwise quantity that I'm trying to sum up is some quantity between zero and one. I have my query point X and I'd like to estimate what is the value of this average potential contribution at the point X, okay? And one more piece of notation. Let's assume that for the X that I'm actually interested in, the actual value is mu, okay? The answer is mu, okay? So now let me ask you guys, how would you go over estimating this? So you have this problem. You know that you can compute it exactly in linear time, essentially. Can you do this faster? If you just wanted to estimate this value, how would you do it? Okay, in particular, can you actually express? Let me give you one hint. Can you actually express, coming with the method where the complexity, the amount of work that I need to do is some function of mu, the actual answer that we're trying to estimate. So any thoughts? I see some people are cheating. Clemore writes based on the title, I'd like to use hashing. Oh, okay, all right, that's cheating. There's not a lot. We'll do that in a few slides, okay? But before we think about hashing, even, right? What's the basic, what's the simplest thing that you would think about? Yeah, the sum of answers. There's one question before we get there. I'm confused because what you showed us previously, multiple allows us to compute all pairs of value with only order and time, but what does it do in terms of one versus all the others as you are asking here? Is it better than the other end? So, okay, I mean, I sort of hinted at this and I guess I haven't really checked this, but the same kind of space partitioning techniques that they use in these fast multiple methods could in principle be used to answer this, solve this problem in this query setting, okay? So presumably one could construct this space partition beforehand. And now when your query comes along, you place it within this space partition that you've already constructed and now use exactly this fast multiple ideas to quickly come up with the estimate of what's the contribution at the query, okay? I assume the savings will really come like when you have many queries rather than one basically. Sorry, well, I didn't get that comment. So, I think for the next question probably, like for one query you probably wouldn't get any savings, but if you are across many queries, then you'll get savings. Well, so the thing is there is this upfront cost of actually building this space partition, okay? And in some sense, I'm giving that to you for free, even in my problem setup, right? So I said when you have this thing where you have this database of endpoints, you get a preprocess and do whatever you want. And then after that, you got to answer the queries. So you're right. I mean, if you just had to answer one query, we're not gonna get any advantage, but we're not gonna get any advantage even from our eventual scheme that we described. So this really makes sense if you are mortised at upfront cost over a lot of queries, okay? Okay, any other questions? All right, so, you know, back to this, what's the first method that you would think of? Well, I would think the first thing that you would think of when you, if you had to estimate the sum, would be to just use a random sample, right? Why not sample some points of your database and just average their contributions and use that sample average as an estimate of the overall sum, okay? So how badly does this do? So let's think about it. It turns out that if the right, so remember, we have n terms, each term is a number between zero and one. The average is mu, okay? How many samples do you need to get a good estimate of this average? Well, you need one over mu samples, one over mu times one over epsilon squared, okay? So one over epsilon squared is kind of standard. If you want to get one plus minus epsilon accuracy, you typically will have this over at a one over epsilon squared. What about the one over mu? Well, in random sampling, right? If it turns out that your average value of mu comes about because mu fraction of the data has value one and one minus mu fraction of the data has value zero and even to hit something with non-zero contribution, you're gonna have to sample one over mu times, okay? So with the random sampling, you're stuck with one over mu samples, right? Does that make sense? Okay. And now the question is, can we do any better, right? So can we do any better than random sampling? Random sampling needs one over mu samples. I'm gonna ignore the one over epsilon squared. I'm always gonna talk about the dependence on mu and can we do any better? Okay, so I should say one more thing that, you know, another regime that people have studied, actually one paper I studied a little bit is additive approximations, okay? And if you allow yourself to get additive approximation, additive epsilon approximation, it's known that one can actually construct a small sample of size, something like one over epsilon squared. And this gives you a good estimate, additive epsilon approximation, okay? So if we want to improve on random sampling, you know, the next idea, natural idea that comes along is to use what's called important sampling, okay? So if we had, so okay, the setup is like this, right? We have a summation involving n terms, okay? Let's call them w one, w two, w n, w i is the kernel contribution of data point y i to the query x, okay? You want to estimate the sum of weights, sum of w i's. If we could sample, okay, so, you know, random sampling has highest variance when, you know, there are some terms that have high values and the other terms are really low values, okay? If you had a way to tweak the sampling distribution, if you could control the probability with which you sample element i, okay? Then you can actually reduce the variance of the sampling procedure, okay? And in particular, if you had the ability to sample i with probably q i, and if you had the freedom to design your own distribution q i, then one way to do this is to say, okay, I sample i with probably q i, and then I report w i by q i, okay? Now, the expected value of this estimator is just sum of w i, it's a very easy calculation. So it's great, this is an unbiased estimator. And the whole point is by picking, by giving us ourselves the flexibility of designing a distribution q i, we can actually come up with a low variance estimator. So here's a question. How should I set my q i's to get the lowest possible variance? Yeah, anybody? Actually, I'm not able to see the chat, maybe I should have. Yeah, some kid says, based on distances. Based on distances, okay. All right, so if you think about what's the variance of this particular estimator, okay? The variance of this estimator, the expectation of z squared is summation of w i squared divided by q i, okay? And it's a simple exercise to show that the optimum setting of the q i's, these probabilities are the q i should be directly proportional to w i. Okay, so you're right. In fact, we should sample with probably somehow a function of the distances. And if we really had complete flexibility over how we could sample, the right thing to do is to sample point i directly proportional to w i, its kernel contribution. I mean, okay, so there are many reasons why this is too good to be true. Well, we can't do this because even to compute the normalizing constant in these probabilities, we'll need to know the sum of the w i's, which is exactly the quantity we wanted to compute, right? If we actually did this, we would get variance zero, okay? But this is not gonna happen. But this is a good principle to keep in mind, right? If we had a way, so what is the lesson that we learned? The lesson that we learned is, if you want to improve over random sampling, we should think about sampling in a biased fashion. And it seems like the right thing to do is to bias our sampling in such a way that points that have high kernel contribution, high kernel values are sampled with higher probabilities, okay? All right. And when we try to figure out just how expensive this estimator is, this is pretty standard. What we want to do is we want to compute what is the expected value of z squared, the square of the estimator. And just how high expected value of z squared is relative to mu squared, that's what's going to determine how many samples I need. Okay, so my goal is to try to design an estimator which has smallest possible value for expectation of z squared, okay? Smallest possible variance, okay? So now let me take a little bit of a detour and explain to you where we're gonna get this method for implementing priority sampling, important sampling, okay? So right now we've stopped at this point where it'd be great if we could design a way in which we could actually sample the points with probabilities somehow proportional to that current value. Now, so here's where our method is gonna come from. This is an idea that's now about 20 years old. It's a culmination of a lot of interesting work in theoretical design, but a beautiful paper of Endic and Motwani in the sequence introduced this idea of what's called locality sensitive ashen. So the basic idea is very simple. If you had a family of hash functions, that's how I would distinguish between near points and far points, okay? So for two points that are close, the probability that they end up in the same hash bucket is much higher than this collision probably for two points that are far away. If you had such a family of hash function, then this kind of hash function family can be used as a building block for nearest neighbor search, okay? You simply construct the hash table using several such hash functions. You can construct not one, but several hash tables. And then you take the query, you map it to these hash tables using the same hash functions and examine the contents of the hash buckets that the query falls into, okay? So in this business, people give guarantees in terms of the approximate version of the nearest neighbor problem. So they say, well, if the nearest neighbor has distance r, they're happy to get something which is within distance c times r, okay? And the guarantees for these locality sensitive hash, hashing-based data structures are given as a function of c, okay? The larger the value c you're willing to allow, the better these data structures are gonna perform, okay? And usually the way in which people study, sorry, is that a question? Getting an additive estimate, cheaper? Sorry, I just saw some question pop up and I wasn't sure if it was for me, it was a question for somebody else. I think it's for you, it says the microphone is broken. I think what Clermont means to say that, why can't you first get an additive estimate and then boost it into, then use that as the approximation of WI to get the... So, first get what estimate? An additive estimate? You may be quoted, you cited the result that gives you epsilon estimates, epsilon estimate. Oh, I see, I see. So I think the hard case for us is exactly when the actual answer is very small value, right? And when mu, the actual answer of the return of estimate is very small, then this additive epsilon is not gonna be very interesting. And I don't know how to make use of that at all. And if you think about it, once you start epsilon equal to mu, you get essentially complexity one over mu squared, which is not that interesting. Yeah, yeah, okay. So sorry, I missed the question. Okay, so going back to the locality sensitive hashing, the usual analysis of locality sensitive hashing is in terms of trying to understand, if you look at two distance thresholds, R and C times R, just how fast does the collision probably drop off from one threshold to the next? And rho is some parameter that measures how separated these two probabilities are. It's not terribly important for us how exactly rho is computed, but a lot of work goes into trying to understand how exactly rho behaves as a function of C, okay? And this locality sensitive hashing stuff is fairly mainstream now. Like for example, this is a figure from a paper in science a couple of years ago where people are trying to use LSH to detect the earthquake signals. So yeah, it's certainly a method that's been adopted quite widely in practice. A couple of LSH constructions that I'll point out there's several. One was by Data Remordlica in Nick and Marokini. It's a very simple locality sensitive hashing function. It takes points in if you're in space, it projects them down to a single line. It buckets the line into buckets of width w and shifts this bucketing a little bit by random amount. And the hash value is just the ID of the bucket that you fall into. Okay, so it's very simple. You project onto the line, discretize the projection. That's the value of your hash function. Another hash function that we'll refer to later is one reduced by Antoni and Nick in 2006. It's slight generalization of this idea. You take your points in the gradient space. You now instead of projecting down to a single line to project them to T dimensional space for some large constant T. And then in this T dimensional space, you partition it by doing some kind of ball partitioning. You sort of think of it as, you know, pull out a ball, pull out another ball, pull out another ball and every point gets assigned to the first ball that it falls into. Okay, it turns out that for L2, the value of rho that you get for this one is one over C squared. And that is actually the right dependence. Okay, so this is actually the optimum LSE scheme for L2. Okay, what's the, okay. There are a few other things that I'm just gonna mention in passing. Recently, there's been a lot of interest in LSE. I mean, this work, as you see, was done about 10 years ago. But in the last two, three years, there's been a resurgence of interest in locality sensitive ashing. And one of the key insights was that, you know, all of these methods that we had before ran up against a wall, but because of a data oblivious, they actually did some kind of choosing your hash functions and partitioning of space without actually looking at the data, the actual data, the distribution of the data. But it turns out that if you allow yourself the ability to look at the data and do something that is data-dependent, then you can break these lower bounds. And now we have even better schemes that actually partition the data based on the structure of the data itself. Okay, all right. Back to where we were. I mean, remember, we were thinking about an important sampling. The idea was it would be great if we could somehow sample points with probabilities proportional to these kernel values, but we didn't know how to do that. Okay, and we took this interlude into discussing LSH, okay. The key insight is that, you know, locality sensitive ashing is a very powerful method. You know, it's a beautiful method for approximate nearest neighbor search. But the key insight is that it actually gives us a bias sampling scheme. Okay, and this insight was also independently developed by Spring and Srivastava in a recent paper. So, you know, what is LSH? LSH is a scheme by which you can build this hash table. You have a certain collision probability, the probability that two points X and Y end up in the same bucket. What exactly the collision probability is is a function of, you know, the space that you're operating in and the hash family that you pick. But you can control this collision probability to some extent by picking an appropriate hash family, okay. And the point is, if you build a hash table and then map the query X to this hash table, you could think of the set of elements that are in the same bucket as the query as a bias sample of the data set, okay. So, we wanted a bias sampling scheme. Well, LSH gives us a bias sampling scheme, okay. So, could this be an efficient implementation of priority sampling? Well, the caveat is that this is somewhat dependent sampling scheme, okay. It's not quite, it's not that every element I is sampled independently of probably QI. So, we have to sort of tweak this a little bit to make this work for us. So, okay. So, let's say we have end points, Y1 to YN. Each of them has a weight WI, this kernel contribution. We have an LSH scheme which has collision probabilities PXY. So, let's say the probability that YI ends up in the same bucket as X is PI. Okay, what are we gonna do? Well, we're gonna map query X to its hash bucket HLX. And then we're gonna pick a random element YI and HLX and use that as our sample, okay. So, that's how we're gonna use LSH to get a buyer sample. The question is, how should we analyze this? Okay. So, the analysis of important sampling somewhat suggested that if we had to do this, the best choice of LSH should be one where the collision probability is exactly proportional to the weight, right. It's exactly proportional to the kernel value, right. That's kind of the insight that we got out of important sampling. But it turns out that if we do important, I mean, if we use LSH to get a buyer sample, the probability of sampling YI is not quite the collision probability, right. So, YI needs to end up in the same hash bucket as X. That happens with probably PXYI. That having happened, you're gonna look at all the elements of the hash bucket. You're gonna pick one of them at random. We're actually not gonna look at all the elements. You're gonna just pick one of them at random. So, the probability that YI is gonna be picked is inversely proportional to the size of the hash bucket. Okay. Now, the numerator of this term is something that we can control. It's a function of the hash family that we pick. The denominator on the other hand is something that we don't quite have control over. It depends on the structure of the dataset. So, how could we possibly analyze this, right? It seems like somewhat intractable. Okay. So, I'll try to motivate this analysis by working through a quick example, okay. And I'll give you a summary of what results we get. All right. So, this is our scheme. We're gonna use LSH to get a bias sample. Now, the final estimator that we use is, remember in priority sampling, it was WI divided by QI, right. In this case, it's gonna be WI divided by PI, the collision probability, times the size of the hash bucket. Okay. That's gonna be our estimator, right. So, just as priority sampling, this is an important sampling. This is an unbiased estimator and all of that. The main question is, what's the variance of this estimator? Okay. So, if you compute the second moment of this estimator, it turns out to be somewhat scary looking expression. As before, we have this summation of WI squared by PI. This is exactly the expression that we had for important sampling. But now, we have this term, which comes from the size of the hash bucket. We get the expected value of the hash bucket, given that YI belongs to the hash bucket. Okay. And this seems somewhat, I'm really not clear how to analyze this. Okay. So, let's work through a very quick example. Just to get a sense for what we should be doing. Okay. So, for this example, I'm going to, instead of thinking of this estimator, I'm actually gonna average over the end data point. So, I'm gonna just take the estimate, divide by N. Okay. So, the value of the estimator is, WI divided by PI times the fraction of the data set that's in the current hash bucket. That's size of HX divided by N. Okay. And here's the example that I wanna think about. Let's think of this instance where, mu fraction of the data set is at distance zero and has weight one. Okay. And one minus mu fraction of the data set has distance square root log one over mu. It has weight mu. Okay. The square root log one over mu, just because I'm assuming that I have a Gaussian density. All right. So, mu fraction has weight one. One minus mu fraction has weight mu. And suppose we're able to design a sampling scheme so that the probability of, the collision probability for the eye point is the weight raised to the bar of beta. Okay. So, initially I said, well, maybe we should have a sampling scheme where the collision probability is exactly the weight. I say, well, maybe that's not a great idea. Let's try to look at schemes where I have collision probability equal to weight raised to the bar of beta. And let's try to figure out what the, you know, what the right value of beta might be. Okay. Now the expected value of my estimator is just the kernel density. In this case, the kernel density is, is very close to two times mu. Okay. So that's the easy calculation. You get contribution of one from mu fraction and contribution of mu from one minus mu fraction. Okay. The key point is what is the expectation of y squared. Okay. So here's what you should think about. Okay. The collision probability is w i to the bar of beta. So here's what's going to happen when you apply this to this field of data set. With probability, very close to one, one minus mu to the bar of beta, the hash bucket will only have points of weight one. Okay. And with probably mu to the bar of beta, the hash bucket is going to have all the points in the data set. All right. Now, if you do the calculation, and maybe since we're running a little short of time, I won't actually work through the calculation, but it's not very hard at this point. You should be able to do it by yourself if you'd like. It turns out that the expectation of y squared, the variance is mu to the power of one plus beta plus mu to the power of two minus beta. Okay. These are the dominant terms for the terms on matter. All right. So here's what we have. For this particular instance where you know, you have just points at two different distances, distance zero and some large distance. The expected value is two mu, and the expected value of y squared is mu to the power of one plus beta plus mu to the power of two minus beta. Okay. So obviously we should try to equalize the two terms and set beta equals 0.5. And for this, the expected value of y squared divided by mu squared is about one over square root mu. Okay. So if you remember the analysis of random sampling estimator, there the variance was one over mu. Okay. And I mean, the normalized variance is one over mu. This estimator, if this indeed is kind of the worst case, it suggests that we might be able to get samples as one over square root mu. All right. So two theorems about this. So it turns out that, you know, even though the expression for the variance of these estimators that we get out of hashing is a little unwieldy and seemingly complicated, the worst case variance really arises from these two point configurations. So this kind of analysis that we did, it's great. This is really, these are really the worst cases and we can show that. Okay. And then it turns out that if we are able to design hashing schemes, where the collision probability is the kernel value raised to beta. For beta between 0.5 and one, we can show that the variance of this estimator is mu to the power of two minus beta. Okay. So indeed the best thing to do for us is to set beta equals 0.5. Okay. So to summarize, we have this new way to think about LSH as a way to facilitate important sampling. And this makes us sort of go back and redo the analysis of all of these LSH schemes. Okay. So typically the value is sensitive hashing has been analyzed in the setting where if you want to see approximate nearest neighbor, what's the complexity of your hashing scheme? Okay. We weren't able to use those results as a black box. We had to go back and kind of redo their analysis by relying on the same hash functions, but analyzing different things. It turns out that for the right thing to do is as our results suggest, is that whatever kernel you're interested in, you need to pick the LSH family that's sort of appropriate for it. So it turns out for the Gaussian kernel, if you pick the project onto a single line, Euclidean LSH, you get complexity mu to the one over mu to the three-fourths. Okay. On the other hand, if you pick the Andoni-Indic LSH, which projects to T dimensions, you get one over square root mu. For the exponential kernel, the projecting on a single line itself gives you the one over square root mu complexity. And also for the polynomial kernel, you can get one over square root mu, okay? The running time can be made adaptive to mu. So these results suggest that some of the running time is a function of the answer that you're trying to compute. What you can easily modify the scheme so that if mu is small, you terminate quickly and you certify that mu is small. Sorry, if mu is large, you terminate quickly and certify that mu is large. On the other hand, the kernel density is small and it takes a little longer. And how long you take is exactly one over square root mu. Okay. Just a couple of other things. It turns out that we can also do things like, say that, well, if your data has some pseudo-random structure, not all of it is clustered in a small amount of space, then you can actually get some better bounds, although I won't actually state them exclusively. There are some low bounds that one could get for the batch setting, but these low bounds need values of epsilon that are very, very tiny. And they rely on the strong exponential time map. This is the work of Bakker's, Endic and Schmidt. A few other things I just wanted to mention, some directions of ongoing work. So one thing that we realized subsequent to this work, this is a joint work with Guthrie Endic and Artus Bakker, is that for the polynomial kernel, where the kernel density drops off, where the potential drops off as a polynomial function of the distance, you can actually get polylogarithmic query time, okay? So right now, the results that we have in general are one over square root mu. Of course, if mu is tiny, then this means that you are gonna spend a lot of time. But for the polynomial kernel, you can always guarantee that you get polylogarithmic query time. You can modify these methods to estimate not just these kernel functions we're talking about, but estimate sums of general functions of inner products where your pairwise terms are of the form e to the bar of some function of inner product that can all be done. Although this is still, you know, work in progress. It seems that data-dependent LSH should give improved bounds, although we haven't quite nailed all the details there. It seems like it should be plausible but we don't quite have a result that we can go to that at this point. And finally, it would be interesting to experiment with this idea. It's a very simple idea of using locality-sensitive hashing. It's a very practical scheme of high dimensions. And I'm hoping that just as LSH became, you know, sort of a method of choice for practitioners, this would also be something that people would readily implement for current density in high dimensions, okay? I'll stop there. I think I'm a little bit over time. Sorry about that. Thanks for listening. Excellent. Thank you, Moses. At this point, we'll take some questions from the audience. So, one question. I wonder if we can open the back box as you suggested at the end. It seems like what's actually going on is you basically take your set of points. You're already projecting it in some way. And then there's a bucketing going on. It sounds a bit like the following naive idea of projecting my points to low-dimensional space and then maybe applying the multiple, the old algorithm that works in low dimensions. Would it, are there any connections between those two things? So, you know, I think, I don't know exactly how you intend to implement this. One thing is that what's happening with this projection is purely being used to facilitate sampling, right? And then eventually, we're coming with an estimate based on sampling. Whereas if you project to low dimensions, you kind of distort the distances a little bit. And if you do multiple on those distorted distances, I think you would have a lot of error, okay? So that's one problem that you, we're really doing this projection to low dimensions to facilitate sampling, okay? In fact, I should say that in this polylogarithmic query time that we have for the polynomial kernel, this is really the idea where we are indeed projecting to low dimensions. And then instead of working with these low dimensional distances directly, we're actually using this low dimensional projection as a way to facilitate a good sampling scheme. But having found a good sample, we then proceed and work with the original distances in the original space. So it is useful to kind of open that black box up a little bit, but in a slightly different way, at least the way that we approach it. We have one question in the chat here from Klamal Kanan. What happens if you use kernel approximation methods like random Fourier features for kernel density estimation? So as far as we know, it actually doesn't give us improved results, okay, for our setting. So if we use random Fourier features, I don't quite remember what the calculation was, but it actually doesn't even do better than random sampling. Okay, for this particular setting. All right, are there any other questions here? If not, I'm going to take us offline shortly and you can still hang around and chat with the speaker. I just wanna remind you of some upcoming talks we have. In particular, two weeks from now where we will have Seth Petty, two weeks from then we'll have Ola Svensson, and one week from then to avoid clashing with Thanksgiving will be of the node vacant to nothing. All right, thank you all for joining.