 The second part, I was added due to popular request. It was not initially a part of the talk, but you'll see. So the first part is a research talk, and the second part deals with K-12 math education in California. Is this mic working, by the way? Now it's on. It's working? OK, good. So yeah, the research part of the talk is this private frequency estimation via projective geometry. This is joint work with Vitaly Feldman and Knoll Talwart Apple, and we went at Northeastern. I was having a little too much fun with the title slide there, using projective geometry, of course. And then at the end, a little bit on K-12 math education, mostly talking about California. So first part of the talk, so what is it about? So most of us, I think, have smartphones, and we rely on various features in our smartphones that make using them more enjoyable. Features like autocomplete or automatic spell correction. And how do these features work? Well, here's a patent that I found online by Apple. Bottom line is they need to learn from our data, right? So let me actually just zoom in on the part that's boxed in the bottom right. Systems and methods are disclosed for a server learning new words. The server is them, generated by user client devices. That's our iPhones, in a crowdsourced manner. So you might read this and say, what does this mean? They're reading our texts and learning from our texts to then train their ML models to get better autocomplete and spell correction, right? That is roughly what it means. But we'll come back to that. I think I don't speak for that. I'm being recorded. So here's the setup. There's a server, which is, say, the person making the texting app or the device manufacturer. And then there's a bunch of devices. And let's just say, let's make it simple or live simple. I'm just paying attention to the last word you texted your friend on your phone. So each device has the last word that it texted. And the server wants to know word popularities of different words in the dictionary. So it wants to know a histogram, f, where f sub x says how many devices are holding the word x. They just texted the word x. These are the kinds of things that the server would want to know. I mean, they want to know other things too, for example. Simple enough, every time you text your friend, you send a carbon copy of your text to the device manufacturer. They learn texting patterns of everyone in the world who's using their devices. And then they can learn how to do a spell corrector, whatever. But there's a constraint, which is privacy. Do you really want the phone manufacturer to read all of your texts? Probably not. And if you go back to that patent filing, you'll notice all these things that I've boxed and read. So you want to learn in a crowdsourced manner. But while maintaining local differential privacy, a differential privacy system, privacy budget, privacy, privacy, privacy, so they're aware that their customers, us, are not maybe happy with the company learning everything about us. So how does the server, how does the device manufacturer learn from us in a way that maintains our privacy? And the basic idea is, yes, we're going to text our friends, but we're not going to send a carbon copy of that text to the device manufacturer. We're going to send some other message to the device manufacturer, which is some randomized thing that adds noise, whatever that means. We're going to see in a second to hide a lot about our information to the device manufacturer. So let's pretend, for example, that we're not texting words to our friends, but we're texting images. Like I just attended, this is a couple years old, a few years old, I attended the baby shower of my first daughter, and there was a game where it's like, who could drink from a baby bottle the fastest? That's the actual text message I sent my friend. But then what I sent to the server is some noise-ified version of that, and here I'm progressively adding more and more noise, like static, basically. This is not what we're actually, I'm just giving you an idea, okay, you'll see what we actually do. And here it is with a ton of noise. And the basic idea is something like the following, although the picture on the right is not as clear as I'd like it, but okay, so here's the basic idea. This picture of me drinking from a baby bottle, if there's only one phone in the world that's texting this picture, like I'm texting it to my wife, then okay, the server might not learn it, that's okay. But if it went viral, and people all over the world are texting this photo, then the server would really like to know that. So you can think that once something is popular enough, the server should learn it, which means there are lots of devices out there setting this particular word, which is this image, okay, and each one of them is independently noisifying it with independent random noise. And there are some other people in the world who are texting other things, like a cat or whatever, not everyone in the world is texting this picture. And I would like a procedure where everyone is individually noisifying things, and the server could somehow aggregate all these noisified images and extract knowledge from it and realize, oh, this picture, and maybe they won't completely recover the picture exactly precisely, but they'll recover something close to the original picture. And they'll be like, well, I can ascertain that this particular picture is something that's going viral, but I don't know who actually sent it. I know that many people amongst these people on the left sent it, but I don't know who, so that's the privacy that's being maintained, okay? So the moral, what we want the moral of the story to be is that we can have each individual message look like total random garbage, and therefore protecting individual privacy, but in a way that the server can still extract useful knowledge by aggregation, okay, so that's the goal of what I'm gonna talk about. But what exactly does privacy mean, and you have to be careful, here's an example, this is a real example, I took that same image and then I added a ton of noise on the left. You can barely tell, you can't tell what all what's going on, right, I think, okay? And then I ran some signal processing algorithm on that called wavelet denoising, it doesn't really matter what it is, but lo and behold, from the garbage came, I mean, okay, it's not as good quality as the original image, but you can tell that it's a person drinking from something and they're wearing a jacket, you know, you're not supposed to be able to tell anything about any individual's data, but from this one individual message that's randomized, you're able to actually extract a lot of information. So this is bad, okay, I don't want this kind of thing to happen. So we need to be careful with our definitions, we need to mathematically define what does privacy mean, and then prove that whatever algorithm we're doing is actually satisfying that definition, okay? So we're gonna use a definition that's kind of a gold standard in this area called local differential privacy, okay? What is the idea? So the idea is similar to what you saw before, where each device I will send a random message M sub I, that is only weakly correlated with its data XI, XI is like a word in the dictionary, or it's a picture of me drinking from a bottle. And we want that one individual device's message almost looks like random garbage, just like I said, but the server can extract from the aggregate, but then here's the privacy definition. By the way, I know where I'm standing, I'm like blocking this projector, but you can see, is it inconvenient for anyone on this side of the room? Is it okay? Okay. So let's just walk through what this math means. What does it mean to satisfy epsilon differential privacy? Okay, so for any device I, any phone, and any possible message M that the device could send, let's look at different data elements. X is a picture of me drinking from a bottle, and X prime is a picture of a cat. The probability that device I sends that message M given that its data is the bottle picture, versus the probability that it sends that same message M given that its data is a picture of a cat, these two probabilities should be close. That's the privacy definition, okay? In other words, the probability distribution over messages you send should not be too sensitive to your data, and close means that the ratio should be at most e to the epsilon, okay? When epsilon is zero, e to the epsilon is one, which means e to the zero is one, right? Which means that my probability distribution of messages I send is the same no matter what my data is, which intuitively means you're not learning anything about my data from my message, so that's perfect privacy. So epsilon, we'll call epsilon the privacy loss parameter, and epsilon being zero means there's no privacy loss, you're not losing any privacy, and as you increase epsilon, the right hand side becomes bigger than one, so you start having gaps in these probabilities, and then you have more privacy, we have positive privacy loss, you have some amount of actual privacy loss, okay? So there are two regimes to keep in mind with this definition, the first is small epsilon, where I mean epsilon is like 0.01, it's less than one, so there's very little privacy loss, and there e to the epsilon is roughly one plus epsilon, just my Taylor approximation, and the other regime is large epsilon, not that large, I think in reality people usually use like five or six or something, so e to the five is around 150, and that's what's usually deployed in practice, and you might say, well, wait, I thought these companies wanted to maintain my privacy, why am I allowing them to use large epsilon, like that, you know, if you set epsilon to be infinity, that means that there's no privacy whatsoever, why am I happy with large epsilon? Well, okay, before I answer that question, we should first understand that there is a fundamental trade-off between utility and privacy, utility meaning the quality of knowledge that the server is able to extract from the messages, okay, so on the one hand, if epsilon is zero, there's no privacy loss, but then the utility is arbitrarily bad, right, they don't learn anything about the data, so really great privacy, awesome privacy, no utility, the other extreme is epsilon is infinity, which is there's no privacy at all, they're not maintaining privacy, they're sending the messages in the clear, but the server learns the data exactly, okay, so then you could ask, I mean, you would suspect that there is some kind of, would hope that there's some kind of smooth trade-off as you move epsilon between zero and infinity that the utility somehow smoothly changes, okay, and it turns out for the problem we're considering today, that is true, and we're gonna talk about that, but because of that, you know, remember now, I mean, the server probably really wants to learn our data, okay, they really want to provide, oh yeah. So you mentioned that this utility is kind of inverse of this privacy, however, the equation that you show us in the last slide, this utility was not mentioned anywhere, so how can we understand that they are actually inverse to each other? Oh yeah, so when I defined privacy, I did not mention utility, is that what you're saying? Yes, that's true, so I guess, yeah, I'm only introducing utility now, but just like intuitively, the more that I try to mask privacy, if I tell you that privacy needs to be perfect, you shouldn't learn anything, then there should be no utility, right, how can there be any utility if they're not learning anything? Intuitively, I understand that. Yeah, yeah, but we'll get there. So what's the theoretical background behind it? Yeah, I mean, about the theoretical background, I haven't even defined utility actually mathematically, so we'll get there, but we're just that intuition now, right, that like, they should be inversely related, right? Okay, and you know, the company wants to get as much utility as they can from our data, so they would love it if Epsilon were infinity, but you know, they're constrained, they've promised us that they're gonna maintain some level of privacy, so they'd like to set Epsilon to be as large as possible without pissing us off, or you know, without violating the law, or whatever it is, okay? But then there's like some silver lining, which basically says, if you have a local, differentially private algorithm, remember the model, the model is that you have N devices on the left, you have the server on the right, each one is sending a randomized message to the server. So if you create a mechanism in this model that satisfies a particular Epsilon, and then you just run that algorithm in a different model called the shuffle model, which I'm not gonna define right now, there are theorems that say that the effective Epsilon you get by running that mechanism in this alternate model is automatically amplified. If your old, if your Epsilon was Epsilon, your new Epsilon is roughly, let's say if your old Epsilon was some constant two, your new Epsilon is some constant divided by root N roughly, where N is the number of devices, the N is the number of people who are using an iPhone, for example, okay? So the point is if you have a lot of users out there, and then you run your local, differentially private mechanism in this shuffle model, you automatically get a much better privacy guarantee. So that's why they're able to kind of morally get away with deploying with Epsilon being five, because then their effective Epsilon in the field is actually some function of five divided by the square root of the number of iPhones out there. Yeah. Isn't this assumed that there's, nobody's like picking up on traffic going back and forth between you and the server? What do you mean by picking up on traffic? So like if you were communicating with the server, and then the server, when it receives it, shuffles it, if somebody's picking up on the traffic before the server shuffles it, then it's, okay, so I guess I didn't define the shuffle model. Let me first define, I'll just say what the shuffle model is because it's not super complicated. So the model you saw was devices here, there's a server there, and then the shuffle model, there's like one extra server called the shuffler who sits in the middle, and they're the ones who actually receive the messages, and their only job is to take the messages and apply uniformly random permutation and then forward those off to the server. Okay, so I guess your question is about what happens when, let's say, before the message reaches the shuffler, someone intercepts it and sees it. I mean, that can happen. If that happens, then, and then they leak that information to the actual server or something, then you don't get the privacy amplification. But you still do get whatever Epsilon, I mean, the message that came out of the device in the first place is not the raw data. It's a randomized message. So you'll still get whatever Epsilon guarantee you had from the local month. But you wouldn't get the amplification. Okay, so let's keep going. So okay, so now that we're all on the same page with the model, what is the problem being studied today? So I kind of already said it, but let me just say it firmly. Each device holds some, each device I holds a data element Xi, which is an element of the universe, one to K. Let's say K is the size of some dictionary. This implies a frequency histogram F. The X of coordinate of F is just the number of devices whose data is X. And the server wants to recover an F tilde that is close to F. And the closeness I'm gonna measure as mean squared error, which is the average distance between the true frequency of an element and the estimate that I produced of the frequency squared. This is gonna be a random variable. Because the algorithms are all randomized, F tilde is gonna be like a random vector that I reconstruct. So I want that to be small, say an expectation or with high probabilities. In this talk, I'm gonna talk about expectation. And as I'm designing the algorithm, what are the things that I wanna optimize? There are five things in this talk that I'll care about and think I want all five of these to be small. So the privacy loss is epsilon. Everything else considered, if I fix all the other four things, the smaller epsilon is the better, more privacy. Utility is the mean squared error. Again, if you fix all the four other things, the smaller utility loss, the better. I want small reconstruction error. Low communication, each device has to send a randomized message. That's some B bit message. I want B to be low. Server time. The server collects all these messages, applies some algorithm to them, to then compute some F tilde. I want that algorithm to be fast. And then the device time, which is, I'm a phone, I have my data. I need to figure out what message, what randomized message I'm gonna send given my data, that should be fast too. So I want all five to be small. And of course, the first two are related. You'd expect there's gonna be some privacy utility trade off. So before I tell you what we did, let me just get you, just out of curiosity, who's seen local differential privacy at all before this talk? Okay, minority of people, which is fine. So let me give you a quick crash course on some of the basic mechanisms that exist in this space before our work. So one is something called randomized response. Okay? Remember now, the definition of privacy. For any message that I could send, if my data is x versus x prime, the ratio and probabilities of sending that message that ratio should be bounded by e to the epsilon. Okay? So what do I do? Each device will send its true item x with probability e to the epsilon p. Otherwise, it sends a uniformly random other element so that any other element is sent with probability p. Okay? Now what is p? There's only one value of p I can put into this, which makes it make any sense. So the probability that I send something is one. What's the probability that I send something? Well, I either send my true data, which is probably e to the epsilon p, or I send some other element in the universe. There are k minus one other elements. Each one has probability p. That has to equal one. So solve for p and I get something. So that specifies what the device does. What does the server do? It's gonna use some linear estimator. What do I mean by that? Okay, let's say the server wants to estimate how many devices are holding data elements x. What do I do? For each message MI that I receive as the server, I'm gonna think like, do I think that the user who sent this message has x or not? If the message is x, I'm gonna think it's more likely that they were really holding x. So if the message was x, I will add alpha plus beta to a counter. If it's not x, I'll only add beta. What are alpha and beta? Well, okay, first of all, I'll get to that in a second. If xi equals x, the expected contribution to this counter is, well I remember I always add the beta, no matter what. What's the problem that I add the alpha? The problem that I add the alpha is e to the epsilon p. So that's the expected contribution when xi equals x. If xi is not equal to x, then the problem that I send the alpha is only p. So I get alpha peoples beta, right? Okay, so you tell me, I mean just to make sure that everybody's on the same page here. If I want this to be an unbiased estimator of fx, what do I want this to equal? If you're actually holding x, what do I want the expected contribution to the counter to be? One, and if you're not holding x, I want it to be zero. So I want the first thing to be one, I want the second thing to be zero. I have two equations and two unknowns, alpha and beta, I can solve for alpha and beta and I get something. So that's it. That's the whole protocol. I've told you what the device does. I've told you what the server does. That's it, okay? And if you, now that you know what the, now that you know what the mechanism is, you can analyze, compute its variance. What is the expected mean squared error? It's a calculation, you could do it, okay? And if you do that calculation, you'll get some really bad utility loss. Yeah. Would you explain why xi is not equal to y? You have bar for p plus beta. Why not beta itself? Oh, because even if the data is not x, there's still a chance that the message I send is x, right? And the chance that I send x, if I'm not holding x, is p. Yeah, you're with me? It's okay. Okay. Okay, so if you do the analysis, it turns out the utility loss is terrible, but the nice things are, the communication's not that big. You'll see, I'll compare it to the next thing. That has a really bad communication. In particular, you're trying to send an element of the universe that should take log k bits. Instead, you're sending possibly some other element of the universe, it's still only log k bits. So there's no communication overhead compared to sending your actual data, which I'm happy with. And it's a linear type algorithm for the server, right? So the server just loops through all the messages and then for each message mi, it just adds something. It adds an alpha to the mi of counter, okay? And then it adds beta n to all counters at the end or something. So it's a very fast algorithm. Okay. Now here's another simple scheme. This is doing yin-barg. Okay, and it's a little bit, there's some similarities, but it's not exactly the same. So what is this algorithm? So there's a parameter d. We'll get to that parameter later, but there's a parameter d, which is some integer, positive integer. Each device will send a random subset of the universe of size d. Of all k choose d sets of size d, it chooses a random one. If, so now what are the probabilities? If the set s contains x, x is its data. If it contains x, we'll send that set with probably e to the epsilon p. If it doesn't contain x, we'll only send that set with probability p. Once p, again, there's only one value of p where this makes any sense because something has to be one. The probability that I send something is one. What's the probability that I send a set that contains x? Well, how many sets contain x? k minus one, choose d minus one. Okay, so those each have probably e to the epsilon p. How many sets don't contain x? k minus one, choose d, right? And those have probability p. So that sum has to be one, so p is something. What does the server do to estimate? So now we know what the device does. What does the server do to estimate the frequency of x? It's gonna be very similar. If x is in mi, now mi is a set. If x is in mi, I'll add alpha plus beta, otherwise I only add beta. If x i equals x, the expected contribution is something. If it's not equal to x, the expected contribution is something else. You can figure out what these things are not too hard to come up with if you sit down and think about it for a few seconds, or a few minutes. We want the first thing to equal one, we want the second thing to equal zero, just like last time. So two equations and two unknowns. Once you solve for alpha and beta, of course alpha and beta will be functions of d, just like p is a function of d. Now you do some variance calculation to compute the expected mean squared error, and you'll get an expression that's a function of d. d was a parameter. You then do some calculus, set a derivative to zero or something, optimize, and choose the best d you can to minimize the mean squared error. It turns out that the right value of d to choose is something like k over e to the epsilon. If epsilon's a constant, the denominator is just some constant. So d is proportional to k. k is the size of the whole dictionary. So I have one word in the dictionary, and my message is almost as big as setting the entire dictionary, like setting like 10% of the dictionary. So the con is that the communication is terrible. I'm sending really big messages. And also just because the messages are so big, like even for the server to read all the messages takes forever. So the server is gonna be really slow as well. Instead of linear time, it's gonna be like quadratic time. The pro is that you can prove a lower bound, actually yin bar in a different paper, actually they got the lower bound before the upper bound. This is actually optimal privacy loss, utility loss trade-off. So for any fixed epsilon, the utility loss they achieve is the best possible. Question? Can you explain what utility loss is again? It's the expectation of f minus f tilde L2 norm squared, divided by k or something. But yeah, the squared Euclidean distance. And when I say optimal, I mean they really nailed the optimal. It's not asymptotically optimal. It's like four times, whatever, whatever, whatever. That expression is optimal up to a little low of one factors. One plus a little low of one factors. Okay, great. Okay, so we had a really fast algorithm on the last slide about terrible utility loss. Now we have great utility loss, but terrible runtime and terrible communication. Great. So now we want the best of all worlds, right? Why not? So first let's just realize now that both of these fit under a certain meta approach. What is the meta approach? The data is in this element, a universal size k, and there's some message space y. And whenever a device sends a message, the message it sends will be an element of message space. So in randomized response, message space was just the universe itself. In subset selection, message space was the set of all size D subsets of one to k. But you can construct whatever message space you want. And for each element of the universe x, we'll have a subset of message space, which is the preferred messages for x. I prefer to send a message that's a preferred message. So for example, in randomized response, sx is a singleton set that only contains x. In subset selection, sx is the set of all size D subsets that contain x. And I'll construct a set system where these preferred message lists all have the same size for every data element in the universe. They all have size a little less. And also it'll be like some design. What I mean by that, if I have x and x prime, which are two different elements of the universe, their preferred message lists always intersect. They always have the same intersection size, which I'll call l. In randomized response, l was zero. Because you have two singleton sets that don't intersect. Okay, great. So the mechanism is, I have my data x. I'm trying to figure out what probability should I assign to y. Well, if y is not a preferred message, I send with probability p. And if it is a preferred message, I send with probability e to the epsilon p. Again, p is determined, okay? How many preferred messages do I have s? Each one has e to the epsilon p probability. How many non-preferred messages do I have y minus s? Each one has probability p. The server will estimate fx as something you saw before. If mi is a preferred message, I'll add alpha plus beta. If it's not a preferred message, I only add beta. Again, we want, if x equals x, I submit as expectation one, otherwise it has expectation zero. I get two equations, two unknowns I can solve for alpha and beta. They depend on s and l as well as p. And p depends on also the size of the message space, as well as epsilon and s. Okay. And then, now that the protocol is determined, I can then compute its mean squared error and this slide only exists as a proof of concept to say, look, there is a calculation you can do. You can do it. It doesn't really matter. Line by line, you don't have to follow this. All that matters is I can get a closed form expression for the expected mean squared error, just as a function of properties of this kind of combinatorial set system. As a function of s, which is the size of the preferred message lists. L, which is the intersection sizes. It also depends on epsilon and n. And it also depends on the size of message space. Okay. In a particular, I want the ratio l over s to be small. That makes this small. And I want the ratio y over s to also be small. And if you look at this and you think about, okay, y, and I highlighted in blue kind of the leading term. That fraction in white, just take my word for it, is a smaller order term than the tl one. Okay. If you then look at this and say, why is it that subset selection did so well, you'll realize that kind of what happened there when you chose d at the end, what happened was you basically ended up getting an l over s that canceled this e to the epsilon minus one. And then y over s equals this e to the epsilon minus one. So since this canceled, you get one plus one is two. And then you get this equals that. So you get another two times e to the epsilon minus one. So this two times whatever times another two is a four. So you get like four and e to the epsilon something. So that's where four came from. Came from those two terms, canceling or whatever, okay. Great. So in any case, now mechanism design under this meta approach reduces to a combinatorial question, right? How do you design a set system that has the nice l and s you want, the nice message space you want? Also keep in mind that what's the message length? The message length is log of y bits. Y is message space, specifying an element of y takes log of size of y bits. So you also want y to be small, all of the things considered. So you have these kind of combinatorial expressions that you're trying to all make small at the same time. You also want that somehow your set system has to be such that you can compute all of these values for all x simultaneously quickly because you want a fast algorithm, right? I mean you could like just double four loop, loop over x and then loop over all i and then compute this for each one separately. But you want something better than that if possible. That's, yeah, okay, yes, question. Why isn't NK, because I thought the histogram would be the size of the, I guess what is N? N is the number of devices. Devices. K is the size of the universe and is the number of devices. I see, I see. That's right. The mean sort of error does depend on N. And I guess it would depend on K except in my definition I divided by K. Okay, so now, yeah, so I mean roughly, okay, well, I'll get there. Let me, I'll come back to that, but yeah. So now this is a combinatorial design question. And here's one, here's one approach to try and do something and it's not gonna quite work but we'll fix it. So we'll pick some prime Q. Later we'll set Q to be the right thing. Don't worry about what Q is right now. I said E to the epsilon. Just Q is some prime. And message space is gonna be FQ to the T. So all T dimensional vectors over the finite field FQ. Finite field of Q elements. And I'll pick T large enough so that the size of FQ to the T is bigger than or equal to K. Which means I can view any element in my universe as an element of FQ to the T. Okay. Which basically means that T has to be the ceiling of log base Q of K. Okay, so now, remember now, now I can pretend, I can pretend that every element of my universe is a T dimensional vector over FQ. So I'll define its preferred message list as a subspace of FQ to the T, which is the T minus one dimensional subspace or a thonginal to X. Any Y such that X dot Y is zero mod Q. Okay, I'm gonna define that to be my preferred message list. And then, SX, if I look at two different preferred message lists, these are just two different subspaces, each of dimension T minus one. I intersect them. I get a T minus two dimensional subspace. So this means that little s, my preferred message list has size Q to the T minus one. And my intersection sizes have Q to the T minus two. So then I get L over S and S over Y to both be one over Q. And then I set Q to be E to the epsilon roughly or E to the epsilon minus one. Whatever I needed, remember I said that the magic happens when L over S cancels that, right? So basically I'll just, L over S is now one over Q. So I'll pick Q to be as close to this as possible. Great, not so fast, doesn't quite work. And the reason it doesn't work is, the X size are just different elements of FQ to the T. Like, what if one element of FQ to the T is one zero zero and the other element of FQ to the T is two zero zero? Then actually the preferred message lists are the same. They define the same orthogonal subspace. So the intersection of S, X and S, Y is not a T minus two dimensional subspace. It's just the same T minus one dimensional subspace back again, right? So that's a problem. It doesn't satisfy our meta approach that I said. But that's okay. The fix is projective geometry. And honestly, okay, so I will go through this slide just because I think it's interesting. I didn't know about this before working on this project. I mean, the real fix is, okay, so there's this thing called projective space. It's not too complicated. All it means is you normalize vectors, okay? So your space will not be FQ to the T. It'll be normalized FQ to the T, projective FQ to the T. What does that mean? It means that you take the set of all non-zero vectors and then for any non-zero vector, you normalize it so that its first non-zero entry is a one. So you just look at what is its first non-zero entry going from left to right? Divide everything in the finite field by that number and then now you have a normalized vector. And if you have a normalized vector like that, then it's not gonna be the case that two different elements of the space are multiples of each other. Which means that they will always define different orthogonal subspaces. Okay? And the connection, so I don't know why this happened. I was invited to give a talk to a media and arts department and I warned them. I was like, you know, that's not what I do. I don't really know anything about art. But then I tried to fit in this talk into art and I was like, well, where did a projective geometry come from? We're like, why is it important? And actually the reason it's important, I mean, one of the reasons is important. There are others, I guess, is it's connected to renaissance art, right? So one of the big innovations in the renaissance was drawing more realistic pictures, right? Oh yeah. Yeah, exactly. So it's like drawing in perspective. Yes, exactly. Do people know about this? Drawing in perspective. Some people, not everyone, okay. So I mean the idea is like, this is you, right? This yellow person. And you're standing on the road and these are like the edges of the road, which are parallel lines, of course, right? And you're just looking out into the distance and you're drawing onto a canvas, right? So this plane here is like the picture plane, they call it. So that's the canvas you're drawing on. And what these lines indicate is each, this is like your eye. And if you just look anywhere out into the world, there's like a line, or a ray, that comes out of your eye onto the world, right? And that ray intersects the picture plane, the canvas. And anything on that line projects onto the same point on the canvas. So in other words, lines become points, right? Which is exactly what we're doing when we said, look, 100, 200, 300, these are all on a particular line out of the origin. And we're gonna normalize to map all these points on the line to a single point. That's the projection, that's the project of geometry. Okay, so like everything on this line is the same point on the canvas, everything on that line is a different point. And there are some cool things that happen when you draw in perspective like parallel lines, like the edges of this road, actually are not parallel in the drawing. They actually intersect at infinity. And this infinity is usually an art, a perspective drawing they call the vanishing point. But whatever, okay, this is not an art class. Yes? So like when you move to projective space, don't you get like collisions between your messages that might, when you represent them in productive space, like any 100, 200, 200, these could be two distinct messages in your initial vector space. Ah, okay, good. So first of all, now the space that I work in is projective space. So every element of my universe, I will view as an element of projective space. So like 100 and 200 don't exist anymore. There's only 100 now. And then I'm gonna say, well let me just go to the next slide. Oh, there's a picture of it. But anyway, you can see like that's the origin there. And then I guess these two, everything that is the same color are like the same points in projective space because they're on a line, but a line in F3 to the three. Yes, okay, anyway, this is for the art people. So you define projective points in FQ to T as non-zero vectors whose first non-zero is a one, you normalize them. And then you can show, I mean, can show as a, I mean, it's not that complicated. So the point is, how many projective points are there? So they're Q to the T vectors in FQ to take out the origin, so you subtract one. And then every non-zero vector has Q minus one different scaling. So normalized by Q minus one. That's the number of projective points that exist. And I'll choose T to be such that the number of projective points is at least K, so now I can identify elements of the universe with projective points. And my preferred set now is the projective subspace that we're talking about to X. That is all U, projective vectors U, U is a vector in projective geometry that satisfies X dot U is zero mod Q. And then you can, you know, now that you have the construction, you can figure out what SNL are, and it turns out that everyone's happy at the end, okay? So this is it. Okay, so I mean I still have time left. What I wanna show you is first of all, you know, we implemented this thing and we implemented all the previous algorithms and it works well. And also the thing that's maybe not obvious right now is like why does this lend itself to a fast algorithm? Where does the speed come into play? So let's talk about that. First of all, here is a communication utility loss server time. We give two new algorithms. One is the algorithm I just told you which is what we call projective geometry response. And the other is a hybrid projective geometry response algorithm. And what is the difference? So you'll notice like the previous state of the art, there was something that got pretty decent communication and optimal utility loss. This is the optimal bound, even the constant factor is optimal. But the server time was like quadratic. That was the previous state of the art. If you look at projective geometry response, it gets great communication, optimal utility loss and like almost linear runtime n plus k log k times e to the epsilon. Okay, now this e to the epsilon, remember I said in practice epsilon is often five when people deploy these things? So e to the epsilon is 150. So okay, it's linear time but there's a 150 factor right there. Can I reduce that? And if you look at some of the previous algorithms, they were actually closer to linear time. They had like runtimes of n plus k log k without any e to the epsilon in them at all. But those did not have optimal utility loss, right? So what we do in our hybrid algorithm is we show that you can kind of smoothly trade off, if you really don't like that factor 150 slow down in your runtime, you can trade off speed for utility loss. So we show that if you're willing, if you're willing to worsen your utility loss by a one plus one over q minus one factor for any q, q can be any prime of your choice, then you can turn that e to the epsilon in the runtime into a q. So like imagine q is five. So I could get n plus five k log k as my runtime but I'll have to sacrifice and worsen my utility loss by 25%. So instead of four n, I'll be five n. Which is still better than eight n, I guess. Okay, so that's the statement of our contribution. And then I wanna maybe show some plots and then talk about why the algorithm's fast and then at the end we can talk about, maybe I'll take a break for questions and before we move into part two, which is math education stuff, which is completely unrelated, but it's not unrelated. I mean, I guess you have to get educated in math to be able to do algorithms. Yeah, so questions before we move on. Yeah. Projective geometry is slightly different from like math definition in algebraic geometry, right? Because there you also could have the first coordinate zero and it requires to be the previous step, you know, one direction less than the other coordinates. Wait, so I could have the first coordinate zero? I can also have the first coordinate zero. So for me, I'm just saying the first non-zero, the first non-zero coordinate has to be a one. Okay. Yeah. Yeah, okay. Okay, so here's some plots. Roughly what are the plots mean? So remember, it's a randomized mechanism, right? So the error I get is a random variable. So I just ran the experiment a lot of times, I don't know, 100 times, 1,000 times, something. And I looked at, each time I ran it, what was the error? What was the mean squared error? So I get like a CDF, empirical CDF of the error distribution. And then different color, so the y-axis is the error, high is bad, that's more error. And then this is the percentiles in the CDF and empirical CDF. Blue is randomized response. Remember, that's the super duper fast algorithm, but really bad error, as you can see, empirically. And then from now on, I'm just gonna stop plotting randomized response because it's so bad that you can't really even distinguish the other ones on the plot. So there's a different plot without the blue one, without the randomized response. Green and red up top, those have a constant factor, worst mean squared error. So you can see instead of like, I don't know, 150, they have 600 or something. But they're really fast, okay? And then down here, blue is subset selection, which is an optimal algorithm. It has optimal expected mean squared error. That's the one that you saw that has really bad communication. And then the teal is our new algorithm, which is basically right there down with subset selection in terms of error, which is great. And then the yellow one is our hybrid algorithm. Remember I said, if you're willing to worsen, if you're willing to worsen the utility loss by 25%, you can get a fast algorithm. So that's the yellow one. More plots, it's just basically, I may not dispense too much time on this, but we ran different kinds of experiments with different kinds of data. The picture always looked very similar. Kind of teal, and teal is us. Blue is subset selection. We're always competitive with each other. And then yellow is our hybrid algorithm. And then green and red are other fast algorithms that are not optimal in terms of utility loss, yeah. Do you happen to know if Apple switched to this algorithm at something? Has Apple influenced this algorithm? I don't know. And if I knew, I probably couldn't say, but I don't know. I don't know. Okay, in terms of runtime, the previous state of the art for reasonable settings of universe sizes and data sizes or whatever, on my desktop took about half an hour to reconstruct the approximate histogram. Okay, I measured runtime in seconds, so that number of seconds is roughly a little more than half an hour. Projective geometry on the same data took 37 seconds, roughly. Our hybrid algorithm took about six seconds. And then recursive Hadamard response and Hadamard response, these are algorithms that are really fast, but have a constant effect. These are like the red and the green on the plots. So they have a constant factor worse utility loss, but they are noticeably faster. So we're not as fast as the fastest algorithms. And randomized response is like blazingly fast, but has ridiculously bad error, so you wouldn't use it. So we're not as fast as the fastest, but we're, I think, of the same, we're like, maybe it's acceptable. And we're much faster than the previous state-of-the-art algorithm that had optimal utility loss. There was a question over there. How do you generate data in the experiment? How do we generate the data? What happens? So we tried different things, and then we realized trying different things, the plots didn't really change that much in terms of the shape. So we tried spike data, where there was like one, like everyone has the same element one. We tried like Zipfian decay data with different like Zip parameters. What else did we try? So like, you can see in the top left, so like this has Zipf, Zipf with some parameter. These are all Zipf, spike data. And I think there was one other thing we tried too, but the picture was always kind of similar. Yeah. Good. Yeah. I may have missed this. Is client runtime the same for all of these algorithms or is that not really? Ah, client runtime. It's a good question. So it's not the same for all of them. For example, for subset selection, it's really slow. Randomize response is really fast. Our algorithm, so what do you need to do? You need to, I think our algorithm has roughly log K, like so basically you have your vector, which is an FQ to the T. Well, you have an element, which is a number between one and K. You need to figure out like what element of, what projective point does this map to? That you can do in roughly T time. T is, remember, it's FQ to the T. So T is roughly log base Q of K. So it's not too big. And then you need to decide, okay, am I going to generate a vector orthogonal to this? A random vector orthogonal to this or a random vector that's not orthogonal to this? So what I'll do is I'll basically say, okay, if it's not orthogonal, then its dot product will be like a random number which is not zero between one and Q minus one. And then I need to generate a random vector that satisfies that dot product. So overall, the runtime is like T, which is like log base Q of K for our algorithm. On the client side. Okay, so yeah, so maybe in the last part of the talk, making our scheme fast. Why do we have a fast algorithm? So the idea, the overall summary, I'll show you the details, is you find a recurrence relation to compute the thing you want to compute and then you use dynamic programming. So actually this was my first time writing DP, not differential privacy, dynamic programming. It was my first time running dynamic programming code, not in a programming contest, I guess. So, okay, so now let's think about what it is we want. So remember, f tilde of x is the sum of all i for one n of alpha times an indicator of, is mi preferred for x plus beta? I can, let me pull out the alpha out and let me realize that I add beta all the time, so I just add a beta n at the end. Okay, and now that inner sum, I want to count how many devices sent a preferred message. Okay, I'm gonna think of that in a different way. I'm gonna imagine that there's a histogram for a message space, I'm gonna call that y. Y sub u is how many devices sent u as their message? Okay, and now, so now I wanna know how many preferred messages were sent for x. Let me sum up over all u that are canonical, meaning they're projective points, such that x dot u is zero mod q of y sub u. Okay, so I want, at the end of the day, what I want is I want this sum for all x simultaneously. And once I have that sum for all x simultaneously, I can now write down the histogram. Naively computing that sum would take about k over q time per x, because that's how many orthogonal u there are per x. And there are k values of x, so the naive runtime would be something like k squared over q. So quadratic time in the universe size. I'm trying to get something like linear time. Okay. And we're gonna do it using dynamic programming. And how does that work? So, I'm teaching undergrad algorithms a semester. I always tell students when they are designing dynamic programming solutions, it's really just like brute force recursion and then you add the memoization at the end. I mean, that's how I usually design these algorithms. And then later, if you want to implement it bottom up, you can. So what's this recursion? So I'm gonna define f of abz as the sum of all y sub u's where the length j prefix of u is a, the length t minus j suffix of u dotted with b is z, mod q. So I'm summing up over all u's that satisfy this of y sub u. That's the definition of f abz, okay? Now, what's the quantity I actually want to compute? I want to compute that, right? I want to compute this sum for some x. So the thing I actually want to compute is, sorry, I should have written x here. Pretend this v is an x. The thing I actually want to compute is, well, z should be zero, because I want the dot product to be zero mod q. b should be x itself. And then I want the length t suffix of u dotted with x to be zero. The length t suffix of u is all of u. So that means the prefix of u is a, that's the length zero prefix. So the length zero prefix is the empty string. So this is the thing that I want, okay? And now what we'll show is that f abz satisfies a recurrence relation, okay? What's the recurrence relation? All right, I can maybe out, I don't want to get buried in the details, but let's walk through this a little bit. So, okay, let's think about where we are. First of all, we have the base cases. Let's ignore the base case. Let's get the recursive step. I guess that's the interesting part. I want that the length j, the length j prefix of u is a, and the length t minus j suffix of u is b, and the dot product is z. So basically what I'm gonna recurse on is, if I know that the length j suffix of u is a, I'm gonna recurse on all ways to, prefix, sorry, all ways to extend that prefix by one more element. So let me loop over all w's I can use to extend the prefix, okay? And then now the new prefixes a concatenate with w. The new suffix is the old suffix, but where I removed the first elements in it. And then now, what is the, remember, I want the dot product to equal z. So the new dot product, after I do that extension, it should be z minus b one w log q. Okay, so that when I add in, when I add in the b one w, because remember the character I lost with b one, b one got multiplied by that w. So when I put that back in, I get z mod q. Okay, and then there are two cases in the recursive step based on whether a is zero or a is not zero. Okay, and why was that again? Ah, yes. So the point there is, remember at the end of the day, a is a prefix of u. So as I extend a, I'm basically building out what u is, right? At the end of the recursion, when the recursion bottoms out at the base case, u needs to be a projective point, right? Which means its first nonzero entry, its first nonzero entry has to be a one. So if the a I've built out so far is all zeros, then the next thing I concatenate has to either be a zero or a one. But if it's not all zeros, that means I've already taken care of the starting one, then I'm allowed to append any number I want. It doesn't have to be a one anymore. Okay, and then if you just basically figure out what this means, you get a dynamic program that has k q squared t time and k q space, right? You just count like how many a b z possibilities are there and sum up with the times over all of them. There's an optimization you can do which is in this recursion, remember now b is a suffix of u, u is canonical, meaning it's projective, meaning its first nonzero entry is a one. A suffix of a canonical vector is not necessarily canonical, right? A suffix of a canonical vector might start with a number which is not a one. But so which means b is not necessarily canonical. But the optimization is to observe that we only need to fill in entries of the DP table when b is also canonical. Why is that? Because if b is not canonical, just, I'm not gonna go through this too, I'm gonna spend too much time on this, but by definition of f, f of a b z is the same as f of a alpha b alpha z for any alpha. So just choose alpha to normalize b and make b canonical. So you only need to compute the entries of the DP table when b is also canonical, which essentially cuts down your number of states by a factor of q, or q minus one or something. So that saves you a factor of q in your time, as well as a factor of q in your space, and overall you get kqt time and k space. And then there's the usual trick with bottom up DP where if you have like a, I guess here you have like a 3D table as your DP table, but you realize that one dimension of the, one plane of the table only depends on the previous plane. So if you do bottom up you can save memory by reusing memory. So you can do all that and implement it. Again there it is, this is the full algorithm, that's on GitHub. And you can see, I mean that's exactly what's happening here. I don't know if you see like, there's some loops, swap last next, that's like swap, that's basically the only keeping track of the last two slices of the DP table at a time to save memory, that's all that is. Okay, so trade off, as I mentioned, we have a trade off apostle between utility and run time, and how does that trade off work? The idea is, you have this universe of size Q, you use a parameter h. You pretend that your universe is like, you break it up into h blocks, each of equal size k over h. So there's like the elements one, two, three up to k over h, then k over h plus one, k over h plus two up to two k over h, et cetera. These are the different blocks. You use randomized response to reveal which block your element is actually in, and then you do projective geometry response inside the block, and then that's your overall message, is you send these two things together. And then you can show that this gives you the trade off you want as a function of h. What next? And also the point is, remember like the run time is slow as a function of the universe size for projective geometry response. So as you make h larger and larger, k over h becomes smaller and smaller, which means you're doing PGR over a smaller universe, which is why you ended up getting faster. One thing that I kind of swapped under the rug is for the algorithm to work, k has to be a power of, I mean, ideally, remember we have to round t up. So we said, pick the smallest t you can, such that q to the t is at least k, or q to the t minus one, or q minus one is at least k. So t is something like roughly the ceiling of log base q of k. But that ceiling might round up to the next integer, which means effectively you're working over a projective space that's a factor q larger potentially than you really wanted, which means that your run time is a factor q slower than you ideally wanted, and your communication has an extra log q bits that you maybe don't want. So can you get some kind of set system design that doesn't only work for such a sparse set of values? Can I get it so that there's always a set system that has message space being within a factor of two, let's say, of what I really wanted? And the other thing is sublinear time. So this is related to something called the heavy hitters problem, which I'm not gonna define, but the point is we could hope to have an algorithm that solves problems like this in sublinear time, not just linear, but sublinear. And in fact, we do, but those algorithms don't have the optimal utility loss. Remember this four times n times whatever, they don't get that. So can we get a sublinear time algorithm that actually has optimal utility loss? So that would be interesting. Okay, so that's it for research part of the talk, and then we can talk about math stuff, math and education stuff. So questions for this part? Yeah. So just a quick question about the normalization process for projective space? Yeah. Like R2, when you're constructing it, you just divide by the norm. Is the fact that you can just set the first non-zero value to one, just a product of it being in the prime? We're not working over R2, we're working over a finite field. So is the fact that you can just set it to one to normalize it, like some artifact of it being in a finite field? Even if we were working over reels, I would just normalize it to be one. I wouldn't normalize it by the norm. Like dividing by the norm to get onto the unit circle, right? Oh, I see what you mean. That's one way to do it. I mean, I could also normalize it so that the first entry is a one, and that would put it on the L infinity ball. Okay. Right? All right. Instead of the L2 ball, I could do the L infinity ball, and that would make it the first entry, well, yeah. Okay, so it's like an L infinity norm, is what you're like, Sure. Or something? Yeah. Well, but it's not even, I'm not dividing by the L infinity norm, I'm actually, what I'm saying is crap. No, that's not true. No, I'm just normalizing so the first non-zero entry is a one. That's what I'm doing. That's what I'm doing. Forget about norms. Yeah, forget about norms. Yeah. So what is the state of the art in privacy, preserving heavy hitters? Yeah, I mean there are, I think the state of the art is probably, I had some paper in PODS 2018 with Ori Stummer and Mark Bunn, and we didn't care, in that paper we didn't care about constants at all. It was just like getting the right asymptotics for things. So I mean, I think for sub-linear time algorithms, for heavy hitters, private heavy hitters, I don't think that there's been really any kind of progress in getting the right constants. Wait, do we have approximate, Sorry? Do we have approximate heavy hitters? Yeah, we do have sub-linear time heavy hitters algorithms. We do. In LVC? Yes, in LDP. Like for example, the paper I just mentioned, this Bunn, me, Mark Bunn, Ori Stummer, PODS 2018, but it doesn't get the right utility loss. I mean, in terms of constant factors. And it's also, I think they're, it's not, we're not measuring mean squared error. It's another, it's errors measured via L infinity. Any other questions?