 The title of my talk is Pinning Down, Privacy and Statistical Databases, and as the title suggests, a lot of the talk is going to be about discussing the meaning of the word and quotes and how we convert this sort of natural language notion into something more mathematically rigorous. So when I say privacy in this talk, I'll be talking about privacy and statistical databases. I mean by that, imagine you've got a bunch of individuals who have sensitive data, personal data of some kind that's collected by some server or agency, and the goal of the server or agency is to make the benefits of this data available to as wide a public as possible, so they'd like it to make it available to users, they'd like users to be able to ask queries about this data, but they're concerned about the privacy of the information that's in that data set. So large collections of sensitive personal information are actually ubiquitous. Census data is the classic example, but of course there are lots of other good examples. Things like clinical and public health data, data gathered from social networks, both the online kind and the traditional offline kind, recommendation systems, trace data from computer systems like search records, network usage, that kind of thing, data collected by intrusion detection systems. So recently there's been a trend towards more and more types of data being collected, so these data sets are getting both larger and a lot more varied and a lot more valuable. So they're valuable both in the sense that mining this information offers all sorts of public benefits, but when we by and large consider it a sort of social good that things like clinical data be made as available as possible, but there's obviously this flip side to it that we need to be concerned about the information that's contained, the sensitive information contained in these data sets. All right, so we basically have two conflicting goals. On one hand, we want to get as much utility out of these data sets as we can. On the other hand, we'd like information that's sort of specific to individuals, whatever that means, to remain hidden. So the theme of the first at least half of this tutorial, more like two thirds, is going to be how can we define this question precisely and what are the challenges involved in formalizing. So variations on this model have been studied for a long time in statistics under the name statistical disclosure control since the 60s and 70s in the data mining and database communities since the early 80s where it's come to be called privacy preserving data mining, although that term has a second very distinct meaning in this community that I'll get back to in a couple of minutes, but roughly since 2002, there's been a push from several communities, although it started very much in the crypto community, to try and put this, put this field on more rigorous foundations and so I'll be telling you a little bit about that. Okay, so why, what are the differences between privacy as I mean it in this tutorial and crypto? One of the challenges that in privacy, there are no bright lines. Okay, so whereas we're in crypto, we're used to a situation which is a rough, roughly follows the model of the psychiatrist with a patient where there's there's a there's some sensitive information. There's a group of people who are supposed to have unrestricted access to that information, namely the patient and the psychiatrist himself or herself and then there's sort of the rest of the world and the rest of the world has should have absolutely no access to that information. Should be completely hidden from that. What I'll try to explain is that actually in, with data privacy, we don't get these bright lines and we sort of have to pick and choose what we're gonna release. When we release certain types of information, we do so at the expense of other kinds of information and moreover if we wanted to have the kind of strong types of guarantees we get in the crypto, we're all used to, then we wouldn't be able to release anything at all. So that's one big difference. And in particular, it's very different from secure function evaluation, what sometimes meant by privacy preserving data mining, if you hear it in this community, in particular talk at crypto 2000 that introduced the term. So, so secure function evaluation roughly focuses on how to securely distribute a computation that we've already agreed on, okay? Whereas, whereas in the question in data privacy is which computation should we actually be distributing? And I'll give you some, some examples later of how, how starkly those two questions can differ. So, maybe another, the companion question to ask is how, well, what, how can crypto contribute to this field of data privacy? I actually think crypto has a lot to say, in particular about, as a community, we think a lot about modeling security and what it should mean and how different measures of information leakage and things like that. We're sort of used to thinking adversarily. And I think that's a, a contribution that, that our community can, can make to this, this area is sort of, well, more hacking and, and just thinking about different kinds of attacks, so both carrying them out. And also kind of trying to establish a systematic understanding of them. And thinking about how, what, what the various security and privacy concerns are in various distributed models. Something I won't really get into today, but I think it's a very interesting topic. What can crypto get out of it? Well, as I'll try to convince you that privacy has this aspect to it, that we're sort of forced to deal with, with, you know, non-negligible secure, leakage of information. And so one of the interesting aspects of it is that it kind of pushes us into coming up with a theory of, of how, of how to think about these sort of moderate, moderately secure settings. And you know, these are things that come up to some extent already in crypto, but I think this is sort of a different twist on that. And I think, I think some of the ideas that are being developed in this area will have applications to other areas that are sort of more traditional crypto, like anonymous communication systems and voting perhaps and things like that. Okay, so this is a tutorial. You might hope that this will be an overview of research on this question, this sort of broad question I've laid out. Unfortunately, data, data privacy research is extremely diverse. It, it involves researchers from lots of areas, which makes it fun. It involves tools from lots of areas, which also makes it fun and challenging. Unfortunately, that also means it's very hard to cover in a 75-minute tutorial in any kind of, with any kind of real coverage. And so, you know, necessarily I had to pick and choose and, and, yeah, deal. The other, the other thing I wanted to say is, you know, there's, there's been a lot of progress in, you know, in this field since 2002. We really are 10 years ahead of where we were in 2002. But the area is still immature, and that also makes it hard to give a tutorial because it makes it hard to present the sort of really coherent theory of everything that one would like to give. And so I hope you'll, you'll forgive me for not doing that and you'll be, I hope you'll view it as sort of a, a, a challenge and an interesting one to kind of participate in, in, in making it a bit more mature. Okay, so this, this talk, it's more, more tutorial than survey. I'm going to try to go in depth into a couple of things rather than touching on everything. There's a lot of stuff that's been left out. As I said, I'm not, I'm, I will be discussing my own work a reasonable amount, but I won't only be discussing my own work. I'm sorry about that. And then just a disclaimer that my slides are somewhat sparse on references because there were just too many to fit. And so they should not be, if I've left your work out on some slides, I apologize ahead of time. And, you know, one last disclaimer, no, no Hitler cats, sorry. All right. Okay, so, so this talk, an outline, first, first act, we'll talk about attacks and I'll talk, be talking, getting into the math of reconstruction attacks in particular, but really in some sense, the purpose of this first section of the, the, the talk is to explain why the problem is hard. Okay, act two definitions. I'll talk about, I'll go into depth about one particular approach to sort of pinning down this ambiguous notion of privacy, called differential privacy, something I was involved in developing. And some, I'll be talking about variations on the theme of differential privacy and also some of the things that are good about it. And some of the things that, some of the flaws it had, or not flaws per se, but, but some of the things that one would, would hope to, to remedy about it. And then I'll be talking, sort of third part, I'll be talking about algorithms for designing differential privacy, differentially private algorithms. I'll, I'll get through some basic techniques such as noise addition, exponential sampling, and then I've got some more advanced stuff that I may or may not have time to get to. All right. So, so what's the deal with attack? So one of the things that, that makes this problem hard to, to reason about and, and quite different from the settings that we're used to as, as cryptographers is that there's these users that are getting access to the information, but they, they themselves have access to lots of external sources of information. They could get information from the internet, from their own social networks, maybe other anonymized data sets. There's all sorts of, all sorts of other information about these individuals out there other than what's coming out of the server or the agency. Okay. And we can't really assume we know what those sources are, but unfortunately in a sense that we'll make precise, we can't ignore them either the way we could, in somehow in, in, you know, traditional crypto with auxiliary information, we can, we can almost forget about it. The definitions are so strong that we don't really have to think, we've, we've found ways to kind of not think too hard about auxiliary information. As a result of this, this, these complications, anonymization schemes are, are really regularly broken or things that people thought were anonymous are regularly broken. So I'll just sort of, I'll start up with, you know, a warm up some examples of, of problems that have gone on with anonymization systems. And then I'll talk in, in a bit more detail about reconstruction attacks. Okay. So one example is the, the Netflix data release. Many of you have probably heard about this. In 2006, Netflix, Netflix released a data set to help with, well for the purposes of a, of a competition on recommendation systems. And what they released were, so Netflix is a movie rental company. They released ratings for a subset of movies and users. So their, their users and movies those people had watched. So for each user, user names were, they, they removed the user names. They were replaced with random IDs. There was some additional perturbation added, although that wasn't entirely specified. And then basically, so for each of these users, you got this random ID together with a list of movies they'd reviewed or some, a subset of the movies they'd reviewed rather. And the scores they'd given and an approximate date of the review. Okay. And what Arvind Narayanan and Vitaly Shmatikov did at, at Texas was to take this anonymous data and correlate it with, with an external source of information. In this, in this case they looked at the, the public data available through the website IMDB, which is the movie ratings website. Where, what people can do is create an account, you know, normally with their real name and post reviews of the movies they want everybody to know they watched. And how much they liked them and these are the movies that they actually watched. All the movies they watched. Okay. And what they were able to show that it wasn't, it wasn't not too difficult to actually reidentify the records on the left based on this sort of very partial and incomplete information available on the left, on the right. In fact on average it took about four movies to uniquely identify a user. And really as a result of this work and the publicity it created the second round of the Netflix competition which was meant to take place in, I guess 2010 or so was, was postponed basically indefinitely. All right. So that's one example. I guess one thing about this example it illustrates the perils of very high dimensional data. So these are, this data set is very sparse, right. Most people don't watch actually a large fraction of the Netflix movie corpus. But so it's, it's sort of, it's hard to condense this data set into some, some smaller more succinct format and that was part of the issue. Another, another type of issue that can come up, this is something from my own work, is something I'll call composition attacks which should be a name that's natural for this crowd especially given the session we just listened to. This is an example, you've got two hospitals that serve overlapping populations and what happens if they independently anonymize and release data about the populations they serve, okay. So it turns out that if you, that you can actually combine these, these, these releases to learn a lot about the people who are in the intersection of these populations, okay. If for various popular anonymization schemes including K anonymity and L diversity and a few other things and you know the idea is really not that complicated. Roughly speaking there's a sort of a large class of anonymization schemes that were around for a while pre-2002 in particular where the kinds of information that were released what you'd learn about, if I was the person in the intersection what you'd learn about me would, would say that I visited the hospital either for because I had diabetes or for complaints about high blood pressure. What you might learn from the other hospital database is that I, I visited for, for one of some other set of possible reasons and under the reasonable hypothesis that the two visits had were for, that the two visits happened around the same time then you could, you could look at the intersection of these, of these sort of disease, disease lists and, and actually typically the intersection is a single item and you figure out what the problem is. You figure out why I was visiting these hospitals. There are lots of other attacks out there in the literature. I'm not going to go through them on, from a very, a bunch of various donate, donate, donate domains. One of them that, that bears mentioning is this paper of Homer et al. that appeared in, in the genetics community on genome-wide association data and basically that paper caused the NIH, the National Institutes of Health to pull a whole bunch of data off of their websites and to, it, it sort of drastically changed the policy of the, the policies in the genetic community for sharing research data. Okay. So these are, these are some attacks. One might ask what, you know, what's going on? What's, what's wrong here? So, so let's try and sort of understand a little more, in a little more detail. Is it, so, so far the examples I gave you all had to do with releasing individual information. And the, and so one might ask if, if that's, if that's what's going on, that the problem is that the information that's being released is just too fine-grained. And that if we just release sort of course-grained global statistics about the data, we'll be fine. Okay. And actually, you know, we won't, and there are, there are a few reasons for that. So one of them is, again, composition type problems. So, you know, suppose I tell you the average salary in a department before and after a given professor resigns, then you learn exactly how much that person was making at the time they resigned. Right? Incidentally, this illustrates the difference, you know, one of the differences, why let's say, secure function evaluation is sort of addressing a different problem. So, secure function evaluation would tell you how to release the average, but it doesn't tell you whether or not releasing the average is a good idea. Also, seemingly global results, even in sort of in a one-shot version, can reveal all sorts of specific values. So, there's a popular classification algorithm called the support vector machine. It produces a linear classifier, so it's sort of a rule for determining, for telling apart two different kinds of points in a data set. Roughly, it's just a line in space or a hyperplane. And it's derived through some relatively complicated algorithm, but the description of the final output is always given in terms of a very small number of actual exact input points in the data set. So, even though it's the result of this very global processing, the final output actually reveals a bunch of data points in the clear. And maybe a more subtle problem is that global statistics taken jointly may together encode a lot of information about data. And the attacks that take advantage of this are often called reconstruction attacks, because they reconstruct some piece of data based on a bunch of released information. And roughly what these reconstruction attacks, they all have the same broad flavor is that if you release too many, too accurate statistics, then you can reconstruct either the entire data set or some very particular part of it. And these reconstruction attacks are robust even to fairly significant noise, and I'll tell you a little bit about those. So, as I intend them, these sort of came up in the museum in 2003, let's make things a bit more concrete. So, suppose we've got some data set X, and we want to release this function f of X, or some approximation to it. Essentially, a reconstruction attack is going to take this approximation to f of X and reconstruct some X hat that's ideally close to X. So, to make it more concrete, imagine we've got end users, each one has a secret bit, and the types of queries we're going to allow are going to be subset queries. So, for a given subset of the data, what I'll ask for are, the query I'll ask is to tell me the sum of the bits in that subset, or rather divided by one over n for convenience. Okay, so I can think of this as sort of an inner product of the characteristic distribution vector of my set S, with this vector of secret bits, and then I just add I divide that by n, so it's normalized between zero and one. Okay, so the question is what sets of subset queries, what collections of subset queries allow me to reconstruct, say, significant parts of the data, and so I might ask, you know, how many queries can I make before I start running into trouble? How close is my reconstruction of the data to my original data set? I'll use hamming distance to measure that. And also how much noise can, how much noise does my algorithm tolerate? So if I've got a certain amount of distortion in my answers, how well do I do in terms of reconstructing the data, and also how much time does the attack take? Okay, so I'll give you, there's a lot of work on this, I'll give you sort of two data points. One was the original paper of Denur, and from the original paper of Denur and Nissem, so they give a fact that uses sort of two-to-the-end queries, so all possible queries. It gives something with, if the queries are reconstructed within distance, are accurate to within alpha, then the reconstructed data set is correct on all but an alpha and fraction of the points, basically. So basically if alpha is little o one, if alpha is very small, then I'm reconstructing the data set exactly. And the algorithm is very simple. The idea is that if I've got some candidate data set y that I think might be my real data set, I can just write the Hamming distance between y and x in terms of two subset queries. This s one and s zero, where s one is the set of positions in which y is one, and s zero is the set of positions in which y is zero. And basically if I'm given approximations to these subset queries, then I can just get an estimate for this Hamming distance in terms of by using the estimate I have for f s one of x, the estimate I have for s zero of x, and I just output the data set that minimizes this estimated distance. It's a very simple attack. It takes a long time. You're basically explicitly searching over the entire space. But at least it tells you you can't release everything, at least modular running time considerations. You can't release everything at high accuracy and even any non-trivial accuracy. And not basically broadcast your data. There's a sort of various refinements of that result. I'll give you another data point on there. So one is, instead of using to the n queries I use n queries and the number of the error to which I reconstruct now is higher by a factor of square root of n than it was before. But now my running time is significantly better. It went from to the n down to basically linear. So this attack is meaningful when the noise is little o of about one over square root of n. That's when it's basically reconstructing the entire data set. One can show that's optimal. And I can explain the attack to you. It's very elegant. This is a version from a paper of Dwark and Yekhaning at crypto 2008 actually. So I'll again I want to construct this sort of subset of queries to ask. What I'll do is I'll choose my queries according to the rows of a Hadamard matrix. So this is a plus one, minus one matrix. So the entries are plus and one and minus one. It's defined recursively by this pair of formulas. And a Hadamard matrix is the nice property that the all of the rows any two rows of the matrix differ in exactly n over two positions. And because they're one plus one, minus one matrices, that means the inner product of any two rows is exactly zero. So the matrix is orthogonal and all that tells you that all its eigenvalues are actually exactly plus or minus square root of n. So we know a lot about this matrix. And now what I'll do is it turns out that using n subset queries roughly one per row I can reconstruct something like the matrix product of this Hadamard matrix with the secret data x. So there's a little mismatch here. I'm talking about plus one, minus one matrices and I said my queries were zero one. Turns out that's easy to deal with. And my what my error guarantee tells me is that I'm getting this matrix product plus an error vector e and I know something about the L infinity vector of e, the L infinity norm of e. It's at most two alpha in this case. And now what I can do is I can just compute an estimate for x I'll call it x hat prime because afterwards I'm going to have to round it. So x hat prime I'll just take the inverse of this matrix. So I invert the Hadamard matrix I multiply by n to take care of that one over n thing. And I multiply whatever I reconstructed this information z by the inverse of this matrix. And obviously what I get back is x, since hn is invertible, I get back x plus some other error vector e prime which is just e multiplied by this inverse of the Hadamard matrix. And because I've sort of controlled the eigenvalues of the Hadamard matrix I know exactly how much the length of e can be blown up by. And that allows me to analyze this rounding procedure in this very simple way and figure out that if the error is below one over square root of n then again x hat will agree with x in almost all the positions. So these attacks that I've just described they can be extended considerably they can handle some very distorted queries they can exploit sparsity in the error vector so I won't tell you too much about it maybe I'll just mention that these results draw on this sort of deep and extensive theory of reconstruction from the signal processing world and there are really nice connections to things like compressed sensing if these are things you've heard about it's a very nice area to play around in. So so far from the examples I've given you the queries are kind of very unnatural in the sense that they were kind of algebraically defined which is kind of odd or it turns out you could also use uniformly random queries. And the other kind of funny thing about the queries I've described is that they somehow require you to be able to name rows to talk about give me the bits of this specific set of people and it may not be a priori obvious why that's possible so you could ask could you pull this off with sort of natural very symmetric queries that don't have this feature of name the kinds of things people actually release about say government data sets or clinical data and the answer is yes it turns out that the same ideas can be exploited say to get reconstruction based on releases of marginal tables for example regression analysis decision tree classifier is a bunch of other things so basically once you start releasing too much information that's too accurate of almost any kind of information about your data you're going to run into trouble I won't explain the details since I want to move on to other things but maybe I'll just sort of summarize what I wanted to say about attacks so far meaning so far in the literature there are a lot of ad hoc examples of attacks there are some general principles starting to emerge things like composition properties that we'd like but not too much there there are some very sophisticated attacks out there these reconstruction attacks are examples they draw on the theory of coding and signal processing and there are some sophisticated lower bounds for various classes of release mechanisms there are some interesting ideas from crypto like based on the hardness of forging various signature schemes that also are quite sophisticated but I would say by and large we still don't have anything like a systematic understanding of the picture of what kinds of attacks are possible there's no kind of clean taxonomy or anything like that and there's certainly nothing close to a suite of standard attack techniques the way we have say for the design of say first and I think this is something the crypto community could so just transitioning to the next part of the talk what have we learned well even if we're releasing only aggregate statistics we can't release everything when we do release information some information that means we're releasing it at the expense of other kinds of information and this kind of inherent tradeoff is actually very different from crypto as usual we're used to this nice harbored arguments that say well if each of my blocks has not negligible leakage then I can run them a bunch of times and the leakage doesn't add up too much but here here it does even a single aggregate statistic can be hard to reason about the support vector machine example I gave I think illustrates that and really it sort of begs the question what exactly does aggregate mean and so that's kind of where we're going with the definitions is to try and understand that question so having shown you some things that can go wrong it'd be nice to try and understand well you know what would it mean for things to go right at this point so that's going to be the sort of second third of the talk and you know maybe it's good to sort of take a break and stretch at least for me I should say this is a tutorial in a rather formal setting so I realize it's somewhat hard to ask questions during a tutorial when it's like you and 200 of your closest friends but feel free to ask questions if you'd like to okay so let me tell you about definitions so what I'd like to do is I'm going to start off with sort of one approach to defining privacy in this context called differential privacy and then I'll talk about some of the good things and bad things about it and other approaches that are out there okay so the high level idea with differential privacy is that we're going to try to define aggregate the sort of the meaning of aggregate information is information that's stable to small changes in the input it should be able to it's a definition that handles arbitrary external information in a sense that I'll define precisely and it's a burgeoning field of research there's really a lot of work going on there now and I'll tell you about some of that in the third part okay so the intuition is that I'm going to try to define privacy in the following way changes to my data meaning a particular person's data should somehow not be noticeable to users on the outside of the system and what we'd like to enforce really is that in some sense the output is independent of any individual's data although that's a more problematic interpretation okay so formally here's the picture we've got this data set x1 through xn each of the entries in my data set could be come from some domain this domain could be arbitrarily complicated it could be numbers could be categories it could be you know tax forms your tax forms it could be some complicated high-dimensional summary of your data we're going to think of x as fixed not random but there is randomness involved I'm going to think of the algorithm A as a randomized procedure and in particular for every fixed x A of x is a random variable and that randomness might come from adding noise it might come from re-sampling or all sorts of things like that and what we'd like to do is compare the behavior of this algorithm on an input x to its behavior on a neighboring data set where a neighbor is a data set that differs in one data point from x there are actually several ways to define exactly what differing in one data point means and I'm not going to get into the distinctions between those but for today just imagine that I take you know person two's data and I replace it with some other thing so the intuition behind the definition is that neighboring data sets should somehow induce distribution on outputs so this is very much inspired by indistinguishability type definitions in crypto and specifically we'll say that A is epsilon differentially private if for every pair of neighbors x and x prime for all subsets of possible outputs so for all events in the probability space the probability that A of x lies in this output in this set of outputs should be approximately the same as the probability that it would get set under x prime up to this multiplicative factor of E to the epsilon so think of that as one plus epsilon if epsilon is reasonably small so you know this really is a metric on probability distributions by the way if I use the smallest possible epsilon there I get a this is actually a metric on distributions and so this is really saying that neighboring data sets induce close distributions on outputs so one thing to notice is condition on the actual algorithm it doesn't talk about a particular output a particular table being safe to release and that's very natural I think in our community but it was actually sort of a big deal at the time for this area okay so previous a lot of previous definitions had this flavor one thing I will say is that the choice at the exact choice of distance measure matters it matters a lot in terms of how meaningful the definition is and I'll talk more about that in a minute and then the obvious question is well what's this epsilon here so it's some measure of information leakage but it's not too small unfortunately I think maybe 1 in 10 or 1 in 100 or 1 in 300 but not 1 in 2 to the 50 or something like that okay and I'll talk more about this in a few minutes let me go through an example of something that satisfies this definition first suppose I've got this function f some function f that I'd like to release and I'll think of this function it's just some vector valued function that returns a vector of p real numbers and what I'd like to do is I'd like to add some kind of noise to f and then just release that noisy estimate so for example I might be interested in releasing I might have a clinical data set and might be interested in releasing the proportion of so for each person I have a bit and the f of x I want is just 1 over n times the sum of these bits so a simple approach is to add noise the question is how much noise and the intuition is well we should be able to satisfy this type of indistinguishability definition as long as f itself is insensitive to changes in the entries of x so this leads us to this definition we'll say the global sensitivity of f is the maximum change I can get from f of x to f of x prime for neighboring data sets x and x prime so for example pictorially we've got this here x and x prime we've got a little ball around x of radius 1 and I'm curious how big is the ball around f of x the corresponding ball around f of x so for example for the proportion the change is at most 1 over n and what we prove is that if you add noise proportional to this global sensitivity of f in each of the entries so remember this f could be a vector then that algorithm is epsilon differentially private as long as the noise so the noise has to scale like the global sensitivity divided by epsilon and it should be distributed according to this particular Laplace distribution which is kind of like a Gaussian distribution but a little pointier and its tails go down linearly with the magnitude of the argument instead of quadratically okay so this is a distribution with standard deviation well roughly this argument okay so you're adding noise proportional to this quantity and the reason it worked the reason it's differentially private is that if I consider two different data sets x and x prime what I'm doing is I'm considering two probability two Laplace distributions centered at different values f of x and f of x prime and so essentially I'm just comparing a translated version of this density function with itself and it turns out that the Laplace distribution has the property that when you when you shift it by a certain amount delta then the change in any the probability of any given event scales is e to the delta okay so it just sort of matches directly the definition we wanted so for example for the proportion of diabetics we're releasing the proportion plus or minus one over epsilon n sorry so you might ask so this is a we're releasing this approximately right this is a random variable with this whose absolute value is about this big you might ask is this a lot of noise so it's some amount of noise is this a lot of noise it's a little noise well it does go to zero with n that's sort of a good thing but more precisely if x is a random sample from some large underlying population then the sampling noise just the sampling noise the fact that I'm not looking at the actual population but some sample from it is going to be on the order of one over squared so the error I get just for you know basic statistical considerations is going to be on this order and so the noise I'm adding for privacy is less than that considerably less than that for large n right so this goes to zero much more quickly than this quantity and so what that means is that this quantity a of x is essentially as good as the real proportion at least for statistical inference so in as much as you want to know about some underlying population then this amount of noise is not getting in the way so that's some crude way of calibrating you know are we adding are we distorting things too much too little okay so it turns out this this global sense this what's been come to be called the Laplace mechanism because you're adding Laplace noise has been very useful and that's because a lot of natural functions have low global sensitivity in this sense so for example if I've got a histogram I'll come back to that so histogram of a data set that has constant global sensitivity the mean sample mean covariance matrix things like the distance to a given property having the data how many points do I have to change for the data set to satisfy a given property various things that people do in statistics and convex optimization okay and so there are a lot of things that kind of naturally just drop into this framework and moreover this mechanism can be thought of as a programming interface where I could I can imagine what I'll just do is I'll just ask for various functions make different queries where my query describes a function what I get back is the answer to the query right some noisy answer to the query and indeed a lot of algorithms that don't themselves have low global sensitivity if you think of them as sort of talking interactively to the data then the questions they ask of the data do actually have low global sensitivity so for example and there are a number of papers on this but a good example of this is a k-means clustering algorithm I'm not going to go into it because it would take too much time but the k-means clustering algorithm has this property that it all of the individual updates made by this iterative algorithm have very low global sensitivity and so what you can do is add noise sort of inside the algorithm instead of to the final output and that actually works pretty well and this idea was actually implemented in several systems first of all by McSherry and later by a few other people and I will come back to those and talk more about them at the end of the talk okay so this is let's see this is more for the TCC crowd but let's just hack away the definition a bit so what is this definition really telling us okay so first some notes as I said earlier epsilon can't be too small okay it's actually easy to see why that is if I consider two data sets that are as far away as possible they have hamming distance in okay then just by the triangle inequality if my algorithm satisfies this definition those two data sets will be at distance almost n times epsilon so if epsilon is small then these two data sets induce very close distribution if epsilon is cryptographically small these two data sets are so close that you can't tell them apart so it's just totally useless so these are exactly the kind of things we want to be able to tell apart the other thing is why this funny distance measure why can't we use something more traditional variation distance KL something like that mutual information maybe and there are good reasons for that so let me just consider as a straw man a mechanism that what it does it takes my data so let's say the US Census corpus it grabs one person at random and publishes that person's data in its entirety so it turns out that that mechanism satisfies the following property for any two data sets that differ in one person's data the statistical difference between the induced distributions is actually exactly one over n and that's because the chances that you're the unlucky person are very small they're only one over n but of course with probability one somebody gets shafted and so this so we kind of need some distance measure that's a little more worst case that is more sensitive to this type of problem and that's why we settled on this definition in the original paper some other nice things about this it satisfies the natural composition lemma if a1 and a2 are epsilon differentially private then the joint output a1 a2 is itself two epsilon differentially private and so you get this kind of graceful degradation as you compose systems just like you'd expect in an indistinguishability based definition the difference here of course is that two epsilon is really much bigger than epsilon because epsilon isn't that small so these things accumulate quickly and the other thing that's nice about this definition before I start telling you all the things that aren't nice about it is that it actually remains meaningful in the presence of arbitrary external information let me explain what I mean by that because that's a claim that's often misunderstood so we're in the line of work of trying to interpret the definition it would be nice to have something along the lines of what we have semantic security of encryption where we could say that if you're the adversary who's looking at the outputs of the system your beliefs about me are the same after you see the output as they were before you saw the output so you've really learned nothing about me unfortunately that's very hard to achieve so consider the following example suppose as side information about me you find out that I smoke and suppose that you then read a public health study the anonymized results of a public health study that teach you that there's this sort of strong link between smoking and certain kinds of lung cancer so you could then from that learn that I myself am at risk for cancer in particular that I'm a bad insurance risk so in fact you've learned a lot about me if you didn't know if you lived in a different planet and didn't know that smoking was problematic and you read this public health study then you could figure out that you'd learn a lot about me and in fact that's exactly what we want that's exactly what these studies are supposed to do they're supposed to take just about connections high level connections between different attributes of people and notice that what you learned about me didn't actually matter whether or not my data was used in this public health study you learned something about me even though I may have gone nowhere near the person doing the survey so first by Dworkinor and sort of reformulized by Kiefer and Machinovatila recently and you know it's not hard to argue that learning things about individuals is sort of unavoidable in the presence of interesting external information and there's sort of a couple of theorems out there proving this but actually the smoking and cancer study pretty much tells you the whole story there okay so what can we say so we can't get a guarantee like this what you can say about differential privacy is that no matter what you know ahead of time you'll learn almost the same things about me whether or not my data is used in the calculation of the statistics so it's not that you won't learn about me you should learn things about me but what you learn shouldn't depend on the use of my data okay and you can sort of formulate this in this nice Bayesian way in terms of how an adversary updates a prior distribution about me say but at a high level I think the English sentence conveys it well alright so these are the reasons that we I like the definition some of the reasons the definition has caught on and become I think a focus of a lot of study and I then want what I wanted to do is tell you in a couple of minutes some of the caveats that come with the definition and I think that highlight again why privacy is sort of a tricky area and we don't necessarily have the sort of silver bullets that we have in the sort of area of traditional crypto yeah Ben you had a question that's right well we'll get there okay let me put that on ice for about three bullets okay okay so one of the interesting things about the definition is it sort of it provides a formalization maybe not the only one of a line between kind of global and individual information right but it does you know what happens when the global information is itself the problem right so you know the smoking and cancer situation if I if I'm denied health insurance because you know some pesky doctor went out and did a study on smoking and cancer you know I may not be happy more serious example that if I've got a data set of financial transactions differential privacy might provide privacy for individual individual investors actions but it might not hide say trading strategies that you are used at the level of a large financial institution because those are kind of that's sort of a global piece of information that concerns lots of people there are lots of settings for example in the context of social networks or data about relationships where my presence in the data actually affects everybody else right so this gets to Ben's question is that if you've got say a data about a social network it's very difficult to pin down which part of that information is about just about me okay as a trivial example of that let's consider what I call the annoying colleague example suppose that there let's consider a hypothetical situation that I've got a colleague who every time this person shows up for a faculty meeting everybody leaves annoyed I don't actually have such a colleague I'm happy to say but suppose I were to like do a survey of people before and after this faculty meeting I could easily figure out even if the results were differentially privately protected I could easily figure out whether or not I was in the faculty meeting just by virtue of everybody being pissed off and so so you know when the people when the data is about people who actually react together then it's very it becomes much harder to say who's you know what's my data what's your data there's and I'd say that was nicely pointed out in this sort of paper much much in Avachala last year and in the same they had another example in the same paper where they were looking at what happens if I've got exact deterministic information about this data set that somehow comes from another source if I how can I use that so for example this is sort of a simplistic example but suppose I'm trying to release the populations of all 50 states and what is somehow gets released through a side channel for whatever reason are the exact differences between the populations of all those states so now even so I might release something it turns out there is a differentially private release which would be which would allow you to figure out the exact populations of all the 50 states with very high confidence based on this side information even though there's no without this side information it's hard to figure that out from a differentially private release with this side information whether or not my data is used you could figure this out exactly because you've got this side channel this very very kind of brittle side channel another type of issue that comes up is that leakage accumulates this epsilons they add up with lots of releases and they're really inevitable in some form or other so the obvious question is how do we set epsilons I'm not going to say anything about that but so this type of issue is inherent in the problem but it's not clear that the particular tradeoff, the particular formalization provided by differential privacy is the right one okay so there are a lot of variations on this idea that are out there and I'll just mention a few of them as food for thought so one is well okay first there's some predecessors to this definition I've been bad about giving citations so far so I'm not going to change that now there's something called epsilon delta differential privacy which is a relaxation where I add this additive delta it's got very similar semantics as long as the delta is very small which it normally is there are various variants that incorporate computational considerations from the adversaries point of view like bounded running time on the adversary in various ways there are distributional variants where what I'll do is I'll try to assume something specific about the adversaries prior knowledge maybe a prior distribution the adversary has and I'll try to exploit that to perhaps get a deterministic release rather than something that's randomized so there are a couple of papers on this there was a paper at Asia Crip last year in particular that made a lot of headway in formalizing this but one thing that comes with these deterministic releases is they almost necessarily have poor composition guarantees so it's tricky and there are a number of papers on that and there are various generalizations of this definition that are kind of simulation based one of them from TCC last year called zero knowledge privacy the other called distributional privacy sort of confusing and I'm not going to talk more about this zero knowledge privacy paper here because it's going to come up in a talk tomorrow or I think it's tomorrow or Thursday but you know it's out there it's a strict strengthening of differential privacy and then there's something called Pufferfish which came out this year which is a sort of a generalization of differential privacy it's a very vast generalization where you've got lots of knobs that you can turn and differential privacy is one of the things you can get out of it it's an interesting paper the problem with vast generality is it then becomes very tricky to instantiate you have to make all these choices it gets complicated fast and then there's another work on cloud blending privacy that I'm not going to talk about because again let's talk on that tomorrow alright so sorry so what did I want to say about these so there are a lot of variations out there and I think perhaps I mean these are this one's not interesting just because it's very similar to differential privacy it's not that it's not interesting but it doesn't really change the semantics the computational variance change things in the distributed setting but not so much in the single server setting the distributional variance are perhaps the place where there's a lot of of insights to be made still where we really have a very poor idea of the right way to incorporate precise assumptions about adversarial knowledge here and I think there's sort of interesting open questions there of what's the right way to do that so we've got this definition privacy will equate now to change the fact that changes in one input should lead to small changes from the overall distribution and then the question becomes well what computational tasks can we achieve privately so there's been a lot there's been a lot of work on this and we've developed a lot of tools for reasoning about differential privacy that I hope will be valuable sort of more broadly in this space and it's just a lot of interesting work it's impossible to survey but just to give you an idea of the communities in which it happens there's the StocksBox Sota crowd the database people and data mining people the sort of USNICS and security crowd the learning people us and the networking folks the statisticians so this sort of work on this topic it's an interesting area to work in okay and so what I'd like to do with my remaining time which isn't as much as I'd hoped is to tell you more about how to design differentially private algorithms and some of the cool results so again let's just sort of take a second to stretch if you want to stand up and stretch you can do so because it'll become a lot more technical fast as one person said I will now switch into you've got any DOTUS mode and start pressing the clicker button a lot faster okay okay so alright so what we're after is sort of an understanding of what do we know about designing differentially private algorithms okay well we have some basic tools and techniques so I mentioned the Laplace mechanism there's something called the exponential mechanism that I will describe there are various algorithms for answering a vast set of queries simultaneously or in sequence where there's some sort of basic techniques that have been developed there there's a set of techniques that are revolved around the idea of local sensitivity that I might get to there's a lot of work on sort of basic theoretical foundations feasibility results about learning or optimization or synthetic data connections to game theory, learning robustness there are a lot of domain specific algorithms for say network data clinical data and there are a number of systems that were implemented and I'll talk about them right at the end okay so I want to talk about tools and techniques so my goal here is to give you enough information that if you go and try to read papers on the topic it will be easier okay so technique number one is noise addition and by noise addition I mean the simplest variant which is the Laplace mechanism that's what I talked about earlier so this is just the same slides as before just to remind you to give you an example of where this can be useful suppose I've got a histogram so suppose I've got my data points lie in some domain I don't know what the domain is but I've somehow partitioned it into a bunch of disjoint bins maybe the 50 states then I could consider the following function which counts how many data points live in each of the bins and now if I add or remove one person from the data set then at most one of these numbers changes right and so the global sensitivity of this function is one even though independent of the dimensionality and so I can it's sufficient to add noise to get differential privacy it's sufficient to add noise on the order of one of one of our Epsilon so essentially a constant to each of these counts and you get something differentially private and again regardless of the dimensionality so you get something that scales very nicely to very high dimensional outputs and it's going to be very useful as long as the bins aren't too sparsely populated if there are a lot of zeros and ones in this counts then you're kind of obliterating that information okay so an example of this is a histogram on the line there are some nice results on the convergence of density estimators based on this idea the population of the 50 states I just mentioned the marginal tables that I mentioned briefly earlier are an example of this and so there are some various algorithms for releasing marginal tables based on this that are fairly accurate I'm not going to tell you the details there are variants of this noise addition idea in other metrics so I talked about I defined global sensitivity maybe I didn't make it precise in terms of L1 noise turns out you can generalize this in various ways for example you can use L2 to measure the distance between functional outputs and there are cases where that makes a huge difference and basically you just the theorems get a bit more complicated you have to add normal noise Gaussian noise the parameters get messier but they still have that global sensitivity over epsilon plus a bit you know plus some other stuff and then the the guarantee changes is now epsilon differentially private instead of epsilon zero for example if I've got a bunch of predicates I want to know say subset queries a large set of them then what I can get if I've got d of them then I can add noise proportional to d to each of these questions and get something differentially private for subset queries and that actually matches the reconstruction attack lower bounds that I mentioned earlier for some settings of parameters okay so that's one basic technique the Laplace mechanism it comes up a lot the other very basic is something called exponential sampling or the exponential mechanism they came from a paper from McSherry in Talwar their motivation really was that sometimes noise addition just doesn't make sense so for example suppose what I'm trying to release is the mode of a distribution on a discrete data set right so what does it mean you know I want to know are more people voting Republican or Democrat I can't add noise to Republican it doesn't make sense I could consider something complicated like the minimum cut in a graph what does it mean to add noise to the description of a minimum cut or some classification rule given by a decision tree or something like that again it's hard to add noise to that so their motivation was sort of this broad class of problems specifically they were looking at auction design problems and because differential privacy has this lovely property that it also implies approximate truthfulness of an auction is differentially private it is in some sense a truthful auction and that has generated a line of work connecting differential privacy and game theory that is a subject of a whole other tutorial but so this idea that was first applied for auction auction design subsequently came to be applied very broadly so let me give you an example suppose I'll call this a voting example it's not really voting suppose I've got the following data I've got a bunch of students at UCSB and they visited various websites today and I might be curious to know what is the most visited what was the most visited website today so I've got the data lies in some range which is the set of possible website names and for each person I've got a set of websites okay so for each for each possible website let me compute a score and the score is just the number of students who visited that website so my goal is to output the most frequently visited site or at least some approximation to that and the mechanism is the following given my data set I'm gonna sample the name of a website from a funny probability distribution that depends on X and that probability distribution will assign probability to Y which is proportional to E to the epsilon times its score so that's a funny probability distribution right so here's my the green is the score the blue is this sort of you know exponential version of this so the idea is that you know why is this a reasonable thing to do well popular sites are exponentially more likely than unpopular ones so hopefully I'll get something good out on the other hand why does it have a hope of satisfying privacy well one person's one person's data changes the website all the website scores by most one and so the sort of individual probabilities in this distribution won't go up or down by too much so more formally here's my mechanism again the mechanism as stated is two epsilon differentially private and that's because if I consider the ratio of the probability of Y under X to the probability of Y under X prime I really got two ratios there's this the thing to which I want to be proportional and the constant of proportionality the normalization factor and each of those kicks in a factor of at most e to the epsilon and I get this e to the two epsilon okay so now for utility the claim is that if most popular if the most popular website has scored t then what I'm gonna get an expectation I'm gonna get something whose score is almost t so it's gonna be t minus you know something that is the log of the number of websites so if we're talking a couple of thousand websites it's not that big over epsilon and remember t here is the number of students who visited the most popular websites presumably it's a large number and I you know the proof is just a line but I'm not gonna do it so what's an example let me give you another so that's voting as one example but it turns out it's a special case of something much more general I could have a set of outputs y with some prior distribution p of y I just need any score function such that for neighboring data sets x and x prime the difference in scores is at most one or it just has to be bounded because I can always renormalize the scores and then I do this I output an element from my set of possible outputs with probability that scales as the prior times this factor of e to the epsilon q sorry there's an extra minus sign there there shouldn't be e to the epsilon q so I want things with high score to come out more often so now for example let me give you an example suppose I've got a set of suppose I'm trying to learn a classification rule for a data set I might have a set of possible classifiers that I'm considering those could be decision trees or they could be settings of the parameters for some linear classifier for convenience I'm going to work with discrete objects so let's think of it as maybe a set of discretized half planes then my score could just simply be the error rate of this classifier on the data so how well does it do at distinguishing good credit risks from bad credit risks or whatever the data set happens to be and I'm going to add a minus sign because I want to select for things with high score not low score and now it turns out that just by the same analysis I didn't give you on the previous slide the output I get a classifier whose expected rate error rate is the rate of the optimal classifier plus some number that scales is the log of the number of classifiers divided by n so as long as the number of classifiers grows more slowly than e to the n then I'm in good shape okay I get something with vanishing additional error okay in a corollary of this this came from a Fox 2008 paper is that every PAC learnable concept class if you know what PAC learning is is privately PAC learnable okay so the mechanism above it's extremely general in fact it's completely general every differentially private algorithm is an instance of that mechanism okay so that makes it sort of threateningly general you know it's almost too general to be useful turns out though it's still a really useful way to think about designing algorithms and that perspective was used explicitly to design algorithms for a bunch of different contexts so okay so we've got these two basic techniques in our toolbox right noise addition exponential sampling if you remember nothing else from this talk other than those two techniques you'll actually have a pretty good basis for reading the papers on this topic I'd like to give you an application to releasing a whole list of functions that sort of nicely combines these two techniques together with some other stuff that we'll explain okay so so this is the last thing I'll get to imagine we've got a bunch so we're gonna consider something called linear queries so these are sort of related to subset queries but sort of not so I'm gonna redefine everything just to be totally clear that we're all on the same page I've got some data set X which is a multi-set in some domain D so it's okay you know different people can have the same value that's okay and I'm gonna think of the vector I'm gonna think of X as a vector in R to the D which has been normalized so that the entries sum to one so it's actually I'm gonna think of my data as a probability distribution over the domain okay so the this vector is the number of occurrences of I and X divided by N the number of people and what's a linear query a linear query is just a function from the domain to zero one and what my answer to the query F on data set X is is the expected value of that function over this probability distribution X so it's just the sum sorry it's one over N times the sum of the value of the function which is the same thing as this inner product I can write F as a vector X as a vector and I get an inner product sorry there's a one over N missing right where the pointer is so subset queries that we talked about earlier are special cases of that as long as you represent things just the right way and most of the low sensitivity queries that people actually use fall into this category of linear queries and our goal is that given a bunch of queries F1 through FM we'd like to release approximations F1 hat through FM hat and what we're aiming to minimize is the maximum possible error we would have made so the worst difference between the true value and the approximated value and the question is how low can I get this error to be so we could use the Laplace mechanism sorry this is just our error to measure again we could use the Laplace mechanism plus some basic composition results to get error that scales as so M is the number of questions I'm asking so the error scales as M over epsilon N or if you fiddle with things a little bit and use that Gaussian noise thing I mentioned earlier you can get square root of M over epsilon N with this additional delta but in any case it's only useful either way this mechanism is only useful when the number of questions you want to ask compared to the size of the data so it has to be much less than M over N squared than N squared but the good thing about it is it runs reasonably quickly I mean you can implement it alright so we might ask well maybe you know is this the best possible thing we can do differentially privately it's actually complicated so the answer is yes in certain settings of parameters when N is very large relative to M or even just a bit bigger than M then you can show that this is optimal this mechanism is optimal in general but in the interesting in the interesting range when this mechanism fails that is when the number of queries is very large it's really not clear what happens because those reconstruction attacks we talked about earlier they only rule out a certain amount of error they rule out error one over squared event but they don't rule out all possible error and in fact there's reconstruction attacks can't randomly sample T people from the population and publish that then I'll give you something that allows you to estimate any linear query within about error one over squared of T log M over squared of T and so if I were willing to publish a bunch of people's data raw then I'd be fine I'd get actually a pretty useful mechanism of course I don't want to do that because that means those T people get completely screwed what was surprising was there's this line of work that started a pioneering paper of Blum, Liget and Roth and that generated a lot of interesting follow up when they showed roughly I can get error that scales much more nicely in terms of the number of query questions I want to ask I can actually get logarithmic scaling in the number of questions I pay an extra factor which is log of the size of the domain the number of different data values I can get and instead of things scaling with one over N they now scale with one over N to the one third or one over N to the oops that's a typo that should be a half one over squared of N maybe but the point is the great thing about this is it's useful when M is very large the bad thing is that it goes really slowly that is you're running time you pay running time proportional to the number of possible elements in the domain so that could be very large let me try to give you in a minute a quick overview of how this these ideas work so I'm going to give you a version due to Hart, Liget and McSherry and it's based on an idea of kind of what I'll call learning the data that came out of a paper of Dworknow or Reingold Roth and Vadan so so what I'm going to do is I'm going to design my algorithm this way hitting and inside my algorithm I'm going to have somebody who's trying to learn X but he talks to X through this very restricted interface of being able to use either the PLOS mechanism or the exponential mechanism to ask questions alright so the release mechanism tries to learn X through this differentially private interface and he's going to output a synthetic data set X hat which hopefully minimizes this error which is the sort of difference on these queries between X hat and X in general I'm not actually reconstructing X here I'm just reconstructing it relative to this very specific measure so it's kind of weird to think about it as learning because the correspondence holds up remarkably well so whereas before what in learning would be thought of as the parameters of some linear classifier I'm trying to learn that's now the data set and what used to be the data those are going to become the queries and what used to be gradient computations in the sort of most of the algorithms people use for classification are going to become actual data accesses so why is this even remotely reasonable well the way it's going to work the learner is going to compute a sequence of estimates to the data set and if I look at my current error on my current estimate and I look at the gradient of that error it turns out the gradient of that error you sort of just work through it it's exactly well it's up to a minus sign it's exactly the function fj that maximizes that makes us achieve this maximum so the basic idea for the algorithm is that each step I'm going to use the exponential mechanism to try and find the query that's causing my current maximum error and I'm going to use the Laplace mechanism to ask what exactly is the difference between you know which direction is the error going in am I too big or too small right now and then I'll do some wacky looking update based on this which isn't all that wacky it's a classic idea that comes from the it's called the multiplicative weights mechanism it's a classic idea and then I'll sort of do this kind of wacky exponent it's actually an update I'm doing gradient updates on the log of the x-vector instead of the x-vector itself which is sort of confusing but there are good reasons to do this and basically I'll just give you the utility claim in a second is that if I do it right I've got some potential function which is the cobalt lever distance between x and my current estimate x hat and that thing drops by the square of the current roughly the square of the current error on each iteration of this algorithm so why is it useful to have the KL distance drop I actually don't care if the KL distance goes to zero that's not my utility function all I'm using this is to measure progress of my error in some other way because as long as my error is large then this utility claim tells me I will make an update that reduces the potential a lot but the potential starts it doesn't start that large it starts at log D and so if my error is always above alpha then after log D over alpha squared steps updates I will actually I will run out of updates to make you know I will run out of potential to use up and so what that tells me is that actually this algorithm after this many updates it will be giving you an x hat that has error less than alpha proportional to log D over 1 over alpha squared and that's kind of really I think it's a remarkable analysis and basically given the analysis on this slide you can work out all the parameters I had on that on that previous slide okay at this point I will skip to the end of the talk so I wanted to tell you about how to take advantage of local sensitivity based methods but I also kind of knew I wasn't going to get there so I'm not too disappointed alright so let me just end with a quick post script about systems and implementation this talk has been sort of very theoretically focused as you probably noticed but what's going on in practice there's a lot actually going on in practice the high level message is so far differentially private algorithms are there's been a lot of progress but they're hard to use because they add noise and distortion and that's something that users of data don't want to deal with if possible they're hard to use because while they're not necessarily always compatible with out-of-the-box software and they're hard to use for the more fundamental and well frankly disappointing reason that people don't really like to think hard every time they do things and so if you've got some definition where you do anything then it's going to run into inherent resistance right so several systems there are a lot of systems that have been developed to make it much easier to use Frank McSherry had something called pink which is a query language kind of like SQL except it's Microsoft's version it's a group at Penn and I'm actually missing a reference here there's also anapamdata at CMU they've been developing programming languages where differential privacy is actually enforced by the type system so any program that compiles will be differentially private in principle and then there were a couple of systems developed by Roy et al and Moharan et al and that's including a student of mine who developed systems for sort of various restricted classes of queries where their focus was really on the usability of legacy code those systems are actually remarkably usable but it's still very far from something you can just hand to the statisticians at the census and say okay you don't have to think anymore go use it I don't have it on the slide but there's a product in use at the census that uses differential privacy that was called on the map that uses an economic metric data set and I was hoping in time for the talk but there are already pictures from it but I didn't sorry so there are these systems that are out there that make use easier the disappointing thing is that it's really hard to get these things right it's really hard to program these systems in a way that's actually differentially private and there are all sorts of channels that show up that some of them some of which were used to thinking about timing attacks like leakage via numerical errors so there's a paper that will appear at CCS this year and I would say just as the definitions could be in flux the algorithms are getting better and changing so the systems aspect of this area is very much in its adolescence and still on the way to being refined so let me just here's some things I didn't cover there are lots of things I didn't cover so just wrapping up so what I've suggested is that if I think of privacy in terms of the effect of individuals data on the output this buys me something it gives me a purchase on this problem it's meaningful the way it was described for differential privacy essentially arbitrary external information and it has this kind of game-theoretic interpretation that well I should pitch my data in if I get any better if I know you're going to be using a differentially private algorithm and I'm going to get some benefit out of that be it like a set of free-stake knives or a warm fuzzy feeling from having filled out the census form and followed the law then I should participate it leads to this sort of question well what can we compute with these rigorous guarantees I gave you some basic tools for example and I was going to get into more drawn my own recent work but I didn't future work there's a lot to be done so I mentioned earlier the fact that the state of the attacks is extremely ad hoc and we don't really know much definitions we have one we've got some other candidates they start getting really hard to compare fast and unfortunately because of these sort of inherent trade-offs it's not likely that we're going to find one that just makes everything all the problems go away this is an area where to work in you're going to have to actually read the definitions and understand them and as I said earlier that's an upsetting thing because suddenly people have to think hard and we all know where that leads so there's a lot of application areas where that algorithm still suck badly and one of them is genetics the data is very sparse, very high-dimensional and very hard to deal with another is finance finance is interesting for that reason I mentioned earlier there are other considerations at play other than individual privacy okay so what I hope you got out of this talk is the sense that privacy is hard to reason about A it is possible to reason about B and there's lots of interesting work to be done further resources Aaron Roth at Penn has a nice set of lecture notes on his course from a course that he taught last fall Sofia Rasodnikova and I at Penn State have some course notes that are older and frankly less polished and there's a DIMACS workshop on data privacy that'll be coming up immediately after Fox Fox is also at Rutgers so that'll be the Wednesday, Thursday, Friday right after Fox so if you're in the New York area or if you're going to Fox please consider attending it should be a very interesting workshop I'll just stop there thank you very much