 Thank you for the introduction. So let me first remind all of us the problem in data privacy. So in data privacy, we have a database containing data, such as medical data or census records. And what we want to do is design a database mechanism that allows users to analyze the data. There are two main goals. One of them is utility. We want to release accurate statistical information to the users. The second goal is privacy. We want to make sure that each individual's sensitive information remains hidden. So it turns out that simple anonymization techniques are not good enough. And let me give you some examples. In the past, a medical insurance company released supposedly anonymized medical data. However, by combining this medical data with public voter registration records, one can actually identify some of the medical records, such as the governor of Massachusetts. Another attack is the Netflix attack, which was already described yesterday in the tutorial. In the Netflix attack, Netflix released supposedly anonymized user movie rating data. However, by combining this data with the public IMDP database, one can actually partly de-anonymize the Netflix data set. So these sort of attacks have led to the proposal of many privacy definitions. One of the first is K anonymity. K anonymity is a privacy definition that's specifically for releasing data tables. And roughly speaking, it requires that each record in the release data table is indistinguishable from at least K minus one other records with respect to certain identifying attributes. The current standard privacy definition is differential privacy, as described in yesterday's tutorial. And roughly speaking, it requires that when one person's data is added or removed from the database, the output distribution of the mechanism changes by most in epsilon amount. More precisely, for every pair of databases d and t, d and d prime differing in only one row, the output distribution of the mechanism zan on d is epsilon close to the output distribution of the mechanism zan on d prime. And by epsilon close, I mean that the two output distributions differ point-wise by a multiplicative factor of e to the epsilon. An equivalent way of defining differential privacy is that whatever an adversary learns about an individual eye from the mechanism, he or she could have learned from knowing everyone else in the database. However, in situations where there is correlation between individuals in the database, such as in social networks, knowing everyone else in the database can actually allow the adversary to learn a lot about individual eye. And so what this means is that in settings where there is correlation between individuals in the database, such as in social networks, differential privacy might not be strong enough. And this issue led to the proposal of an even stronger privacy definition called zero-knowledge privacy. Roughly speaking, zero-knowledge privacy requires that whatever an adversary learns about an individual eye from the database mechanism, he or she could have learned from just knowing k of the remaining individuals in the database, where k is something strictly less than n, where n is the size of the database. More formally, we require that for every adversary a interacting with the mechanism, then, there exists a simulator, f, such that for every database d, for every auxiliary input z, and for every individual eye in the database, the simulator can simulate the adversary's output, given just k random samples from the remaining individuals in the database. And more precisely, we require that the output distribution of a is epsilon close to the output distribution of the simulator. And by epsilon close, I mean the same thing as in differential privacy. So let me tell you what's good and bad about these privacy definitions. k anonymity is good in the sense that it's simple, efficient, and practical. However, it is bad in the sense that it only provides weak privacy protection, and there are known attacks for k anonymity, which I'll give an example of later. So differential privacy is good in the sense that it provides strong privacy protection. And there are also a lot of mechanisms for differential privacy. However, it is bad in the sense that you have to add noise. And it's not clear how efficient and practical differentially private mechanisms are. Zero-knowledge privacy is good in the sense that it provides even stronger privacy protection. And there are also a lot of privacy mechanisms for zero-knowledge privacy. However, it is bad because one has to add even more noise, and therefore it is even less clear whether zero-knowledge private mechanisms can be efficient and practical. So differential privacy and zero-knowledge privacy both require the database mechanism to be randomized. And noise needs to be added to the exact answer, sometimes quite a lot of noise. In practice, however, we don't want to add much noise. We want simple and efficient sanitization algorithms or sanitization mechanisms. So this leads to the following question. Is there a practical way of sanitizing data while ensuring privacy and good utility? So we observe that in practice, data is often collected via random sampling from some population, such as in surveys. And this collected data is stored in a database. Next, some sanitization mechanism, sand is run on the database. It is already known that if the mechanism sand is differentially private, then a random sampling step amplifies the privacy of the database mechanism. However, we can ask ourselves, can we use a qualitatively weaker privacy definition for the database mechanism sand, and still have to combine, process, satisfy a strong privacy notion? So the goal is the following. We want to provide a privacy definition such that if a mechanism sand satisfies the privacy definition, then if we combine the mechanism with a random sampling step, as it's done during data collection, then the combined process satisfies a strong privacy definition, such as differential privacy or zero-knowledge privacy. And this definition, we want it to be weaker than differential privacy because we actually want better utility. And furthermore, we want the definition to be meaningful by itself, even without any random sampling. This is important because if the random sampling is completely corrupted or completely leaked, we want the definition to still provide a strong, far-back guarantee. So towards this goal, let us revisit K anonymity since it is both simple and practical. So let me remind ourselves of what K anonymity is. It's a privacy definition for releasing data tables, and it requires that each record in the release data table is indistinguishable from K minus one other records with respect to certain identifying attributes. And it's based on the notion of blending in a crowd in the sense that each record, representing an individual, is required to blend what's K minus one other records also representing individuals. And since it's simple and practical, we want to consider it. However, there is a problem. The definition restricts the output of the mechanism, but it does not restrict the mechanism that generates the output. And this has led to practical attacks on K anonymity. So let me give you a simple example illustrating the problem. Suppose that you have some database and you run any existing algorithm to generate a data table satisfying K anonymity, and there are a lot of such algorithms. However, next, at the end of each row of the data table, you attach the personal data of some fixed individual from the original database. And then you output the modified data table. The output satisfies K anonymity, but it reveals the personal data of some individual, which is really bad. And there are plenty of other examples, actually. So the problem is that K anonymity does not impose restrictions on the mechanism, it just imposes restrictions on the output. And as a result, it does not properly capture the notion of blending in a crowd. In fact, one of the key insights of differential privacy is that privacy should be a property of the mechanism and not just of the output. So what we want is a privacy definition that imposes restrictions on the database mechanism and properly captures the notion of blending in a crowd. So here are our main results. We provide a new privacy definition called crowd blending privacy. And we construct simple and practical mechanisms for releasing histograms and synthetic data points. And we show that if we take a crowd blending private mechanism and combine it with a random sampling step as it's done during data collection, the combined process satisfies zero knowledge privacy. And since zero knowledge privacy is stronger than differential privacy, the combined process satisfies differential privacy as well. So before I give you the formal definition of crowd blending privacy, let me start off with some preliminary definitions. Two individuals with data values T and T prime are epsilon indistinguishable by the mechanism sand. If whenever you have a database D containing T you can replace T by T prime and the output distribution of the mechanism sand changes by most an epsilon amount. Or more formally, for every database D, the output distribution of sand on D and T is epsilon close to the output distribution of sand on D and T prime. And by epsilon close, I'm using the same distance measure as is used in differential privacy. So with this definition, we can actually phrase differential privacy in the following manner. Every individual T in the universe is epsilon indistinguishable from every other individual T prime in the universe. And in particular, in any database D, each individual in D is epsilon indistinguishable by the mechanism sand from every other individual in the database. And one possible way of relaxing this definition is to require that for each person or each individual in the database, that individual only needs to be indistinguishable from, say, K other people in the database, as opposed to everybody else in the database. And so let us make a first attempt of our privacy definition. For every database D of size at least K and for every individual in the database, the individual is epsilon indistinguishable by the mechanism from at least K minus one other individual in the database. Unfortunately, this definition collapses back down to differential privacy. And intuitively, it's because if differential privacy doesn't hold, then there exists T and T prime such that the mechanism sand can actually epsilon distinguish T and T prime. And now we can form a database D consisting of a single individual with data value T and the remaining individuals have data value T prime. And now the individual that has data value T is not indistinguishable from anyone else in the database and this would violate our privacy definition. So but we have a solution. The idea is to allow the database to contain outliers. Outliers are the people that are not indistinguishable from sufficiently many other people in the database. So we allow the database to have outliers but we require the mechanism sand to essentially delete or ignore these outliers. So I now give our formal definition of crowd blending privacy which is the new privacy definition that we propose. A mechanism sand is K epsilon crowd blending private if for every database D and for every individual T in the database either T is epsilon indistinguishable from at least K individuals in D or T's data is essentially ignored meaning that even if we remove T from the database the output distribution of the mechanism changes by most an epsilon amount. So this definition is weaker than differential privacy which is what we want because we want better utility. And furthermore, this definition is meant to be used in conjunction with random sampling but it is still meaningful by itself. So now let me give you an example of a crowd blending private mechanism for releasing a histogram. Okay, so here's how the mechanism works. We first we compute the histogram and for each count that is less than K we suppress it to zero. For example, on the left we have a histogram and the red line represents K. The mechanism takes all the counts are below the red line and suppress it to zero. Intuitively, this is crowd blending private because if you consider any individual I in the database, individual I blend or is indistinguishable from everyone else that belong to the same bin. Now if there are at least K such people then individual I is indistinguishable from at least K minus one other people in the database and so our privacy definition is satisfied. On the other hand, if individuals I've been contains fewer than K people the mechanism would suppress the count of that bin to zero and that's essentially the same thing as ignoring individual I's data which is also allowed by our privacy definition. In fact, to get better utility one doesn't have to suppress counts that are less than K. One can simply add noise to counts that are less than K like this. And so this is simple and similar to what is done in practice. However, it is not differentiated private because we are releasing the exact counts of some of the bins and it's not too hard to see that it is impossible to do this using differential privacy. And I'll give an example of a crowd blending private mechanism for releasing synthetic data points. So let me first describe to you the problem. We are given as input a set of data points in some Euclidean space and the goal is to release some version of these data points I call these synthetic data points so that users can analyze and perform statistical analysis on the data points. It is already known that it is impossible to efficiently and privately release synthetic data points for answering general classes of counting queries. And this holds for any reasonable notion of privacy. However, we know that counting queries are somewhat non-smooth in the sense that even if you perturb the data points of the input slightly the output of a counting query could change quite a lot. And as a result, we focus on answering smooth query functions instead of counting queries. And now let me give you a crowd blending private mechanism that does this. So the mechanism works as follows. First, it identifies the outliers. The outliers are the red data points. An outlier is a data point that belongs to a cell that contains fewer than K data points. Then the mechanism removes the outliers. And for each of the remaining data points, the mechanism adds noise. And it does this for each of the remaining data points. So the above mechanism is actually useful for answering all smooth query functions with decent accuracy and it's also crowd blending private. So in our paper, we show that this is not possible with differentially private synthetic data points. I now present our main theorem. Consider the following scenario. Suppose that we have a population and we sample from the population with probability P independently for each individual and collect their data and we store their data in the database. Next, we want a crowd blending private mechanism on the database. Our theorem says the following. The combined process of both sampling and running the crowd blending private mechanism satisfies zero knowledge privacy. And since zero knowledge privacy is stronger than differential privacy, the combined process also satisfies differential privacy. However, sometimes the random sampling is slightly biased because it's done slightly incorrectly or maybe an adversary isn't influencing it. Furthermore, an adversary might already know whether certain individuals are sampled or not. As a result, in our paper, we extend our theorem to hold, even in the case where the sampling is slightly biased in the sense that most individuals are sampled with probability close to P but not necessarily equal to P. And the remaining are sampled with arbitrary probability and our theorem still holds in this case. So due to time constraints, I won't talk about the proof. You can look at our paper for details. So this concludes my talk. Thank you for listening.