 Good afternoon, everyone. My name is Christian Kacza. I'm the president of the IACR and I'm here speaking because the next session that invited talk is the IACR distinguished lecture, which is annually awarded or selected by the IACR board of directors. So today I am pleased to welcome Cynthia Dwork for this IACR distinguished lecture. She has a long and extremely distinguished career in our field. She is today Gordon McKay, professor of computer science at Harvard's School of Engineering. She studied at Princeton and Cornell and I guess at Cornell she got involved with distributed computing because she worked later at IBM research in Alma Den for a decade or more, including working on problems in distributed computing. So among other things she got the Dykstra award much later for work on the DLS paper on consensus, which is today still, again, a prominent work in this area. Later she moved to MSR in the Bay Area and she stayed there until 2017 when she became professor at Harvard. In this time also she worked in many things that are now very well known basic foundations in our field such as non-malleable cryptography, frozen ciphertext security, differential privacy, which is probably what she's going to talk about today. And last but not least also the kind of proof of work that has definitely contributed through Satoshi to a revival of ideas on distributed computing. She has reached the TCC test of time award and the Dykstra award among many other awards in our field. So please join me to welcome Cynthia Dwork. Thank you very much, Christian. Thank you for inviting me. Alright, so differential privacy and the people's data. There are two kinds of data analysis that you might think about when you think about privacy preserving data analysis. So in one of them we're looking at the behavior of the population as a whole. So examples would be Census, the Consumer Financial Protection Board in the United States which looks at loan applications and mortgages to find evidence of systematic discrimination. Great. And another thing that you might do is you might be looking at doing some kind of machine learning where you're trying to say learn to distinguish cancerous cells from healthy cells. And you have data from lots and lots of different people and you want to understand sort of statistically how to make these distinctions without compromising the privacy of individuals. A different kind of situation where privacy comes up, a different kind of data analysis would be when you're looking for a needle in a haystack. So an example of this would have been the total information awareness program in the United States intended to detect terrorists through analyzing troves of information. Privacy is an issue in both of these. Differential privacy deals with the case on the left. So preserving privacy when we're trying to do statistical analysis of data. And in fact, differential privacy preserves the privacy of individuals while allowing the statistical information to come through. And it's entirely the wrong tool for the needle in the haystack. Differential privacy protects all of the outliers exactly hiding the needle and just allowing sort of the shape of the haystack to come through. So it's completely the wrong tool for the other case. Now you might have an intuition that statistics feel private. A statistic is a quantity that's computed from a sample and it tells us the population as a whole. So intuitively, if we built two data sets using standard statistical methods to collect them independently from the same population, for using the correct methodology, we expect these two data sets to tell us the same things about our population as a whole. Now, that is where a source of privacy sort of comes. And these are sense of privacy is derived from this fact. You can say to yourself, well, nobody knows I was in the sample if you were actually in the sample or I can claim that I opted out. Or somehow this isn't about me. This is about the population as a whole. And this intuition is on the right track, but it needs some help to be made rigorous and differential privacy provides exactly that help. So differential privacy extends this sort of statistical notion of privacy to all computations. It preserves the I could have opted out privacy for every computation, including things like total population counts. This is our general model of computing. We have a database that I don't care what form the database is in for this abstraction. It's just some giant mass of data and a data analyst interacts with this data set. You can think of this as a single data analyst who asks a question of the data, some kind of a statistical query, and gets back an answer that Q1 and A1. And based on that, adaptively chooses another query, Q2, gets an answer A2 and so on. Or you could think of it more abstractly where Q1 is a study done by some researcher or research group and they publish the results of their study and that's A1. And then they do another study, some other group perhaps does another study and that would be Q2 and publish their results A2 and so on. So this is a very abstract notion. It allows us to capture collusion among various data analysts. We just view them as different parts of one giant malevolent analyst that's getting at the data. The driving scenario when we started this work was analysis of US Census data and I'll say more about that in a bit. But this is a very old problem. It's been studied for at least I guess now 54 years starting with Warner's work that I know of in 1965 and I'll mention more about that in a moment. So in English differential privacy says that the outcome of any analysis is essentially equally likely independent of whether any individual joins or refrains from joining the data set. That's that I could have opted out intuition. So what this says is that the analyses are going to be randomized and that's how we're going to get these similarities of outcomes. But still on the intuitive level, what we want is that we should learn the same things whether any individual is or is not in the database. We should learn the same fundamental statistical truths independent of the participation of any individual. Now think about machine learning. You want stability. In fact, you need stability in machine learning for generalization. You need that the same things should have been learned independent of whether any sample or small set of samples was included. This is a requirement for generalization. And so while we often hear about privacy and utility being at odds with each other, we see that in fact privacy and this very strong notion of utility which is generalizability. Generalizability to samples not in your training data are actually aligned. So here's the formal definition. Our algorithms are called mechanisms M and M as I mentioned earlier is going to be randomized. So all of the randomness in our discussion is over the coin flips of the algorithm, which as you know from crypto is the good source of randomness. We're not talking about who happens to have been included in the data set. All right. And we need the notion of a pair of adjacent data sets. So data sets X and Y are adjacent if they have almost exactly the same set of people. But one of them has the data of just one more person. So everybody in this room versus everybody in this room without me, I could have opted out. And the requirement is that for any event S in the output space of the mechanism of the algorithm, the probability that we observe this event when the database is X is very close to the probability that we observe it when the database is Y. It's at most e to the epsilon times the probability of observing it when the database is Y. Now think of epsilon as a small constant. So e to the epsilon is about one plus epsilon. And notice that the roles of X and Y are symmetric in this definition. So we get the other inequality as well. And epsilon is called a bound on the privacy loss. And again, the randomness is introduced by the algorithm. And of course, this is joint work with McSherry, Meeseam, and Smith. So the key properties of differential privacy are, first of all, its future proof. If you have an algorithm that's differentially private, then it retains this privacy no matter what additional computation is performed on the outputs of the algorithm, or what auxiliary information your adversary might come in contact with other than, of course, seeing right into the database. So it's resilient to present or future auxiliary information. The second important property is that it composes gracefully and automatically. So we understand how the epsilons of different analyses or steps, or differentially private steps, add up. And we get nice composition properties, and they're automatic. We don't have to do any special work to make composition happen. These two properties together tell us that differential privacy is programmable, meaning that you can build complex, differentially private analyses from small, differentially private primitives and building blocks. And it's the only notion of privacy that has this property. And this is, of course, where it gets all of its power from, and why we can use it for so many different things. Now there are various relaxations to the formal definition that I gave. And in fact, if you think of a pair of adjacent databases, X and Y, and you run the algorithm once on database X and get an output C, you can define the privacy loss of X and Y, of C on X with respect to Y, as the log of this ratio of probabilities. And the privacy loss, when you look at it this way, is clearly it's a random variable. And so then you can start playing games with, well, what does this random variable look like, what is its variance, how do these things add up, and so on. So it's a random variable, it can be positive, and thankfully it can be negative. And that's very nice for composition because things add up nicely and there's cancellation. In pure differential privacy, the privacy loss is always bounded by epsilon. But in what's called relaxed or approximate differential privacy or epsilon delta differential privacy, it's bounded by epsilon with probability of one minus delta. And other relaxations capture various moments of the privacy loss random variable. And I think the key insight of the last couple of years is that when you have very heavy levels of compositions, and you're doing many, many, many computations, to some approximation, it doesn't really matter which variant you're using. The privacy loss random variable is a sub-Gaussian, and that's really nice. And we can say that the probability that it's going to be K times its expectation drops with e to the minus K squared over 2. So here's a simple mechanism for differential privacy. Remember that we're trying to preserve the opt-in and opt-out privacy semantics. So if we're trying to get a privacy-preserving computation approximation to the function f, we look at how much the data of a single person can possibly swing the value of the function f, and that's called the L1 sensitivity. And what we do is we add noise that's drawn according to a Laplace with variance, sorry, mean delta 1 over epsilon. So if we have a single counting query, how many people in the database satisfy some property p? We have to add noise that's scaled to 1 over epsilon. As I said, think of epsilon as a small constant, so we're adding noise that's basically constant for one counting query. If we want to handle K counting queries, a single individual can affect perhaps each of these different queries. So if you're asking how many people like Beethoven and how many people like Bach and how many people like Schubert, there may be one person who is in all of those counts. So that single person could affect that triple of queries by 3. So we would add noise that would be scaled to 3 in each case. When we relax to epsilon delta differential privacy, we can use a different notion of sensitivity, which is the L2 sensitivity. And here we're adding Gaussian noise, and roughly speaking, the noise is scaled to the square of the L2 sensitivity over... The variance is the square of the L2 sensitivity over epsilon squared. So what that means is this time for K counting queries, we're adding noise that's scaled to square root of K for each one, rather than K for each one. This is a major win. And in the long run, we're going to get to ignore that term, the log of 2 over delta divided by epsilon squared. It'll show up every now and then, but in these high levels of composition, it sort of goes away. And so I thought I'd tell you about a beautiful, beautiful algorithm from a paper of Nikola Talwar and Zhang. It's not their main result. In fact, it's algorithm 5 in their paper. But it's really stunning, and it has some nice properties, and it'll give you an idea of some of the things that people get to do in this field. So the paper is on the geometry of differential privacy. So let's say that we have a query matrix. Our query matrix is a set, in this case, of K different counting queries. So what I have is U is my universe of individuals, and I'm representing the database by a vector X, which says for each of the possible kinds of individuals, how many of them are in the database? It's a histogram representation of the database. So I can describe a counting query by describing what I'm looking for in my counting query, say people over six feet tall and less than 150 pounds or something. I can say which sorts of people would be in that, and the inner product of that with my histogram would be the count of the number of people in the database with that property. So I have a row of this matrix for each of the queries, and Nikolay Talwar and Jung define a convex body called K, which is the query matrix times the L1 ball. So it's the feasible region in answer space for databases X if the database contained only one person. And the vertices of this are the columns, plus and minus the columns of the matrix A. Now, given this convex body's K, you can define a K norm, which is sort of how much do I have to inflate K until I capture the element Z. So if Z is outside of K, I have to push it up more and I get a multiple R, which is larger than one. If Z is inside of K, then I can shrink it down a little bit. So that's the K norm. The polar norm has a different and less intuitive definition, and it's the maximum overall Y in the body of the inner product of Y and Z. Now, we care about these because Holder inequality says that for every pair of vectors, U and W, the inner product of U times W is bounded by the K norm of U times the polar norm of W. So here's their algorithm. First of all, instead of looking at the feasible region in answer space for databases of size one, we look at databases of size N. So that gives us N times K. And the algorithm is simple. They're going to use the Gaussian noise mechanism that I introduced earlier. So for that funny term B, which was like square root log 1 over delta over epsilon, they're adding noise that is scaled to K times B squared, I'm sorry, whose variance is K times B squared to the answers for each of the queries. So they take this point Y, which is in the answer space, and they add a phenomenal amount of noise because little K, the number of questions, could be very, very large. And they get some point way over outside called Y tilde. So the noise that they've added for each query is way bigger than N, the size of the database, and it's a counting query. How are we ever going to make any sense out of these things? And that's the amazing thing. They then take that and they project it back onto the body N times K, and that's what they output as the answer. So you add this completely independent noise to the answers for all of the questions and all of the dimensions. You get this point way out there, and then you project it down, and you get something meaningful. So what they do is they analyze the root mean squared error. And the way they do it is first they do some basic trigonometry that shows that our error, U, which is Y hat minus Y squared, is bounded by this term, which is at most 2 times U times W. W was that crazy amount of noise that we added. Then they apply Holder's inequality. So first of all, since Y hat is inside the body, our K norm, sorry, since Y is inside the body, our K norm for Y is at most N, because Y is in N times K, it's inside the body. And the same thing is true for Y hat. So by the triangle inequality, we get a bound of 2 N there. For the polar norm, we're looking at this maximum inner product, and standard techniques say it's reached at one of the vertices of our convex bodies. And so we get the 2 from the 2 U times W in the upper right, and we get the 2 from the 2 N, and we have 4 N times the max of this inner product of one of these corners with our noise vector W. But our noise vector was really special form. It was just a whole bunch of Gaussians added up. So we end up with a total variance for A i times W bounded by K squared times that funny term B squared. And expectation then of at most KB. So with high probability, the maximum over all of the different A i's, since there are only the size of the universe of those, will be at most square root of log of the size of the universe times KB. And we get for this term, 4 N E max I A i W, we get this quantity. But now we're going to divide by K, because that's what the title of the slide says. We want the average error. So the number of queries drops out completely. It's astonishing. And the noise that this algorithm gives is essentially tight. Can't do any better. Okay, so why do I love this algorithm? First of all, it's conceptually simple. Use the Gaussian mechanism and project. Secondly, it's almost uncoordinated. There's a result that says that essentially to get good answers to more than N squared queries, you have to coordinate the noise that you use. Where did we coordinate? The only place we coordinated was in that projection, where we projected down onto this body N times K, which is a publicly known body. And, yeah, so these are the reasons that I love it. Also, I love it because it introduces you ever so slightly to some of the techniques that are used in the more advanced geometric approaches. Now, projection onto this body is computationally hard. And so I'll say a little bit more about this later. Oh, dear. Okay, so where are we today with differential privacy? First of all, it's being used in industry quite a bit. So Apple is using it for things like learning new emojis and new spelling terms. Google was using it for detecting vectors for malware in the Chrome browser. Microsoft is using it for Windows telemetry. I'm not sure what Uber is using it for, but I've seen them advertising for people who know the field. Most of this work is in what we call the local differential privacy model, which means that before information is sent to the company, it's randomized in some way, using, for example, techniques of randomized response introduced by Warner in 1965. But modernized, of course, to be differentially private and to control the privacy loss. So in industry parlance, this says that the trust boundary is moved to the client. Your client is randomizing, so you don't have to worry about whether the central server is secure or not. That's the intuition. For simple counting queries, we saw that in the centralized model you could get errors that look like one. And in this model, the error is more like it's square root of n. But there's some really exciting new directions in which the elements are randomized and then shuffled so that you break the tie between the randomized element and where it came from in the shuffle process. And there will be a talk on this in the next session, for example, in this randomized and shuffle model. And in this world, you can get down to an error of n to the one-sixth. On the other hand, the trust model now has to extend a bit to the other users. You have to assume that the other users are doing what they're supposed to be doing. Facebook is using differential privacy for social science one, which is a project to allow academic researchers to do social science research on Facebook data, and they have their first project running now. And they're also partnering with Udacity to give scholarships for people to study these techniques together with federated learning and encrypted computation. The first large-scale system of differential privacy that was ever deployed was done at the U.S. Census Bureau in this tool, which allows users to find out where people live and where workers live. Where people work and where workers live. That's on the map. And very recently, there has been some exciting census-based research using bespoke techniques by the economists Raj Chetty and colleagues on building something called the Opportunity Atlas, which maps demographic mobility or social mobility according to where people live and how long they have lived in various regions in the United States. So it's a fun tool, and you can take a look at that on the web. And a final application that's worth mentioning is that there's been a lot of discussion of how ways in which people hack the data and headlines like most scientific results are false and so on and so forth. So the issue is what's called adaptivity, where the question that you ask depends on the data that you are currently exploring. And this is a known statistical pitfall, but people do it all of the time. And the TLDR here is that if you interact with your data in a differentially private fashion, then this will neutralize the risks to validity that are caused by adaptivity. Okay. So now let's get back to the census. The census is the people's data. It's used to allocate billions of dollars in resources. It's used to determine allocation of seats in Congress and in the Electoral College. It's used in enforcement of the Voting Rights Act. And the Census Bureau has a legal mandate for privacy. So one of the things that we found is that you say I have this shiny new privacy technique. You should use it and people say, well, we're fine. You know, if we've had access to data all along, we don't want to hear about you and your privacy preserving whatever. But when there's a legal mandate for privacy and you can show that there's a problem, then in some sense people have to pay attention. And in general, in other scenarios, if you can say here are data that people didn't have access to and now we have privacy preserving technology that permits it, that's another good way of having technology deployed. Now, back in 2003, long before anybody ever really started formalizing privacy for statistical data analysis, De Nure and Niseem showed an amazing result that overly accurate estimates of too many statistics completely destroys privacy. There wasn't a... What they said was, here's this thing, we call it now blatant non-privacy. Whatever you think privacy is, this is obviously a violation of it. And if you allow overly accurate estimates of too many statistics, you're going to get this thing, which is clearly a violation of privacy. And that set off a suite of results that strengthen and generalize that. So it says that there's a limit to what can be done. Now, and the definition of overly accurate will vary according to what is your definition of many. You should have lots and lots and lots of statistics. They don't have to be that accurate. If you only have a few, then they would have to be pretty accurate. So when he became chief scientist and associate director of research and methodology at the Census Bureau, John Aboud, the labor economist, started looking into this. And what he says is that the techniques that were used in the 2010 decennial census did not suffice. So the United States has a short census every 10 years called the decennial census, and then there's a much more detailed census called the American Community Survey. But everybody has to answer the decennial and only a small percentage of the population is required to respond to the community survey. Okay, so from a talk that he gave staring down the database reconstruction theorem, he outlined how they actually launched the attacks that we had been talking about and did various things and linked with publicly available data. He mapped out how often they were successful in re-identification. He pointed out that the harm that they were able to identify was that the attacker could learn how people describe their race and their ethnicity. The United States form has quite a bit about race and ethnicity on it, even the short decennial form. And if you do this on the American Community Survey, you get much, much more personal information. As a matter of fact, oh, I'll tell you that in a minute, okay. So then he gave statistics for exactly how many times they were able to correctly re-identify, and he says, well, we fixed this by implementing differential privacy. So this was really gratifying to us, in part because the census was our driving scenario from the very beginning. This is a picture that we had in mind. Now, there are a lot of challenges still. First of all, census data is used for many, many purposes. Historians and sociologists, demographers, economists, they're not trained to interact with data through any sort of official mechanism or certainly not in a differentially private way. We don't have vast libraries of tools for this sort of analysis and that sort of analysis. There's a privacy budget. I said that privacy loss accumulates. Somebody has to keep watch and make sure that privacy loss hasn't grown too large. By the way, the actual value for the privacy budget will be chosen by the Secretary of Commerce. I don't know what they're going to choose. Okay. So you have all of these people who work with the data who are not trained to use it in this new fashion, and there aren't tools for them to just deploy and so on. They're also used to seeing what are called PUMS or Public Use Microdata, which are sort of roughly speaking de-identified individual records of real people with some small changes thrown in by a process called swapping where intuitively families that in various respects are similar in different regions of the country have their data swapped. But the swap rate is not public. Now, the decennial form is really meager. Here's the whole form for one person, and this is a description of the kinds of questions or most of the questions in it. In contrast, let's say, in the American Community Survey, the questions are about housing, ancestry, about your journey to work and how you commute, computer and internet use, disability, employment, family and relationship to the householder, fertility, food stampios, grandparents as caregivers, health insurance coverage, Hispanic origin, home heating fuel, housing costs for owners, industry occupation and class of worker, marital status and history, ownership, home value and rent, place and birth and citizenship and year of entry, plumbing in kitchen and telephone services, residents, did you live in this house a year ago? Migration, when did you move here? School enrollment, sex, vehicles available, veteran status and the year that the home was built. So it's a huge trove of information and the techniques that exist right now can't cope with all of that. Now, what can we do about it? So one possibility might be to build differentially private synthetic data. This is this wonderful image, right, where somehow or other you can look at the data set and you can make up synthetic data and publish this little synthetic data set and let people run the same kinds of queries they would have run against big database but against this privacy preserving synthetic database. Is it doable? Is it possible? Astonishingly, yes. Theoretically, you can build a small database that is completely synthetic and you build it in a differentially private way and you can even handle exponentially many queries and this was a result of Blumleget and Roth which absolutely knocked my socks off in 2008. It was an offline process and an online process was then developed by Hart and Rothblum but there are hardness results. So your cryptographers, suppose the database had records of this form. It had a message, a public verification key and a signature on that message using the corresponding signing key and you wanted to release a synthetic database where the queries would be verification keys and the questioner could say how many people in the database how many rows in the database are valid signatures that could be verified with the K-Star. How would you create a synthetic element that didn't exist in the database? You would need to forge a signature in order to do it. So you get a hardness result right off the bat and in fact there's a very tight connection to trader tracing. You can play similar games. Now these are counting queries of a very specific and contrived form so what about simple things like can you create synthetic data that will give you the answer for all two-way marginals? And I'll say a little bit more about what marginals are in a later slide if you don't know already. And if the dimension of your data isn't too large you could answer all of the two-way marginals using the Gaussian mechanism or the Laplace mechanism but you can't make synthetic data that captures all of these two-way marginals assuming that one-way functions exist. And in fact there are small families of queries where it's hard to create just the answers for these small sets of queries and this also draws on trader tracing. So this leaves three directions for moving forward. One of them is to look at structured query classes like threshold functions and marginals. So the previous hardness results were for arbitrary families or contrived families. What about very... Okay. And somehow we're going to have to go around the impossibilities results so these are not going to be necessarily synthetic data but just answers for structured query classes. Another is to try the AI method of punting so use differentially private GANs or generative adversarial networks to try to generate synthetic data. Well, if you're going to generate synthetic data you have to say what it is you want these data to capture. What are the statistical tests that you'll use for evaluating the quality of these synthetic data? So you might say, well, I want to do... I want to preserve some kind of low-order interactions. We know there's a hardness result here but in practice a lot of things can be done even when there are hardness results so we could say. And the third is a fanciful problem which I will leave you with at the very end of the talk. This is Steve Feinberg who is... He died two years ago. He was a real inspiration to me. He's a statistician who did a lot of work among many fields on privacy and many of the forms of the questions that I asked came from conversations with Steve. And he's pictured here with his wife Joyce who was murdered along with 10 other people at the Tree of Life Synagogue Massacre in Pittsburgh. So, differentially private marginals. Think about data with D different Boolean attributes. So you can describe each person by just a D-bit string and you can look at a histogram or a contingency table that has a cell for each of the two to the D settings of these D different attributes. A marginal is just a subtable. So in this picture I've got the marginal... The blue table is the one where the attribute A has value zero and the red table is the one where attribute A has value one. Now, each individual has exactly one setting, of course, of their D bits so they live in just one cell. So the sensitivity of a contingency table query is just one. Adding or deleting one person can only change one cell and that cell only by one. So we could add noise scale to one over epsilon to each cell and then when somebody wants a marginal they just add up all of the noisy cells and that's that. But if you're adding up an exponential number of cells even with cancellation in the errors, you still have big errors that remain. So the marginal problem is sort of how to deal with that. So there are two general approaches in the literature. One of them sort of separates the problem into two steps. One step adds privacy and the other one is some kind of a hard computational step where people try to get consistency across the marginals or create synthetic data distributions or a synthetic data set and they throw some really well-established high-powered solvers at it. So here are some examples of that. And a different approach is just to release noisy values for these marginals without trying for consistency among them, without trying to say well there's one table with integer-valued people who are non-negative, who give rise to this number, this collection of marginals. And two approaches to that involve private approximation via low-degree polynomials or a relaxation of that algorithm five. So remember I mentioned algorithm five, we added all this Gaussian noise and projected down onto this body K and the projection was a hard step. So the idea here is now we're going to relax that, we're going to approximately project onto a convex body that contains K and is in some nice way just a little bit bigger so that the errors that you get aren't bad. And this gives state-of-the-art results for two-way marginals in polynomial time. This is a very interesting direction to experiment with and to keep pursuing. About the GANs, I'll only say this. So in GANs, we have a fake data generator that is trying to learn to generate data that look like the real data and we have a distinguisher who's trying to tell, to distinguish the real data from the fake data and the results from the distinguisher are fed back into the fake data generator so that it knows how to try to improve what it's doing. Now the only box here that actually sees real data is the distinguisher. So we can make the distinguisher differentially private. So this is an approach that was suggested, it's been suggested in a few places. There's a nice work on BioArchive that uses this for actually, let me see, Privacy Preserving Generative Deep Neural Networks Support Clinical Data Sharing. That's the name of the paper. And Stephen Wu is from our community and so we are working on variants of this that, for example, might be good with respect to low-order interactions among variables. This is very preliminary work. Also done with Marcel Neunehofer. Okay, so I'm going to close with this problem. The Feinberg problem, because Feinberg used to sort of pound on this problem, is how can we let a trusted researcher access raw data and then privately publish the results? That is, I trust Feinberg to do the right thing and to follow the instructions and follow the protocol and only make public the things that the protocol says can safely be made public. So the question is, how can we let him do this when what he chooses to make public now will depend in a very immediate way because he's looking at the raw data. It will depend very strongly, perhaps on a single individual in the data. That's the worry. His choice of what he decided was interesting might have been completely different had Warren Buffett not been included in the data set. So, what can we do? Well, the first idea is, we assume that Feinberg is very energetic and that we can rewind him. And we apply a beautiful technique due to Nisim Raskolnikov and Smith called sample and aggregate, which says essentially, how can you take a function that you absolutely don't understand and get some kind of a privacy-preserving approximation to this function? So here was their idea. They take the data. Remember, we're assuming that data are copious. We take the data and we split it up into a bunch of slices. In this picture, I have four slices. We feed a slice of the data to the function. We do that B different times if I have B different slices. So the function is operating independently on each slice, let's say in parallel. And then you take all of those answers and you aggregate them using a differentially private aggregator. So why is this whole process differentially private? Well, consider two databases that differ in the presence or absence of that red data point. The difference affects only one of the copies of the function, only the second copy of the function. So that's just one of the inputs to the aggregator. And since the aggregator is differentially private, it behaves essentially the same way, independent of any of the values of one of its inputs. So in this fanciful version, my function is Feinberg. And if I could, if you were energetic and I could rerun him, this is something that I could do. Now, can we relax this? So can Feinberg incorporate some of his friends? So here are some ideas for what the friends could do. Feinberg does a computation. He says, I want to publish statistic T. In fact, I'd like to publish it now using differential privacy to release the statistic T. What do you guys think? So he sends T to all of the other participants, Frank Sherry, Kobi Nisim, Adam Smith. And what do they do? We're going to hope that there's some verification procedure that they can run that's easier than running the Feinberg problem in the first place. Verification might be easier than generating T, than finding T. So what they're going to do is they're going to say, hey, I'm going to look at it on my own slice, and if I like it from my own slice, I'll vote one, otherwise I'm going to vote zero. And then we're going to have some differentially private aggregator for these, which I don't have time to describe. So this is nice, but there's a problem with this. And the problem is if the friends are rubber stamps, then we haven't solved anything. So we conjecture that there are simple picky verifiers. For example, verifiers who would verify T for their own slice S, only a Feinberg when running on S would have produced T. But as I said, this is a fanciful problem, but I think it's also an excellent problem for the crypto community. So with that, I thank you for your attention. We do certainly have time for questions. Please step up to the microphones there in the middle of the audiences here. So let me open the round of questions. They will come. You had on one slide GDP, generalized differential privacy. And in Europe, we have GDPR. Sorry. Good. Thank you. So I wanted to ask to which extent are you aware that legal documents, frameworks are influenced by the notion? To what extent am I aware what? To which extent have legal documents been influenced by the technical notions is it's not only in the lab, the notion, right? Right, right, right. OK. So by the way, that GDP was Gaussian differential privacy. But let me answer a couple things along with that. So first of all, I talked about the use of differential privacy in industry. I talked about its use in the US Census Bureau. This is a very US centric set of observations here. What about Europe and what about the UK? So to my knowledge, there is, I'm sorry. I didn't mean that. I didn't mean it that way. I meant that there's been some motion in the UK that I have not yet seen in Europe, in the rest of Europe. And I do hope that it stays together. OK. So at least there's like real awareness in the Office of National Statistics. In the GDPR, there's been a lot of resistance. I mean, the rapporteur wanted to put some things about differential privacy in a few years ago and it was voted out. I don't know what's happening with legislation and regulation here. There is interest in differential privacy in Germany. So for example, Frauk Kreuter, who among her many positions as a professor at Mannheim, spent the semester at the Simons Institute just now on the special semester on differential privacy. And I mentioned one of her students, Marcelle Neunehofer, earlier. So regulation is much slower. OK. Thank you. Yes. So about the question of synthetic data. So you said that there are limitations related to the cryptographic problems. But aren't there actually even with a normal database of people? Isn't there a problem? Like the synthetic data would have less information than the real data. So it will be still like a limitation even in that normal database. So you bring up an excellent point. So you're absolutely right. And I should have mentioned this. We've been looking at this for a couple of years. And I was talking to a medical researcher on the phone. And he said, can't you just give me a synthetic table that would let me look up the answers to any of the questions I want. So synthetic data would do that. But a big table that wasn't of special form would also let him do that. And the answer is, of course not. Because if I did, then you could launch the attack that corresponds to the fundamental law of information recovery against that table. If it would allow you to obtain relatively accurate answers to essentially all questions that you might want to ask, then it'd be a violation of the fundamental law of information recovery. Yes. And so when we talk about synthetic data, we have to say, what are the statistical tests? What are the things that we want these synthetic data to capture? And it will definitely be less. And that's why we talk about things like low order marginals or small variable interactions. And also you mentioned that differential privacy is really nice because it's programmable. So I wondered, how would that happen? Do we really have the libraries and everything that allow us to do that? Like in practice, if I have a database, in what sense? OK, now I can start saying I want to program into it differential privacy. Are they like basic building blocks? Right. I also said that there aren't really big, like, ready for industrial scale libraries. Nonetheless, there are many places in which people are starting to try to build these libraries quite prominently among them in the privacy tools for social science research data project that Salil Vadan has at Harvard. And if you search around on the web, you'll find other libraries. And I think that various industry players are sort of starting to get there. But there's a lot of work to be done. Any other questions? If not, then I would like to thank you again very much.