 And at that point, it can be vulnerable. Enter secure multi-party computation, which gives us the ability to compute values using encrypted data without revealing the underlying data. This is something that's been a theoretical possibility for quite some time. The first real-world implementation of it was about 10 years ago in the Danish sugar beet market, an obvious place to begin applying new privacy enhancing technologies. Now, the farmers all sell to this one entity that commercially sells the beets. Farmers don't really want to reveal the price they're willing to sell at. So they were able to partner with the university there in Denmark to create a multi-party computation protocol that allows the farmers to submit the beet price they would sell at. And the protocol calculates the market clearing price of beets in Denmark. It's really been in the past three years or so, though, that we've seen a number of other practical uses of MPC, including to study gender pay gaps in Boston, to evaluate VAT tax payment in Estonia, and to avoid spy satellite collisions. There's a lot more potential for MPC, not just in further securing existing data uses and data transactions, but in potentially being the key to unlocking valuable research data in areas such as health and government. So today, you'll learn a lot more about what MPC is, how it works, and what its potential benefits and applications might be. We've got a great panel discussion, which will then be followed by a fireside chat with Bob Groves, former census director. But first, I want to introduce Senator Juan Widen. Senator Widen has represented Oregon and Congress since 1981. He is the ranking member of the Senate Finance Committee, as well as the senior member of the Intelligence, Budget, and Natural Resources Committees. Last year, he also introduced the latest version of the student rate to know before you go act, which calls for the use of secure multi-party computation in evaluating higher education student outcomes. Senator Juan Widen. Chris, thank you very much, and we're gonna make this a filibuster free zone today. I know you've got a extensive program, and suffice it to say, multi-party computation probably does not come up consistently when Americans get together at the local coffee shop. It, however, is an important issue because I think we all understand that encryption may not grab the morning headlines and command the coffee shops and locker rooms of this country, but it has very significant kind of implications. And I thought I would just kind of give you my sense of where things are and perhaps frame this in terms of where the Senate is and what the debate will be all about. Suffice it to say, I'm very troubled by the fact that senior law enforcement officials have restarted their effort to weaken strong encryption. This is, in effect, launching once again a sort of restart of the failed war on encryption. Like his predecessor, the FBI director, Chris Ray, once American technology companies to build back doors into encrypted products so that the government can get access to Americans' phones and their laptops. And I think it is fair to say I have been one of the most outspoken members of Congress in opposing it because I think you weaken strong encryption. And even by Beltway standards, this is a flawed concept. It is bad for security. I think people will be more vulnerable if their information is not protected with strong security. I think it will be bad for people's individual liberty and I think it will be very damaging to American competitiveness because if we go to this kind of policy, the first thing American companies are gonna say is, where in the world can we find a place that isn't doing something like this? So when you have a loser that is a threefer, bad for security, bad for liberty and bad for the American economy, that ought to give people a bit of pause and I have already indicated if there is an effort to weaken strong encryption, I'm prepared to shut down the United States Senate over that effort and I don't make that statement casually. I have only done that a couple of times in my time in public service but I really do believe weakening strong encryption is that flawed concept. We've reached that judgment based on talking to a lot of independent experts who are virtually unanimous in saying that weakening encryption to give our government access will also open up Americans data to theft by the people that almost universally are described as the bad guys, the criminals and the foreign hackers. Now if you were to confine your views on this topic to what is said by Director Ray, you would easily walk away with the impression that encryption is really an evil. Encryption is a bad thing. It's damaging US security. As I indicated, I think that is a big leap from reality. It represents in my view a misunderstanding of how policymakers ought to look at encryption, cybersecurity and of course an important part of your work today, the protection of data. So this is an important program. You're gonna be talking about one of the many ways that encryption can protect Americans' private information. We will be following this closely. Laura and Chris here are the gurus in the Senate on this issue. Feel free to call them nights and weekends. Take all their free time. But this is an extremely important issue because there are vast amounts of sensitive data sitting in servers across the globe and it's been used to advance scientific research, identify and prosecute cases of race-based discrimination and even bolster the effort to find cures for dread diseases. But it can seem like practically every day, almost every hour, there is a new account of a massive data breach. Equifax, Yahoo, OPM, it goes on and on and it is now well-known that the bad guys aren't small-time crooks. Any Tom, Dick and Harry can try to steal a spreadsheet of credit card numbers. The biggest concern, I think, is that these days, it's not Tom, Dick and Harry, it's Vlad, Boris and Oleg and they are trying to find and exploit sensitive information on US citizens and companies. So our view is it is nothing short of irresponsible to not ensure that data collection and data collection efforts move hand-in-hand with data protection. My take is with all the talent and wisdom out there, it ought to be possible to put data to work and protect it at the same time. I don't think those two are mutually exclusive, protecting data, putting it to work at the same time. A smart policy will get you neither and by the way, a policy that isn't too smart probably won't get you much of either. So these two really do need to be intertwined. That is the fundamental premise behind the bipartisan legislation I have with Senator Rubio from Florida in the Senate. As you may know, it's called the student right to know before you go act. The bill is about connecting student data with employment and earnings data in order to help answer the most important questions, the key questions about different colleges and programs. How likely are students to graduate? What are average debt level? Are graduates able to find jobs and pay their student loans? And empowering students and parents with information is especially important when the cost of college is the second most expensive investment many young people are gonna have next to buying a house. And I always try to put this in the context of how dramatic the changes have been since I was coming up. I went to school on a basketball scholarship dreaming of playing in the NBA, ridiculous idea because I was too small and I made up for it by being really slow. Going to college and paying for college was a big expense but it wasn't a boulder that was riding on the back of college students for decades to give you an idea of how ominous this is. Mentioned, Chris made mention of the fact I'm the ranking Democrat on the Senate Finance Committee. I actually put in a bill long ago to keep senior citizens from having their retirement garnished because they had so much student loan debt. Put this in with Senator Brown, very valuable member of the committee from Ohio. Let me just try to get your arms around that. This problem is now so serious that you can have an older person who would just like to have a dignified retirement and they're sitting there faced with the prospect that their retirement is gonna be garnished because they're still walking around with that boulder on their back. So this is really high-stakes stuff and as we know, there isn't the same market for holding down the cost of college that there is for many other goods and services. So we like to think that with this information, empowering people, if school A is doing a good job and school B is not doing a good job, school B is gonna have to clean up its act in a hurry or they're not in a more competitive landscape, gonna stay in business and school A better worry that at some point there'll be school C that will knock school A off the top. So we really think that there are multiple benefits that come from this. Now, given that, we recognize that this information involves sensitive data sets and so the most recent version of the bill does mandate the use of advanced encryption technology which of course is what secure multi-party computation is about and the high-profile breaches that I mentioned show that contractual privacy protections are no longer right, will no longer suffice and it's more important to come up with a approach that secures sensitive data using state-of-the-art technology. That is what we are doing in the latest version of know before you go. With multi-party computation, colleges don't lose control of individual data. The only information that is shared is that information that is encrypted and if hackers somehow get their hands on the encrypted data, I think it is fair to say they aren't gonna be able to make much sense out of it. Over the past decade, taxpayers have poured millions into developing this advanced encryption technology through defense and intelligence spending. So this may sound a little bit magical, but it is not some high priced pie in the sky in terms of theoretical idea. It is simply not a high in the sky theory. Other countries as well as the city of Boston have shown this technology can be deployed in the public sector from everything to setting agricultural prices, to detecting tax fraud, to identifying gender pay gaps and Chris touched on some of the array of applications. And I'll just close by way of saying, when I'm home, I do town hall meetings. I had made this pledge that nobody had ever done go to every county every year and I've done hundreds of them and what people constantly say is what are you doing to get a better value for our tax dollars? What are you doing to squeeze more value out of the money that we spend in this area and suffice it to say, we're spending vast sums on higher education and taxpayers deserve a better return on their investment and I think they can get it in the form of better encryption technology and advanced privacy protections. That is what we are trying to do in the bipartisan legislation, Senator Rubio and myself and I look forward to hearing the results of your deliberations. I gather I have achieved a degree of notoriety in that one of the slides from one of your distinguished group of panelists actually highlights my infamous activities but what you're doing is really important and we're joking about multi-party computation and the like but the reality is at the end of the day, this is a chance to really get it right in technology. So much of what I talk about in technology and I got into this in the late 90s when in the Senate, really the only one who knew how to use a computer was Pat Leahy, literally and we had had this period where the late night comics were talking about, hey, the folks who run these committees just said the internet is a series of tubes and everybody started laughing and the like and my take is on technology, the more you learn, the less you know and so it's important to have people like Chris and Laura who really have drilled down on this and I guess BPC and several of you have put in a lot of sweat equity in terms of our getting this right. Laura advises me that some part will have a higher education bill. I don't know that's gonna be a kind of fast moving year with the elections and the like but we really wanna get this right and look forward to your feedback and hearing about the results of the program. Chris and New America, big thanks for having us and we'll put this conversation in the to be continued department. Thanks everybody. So we'll move to our panel now. I'll have each of the panelists come up, introduce themselves. They'll provide some introductory remarks. Then we'll all get seated and have a panel discussion where I'll ask a few questions and then we'll turn it over to you, the audience for some further questions. You know, we really wanted to get a Danish beat farmer for the panel but it would have been expensive to fly someone in from Denmark and I'm sure if you're a beat farmer, you're very busy, you don't have time to participate in secure multi-party computations but we found some other great people. Sorry, I'm done with the beat jokes starting with George Alter. Thanks Chris. I'm George Alter, I'm a research professor at the Inter-University Consortium for Political and Social Research at the University of Michigan. Better known as ICPSR, we're a data repository and archive for the social sciences and I'll leave it there. So, can whoever is running the slides get my slides up again? Do I do it from here? So, while they're getting that up just so you know what you're gonna see, among the three of us on the panel, I offered to do an introduction to MPC and so I'm gonna try to explain to you what it is and what it isn't and then tell you a little bit about a project that we're doing with a small company called Stealth Software to actually build a system that we can test at scale. So, it looks like something's happening. He had that at the screen just timed out. I mean, should I do anything here? No, it doesn't. Okay, I'm getting signaled that something's coming. Okay, so. Right. Okay. So, let me see if I can remember my first slide. So, the place I wanted to start was to say that we currently have a number of different procedures for dealing with confidential data. As a data archive, ICPSR has to deal with this all the time and we have a number of procedures that we can use. We can anonymize the data, change the data so it can't be re-identified. We can try to make the people who use it safer by having them sign data use agreements and we can put the data into secure places where we can protect it. MPC is not those things. MPC is for cases where we can't do that and in particular, it's for cases where the data are in multiple locations in different databases maintained by different organizations and where those organizations cannot share with each other or with anyone else. So, MPC is not everything. It's a well-defined set of technologies and it is specifically designed for a case where the databases specifically cannot share with each other and then I think I have a slide that's hard to describe about how we do this. So, let me go on to another one that I can describe which, here we go. Okay, good. So, this is what I was just telling you that there are ways to deal with confidential data. One thing I didn't mention is having a trusted broker, somebody that everybody agrees can receive and protect the data, but all of these solutions assume that the data exists in a single place. MPC is for the case where the data can't exist in a single place because one or more of the parties are not allowed to share with anyone, including each other. Chris has already referred to this. There's the Danish sugar beet auction, but the short version of this slide is that cryptographers have been working on MPC and the algorithms behind it for quite a while now. MPC is not a single algorithm. It's actually a whole suite of algorithms that are used in different combinations. So, here's my slide which I got from Danny Goroff at the Sloan Foundation that explains how MPC works to people like me. So, the question, you know, the issue here is how can you compute the average incomes of three people without them revealing to each other what they earn? And the solution is that each of them generates two random numbers. Those are the R's, and they send those random numbers to the other parties. And so, we've got... So, the S's are the numbers that we want to protect, and then we create a new number X by adding in the random numbers that the party created themselves, and then subtracting out the two random numbers that they receive from others. And they share these encrypted numbers. Now, the cool thing about this is that when you add together these three numbers, the randomness all cancels out and you get the right answer. So, that's how... This is one type of MPC algorithm. And what's cool about it is that they pass these random numbers back and forth, but the random numbers are under the control of the original owners of the data, and because none of the other parties knows all of the random numbers generated by each party, the other parties can't... Nobody can decrypt things. So, what MPC does is it puts the data owner in charge of the encryption. They create the encryption themselves. They hold the encryption keys themselves. No one outside sees unencrypted data. And this is all based on math, so you can make the algorithms public and demonstrate what you're doing mathematically. Two other things that are characteristic of MPC is that the data... The calculations can be done in real time. So, you get an immediate result. And also that the results are exact. They're not approximations. So, the security comes from the encryption. It does not come from adding noise, which happens in some other data security things. What doesn't MPC do? So, MPC tells you that the calculation that you did is exact and that it was protected in transit. But it does not tell you that the calculation... It does not prevent the calculation that you do from revealing information that you don't want to let out. So, this is the slide that the senator referred to. So, there are a lot of computations you can do where computing averages or tables could actually be used to identify a person. That's not what MPC does. MPC provides, protects the way that the computations are done. Now, MPC can be combined with other techniques. So, there is a large literature now on differential privacy, which is about this problem. When you've got a result, how do you make sure that the result does not leak information? And differential privacy and MPC are two different things, but they can actually work together. Because of the way MPC works, you actually have to build a new application from the ground up. MPC is not something where you can take algorithms and modules in R or Stata or whatever your favorite stat packages are and just add MPC on. You actually have to design it from the bottom up. And one of the debates is about the expensive of MPC because MPC works by sending a lot of messages back and forth that are all encrypted and the messages, as you saw before, for doing a simple thing, you end up sending a lot of messages. And there is a question about how much this will cost. It's difficult to estimate those things a priori because the messaging actually depends on the data, not just on the algorithm. So what we're doing in our project which is funded by the Arnold Foundation is we're actually building a system to do MPC computations and we're running it on synthetic data. Why synthetic data? Basically because no one in their right mind would give real confidential data to a project that is testing new technologies. So we're testing the technology on synthetic data where we can put it out in the world and let other people have access to it and run it. So this is the problem that we're doing in our current prototype which is very similar to what the senator was talking about. We've set up a model system that has four databases that represent schools and in the schools there are students and the students all have IDs and they have majors and they have other characteristics which we're not using now but we will add later. And then those students can then be linked to a separate database that has income data from three points in time and another database that has loan data. So we've got six databases that can now be linked through ID numbers and then there's a seventh database that serves as the interface for doing the calculations. And we've run this now on synthetic data ranging in size from 1,000 cases up to 250,000 cases. In the future we've actually generated 10 million cases but we didn't get up that high before today. So I know you can't read this but this is what it looks like, the website currently looks like and maybe you can at least get a sense of this. What's happening here is for each combination of a school and a degree we generate an average loan amount, an average income, two years after degree, three years after degree and ten years after degree. And so this is what the front end of the system that we're running now and then as I said we've run this on cases of 10,000, 50,000, 100,000 and 250,000 cases to see how long it takes. These graphs are the total amount of time it takes for different servers to complete their task and what we see here is that we think that the result is pretty close to being linear. It's a little bit non-linear here. We think that might be more the implementation than the algorithm but in terms of understanding what's going to happen to the costs in the future what our project is about is actually trying these things at scale so we can say what the costs will be in the future. And here's my name and a URL to my partners at Steltz Software. Thank you. Hi, I'm Ben Kreuter. I'm hoping... No? Yes. Okay. Hi, I'm Ben Kreuter. I work at Google on a team that designs and helps to deploy MPC applications to protect user data. So we're also deploying MPC at scale. And, you know, it's been mentioned a few times. MPC research predates me as a person. In fact, it started before I was born. And it's only been recently that we've seen applications get deployed but something I think is interesting about that is most of the work post the early 90s was more on the engineering, on making these things fast enough or practical enough to use. The actual theory has been fairly well established for some time. And one reason why this has worked so well, one reason why more and more applications are getting deployed is that the theoretical model for security actually very nicely captures a large class of real-world problems. So, unfortunately, there's a rendering error here. A common approach to deal with an MPC type problem would be to do something that was mentioned earlier to find some trusted party and just hand them all the data and have them do whatever the calculation is and then they'll tell us all the results. And we'll hope that this party is a good steward of the data and doesn't collude with someone or otherwise violate our trust. But unfortunately, as Senator White had mentioned, this can fail. The party could be hacked or there could be an insider. So, this is actually very bad. All of our data has been concentrated in one place and now we have no idea what's going to happen to it. One of the unfortunate facts about computers and the Internet is that once data is out, it's sort of out forever. It's very hard to sort of gather it all back again. It's sort of the world's best copy machine. And so, we kind of want to avoid having trusted parties that accumulate this much data in one place. And so, enter MPC. So, the idea in MPC is we'll emulate a perfectly trustworthy and completely unhackable party by interacting with each other, by just sending messages back and forth in some prescribed way and then out of this interaction will emerge the answer. There's a lot of math hiding behind those statements. That's a bit beyond the scope of this, but that's the sort of high-level concept. And what's nice about this is that this is a very, very strong security model. It's not the typical cat and mouse approach where we design something, someone breaks it and then we spend the money to design a new thing. We'll actually, we'll prove mathematically that this has this property. And the interesting thing about it is that the most anyone can do to cheat in these systems is to kind of lie about their input. So, I can always claim to be a billionaire. You don't really have a way to show that I'm not. And then the system will output whatever the correct answer was for that input. But I can't do more. I can't somehow trick these systems into revealing your salary to me if I wasn't supposed to learn it or telling you an answer that's not consistent with the input I gave. And this is true even if I'm in some kind of conspiracy with other parties in the system. So, in theory we can design a system where you only need one party to behave honestly, even if there's a thousand other parties. And you'll still get the right answer and you'll still protect that one party's privacy. So, it's a very strong model and it's very nice to be able to make a strong statement about a computer security system like that. So, when we actually deploy it, there's, you know, it's still a little early to say what the good logic on deployments is, but there's a few points that have emerged, at least from my experience, which is admittedly limited. And one that I think is sort of a key point that should be taken away is it's very expensive to run these. And in particular, the amount of communication you need is very expensive. So, it would be very easy, for example, to exhaust your entire data plan and still not even get one computation done. So, we have to be very careful when we design these to not do that. It's also, it's not really well suited for a real time, you know, and I define real time as order of seconds or milliseconds. It's not well suited to that. It's better if you have some sort of offline in the background analysis that you want to do on some data. If you're willing to wait minutes or hours, maybe not days, but minutes or hours at least. And it's very important to keep our parties separate. So, the whole point here was we didn't want to concentrate all of our data in one place. So, if we take an MPC system and put all of the parties in one place, we haven't really accomplished that. All we've done is put the data in one place and treat it somewhat differently. So, we do want to keep them separate. So, I'll go through a brief example to sort of illustrate what it's like. Great, it rendered correctly. So, this is something that we did at Google. So, the idea is your phone has some model for your keyboard. So, when you type, it can predict what you're trying to type. And it turns out if we combine models from many phones, we can get an even better model. Now, the flip side, the reason why we want MPC here is that the keyboard model also reveals what you typed in the past. So, that's somewhat sensitive and we don't want to share that. Now, in principle, we could just have a bunch of phones communicate with each other and run an MPC amongst themselves to accomplish this. Now, unfortunately, we'll very quickly run into limitations on communication and computation. So, something that we'll do and something that Georgia had on his slides but didn't call out is we'll introduce a service provider. We'll have someone to take on heavier computational work so that we don't run out of the phone's capabilities and resources. But, in this setting, we'll only have one server. So, this is still going to require the phones to work together a bit. And in particular, they're just going to sort of set things up with themselves. So, they'll do some lighter weight thing where they sort of set up some kind of encryption and then they can upload their encrypted models to our service provider that then does the much heavier work of combining all the models, which will still be encrypted so we still won't actually reveal anything. And then the service provider can just tell everyone what the updated model is. Since that was sort of the public output, we're okay if the service provider learns that. And there's more details, of course. You can read our research papers on it, but that's the idea. And this is what I think a sort of typical practical MPC system today looks like. There's a few counter examples of these peer-to-peer models and other models, but this is fairly typical. The sugar beet auction had this client-server type approach. Georgia's example had seven servers and we have this here. And so a few other things. As I mentioned, the server will not see any inputs directly. The cost for the clients is quite low. Low enough to satisfy the constraints we were given. And it's actually easy to deploy when we only have one server. I suspect that when you have seven or three or ten coordinating that many different parties and making sure they're actually separate can be a bit of a challenge. With just one, of course, you only need to do it yourself, so that's nice. And yeah, and this is real, so it's actually being done. And thank you, and I look forward to the discussion. Mine worked right away. So I think I've been set up really well because I'm Amy O'Hara. I'm currently at Stanford University. As George said, the people that are doing this, they would be out of their mind to share data and test this with live data. Well, that's what I'm doing at Stanford University is figuring out how you could do that. And following up on what Ben said, you want to break away from the model where there's a trusted third party. And prior to joining Stanford last summer, I was at the U.S. Census Bureau for 14 years where I acted as the trusted third party. So I'm actually seeing this from both sides here and feel like speaking to the implementation issues for what federal agencies would need to do, that I think I'm pretty well-versed in some of the policy-relevant issues that would need to be addressed to deploy this. And it's very exciting, and I want to see this happen. So I can start with that. But the complexity of the questions was already brought up. If you're just drawing data together like the sugar beets, that it's pretty straightforward. But what if I wanted to do modeling? What if I wanted to run regressions? You've already heard that this is going to be computationally intensive, it's going to be expensive, and who is going to bear that cost? So knowing what questions you want to answer and then determining whether this is a good application, that's the first critical step. They also mentioned where are the data. In George's example, there were four schools. There was someone that had the student loan data and there was someone that had the income data. So that means you need six parties to agree to this. And I think that you need to frame it in terms of data access. But many of these parties are currently viewing this in terms of data sharing because they were giving the data to a trusted third party like I was. So really the entire legal construct in what sharing the data among these units is, that needs to be revisited. I listed there, what about if you're using two agencies? That's pretty straightforward. Ben's example had three parties there. Well, what if you're doing it in every state? What if you're doing it across the 15,000 law enforcement agencies in the country? That's a really big number. Let's scale it back. What about the 1,700 state prisons? Still a pretty big number. Let's dial it back. What about the 170 VA hospitals? Is that a small enough number? Well, it's still apparently going to have a big rendering mistake, I fear, whenever you try to envision what that sharing looks like and also what the obstacles will be. So no matter how many parties you're talking about, they need to have legal agreements that say we are going to engage in this. These are the terms of use. These are the retention properties. This is what happens in the event of a spill or a breach. The financial considerations. Who is going to pay for this? In Ben's diagram, each one of those parties agreed to have their own pre-work to the data to make sure that they would have known schema documentation so that the secure multi-party computation would actually work. Who's going to pay for that? Imagine in a federal agency right now, everybody already has a job. So this is a new job for someone that would need to figure out how to get new software in, potentially a new server in, and get that all working to test something that George pointed out you would have to be out of your mind to do with real data. So what needs to be done to build the confidence, to have people willing to take that risk, to have trust in both the parties that are in this competition example, and also in the methodology itself. And the last point there, the cultural aspect. Whenever you're dealing especially with government agencies, they are very willing to say, that is not what we do. And it's not because they can't do it due to a legal prohibition, which I have the three main ones that if you're in the federal government, you often trip over HIPAA when you're dealing with protected health information. You often deal with FERPA with education data. And if you're looking to measure earnings and employment, and you want to try to use the IRS data, you deal with Title 26 of the U.S. Code. And there are a lot of legal issues with who can use the data there. So the legal issues, the financial issues, how we're going to pay for this, and the cultural issues of that's not the way we do it. Because historically, the data are pulled together to a trusted third party, and they are analyzed in the clear, and that has a great deal of risk, both in the way that the data are drawn together, and in having trust in the individuals or the parties that are doing that computation for you. Where is secure multi-party computation really going to shine? It's when you have these data joins that make you wince a little bit. The places that have data, and they are really adamant that those data should not be shared further. The placement, the address where fostered children are living, that is held very tightly. But what if you wanted to match it to the list of addresses where there are sex offenders? And this was an actual example in Baltimore City that they needed to do that in the clear. But secure multi-party computation would enable this to be done with the encrypted data. And it may make it more palatable to the individuals, the parties that hold those data, to do a linkage that could be very valuable in terms of child safety. What about public health departments? They all have their surveillance data on sexually transmitted diseases, or on HIV. They can't share with each other. So if they're prohibited from doing that, could secure multi-party computation, help them take that step forward to improve public health. Outcomes for community college students, who has the information on the enrollment, and on the outcomes, and on the financial outcomes. That's one that we've heard several times. That's just really hard to do now if not impossible. And the final one, a safe haven for crimes against businesses. You hear in the news, when the businesses feel like sharing it, that there has been a hack to a corporation. Or that there has been this financial fraud, or that they had their data held hostage. They don't want to tell the public about it. They don't want to tell each other about it. But all of those corporations would like to know, is this happening to my peers? And right now, there's no safe place. They're not all going to volunteer their data up to the feds. But could secure multi-party computation provide that safe place where the companies that are experiencing this sort of fraud, these cyber incidents, they would have a safe place to share that information amongst each other, and perhaps roll up aggregated information to the feds instead of exposing that sort of information. So here's another example that I bring up just to show you the rangeness of the legal agreements that would need to be addressed. I'm going to start at the bottom. This is an opioid addiction example. So bear with me. You start at the bottom and you could potentially blame Medicaid for paying for a lot of prescriptions, over prescribing, you get people addicted to opiates. Okay? And then, say it's me, say I have an overdose and the emergency response team comes and I get an arcane. And the Medicaid is paying for approximately 7,000 of these kits a year. So they're paying for them, some of them succeed in reversing the overdose, some of them fail. But let's say that I'm one of the ones that it succeeds on. So I have been brought back and I'm looking for a treatment center and I go to SAMHSA, which is a government website, and they have a list of the treatment providers. But to get to SAMHSA, I probably went to Google and I probably looked for a treatment provider and it would refer me to a lot of the listings there, especially the treatment providers that have bought ads. Last fall in September, Google suspended the ads for addiction treatment centers because they found out that there were some disreputable companies that were buying these ads that were pulling in these people and it was just a bad overall outcome, so they suspended ads. In yesterday's news, Google has announced that they are going to start bringing in the addiction center ads again because they have found a third party that's going to vet them. So I would want to understand more about this. I would need data from federal agencies. I would need data from emergency responses, teams in potentially every county, if not below that. I would need data from Google, a private company. Think of the legal agreements that would need to be made so that you would be able to pull the right strains of information from these different players and think of the amount of pre-work that each one of them would need to do to make their data comparable for the secure multi-party computation. And that just makes it sound really hard. So I would like to go to a simpler example. A simpler example involving IRS data and Department of Education data, which is kind of hard to believe that that's a simple example, but there are only two parties. Could they have data access agreements? And the answer is yes. Could there be will? Well, obviously we heard earlier that these are the sorts of data that would be needed to understand and inform the students right to know before you go. So how could this work? You would need staff at both agencies to prepare their data to document their data. The lawyers would need to understand that this is going to happen. They would need to have servers dedicated to this. You would need to have that third server in the middle ready to do it, but it's impossible. Especially if it were going to be done periodically instead of in real time. This could actually happen. So what is preventing it from happening potentially? And this is ironic in the student right to know before you go. This is a different use of the word no. Who can say no to this? The lawyers will say no. Almost definitely. They will say no, no, we can't do that with our data. The information security officers at both Department of Education and Department of Treasury, IRS, they are likely to say no, that's not how we do it. We're not going to get a new piece of equipment. We're not going to put the data on it and we're certainly not going to transmit it into this center cloud as Bennett had shown. The privacy officers are going to say that's not what we do with the data. They're going to pull out FERPA. They're going to pull out the internal revenue code and they're going to be looking for language in there that would allow them to say no. The program administrators, do they say no? They're different reasons that people say no. And the leadership is probably likely to say yes, but with all of these cumulative no's from the people beneath them, they are going to be tipped toward saying maybe we shouldn't do this. But really, what are they saying no to? Are they saying no to the fact that you would want to look at outcomes? Are they saying no to the way that we would be measuring the outcomes for students from federal employment and earnings? So it's really understanding what people are saying no to and if they're saying no because of cultural issues. I've never done this before. That's not how we've done it in the past. Or are they saying no due to legal issues and are those legal issues firm or are they a little wiggly? This was an interpretation of statute or an interpretation of regs. So is there any malleability there? Or is this a need to have a change in regs or a change in statute? So what is really blocking people from wanting to do this? The privacy officers, it's the same thing that the lawyers read the statute in a certain way they are likely to start getting along with getting to yes. The program administrators, they're the ones that really have a heavy lift here. They have to take their data that might be on a server, might be in that bucket, it might be backed up to tape and they've got to figure out how to get it all together, get it all documented and have it available to this sort of use. So they really need to deploy staff to making this work. But the pay off is so great because instead of having the data exposed, instead of doing these calculations in the clear, instead of having to bump up against FERPA saying the data can't go anywhere, IRS saying the data must go to a place that's protected in this way, you may be able to come up with an incredibly powerful strategy to make these calculations across projects that in the past they were just simply saying no. So I'm very hopeful. Thanks everyone for those great presentations. So let me just sort of start with a baseline question. What can we do within an MPC protocol? So obviously we can perform math calculations, statistics and I assume most database type commands, joins and such. Is there a limit either theoretically or in practice? Can we do a logistic regression like Amy had suggested? So in theory we can do anything. That's a very old result. That in fact was one of the first results in MPC. So it really comes down to how much you're willing to pay to do it. If you have an unlimited budget for resources then whatever you want. For the example you gave logistic regression there's actually a colleague of mine has a paper out very recently on doing it and they give examples of what it costs to do for particular sized data sets. So that one I think is promising for practicality. It is not hard to come up with examples that are not promising. There's been research showing pathological cases where we can create computations that are just terrible and will sort of maximize the amount you're going to have to pay. They're not useful computations but they exist. So it's hard to predict if I don't know the computation. I can't really tell you. So let me just say that the project we're on is planning to create a suite of statistics that we'll calculate. We started with means but we're planning to go to regression and to logistic regression. That's sort of the last one on our list for the current project. I think I have based on what we've done so far a somewhat different view than Ben from what we've done it looks to me like the costs are manageable but I think that obviously depends on the application and what you're calculating and what the data look like but I think our first results are encouraging that the costs are going to be manageable. I do think though that I think both of them mentioned you might not get the results in real time. The system we're designing some things will probably be designed such that you put in a request and then you get an email when your results are ready but for a lot of things that's not too bad. So in terms of cost there's significant bandwidth requirements and computing requirements also the development costs to develop the protocol what's kind of the major thing driving costs there or is it just all all those things? So we're still in the first year of a three year project so we'll see how far it goes. I don't think that the development costs and the machine costs are in the end going to be the limitation. I think what Amy was talking about is going to be the limitation that the social process of getting this adopted is going to take more time and I think we have to look at it as a process that we're not going to roll this thing out and then people will just jump on and report. I think it's very important to have public demonstrations where we can say this is what's involved this is what it looks like you can work on it. I think that some of the legal issues that Amy was referring to we're going to have to convince people that we're saying something different so what we're used to doing is writing legal agreements about how you transfer confidential data in a secure way. What we have to explain to people is we're getting a computation at the end but the things that we're releasing are not confidential because they're encrypted and I'm not sure that we have examples of legal agreements to do that so I think that they're so I actually think that it's going to be the the social process of explaining this to people and getting them to get over where we don't do that here that's going to be the hard part. So I think I disagree on the cost and discussions aren't interesting if everyone agrees. I think the primary cost and the thing if you have a practical system you have to talk about the cost of deployment and the cost of communication I think is what will invariably become a bottleneck especially for protocols that require more communication. An easy way to understand that is that it's a lot less expensive to buy a computer than to lay a fiber optic line so it's much easier to get more computation resources than it is to get more communication resources and depending on the setting the cost can be higher or lower and the example I gave with consumer devices, cell phones the communication cost can be very high and actually we won't run that protocol if you're not on Wi-Fi because on your data plan it would be far too prohibitive. If you have a data center you have more room for communication but you also have a lot more competition for that communication. Whenever Google runs one of these protocols it is using resources that other business applications want to use and again it's just very expensive to increase the capacity of the network. As to the point of development cost and cost for convincing everyone else development costs can be high I mean finding experts in the field is not all that easy it's not something as simple as putting out an ad in the newspaper saying we want to have a line of people at your door but once you've designed it it's sort of a one-time cost once you've designed the system you should be able to just keep using it indefinitely and likewise once you've convinced the relevant people that this is something that can be done and should be done hopefully you don't have to convince them after it's been used a couple of times so I think there's a difference between the one-time costs and sort of ongoing costs for resources as you use it more and more So I'm a skeptic to his last point that I'm a skeptic on a lot of things but I will focus on his last point that when the parties that are preparing their data to go in there and using the example community colleges looking for outcomes the first time they do it they put all of their data in for real and then the results that come out it looks a little sketchy it's hard to explain to their leadership it's hard to explain to their constituents so maybe the next time they do it they don't quite put all of the data in how are you trusting that the parties are really putting the right data the unperturbed, the full universe of data in and how do you build that trust because they're retaining the control so how do you know that they're not going to game the inputs so that they don't get an embarrassing output that's beyond the scope of what MPC can give you right so I have a question along these lines I have a bit of a data research background I used to work in epidemiology and I know how hard it can be when you're bringing together data to match the data to clean the data so when I talk to other people who are in data research and explain MPC to them it looks seems very black box which it is by design but they're very concerned about well how do I know that the results I'm getting out the other side are correct when I can't see any of the inputs that I've got coming from multiple sources you know how do we ensure data quality, how do I clean the data make sure that data matches are happening successfully because a lot of times you know when you do this work you want to look at intermediate data sets to see whether what you're doing is correct but you can't do that with MPC so how do we best kind of deal with those challenges of data quality and verification of the results you have to have the right so you have to spend some time ensuring that what you're computing is the thing you want so you'll debug it, you'll make sure that this function does not do the wrong thing but then once it's in the MPC you should not as a security measure you should not be able to see intermediate states that's sort of the point I think there's also we're getting a bit more into philosophical questions when you talk about whether or not inputs themselves are correct you know unless you have some authority that goes around certifying everyone's input it's very hard to say with or without MPC it's just generally hard to say whether or not people are lying I mean if I said I was a billionaire would you believe me maybe I am so so I have a somewhat different take on that and so I think that first of all one of the things it's going to take if you want to go beyond really simple things is that people in the data centers are going to have to do a good job of documenting their data and describing it for others you know that in our experience we receive data from a lot of researchers and the quality of the documentation we get is very variable to say the least so documenting something well for others to see is not a difficult thing it's often tedious but it can be done but I think that Ben I think you missed what I see in what Chris's comment is that in a lot of calculations and statistics what you do is you do a lot of looking at the data from different perspectives to see if the data makes sense and some things you can catch that way so if the community colleges suddenly change from one year to the next you'll probably pick that up but when you move beyond the MPC to the differential privacy issues then you might run into trouble but I think because doing those tasks you do every stage of the calculation and look at frequencies and things like that you might begin to get into a problem with differential privacy so that I think is an issue most of the applications we've been talking about though most of the ones that Amy was discussing I think and the places where I would expect this to be done is where at least at first where the computations are relatively simple and the database provider is someone you can trust so I think that's how I think this will evolve I think we'll evolve by starting with some simple applications that are where we can have some transparency and as people get understanding and more comfort with the process I think we'll move to more complex things I'd like to mention that at Stanford the director of the crypto lab Dan Bonet has come out with an application called bulletproof which is trying to build that trust between the parties that are engaging in a number of different applications but including secure multi-parties so it is an active area of research to address your question of how can we how can we get faith in the process just to be clear though it's fine if you want to set up your MPC to give you the intermediate results you now call them outputs instead of intermediate results but the security definition itself should stop you from ever seeing how you got to the output other than the fact that this was the function that you use like I shouldn't be able to pull out some half computed state and examine it even if I'm trying to get the MPC bug it just won't let me picking up on differential privacy maybe we can quickly talk a little bit more about how that fits in here and how much the use of MPC the further use of MPC hinges on differential privacy well it depends on what you compute right so what MPC guarantees is that the function you expected to compute is what you compute it doesn't tell you what the function itself reveals but the differential privacy question comes in so if for example you're computing an average of salaries you should also have differential privacy in there because an average of salaries does reveal something not your whole salary as the input but it does reveal something about your salary in a degenerate case if it was a two person average it would reveal the other person's salary to you if you did not use differential privacy so and and as I tried to say in my presentation these are two different things that can be combined what I I know that there are people in Michigan and other places who are working on systems that actually can detect the differential privacy problems on the fly and then add noise to protect the results and I think that eventually there will be a melding between those two technologies where the MPC will solve the problem of data that cannot be released in clear and then the differential privacy will monitor the results to make sure that the end results are not disclosed I think it's going to be really tricky in how you word the legal agreements about the data access because you can't make the claim that with the secure multi-party computation that the outputs are confidential and so it's really going to need to be worded carefully and explained to the attorneys very well but you can make a claim about who sees which output so it is possible for instance have a party that exists only to help you with the MPC has no input has no view of the output and I actually I'm not a lawyer I don't know what legal implication they would be for including a party technically cryptographically speaking that party has learned nothing and so it shouldn't even be relevant but that's not how lawyers approach these things so in privacy and security nothing's really a silver bullet although sometimes MPC sounds that way what types of ways of compromising MPC or a tax on MPC are there that we have to worry about so that's a somewhat deep question it gets into specifics about the security model so we've spoken of MPC as if there was a security model there's really more than one some of them are more attackable than others the problem here is that the stronger the security model is so the less able you are to attack it directly the more expensive it tends to be to use that so to use a protocol that satisfies that and there's also some interesting questions that keep people like me up at night when you start to think about concurrent uses of MPC a simple way to think of that is I cannot personally beat a chess grandmaster but if I play two of them at the same time I'll either beat one of them or I'll play to a draw against both I'll sort of forward each one's moves to the other board and with MPC we have similar problems that can come up where I start taking messages from one session and then somehow using them in another one so this is a more theoretical attack of course there's also implementation attacks you can just implement it wrong so I want to turn it over to the audience for some questions Joe will come around with the mic wait for the mic for a second yeah basic algorithms some of the prime number theory high primes and things like this and is that why the communication cost is so high if you're transmitting information that encodes that takes an awful lot of bits so the answer to that the first part was are these protocols based on these number theoretic constructions we see in other kinds of cryptography sometimes it sort of has to be if you only have two parties but if you have more you don't necessarily have to do that as for what makes it expensive that is actually not the thing that makes it expensive we have a number of techniques for minimizing our use of that sort of cryptography what makes it expensive is that in many cases the communication amount will grow with the amount of computation that you do so imagine the work needed to compute a sorted list of names that is that's going to scale as the list increases in length by a factor that's greater than the amount that the list has increased in length and so if your protocol has to communicate as much as that computation it's worse than just sending the list in the clear it's a somewhat technical explanation but unfortunately for the most part protocols that support arbitrary computation have this property that your communication grows with the computation not all but many and in particular the ones that are the most promising for efficiency and practicality they're in the very back Hi, both George and Amy mentioned the importance of documentation for the datasets that are involved in this and I was wondering if you could talk a little bit more about both the specificity of the metadata that's needed and the comparability and also Amy some of the cultural barriers to making that happen So the specificity has to do with which of the elements are going to be used in the computation in the example of name I'll stick to the IRS example IRS captures name in one string so if there is a return that is married file joint both of the individuals are in this very long string it needs to be parsed and if you were going to match that to data from federal student aid that probably has name in last middle initial and first social security numbers you assume that they're going to be pretty consistent pretty clean but if they're missing do they have it filled with zeros do they have it filled with nines do they have it blanked what are you going to do if there's only eight of nine digits so it's all of these issues that the documentation needs to be very crisp on and it also has to do with the data elements that you want in the computation the amount of earnings which fields are you going to get that from fortunately the IRS data don't change a lot year to year but if the FAFSA form did change you're going to need to understand how the delta on that schema so that you don't just keep feeding information and having it spit out the wrong results so that's crucial you need to know which data elements you need to do the join which data elements are going to be part of the computation and you need to be able to track any changes that occur over time you've talked a lot about the costs but of course costs are relative I haven't heard much about the opportunity costs of not doing it the possible gains and then the other issue is the diffusion and getting folks just culturally to accept that this is important and even if it's not my job that's at the program or agency level that's at perhaps political level but I just wondered you all seem extremely talented at numbers and programming and the math behind this and computations where are the lawyers involved in your project so that you start identifying which ones will become the sort of patent lawyers of this specialty where are the innovation diffusion anthropologists back to the diffusion of agricultural technology and finding peer networks and all that literature is there thinking in your groups about maybe yours at Stanford about how to mobilize those kinds of expertise to start to build this out because it seems to me it's really very promising it's easy to think about the technology transfer that you can do it in sugar beats and you're going to jump it over to understanding something about education I like to describe it as the governance process transfer that has really been absent that has to do with the templates the language, the terms and conditions and these agreements information on whether it's data access or data sharing that's the sort of new technology transfer that is needed so that you'll be able to move this across from existing precedents to convince the attorneys of the privacy officers and the security people that it can be applied in another domain whether that's going to be involving health data education data law enforcement data wherever the need exists but the opportunity cost of not doing it is you need to have a trusted third party you need to wrangle with the lawyers to see if they would send the data and then you have a vulnerability because the data have been transmitted to one place here even if they're encrypted at rest encrypted in motion the fact that they need to be decrypted in order to do the computation they've been drawn to one place that to me is the push to doing this with secure multi-party or there's another opportunity cost you could just not do the computation I mean maybe in the public sector that's not always something that you can consider because you just have a mandate to do it but for a private company like Google it's always possible so it's just not worth the effort like it's not worth the risk of having a trusted third party it's not worth the cost of MPC we just won't compute the thing at all that's always you know that's another opportunity cost I won't comment too much on Google's lawyers other than to say that yes I do deal with them regularly I will say though for other fields of expertise the team I'm on we basically get approached by other teams that are doing other stuff and they have in some way encountered a privacy problem that they think MPC can be used to solve so you know we at least internally we've learned quite a bit about other things that are happening outside of the security group at Google and I'm sure you know the team in Denmark that did the sugar beet auction learned quite a lot about how sugar beet prices are being computed in Denmark so yeah there's a lot of interdisciplinary collaboration I think so I will comment on the lawyers at the University of Michigan and I've had a very good experience with them I've had to work with them for a number of years on data sharing issues and data use agreements and there is a lot of expertise there I think that what we're talking about though is a new model and it will take time for those new models to be developed and spread I don't think the problem I think ultimately the problem will be one of dissemination and diffusion because I think we'll be able to come up with legal models that work the sphere that I know is the academic sphere and the problem we have there is that there are just a lot of universities out there and each one has a fleet of lawyers and it takes a while for our lawyers to convince their lawyers that yeah this is okay to that point are these human subjects whenever it's the computation is being done how's this going to go through IRB is it really that's one of the interesting questions because if this is an odd situation where the information you're letting out is encrypted it's not human subjects but the computation could result in because of differential privacy in a result that is identifiable I don't think we actually have that issue in a different sense currently when we share data that are confidential we have various ways of making it available to people but we actually have to put in our data use agreements that the results that they do cannot identify people now currently ICPSRs data use agreements have some of them have like 10 or 12 clauses about you can't do this and you can't do that you can't have this many cases and so on and those clauses actually are don't really work anymore but I think it's that's I think one of the interesting issues here about where does this actually fall and I think we just need to bring in the IRBs and the lawyers and get them to help us figure it out. We've got time for one, maybe two more questions before the fireside chat. Hi, my name is Jack Kropansky unaffiliated by a technical background with a lot of data and databases provocative question could Facebook Cambridge Analytica have used this technique and it would have passed legal muster and would you have to apply the campaign as treat the campaign as another one of the parties so like three parties or actually four parties I guess would have application error or would be any reason that they wouldn't do it. So I can't tell you what they in particular would or would not do and I certainly can't comment on whether or not anything is legal it's just not my expertise what I can say is MPC is probably not the right answer in that situation because as I said earlier you can use MPC for any function so in principle they could have used it for the exact same function and I think in that case it was really not so much the data but the outcome that people were concerned about these psychometric profiles so I don't know I mean you know it's not a silver bullet maybe one more question sorry so thanks everyone my question is on we've talked a lot about how the purposes of MPC are certain issues and differential privacy is a solution to a lot of those. Can you just talk about how the addition of differential privacy solutions for trying to cover some of these privacy issues affects the cost of MPC both sort of as a technical measure and as a social and cultural measure I think that those are two difficult conversations and combining them might be a bigger challenge but I'm curious your thoughts it depends so sometimes it's actually very easy to add differential privacy so for instance in the salary average it's actually very simple there you can literally just add noise to your own input and it turns out that the sum will then have sufficient noise to be differentially private it's also possible in some cases that maybe you had some central aggregator that sees the result and can add the differential privacy before releasing it so maybe you trust them enough to see the non-differently private version because you believe that they will never actually reveal it they'll just delete it after they've added noise where the cost can really accumulate is if you try to bake the differential privacy into the MPC protocol itself so there's been quite a bit of research on that but that I think raises the technical cost for actually doing this quite a lot and if you're going to stage it that you have your multi-parity computation happening here you take the results and then you're going to have this other think of it as another server focusing on the differential privacy if you were going to do this at a federal agency and assume that they have a data center and they have not gone to cloud the likelihood of them having the computational firepower to do both of those in my opinion is very very low so it would be great if there was that integrator that would enable this because whether it's possible or not whether it's viable in the current procurement IT and cultural factors that swirl there that's going to be very challenging so let me take the other side there are a lot of these things that probably are going to be simple so if you're doing just means as in the example that I was showing and that's all you're doing with those data that's probably you might not need differential privacy and you might not need anything else I think the place where it's going to be most difficult is if you if you allow a lot of different things to be done because then you create other ways for people to to sort of re-identify by doing multiple things and there are projects that I know who are experimenting things like privacy budgets where you only get a certain number of goes at the data so there are some approaches to that but I do think again I think that we should be walking before we run and applying this to some of the simple cases to get experience with it is we'll before we deal with the more complex cases is the way to go Thanks we're going to move to our fireside chat give us one minute to just reconfigure the stage for that and then we'll bring up Nick Hart from the Bipartisan Policy Center to preface that chat thanks to our incredible panel Thank you Chris Thanks to the panelists Ben, Amy and George I think that was an excellent discussion that we've just had and as Chris mentioned I'm Nick Hart from the Bipartisan Policy Center where I'm the director of our evidence-based policymaking initiative on behalf of BPC I just want to thank New America for inviting us to partner on this event at BPC we are specifically continuing the efforts of the United States Commission on Evidence-Based Policymaking and General Recommendations to Congress in the President last September those recommendations were unanimous and therefore bipartisan among the recommendations was a call to modernize our data infrastructure in government as well as enhancing privacy protections and that included through the deployment of privacy preserving or privacy protecting technologies the commission said and I'll quote that government must adapt to new threats to the commission's security and privacy and take advantage of emerging technologies that better protect privacy so while the commission did not explicitly endorse any particular technology SMC, Secure Multi-Party Computation was one approach that the commission specifically called out for its potential to enhance privacy protections I want to acknowledge two of the commission staffers that are here who are really instrumental in writing some of the commission's language around privacy protections and Sharon Boyven so you've just heard the panel that really took a deep dive into the approach, how it works and some of the challenges that were faced just in practical deployment I want to now turn to talk about how the approach fits in the current government landscape and specifically how we can use the approach to improve government statistical activities George mentioned that the social process here is really important for deployment and we're going to get into aspects of this in the course of the fireside chat participating will be Bob Groves a former commissioner on the commission on evidence-based policymaking Bob was specifically appointed for his expertise in privacy and he was one of five individuals who served on the commission in that capacity he's also a former Census Bureau director and the provost of Georgetown University his research among other areas is focused on public concerns about privacy that affects attitudes towards statistical agencies so thank you Bob for joining us and leading the discussion with Bob is Tobi Zakaria who is a senior advisor with the Bipartisan Policy Center and a long-time DC journalist who's covered a range of domestic and international issues so thank you Tobi, please come on up so Bob as a former Census Bureau director I imagine that you know a lot about data and government so what would you say is the current basically state of data ownership and access issues right now in government let me start with memories of being Census Bureau director so if you read the enabling legislation of the Census Bureau there's a wonderful clause in there that gives the Census Bureau director the right to access data from any federal agency so I remember reading that when I was waiting for confirmation hearings and I said wow that's a really cool thing but I quickly learned that the other agencies that possess data have nothing in their enabling legislation that requires them to give to the Census Bureau director today so it's rather a hollow authority in a real way and that relates to your question so what is the setup in federal agencies with existing data resources the statistical agencies there are roughly a dozen or more primary statistical agencies in the federal government they are the ones that provide the country the basic information about the economy and the society what are we how is it going are things getting better or worse they're fundamental to the democracy in my opinion they are united through a set of protections of data that apply to them really strict standards of keeping data confidential data supplied by citizens by residents of the country or by establishment units in the country there is less legislation that permits either those agencies to share data even though they have in one sense a common mission or to have program agencies that are not statistical in nature that are really there to provide direct benefits or services to the population to have those data be used for statistical purposes and for that reason the agencies are actually doing measurements right now original measurements of attributes of the population that where data are contained in other data sets in other agencies and they can't access them an interesting fact about this is that if you ask the American public about how our data handled in the federal government there are large portions of the American public who believe that all the data are shared that if you give an answer to the Census Bureau it is known within minutes by every other agency only when you're inside of these agencies do you realize that the opposite is really true so we are hampered in federal agencies that are designing data and to speak to some of the implied motivations for MPC let me tell you why that's important so in the 20th century we developed a set of measurement tools in the social sciences that provide us everything we know about the economy and the society the sample survey, the use of structured measurement and probability sampling was key to what we know that tool is the foundation of what we know in the society that tool itself is fraying at the edges in all the developed societies of the world and the problem has to do with public participation in those surveys in every developed country in the world the proportion of sample units that are responding is declining with declining participation a lot of the statistical inferential properties of surveys are threatened and that's happening at the same time that we have other digital data resources so these other federal agencies that are program agencies not statistical agencies have data that would be quite useful to these common good purposes and without those data somehow conjoining those data with sample survey data what we know about the country is threatened going forward could you give us a very specific example of how this computing technique can help the government in its assessments have you spoken a little bit about the assessments on the economy well it's not just MPC that's the issue it's actually if you listen to Amy very carefully she made a distinction between access of data and the use of MPC for computational purposes I'm really talking about access of data right now and so every month you can see from the Department of Commerce an estimate of retail trade purchases right pretty important in a consumer dominated economy to know what that number is that's based on a survey of retail outlets the participation in that survey is declining there's a threat to the credibility of those estimates yet we have much richer transaction data from credit cards that are the result of behaviors of consumers making purchases and those are real time high volume data those data could be used in conjunction with the survey of retail outlets to provide much richer estimates of the consumer economy we're not doing that and so again thinking of what Amy was saying how you compute a conjoined set of data is one issue and MPC is an attractive candidate for that the real question is why can't the country for common good purposes conjoin those data for the good of the American public in knowing how the economy is going and that's a critical issue so do you think that it is possible for the public and also private companies to actually believe the government when it makes these privacy pledges yeah it is odd that the use of the term trust hasn't been prevalent in this meeting but that really is the ultimate variable that seems to me a squishy variable but even right now or even back in the 20th century why would anyone in the public believe the unemployment rate I mean really why would they believe that well it appears that the behavior of these agencies the building up of credibility over time the relationship between the unemployment rate and the observations the public makes about how things are going built up credibility in the numbers themselves the agency was as transparent as they could be even though these are hideously complex computations the unemployment rates a simple number but underlying that is a whole lot of complexity the agency was transparent on how that was done and there was a deep code of ethics that was known and displayed by the public servants who produced this and then there were laws so it's all of that combined that built up public trust is that still the case today would you say so our measures of trust and all the institutions of this country are declining except for the military interestingly enough and science is hanging in there although it's iffy at times that is the critical issue as we move forward on blending data together I think and another perspective on MPC that we ought to promote is is that a new way of thinking about access of data that breaks the interpretation that this is a violation of the pledge of confidentiality of these of these data owners that is a critical step in the logic of the acceptance of this there are all sorts of these technical complexities and cost and so on I actually think the most important thing is if I'm a federal agency and I view another data set as a way to execute my mission yet that data set is protected by its own set of confidentiality rules and the other agency sees its mission as potentially achievable with my data could we agree that this is not data sharing not a violation of the pledge of confidentiality it's just a new way to execute my mission I am providing common good information to the public the way I did before with new resources for the benefit of the American public without any threat without any increased threat to privacy concerns last week obviously congress was very much focused on privacy issues with the facebook and Cambridge Analytica hearings what kind of implications do you see from that on these issues going forward in terms of evidence based policymaking in terms of using these kind of computing techniques so I thought what was missing and actually the commission on evidence based policymaking spent a lot of time on this what was missing was transparency right so what the commission ended up saying is trust of the public in the outcomes requires both transparency of inputs that is at any point the public ought to know what data are being shared or used in conjunction blended together and transparency of outputs and neither one of those things was true in that event I think that's the sine qua non of going forward if we're going to build trust we have to be transparent in a new way and we have to be transparent and accessible in our language of transparency so it exhibiting the code to the American public is transparent but it's completely incomprehensible to the American public right we have to be better than that and we can't avoid that challenge as I see it we have to come up to that challenge what kinds of things would you would you see as being necessary you know on transparency so what the commission the commission was asked to opine on the construction of a data clearing house for the country we interpreted that as meaning a huge data warehouse where there would be imagine all data records of all people merged together it didn't take us very long as I recall Shelly could remind me whether it took a long time to say boy that is not an idea that's that ought to go forward and indeed we proposed a world that is completely compatible with MPC it's a world where whatever functions exist would not be a large set of data permanently stored rather it would be access of multiple data sets instantaneous analytic work output of statistical information for common good purposes and then the standards of transparency of that would be that every resident or anyone in the world could go to a web portal and say what's going on now today what data are now being analyzed what are the purposes of that and then I could in the same way see the results of those data analysis so radical transparency and why are we doing that it's to build trust over time and I think that's the price of doing what we do as we go forward just in our our conversation a couple of times you've referred to common good what does that mean when you talk about that so James Madison when he was writing the constitution is the father of the census clause and so again while I was waiting to be confirmed I read a lot about James Madison so he was really a social scientist and in the 1790 census he said well in addition to counting to measuring people by age and sex and to have their name why don't we ask their occupation and it turned out Thomas Jefferson didn't like this idea at all and he was critiqued he was a member of the house at the time he was critiqued by people saying well you're turning this into some research project what's this about right we just want to reapportion the house all we need are counts of people and he has an articulate speech on the floor where he says you know this is a fledgling nation if we're going to grow we're going to need people with skills and all the trades because we want to be a powerful and successful nation if we don't know whether we have that occupational mix we don't know our chances of success well he lost the battle as it turns out but the common good is part of that why did he care about that well each of us has to give up a little bit of our privacy I have to reveal my occupation and maybe I don't want to reveal my occupation but I'm doing this for a larger purpose and the larger purpose is the common good so one of the unfortunate in my personal opinion outcomes of letting privacy be defined by a legal framework is that we've missed the other side so in my with each of us also have an obligation we're part of a society we have to reveal things about ourselves for higher purposes and so I give up a little of my privacy voluntarily because I believe that make a better society and so these dozen or so statistical agencies that exist are all about that they depend on people saying okay I'll let you know that about myself because I trust that you're going to use those data for a higher purpose and I believe in that purpose it's good in a democracy for people to know something about the society how else will we know whether to kick the guys out of office if we don't know how well we're doing and that ought to be pure information untainted by political ideology I'll step down for my soapbox so talking about these issues can you give a specific example in terms of so that people can understand how would this impact the work at an agency at a government agency if they had this kind of computing technique and would it be helpful again let's first talk about what questions we cannot answer because we're not blending data and MPC is a way to blend data in a particular constellation of constraints so again referring back to the commission the commission report actually has some examples of this so one of the things that united Paul Ryan and Patty Murray in creating this commission was that we ought to be making decisions about whether programs are working based on data on evidence that seems like a good idea but we actually can't assemble the evidence from the program data themselves because often the dependent variable if you will the outcome of interest in a program is not collected by the agency executing the program so think of all the welfare programs that you think of housing supports all sorts of things that are attempting to uplift people in need the outcome is often some income related outcome where the data are not possessed by the program agency so quite quickly the commission said we're not going to get better at evidence based policy making until we assemble evidence that has the whole course of an outcome of a program and that requires blending data together blending data together requires agencies to collaborate in a way that actually right now may be counter regulatory constraints or legal constraints and that's a problem we need a new entity to make that work in a way that's legal and safe and genders the trust of the American public how would that work how would such an entity work so the entity as proposed by the commission has all of the legal protections of a statistical agency it would have legal authority to access data from program agencies for statistical purposes maybe I should stop for a minute and define those terms so this is a very important legal concept because in the privacy act there's sort of a carve out for statistical uses and that means that the outcome of those uses has no way of harmfully affecting an individual particularly these are the results of aggregate statistical operations we know the attributes of groups because of statistical uses we could not browse a data set to find your record and then do something to you that is not a statistical use and for that reason all of the protections of a statistical agency would be attached to this entity the commission proposes that this build on a whole lot of great work that's gone on particularly the census bureau in blending data together from different sources that it not be a data warehouse so it has an ability to let third parties access the data under controlled circumstances so all of those features has the transparency that we talked about before all of those things build a multi-leg stool that we think would build public trust increase the privacy protections of the existing data and provide better access and this in turn for example you can combine data from maybe the education department and the health department to come up with a broader data set that we currently don't have actually the senator mentioned some of uses and Amy has mentioned other uses but take the question of what are the long term impacts of education both on income and health well, you have at least three data sets there you need education data you'd like to know what a student took throughout their educational experiences you need longitudinal data on their occupation and income and you need health data and all three of those are now under the control of different agencies and some of them are a mix of federal and state data so when Amy started really freaking you out 1700 prisons or something like that yeah, that's that's the problem and when George says calm down let's start simply he's right too right? the MPC is a great idea it's not ready to go to 1700 prisons or 5000 whatever but it's a tool that might engender trust in a way that we don't have a tool that does that and we're at the beginning of this I think everyone would admit the low hanging fruit we ought to go for and then let's see how far we can push it as computing gets better and better as well but in that example that you just gave currently you can't access all of that in one set that's right and if you go back to the old way of doing this which for which I guess the Census Bureau is the best example of going to all these entities one by one and let's just take 50 the states it took is it 20 years or 15 years to assemble agreements across the states just to get simple employment unemployment data that could be merged and actually has already provided insights that are revolutionary in labor economics we answered questions we couldn't answer before and then once you got to 50 as I believe somebody dropped out then boom you're back to 49 so that process is uniquely a US problem if you go to other countries in the world where we have a central statistical agency that agency often has the authority to bring in data from lower geographical units we don't have that in our republic so it's a uniquely American problem that we're facing really but this new tool is just one among many that may allow us to blend data in new ways so in addition to being a member of the commission on evidence based policy making you were also on a national academy's panel did you have were there similar approaches to data privacy? the panel was a little more narrowly focused we were focused entirely on the federal statistical system and we were asking the question how could that system of agencies be advanced by blending data from multiple sources and what were the technical issues there what were the policy issues interestingly enough the panel and the commission came to very similar conclusions that going forward we have to make privacy first then we can do all the stuff we want to do but if we don't build public trust in our enterprise of blending data together forget it we don't deserve to do what we want to do without building that trust so we have to take that seriously we also opined on a bunch of technical issues with regard to the statistical formulations that we need when we start blending data together and this is computer science is doing its thing statistics is doing its thing we'll see how they come together eventually but it isn't quite clear in all circumstances what modeling approach you want to go forward in so we opined on that as well so just to sum it up why does this approach actually mean stronger privacy protections so I think what's really attractive about this is if I'm a data owner as I was trying to motivate and I have a mission and I can achieve my mission by blending my data with another data set yet I'm constrained because of law or confidential user agreements whatever it doesn't matter to sharing my data this might be a way for me to fulfill my mission without any violation of user agreements and what I would add if we also make it transparent to assure people about increasing the harm the risk of harm to them then we have a package we haven't had before I don't have to ship my data out and lose control of it I can achieve my agency's mission and I can pledge to my users and my data record owners that they're under no greater threat and that's a great combination thank you very much thank you thank you all for coming today thanks to the Bipartisan Policy Center for partnering with us thanks to all of our great speakers I hope you learned a lot today about secure multi-party computation and enjoyed the empanadas we should have had a beet salad I just thought of this now but thank you have a great afternoon