 All right, I guess I get to end here with Ed Felton. Ed has been party to lots of controversy. I don't know who to think of. And Richard Nixon has also been involved in controversy. Someone who's dead. Ed's has always been illuminating and destructive. So I guess I first met Ed in connection with a lot of issues that he and his successful graduate students dug up into job security. He's been involved in things like the Microsoft Department of Justice Antitrust case, broke a lot of BRM systems. Why this clever method of looking at the patents, right? That was one of the tricks. Yeah, okay. I've got some tricky things. He was the first chief technologist of the FTC and many other things that I won't list because we all came close to that. All right. So as John said, I'm a computer scientist but I've been involved a fair amount of public policy. And so I want to talk about one example of how those things come together. In particular, I'll talk about the NSA's mass phone call data program. And I'm going to apply some easy computer science to some of the issues that arise around this thing. And try to shed some light on it. A lot of what I end up doing in my policy engagement is to explain blindingly obvious computer science to policy makers in a way that they'll hopefully understand or at least state it with confidence while being an Ivy League professor, which often carries more weight than it would be right in Washington. All right. So here's the program. We're talking about this story broken, the Guardian back in June that the NSA is collecting phone records in this set of millions of Verizon customers but we later learned from lots of other U.S. phone providers as well. So what I want to talk about basically is a few things today. First, what we know about this NSA program and what it does, how it works. I'll do some analysis to try to shed light on how useful the program actually would be in trying to find bad guys. I'll talk a little bit about whether it can be made more privacy-friendly so that there are different ways of organizing and the data and computing to make it more privacy-friendly while still being just as effective and then some concluding thoughts. All right. What do we know about this thing? Well, for many or most domestic phone calls, we know that the NSA collects information about who called whom, when the call was made, and for how long they talked. There's some dispute about whether it is for essentially all domestic phone calls or for fewer of them. Originally it was reported as nearly all later unnamed NSA sources said it was only 20 to 30 percent and just recently a high NSA official said under oath that it's about 20 to 30 percent under this program, meaning that's a lower level. So we don't really know exactly how many calls and I'll talk some about how the degree of coverage actually matters for the effectiveness of the program later. All this stuff they call metadata. They use the term metadata in a way that isn't really the same way that computer scientists use it, but it doesn't matter anymore. When you say metadata in Washington, what you mean is this stuff, who called whom, when the call was made, for how long they talked, in contrast to the voice, audio content of the call. Now all that's for domestic calls, for calls that are either foreign to foreign or one end, foreign, one end domestic. They collect metadata wherever possible. You might as well assume they collect all metadata. And in some cases they collect content. There's no legal limit really on their ability to collect content overseas from non-U.S. persons. The boundary cases where it's one end foreign or where there's a person who's American on one end of the connection or they're not sure whether the people are American, the law's ambiguous, and as far as we know they pretty much collect data. But I'm going to focus here on the domestic case. So domestic calls where they have just information about who called whom and so on. So the first thing they do with this is they build a data structure which we'll call a call graph. It's a graph that has a node for each phone number and it has an edge between two nodes and those numbers have talked to each other at all in the last five years. So that's the call graph. And then they do some computations on the call graph. In particular there are two computations that they do. The first one is called contact chaining and the second one, well there's one that's always redacted in the documents. It's there. Typically in a court order authorizing the program there's about a third of a page about contact chaining and there's about two thirds of a page about some redacted thing with some redacted footnotes. So we don't know what the second thing is but contact chaining we know about and I'll talk a little bit about what that is. Do you have a guess on the other one? Do I have a guess as to what the other one is? What is the black rectangle? Do we have to create white? I have some guesses. I'll talk later about what they have. Alright, so now I have a couple weird slides I don't understand. Alright. Oh, okay. By contact chaining you mean generically either clustering and graph definitions or pathing? Here's what they say. What they say is they look for all paths of length up to three hops or up to two hops. They call it contact chaining. Some of the details are redacted. I would say to the extent that they do any kind of sophisticated clustering or look for more relationships that are more complex or distance between two entities that would fit under number two. My guess is to number two actually let me just answer that very briefly. I'll come back to a more specific example later. But the guess is that the redacted one is basically aiming to counter some particular form of trait that the target organizations use. For example, people throwing away phones and switching to new burn phones. You might be trying to identify what phone the person has switched to. But we don't know for sure. So this is what's going on. There's been a bunch of legal and policy fights about this. I've been involved in this in various forms of file affidavits on testified and so on. One of the biggest arguments that one hears in favor of this program is the idea that it's only metadata and therefore we shouldn't have to worry too much about the problem. Now if you know much about data and what can be inferred from data I don't really impress you that much. And if you think in any kind of detail about what people do on the phone and the fact that people tend to use the phone for those communications that are most sensitive because they don't want to commit them to writing what you get is is a lot of sensitivity here. First of all, sensitive calls. Suppose somebody calls the suicide hotline at 1.15am and talks for 45 minutes. That's just metadata. We don't know anything about what they talked about. There are all kinds of numbers such that calling that number or that number calling you says something about you. If the, let's say the reminder service that your doctor uses to remind you of an appointment the next morning calls you tonight it's pretty obvious what that means. It means you have a doctor's appointment tomorrow. And lots and lots of examples. And examples that have to do not only with personal details but also with public policy. For example, if an NSA employee calls an abuse line at the NSA Inspector General's office that goes in his database and it's potentially available to NSA analysts. And that also is problematic. So sensitive calls. Patterns of calls even more so. You can tell all kinds of stories about this back and forth with, say, different medical specialists or or so on. A pattern of calls between people might tell you something about the nature of their relationship. It might tell you about the beginning or end of a relationship. It might tell you about a professional relationship or changes and so on. And, of course, there's all the standard big data types of analysis where people train on a large data set to build predictive models that let you figure out things like a person's family status that are how happy their marriage is. Whether they have kids or not whether they're a student, etc. etc. All kinds of things like this. The sorts of things you would expect to know about how people would use data. There's all kinds of influences you can make about a person based on their call patterns. And, of course, the largest data set a data set much bigger than any that's available to any researchers, even the ones at the phone companies, is held by NSA. And so there is a serious privacy issue here. And that's now finally, I think, being recognized by the policy world. So that's led to suggestions for reform. The first major reform suggestion came from a report of the President's Review Group on Intelligence and Communications Technology. This was a group of five DC lawyers who were appointed by the president to take a look at this program. And in mid-December they released a really substantive and surprisingly hard hitting report of a couple hundred pages which this is the cover of. And they said two things that are significant about this particular program. First they said our review suggests that the information contributed to terrorist investigations by the use of this data was not essential to preventing attacks and could readily have been obtained in a timely manner using conventional corridors. Now, turns out that when press administration officials can, the only success they can point to is that they arrested a cab driver in San Diego for donating eight thousand dollars to a group in Somalia that has engaged in attacks against U.S. forces. That's the one and only success that arguably came from this program. That has been publicly announced? That has been publicly yes. The case is publicly known and there have been all kinds of claims that have been made, but under oath when press the only one that seems to stand up at all is the case of this cab driver in San Diego. The other thing that this group did is they made a recommendation. We recommend that legislation should be enacted that terminates the story to a bulk telephony metadata by the government and transition as soon as reasonably possible to a system in which such metadata is held instead either by private providers or by a private group. Now, when the president gave his big speech about surveillance on January 17th he actually endorsed both of these ideas or at least semi endorsed him. That is, he talked about the efficacy of the program, he disattended to disagree, but he also directed the administration to decide by March 28th on a strategy with respect to reorganizing how the data is held. So I want to talk about I want to do two little bits of computer science analysis to look at the two questions that are raised by these two quotes. First how useful would this program be? Or at least how useful would contact chain be in actually catching bad guys? And second, how feasible would it be to redesign this program so that the data is not held by the government? Okay, so first how useful is contact chain? In order to do that, let's look at this scenario here. So the scenario is that an intelligence analyst has some evidence suggesting that some person will call Bob, might be a terrorist. Yes. I'm not a terrorist. Yes, Bob? I should really switch the security for this reason. Okay. Great. So the analyst has some evidence suggesting that Bob might be a terrorist. Not like strong evidence, but just generating some suspicion. The analyst then uses the call graph and is going to determine whether Bob is in the near neighborhood of knowing bad guy, who is known to be devolved in terrorist activities. So if yes, if Bob is in the near neighborhood of knowing bad guy, then this is evidence, perhaps weak evidence, that Bob might be a terrorist. And if not, it's weak evidence that Bob is not a terrorist. Now, terrorist groups know that we do this kind of stuff. And so presumably they're trying to avoid being in each other's near neighborhood in the call graph, but that's harder than it sounds. It makes really rigorous discipline to avoid ever within five years of having an indirect two hop connection to someone if you're actually seeing them and coordinating with them. And so there's some substantial probability that the bad guys will slip up and create a link, even if they're trying not to. Yes. But the scenario of the call graph, do they also have rates on some other... Well, they do. But I recall when and how long and you can imagine some kind of scoring function that puts a weight on each edge. And I'm not going to talk about that here, but you can imagine how you might extend the kind of analysis I'm talking about to include that. Probably they do. At the very least, one of the things they do is try to identify high degree nodes in the graph and improve those. Yes. When you originally mentioned the call graph, instead of the vertices where the phone number is not equal. Is there a reverse mapping between those? A practice guess. Is there a reverse mapping in practice? Yes. It turns out you can identify most people by you can map most numbers to a person by introducing broadly available commercial databases or advanced techniques such as Google and in any case the government has subpoena power to get that improved. So it's not as though terrorists could just buy many burner phones or they can buy burner phones and would get to that. It's true that I'm using here. It's true that I'm using here number and name interchangeable to discipline and discussion practice. You can imagine a known bad phone is you can imagine that they're watching mob's phone because they know it's Bob who they suspect may be a bad guy and for a known bad guy you can substitute known bad phone. All right. Now we come to the part of the talk where really what happened? Did the project hurt you at that? Yes. It restarted It's okay. It's good. Let's take a minute. All right. So this is the scenario. I'll go on. It's good. And now the analysis to do here is actually not that complicated. You want to do some very simple application of basis law to this and then you want to do a little bit of discussion of that graph structure and so on. So with respect to and so let's work through the simple basis law argument by looking at a graph. This is a 2 by 2 grid where across the top the columns represent the states of Bob being a terrorist or not a terrorist and the rows represent Bob being in the near neighborhood of known bad guy or not in the near neighborhood of known bad guy. So we'll make some assumptions. We'll say the probability that Bob is a terrorist that's the analyst prior after having all the other information that's available to the analyst but not the call graph that's the analyst's best estimate of the probability that Bob is a bad guy. Now the probability that Bob is in the neighborhood of known bad guy if he's a terrorist we call that alpha that represents more or less the likelihood of that being in the neighborhood and the probability that Bob is in the neighborhood of a terrorist assuming Bob's not a terrorist we'll say he had and we'll make that equal to just the probability that two random people happen to be in each other's neighborhood. So now we can we can plug in the values to each of the four boxes and in the obvious way. Now let's make some assumptions. So let's assume just for the sake of argument that are prior that the Bob is a terrorist is 20% and we'll assume that if Bob's a terrorist there's 50-50 chance that he's actually in a new neighborhood of known bad guy. And of course and so the only one that is maybe a little bit hard to get is what is the probability that two random people are in the same new neighborhood to explore that? We'll use a random graph model. We'll use in particular an error-straining model. This says a couple of advantages. First it is the analysis relatively simple and second I can show off my mastery of special characters. So the model is about the simplest random graph model you can imagine. There are n nodes and each pair of nodes is connected with probability epsilon and that's chosen independently for every neighborhood. And we'll make the assumption at the bottom just to simplify the analysis. Let's plug in values. We'll say there's about half a billion nodes. That seems like a good guess for the number of domestic US phones. We'll say each pair is connected with probability 1 in a million. That makes the degree of the actual degree of the graph 500 which also seems like a good guess. Alright, now so given that model we can say for example if near neighborhood means that the distance is less than or equal to 3 that's what the NSA has talked about in much of their discussion of this program. It turns out that the probability that two random people are in each other's neighborhood is about 22%. So plug in those values to the grid and we get this and now we can work these things out. The probability that Bob's a terrorist if he's in the neighborhood of the unknown bad guy is 36%. That's helped from the 20% prior. If he's not in the near neighborhood we go from 20% down to 14%. But the expected entropy gain is 0.04 bits. Not that much. What's that? What's the expected entropy? So basically it's the let's a little bit complicated actually let me skip that. In the interest of time, sorry. Well so here's roughly what it is. So in the beginning so it's more or less a measure of the change in the analysts uncertainty with respect to whether Bob is a terrorist or not. So there are two cases. What this is is really an attempt to get the idea that you're much more likely to be in the bottom case in the case where you're not in the near neighborhood and therefore you don't gain very much knowledge from 20% to 14%. Let me So why is that? Why is it that you don't get much advantage? And the real reason you don't get much advantage in particular the only go from 20% to 36% is not very good for confirming that Bob is a bad guy. Is the upper right box? That's false positives. So false positives happen 18% of the time. True positives happen only 10% of the time. And therefore a positive doesn't really help you that much. And so obviously if you want this system to be better at actually catching bad guys you need to reduce the number of false positives. And the way to do that is just and that 18% that basically is driven by the probability that two random people are in the near neighborhood. And the way to make that lower is to reduce the size of the near neighborhood. So if you reduce the size of the near neighborhood to distance less than or equal to two hops then the probability in this model is about 0.05% which is way way lower. Now the grid looks like this that upper right cell which is false positive rate is way smaller. And now the probability that Bob is a terrorist given that he's in the neighborhood is 99.6%. So that actually is pretty strong confirmation. Whereas if he's not in the near neighborhood we go from the 20% higher down to 11.1%. Now look at this. The president said in his speech, effective immediately will only pursue phone calls that are too special to remember from the number that says organization instead of the current three. It was believed that the NSA had already decided they liked two hops better than the three. This is basically why. Three hops is way too big. You have all kinds of false positives. And that has two negative effects. One, it makes the analysis not work that well. And second, lawyers who are suing the NSA make annoying claims about how nearly everyone is close to a terrorist in the graph which is pretty much true if the number of terrorists is more than a few. You're kidding too many innocence. You're kidding too many innocence. You will identify too many innocence. Identify the problem. That is indeed a problem. Sorry. Do you know the scenario that you mentioned is the one that is sort of the way that they're using the dating? Because I can imagine if you have a suspect you might say you don't really care about finding other suspects. You might care where that person is going to travel tomorrow. Right. So, well this gets kind of interesting. This connects to complicated questions about legal authority. We know that in a non-US setting for non-US persons that NSA has done all kinds of things like trying to figure out which people travel together, which people tend to be together by looking at location information. Whether they get whether they routinely get location information domestically from phones without a specific warrant is not exactly clear. Probably they don't. We know that if they get a warrant, if they had a 20% prior to Bob is a terrorist and I'll talk about this later, they could get a warrant to basically find out almost everything about what Bob does. But that's not information that they are legally authorized to gather routinely about everyone domestically as far as we know. That's a lot of caveats but unfortunately they have to do that. Do you know whether they can find data from other sources like from such an equity that can be accessed during a hand score? This also gets complicated. The court orders to authorize the NSA to gather this information domestically put limits on how they can use it. In particular what they say is the information when it's collected directly from a phone company is supposed to go into a special zone which is called the collection store and then they're allowed to query that only when they have something called reasonable articulable suspicion about an individual. Then they can use that individual as the starting point as the known bad guy in an analysis. The results of any such query are allowed to go into a different database called the corporate store and once it's there they can do whatever they want with it. And so some commentators believe that what they're really doing that this program is a tricky way of essentially finding ways that contact training is a tricky way of trying to vacuum up as much information that is in this very limited collection store and get it into the corporate store where they can mix it with other data and do whatever they want. And run around the legal protections for Americans data. And there's some debate about whether that's the way they actually use it. This step from three steps to two actually to me is an indicator that is an argument against that claim. Because if the goal was to make as much data as possible be the results of a query and therefore not legally protected anymore then they'd want to stay with the biggest possible and grow so they could suck as much as possible into the unprotected state. We don't really know how much information in the form of query replies has been integrated with other stuff. I'm not sure they know. They seem to have challenges in administering and tracking the history of all this data making sure that the right data stays in the right database and doesn't leak across into a boundary into a state where there are fewer procedural controls. As you know if you know about the challenges of ensuring compliance with complicated rules while processing data it's hard to do. And they don't seem that great. So when I close my bank to find other bank it's like is that content? That is content. It's not the information to dial and connect is considered metadata. But even if you're pressing it on the computer once the call is connected everything else that happens is content. This is a really meaningful event. It seems to me that you still need to collect the same amount of data regardless of the program. They collect it all. The model is they collect it all and then the limits on the kinds of queries they're allowed to do against it. The court orders that authorize the collection say you can collect it but you can only do the following queries. Which is a little bit weird. You may not remember anything about that in the Constitution's discussion of warrants. That's what the courts have done. I'm curious that words like this in a speech like this are extremely carefully worded. And this one clearly is very carefully worded. Yet for the vast 99.9% of the population means nothing but three years have come to. So this would have been said for a reason because the number of words and something like this is very scarce. So who is this meant for? It's an interesting question. One thing it's meant for probably is the various judges who are ruling on the constitutionality of this program. Because there have been very effective arguments that the probability essentially the sort of argument that I showed before that there's a 22% chance that a random person is within three hops of any given other person. And therefore if you believe there are say 50 legitimate suspects in the U.S. at any given time or in the world, that the odds that any one of us are within three hops of at least one of those is actually pretty high. And that argument has gotten a lot of traction with the judges. So by knocking it down to two, it does two things. First it is it helps with the legal argument. And number two to those people who are insiders and sort of follow all the ins and outs of this. It might look like a significant concession, but it's in fact a concession that they had already made for their own reasons. It's a big sacrifice and it was presented as what this is this big thing we're giving you even though the third hop doesn't seem to really help that much in the back. I'm going to add if I may because this kind of random graph is not the kind of graph that you see in the real social network. Let me come back to that in a little while. It turns out that the exact formula graph does. Actually let me do it now. It turns out that they want to eliminate hiding. So real graphs differ in a few ways. One is that they're power law graphs and particularly that there are some nodes with very high degree. But they identify those nodes and eliminate them from the analysis and the reason makes perfect sense that the fact that you and I, let's say, both called United Airlines customer service within the last five years doesn't really say anything about whether we know each other. That those particular links are not indicative of any kind of a terrorist plot and do generate a lot of false positives and so it makes sense to knock them out of the analysis. So high degree nodes already get neutralized in the analysis and so the variance in degrees doesn't matter as much as you think. The second thing to say is that the only thing that really matters in some sense is the size of the near neighborhood. That PN parameter is the only thing that matters. What's the probability that two random people are in each other's near neighborhood? And that doesn't depend so much on the exact distribution from which the graph is drawn. And in fact you can imagine that there's a kind of dial that they turn to try to get the best value of this. You want a near neighborhood that's large enough that there's a substantial chance that two terrorists who are trying to avoid it will accidentally stumble within each other's near neighborhood with high enough probability while at the same time dialing down the probability that two random people will be the same. That's more or less your goal. And so you're going to choose a PN which does well for that and I think it's going to be around 0.05% anyway regardless of the exact format of the graph. So that turns out I think not to matter much. So given that there is hopefully not that common in the population, when you look at absolute numbers, wouldn't they end up still with more false positives even in the future? Yeah, so we'll get there. Actually, we'll get there. Right, so here we have this remember this. Now some people might be uneasy. What do you mean there's a 99.6% chance that Bob is a terrorist just because he's in the near neighborhood of some other terrorist? Well, you have to remember of course there's a 20% prior here which is very unusual and that plays a big role. So if we in fact say rather than our prior being 0.2%, what if it's 1 in a million which is closer to what the number would be for just a random member of the population? Well now that the grid looks like this and now that false positive number looks pretty big. In fact in that case the probability that Bob's a terrorist if he's in the near neighborhood is still 0.1%. And if he's not, well, it's half in a million. And you can also know. And so that makes sense, right? If you start with no knowledge, you don't really get much out of this because of the way things the grid lines up. It's only where you have a prior which puts you substantially above what you think for a random member of the population is only then that you really get any kind of strong confirmation of that. Alright, so what do we see so far? This works best if the neighborhood is relatively small. It can work for confirming suspicion. You can go from 20% to 99.6%. But it's not as good at eliminating suspicion. Even with the 20% prior you will be, if Bob's not in the near neighborhood, you still above 10% probability that he's a bad guy. And this is more or less what people who are involved with the program will say are a terrorist. Okay, other network structures while I talked about this before, let me skip over that. The precise network structure doesn't matter so much. Okay, so we had recently this story that NSA is maybe collecting less than 30% of U.S. call data that strangely they are collecting a lot of data on land lines and not nearly as much on mobile phones which seems to be backwards, right? We think that terrorists are more likely to use mobile phones. Nonetheless, this is the story. We now have NSA that's true under this program. So what we can model, they easily suppose each node is quote, covered with some probability C and an edge is covered if either of the nodes it connects is covered. Because remember that the caller and call lead both contribute, both have a record of a particular caller. So now we can take our original graph, we can construct the covered graph which includes just the covered nodes and edges, sorry just the covered edges. And you can show that a two-hop path is in the original graph survives with this probability it's down there. So if you have 25% coverage a two-hop path survives with about 30% probability. So where this was our initial grid for a two-hop analysis we go to these new numbers if you assume a coverage of 25%. And that means that the system is just as good at confirming suspicion when that happens. Not nearly as good as ruling out you're much more likely to be in the not neighborhood case. Essentially what happens here is that you're usually it's quite rare to find Bob and MoonBadGuy in the neighborhood. So this limited coverage makes the program substantially less useful. Now one question is why aren't these targeted subpoenas? Because in fact if what you're looking for is Bob and MoonBadGuy are two-hops of each other well you can subpoena MoonBadGuy's records because he's a known bad guy. You can subpoena Bob's records given a 20% prior all you need is reasonable articulable suspicion which is a pretty low standard. And so you can get both of their records you can construct both of their neighbor sets and if their neighbor sets have any node in common then Bob and MoonBadGuy are two-hops away. So you don't need suspicion by using targeted subpoenas and you don't really need to collect all this data in advance in any case. And so this calls into question once you've moved to a two-hop scenario what do you need this preemptive collection at all? The only case where you wouldn't is where you have suspicion that is above the base rate substantially above the base rate but doesn't yet reach the level where you get a warrant for this information which is relatively low. So there's a little window of a prior probability of Bob being a terrorist in which you can't subpoena Bob's records but you might be able to get above that threshold by applying Bob in the near neighborhood of the area. I mean the other thing that takes some effort is subpoena. So I'm just wondering what's the scale here if you were going to follow that strategy of 1,000 million, 10 million subpoenas? It can't be 10 million, right? So if you assume that the number of terrorists in the domestic US population is relatively low, there can't be that many people for whom you have a substantial prior that they're a terrorist. And so it can't be that large. We already know that there are I think the answer is it's already on order a few hundred that are on the list and that they can and that probably won't go up a lot. I'm sure they can get back as far as orders and listen to the confidence if they want to. If they have strong enough suspicion that requires stronger suspicion. There's a Supreme Court precedent that arguably says that getting someone's phone metadata is not a search under the Fourth Amendment and therefore the strong word requirement doesn't apply but that getting content does require a very strong word. And so there's a much higher standard to get access to content even for even for foreign intelligence purposes. They can if they have strong enough suspicion. So in this scenario for a known bad guy they're presumably listening to all of his calls monitoring all of known bad guys' communications continuously. Bob with the 20% prior I don't know if they can get his content. This is begging for whether some of the you know the island vector based prestige models where I inherit with some decay the prestige of one hop to a hop for sure. Because you would think that there's a very different interpretation of the island in only one path versus in five different paths and then you can get some clique with. Have you done any better I haven't done analysis of that I wouldn't be surprised if they're doing something like that One has to parse very carefully the statements that government people make about these programs to understand what they're doing and one needs to pay attention to which statements are made under oath and which are not. I think the contact changing analysis is not that sophisticated the redacted thing may be sophisticated they're probably using sophisticated analysis on the non-domestic data set. One of the challenges they have here is because the court order they get specifies what the algorithm is they need to be able to explain the algorithm to a judge and so it can't be too complicated. Actually you have to pay the poor judges the judges don't have the support or help of anyone who understands technology in a regular case if there's a criminal defendant and there's a fight about something like this the that the judge before deciding whether information can be used can the judge is a bunch of options there may be an expert witness on behalf of the defendant who's arguing or the judge can get a court appointed or a special mask of someone who has expertise to help them with this sort of thing but these judges in the foreign intelligence surveillance court don't have that. That's one of the reforms that people have talked about but it's not clear if it's going to happen some day perhaps Congress will pass a law letting these judges actually get technical help but they really don't. The judge is here only what the government tells them they'll get a statement under oath from the chief technology something at the NSA but that's really the only source of technical input that the judges get and I think they really struggle with it and they know they're struggling with it by the way the same thing is true in Congress in Congress it's the members of the intelligence committees in the House and Senate who have access to classified briefings about this stuff a very limited number of staff who they're allowed to who are allowed to hear and participate in these briefings, very limited and that don't include anybody with technical expertise. Now back in the good old days my congressman from Princeton, Rush Holt who is an actual scientist was on the House Intelligence Committee and he understood things like statistics and graphs and so on but he's not anymore and now he's retired. Let me move on I just had a quick question so if I understand correctly NSA is allowed to do essentially whatever they want with foreign data pretty much without the process requirement so is GCHQ also allowed to do whatever they want with foreign data and if so can they give them NSA? GCHQ is the British equivalent of NSA and as I understand it they can do overseas stuff and there's a question about whether they can just collect on each other's citizens the belief is that that's not okay that seems like common sense that, you know, NSA and GCHQ can't just say we'll spy on each other's citizens and then exchange the data but has that common sense been confirmed under us? Well, the question is there's a few questions one is, you know, has a court ruled on it the other is, has the NSA General Counsel or some, you know, authoritative DOJ lawyer given NSA people guidance on this and I don't know for sure I would be very surprised if NSA people believed it was legal to do that so they probably don't Are there any instances of data that they could change for themselves? Yeah, some results could be exchanged there's this other tricky issue of information where they don't know whether the person is a US person or not the question of how much obligation they have to try to find out and that's another way to slide around Let me move on to the second part of the analysis and talk about and pretty briefly about possibilities for reorganizing the data so this again is the recommendation 5 from the president's review panel says we want to transition as soon as possible to the system in which the metadata is held not by the government but by some kind of party the president told in his speech the intelligence community and the attorney general a bunch of lawyers to develop options for a new technical approach to match capabilities etc without the government holding this metadata itself and they're supposed to report back on March 28 So here's roughly how the system works now very schematically you have a bunch of data you have computing resources you have an NSA analyst all this within the NSA and so one proposal that's been made is to take the data to that side of the NSA have some kind of private party custodian that holds the data rather than an NSA holder or of course you can move the computing across that boundary or if you're a computer scientist at all and you want to put some computing on both sides yes obviously it shows that there's no difference in the system that you're just shifting you're just shifting so NSA having a Google or Facebook heading over to the computer right well so it's an analyst the analyst could do the same stuff to the extent there are differences the differences go to things like the risk of abuse the abuse scenarios are somewhat different for example this one question is is this stuff kept in the data center that NSA people can badge into or is it only custodians people can badge into this and so on would it be a felony for an NSA person to go into that data center so there's various fraud that might be different there is increased accountability at this boundary in the sense that it's easier for this entity to log what passes across these boundaries somehow so there are some differences but in practice what this probably allows to is that the contractor who operates this program now continues to hold the data and they just change the access control list to some data in groups so maybe that's not so great this is a little bit more interesting now is to keep the data at the providers now the providers are already in many cases retaining this data for their own business purposes and so why not let them keep it maybe contributing some computing but hold the data outside the NSA so how are we going to design this well you know we're going to design for the usual thing computer scientists care about performance, cost, and reliability but we're also going to design for oversight that is if you want an architecture that makes it easy for the legal processes the political processes that are supposed to limited control what the NSA does to operate so we want to optimize for that so how do you in the current system the data get into the data center of NSA that's rising to the regular delta there's a daily transfer the corridor actually requires every privilege provider to turn over data on a daily basis how exactly that happens we don't know whether they like whether a courier delivers it or whether it goes across a wire I don't know but it is daily delivery of complete records for each day for the 4-in-1s it's where for the 4-in-1s they just scoop it up wherever they can get it for the 4-in-1s they kind of capture it off a wire somewhere that's the best yes for the domestic stuff it's a legal process the phone companies under court order deliver data so this is what you want to optimize for you can come up with some easy design principles to try to do try to avoid replication of data the more places the data is in the more opportunities there are for abuse try to avoid aggregation of data into one place that also increases the possibility of abuse or breach or something think about not just storage for processing when lawyers talk about this they always talk about storage but we're computer scientists we actually care where the computation happens is that for the obvious reasons we want to design for accountability as well as other things and we like to use actual computer science in thinking about this instead of instead of just saying G that looks hard ok so let's look at this design this is the most attractive approach with respect to those design goals you avoid aggregation of the data you avoid replication one of the key questions is how long do these providers hold the data do they hold it as long as they would hold it themselves for their business purposes that is deleted and will just like they do now or would they actually require to retain it for some length of time that would require a new law that would be controversial but it's the thing that might happen that's been discussed obviously it's dust for privacy if the data is only held for as long as the company is going to hold it alright so what kinds of stuff can we do in this kind of study one question is can we support simple warrants to get data from individual people and it turns out yeah cryptographers have been all over this stuff for a while probably the best single publication on this is a blog post by setting Kamala from last summer but the idea is pretty much this you can have a protocol in which the telecom company encrypts their database in a particular way they basically encrypt each record in a particular way with a key for that record that when the court issues a warrant they cryptographically commits to the warrant without saying which number the warrant is for cryptographically and publicly commits for the fact that a warrant has been issued for a particular number and they send the NSA the key to unlock that commitment then when the NSA wants to do execute the warrant you have the secure multi-party computation that goes on between the telecom and the NSA basically the telecom puts in their master key which is used to generate individual record keys here the NSA puts in the identity of the number that they want to search and the warrant as well as a locking key for the warrant and then you do a secure multi-party computation by the usual cryptographers methods what you get is a computation that verifies that the warrant matches the number whose records are being requested what is the the record key for the requested phone number and then it gives that key to the NSA all without revealing to the phone company anything more all without revealing anything more than this the telecom learns only that some warrant was executed the NSA learns only what the unlocking key is for that record they then unlock on that record so this has all the properties you might want it reveals the least possible information consistent with the phone company actually holding holding title to the control of the data and NSA doesn't have to leak information about what it did and yet you have accountability so that's great okay, but yes there's a huge flaw in the system that you have the highest concentration of decryption technology and big iron and you're handing your entire quote encrypted and quote database to this group of people people are the best in the world systems of sort of it's the proposal sort of dead in the water for that reason I'm not so sure that that's true and the reason is that if the NSA engages in some kind of big project to try to break this particular system that would be a clear violation of the law and would require a conspiracy that would have a higher chance of getting discovered compared to one where the data is just sitting in the data center somewhere every disgruntled participant has access to it because we know that a disgruntled NSA participants could get a lot of data so that you would accountability you want to add auditability I think auditability is part of accountability it's one of the ways you get accountability is that records can be audited and that if there is a dispute or a challenge later that you can sort of unwrap a record of what happened and verify that it was done there alright so this is simple warrants you can ask about these computations on the call graph can we do these computations in this kind of when the data is segregated well remember these are the two computations the first one is easy to analyze the second one not so much let's look at contact chaining first contact chaining is easy contact chaining is essentially a form of breadth first search and breadth first search is really easy to parallelize in this way with basically two round trips to each provider you can do you can exploit the two hop data from any point that's easy to do and that's like a fresh motor software assignment so that's good news this one as I said kind of hard to analyze depends on what exactly the algorithm is but we can say some things like map reduce does some kind of computation on each separate data item and then aggregates them together sort of tree wise and it's very easy to do the map part with each provider do a reduction within each provider and then send results back for final combination back at the NSA so if it's map reduce that works pretty efficiently and there's some hints that the things that may be doing map reduce for example we believe based on some records that the data is stored in a system called a cumulo which is a which is the NSA's version of which is the NSA's favorite map reduce system and so there's some indication that they may be doing map reduce style computations of those apparelized so why is it that why is it so important to find that the fact that the fact that they're doing it yeah yeah so let me give you an example and here's the next one, similarity search so one reason why they might be hiding the particular analysis is that that analysis is designed to counter some particular kind of trade craft that adversaries are using which the adversaries think is working for example it could be that the adversaries are switching phones all the way a phone, get a new burner phone and switch to it and so you might use some kind of algorithm like a form of similarity search to try to counter that that is to say well gee this phone went inactive at some time among all the phones that went active at about that time let's look for one that has a neighbor set similar to the previous no bad and so if you're doing something like that it's certainly to your advantage if your adversaries believe that when in fact you have a method against so that might be the kind of thing that's going on here that's one reasonable guess as to what this redacted analysis is but presumably it's being withheld for a reason like that that it has something to do with countering a particular form of trade craft so if any 20 to 30% of pools are being collected then presumably that's and you believe that most of those are landline it means a very small fraction of mobile ones so what you just said doesn't even apply to what they're collecting I mean you actually get to another question which is is there some technical reason or historical reason why it would be landline rather than mobile because they all collect billing records so this to me is one of the biggest unanswered questions about how the program operates and there's different hypotheses about this one is that what they care about most is foreign is one in foreign calls that is what they care about is whether people tend to call the same foreign numbers you get the same foreign numbers and if that's the case and if you have ubiquitous collection overseas especially in certain target areas where calls to which are the most interesting and maybe you're fine with 20% 30% domestic collaboration so that's one possibility that they just don't care that much as to why it's primarily landlines that's less clear what I hear so NSA people don't talk about this 20 to 30% business or the reasons for it on the record so one has to pick up indirectly from reporters or through reporters' questions what it is that the reporters are hearing off the record and what they're hearing is as far as I can tell is that there are technical difficulties with inhaling mobile phone company records that's kind of hard to believe at least it's hard to see why that would be the case and one possibility is that in fact they don't care that much about domestic collection and they just kind of let it slide that in the early years of the program they went to landlines first because when people in intelligence agencies think about phones that's what they think about and then later given the shift from the trend to mobile that they just haven't caught up but it seems likely that they actually lost coverage of certain carriers now there's another issue which has to do with foreign ownership or partial foreign ownership of some mobile phone companies and there may be a desire not to serve these kinds of warrants on non-American companies I've heard that speculation I'm not sure whether that's the case or not but it is a little bit of a mystery but the work to get another court order is negligible the technical work to be able to inhale data in whatever format the next provider used to use it's pretty tricky and so it's hard to see why not why they're not doing this you're left with either it just doesn't matter to them domestic calls don't matter very much in their analysis that's a possibility but it's hard to see what that would be how they could legally get that it's a radio why was the radio picking up by itself that would not actually be legal without a separate warrant well if the military does it anyway have you said that they don't collect the data from they've said that they don't collect data domestically over the air from mobile phones without a targeted warrant could mean that they will with an individualized warrant do pretty much anything to get data on an individual but they don't sweep up mobile phones domestically in a broad way without a warrant this is another piece of the pie besides mobile which they don't seem to have good coverage over we know that they use individualized warrants to collect data from white providers especially skype but they don't but they probably don't have much coverage of metadata pursuant to these broad warrants domestically from I don't know what you're talking about soldiers' violence similarity search is still sufficient let me just close I'm not happy to to discuss one of the things that's happening is a result of all of this discussion and the and the mild collision between computer science and intelligence and lawyers is some changes to the debate but really we need further changes to the public policy debate about this in order to get to a better place to have it be not just the debate about security against privacy but also a debate that's about how to achieve accountability and in particular how to use the things that computer scientists know how to do to get enhanced accountability within the system often people who don't think like computer scientists just don't imagine that accountability is even possible in certain settings where even one of our students would know immediately that it is and so one of the things we can do is talk about accountability solutions in this kind of space there's one other way in which the debate needs to change and that's really, and maybe it's starting to and that is exemplified by two quotes I'm going to give you or a quote from two things this is a column by Walter Hink this in the Washington Post actually on Christmas Day 2013 and he is a very well connected and knowledgeable reporter on national security and intelligence matters and I'll blow up the part I'm most interested in and so here he is this column is his attempt to summarize views of some people in the intelligence community about all of this so he asked should the United States engage in secret covert or clandestine activity if the public cannot be convinced of the necessity and wisdom of such activities should they be leaked or disclosed to intelligence professionals that's a bizarre question now to me this is the question that we should start with in discussing this and it seems like from sort of an abstract public policy standpoint we should be starting with this question but to some of the intelligence professionals it's not just an interesting question it's not just a strange question it's, what is it, a bizarre question and he goes on to say this again summarizing the view of many in the intelligence community rather than giving his own view the prime reason for secrecy is that you don't want the targets to know what you're doing okay that makes sense and in democracies another reason is that you don't want your citizens to know what their government is doing and they have to keep them secure as long as it's within their country's law now this to me is really really troubling and the idea that many people in our intelligence community think this way is I think really bad news because this amounts to an assertion that aggressive intelligence collection is incompatible with democracy that the public and the political process cannot have meaningful oversight over intelligence people shouldn't know and as long as it's within the law but of course the law is made by members of congress who also are relatively uninformed about these things because they have a little more information than us but not much we've seen over and over members of congress reacting with anger on learning certain things that one would have expected the overseers to know so this is really worrisome and to me this is part of the reason that we've gotten into the mess that we're in now the good news is that just last week James Clapper the director of national intelligence gave an interview in which he stated this was a different view he said the problems facing the community over its collection of phone records could have been avoided I probably shouldn't say this but I will had we been transparent about this from the outset right after 9-11 and said both to the American people and their elected representatives we need to cover this gap we need to make sure this never happens to us again so here is what we're going to set up here's how it's going to work and why we have to do it and here are the safe funds if we had done that Clapper says we wouldn't have had the problem we have now this is the kind of thinking that we want to be seeing more from our intelligence officials not that intelligence collection is incompatible with public knowledge but the idea that public knowledge and public and political buy-in is a necessary prerequisite to actually engaging in programs like this we don't need to know necessarily who you have a warrant to spy on but we want to know what are the conditions on giving a warrant how many there are and what happens when you have one and so this at least is a sign of progress in the bigger political situation yeah but this was a tiering a program and the extra tried to push around and the public learned about it it wasn't even getting reaction to it well so it's true it was not a good policy to do this they got away with it for a long time and in fact although it's now under the legal umbrella for the first few years of this program there was no legal authorization at all other than a memo signed by the president and so the program was I think pretty clearly illegal in these first few years it then came under the legal umbrella but based on arguably legal evidence but had the intelligence community actually gone to the public after 9-11 and said here's what we're doing there would have been a debate it probably would have been ratified at that time but when the program came up for reauthorization in congress three years later and three years after that and so on we could have had a debate about whether we wanted to keep doing this and what the safeguards were good enough and we didn't have the opportunity to have that the only way we could actually have a public discussion was via these stolen leaks and now they're back and forth in this painstaking process of extracting information from the government by reporters, by analyzing leaked documents and by the byproducts of litigation this is far from ideal but at least we're having a conversation and at least Mr. Clapper is saying that maybe they should have done it a different way from the beginning that's a little bit of good news excuse me, a question and a comment the question is purely with a computer scientist's hat on is there a devil's advocate argument that says if you either via encryption or some injective function that anonymizes who it is wouldn't you want for your machine learning to be at its best to give it the best sample of you know ones and zeros in the classification of good guys and bad guys to see what a good old regular first-generation immigrant does in which what he calls his towns back in Asia or wherever so that you can improve your algorithms and here the notion of unless you have a warrant you can't do much, if this isn't as a talk we're going towards it's sort of as if you can't do any unsupervised learning to thin out your features and see some random reasons of data nor can you have a good sample off of which to train your algorithm including the good cases and the bad cases so are there any thoughts to say well the key is to figure out how to fudge the data but not to prevent the guys in fact from conveying to make the algorithm more discriminating so there are fewer false positives so that's my question well it's a good question and there's no doubt that you can probably get better analysis results at the margin if you have more and more accurate data that said I think that intelligence agencies are far from the cutting edge in most of the analysis that they're doing what we have seen is I think it doesn't get to the level of sophistication that a Stanford graduate student in machine learning could do let alone what you might expect from a very large and well-resourced agency but that's one thing I would say the other thing is that the warrant requirement is not really a matter of policy for us it comes from the constitution and legally there's not really flexibility on that the there's only the anonymization if you somehow the fudging of who the endpoints are just from training you might be able to make an argument to protect the privacy of individuals in the way that the dataset is analyzed or used then you might be able to get around that warrant might not be required but obviously you know you have to make a technically sound argument in favor of the privacy preserving properties of what you're doing which is pretty hard to do my concern from a policy standpoint is that you get if you go down that road what you get is very lame anonymization for example I haven't mentioned it in the talk but all kinds of people including the president have said that we don't collect names we don't collect your name we only collect your phone number so it's not personally I don't and that doesn't pass the lab test the one computer scientist said yeah the president of the United States has said that as a justification for the program in this great place my concern about this from a policy standpoint is if you say we're going to allow the data to be used in a normalized fashion what you're going to get is something like that you're going to get is first an argument that it already is and if that doesn't stand up then something which is not only margin of debate let me just say one more very quickly in response to that and that is that the if you look at the history of the Fourth Amendment the requirement of an individualized warrant was designed to was a response to to general warnings which said that the king's agents could go and search anywhere they wanted in investigating a particular thing that is that it was that delegation of discretion to the law enforcement officer to search to decide what to search which the Fourth Amendment's particularized warrant requirement was designed to prevent there really wasn't a consideration of machine learning and big data analysis to try to build models in they came into thinking there and but you can imagine you can imagine someday that someone would try to make a novel argument in court that this process is sufficiently privacy-preserving even though it uses your data no warrant is needed I think we're pretty far from the point where that kind of analysis would be able to convince a judge I think that's actually good so I'm still seeing a big discrepancy here imagine for a moment that we somehow managed to achieve this why we shut public policy victory and all of this analysis is done with multi-party computations and the organization makes everything in advance and the public sees everything that they're doing and in order to implement this at the end of the day it is contract RSA and get it implemented with state of the art rule EC and then and then somehow there is an internal mechanism where they can actually whatever they want if they really want to do you see any hope of preventing that or what kind of political climate would enable this? I think there are some actually some benefits of using a technological protection which is not perfect but requires knowing that this is like walking the information it doesn't make it impossible to get but it's a big bright blinking light that says that you are violating expectations and the law by breaking it I guess what I'm getting at is it can be arbitrarily more creative than just doing crypt analysis that's right you obviously need to think carefully about how to design these things really this is one of the reasons why in a policy world people think in terms of oversight rather than prevention accountability rather than prevention that there are so many ways to work around any kind of restriction you can put on that the restrictions do two things they they provide a boundary which people know they shouldn't cross and whose crossing will be penalized number one and they tend to increase accountability by doing things like increasing the size of the conspiracy required to cure something and those are valuable processes as well yeah it just seems like it seems such it seems that odds with the spirit of the intelligence community to ever prove a universal in a sense to ever say like not only are we not doing this under this program but actually this is a list of all the programs so you can see we're not doing that well they're not going to give a list of all the programs but there are two kinds of denials you get from one is we are not doing X and the other is we are not doing X under this program people learn pretty quickly that under this program the denial is not really a denial at all and if you look at things like the transcripts of congressional hearings you see from starting in June of 2013 pretty quickly you see the members of congress asking the follow up question when they get it under this program answer and saying I'm not asking into this program in general can you give me a flat statement if you are or are not doing X and as a consequence we're getting a lot fewer of these under this program cases what you are getting is caveats and the caveats have changed the caveats are less like now but it's less common to have it under this program caveat and more common have caveats that says domestically or US persons or and so on and of course there's always a caveat that with the target one left question so when we have these discussions and policies going forward saying that we need to change the program they actually in this discussion all of the countries or do they acknowledge them in some sense it wasn't really a bad start for example I was thinking of all of the developing countries but they got it down to fight on these reports say this sort of thing all of these abuses yeah so what's tricky is you can imagine two kinds of mechanisms one kind of mechanism tries to perform sort of hard computer science access control strong computer science access control to data where there's some technical condition needs to be satisfied before something can happen that turns out to be hard to do in practice and the bacteria involve might involve evidence that's outside the cistern right the fact that these bad abuses happened and they are being leaked to the classified part and they are protected against them what do they say and they abstract sense that oh we will make sure that's not happening well there's two answers to that I think one is even if there's human discretion in the system there are things you can do to design for accountability a reason to be given and permitted to at the time by the person to have some kind of auditing of those decisions so that helps with respect to how much impact this has on the discussion the answer unfortunately is that rhetorically it has less impact than you might think that the tendency is to say that any future abuses are purely hypothetical and we don't need to rebut those because they haven't happened any past abuses are that's water under the bridge and we've already fixed that problem any current abuses we probably don't know about we don't know of any current abuses and so there's a tendency to argue away particular localized abuses in the system systemic abuse is hard the other thing that tends to be under-recognized is systemic error that is for example there was a design problem which caused contact changing to be done from thousands of addresses without the appropriate legal authorization because there was some condition that code was supposed to be checked which it wasn't or some labels were supposed to be on certain data that wasn't and that gets ignored in it's a real problem and it really stems from the fact the unfortunate fact that fact-based debates have difficulty happening in the political world it's hard to make facts stick and people are not are not afraid enough of getting not misleading about the facts right in our community we on our good days are engaged in a collective search for truth we're at least pretty embarrassed if we get caught saying something false in front of a big room especially if we knew it was false at the time we said it but in political debate that's sometimes just a credit thing to do that's one of the most frustrating things about going to Washington thanks everybody thank you e-mail the first the closing the closing the first we need to ren our first the game the questions are going to rain no I think it's staying in the headcode first working We have the house there, it's talking about things, do we have a good argument to keep it a secret? This particular detail of the fact that we're going to secret.