 Good morning everyone. Thank you for coming today. Thank you to those here in person as well as those watching on the live stream for joining us this morning. My name is Benjamin Lennett. I am the Policy Director for the New America Foundation's Open Technology Institute. Before we get started, let me say a few brief words about New America and OTI. New America is a non-profit, non-partisan public policy institute that invests in new thinkers and new ideas to address the next generation of challenges in this United States. We have programs that work on a range of public policy issues including education, health care, and economic growth. My program the Open Technology Institute formulates policies to support open networks and open source innovations. We promote universal and affordable communications access through partnerships with communities, researchers, industry, and public interest groups. And we are committed to maximizing the potential of innovative open technologies. This morning we are talking about big data. Which paint color is most likely to tell you a used car is in good shape? How did Google searches predict the spread of flu outbreaks? How can data help government be more efficient and politicians win elections? The key to answering these questions and many more is big data. Our newfound ability to crunch vast amounts of information, analyze them instantly, and draw sometimes profoundly surprising conclusions from them. To explain this brave new world we have with us today Victor Meyer Schoenberger and Kenneth Kukier, close enough, co-authors of Big Data, a revolution that will transform how we live, work, and think. Victor is a professor of internet governance and regulation at the Oxford Internet Institute at Oxford University. He is a widely recognized authority on big data and the author of more than 100 articles and eight books of which the most recent is Delete the Virtue of Forgetting in the Digital Age. He serves on advisory boards of corporations and organizations around the world including Microsoft and the World Economic Forum. Kenneth is the data editor of The Economist and a prominent commentator on developments in big data. His writings on business and economics have appeared in foreign affairs, the New York Times, the Financial Times, and elsewhere. We'll start with a presentation from Victor and Kenneth. We'll then sit down for a series of questions and then open it up to the audience for questions as well. So without further ado, Victor and Kenneth. Hi, thank you very much. It's absolutely a pleasure to be here. In some ways it's particularly nice to be here because of the leadership that the New America Foundation has shown in all of these issues that this is the right home that we should be discussing these issues here in Washington, D.C. I want to start by talking about big data by telling a story and the story goes like this. It starts in 2009. Now we know that every year the common flu kills tens of thousands of people around the world. But in 2009, a new flu virus was discovered and experts feared it might kill tens of millions. There was no vaccine available. The best health authorities could do was to slow it spread but to do that, they had to know where the virus was. The senators for disease control had doctors informing them of flu cases that walked in. But despite their best efforts, the data collection was always a bit far behind. The analysis took time. So the picture of the pandemic that emerged was always a week or two late. Which of course if you're trying to address a pandemic is like an eternity. But not long before the flu pandemic broke engineers at Google had an idea. They developed what they thought was a better way of predicting the spread of the flu not just nationally but down to individual regions in the United States. They took 50 million of the most common search terms that Americans searched online from the over 3 billion they get every single day. And then Google compared that when and where those terms were searched with flu data from the previous years by the CDC. And the idea was to predict the spread of the flu virus using Google search queries. And they struck gold. They identified 45 search terms that when taken together in a mathematical model could predict the spread of the flu with a high degree of accuracy. So what you're seeing here is the CDC data, the real data of when and where the flu is, the flu outbreaks are happening across about five years and you're looking at Google's data. But the difference is that where CDC has a two or three week time lag, Google's flu trends is almost in real time. So strikingly, Google's method doesn't involve distributing mouth swabs, contacting physicians offices. Instead it's built on big data, the ability to harness data to produce novel insights and valuable goods and services. A world awash with data. Any amount of the digital data that is being collected is growing fast, doubling a little more than every three years. It's like Moore's law all over again. The amount of stored information in the world is estimated to be around 1.2 zettabytes. Of which less than 2% today are non-digital. But we suggest big data is more than just about volume. We suggest in the book that there are three defining and reinforcing qualities that characterize big data. Moore, Messi, and correlations. So first Moore. Today we can collect and analyze far more data about a particular problem or particular phenomenon than ever before when we were limited to working with just a small sample. That gives us today a remarkably clear view of the granular, the details, the conventional samples of data just can't assess. Importantly as we capture so much data we are no longer forced to use data simply to confirm or to disprove a concrete hypothesis. Instead we can as the experts say let the data speak. And that often reveals insights we never thought we had. Give you Ken for an example. Great. So consider DNA analysis. Since 2007 the Silicon Valley startup 23andMe has been analyzing people's DNA for only a couple hundred of dollars. The technique reveals the genetic codes that make people susceptible to certain diseases like breast cancer or heart problems. But there's a hitch. The DNA sequences just a small portion of the DNA. Places that are known to be markers indicating particular genetic weaknesses. Meanwhile billions of base pairs of DNA remain unsequenced. Thus 23andMe can only answer questions about the markers it considers. Whenever a new marker is discovered a person's DNA has to be sequenced again working with a subset rather than the whole 23andMe cannot answer questions that it didn't consider in advance. Apple's legendary CEO Steve Jobs took a totally different track in his fight against cancer. He became one of the first people in the world to have his entire DNA sequenced as well as that of his tumor. To do that he paid a six figure sum hundreds of times more than the price 23andMe charges. But in return he didn't receive a sample. He got not a set of markers but a data file containing the entire genetic codes of his body and that of his tumor. The result is that Jobs's doctors could select therapies by how well they would work given his specific genetic makeup. Whenever one treatment lost its effectiveness because the cancer mutated or worked around it the doctors could switch to another drug. Jobs at the time had said I'm either going to be the first person to outrun a cancer like this or I'm going to be the last one to die from it. Now the prediction went unfulfilled the method having all the data not just a little bit of it gave him years of extra life. That's more. The next quality of big data is messy. That is that we can embrace messiness of data. Looking at vastly more data permits us to loosen up our desire for exactitude. When our ability to measure in the analog times was limited we had to treat what we did bother to quantify as precisely as possible. In contrast big data often is messy and it varies in quality but rather than going after exactitude of measuring and just collecting small quantities of information at great cost in the big data age we'll be accepting messiness. We'll often be satisfied with a sense of direction rather than striving to know a phenomenon down to the inch, the penny, or the atom. We don't give up on exactitude entirely. We only give up our singular devotion to it. What we lose in accuracy at the micro level we gain in insight at the macro level. That does provide you with another example. So the example is machine translation. In the 1990s IBM researchers used a very precise set of documents, Canadian parliamentary transcripts available in both French and English to train the computer. Basically it worked in principle but the overall quality of statistical machine translation that is to say comparing the two documents and trying to find the statistical probability that a word in one language was the suitable word in the other. It worked well but the quality wasn't great. Around 2006 Google marched in. Instead of using just a few million sentences and clean translations from the Canadian government they used everything they could get their hands on. They used the entire global internet harnessing translations of varying quality. These are corporate websites from multinationals that were in multiple languages. Books from their book scanning project that were in multiple translations. These are EU documents in all 21 languages and by doing all of that there was a trade-off. The data was much less clean but they were able to increase the vast scale and therefore greatly improve the quality of their translations. So if you will, more messy data trumped less clean data. More messy with the third. These two shifts, these two shifts that we discussed already more and messy, lead to a third and most important change. A move away from the age-old search for causality. Humans are conditioned to understand the world as a series of causes and effects. If we fall sick the day after having eaten at a new restaurant our hunch tells us that it wasn't a food. Even though it's far more likely we got the stomach bug from shaking hands with a colleague. This hunch of causality gives us a sense of comprehension in this seemingly inexplicable world. It's enlightening, it's comforting, it's reassuring and often it's plain wrong. These quick hunches, these fast insights are our brain's feeble attempts to offer explanations with a scarcity effect. That may have worked when we needed to fight for survival in the hostile world thousands of years ago when deciding in a split second whether or not to run was of essence and inquiring more deeply might have been deadly. But as Noble Laureate Daniel Kahneman so eloquently has argued in our complex world this fast thinking as he calls it, these quick causal insights often lead us down the wrong path. The alternative of course is to think slower, to reflect harder on causal connections it's embedded in what we call the scientific method. As a result we run experiments to uncover causal relationships and yet even the most carefully orchestrated double blind trial cannot prove causality conclusively either. Facilitated by big data we now have an alternative method available instead of asking why of looking for sometimes elusive causal relationships in many instances today we can simply ask what and often that's good enough. Big data correlations help Amazon and Netflix in recommending products to customers. Correlations are at the heart of Google's translation service as well as its spellchecker again they tell us why but not what but that is good enough because we need to know the information at a crucial time to act. Correlations do much more than improve consumer efficiency however. A premature baby can fit into the hand of an adult until they are grown these babies have a terrible vulnerability to even relatively benign infections. Dr. Carolyn McGregor in Canada to give these babies the best chances to survive. Using big data analysis after collecting a thousand data points per minute from these babies Dr. McGregor has discovered a shocking truth. Whenever a little premature baby with a very stable vital signs often the body wasn't stabilizing rather it was preparing for the onset of an infection. With this knowledge she could identify babies needing medication at a much earlier stage before it was too late. If you will it's the quintessential big data application. McGregor used much more data than ever before for these premies premature babies collected through better sensors. She accepted even embraced that in such situations not all data is clean and accurate and thus accounted for an exactitude in her analysis but more importantly she put aside the question of why of identifying the exact causality for the temporary stabilization of the vital signs before the infection onset. Instead she looked for what and that allowed her to identify the infections before the overt symptoms appeared so she could respond sooner better treatments to save the baby's life. Now often big data is portrayed as the consequence of the digital age but that misses the point. What really matters is that we're taking things that were always informational but we never rendered them in a data form before. If you will we're data-fying it to coin a term. Once it's in data we can use it, process it, store it and analyze it and extract new value from it. So think about it in terms of posture. You're all sitting right now and you're all sitting slightly differently. You sit differently than you and you and it's obviously because it's your weight, your leg length your positioning, the distribution of your weight. No two people are very rare that two people should sit exactly the same. But what can we do with this if we were to dataize it, data-fy it? Well first researchers in Tokyo are doing just that to think about it as an anti-theft device in cars. You would know when your car is being or the car is being driven not by its rightful owner. But if we analyze millions of drivers this way we discover subtle shifts perhaps in body posture prior to accidents that would suggest driver fatigue. We'd be data-fying if you will, fatigue. And the car would know to send an alarm to the driver to be more alert. So data-fication is also a core byproduct of social media platforms. For example, Facebook has data-fied our friendships and the things that we like. Twitter data-fies our stray thoughts and our whispers. LinkedIn data-fies our professional contacts. Once things are in data form they can be transformed into something else. Traditionally even in the early stages of the digital change, data was processed just for primary purpose with little thought given about novel reuses. But this is changing. The core economic point of big data is that a myriad of reuses of the information are possible that can unleash new services or improve existing ones. So the value of data shifts from the reason it was collected and the immediate use at its surface to the subsequent uses that may not have been apparent initially, but are worth a great lot. Another example is InRix. That is a company in the Seattle area. InRix has found a strong correlation between the road traffic and the health of the local economy. Wow, that's interesting insight. So one can have a new indicator of whether the recession is improving or deepening. But there's more. One investment fund uses the data of InRix to examine the weekend traffic in the area around stores of large national retailers. Since this correlates with sales. It uses that information to decide whether to buy or to sell the company's share of its quarterly earnings announcement. Unfortunately, big data like every potent tool also has a dark side. And one that, given its bleakness, we must learn about, aim to understand, and guard against. With the big data age, the value of data is so much greater than what can be extracted for the primary purpose. So much of data's value remains hidden, as I said. So this puts big data on a direct collision course with our conventional privacy notion of notice and consent. Of telling individuals at the point of collection why we are gathering this data and asking for their consent. In the big data age, we simply do not know at the point of collection for what purposes we'll be using personal information. As a result, we are stuck with a conventional privacy protection paradigm that fails to be effective in the big data age. But that's just small. Yeah, so privacy is an issue. Still will be an issue. But a new problem emerges. If you will, it's the issue not of privacy but propensity. We'll be predicting human behavior for what we are likely to do and possibly penalizing for us before we've even committed the infraction. Now this might sound like the idea of pre-crime from the movie Minority Report. It is. Where people are punished for crimes they have yet to commit. But it goes far beyond policing. A case in point is Britain where the National Health Service tells people that some instances they're no longer eligible for life-saving surgical operations because the statistical prediction for them is so bleak in terms of the mortality a few years after. So what can we learn from that? Predicting an individual's future behavior seems to be one of the most valuable features of big data. Wouldn't it be great to predict crime before it's happening so that we can prevent it rather than have to punish afterwards? Isn't prevention always better than punishment? And yet such use of big data would be terribly misguided. For starters, predictions are never perfect. They only reflect the statistical probability. So we would punish people without certainty. Negating a fundamental tenant adjusted. Worse, by intervening before an illicit action has actually taken place and punishing individuals involved, we essentially deny them human condition, our ability to live our lives freely and to decide when and whether to act or not. In a world of predictive punishment, we never know whether or not somebody would have actually committed the crime that he was predicted to do. We would not have let fate play out holding people responsible on the basis of big data analysis that can never be disproven. But let's be careful. Let's be careful. The culprit here is not big data itself, but how we use it. The crux is that holding people responsible for actions they have yet to commit is using big data correlations to make causal decisions about individual responsibility. If so, abused, big data threatens to imprison us, perhaps literally, in probabilities. The third problem is one that's not unique to big data, but society needs to be vigilant to protect against it, and that is what we call the dictatorship of data. It's the idea that we may fetishize the data, that we may endow it with more meaning than the data truly deserves. As big data starts to play an increasing amount of part of our life in all areas of our lives, the tendency to place trust in data and cut off our common sense may only grow. Placing one's trust in data without a deep appreciation of what the data means and understanding its limitations can lead to terrible consequences. In American history, we have fought a war on behalf of a data point. The war in Vietnam and the data point was the body count. It was used to measure progress when the situation was far, far more complex. So in the big data age it's critical that we do not follow blindly the path that big data seems to set. So Ken said, I am the lawyer so I have to do control to prevent big data abuse and to ensure that the big data era can attain its potential. In our book we call for new strategies to protect privacy in this big data age and to also protect and guarantee human volition and to limit the dangers of relying too much on what the data dictates. Importantly, our recommendations include the creation of a new cadre of professionals. We call them algorithmists who would be specially trained to understand big data analysis. They would not only work at big data companies to examine and to audit their big data analysis and to ensure the soundness of the methods and to protect people from harm but also enable individuals and regulators to peak what we call the black box of big data and big data predictions and correlation. This could help we suggest to ensure the ideals of transparency and of openness and accountability that are deep features, important features of the big data world. Of course, big data also requires more than safeguards to fulfill its amazing potential safeguards to the individual. For instance we need to ensure that the data isn't being held by an ever smaller number of big data holders. Much like previous generations that rose to the challenge posed by the robber barons of domination in the railway and steel manufacturing sectors in the 19th century, we may need to constrain the reach of nascent data barons and to ensure big data markets stay competitive. So big data is going to help us understand the world better and improve how we make decisions from what medical treatments to work to how to best educate our children to how a car can drive itself but it also brings new concerns. What is essential is that we harness this technology understanding that we remain its master. That just as there is a vital need to learn from data, we also need to carve out a space for the human for our reason, our imagination, for acting in defiance of what the data says. Because the data is always just a cinemacrum of reality and therefore always imperfect always incomplete. As we walk into the big data age we need to do so with humility and humanity. Thank you very much. Well that was great. Thank you so much. And this is the book. There's copies out front to purchase. We are subsidizing it. And the book is actually great. It's quite interesting. You really sort of take us through the entire landscape. What does it mean? What's the value of this data? I think something that is generally sort of missed in a lot of these discussions is what are the risks? Because there are sort of a number of risks. So one thing I sort of at one term that I really like in the book is this idea of data vacation. So I was wondering if you could sort of unpack that a bit more and kind of talk about what is the data, all of the different sort of components of data that are being collected. Not just numbers now but being able to convert text into sort of statistics. What does that look like? Let me give an example of just the iPhone and then I'll have Victor talk more deeply about it. So our iPhone has roughly around 20 sensors in it. It's an extraordinarily high tech device. That's obvious. The uses of it are almost limited by our imagination. So for example, the way that we walk, here's an example. So my telephone has a password probably yours does as well where you type in a code. That's really silly when you think about it. The phone is on you. The phone is on, it has the sensors, it can get a call. So it could be monitoring how you walk at all times. It could be giving an index to your gate. It's pretty probable that the way you walk is different than someone else's walks. So it would get to know you intimately. Why the security feature should ask you for a number code, not knowing that it's at a certain elevation. It's going to always be perhaps in your pocket that you put it into your ear in the same way you hold it up in that same sort of swoop because your arm length is at a certain length and all of that. You do it at a roughly the same thing. The phone should be able to know that for you. It could datify these aspects of your life. That would be your security feature. Layered on top of it, you could apply a new application. So the application might be able to tell that your phone monitoring your mobility and the number of calls you make, perhaps your speech, it would be able to tell that you're getting sick prior to you feeling the symptoms of the flu. This sounds like science fiction. It's not. It's already been done with earlier technology about five years ago at MIT by Sandy Pentland. Looking at the patterns of usage of the mobile phone and the mobility was able to predict that someone was coming down the flu prior to that individual themselves knowing they were getting the flu. So when we modify aspects of our life in this way, we can do radically new and interesting things that will add value to our lives. And what works on an individual level also can work on a societal level. So in Boston there is an application out there on the smartphone and as you travel through the city and you drive over potholes, the iPhone shakes and that shaking is recorded and is interpreted as a pothole and sent to the city so that the city for the first time has a really accurate map or begins to have an accurate map of its potholes. That's really valuable because that's not what they had before. We see examples in the commercial sector, we see examples with utilities. We tell a story in the book about New York's utility company having to prioritize which potholes to look at and to service before they blow up. Turns out that there are dozens of manholes that blow up every year and there's no really good way of prioritizing that. Now what they did was to employ a big data. They took the last 100 years of maintenance records of each of those manholes. This was incredibly messy data because it was recorded not in a standardized standard formatted way. They took it, they just said we'll accept that messiness, we'll just hope that the volume of data that we have drowns that messiness out and they were right. They were four times more able, better able to predict exploding manholes than before and thereby were able to prioritize what manhole to maintain or to service. So there's lots of different ways to data-fy the world and we are just seeing the tip of the iceberg. I would emphasize two things about the pothole example. The first one is that the data was collected passively. That's different than the way that we have traditionally done this in the past where you saw the pothole like mysociety.com or .co.uk in America in UK where you have to identify where it is and let the government know here it's collected passively. The second thing is that the data was collected for a different purpose than it's actually being put towards now. This is a secondary usage. The GPS data which is sort of at the heart of this aside from the sensor bump was collected to route phone calls. So you're cross-applying data or reusing it getting a secondary purpose out of it beyond the primary purpose for which it was collected. So one of the sort of mechanisms you mentioned in the book or for data-fication or Facebook and other social networks are sort of these major hubs for this kind of data collection. There was an economist article in February called Stat Oil Lenders Are Turning to Social Media to Assess Borrowers. I'm going to read some snippets from it and then point to something in your book because I think there's a really interesting connection there. This is from the Economist article. Some firms piece together scores by analyzing applicants online social networks. Professional contacts on LinkedIn are especially revealing of an applicant's character and capacity to repay says Navan Bhathia, the founder of NEO, a startup to assess the creditworthiness of car loan applicants. NEO's software helps determine if applicants' claim jobs are real by looking with permission at the number and nature of LinkedIn connections to coworkers. It also estimates how quickly laid-off employees will land a new job by rating their contacts at other employers. NEO's efforts to improve accuracy are also to be recording borrowers' Facebook data. Mr. Bhathia reckons that within a year there will be enough evidence to determine if making racist comments on Facebook is correlated with a lack of creditworthiness. An additional example, applicants who type only in lower case letters or entirely in uppercase are less likely to repay loans, other factors being equal says Douglas Merrill, founder of Zest Finance, an American online lender whose default rate is roughly 40% lower than out of a typical payday lender. You write in the book this notion of algorithm. Algorithms will predict the likelihood that one will get a heart attack and pay more for health care, default on mortgage and be denied a loan or commit a crime and perhaps get arrested in advance. It leads to an ethical consideration of the role of free will versus the dictatorship of data. Later on, big data can lead to penalties based on propensities and I would add penalties based on associations. What does this mean then for the future when you have potentially what you write on Facebook influencing your credit score? Let me start and then I'll hand it to Victor. In the first instance of some of the signals, we're using Facebook to determine if the person really has the job or not. It seems to me that what's happening here is it's done by human eye. A human person is looking at it and identifying if this makes sense. It's not if you will a real big data analysis. It's just what we've done in the past but we now have this new tool to look at it. Now, it's quite more interesting when you think about whether racist people repay their loans more or less and how you would use that information. It's kind of there to say isn't this interesting. Likewise, Douglas Merrill's extraordinary finding of knowing that people who do all caps are all lower case tend not to be as credit worthy as people who write in a normal way. Normal being however you want to put it, maybe in normal distribution or normalizing what we ordinarily think is the rules of grammar. That is just a pure correlation. We could come up with plausible scenarios to us for why that might be the case. It might be just ridiculous. Why even go there? That would be an instance where you'd say hey, that's good enough. The troubling thing is that we will be if you will victimized by this both by the commercial sector and by the private sector and by government and it won't be clear to us because we don't know the causality why. It's not even certain that right now we can talk about big data in a form that's explainable and comprehensible because we're just at the outset of the big data age. But what happens when we want to explain why we face some sort of penalty or sanction or denied a credit or an operation based on an enormous big data analysis that looks a little bit like flu trends but the explainability is missing. You could imagine that if you were to knock on the door to a computer scientist and a data scientist and ask why did I get denied credit or why am I being put into jail? And they said well we've got a model that looks at a thousand different variables. There's about 80 different strong signals and there's a long tail of 200 weaker signals and they're all of different weightings and it all changes over time. So it's hard to actually give you an answer. And with that lack of transparency in the process of harm to oneself of fairness, of justice, should that be tolerated in society? That's a debate we need to have. Indeed. And in the book in the control chapter we really look at sort of categories of probabilistic predictions. And we quite clearly say that if probabilistic predictions are used to decide on a societal level whether or not to punish an individual, whether or not to hold an individual responsible we need to be extremely careful. In fact we argue in the book governments should never do that. But at the same time we are cognizant of the fact that that is sort of assigning individual responsibility. But we are cognizant of the fact that governments as well as the private sector needs to make some educated guesses about the groups and how groups behave or how individuals behave in order to make decisions that are not necessarily assigning responsibility, individual responsibility but are just policy decisions that need to be made. Or just decisions that companies make, that banks make and whether to give out a loan or not. In these type of circumstances we suggest that if the decision is an important one, a central one about the financial well-being, about one's own health one's own life for example, one on freedom if such a decision is very important to an individual even if that decision is being made by the private sector using big data analysis there should be certain safeguards in place. There should be a way by which the individual can question the reasoning. That is where we want to open up the black box of big data not to the individual because if I'm told that there are 280 signals, 80 strong ones and 200 weak ones and they're all combined in this very complex mathematical model I'm lost the first sentence they tell me so I need to be able to avail myself of experts. That's what I do when I have a medical problem. I go to a doctor or that's what I do when I have an engineering problem. I go to a civil engineer and so we suggest that we really need sort of a new cast of intermediate professionals called the algorithm that I could then go to and then I could ask them to help me and what works on an or what could work on an individual level we also foresee could work on a societal level where regulators come in and say okay so this is really really strange here this is really suspect. It could be perfectly fine it could be really dangerous. We don't know this is a black box let's have a look at it and let's avail ourselves of these intermediaries basically data auditors to use that term to peak into that black box and we believe that that is a regulatory mechanism that is relatively low weight and has a relatively high success rate. So you have the algorithm you sort of maybe have a mechanism to sort of figure out what's happening or why a decision was made. What I'm interested in is though and I think you kind of talk about this in the book which is does this have a chilling effect on the way that we interact in society. If every move online has an impact on your credit score or the ability to get a loan or some other healthcare issue or who you're associated I mean we look at something like persistent poverty and so what if the propensity shows that if you have associated with other individuals than poverty you have this instance and so does this sort of fundamentally have a chilling effect on how both A on how we interact online and then does it reinforce existing social inequality. The interesting thing is that these feedback loops these sort of chilling effect feedback loops have been assumed to exist particularly in the free speech area where if we assume that we are being watched if we have to assume that the sensors are watching us will we self-censor. The truth is that typical social science problem we don't have enough data. We don't know whether chilling effects actually exist and to what extent they exist and so forth. In the legal world a lot of it is being written up about chilling effects and I believe there is something there but I don't have the empiric. So what is very interesting about big data is that we can find answers or we can find shed some light on some of the societal dynamics including the dynamic of chilling effects and then use that outcome that understanding that better insight to then also inform our policymaking. I was going to echo it we don't know yet but I think that there is a potential risk. Right now I think many people use Facebook as a sort of social barometer for themselves and so they have a great interest in sort of friending everyone who wants to be their friend and you would imagine people would be a lot more selective in their interactions if they knew that they were going to be denied medical coverage because they were friends with people who were slightly overweight. If that turned out to be a signal that people were using for whatever purpose you would imagine that the interactions in the social sphere would look different. Now we've never as a society thought about that before. We've never had that as an issue for us before but as you've read from the article that's the world in which we're stepping into. So obviously this is bringing sort of new challenges age of big data world around new rules to see if guard the sanctity of the individual. Privacy is clearly sort of one of those challenges. One of the sort of things that I thought about with the book was one of the refrains against or sort of people when they're not concerned about online privacy they always say it's only about seeing ads, online ads or I have nothing to hide. And so you're writing the book how can companies provide notice for a purpose that has yet to exist how can individuals give informed consent to an unknown. So much of our privacy debate is about sort of I'm opting in I know what data I'm giving up I'm sort of a participant in this process but if I have no idea that this my post on Facebook is later going to be determined as a data point and some other thing that I never would ever think about would address this issue of privacy in the digital age. In the book we suggest that a way around this is to really rethink the notice and consent paradigm. The notice and consent at collection paradigm and to think about whether it might be better to consider accountability of data use. That is the problem is not whether or not data has been collected but what data is being used for and not to hope that individuals at the step of collection have the imagination in their mind that of all the potential purposes that the data could be used or abused and then make an informed decision that's very unlikely in fact we know it's almost impossible but to really go out to the data uses and say you're going to use the data you're going to unearth and uncover a lot of the hidden value of the data you do that without having to go back and ask for consent if you're going to do it for a different purpose that's a huge benefit to you but in return you need to take on some accountability and some accountability that then has teeth. So can people opt out opt out of big data? Yeah there's still place in Antarctica. Yeah, totally. We've seen people opt out. You're called the Unabomber. Ted Kaczynski opted out. Can you opt out of computers? Let me think about it, can you? I don't think you can. So I think that that's going to so the answer is going to be no. You can't opt out of autopilot when the plane is landing. And that looks it's not really big data but it's certainly big processing power and it's computational. So I think this is going to worm its way into all facets of human life. What is important is if we can't opt out, if we can't leave the car in which we are it's better to be in the front seat rather than the back seat so that we can shape the direction in which we're going. And that is what we hope to achieve with the book as well that we have as a society a debate. You need to know about big data. You need to know what upsides it has and what challenges it poses so we can have a societal debate about it. So one final question now that I'll turn over to the audience. Say sort of at one point in the book at the same time society will have to redefine the very notion of justice to guarantee human freedom to act and thus to be held responsible for those actions. So we talked about sort of privacy but what are sort of kind of the other big ideas that we're going to have to address to sort of deal with the challenges of big data. Victor and I are looking at each other because we both feel so impassioned about it. We're not certain who's going to talk first. I think we both should. This is really important. We have not had the judicial system has never had to grapple with the idea of penalizing someone before they've committed an act. And because of that it's on woolly ground in terms of what it is the value is that we need to preserve and the way that I tend to think about it as a writer is seeing the information explosion of the past during the printing press. The printing press gave rise to free speech laws. This is simplifying the history but if you remember when Socrates was on trial for corrupting the youth of Athens in his apology he used a couple of defenses but one was not free expression and the reason why is that awareness of the right to free expression simply didn't exist. It took the printing press it took a cornucopia of information to have us as a society realize that this was something sacred that needed to be preserved and we needed eventually laws to protect it. Now I'm simplifying the history but that's one way to look at it. I'm a journalist so I kind of live and breathe these issues. So what would be the thing that you need to preserve in an age of big data? What is that one corollary, that touchstone that we're going to have an awareness of and I think it might be human volition free will, moral choice that we might need to actually enshrine these things in law so that it's a feature of our lives that we can actually hold sacrosanct. Victor wants to add to that because he's impassioned about it from an individual responsibility point of view. I am quite impassioned about that because my biggest concern is that we as a society might also move away from assigning individual responsibility. We're saying one of the problems with big data is we will predict the future and hold people responsible for future events or future behavior that they haven't yet committed, haven't yet acted on and so that's really a minority report. But what if a society decides not to punish people for future events but just to sort of create incentives, societal incentives so that people shift their behavior in the future so that everything is like the nanny state, the big data decision of the nanny state where there is no individual responsibility anymore but the nanny state is still driving us, is still sort of using predictions in order to guide us if there is a world without responsibility if there is a world without individual guilt if you want and punishment, then this is a world without innocence and I wouldn't like to live in that world. We're going to open up for questions. We have a mic going around. I think the gentleman in the black sweater, Heba, right there had his hand up early on. The second is if you can address one of the corollaries of what you're saying and that is should we think differently about requiring more data availability for public policy purposes. Let me give an example. As people talk about the MOOCs and the massive online open courses, the clickstream data on these courses would be enormously valuable to learning scientists about thinking about how online education can be improved. Yet venture capitalists will want that data available for the individual company to improve its position in offering these things. That's a real tension. I was wondering if you might want to talk about the information commons pieces of this. The third piece may be that we have wrestled with some of these issues if we think about insurance and charges that differ by gender for instance or the like. We make a decision about whether there can be differences assigned to people on the basis simply of age and gender which are all statistical in nature and we say that there are ethical reasons for making certain decisions about charges like that. I want to know if you can comment on this tension not in the realm of punishment but in the realm of the decision that was made about the National Health Service about offering services to someone on an individualized basis based on a statistical analysis. Thanks. Vice versa. Vice versa. Go for it. Sure. On the question of the information commons and the question of open data and data access in the book we have an entire section devoted to the importance of open data particularly open government data and we think that it is imperative for governments to open the data sources that they have that is often harder done than said it requires the right infrastructure it also requires an understanding of what the data contains and so forth some thinking about the meta structures and all that but we clearly believe that that's a very big value. We also believe that there is a potential place for government incentives so government created incentives to release data so for example as we have seen if you receive a government subsidy to do research then you should not only make your research results openly accessible you're smiling we know both what we're talking about here but also the underlying data and I think that's important we could expand that to civil society organizations to some NGOs self-help groups who might get tax benefits if they follow that policy we could even think of the private sector getting some tax benefits some incentives from society if the private sector entity shares that valuable information with the society at large with the research community again keep in mind we had this debate before we had some 200 years ago when we thought about the patent system in this country tries to find the balance between giving an exclusive right to the patenter versus making sure that everything about an invention is being documented so that once the time period is up for the patent society can use it and utilize its value so that there is a balancing going on there and I think we need exactly that kind of balancing in a number of areas and society you mentioned education is a fabulous example we believe the big data will change will reshape the education but we also believe that health is going to be a huge field of application where it is remarkable that it took a private company 23ME which we mentioned to be the first one to aggregate 10,000 DNA sequences of Alzheimer's sufferers so that we could do the first analysis of genetic traits of Alzheimer's sufferers because our current healthcare and health research system was just unable to do that so we really need to revisit that and thank you very much for that part of the question now I toss over that really hard question the second one to Ken. Before I had try to tackle imperfectly the second part of the question let me go back to one aspect of the first one that I feel strongly about which is I'm all in favor of a data commons I think it's a great idea when for example it's public information provided by the state is open data but I'm against data communism and what I mean by that is this idea of compelling private sector entities to disclose the data that they've collected so in the case of the MOOCs the courses if there was some sort of government regulation to say that by dint of being a company or in this case an educational institution that's doing this thing that's giving away degrees that you have to share the data on the click stream I think that would winnow innovation that would thwart the innovation that you get I think there I think you want to bring a degree of competition if you will into lots of spheres of activity and I don't see why higher education shouldn't be a part of that as well it already is right it's called the Ivy Leagues versus the Baby Ivies etc so it's research money they're competing against and Nady versus Duke for the intellectual property lawyers will know that in fact the Supreme Court agrees with me so the so I like the idea that Victor's pointing out that you can create incentives to do it but I wouldn't want to make it mandatory because there's cost to the use the collection, Amazon shouldn't have to disclose it to Barnes & Noble etc the second part of your question in terms of insurance is a very difficult one basically what we're seeing is a clash and the clash is what the statistics say in terms of the differences in the frequency, the propensity of say getting into a car accident if you're male versus a female and versus our ethos for equality our values in terms of treating two people the same commercially or in the eyes of the law and there I think that it's just up to society to decide yes on one hand it looks anti-math right it looks like creationism to say that you have these known statistical features of something yet we're going to abrogate it entirely and we're going to be blind to and pretend it doesn't exist and we're going to say price insurance in this form what it does if you think about where big data takes insurance is it almost pulls the floor out from a market for insurance and the reason why is an insurer is only going to if you have a high degree of predictability to see into the future of whether someone has propensity for a disease to get into a car accident etc you will always have an incentive to ensure the person who you know will never get ill and you will always have the incentive to deny coverage to the person who you know will right and this asymmetry of power in the marketplace would be pernicious to the individual so it doesn't mean that insurance as a pooled risk a mechanism of pooled risk goes away but it does mean the market for insurance goes away so you might have to have public policy step in to allow this to take place in the European Union they do because you do have the price insurance the same for men and women even though the propensities for certain things like a car accident is different we've even seen the federal government march into insurance to help it by creating a floor for terrorism right because there's things that are kind of be uninsurable in a marketplace situation and so you can imagine that the government is going to have to step in here too to say okay we like the idea of pooled risk we realize the market for it is changed now because we have greater visibility and so the market would have this would have a distortionary effect for the public policy values that we have that we want to attain so here's what we're going to do let's go up the front Mike Nelson with Bloomberg government I noticed that in the book the word policy maker only appears 11 times and once in the acknowledgement so your focus clearly is not on defining what the right policy is for policy makers and that's a good commercial choice since there's more CEOs and people who work in business who want to know about big data then there are policy makers but we're in policy town so I have to ask a policy question and I want to ask about the rest of the world here if the US is going to adopt good policies for big data we want to learn from the mistakes being made elsewhere and the good examples being made elsewhere are there any places we should look to for good big data policies there's a lot of different policies it's privacy it's the right to be forgotten it's research investments it's open data do we have anybody that's doing better than we are in this area no it's as simple as that from my point of view this is very early on as you know Mike having been in the rough tunnel of policy making yourself policy making really is not anticipatory it is reactive and so policy making almost by necessity is sort of behind the curve or at best behind the curve close behind the curve right no but what is really interesting about this is and if you if you let me indulge in a very small anecdote here is this book came out two days ago in the United States this is where the home publisher is Houghton Mifflin and we're really excited to be on this book tour Houghton got the world rights and they sold the world rights we have now 12 language editions coming out this year but one of them is Chinese simplified Chinese for China a legal deal that they did is not a pirated version or so but there's a Chinese publisher they got a small army of translators to translate the book once they got the manuscript the complete manuscript you know when we went into print and they then just printed overnight basically and on the 20th of December last year the book became available in China legally available right this is all contractually fine available in Chinese and it has been the number one business book in China ever since 10 weeks now it's among the top 50 books in all of China the telecom minister has read it we've been there we've been inundated with hundreds and hundreds of people asking for it and we always or almost always get the same question that is is big data a tool a mechanism by how we China can leapfrog the west give us the blueprint on what to do what kind of incentive structures what kind of rules to put in place because we are going to do it and we're going to do it fast because we want to leapfrog and so I can't point to a better jurisdiction in the world where we could learn from but I can tell you that a number of jurisdictions are breathing down our necks so to speak because they get the power of big data let me add one thing to that which is what is an interesting thing to do and what is a bad thing to do and it's the tale within Europe, hopelessly divided as always so in the case of Britain they have the cabinet office and the cabinet office created they're on a very like in America very strong open data initiative but they did something very interesting forward wired into a committee that advises the cabinet minister which is a very strong sort of like the control tower for all the other agents federal essentially federal agencies national agencies in the country something called the open data users group which would nominate from the essentially from the community two or three people to sit on this board and represent it and to meet on a monthly basis with the minister as well as with other heads of people on this board and often with the heads of other agencies this was really shrewd because what it did is it took the wolf or what you want into the general's tent it took your biggest critic the ones who are militating the loudest for open data and brought them into the seat of government into the cabinet office into the I've been in the minister's to his plush palatial office called Whitehall that's really good because what it did is it emboldened the people within the government of Britain who want open data to have their proxy their surrogate fight their battle for them because if not like as we know in politics you're going to get the knives in your back so what they did is they said great we're going to bring them in they can do the screaming that we don't have to do and convince all the other people that are reluctant to open up their data to open it up very shrewd that's one thing that you can do what don't you want to do well in another European country it happens to be France the finance ministry has just issued a report and it's just a green paper it doesn't mean anything but it does basically essentially consider the idea of taxing data so you I don't think that that would be the right way to go we'll take two more questions back to back and then they'll answer and then we'll probably go ahead and close but Victor and Kenneth have agreed to stay a bit after to answer any other questions so we have up here and then one in the back as well I've been running a Dupont Circle village which is an aging in place thing but I'm also a lifetime reporter I remember being taken a health economy economics of health care at Harvard ages ago and was shocked when the professor was saying these are choices you make about the availability and if you're 55 or older you're not eligible for a kidney transplant or a kidney dialysis in England and I thought my god I know half of the people on them here are over 50 and so I don't know whether that was based on any data I think it's probably just based on health policy but there's a lot of data these days that the biggest spending on health care in a person's life is in the last year of life now what do you do with that data do you say sorry you're 80 and you might live to 90 but we don't know and so we want to deprive you of that do you set rules in hospitals I mean I don't know what I was very interested in the Canadian preemie's example which is sort of a positive way of getting more data and saying we're doing it wrong and here's what we do better but I don't know how in the how you deal with this data that's out there that's pretty harrowing but it's real if you're an insurer, if you're a hospital, if you're a person let's take another question from the back and then one of the questions that I have is the pitfalls with big data that you point out one thought that I have is how quickly trends are changing in the digital world and the prescriptions or the correlations that you draw from big data how quickly they change so something like all caps are all small typing that must have changed in the last five years with tweens coming on Twitter and Facebook and not using caps at all and so drawing correlations and how quickly these correlations expire and the expiry date is going to decrease as we move on in the digital world do you guys have any thoughts on that yeah really good so to the first question it's a really difficult issue and again it shows where the algorithm butts up against our values so what we do know for certain is that future generations will be mystified to realize that we're not collecting all of our data at all the time in terms of healthcare within the next five, ten, fifteen, twenty years we're going to get there where we're going to be really having data driven medicine it's going to lower the cost and it's going to increase the care all of that's certain after that we're still going to have to make these decisions and the decision is going to look like that we might make a better decision it might have to be fifty-two and not fifty-five or it might now be sixty for a lot of factors it might be individualized rather than for the group so this person it's sixty but that person it's fifty-seven and that's going to really be a puddle of tears at the kitchen table but again these are decisions that we're probably going to have to make but importantly there is another white elephant in the room and that is access to that piece of information and I know Ken in particular he's too polite not to push for it at this point in time but he is adamant about making healthcare data, health information available to the research community one of the problems that we have I said it earlier on is that twenty-three and me sort of a new startup needs to come in to collect Alzheimer's genetic information we really need to be better in collecting health information and making that health information available to the research community think about that think about it just for a second FDA drug trials begins with a couple of dozen then has a couple of hundred cases would Facebook ever make a decision on to place the like button based on a sample of a couple of hundred users never they would sample like eight hundred million and then make sure that the like button is exactly where it needs to be if the like button is so important that you look at eight hundred million data points shouldn't we use more data points when we decide whether or not a pharmaceutical a new pharmaceutical should be made available or not so we really think that big data requires us in the healthcare community to make health information health data more accessible to the research community and we are quite adamant about that any thoughts on that final question I think it's a very great point the fact is that when we're doing our big data analyses the situation changes you know it seems is it only viable for this instantiation of time and that we have to do it again and these are the things we're going to learn about right I think that's one of the big issues that we're facing great well please give the authors a hand thank you all for coming this morning are you signing books? yes and books are up front