 Hello, hello. Good evening everyone. My name is Urs Gasser. I'm the Executive Director of the Berkman Center. It is on. It's the only way it works. So, but thanks. Welcome to this special book talk on a great book that is an instant bestseller. Trust me on this one. A book on big data actually written by two of my most favorite authors and thinkers in the information space by Victor Maio Schoenberger and Ken Kukje. Victor Maio Schoenberger is a professor of Internet Governance and Regulation at Oxford University. He served for many years on the Kennedy School faculty here at Harvard and I'm proud to say is also a graduate of Harvard Law School. And just to add a personal food note, actually without Victor, I would not be here because Victor was responsible for writing a letter of recommendation that got me into the LLM program here at Harvard. So welcome back on campus, Victor. Ken Kukje is the data editor at The Economist, one of the coolest job titles ever, I think. Ken was previously the Japan correspondent for The Economist on Business and Finance, has extensive experience across the world, thinking hard, doing research on information policy matters on data issues more broadly. Today they will talk about their new book, Big Data. They have a very busy travel schedule. I'm super delighted that they're here tonight. They will not only talk about the opportunities that are in front of us as we get into this new revolution of Big Data, but will also talk and actually even focus on focusing on the dark side on some of the challenges. So very much looking forward to the presentation. Fair to say. Big Data, big audience, let's have a big time. As you mentioned, everything is recorded what you say, so keep that in mind. Thank you. Thank you very much. Thank you very much, everybody. It is just delightful to be here, particularly because you're guinea pigs. We haven't done that presentation really in that form before, so we'll find out how it plays. But when we talk about Big Data, what we need to do first is to, and it's dark side, we need to sort of set the stage. Some of you might already know some of the stuff that we were sort of quickly raising here, but please bear with us because we want to give it a particular interpretation, a particular spin. So let's look at Big Data. What is it? How do we understand it? This year, the common flu kills tens of thousands around the world. But in 2009, as you may recall, a new flu virus was discovered and experts feared that it might kill tens of millions of people. With no vaccine available, the best that health authorities could do was to monitor its spread. But for that, they needed to know where it already was. As with disease control in Atlanta, had doctors informed them of new flu cases, but despite best efforts, such data collection and analysis takes time. Hence, the picture of the pandemic that emerged was always a week or two late. An eternity is a potentially dangerous pandemic is underway. Long before that new flu pandemic hit the U.S., engineers at Google developed what they thought was a much better way of predicting the spread of the flu, not just nationally, but down to regions. They took 50 million most common terms that Americans search online and compared where and when these terms were searched for with flu data from previous years. The idea was to predict the spread of the flu virus through Google searches. And Google's researchers struck gold. They identified 45 search terms that when taken together in a mathematical model could predict the spread of the flu with high accuracy. Here is the official data of the Centers for Disease Control up here. That's the orange line and the blue line of Google's prediction, a very impressive predictability here. But unlike the Centers for Disease Control, and that's important, Google would know the spread of the flu almost in real time. Strikingly, Google's method does not involve mouth swabs or contacting physicians' offices. Instead, it is built on what is called big data, the ability to harness data to produce new novel insights or valuable goods and services. So it's tempting to think of big data in terms of its size, but that would be like describing an elephant by the size of its footprints. In contrast, we argue that big data is something more profound than having lots of information in digital storage. There are three defining and reinforcing qualities that characterize big data, more, messy, and correlations. We can collect and analyze far more data about a particular problem or phenomenon than ever before when we were limited to only working with just a small sample. That gives us a remarkably clear view of the granular, the details that conventional samples and data simply cannot assess. Importantly, as we capture much more data, we are no longer forced to use the data simply to confirm or just prove a concrete hypothesis that we had. Instead, we can let the data speak, and that often reveals insights that we never would have thought of. Consider DNA analysis. Since 2007, the Silicon Valley startup, 23andMe, has been analyzing people's DNA for only a couple hundred of dollars. Its techniques can reveal traits and people's genetic codes that may make them more susceptible to certain diseases, like breast cancer or heart problems. But there's a hitch. The company sequences just a small portion of a person's genetic code, places that are known to be markers indicating particular genetic weaknesses. Meanwhile, billions of base pairs of DNA go unsequenced. Thus, 23andMe can only answer questions about the markers it considers. Whenever a new marker is discovered, a person's DNA has to be sequenced again, working with a subset rather than with the whole. 23andMe cannot answer questions that it did not consider in advance. Apple's legendary chief executive, Steve Jobs, took a totally different approach in his fight against cancer. He became one of the first people in the world to have his entire DNA sequenced, as well as that of his tumor. To do that, he paid a six-figure sum, many hundred times more than the price that 23andMe charges. In return, he didn't receive a sample, a mere set of markers, but a data file containing his entire genetic codes. Thus, Steve Jobs' team of doctors could select therapies by how well they would work, given his specific genetic makeup. Whenever one treatment lost its effectiveness because the cancer mutated and worked around it, the doctors could switch to another drug. This is his quote. He says, I'm either going to be one of the first to be able to outrun cancer like this, or I'm going to be one of the last ones to die from it. His prediction sadly went unfulfilled, but the method, having all the data not just a little bit, likely gave him years of extra life. The second quality of big data is its embrace of messiness. Looking at vastly more data permits us to loosen up on our desire for exactitude. When our ability to measure was limited, we had to treat what we did bother to quantify as precisely as possible. In contrast, big data often is messy and varies in quality, but rather than going after exactitude and measuring and collecting small quantities of data at great cost, in the big data age, we'll be accepting of messiness. We'll often be satisfied with a sense of a general direction, rather than striving to know a phenomenon down to the inch, down to the penny, down to the very little atom. We don't give up on exactitude entirely, we only give up on our singular devotion to it. What we lose in accuracy at the micro level, we gain at insight at the macro level. So an example of this is machine translation. IBM researchers in the 1990s used a very precise set of documents, Canadian parliamentary transcripts, that were available in French and English, to train the computer. It worked in principle, but the overall quality of statistical machine translation remained low. Then around 2006, Google marched in. Instead of using just a few million sentences of clean translations from the Canadian government, they used everything they could get their hands on. They used the entire global internet, harnessing billions of pages of translations of widely varying quality. Now the data was much less clean, but there was a small trade-off, because they were able to use the vast increase in data to greatly improve the quality of their translations. So the lesson here, if you will, is more messy data trumped less clean data. Well, so these two shifts, more and messy, lead to a third and a most important shift or change. But it's one that goes against everything that we know. It's a move away from the age-old search for causality. Humans are conditioned to understand the world as a series of causes and effects. If we fall sick after we had a meal at a new restaurant, we believe it might have been the food that we ate. Although it is much more likely that we got the bug by shaking hands with a colleague. This hunch of causality that we have gives us a sense of comprehension in this seemingly inexplicable world that we live in. It's enlightening, it's comforting, it's reassuring, and most often it's plain wrong. These quick hunches, these fast insights as our brains feeble attempts to offer explanation with a scarcity of facts at hand. That may have worked when we needed to decide in the split of a second whether we would like to run away from a potential danger some hundred thousand years ago. But as Nobel laureate Daniel Kahneman has shown in our complex world of today, this fast thinking, as he calls it, these quick causal insights often lead us down the wrong path. The alternative, of course, seems to think slower, to reflect harder on causal connections. It's embedded in what's been called the scientific method as a result we run experiments to uncover causal relationships. And yet even the most carefully orchestrated double-blind trial cannot prove causality conclusively either. Facilitated by big data, we now have an additional method available. Instead of asking why, as we attempted to, of looking for elusive causal relationships, in many instances we simply can ask what? And often that is good enough. Big data correlations asking what helped Amazon and Netflix in recommending products to their customers. Correlations are at the heart of Google's translation service as we heard, as well as its spell checker. They don't tell us why, but what at a crucial moment and in time for us to act. Let me explain. A premature baby can fit into an adult's hand until they have grown. These little babies are terribly vulnerable to even relatively benign infection. Dr. Carolyn McGregor in Canada researches how to give these babies the best chance to survive. Using big data analysis after collecting a thousand data points a minute from these babies. Dr. McGregor has discovered a shocking truth. Whenever a little premature baby would have very stable vital signs, often its body wasn't stabilizing, but rather preparing for the onset of a severe infection. With this knowledge, she could identify babies needing medication at a much earlier stage. And certainly before it was too late. It is the quintessential big data application. Dr. McGregor used much more data than ever before from premies, from this premature babies, collected through better sensors. She accepted even she embraced that in such situations not all data is clean and accurate. And thus accounted for some inexactitude in her analysis. Most importantly, she put aside the question of why. Of identifying the exact causality for the temporary stabilization before the infections onset. Intend to find a pragmatic way to help, she looked for the what. And as a consequence, enabled us to foresee infections well before they ravage the little body. Lest you forget, big data saves lives. So what's behind all of this? Often big data has been portrayed as the consequences of the digital age, but that misses the point. What really matters is that we are taking things that we never really thought of as informational and rendering them in a data format. We're data-fying it to coin a term. Once it is data, we can use it, process it, store it and analyze it and extract new value from it. Think about posture. We don't typically think of the way that you are all sitting right now as informational, but it is. It's rare for two people to sit in exactly the same way. So if we measure the weight, the leg length, the positioning and all that, the distribution of the weight, we can calculate a certain index for individuals. What can we do with all of this? Well, researchers in Tokyo are working to implement it in cars as an anti-theft device. A car would know when it's not being driven by its owner. But if we analyzed millions of drivers in this way, we might discover subtle shifts in body posture prior to accidents that suggest driver fatigue. And the car would notice an alarm to make the driver more alert. Datification is also the core byproduct of social media platforms. Facebook has data-fied our friendships and the things that we like. Twitter data-fies our stray thoughts and our whispers. LinkedIn data-fies our professional contacts. Once things are in data form, they can be transformed into something else. And that brings us to the application of big data. The application of big data that you're familiar with, maybe apart from Google Flu trends, maybe Amazon's recommendation engines, Netflix recommendations engines, predictive maintenance of car engines and of airplane engines. But the applications of big data are not confined to the commercial sector. We already talked about healthcare, the premies, Google Flu trends, because we now can capture so much data about our social interactions. Big data promises an enormous boost for the empirical social sciences. And we suggest that we'll also transform many humanities into social sciences. We'll be able to see social interactions at scale with detail and almost in real time. I'm going to show you three examples. It's a little hard here, I have to admit, to show these examples, because I have in this audience a great friend and colleague, Professor David Lazar, who actually is an expert on exactly that. So I am a little hesitant to do this, but I feel like the pupil showing off in front of the teacher. So here I go. So for example, with big data analysis, we can visualize what topics were discussed and how frequently they were discussed in the British House of Commons and the British Parliament over the course of the last 75 years. We can see how concerns about the environment over the course of the time are fairly recent developments and how crime and national security top line and in comparison to defense and foreign affairs to down have sent a long term, have seen a long term and sustained growth and interest. That's an amazing insight. That's an interest insight, an insight that we can gain into the debates at the higher level, at the elite level in our society. And when you look at the environment, you really see that it came up in the 1970s. Or second example, this is an example of two students of mine at the Oxford Internet Institute. They examined the topology of Mitt Romney's Twitter followers last summer and discovered with 99.999% probability that most of his recent followers were so very different in their network ties from average followers that it was very highly likely that they were fake. They were. Or in this final example of using big data to shed light on society and our societal dynamics, we data-fied over a century of U.S. Supreme Court decisions to investigate the behavior of conservative and liberal justices. It turns out that especially over recent decades here, justices have become more likely to cite within their respective ideological universes to see primarily facts that confirm their ideology. And conservative justices are more strongly doing this than liberals. Analysis such as these are only the beginning. In the future, big data analysis will greatly aid societal decision-making. We may have vastly improved senses of where the next war is going to take place. Where the regimes are becoming unstable as we shift from believing experts in their hunches to trusting big data. We will never gain certainty, likelihoods of events, probabilistic predictions, as all that we can hope for. But in many instances, such big data analysis, believe it or not, will outperform experts. Unfortunately, big data, like every potent tool, also has a dark side. And one that, given its bleakness, we must learn about, aim to understand and guard against. In the big data age, some of the policies and the technologies for the protection of privacy don't work very well anymore. Without adjusting them, we'll either risk repressing big data's potential or leaving individuals exposed to big data abuses. But a new problem emerges. Some of them will be predicting human behavior that we are likely to do and possibly penalizing for us before we have even committed the infraction. This sounds like the idea of pre-prime from the movie Minority Report. It is. Where people are punished for crimes they have yet to commit. So Ken said, I need to take over now because I'm the lawyer. Predicting an individual's future behavior is one of the most valuable features of big data. Wouldn't it be great to predict when somebody is committing a crime and then just stop it? Prevention is better than punishment, isn't it? And yet, such a use of big data would be terribly misguided. For starters, predictions are never perfect. They affect the statistical probability. So we would punish people essentially without certainty, negating a fundamental tenet of justice. Worse, by intervening before an illicit action can take place. And punishing the individuals involved, we essentially deny them human volition, the ability to live their lives freely and to decide whether or not and when to act. In a world of predictive punishment, we never know whether or not somebody would have actually committed the predicted crime. We would not let fate play out, holding people responsible on the basis of big data analysis that can never be disproven. But let's be careful. The culprit here is not big data itself, but how we use it. The crux is that holding people responsible for actions they have yet to commit is using big data correlations to make causal decisions about individual responsibility. As we've explained, big data correlations cannot tell us about the why, the causality behind things. Often that's good enough. But it makes big data correlations singularly unfit to decide who to punish, who to hold responsible. The trouble, though, is that we humans are primed to see the world through the lens of cause and effect. Thus, big data is constantly under threat of being abused for causal purposes. It is the quintessential slippery slope, leading straight to a world in which individual choice and free will have been eliminated, in which our moral compass has been replaced by predictive algorithms and individuals are exposed to the unencumbered brunt of collective fear. If so abused, big data threatens to imprison us, perhaps even literally in probabilities. A third problem is one that's not unique to big data, but that society needs to be vigilant and to guard against, and that is what we call the dictatorship of data. It's the idea that we may fetishize the data and endow it with more meaning than the data itself truly deserves. As big data starts to play a part in all areas of human life, this tendency to place trust in the data and to cut off our common sense may only grow. Placing one's trust in data without a deep appreciation of what the data means and understanding its limitations can lead to terrible consequences. In American history, we have experienced a war fought on behalf of a data point. The war in Vietnam and the data point was the body count. It was used to measure progress when the situation was far, far more complex. So in the age of big data, it will be critical that we do not follow blindly the path that big data sets before us and only does. But wait a second, we are lawyers, and lawyers should know how to control this, or at least advise policymakers. So what can we do to control the dark side of big data? We call for new strategies to protect privacy, to move away from a lake, noticing consent, or with accountability of use. We call for new strategies to protect human volition and prevent big data analysis to be used to hold individuals responsible for future actions and to limit the dangers of relying too much on what the data dictates. These strategies include creating a new cadre of professionals. We call them the algorithmists, lawyers watch out, who would be specially trained to understand big data analysis. They would not only work with big data companies to examine and to audit their big data analysis to ensure the soundness of methods and protect people from harm, but they would also enable individuals out there and regulators to peek into the black box of big data predictions and correlations. This we hope could help ensure the ideals of transparency and accountability that are features of the big data world. Of course, big data requires more than that. More than these couple of safeguards that I just mentioned to fulfill its potential. For instance, we will need to ensure that data isn't held by an ever smaller group of big data holders. The Facebooks, the Googles of the world, or the NSA, pick your favorite. Much like previous generations rose to the challenge posed by the robber barons that dominated railways and steel manufacturing in the 19th century, we may need to constrain the data barons and to ensure big data markets stay competitive. Big data is going to help us understand the world better, improve how we make decisions from what medical treatments work to how to best educate our kids to how a car can drive itself, but it also brings new concerns. What is essential is that we harness this technology understanding that we remain its master. That just as there is a vital need to learn from data, we also need to carve out a space for the human for our reason, our imagination, for acting in defiance of what the data says. Because the data is always just a similacrum of reality and therefore is always imperfect, always incomplete. As we walk into the big data age, we need to do so with humility and humanity. Thank you very much. Thank you very much. We have time for questions, right? You have about 20 minutes or such, yeah. So please press the bottom, and I also want to introduce yourself. Brian Cain, MIT. I was intrigued by your comments on data barons because I think there are such clear economies of scope and scale here that the risk of proprietary control is very high. I think we've seen this in the internet environment already, that the stakes are very high, but it's very difficult to get at these situations from an antitrust perspective because the benefits are also very high, especially for consumers. I think that's absolutely right, but there is a slight sliver of hope on the horizon in the United States, not elsewhere, by the way, but in the United States. And that is that the federal government, FTC, and others have been quite cognizant of the power of data. I'll give you one example. You might have heard of the ability to predict the price of airline tickets when an airline ticket goes up or down. It was originally called Faircast, and now as part of being travel, a kayak now has it as well. And the data behind that was licensed to Faircast, originally, from a company called ITA, one of the large airline reservation systems, based here, by the way, in Massachusetts. And so this data was used to make these great predictions. It was licensed. Then a company came in to buy up ITA called Google. And immediately the third parties were extremely concerned about this quasi-monopolization of data, this concentration of data. And to its credit, the federal government came in and mandated in this particular instance that Google will have to continue to license ITA data to third parties for a period of time. And at fair licensing terms. So I think that the federal government actually is understanding that particular question, not everywhere, not all agencies, not in all cases, but there are precedents for that. We might need more than these individual mandates in merger decisions, but they're a really good start. I'm less worried about that monopolization bit than I'm worried about minority report. Yeah. So those are two great points. We've thought long and hard about it. It's mentioned a sprinkling in the book. We can elaborate now. For education, I think that's true. We already do that. Harvard's entrance exam actually tries to be a predictor of your performance, both in Harvard and later in life. Right. Certainly without a doubt. I think that it's more likely what we're going to see is that we're going to see incredible improvements in education through big data. And I'll give you an example. Right now, our second, third, and fourth graders get their tests back for their mathematics. And actually, let's go a better example, might be in 10th grade, right, when you start learning serious math, algebra, and all that. And so there, you could imagine a situation where the teachers are grading the tests, and a few people get something in the 70s, something in the 80s, something in the 90s, and you get it back. And it's a very blunt instrument. With big data, what we're going to be having is you could, or just, this doesn't even have to be big data, but it's the big data that, this is going to be how it's going to be implemented at that level, where you could imagine the teacher grading the tests and looking at it, and then she would be able to, he or she, the teacher would be able to find out that 60% of the class got a certain question wrong with the exact same answer. Right. And so that would be an interesting signal for the teacher to learn, maybe I mistaught it, maybe I've got to do something different. And so what's really interesting is, you know, in this instance, you'd see that they inverted an algebraic equation, not realizing that the sequence actually mattered. Okay. So what you can imagine right now is happening, and actually it is, with the MOOCs, the massively open online courses, is that they are using these sorts of techniques to identify trends and correlations that actually improve the learning process, that tailor education, right, down to the students. So I think we're going to see a pump-up. Yes, there will be a winnowing, right, tracking. It already exists today, so I don't know if this is a big data problem, and I don't know if there's a solution that's going to target big data versus just our social values. So in the instance of insurance, you raise a very fundamental question. The principle behind insurance is to have pooled risk. And that essentially goes away. A market for insurance does seem to winnow dramatically, because if you're an insurance company and you have very good predictive power about the person who you're going to insure, you're always going to want to offer insurance to someone, in this case of health care, who you know is not going to get ill, and you're always going to want to deny coverage to someone who you know is going to get sick. So what it would suggest is that insurance doesn't go away, but a market for insurance changes. Now in the European Union, they have noticed that there's, but you can have values. Public policy can put values into it. So in the European Union, they did see that the health insurance and driving insurance for men and women were charged at different rates, because men were more likely to get into accidents and certain sort of cancers men were supposed to would get and women wouldn't. And they felt that this was wrong on the grounds of equality, right? Now it wasn't wrong on the ground of statistics and on mathematics. So it almost looks like the European version of creationism, if you will. Kind of weird, like saying, oh no, you have to be equal, but we know statistically these two sides aren't equal. Well, guess what the European Union did? They said, no, you have to price it equally, in this case driving insurance. Now we might laugh at it if we're statisticians, but we might accept it as, you know, individual citizens of countries because it does seem to align with our values. I'm not going to take a stance on it, but I'll just simply say that that's the debate that we've had in a small data world, and we're going to have the exact same debate times 10 in a big data world. Okay. So how the writing of the book has influenced what I do with the Economist, and the second one is, sure. Okay, so one of the things that I'm doing is I'm thinking about money-balling the Economist. Right now we make decisions. If you know the reference from the book and from the movie, it's about how a baseball team, the Oakland A's, improved their game by using statistics and data. And so an example of how I'm looking at that is I'm looking at some of the data that we have and thinking about what can we learn from the way users are interacting with our content to improve what we do at the paper. So one interesting thing is I've taken a lot of the data from the books and art section related to what's popular online, that is essentially online-only content, versus what is popular in print based on what people are reading online and looking for interesting anomalies. And it turns out among our most interesting correlations, or if you will, our most interesting findings is that our Q and A's, our question and answer is with authors are extremely popular on the internet, online. And it is one of the formats with which we almost never run in the paper. So it's suggesting to us that we may want to be taking this format that seems to really serve readers' interest into the paper. However, we have to be really careful. Are we going to make the presumption that our subscribers who have paid a lot of money for us on a weekly basis have the same interest as the Hoi Ploi who visit us online because they've seen the link and who don't pay us. Now we have to think through all of these issues that human being is still there with the judgment. But we actually didn't know that at the outset that we had to listen to the data to do that. And actually it took a sort of humility as a professional, you know, if you were the books and arts editor, to think, I really sort of believe that I know what is best on a weekly basis for my audience to actually second-guess yourself and say, okay, I recognize that in some ways I have good instincts, but in some ways I'm blindfolded. I'm not going to blindly accept the data, but I'm not going to be blind to it either. And so these are techniques that I think a lot of media companies are using. The economist is walking very light-footed into it to be careful, but it's something that it's going to be necessary because this is the new reality. We can do this. The only reason why we didn't do it in the past is we weren't able to. Or insofar as we did this in the past, the methodology was called brunch on Sunday and cocktails, right? We heard from people what they said. This is just a more data-driven approach to that. What should journalists do in the future? Learn stats, right? Be... When we teach journalism in the future, we're not going to just teach people the fundamentals of how to do an interview, right? And what a lead paragraph is. We're going to tell people how to interview databases. And also, just as we train our journalists by reminding them that sometimes the people who we interview are unfaithful and lie, we're going to have to teach journalists that you have to be suspicious of the data because sometimes the data lies too, right? You can't just park your judgment somewhere else. You have to bring the same scrutiny that you do in an analog world of talking to people and observing to the data as well. John Dayton at the Harvard Business School. I want to say at the outset that I'm slightly skeptical of the term big data. So are we. For it to matter, it would have to represent some kind of discontinuity. So I look at the three elements of your definition. Messiness, it seems to me, is intrinsically a property of data. Most econometricians spend most of their time worrying about the error term. Most of the advances seem to be in the area of sophisticated treatment of error. So I'm not sure messiness is a characteristic of big data as opposed to data generally. Talk about correlations. Well, there's been an interest in data reduction as opposed to causal analysis from the very beginnings of statistics. So the one thing you're left with is the more, the quantity. And the trouble there is that I suspect that if we mapped it from the stone age to now, it would be log linear. And if it is, then you've got to worry about the social implications, which are a big part of your argument, that seemed to be degrees of kind rather than, I mean questions of kind rather than questions of degree. Because if the simply questions of degree, society is getting used to it. And what we did with an abacus is prepared us for what we can now do with a mobile phone. So a way to probe the question of where is the discontinuity is to ask what does empiricism crowd out? So what is it that now that we can do more, we can answer more questions empirically, what, for there to be harm, there has to be something that we neglect. And certainly you make the argument that one of the things we neglect is simply, we can afford to neglect misplaced intuition. But there's a class of social problems that I don't think are amenable to empiricism. One of them is a kind of social problem that really is a values question, a question of neglect of the interests of one community in favor of the community that has power using arguments that are empirically based, but in fact are misdirected empiricism. So that's one thing that worries me, is it will make fewer decisions politically and more decisions based on expertise and the experts will be those hired by those in possession of power. And the second question is a whole class of taste related issues where it seems to me we'll be told what to like instead of simply liking it. So my core question is do you think there's an inflection? Has something really changed in the last 10 years or is it just an IBM sales slogan? Boy, can we failed. Actually, well, I should say before Victor starts, it can be both, right? It's not mutually exclusive. We're in the middle of the hype cycle for big data, but there could be something real here as well. And we want to distinguish between the two, but I'll let Victor take it over. But apparently if you failed to make clear what we mean... Just with one person, only one person. So let me try again. Messiness. Yes, it's true, data is messy. Absolutely. And that is in a small data world. If you have very few data points, you try to spend a lot of time and expend a lot of costs making sure that the data is accurate, that it is absolutely... that it is as clean as possible. In the big data range, because you have so much more, you can be more permissive of messiness. Not by much, but you understand it as a trade-off, as an inherent trade-off between the amount that you're getting and the small increase in messiness that comes with it. That is a very different approach to investigating empirical problems compared with a small data world. But most importantly, when you look at correlations, which you sort of discarded so quickly, when you look at correlations and correlational insights, it has the potentiality to change the way we make sense of the world. The classical way of making empiric sense of this world has been to think, based on our theory, about a hypothesis. Then to go out and get a sample of data to prove quote-unquote or disprove quote-unquote your hypothesis. If your hypothesis didn't work out, you went back and tried again with a different hypothesis. If your hypothesis seemed to work out, you published it in a paper. That was the kind of scientific method that we're used to. Now, the problem is that with this approach, we need to test one hypothesis after the other, and we need to think about a hypothesis before we can test it. And before we collect the data. With big data, we can somewhat, for some problems, do it the other way around. We can use existing data, and because we have so much of it relative to a phenomenon, we can ask it not just one question, but multiple questions, so to speak, and we can test multiple phenomena. Google engineers had the theory that search terms could... There would be a correlation between search terms and the spread of the flu that they could predict therefore the spread of the flu using search terms. But they didn't know which search terms, 50 million search terms, would you pick? Would you pick? You have no idea. So what they did is to ask the data what search terms would work best, and then added more and more search terms, so you have one, you have 10, you have 20, and they ended up with 45, because when they started adding the 46 search term, the predictability of the model went down. Now when they looked at the search terms afterwards, all of the search terms made sense. They were all about flu, flu information, sickness related and so forth, but the Google engineers didn't know that in the first place. That is the kind of hypothesis creating power of big data correlations that we are talking about, and that is changing how we make sense of the world and how we understand it. I enjoyed your talk. David Lazar, Northeastern University, and I should, in the interest of full disclosure, former colleague of Victor, as he mentioned, and Ken as well at the Kennedy School during his respite there. So just a brief, just a quick question. I wanted to press on the Minority Report dimension, because the main application that I can think of in the criminal justice system, something sort of akin to big data, I'll call it big-ish, is an interesting question of what constitutes data versus big data, as we've discussed, is around programs like Comstat in New York City, which is involved tracking crime over time and trying to project where crime is likely to occur, because typically where crime is occurring yesterday is likely to occur tomorrow. It's widely debated whether this is accounted for some of the decline in crime in the United States and New York. Actually, it's an ambiguous literature. But the question is, do you see issues just at an aggregate or population level targeting resources to places where you predict crime is likely to occur, versus at the individual level, and at the individual level, we actually have had the use of data for a long time about putting people away who we think are imminent danger to themselves or to others, and so we've been using data maybe in problematic ways already to say who's a likely risk to others and then having them committed. So I'm not sure whether there's really a distinction here when we're talking about big data projecting risk at the individual level. Very good point. Let me do two differentiations here. One is, with respect to predictions and what predictions are used to do or how we utilize predictions, we make this particular point that we should not use big data predictions to assess individual responsibility. And that's a very important sort of qualification there to start with. But your question points to something else there, and that is we have used data to direct limited law enforcement resources to certain neighborhoods, to certain cities, to certain times of the week and so forth before. This is called profiling. That's what we've been doing through profiling. That's what we do at the airport security all the time. But what we do on the profiling is that we create sort of suspect groups. And you are subjected to this profiling extra security, extra pat downs and so forth at the airport irrespective of your own propensity, but just based on a group propensity. It's guilt by association. And the idealistic hope of big data is that we move down from large groups where we look as unfortunately this country has done at times at African American populations in cities and police more there and then find out that there is crime. Surprise, surprise. If you police hard, you find crime. And that seems to prove that the money was well spent. That's a self-fulfilling prophecy. That's guilt by association. And the hope of big data is that as we move away from groups and group identities towards individuals we'll have fewer misidentifications so that the person who really isn't a risk even though he has an Arabic name and was naturalized in the United States from Saudi Arabia isn't really a risk and doesn't need secondary inspection at the airport. I have more questions. Hi, Kate Wilkner of Microsoft Research. How do you... Giggles, nice. How do you think the digital divide and digital exclusion plays into this? I mean, you've got... Your analysis can only be as good as your data points. So what happens when you have entire populations for whom adoption curves are different than the main stream population? So sort of like, will big data... Will the same digital divide that we've seen in terms of computing actually also be here in big data? And so we'll see winners and losers sort of... An extreme grow, is that right? Well, how do you think that that's going to play out? So for example, if you have people who don't... Yeah, I think it's going to exacerbate existing divides. There's much of a question there. Data is going to accrue to scale. These technologies are not equally dispersed. You can't... In some ways, the vendors are going to try to be able to sell it, so you can buy it off a shelf, but you're going to need human resources, and so right now we're in the middle of... Here's an example on the fly. We're in the middle of an African renaissance, right? GDP growth across the countries of Africa is just taking off cell phone penetration. They're doing extraordinarily well as a continent. Big data is going to come through and might actually stir things up again and actually see differences grow even bigger between how the West and its economy is performing and optimizing based on the data versus how Africa does, not because they can't buy the kit, but because they might not have enough statisticians and mathematicians and people who are scientists who can actually apply their minds to do it to tease out these insights. Now, it might not turn out that way, right? These are skills that you can probably buy from the West, right? So there's no reason why it has to turn out this way, but it likely can. What we may indeed see is that when you look at, for example, the East and the West, manufacturing has gone generally to Asia from the West because it's about low cost. But who has the data? Is it the Asian factories or is it the Western factories and who is more likely to adopt big data techniques? Probably the factory is here. So although we have been seeing a trickling of manufacturing come back to the West and particularly to America because it's been more sophisticated, we might actually even see that the great dominance of low cost Asian manufacturing starts to winnow and actually the people who are very well suited to profit in this world are going to be Western factories because they're the incumbents because the incumbents have the data, that there's going to be a greater advantage than ever before for incumbency rather than for startups who have the data. A Chinese car company that wants to get into manufacturing might not be able to optimize its manufacturing lines really, really well. They may not be able, which is about volume and about scale and about a low cost of defects, which is going to be about an optimization play in learning from data. So you might want to buy Saab or Volvo. You might want to buy a Western car company to acquire that expertise. So I think there's a lot to play for in terms of education and healthcare, if you don't have data from populations who aren't using it, how do we address that? I mean, that's a major problem. Well, okay. Let me leave education aside. It's going to be a tough nut to crack in that respect, but there it's just going to be about classic development issues I think in the short term. In the case of healthcare, the fact that we have cell phones is already transforming healthcare in Asia in Africa, Asia as well, but particularly Africa. We've seen mobile banking take off first in Africa, and now it's coming back to the west in kind of the reverse order. We may see mobile phone applications actually go first where there's light regulation, not heavy regulation for healthcare, and actually where there's greater need to do it because they don't have the legacy healthcare systems and then come back here as well. So in that sense, I'm optimistic. Now, I'm not pan glossy and optimistic, but I think let's find out and see where this takes us. Thank you so much. I know there are many more questions, but I'm aware that you have to catch a flight tonight. Thank you so much for being here. We're looking forward to reading the book. Thank you.