 Thank you for the introduction. Can everyone hear me? So let me start with just a little bit about who I am. So my research areas are in internet scale systems. I was part of the RAD Lab, so we proposed some of the early foundations of what cloud computing is and helped define what cloud computing is. And I'm now part of the AMP Lab, and that's what I'm going to spend a lot of this talking about. Deter testbed, that is the largest public cybersecurity testbed. So I am both a researcher and I am an operator. So for those of you on the IT staff side of things, I share your pain. And then I'm also very interested in machine learning for systems in the presence of adversaries. So when you've got people who are trying to game the system, how do you deal with that? I'm also an educator, so tomorrow I get to fly back to the U.S. and give a final exam to my operating system class. And an important disclaimer is that I'm not speaking on behalf of the University of California or any of the sponsors. This is my own personal opinions on big data. Now as the last speaker, I am in a nice position. I'm going to try and tie together many of the ideas that we've heard from the other speakers today. And then I'm going to propose a novel approach to dealing with big data. So we heard big data is huge. It's massive. The volume of data that's being produced is only ever growing. Many companies like Google are processing huge amounts of big data every day. We've heard about the large Hadron Collider from several of the speakers. And we also heard in the first keynote about a zettabyte. And for those of you who don't know what that is, it's a million petabytes. So the other aspect to big data is that it's very diverse. It's not just scientific data, it's financial data. The Walmart number, I think, is particularly interesting because Walmart collects and retains at Infinitum every single customer transaction that happens in their stores. Why do they do this? Because if they can see that there's a run on beanie babies in Cincinnati, they can send a message to redirect a truck heading to Akron, Ohio, to Cincinnati instead and meet that customer demand. We also heard from several speakers about human genome sequencing and some of the big data challenges associated with that. I'll talk later on about the particular special take that we're doing in terms of cancer genomics in this area. So one of the things that I think is really interesting with big data is when you're able to analyze user behavior as opposed to user inputs. So as an example, a researcher at the U.S. Geological Service has developed a system called TED, Twitter Earthquake Detector. So basically, he has a feed from the Twitter fire hose and when an earthquake occurs, what do people do? OMG, shaking, right, you know, hash quake. So he just simply looks for those tags and looks for those keywords and it's remarkable. He can geolocate where an earthquake is occurring because a lot of tweets are sent from mobile devices which include their GPS coordinates. So he can identify where a quake is occurring. He can identify even the magnitude of the quake because the rate at which people tweet is correlated with the magnitude of the earthquake. And he can do all of this faster than the USGS's own network of seismographs. And this is worldwide. So this is really interesting because we don't have a lot of seismographs in Asia, for example, in the Pacific Rim area. And yet everybody has a mobile phone and everybody's on Twitter. So this is actually being looked at seriously from a scientific standpoint as a way of filling in the gaps in our scientific network. Now, another area that I think is really interesting and we're going to come back to this and this ties in with what we heard from the last speaker in terms of analyzing big data from a government standpoint is Google has been very active in what they call now casting. So this is looking at what are people typing in as search terms, what are people looking for and using that to infer various trends. So if you go to the Google.org Flu Trends website, you can see how many people have the flu in a given area roughly. Is it peaking? Is it declining? Is it a high outbreak, low outbreak and so on? How do they do this? What does Google know about the flu? Nothing. But they know that people are searching for do I have the flu? What are the symptoms of the flu? Home remedies for the flu. And so they can use that to correlate against known data to determine today what the flu trends are as opposed to the CDC, which can make the same determination weeks later when they get all of the reports collated from doctors' offices, from county records and from state level medical records. In 2009, as part of the reinvestment program in the United States, they announced the Cash for Clunkers program. So for most of you are probably not familiar with what that program is, but what it was was basically the government said, we will give you a cash rebate up front for turning in your clunker, having the engine destroyed on that machine, and instead buying a new fuel efficient automobile. So the idea was remove these clunker automobiles from the market and get people to get into fuel efficient cars. Now the problem was they announced this program and then dealers will give you an instant rebate on your purchase and then they'll fill out a form and send that form off to the government for getting reimbursed. So it took the government weeks and months to find out was this program going to be effective. From day one Google knew the program was going to be effective because they could see people searching for terms like how much is my car worth, cash for clunkers, what's the trade in value of a clunker, and so on. And the last thing I want to point out in terms of government data is state unemployment records. So every month, at the end of the month, the county offices report unemployment office visits, the state reports unemployment office visits to the federal government, and a few weeks later the federal government releases the official unemployment rates for states in the United States. Again Google knows exactly what the unemployment rate is because they can see people searching for unemployment benefits, jobs and so on and they can correlate that with historic unemployment trends. So keep all of that in mind. We're going to come back to that later on. Okay, so other aspects about big data is bigger and bigger, more devices, more people getting connected to the internet and generating even more data and storage getting cheaper and cheaper every year. But of course as we heard we like to hoard. We don't want to throw anything away because for many reasons. One of them is that you don't know what that data will provide in terms of potential answers. And since I don't know what the future is going to hold in terms of what my analysts might want to do with that data I keep it. But then we run into this problem where the amount of information that we're creating is drastically exceeding the amount of available storage. So ultimately then you have to ask what big data do we keep. And this is really hard. Because if you delete the right data, no one says thank you for deleting the data but if you delete the wrong data then you have to stand up in front of Congress and testify. So the unfortunate backstory here is the climate research unit down in Australia did the heroic job of collecting surface temperature data from around the globe historic data and then they had to canonicalize it. So they had to remove the biases of the various sensors and normalize all of that data. And they did that effort and then that was the data set that they used to posit their hypotheses around global warming. They moved buildings and didn't have enough space in the new building so they threw out that data. Now fast forward and all the climate change doubters are saying well you know maybe in that bias removal phase you actually introduced bias because you had this hypothesis and that skewed the data. So let's see the original data. Well the original data is gone. So as a result of this in the United States we now have a rather onerous data retention requirement. So the National Science Foundation which funds much of our research in the sciences and information sciences and many of the physical sciences has a requirement that as of January of last year all proposals have to include a data management plan. And at a minimum you have to keep all data for three years after the end of the award. So from an institutional standpoint and an organizational standpoint there are opportunities here. We don't want people storing that data on a little drive under their desk because what happens if there's a fire and then we get audited or there's a theft and that drive is stolen. So this is an opportunity to invest in pooled storage either at the campus level or at the system wide level or at the regional level or perhaps even at the national level for some countries. Cost it's expensive. So these are yearly costs. These are just sample costs it took from one of the universities that offers this kind of pooled storage. And you now have to build this cost into your grants. So the data management cost associated with big data comes back to really affect you. I tried to figure out what it was for the United Kingdom and I found documents that were sort of wildly varying from something termed appropriate time to ten years. That was mainly bioinformatics, medical records had to be kept for ten years. And the NIH in the U.S. has had historically has had similar kinds of things ten years or in some cases for patents it has to be indefinitely. Now one of the things that somebody talked about today is that it's not always a requirement that big data be big. My argument is that data that's expensive to manage and hard to extract value from is what you should think of as big data. And this might be because you don't have the right tools to do the analysis and also because the management costs start to become really significant and dominate the cost in your project. Now we heard earlier today that storing data and managing that data especially in archival form is not cheap. Facebook has almost a billion users they cross the 900 million user mark and those users are uploading hundreds of millions of pictures every year at least that number. And as a result they're growing by 200 petabytes per year. So to put that into translation that is literally like they have containers of disk drives arriving at their data centers every week. Now a typical cloud startup so Conviva is a network analytics company they do streaming video. Their logging cost actually is starting to dominate their infrastructure cost. So that's the storage to store the analytical data. Again they can't throw out this data because that's how they make their business is on analyzing the performance of streaming video. So what makes it hard to extract value from this data? Again as we've heard earlier today it comes from many different sources it's uncurated, no schema, no syntax inconsistent semantics. When you look at things like tweets it's really hard to extract stuff from tweets because there is no standardized form and people can say the same thing a thousand different ways and if you think you've seen all a thousand then we find a thousand in one when we look at the data. So integrating all of this together can be a huge challenge. The second problem here is there's no easy way to get answers that are both high quality and timely. And again we want to be able to make real time decisions. Wouldn't it be great if we could provide earthquake warnings from the first set of people that report an earthquake? So people hundreds of miles away because it takes time for the various waves to propagate through the earth especially the destructive ones if we could provide advanced warning. But that means being able to do real time analytics on something as messy as the Twitter fire hose. So the challenge is we want to maximize the value that we can get from the data by providing the best possible answers. And we want to make it possible for naive users. We don't want you to have to be a big data scientist in order to be able to do this kind of effort. So doing this we believe requires a multifaceted approach. So there's three dimensions to the problem and I think we've heard about two of the dimensions today. The algorithms dimension and the machines dimension. But I want to talk in particular about the people dimension. So we want to improve scale efficiency and quality of algorithms. We want to be able to scale up our data centers. And we want to be able to leverage the fact that people are pretty smart. And they can solve problems that are really messy and hard for computers to solve. And they can do it really inexpensively too. So you need to adaptively and flexibly combine all of these dimensions to solve the big data problem. So where are we today? The state of the art is we have algorithms, machines, and people. And along these axes we have various point cases. So we have Hadoop and Oracle databases on the these axes. We have Matlab and R on the algorithms axis. And we have crowdsourcing applications like Yelp and Mechanical Turk on the people axes. So when I got here on Tuesday evening I was hungry for dinner. And so what did I do? I fired up Yelp and I quickly found what was supposed to be the best fish and chips place in the area in London. And it was really good actually. So it does work. How many people are familiar with Mechanical Turk? Okay, so let me explain what Mechanical Turk does. So Mechanical Turk is let's leverage off of the intelligence of people. You create a problem called a human interaction task or hit. And you give that to people. And you pay them to solve that problem. So a very simple example of a problem is image segmentation. So I'll show you a picture of a bird. You can quickly draw an outline around that bird. Because that's just how our brains are wired up. It's very easy for us to pick out something like that. Very hard for a computer to do that. And in fact at Berkeley some of my colleagues are developing algorithms for doing that image segmentation. But you need ground truth. So you take thousands of pictures of various things like let's say animals because they're really, they're a lot easier to segment than people. So you take birds, you take tigers and everything. And you farm out these hits to people. How much do you pay them? Pennies. There are people around the world who are willing to solve these problems for pennies. Now one of the problems you have to deal with is quality. Some people will just simply draw a circle around the bird. That's not a very good segmentation. And so you have to do quality control on the results that you get back from people. But the great thing is that people are awake on the other side of the globe. They're awake here. And so you can get answers very quickly. When we do projects like this, it typically takes anywhere from 10 minutes to an hour to label a large data set of hits. Okay. There are some applications that push out on one of the planes. So Watson. This is the Jeopardy winner from IBM. And Google Search. Push out on the machines and the algorithms dimension. But the key thing is that all of these are fixed points in space. And what we really want is techniques that dynamically pick the best operating point to work in. All right. So let's come back to what makes big data a problem. So there's two reasons here. The first is the more data that you have, the greater chance that you'll find a pattern that you're looking for. So I think it was in Simon's talk where he showed how if you had some rows in a table for the landslide analysis, you could add more columns and add more variables dimensions. And as you added more dimensions, there were more hypotheses to test. And your problem grows exponentially in terms of trying to solve it. So that's good and bad. We'd like to explore that space, but we need to explore that space intelligently. The second aspect of this is that the more data that we get, the less likely it is that our super sophisticated machine learning algorithm with great error bounds will be able to give us an answer in an acceptable amount of time, which means we have to use that similar algorithm, which is more likely to give us an erroneous answer. So we can't win either way. So as a more concrete formulation of the problem, what we want is if we're given an inferential goal in terms of a required error bound and we're given a fixed computational budget, which is effectively machines in time, we want to provide a guarantee supported by an algorithm and by an analysis of that algorithm that the quality of that inference will increase monotonically as we get more data. Very simple. Can't do this today. But this is what we think you need to do in order to be able to solve the big data problem. So we can, we think one of the solutions here is if you blend statistical and computational design principles, you'll be able to build systems that can do this. And I'll talk about an example of an approach in just a little while. Okay, let's come back to the US and what big data looks like in the US. So there are, we've talked to a lot of companies and there are many companies that are in the situation of collecting lots of data and they put it into these huge right once reading on repositories. Now why do they collect the data? For the reasons I mentioned. Because some analysts might want to go and look at that data and if you tell them you threw out that data you lose your job. So you'll keep it. So they just keep buying more and more tape and keeping the disc manufacturers in business. The US government, many of the US government agencies are in the same situation as we heard in the last talk of they've got lots of data and some of the data they want to share with the public and they don't know, and they want to perform analysis and we have Google that's able to do all of these real-time analyses on data and they don't have the same ability to do that. From an infrastructure standpoint, from an expertise standpoint, and from an analyst algorithm standpoint. There are many companies that are jumping into the big data space in the United States and they're all deploying proprietary solutions and other Fortune 1000 companies are buying those proprietary solutions, discovering they don't work after they've imported hundreds of petabytes of data and now they're stuck. So it's become, this is where the hype machine is really starting to cause problems. Everybody's trying to buy big data solutions and finding that not all solutions are equivalent. The good news is there's a very, very active and broad open source community. It's really international. It's not just a US centric operation and there's even now nonprofits like DataKind which was formerly Data Without Borders which is groups of experts from multidisciplinary experts which will help nonprofits collect data, curate that data, analyze that data and then visualize the output results. Now because of the recognition that there is this big gap in the United States in terms of big data in terms of research and in terms of operational use within the US government, at the end of March the White House announced a broad $200 million new commitment to big data both from a research standpoint and from an operational standpoint and it's a litany of departments and organizations and agencies that are involved in this. Just to give you a little bit of insight into some of the projects, Health and Human Services is developing a data warehouse based on Hadoop. The FDA is developing a virtual library environment which will use crowd sourcing on analytics. The National Archives has 87 million, more than 87 million documents in their renaissance collection that they want to make available for people to analyze. The NIH is developing a worldwide protein data bank. This is actually something that the UK is also participating in and every month one terabyte of protein data is downloaded from that organization. The CDC is trying to do a more generic form of the flu trends called biosense 2.0. So using all sorts of data sources to identify diseases that are present in the population. So we can figure out that there's an outbreak of X, Y or Z somewhere in the United States based on data that's being collected. The Veterans Administration has a million veterans project which is soliciting a million veterans to voluntarily donate their blood so that it can be genotyped and sequenced so that they can now look at how genotype affects disease progression and the correlations with diseases within that population and of course within the broader American population. So there's a lot of work going on in the open source community. Development of all aspects of the big data pipeline but there's much more that needs to be done. Analysis environments and also particularly suggestions for novice users of what tools should they use and given a particular tool how do they use that tool. I'll come back to that in a moment when I talk about the AMP Lab. So the AMP Lab our goal is to make sense of data at scale by tightly integrating together algorithms, machines and people. And we are one of the projects that was just sponsored by as a part of that $200 million commitment. So the National Science Foundation as part of that announcement announced a $10 million five year grant to the AMP Lab specifically to study big data. So what we're interested in is this. We want to move from looking at point-wise solutions to solutions that mix elements of applications, machines, I'm sorry, algorithms, machines and people. So we have a broad set of faculty and experts that are involved in the project. We also have a broad set of sponsors both on the industrial side so we're working with some of the largest companies and we're working with some of the smallest companies and as I mentioned we're sponsored now also by the National Science Foundation. So in terms of algorithms the problem is the state of the art machine learning algorithms just don't scale. You can't process all of the data points. Even at companies like Google they at best can process the data once that they're collecting. It's very hard to process it more than that. So what we'd really like is if we as on the x-axis here we have a number of data points, on the y-axis we have our estimate and there's the true estimate what we would like is that our algorithm converges monotonically to the actual true answer. Now a question that arises here is when do we stop? We just keep applying our algorithm against more and more data. We really need to provide error bars with our answers. If we provide error bars in every answer then we can simply say what we want our bound to be and stop when we have reached that threshold. So now we're not thinking about it in terms of the number of data points but we're thinking about it in terms of time. Time to reach a particular threshold. Once we reach the threshold we can stop. Now second part of this is we'd like to automatically pick the best algorithms. You don't have to be a world-renowned machine learning researcher in order to analyze big data. So again we have our sophisticated algorithm that converges nicely and monotonically. We have a very simple algorithm but it's messy. So it's wildly inaccurate with small amounts of data but eventually it converges nicely to the true answer. So if we have small amounts of time we say we can't give you an accurate answer. You've got to give us at least some minimum bound in terms of time and then we will meet your minimum bound in terms of error. For that range if you give me a little bit more time we'll pick the sophisticated algorithm because we know it converges monotonically. If you give more time we'll pick the simple algorithm. And this works well because the sophisticated algorithm won't work when we start to add a lot of points. The time grows exponentially for those algorithms. The second axis is machines. So the data center as a computer is still in its infancy. No two data centers are alike. There are projects like Facebook's Open Data Center which are trying to standardize data center design so we can get economies of scale. But very few companies, Amazon, Microsoft and Google are really able to get true economies of scale out of their designs. And even within their designs you'll find that they have purpose clusters. The machines are not all identical. And as a result you have highly variable performance. It's very hard to program a data center as a computer and it's very hard to debug when something goes wrong. In fact it's even hard to know that something has gone wrong. We would like it to make it equivalent to programming a PC where it is easy to program, is easy to debug, is easy to know when something has gone wrong. So we want to make data centers really be a real computer. And to do this you need a data center operating system and we have developed that. It's called Apache Mesos. Since December of 2010 it has been an Apache Open Source Foundation incubator project. Just two days ago we celebrated our first official Apache release of the software. And how many people here use Twitter? So we're powering a dozen of their production services using Mesos and their active contributors to the project. So we have both an operational side to our research and we have a very scientific side to our research. So on top of a data center operating system like Apache you can share these different machines, provide new abstractions and services, and run your existing types of applications ranging from Hadoop to high performance computing applications, hyper table Cassandra and so on. And we're also developing novel application frameworks. So one of the frameworks we've developed is Spark and it supports an iterative version of MapReduce. So a lot of machine learning algorithms like for example Logistic Regression, it's an iterative process until you converge to some error bound. And the problem with traditional MapReduce is every time you finish the reduce phase you write it all out to disk and then you start a map phase that reads it all back in from disk and you have to serialize and deserialize the data in those stages and so if you're doing a lot of iterations that gets very, very expensive. Spark keeps all of this in memory. So now we're talking about machines that have anywhere from 32 gigabytes up to 200 plus gigabytes of RAM so we can keep the data set in memory and we get orders of magnitude, multiple orders of magnitude speed up with approaches like that. That's Spark. Scads is a consistency adjustable data store so it allows us to put, it's not a SQL based store, it allows us to put bounds on consistency and a time to reach consistency especially when you have distributed data centers. And Pickle is a query language like SQL except it includes the notion of time. So you can say how long do you want this query to take and are you willing to accept errors. So it might not be a perfectly accurate result because we haven't looked at all of the data but we'll put bounds on the error of the result we give back to you. On top of all of this we can build our applications and tools and these include advanced machine learning algorithms, interactive data mining and collaborative visualization. So the last axis is people and what's great about people is you can give them messy data and they figure it out. I talked about Amazon mechanical turk, there's also other systems like Quora, many eyes and Galaxy Zoo. All of these are basically give people tasks and let them solve them. So we want to make people an integral part of the system in two ways. We want to leverage human activities. We want to monitor what are people doing and use that to infer things. So that's an example would be the Google Flu Trends work or the FDA's biosense 2.0. And then leverage human intelligence through crowdsourcing. So use people to curate and clean dirty data because that's what people are really good at. Use them to answer imprecise questions, use them to test and improve the answers to questions that maybe are generated by the algorithms. Is this a good answer or not? A person can instantly say yes or no. Now the challenge is people are not equivalent unlike computers, we can't just cookie cutter them. And as I gave an example with the image segmentation, people will game the system, they will return wildly different answers. With that bird example, some people are very finely detailed and some people aren't. And so we need ways of taking the data we get from the people and actually doing a cleaning step and a validation step on that. So just quickly I want to talk about a couple of real applications that we have built with our collaborators in the AMP Lab. So Alex Barron is a professor in civil and environment engineering at Berkeley and he has a project called Mobile Millennium. This is using a combination of loop sensors in highways and smartphones to collect traffic information and do it in a privacy preserving manner. So unlike systems like Google Maps where Google knows exactly who you are and where you're going and where you're coming from. In this system all of your data is anonymized and so you can't infer individuals from the collected data. My colleague Paul Waddell over in the College of Environment Design is building a system called for micro simulation of urban development. So this is basically answering the question of I have a river, I need to put a bridge across a river where do I put the bridge? Where you put the bridge is going to affect population growth, traffic and economic growth in that environment not for a few years but perhaps for decades or even perhaps 100 years. So where you put a highway, where you put infrastructure is really an important decision to make and you want to simulate down to the level of neighborhoods. So it's a very computationally intense problem, a real big data problem. He has to lead the world's record for accuracy in terms of the predictions that he can make based upon where you put highways and other infrastructure. And then my colleague in industrial engineering and operations research, Ken Goldberg, is working on crowd based opinion formation. So he has a project called opinion space with the US State Department where you can basically ask questions about people to get their opinion and then people can see other people's answers and rate those answers. And so you can see where your opinion fits in in terms of the space of other people's opinions and sort of the global opinion thought process. So really interesting in terms of how people can interact within a social space. And then the last project that I want to mention is personalized sequencing. So we're collaborating with Dr. Taylor Sittler who's over in University of California San Francisco campus on doing cancer genomics. And really the motivation here is that as we heard today, the cost of sequencing is dropping dramatically. The time to sequence is dropping dramatically. And we're going to be at $1,000 very shortly, a little bit after that we're going to be at $100 per sequence. So less than a typical diagnostic test in the office to get your genome sequenced. So what we're working with Taylor on is cancer genomics. So this is sequencing the individual and then sequencing the cancer tumor to understand what genetic mutations have occurred and what pathways have been activated or deactivated in the cells. Now one of the interesting challenges here is that what people have learned by doing the sequencing operation is that there's actually a whole ecosystem within a tumor and you have multiple strains, multiple genetic strains within a tumor. I think one of the examples of a, I think it was a kidney tumor that they sequenced they found 155 different cell strains. So you're taking that $100 and now multiplying it by 155 plus 1 because you have to sequence the donor. But if you can do that and you can do it quickly you can save lives. And so this really is an interesting problem because timeliness is important. We need to be able to solve this problem and we need to be able to do the analysis in time for you to be able to develop a personalized treatment. Okay, so all of these applications fit in fixed points of the space today and what we're trying to do in the amp lab is working with them, move them so that they can leverage off of all of the axes. So as an example we've developed a new sequencing algorithm that is more than an order of magnitude faster than the current alignment algorithms and what we've also found is that some people have been looking at crowdsourcing the alignment problem. It turns out that you can give people the little fragments of base pairs and they're really quick at aligning them and they actually view this as a game. So it's interesting because we can both combine algorithms here with crowdsourcing to see are the answers that we develop from the algorithms actually correct. Okay, so in closing big data in 2020, are you prepared? We need to create a new generation of data scientists who knows how to work with big data. These are people who are going to be very interdisciplinary because they're going to be working across scientific domains and engineering domains. Machine learning is going to move from being a scientific discipline to being an engineering tool that we're going to use to solve these problems and we want people to be very deeply integrated into the big data analysis pipeline at all stages from the data collection and curation process to the data processing steps to the data analysis steps. So for those of you with educational institutions or you can substitute company name, are you going to offer a big data curriculum that touches all of these fields? Are you going to have hired cross disciplinary faculty and staff that can deal with the challenges associated with big data? Will you have invested in the pooled storage to deal with the tsunami of data that people want to collect and hoard? Will you have invested appropriately in private clouds to process that data or in infrastructure to work with the public cloud providers like private peering and other sorts of things? Will you have built the intracampus and intracampus networks to be able to move that big data around? What we're finding on a lot of our campuses is we can get big data to the edge but then getting it across campus is a huge problem or even moving it within a building can be a challenging problem. So in summary, goal here is to tame the big data problem. We want to get the results with the right quality in the right amount of time and our belief is you have to take a holistic approach here. You have to combine algorithms, machines and people and there's lots and lots of research areas covering many, many domains. This is a five year project. We're just going to try and scratch the surface in terms of trying to create the infrastructure and the tools and really try more to lay the groundwork for other researchers to solve all of the problems here. So with that, I'll close and take any questions. Thanks very much for a great talk. Any questions for Anthony? Nothing? Really? Hang on the mic. Hi, Matt Johnson from the US University is doing about training the next generation of data scientists. Yeah, that's a really good question. Mainly it's trying to figure out it's creating these interdisciplinary centers. So there's a number of schools that have started to try and create centers that bring together machine learning people along with social scientists, along with the systems people. But it's a relatively small number. The Expeditions and Computing is four different grants that were made. It covers a widespread of universities. I think there's a total of maybe about a dozen universities that are funded under that program. Simon Price from the University of Bristol. I was curious with the machine learning and automatic choice of algorithm, how when we can't do it for small data, how do you think it's going to work easier with big data? Yeah, that's a really good question. It's not. It's still going to be really hard. The thing is what we want to do is try to create toolkits that allow us to profile your data on the small side. And then once we've understood how it behaves in terms of cleanliness and convergence properties, use that to feed into the decision making engine that will select which is the appropriate algorithm given a computational bound. But it's by no means a solved problem or a simple problem. That's a good answer. Thank you very much. It's really interesting. Both I think the sessions after T have been really really interesting. On the last point where you said you need to have the right quality in the right time. Who judges what is the right quality? Yeah, that is really a good and interesting question. That's one of the areas where we want to try and provide people with guidance. If you're a data scientist then you know what a quality metric means. But if you're just an ordinary analyst, you probably don't. And so one of the things that we're trying to understand is how sensitive are answers to a particular quality metric. And can we then use that to try and give people guidance as to what it means when we say this is the error bound that we're providing on your answer. This also feeds into things like Pickle where you want to make a query on a database and you're going to set an error bound and you're going to set a time for that query to be performed. We understand what it means to set a time bound because maybe someone's clicked on a thing on the web and we want to render this page within 100 milliseconds. But we don't understand necessarily what that accuracy bound means. So that's an open area for research. And is that associated to cost? Yes, absolutely. So time equals cost in many different dimensions. So time equals cost in terms of if I have to render that page within 100 milliseconds and it takes me longer, I lose viewers. People go away and there have been studies at Microsoft on Google that have demonstrated that slow search causes people to stop using your engine. But also from a business standpoint, if I need to redirect that truck from Akron to Cincinnati, I have a finite window in which to be able to do that. So, you know, there are many examples in each of the problem domains where time equals money in some way or another. Questions at the back. And then I'm going to wrap things up. All right, it says it left the question made you serve. You talked a bit about a number of challenges that you're currently facing. And I'm assuming that over time, many of those are going to be resolved because the cost of fixing them is going to reduce over time. But I'm assuming there are also some intractable problems that you're facing that you can't currently see as ones that are going to reduce in price when cost over time. Are there any things that you can highlight there? So the simplest intractable problem is what data do I throw away? Right? That's one of the really hard questions because, you know, if you look at genomics data, you know, when they first sequenced the genome, they found all of these repeating sequences across multiple species thought that was junk, so-called junk DNA. And it turns out that's all the machinery that drives cells. So, you know, when you sequence, you get a lot of data. And what do you keep and what do you throw out? All right? So in every field, this is the same thing. What is important today is one thing. What's important tomorrow may be something entirely different. And so I think that's one of the really intractable problems is when can I throw away data? Graham Gilbert from the University of York. And I'm the finance director. I know nothing about IT and I know nothing about research. But as finance director, of course, I've become quite sensitive to issues that drive investment bubbles. And as I listen to you, first of all, I found it very interesting. I mean, I think the sense in which you put an analytical spin on it, which enabled us to make a bit of sense about how you actually penetrate the problem of getting information out of all this data was very insightful. But I am left with the question as to whether this isn't the next tech bubble. You know, when you look at the list of sponsors that are funding your activities, the first thing you realize is these are very big companies. They're also companies that have huge market values and yet many of them have not yet produced any profits. They are presumably building their values around the notion that they can make sense of the vast amounts of data that they are collecting. And the question then ultimately arises, yes, but really will that happen? And are we going to find ourselves having invested in these businesses that many of them will actually go pop because there really isn't as much value in that data as you are hoping to find? Yeah, that is actually really a good question to ask. Are we entering the next giant bubble? So the thing is, today, the really large companies are the ones that can leverage the big data. And we have heard lots of examples of that today. You know, Amazon is able to beat out all of its competitors because it uses big data. Walmart beats out all of its competitors because it uses big data. Amex beats out its competitors because it uses big data. All of these companies have figured out how to monetize a return on investment in big data. The challenge is how do we take it to the Fortune 2000? How do we allow any company or how do we allow any researcher or anybody to be able to do the same easy access to big data? So I think, you know, there is a trap of believing that if I just throw a lot of data into a machine learning algorithm, I will instantly start printing money. It's not that simple. But if you do make the investment, you will ultimately see big returns. Okay, I think we're going to wrap up there. Thank you very much.