 So, welcome, I'm just, in terms of full disclosure, I'm going to say almost nothing about cancer, so if you came to this meeting to learn something about cancer, you should, you know, take this moment to, you know, go back in the other room, get a cup of coffee and relax. What I've done for the last 20 years is build statistical models over big data and, you know, as the amount of data grows in genomics, you know, it's perhaps interesting to think about, you know, how to best build statistical models over it. But I'm not going to get to the cancer, and I'm going to, and, you know, if you want to sue me, you can, but I'm going to use the term big data for a little bit, just to refer to large amounts of data. So I'm going to talk about some of the problems we have with large data, how you might compute over it. One of the technologies, science clouds that can be used for it, and then how we might organize at scale as the amount of genomic data grows. So the whole talk is organized around four questions, and I have no idea what the answers to these questions are, so if you have strong opinions, it would be interesting to talk about them through the meeting. So the first question is what's the same and what is different about large biomedical data versus big science data versus commercial data? You'll hear my opinions, but I think it's going to take us a couple years to get to ground truth. The second question is, as the amount of genomics data grows, what's the right instrument to use? The third question is, as the amount of data grows, can we use the same statistical techniques we have been using, or do we need to create new technologies? I think that's the toughest problem to think about. And the third is, how do we organize as a community as the amount of genomics data grows? And we wouldn't have to think through these questions if the amount of dollars we had was growing with the amount of data. Then we could sort of do what we're doing, and as long as both dollars and sort of our ability to sort of do smart things with data grew as fast as Moore's law, then we wouldn't have a problem. But more or less, there's a constant amount of dollars and a constant number of bioinformaticians and an exponentially growing amount of data. So we have to think about what do we do in that situation? And that's pretty much what this talk is about. The way I think of this is, you know, the 10,000 sample was just completed. Shortly, we'll have, you know, the analysis of roughly 10,000 genomes. We have other large projects like ICGC, and over the next few years, there are going to be a growing number of projects. So you could think of the sort of the problem if we had a million genomes, you know, had about a terabyte for a match pair of a tumor and a normal, that's a million terabytes or a thousand petabytes or about an exabyte. So that's a lot of data. We can compress it. It's still a lot of data. It costs a lot of money. You might think of it as 100 studies with 10,000 patients each about the same size of the TCGA. How do we organize the data management, the data analysis, and the collaborative analysis of something like a million genomes? I may run out of time at the end, so I really want to thank my colleagues and collaborators, especially Kevin White, who I worked out. We worked out a lot of these ideas together. And Nancy Cox, Andre Rezeski, Lincoln Stein, Barbara Stranger, you know, a lot of what I'm going to say is, reflects joint work with these individuals. These are my lab. I especially want to thank Allison Heath, who's the lead of the BioNimbus Protected Data Cloud, and some key people from the White Lab who helped. And I want to talk about this disruption of big data, that big data is causing biomedical computing. So this is, I think, a standard slide I creatively borrowed from the NCI. This is what you might think of as a current model of biomedical computing. In the bioinformatician, there's an artist rendition of a bio-metrician on the lower right-hand corner. He or she has a private local computer, downloads data, creates local data about, you know, for about $1,000, uses community software, and everything is fine, except, you know, if you think of the $100,000, it might cost to analyze a genome. There's a bit of mismatch in terms of how easily we can create it. And how much it costs to analyze it, how long it takes to analyze it. And as the amount of data grows, you can't use this model anymore. So the first problem is the growth of data. It can take weeks just to download a couple petabytes of data. The second issue is the growing types of data we have. If we want to understand environmental impact at scale, and I'm going to talk a few years out, then we can need geospatial data at scale. I'm going to talk a little bit about other ways we might capture individual patient data and environmental data. You can, there'll be various devices you might wear. There's image data, so we have new types of data. We have this mismatch between the ratio of data to sort of the bioinformatics required to analyze it. And so things are pretty fundamentally broken. I just want to dive down to one of the areas. I don't have to talk here about how big the genomics data is. I want you to think ahead a couple years. I heard a sort of an interesting talk recently, which is the sensors that are going to go into phones. And this is just one example from LifeWatch. But over the next few years, phones will have, right now, they have a lot of sensors, you may or may not be aware of them, but they have a lot of sensors. That's how they can sense where you are, what's your acceleration, etc. They're going to add sensors to get environmental data, like temperature, pressure, humidity, and they can add sensors, and they will be adding sensors to do biometrics first for security. But once they have that, you can do things. Here's a simple example where a phone will give you blood glucose, blood saturation, and heart rate, and in first stress. So over the next few years, there's going to be a lot of data coming from devices, and that's just one of the modalities that will eventually give us the environmental data. So if we think over the next five years, we're going to have a lot of new types of data. So the standard, this is a, you know, my wife told me I needed a hobby, and my hobby, as many of you know, is collecting pictures of data centers. And so this is a picture of a Google data center. Luckily, Google has made a lot of their data centers public, and you could drive through them. So whenever I have trouble sleeping at night, I like sort of driving through virtually the Google data center until I fall asleep. I recommend it if you, you know, sometimes when you travel, you know, raise your hand if you have trouble sleeping when you travel, anyone? Yeah, just, you know, take a tour of a Google data center. It's kind of interesting. It will help you get back to sleep. So why do I talk about a Google data center? So it's from the next slide. I spent a little bit of time as a consultant building computational advertising, and why is computational advertising interesting? So first of all, what is it? It finds the best match between a user in a given context, like searching or walking in a given area, and something that really might make them happy. You know, what makes people happy will good health and see nice advertisements. And why is this relevant? Well, it's relevant because it's $100 billion industry, and every, you know, and if you do it right, you make an extra billion dollars. If we do cancer right, we save a lot of lives, but we don't make an extra billion dollars. And so there's a cycle in computational advertising that leads to innovation of data analysis that we don't have in our field. We might save money, we might improve health, but we can't mint money by building better data infrastructure or by building data analysis. We can borrow what those people build. Just to give you an example, you know, a modern advertising platform will build full behavioral models, statistical models, over 100 million or more individuals. They will reanalyze the data each night. They will reanalyze all of the data. Every single bit of data in a data center they can reanalyze. They'll serve tens of thousands of ads per second. They'll do it in milliseconds. They'll use exactly where you are geospatially. They'll do it at machine speed with fewer people than we use in bioinformatics, and they do it with people who have a lot less training. So, you know, there's probably something to be learned from this. And so, you know, one of the standard solutions to the problem I just described of the growing growth of genomic, environmental, and other types of data is to sort of borrow that technology, create a similar type of technology where instead of asking every individual to download and work locally with terabytes and soon petabytes of data, you do it in commons. Everyone shares in that commons. We try to make the experience as close to as what they would get to locally. You know, and, you know, for the bioinformaticians, you know, there was a... When you first did mail in the cloud with Google or Yahoo, you didn't think it would work, but shortly it works, and now most of us do it. So that's probably the transition we'll see when you have to analyze these large terabyte to petabyte scale data sets. At the beginning, it will be awkward, it'll be painful. For those of us working with us now, it may be very painful, but it will eventually get better. And there's, you know, in the end, it will be as seamless as we now do mail. We will also interoperate with private storage and compute at medical research centers and universities. And the point of this talk is how do we structure all of this in a way that should work as seamlessly as possible? So I just want to remind you of the terminology. Why do we call it a commons? Well, Garrett Hardin wrote an interesting science paper in 1968 called The Tragedy of the Commons. The term commons has been used in economics for something like a cow pasture in a village. So if you have a village, there's a pasture, and if everyone sort of thinks, keeps the village in mind, then all the cows can sort of share in the pasture and it's available to everyone. If there are some individuals with cows that take more than their share, then there's not enough common pasture for the village and everyone suffers. And that's where we are more or less in bioinformatics today. We can't afford everyone to have their local petabyte, cache of genomics data. We have to build some commons and figure out the most transparent way to share that. If we don't do it right, Garrett Hardin introduced the tragedy of the commons. So a lot of this talk you could think of as how do we stay away from the tragedy of the commons as we build large biomedical commons. So the standard story that we're trying to work out is you would use these large structures to create a community of biomedical data. There would be a number of them, they would interoperate and you would compute over it. I wanna put this into perspective into sort of three errors of bioinformatics and each of them is about 10 years long and we're just leaving what I think of as the first error. And the first error was sort of the, it started in 1999 with the Badstein and Smaar report that created BISTI. It was really about the integration of informatics tools and how we could do that at scale. And whenever you do things at scale in computing and with data, it's easier to get it wrong than to get it right. And we're just, I think we've just emerged from this error. There was a new report. If the data and informatics working group like Demets and Tabak did a 2012 report on big data and I sort of wanna come to that and how we get there. And there's been different types of assessments of what we did in that first error. But as we sort of think into the future, I think it's good to sort of think of this almost 30 year context of how do we do bioinformatics at scale and how do we do it right. So I'm gonna come back to fill in this chart. So in most talks, you can only learn one thing and then you stop paying attention. I can't tell you anything about cancer because I really don't know anything about cancer. I can tell you a little bit about computing. The only thing I want you to take away from this talk is I want you to stop using the word cloud computing. Why should you stop using the word cloud computing? Well, by the time you can't go to an airport and see the word cloud computing, there's probably no meaning in the word cloud computing anymore. And so at least when I go to airports, if I didn't have a family to support, I would probably remove all signs I see about cloud computing because I think it's sending the wrong message. But at this point in my life, I don't wanna be arrested for vandalism. So I wanna give you a better way to think about this and this is again a picture of a Google data center. This is the mechanical, the power, the cooling and so on. And if there's one thing you take away is you should try to get into a Google data center. If you're malicious, if you're mischievous, you could sort of try to paint different colors on some of the pipes, but it's kind of interesting. So the first question, I just wanna ask, we know that for small objects we use microscopes and for far objects we use telescopes, what would a data scope look like? And Barrasso and Holzi wrote a first edition, this is the second edition. And the key thing is a data scope is probably a data center and we need to learn how to engineize data centers that will scale with the bioinformatics software we need. So really, we don't care about cloud computing, we do care about how we could build boutique data centers for our community that will allow us to do the bioinformatics we need and how do we do that with a modest number of people and how do we do it so the software scales and is easy to use and how do we do it so that there could be a number of them that interoperate at scale. And the data center as a computer is really what's relevant for our community. So forget about cloud computing, think about data centers. Modern data centers have the ability that you can go to a portal and self provision a cluster with 100 nodes, do a computation, wake up in the middle of the night, get a portal, do a bioinformatics computation and tear down those 100 machines and have the gist of some analysis that you can sort of use the next day. So you have scale, the only way this could be done is if the data center uses massive automation and the only way it works for our community is if the software scales to the entire data center. And so the main difference between databases, people get very well known in this audience but the people I hang out with get very emotional about the design of databases. There's traditional relational databases, there's what's called NoSQL databases, there's not only SQL databases and there are just a couple of simple rules about databases. Databases like all software takes about 10 years to get it right. And the main reason that databases are no longer used for some of these problems in advertising is databases don't scale until very recently to a data center. They scale to a rack or a portion of a rack, they don't scale to hundreds of racks. And so the types of databases we need are the types that will scale to a large number of racks in a data center. So if you read the NIST report that gives a definition of a cloud, you can have a cloud that looks like something on the left, you could have someone who operates a cloud that looks like the bioinformatician on the right because from their point of view, all that matters is it runs OpenStack or some other software and has a certain, you know, certain characteristics that are important here. But the difference is we can't do what we need with, this is a, I'm not gonna embarrass anyone, but does anyone reckon, I don't think the person's in the audience, this is a real live data center at a university that does really top rate science that is regularly published in science and nature. But that's their data center and look at this data center compared to this, there's a big difference. So, you know, we have to get to this. So this is a sort of a schematic of what you need. You need to do accounting and billing, you need to do monitoring, you need to do provision. If you do this right, you could sort of do hundreds of petabytes, which is pretty much close to what we need to do our X byte of analysis. And so this is a commercial data center. It's run by about 25 people. They measure computing in megawatts. And the question, you know, that I wanna pose to this audience is just nothing to keep us from building these things for our community. The real question is who should build them, how they should operate, and you know, how do we best do bound from addicts with them. And there's a lot of open source software out there to do that. For the first time, Facebook decided to describe how they build their data centers. People have not done this very often, and there's something called open compute. So if, you know, Facebook will spend a couple hundred million dollars on a data center, and most of us can't spend that much money, but they've open sourced how they do that in a project called open compute. So if we wanna build a downsized data center, we can do that. And this is a picture of a 30 megawatt Facebook data center, but the blueprints are out there. So there's nothing to keep our community from doing this and from any operating with others. And I'm not gonna talk very much at all about our work, but what we did is we took the standard blueprints for a commercial data center, scaled it down dramatically so we could build it, and then used open source software whenever we could, and then built the smallest amount of glue software. And there was an automation software we built, some compliance software and some billing, but we built the smallest amount of glue software. We used open source for almost all of it because we know that you can only build a little sophomore. It takes 10 years and you get it wrong the first three times. So luckily we've been doing this almost 10 years now. We bet on three things, open stack, Hadoop and Gluster. Two of them worked out well, one of them didn't. I won't mention which one, but the last couple weeks and the next few weeks we're transitioning to a data center infrastructure with the two that did work. So I think the standard question is why not just use commercial data centers? You know, if you go through airports and you see signs from Microsoft and Rackspace and all these commercial data centers, they're certainly gonna be part of the solution. The question in front of us is are they the only part of the solution? They scale, they work as long as you have a credit card and they give you a lot of choices. I think the reason that we shouldn't is we should build our own in addition to operating with the commercial ones is at medium scale, costs less. So costs can be very important. If you're a CIO, you probably know the number, but it's probably 3X or more, three to 5X less expensive to build your own. We can build specialized infrastructure we need for bioinformatics. We could build specialized infrastructure we need for compliance and security and the data is probably too important to be trusted exclusively with commercial ones. But then the question is, it's probably the question we have to do from how would we build these. If we have say, let's say you've done your NIH grant, you've gotten samples, you've sequenced them, you've analyzed them and you put them in some cloud, you should have what you might think of as a green button. Has anyone used a blue button to get your medical records? Okay, so there's a process that HSH is doing where there's a blue button and if you hit the blue button, you can get your medical records. I think the way our community should work is we should, anytime we do anything, we should have a green button and if you hit the green button, your terabytes or petabytes of data should be moved to another data center that you like better so that you control where your data goes. It has to be done securely, compliantly, it has to be done with the simple apparatus that we have with DBGAP or whatever replaces it, but you should have a green button and whenever you hit that green button, you should be able to liberate your data so that you can move it around and if we do that, it's a pretty safe way to interoperate with these data centers. This is, under our cost, the green button is a large provider of services and the green line is a large provider of, it's the largest provider of computing infrastructure. The pink line below it is our cost for adding a petabyte of data. So at the beginning, if you're less than half a petabyte, it's much, much, much less expensive to use a commercial provider, but as the number of petabytes grows, the gap between your cost and the cost of the commercial providers is very large. This is assuming you're operating at, say, 10 to 20 petabytes. If you operate much smaller, the slopes are different, but if you operate around 10 petabytes, there's a big difference, so that's one of the reasons. I'm gonna run out of time, there's, it's hard to do this, but one of the tests you can use is if you build this right and you're doing, say, an alignment or you're doing varying calling and you double the amount of data, then you should be able to double the amount of racks and nothing should change. The software shouldn't change, the analysis shouldn't change, the accounting shouldn't change, nothing should change, but you should be able to do the varying calling in exactly the same amount of time with twice as much data. So I call this the rack test and almost no infrastructure we build today has that. So typically if we double the amount of racks, we have to go through a process of months to even longer to change the software, change how we provision, change the standard operating procedures, but what we should get to that if we double the amount of racks, we can do double the amount of data in the same amount of time. I call that the rack test and it's the architectural principle to build these. So I wanna fill in the next line of this. What I think of as we now know how to do but we don't know how to do very well but we should get better at it over the next five to 10 years is what you might call data center scale science. So we have examples, I mentioned the one we were working on BioNimbus, there's CG Hub which is probably one of the best examples of genomics at this scale. There's the cancer collaboratory that Lincoln Stein is doing, CG Hub is what David Housler and his collaborators are doing, Genome Bridge that Gaddy and others are doing. So over the next few years, there are companies like DNA Nexus, we're gonna understand how to do data center scale science. And we have to have these interoperate and there's a lot of details to get right but I wanna now talk about, and I think this is some of the things that Warren always talked about in the big data NIH report but I think some of the things we gotta get right. I wanna talk very, very briefly about whether or not our discipline is the same or different than the other large science disciplines. And if this were a general science audience, the people who work on the LSST in astronomy or the HAP or the LHC in physics would get upset with me and sort of confront me at the end of the talk. Just so I know whether I have to go out the backdoor quickly, do we have anyone who works with the LSST or the LHC? Okay, the reason they get upset is these are the official numbers and they say they always have more data. Well, everyone has more data but these are the official numbers and there's a couple important differences. One, these communities have 10 years to get something right and by Moore's law over 10 years, anything gets easy. And the other thing is they're not doing that much data. They're doing a scary amount of data, 15, 20, 12 petabytes of original data and you might get 10 times that much but it's a manageable amount of data. Now let's look at our community. I wasn't able to get the exact numbers but all I've known, I've been in this business about eight years and every three to four years we radically change the instruments we have. The cost of an instrument fully loaded with people and things like this is closer to a million dollars. We don't have one 10 billion dollar instrument. We got thousands or tens of thousands of million dollar instruments. There is no business value in the physics data or the astronomy data. There is business value in the data we produce so people are always not well behaved about sharing it. The physicists and recently the astronomers have a culture what I think of as big data friendly. I'm trying to say this in a nice way but many of my colleagues, I won't mention who they are, will spend 10 to 15 years refining an ontology. It's not that that's not an important thing to do is it doesn't scale and so one of the lessons people have learned and I'm gonna come back to this is there's a trade off as the amount of data scales of how we automate and how we, you could basically have to replace manual process with automated processes at scale and we haven't been through that and then the compliance is quite different. So I would say that you can focus on once at the bottom in aggregate we have more data but at the top there's common computing storage and transport and our community is by and large afraid to use it. I'll come back to that but there's no reason we can't yet the security compliance sharing and collaboration are quite different as is the distributed nature so we have an intrinsically distributed source of data. The physicists and astronomers analyze in a distributed fashion but we produce and analyze in a distributed fashion. I don't wanna go over this slide. This is a standard slide. I just wanted one slide. Why can't we borrow from other communities? This is a slide. If you, there's an open source, there's a lot of open source software for transport. If you pull it out, it doesn't work very well but if you pull it out and change a few parameters, this is a 10 gigabits per second connection between the US and Europe and you pull and I'm moving some thousand genome data across the Atlantic and it starts out at 1.6 gigabits per second. I change the buffers, it goes to 3.3. I do a flag so I pin to the right core goes to 3.7. I do that on the other side, it goes to 4.6. I put some flags on for processor infinity on both sides. It goes to 6.7. So I've taken a terabyte of data from 85 minutes to 20 minutes with a little bit of tuning all with open source software with a single flow and I can run multiple flows. So this UDT is used by the physics community, by the astronomers. There's a lot of technology out there that's simple to use. But I wanna get back to there's a lot the same and different. We need to interoperate with commercial cloud service providers. There's a different infrastructure but fundamentally we need our green button to move data out and there's a danger if we rely exclusively on commercial infrastructure and don't interoperate that we will implicitly not have the green buttons. Not because our heart's in the wrong place but for those of you who've built software in Amazon you tend to use a lot of Amazon specific infrastructure and even though it's supposed to be open source most people's experience are if you build exclusively to Amazon you can't get the software out even if it's open source and this is true for a lot of the software we use in our community. So I think this lock in and green button is essential. It's really about interoperability at scale. So the key question I've already said is not whether we're gonna use commercial cloud providers but will we as a community figure out how to do it on our own? In high performance computing the community spends a lot on a few very large and a number of smaller high performance computing we have not done that in large data and the critical question over the next couple of years is will we do that as a community so that we understand how to do this? It has to be a different model than in advertising because we don't generate money but we generate science and we improve health so we do have to figure it out. I've learned about this business by building something called the open science data cloud we have satellite data, environmental data, social science data, it's about a petabyte of open access data about a petabyte growing to several petabytes of controlled access data is done through a not for profit so you can do this, it takes time and most things in software take a few starts and take five to seven years to get right. I wanna talk a little bit about how you might analyze data at the scale of a data center because I think this is really what's gonna be interesting over the next few years. So I keep coming back to this metaphor it may seem like a silly metaphor but in 1609 the telescope only got you a 30x resolution improvement the microscope in 1670 about 250x. Simulation science was born out of a cray and devices like that that got you between 10 and 100x and experimental science out of the instruments like the telescope or the microscope. So sort of what fascinates me and what I'm gonna work on for the next 10 years is we're sort of at a simple place either we're not as smart as the people who did experimental science and simulation science and we're not gonna create a new science or else because the resolution that we're gonna get from computing at the scale of a data center is at the pessimistic and 10 to 100 size larger at the optimistic end is probably 1000 times larger. So if I give you an instrument if I give your postdocs and your colleagues an instrument that has 1000 times more resolution power and you can't make new discoveries that no one else has done then you're probably in the wrong business. I mean at least that's how I'm looking at it. So as we build these instruments at scale there's probably something new that's coming out and not doing the same things that we have been doing at a larger scale but something new and to me that's what's really exciting is we look at this data. So the best articulation of this I've seen and this you can go to YouTube and listen to this it's about an hour talk. Phil Anderson a Nobel Prize winning scientist gave a talk at Northwestern several years ago for about an hour and he asked the question in his field solid state physics as things increase in scale did something did generally new phenomena happen or did the same phenomena happen just more of it? And he gives this fascinating lecture we're saying that more really is different in solid state physics where phenomena happen at the statistical level with spins and symmetry breakings that weren't present in any other scale and he sort of gives a retrospective analysis in his life of all the new things that happen at scale. And so as we build these data center scale computing fill it with genomic environmental and self tracking data we have a instrument that's 100 to 1000 times larger so I think the fundamental question is what new happens? And this is there is a sort of a debate in language translation. Does it has anyone worked in language translation? Okay, so DARPA took this view for many years that the way to get smarter with language translation was to put more money into it and build more complex models and I simplify that but they spent hundreds of millions of dollars and they built very complex models and Google in a very quiet way built extremely simple Bayesian models but they did it with the data on the internet and they did it at scale and a few years ago I think it was very controversial but right now most people in statistical language in language translation would say we're better off with simple models at scale that get smarter with more data than when you are with more complex models at small data, at small scale so the way I think of this you know if you're in the data center basis you talk about a business you talk about watts kilowatts and megawatts if you're in the disk business you talk about gigabytes terabytes and petabytes but if you're in the statistical business you gotta think about how do we build these large scale models that work when I fill a data center full of data. I've shown this slide a couple times I love this slide, let me describe it very briefly some scientists took the standard methodologies we used to analyze functional MRI and they abstracted that and tried it out and they took a salmon and the salmon was dead so the salmon was not hurt for this experiment they put the salmon in an FMRI and they showed the salmon two sets of pictures with a really good experimental design one set of pictures had humans with angry emotions and one set of pictures had humans with happy with positive emotions and they identified using the methodology that was standard in their field which voxels lit up when the dead salmon was shown pictures of happy humans and they got you know put it for some reason the first conference they set it to rejected the poster but they eventually got this thing published and you can sort of Google on it and so that voxel is the voxel that a dead salmon uses to identify happy humans and I think you know this is the as you get more data there's really nothing to keep you from being stupid in fact the more data you have the easier it is to be stupid but you know that's no reason not to analyze big data and there's gonna be a backlash you know there was a lot of talk about you know search terms you can be able to predict flu it was a first generation and you know the models weren't great they're good for PR but not good for predicting flu but that doesn't mean that the models won't get better over time so there's gonna be a backlash but you know one of the things I'm enjoying doing is taking massive amounts of environmental data massive amounts of medical records overlaying over genomic data with my colleagues and seeing what we can build that we haven't been able to build before there's nothing to keep us stupid and we may in the end be stupid but there gonna be people out there who are gonna figure out how to build these models. The world's simplest classifier I've spent about 20 years of my life building these are decision trees they're very simple to build so you could ask you know when clusters when May Wolf clusters came out you built a bunch of decision trees and you combined them in an ensemble and at the beginning the statisticians I got heckled for a couple of years talking about ensembles but now you know high school kids build ensembles of models but the question is what does that look like at the scale of a data center and so we and others have done simple experiments when you could say what replaces an ensemble ensembles at the typical scale of a data center don't scale very well because they tend to overfit because you know if I have 100,000 machines and build 100 models on each machine you know that's a lot of millions of classifiers but you can build sort of use randomization to send skeleton classifiers out and just train them so there's their ways to scale I don't wanna talk about it here but I think that's what's gonna be exciting and you know the question you know in important fields like advertising you analyze all the data each night in fields that we're in like cancer, genomics and healthcare we don't analyze the data each night but you know with the right infrastructure you could simply ask what could we do to impact the field if we could do that so the age I think we're about to do in bioinformatics is you know we're about to be able to ask the question is more different at the scale of a data center when it's filled with genomics, environmental and phenotype data and I don't think we know the answer you know I think it's a really it's a wonderful time to be entering the field but that's really the challenge in front of us is how do we do the modeling at scale and how do we do validation at scale so I have a couple of rules I use when I build statistical models and one of them is you can't do data analysis at scale unless you have an automated testing and validation environment so it's really about how do we validate these models that we build at scale I wanna end this talk quickly there are a lot of clouds out there from DNA nexus to the cancer genome collaboratively to the CG hub and I wanna just for those of you I just wanna make this very concrete and this is the only slide I'll do about some of the things we've done but the traditional way to analyze you know 100 terabytes of data was to hire SAF, set up the infrastructure get your environment improved have fights with your security and compliance people hire the bioinformaticians, set up the pipelines download the data, begin analysis and you know this is not something that scales because of costs because the fact there are not enough bioinformaticians because the fact that they're just not enough people to do this and so what we and others have done is you could now log on to a certain of these computing infrastructures use your existing NIH grant credentials it will check to see what you have dbGaP access to and begin your analysis immediately so something that took a year can now be done more or less instantaneously so the real question is what is the scale of this how do we interoperate this and how do we build the models over this and so at the last slide or the last five minutes I really talk about how we organize and I like this model that worked in networking and I think most of you heard me talk about this before it was called the condo model and the reason we have in the research community 10 and 100 gigabit per second networks is because we don't buy them from commercial carriers what we do is universities and research centers get together and they lay their own dark fiber and they've been doing this for 10 years and it changed the way we did scientific research and you think of these as cyber condos and these cyber condos interoperate with the commercial carriers, the commercial ISPs but it's the reason we just did an experiment where we had 100 gigabit connection between one of our facilities and NCBI and we did BAM slicing at scale over 100 gigabit pipe we wouldn't be able to do that if we bought that infrastructure from the commercial internet service providers so you could think of genomics cloud condos where groups get together they build these data centers at scale they interoperate and they with each other and they interoperate with the one of the oh okay thank you and so I think this is what we have to do and we're doing one in the Chicago region called the Burnham Science Cloud and if you're doing one in your region I really would like to talk to you because I think this is what's gonna happen over the next couple years so over the next few years we gotta learn how to interoperate the way the internet took off is the large internet service providers peer that is they share data without cost and that's what I hope will happen with our commons what we'll do is if you're a researcher in Europe you should be able to log on to whatever you're doing and we should be able to interoperate with EBI and those others so it shouldn't matter whether the data is in EBI or in our cloud or in the Cancer Collaboratory or in CG Hub that there'll be transparent interoperability with no cost between the large commons providers and we just need standards to be able to do that if you do the standards ahead of building it you're gonna fail miserably and these are what we'll be working on over the next few years so I just wanna recap so we have about 10 minutes for questions where we are right now is we can build an infrastructure for roughly 10,000 patients we can't interoperate that with other infrastructures and what we'll need to be able to figure out over the next few years is how to scale that up by a factor of 10 to 100 and how to interoperate and how to do the governance and how to do sustainability and from my point of view the challenge is really not in the data center scale science I think we more or less know how to do that we just gotta do it the intellectual challenge is in how do we analyze and model data at scale and bring in other orthogonal data types at scale that we've never been able to do before so I think it's perfectly feasible everyone has a smartphone very shortly the smartphones are gonna have a lot of sensors and there's going to be a lot of data that we can bring in and we can't align it perfectly that is we're not gonna be in the position I think for a long time because of consent that the data we collect environmentally the data we collect from phenotype information from medical records will be aligned that is we won't have one key that would point to a genome, a phenotype and an environment but at scale there are only two things that happen either the devils in charge of data management in which the data comes to us in such a way that in the asymptotic limit nothing aligns or else as we get more data even though we don't know exactly how it lines we could do statistical inference and take models you know if I have a bayon medical records I cover a lot of people and if I have 10 million genomes I should be able to statistically figure that out so that's why I'm excited about the future we have between one we have at least one minute for questions right more than one okay we have more than one minute for questions but if I can make one request if you are interested in anything I talked about come see me sometime during this conference thank you so questions for Dr. Grossman and please come up to the mic hi do you think so with the advertisement model for large data computing you mentioned the methods are fairly consistent and one company usually owns the method that operates on all the data the important things with bioinformatics analysis is a lot of times iteration on the methods and that can be hard to do because when you think about writing methods for distributed computing you have to write them carefully so they work on a distributed network do you think this type of common approach would support past iteration of methods and how do you kind of manage new approaches and new data analyses methods as they come up so I think is the question that there's less is the question that there may be less iteration in computational advertising there's less interested parties in what the right algorithm should be perhaps whereas in a research environment there's certainly a lot of opinions on what the right algorithm is to use yeah I think there's just it's a different trade-off space you know in computation advertising there's a lot of money at stake and people have strong opinions whether right or wrong you know I think there's similarities and differences but you know I think we can steal certain techniques from them and they probably will steal certain techniques from us we'll probably acknowledge it and they probably won't but you know I think there's a little bit to learn not a lot sorry for not giving you a more satisfactory answer yeah yeah in the bioinformatics space the nature of the data is very complex and there are things which have not even started like imaging histology data you know of millions of patients extracting feature from them integrating with the clinical so what is your opinion and what is the state of you know algorithmic scalability where you consider different aspect of data apart from the genome sequencing I think you're right but you know if you look at the problem of statistical language translation and you look at the progress we've done in statistical language translation by using relatively simple models over data if you look at the progress we've done in image recognition at scale with internet scale data where you can fairly easily recognize faces if you look at the way we're geotagging data and taking synthesis of data that's been gathered on the web to create mosaics there's absolutely no question about the complexity of biome of say genomic data and some of the other data around it but I'm sort of I'm in a relatively optimistic question because a position because I think we're in a relatively should be relatively optimistic just because of problems which were formerly intractable like language translation, image recognition we can do with relatively simple techniques at scale so one of the you know hopefully it will appear I worked with Andre Rezeski and a few others and we asked the question if you could double the complexity of the model and we define that or double the amount of data and we define that would you be better off in terms of standard metrics on improving the model by doubling the complexity of the model in a good information theory way or doubling the amount of data and there are different regimes for different problems but most of the time you're better off over large regimes of just doubling the data and so there's clearly biomedical data is extremely complex but I think we'll be in a position where we can make some discoveries that we wouldn't have imagined a few years ago. So we've got more questions I think there's a lot more room for discussion but we're at 9.59 and we've got a tight schedule so Dr. Grossman thank you again very, very much and John Weinstein from MD Anderson was originally scheduled to chair the session unfortunately he's had a minor medical issue that's prevented him from traveling and his colleague Dr. Han Liang has graciously agreed to chair in his place Dr. Liang thank you.