 And so just to be clear, my qualifications for giving this talk are just that I recruit people who are interested in building clouds so that we can use them. I don't build them, I just try to use them. And for me, sequencing, so people talk about it as if, you know, drinking out of the fire hose, which doesn't sound appealing to me at all and I would never try to do that. It's more like in my misspent youth watching all those I love Lucy reruns, you know, remember when she was, they were doing the chocolate thing, so that's what it feels like at first. It's great. And then, oh my gosh, well, okay, it's coming faster and faster and oh, I'm sick of this. I can't deal with it anymore. What are you guys doing to me? And so, you know, so we have, we have all this data, we're drowning in data and everybody, you know, talks about, yes, the thousand dollar genome and the hundred thousand dollar genome analysis and it's not really that bad, but as Rick pointed out yesterday, it's becoming a higher and higher part portion of the sequencing costs. The, the analysis costs may come down, the, the processing that, that Daniel was just talking about, but the analysis is still a big deal and part of that has to do with all of the integration that needs to be done using the beautiful end code data that Steven was talking about. Analysis is still, we're the bottleneck and we'll be the bottleneck for the foreseeable future. And of course, when you, when you think of the other parts to this, making the data accessible, you have to have integrated solutions that allow you to do the analyses that need to be done on very large amounts of data. So the variant calls are not nearly as bad as those raw sequencing files, but variant calls for whole genomes on a million people will not be a trivial amount of data. And then, so you need, you need to have all that data accessible for analysis. You need to have additional data accessible so that you can do the analysis on the sequence data well. So all of those end code data, thousand genomes data, other kinds of reference data. And then when you, when you want to try to deal with, with making it accessible with, with anything related to patient care, there's all these complicated mobiles that we have to deal with. And I'm like Eric, I'm an optimist. I don't think this is any reason to feel particularly concerned. This guy's not going to fall. I mean, we used to worry about how he would deal with the delugiate data from GWAS and that looks like nothing by comparison. And so what I'm going to talk about today is cloud stuff, not as the solution, but as part of a solution. So it's not, it's certainly not the be all end all, but it's something that I think at the University of Chicago, we've thought can be part of the solution that we need to put into place so that our investigators can play in this space in some competitive way. So, so what is cloud computing? And of course, there are lots of definitions. Some people say that anything that you do that's out there. So not on, not directly on your desktop, not on your laptop, but that's cloud computing. And that's, you know, that's a reasonable definition. Some people think of it more in terms of, of what you don't do anymore. You're basically outsourcing your IT support, your, your hardware purchase and maintenance, your software updates and so forth. So, and certainly there's, there's aspects of that that I find very attractive. And from the business perspective, it's really a new, a new kind of business model for computing. The idea is that you scale your computing so that you have what you need, when you need it, and you pay for that. But then you're not paying for things you're not using all the time. And so, so you don't own what you can't use efficiently all the time. And so, so these are all sort of blind men in the elephant sort of, of ways of looking at cloud computing. And I think of the, the key, keys to cloud computing are also, you know, kind of a blind men in the elephant sort of thing. But these are, this is a part of the elephant that I feel. Part of the reason we can, we can really do something with clouds now is a revolution in the virtualization of computing architecture so that you can, you can deal with security and connectivity almost at the level of the processor as opposed to by the rack. And that means that you can expand and contract what users access very quickly and, and make, and then serve a community much more effectively. A key part of, of what we think of as cloud computing is, is redundant, abundant data storage, relatively inexpensive. But of course, since, since you don't own it, you pay for it one way or another. And then computational resources to process data are local to the data. Another key part of, of what we think of as cloud computing. There's a lot of misconception in the scientific community. I think about the security around clouds, the perception, there is a perception among some scientists that data in clouds are never secure. But you can build a private cloud to any level of security that you need. You can make HIPAA compliant clouds for your electronic medical record data. You can have DB gap level security for omics research data in clouds. There's no, I mean, you can, you can make private clouds have whatever security you need to have. Nothing is perfectly secure, but, but that's a given anyway. So you can meet the standards required for, for any level of security within a cloud the same, the same way you can with any particular computer. And of course, there are many public clouds that people use all the time. So certainly Amazon, Google, Dropbox, those kinds of public clouds can be very cost effective and very reliable for large scale, but short term usage. There are lots of places, for example, that are doing their variant calling in the cloud. So they'll move data up, do their variant calling in the cloud and pull the data back down. And in part that for them, the reliability there is an issue. So their current university computing resources are not sufficiently reliable that once they start the variant calling, they can be sure that their, the computer isn't going to fail. And then they have to start all over again. And it's just more cost effective and more reliable to use the clouds in part because of the redundancy that's inherent in those clouds. Amazon right now is hosting 1000 genomes data, some other large scale omics data to enable computations to be done in the Amazon cloud for people who want to use that option for these kinds of data analyses. And people are finding it very effective. Amazon's trying to lure people onto this, into this idea by making grants available to individuals so you can apply to Amazon for $10,000 grants and see how it feels to do your research in the cloud. Things that might take weeks on your local server or the cluster you've put together can be done much more quickly on Amazon. And you can, it gives you, these grants give you the ability to experiment and see how much it costs so that you don't get surprised by the costs. But you should not wait too long to get your grant application in because as, you know, once the 1000 genomes data and some of these other omics data have migrated to Amazon and people have learned about these grants, they've become much more popular. And so this is the kind of note that grant applicants are getting now on a regular basis. So I'm going to take you through just a little bit about the sort of bio-Nimbus, the cloud architecture that's being put together at the University of Chicago. So in the ethernet up there, we've got public clouds. There will be. We can hope scientific commons and those are up there and accessible to investigators to access in a variety of ways. You can build HIPAA compliant clouds and Bob has done so. So that's Bob Grossman at the University of Chicago. But the main part of the architecture is more of a research cloud. So there's a clinical data warehouse there. There's auxiliary big data, auxiliary big clinical data like imaging and so forth. And then this clinical data warehouse converses with a data mart that's much more accessible to investigators. So that's all there within the structure. There are pipelines for the analysis of sequencing data. So the sequencing cores, individual labs that do sequencing can all feed into the pipelines that they've built for analysis. So that can be variant calling for sequence data. It can be RNA-seq pipelines. So that you end up then with omics data that can be stored and accessed by people with the right permissions. And then there are pipelines for the analysis of the omics data so that you get results of omics studies of various types, also storable and available to investigators that can of course talk with other pieces of these data. With a CLIA lab, you have at least a theoretical possibility of the omics data created in a CLIA environment feeding into electronic medical records. So you could imagine people being sequenced. The sequence data being held in this sort of omics sandbox. But when there are clinically actionable variants detected, those could migrate into the electronic medical record. One of the reasons I've had insurance and law as some of the challenges in dealing with this is that nobody really knows or has tested whether that sort of thing would allow a reach through by insurance companies. So we know that insurance companies can access electronic medical records in order to ensure patience. They ask you, can we access your medical records? And people, if they want insurance, say yes. So it's one thing for the clinically actionable variants to be here, but does the ability to bring that data over allow insurance companies to reach through to things that haven't gone in there yet, that are not considered clinically actionable, but might be of interest to insurance companies? And you'd think not, but nobody really knows. And so there's issues around law, insurance that make it hard to know quite how to set these things up and use them. In addition to these data marks, they've set up what they call drop boxes. So investigators, all BSD faculty, our Biological Sciences Division faculty, all get drop boxes in this space. And the drop boxes can be used just as drop boxes, or they can be used as we do more for storing data that we're going to use in some of these pipelines. So we can go into variant calling with data that we get from sequencing centers. We can go into the analysis pipeline that's already set up, or we can set up our own analysis pipelines. And we've done all of that. We can go directly into public clouds for analysis if we want, and if there's a scientific commons, we can do that. And the space is all scalable. So when we're analyzing a bunch of image data with a bunch of omics data, we can scale up our workspace to do this, and then scale it back down when we're done. It doesn't mean that this is all centralized. I still have a cluster. We still have a server room. And we still do a lot of work on it. And those computers can talk well to this cloud. And as we always have, a key thing for us is that this cloud contains a lot of public access databases. So that when we're developing our pipelines, we can do computes over encode data. We can do computes over 1,000 genomes data. It's all inside the cloud to work with. And that speeds things up substantially as opposed to going out and fetching things in. Back to what Peter said, you really need this. This is part of the statistical genetic analysis of sequence data for common phenotypes is going to revolve a lot on bringing in information on function, iterating back when you discover things and adding information to about function. So I have to, as I say, mention my colleagues, Bob Grossman, who is the one who builds these clouds. Ian Foster is the head of the Computation Institute at the University of Chicago and a very talented computer scientist. And Kevin White is our genomics guy who's pushing a lot of this. So Ian works to develop these pipelines to move big data. So that's one of Ian's big things, is developing new ways of moving really big data. And so that's, we're being able to take advantage of that to some degree. Bob builds the clouds and Kevin's down here trying to push things. And I sit here and like to use all of it. Good. Thank you, Nancy. Questions? I have one question to start us off. It seems to me that in entertainment, finance, probably the energy sector, they have much larger data problems, data sets, and data analytic challenges than we have. And do you think as a community we're doing enough to learn from them, and how do we tap into that and not reinvent the wheel and learn from their mistakes? So you know that's definitely true. And that's sort of the world that Bob comes from. So he's a computer scientist. I mean, some of his algorithms are used by credit card companies, the data mining algorithms, to see when somebody's using credit cards fraudulently. I mean, he's worked in some of those areas. And yes, they have huge amounts of data. And the physical sciences, huge amounts of data are collected, they have to move it, whether it's geology, atmospheric science. So it's not that these are unique problems to us. It's coming on us very fast, though. So for some of the other fields, we sort of build up to it more. The issue of security and making your local cloud secure and HIPAA compliant and that. How would you define a local cloud? Do you have to have sort of a password to get into it? Or what makes it local? Well, so I mean, how do you think of DBGaP? I mean, so the level of security around the data stored in DBGaP is comparable to what we have for the main private cloud. The HIPAA cloud has additional layers of security. That's why it was shown separately. And then you sort of certify people to have access to the private cloud. That's right. And so within this space, you can set up different kinds of access for people. So it may be that everybody can access omics results. Everybody in the cloud can access omics results. You need special permissions to access the clinical data warehouse. Everybody can query the data mart. How many patients at the University of Chicago have been seen in the past two years with Crohn's disease? So that's a simple query. Doesn't require any PHI. And you can just do that. Anybody can ask that in the data mart. So there's definitely different levels of security for different parts people can get or be denied access to different parts. So is there a difference in who actually owns the hardware in a local cloud? So Amazon would be a cloud. And there are certain policy considerations on the way currently at NIH where the Amazon cloud is an appropriate storage. That's right. So our private cloud, University of Chicago cloud, is owned by the University of Chicago, managed by the University of Chicago, run by the Bob's IT group. He's director of clinical research informatics. But in terms of the unlimited capacity for really large projects, an Amazon cloud or Google cloud, however, provides the service could be an appropriate venue. With appropriate data? Yes. And I think the other issue is how thinking of it, taking a step back, what you're going to partition to put into that. I mean, everything in the world isn't going to go into a cloud. It's going to make a decision that you're putting the 1,000 genomes in. Or for instance, the TCGA, the Cancer Genome Atlas, has established what's called a trusted partnership, which is a new relationship that NIH has with a particular group who are committed to house, archive, and spin that data, per se, as opposed to DBGAP or NCBI. And there, UCSC, House of Excuses, can obviously use a cloud to be able to hold the TCGA data and the requirements for security and the veracity and the stability of the data are clearly there. But I think that as we think about the cloud computing, it's a very early stage of this. And many universities are talking about having their own private clouds. And there are user groups that are talking about creating their own private clouds even in the academic venue of people sharing their data from non-genetic studies or the metabolomics groups are talking about creating these small clouds. So the modern code project, for example, absolutely used a cloud, used this sort of architecture for the storage and analysis for a good part of some of the analyses done for modern code, for example. Would it be a recommendation to consider as an outcome of this meeting, to consider a cloud model? More than a recommendation. I didn't say that we could do it without the cloud. Yeah, it would be hard to imagine sequencing this number of people and phenotypes without a cloud involved. It's just that's where your mind explodes to think of that amount of data without a cloud. Now, I come back to Maynard's comment. The one person can do a lot of analysis and that's absolutely right. And with this sort of architecture that they can do it from an iPad or a phone. Really. I just, Nancy, that was a great presentation. And the fact you had on the left, the HIPAA cloud was terribly important to one thing that's kept coming up over and over again, which is a notion of re-contacting. So if there is gonna be re-contacting as part of this as we learn through our experience with NHLBI cohorts, you need a whole other informatics kind of dimension that has to do with privacy confidentiality. What's the consent status of the participants in real time? Not just at any time. So it's what's really good is that you seem to be solving that problem at least at the University of Chicago or at least thinking about solving the problem. Well, I mean, Bob's taken the security issues very seriously. And today there hasn't been a lot of research activity in his HIPAA compliant cloud, but it's there. And I mean, there is some use of it, but it's much more of a pain to use because it is compliant and the security issues are more serious. So in the HIPAA compliant, is there any move of foot to try to integrate aspects of the EMR for patients who consent to this? Yes, so our bio banking effort. So you have the clinical data warehouse, which is basically a mirror of the electronic medical record that's updated sort of on a daily basis. And the bio banking effort, which is creating omics data that links to some of those patients. So you have the ability with an honest broker sort of system to put omics data together with clinical data. And yeah. So Stephen, again, I have one last comment. Nancy, can you speak to sort of a more theoretical question? If you have different clouds that different groups have with slightly different securities, how you would get to the next step of sort of meta-analysis and connections per se? You know, I mean, you're probably only as good as your weakest or your least rigorous security for any given cloud per se. And so, you know, we're new in this, but I imagine it's already started down the road. And can you just talk about how clouds connect? Cause it may very well be that Harvard, Chicago, WashU each have their clouds and somebody wants to put those three together. What is, is that real easy or is it not so easy? Well, so some of these have already been linked up with pretty big pipes. You know, some of the sequence, I saw, I'm sure some of the sequencing centers have that capability to NIH, for example. We certainly have some big pipes to Amazon, for example. And another point to this is that the pipelines for analysis that you set up in one cloud are very easily migrated to other clouds. So, which is a big help. So it's not necessarily the case for meta-analysis that you have to migrate the data. You can migrate the pipelines and put things together after the fact. My comment is, within NIH, are there conversations occurring on confidentiality data safety criteria in cloud computing, particularly commercial cloud computing? And it seems that this needs to be happening now. And so we're ready for it when the data are coming out. Actually, the gentleman to your right. Yeah, I mean, I'm on the senior oversight committee and this is, no, no, no, no. Now, when Eric is the chair of it, I mean, this is clearly part of the discussion is this issue of trusted partnerships has come in place. This mechanism that connects NIH to an outside academic organization that will house, you know, in cloud computing, very important data sets, TCGA, 1000 genomes and the like. And I think that there's been a very close inspection of that and we are moving that way. There are some sort of formidable barriers with respect to interpreting some of the very difficult language that we use for the, you know, for patient privacy and patient protection. But we are, I think I could confidently say we're getting to that point that will be tractable. There may be certain requirements and that's sort of what I was sort of asking Nancy a little bit in the background was sort of levels of security may, you know, the different clouds that may have different impositions of what can and can't be shared, particularly if it's data that's NIH supported because, you know, the current data access policy is that it's a one-to-one relationship. Someone gets it from DBGAP and you get it at Chicago and if somebody else has it at Northwestern, each of them have to have it separately before they can put it together because Chicago can't hand the data over to Northwestern. So you can, you can see where this, you know, has to be worked out, but I think that there's a lot of effort to try and solve that because it's upon us. That feeds into my underlying question. It seems that this is usually cloaked in patient confidentiality, but often what's driving it is attorneys to, of legal responsibility. But the NIH is outstanding at pushing legal responsibility onto the universities and the investigators and how that's going to, you know, sort of play out if we go the route of a commercial cloud and we're hitting a commercial cloud. Mayhaps I could comment rashly. One of the reasons that we insist on having and anyone in the room who knows more about this than me, which is most of you, correct me. But one of the reasons we insist on having a one-to-one relationship with the universities is exactly so we don't force Chicago to be responsible for Northwestern or for Emery or for anyone else. And so that is one instance where we actually don't push that on to others. But then again, there should be some way that there could be sharing without having to go through the entire process again. But so there is another issue in terms of cloud computing that probably NIH does need to deal with and that's the fact that private clouds you could get funded through NIH grants but appropriate use of public clouds is virtually impossible even when it's the most cost effective way to do something. So for example, the last time FarmGKB renewed had their grant proposal renewed. There was some discussion with NIH about going to a cloud computing model but they could show it was much more cost effective than all the IT they had to establish, the hardware and so forth. And yet there just wasn't a model to do it that way. There just wasn't possible to fund the enterprise that way. And when we think about some of these big results databases that people are talking about, it may be more effective to have those running in a public cloud. So it's just something to think about the way NIH does funding might have to, there might have to be some rethinking on this given these new kinds of computing models. All right, thank you everybody for great new time setting. We have a 15 minute breakdown and we come back and we'll attack the problem and see how the selection and participation probably also a good time.