 Good afternoon, everyone. Okay, great. So I will be talking about the NCI Cancer Genomics Cloud Pilots, but before I do that, I'm going to also talk about the genomic data commons. So we'll start with that. So the Center for Cancer Genomics of the CCG has three major projects in cancer genomics, TCGA, Target, which is the childhood cancer genomics project, and the CGCI Cancer Genome Characterization Initiative. And is this... Yeah. Okay. So they're represented here. All three of these projects each have their own data coordinating centers as well as their own data portals. So you have to go to three different data portals to get the data on these three projects. In addition for the sequencing data, the BAM files for all three projects are stored in a second location, not the data coordinating center. You will not be able to get that data through the data portal. You would either need to, for TCGA, Target and CGCI, come to the Santa Cruz site CG Hub to get your BAMs. The Target and CGCI BAMs are also available at the NCBI Sequence Read Archive, SRA. But if you want the higher level data, while the BAMs are at CG Hub or SRA, the VCFs and MAFs are again at the DCC. So this is kind of confusing for users, I think, to have all of this data all generated from the same CCG, from the same division CCG, to have it all in so many different locations. So the goal of our genomics data commons, or the GDC, which is currently in production, is to unify this fragmentary repository at NCI, so that all of the cancer genomics projects that we have at the CCG, and in the future hopefully all the projects we have at the NCI, will be available from a single repository. There'll be a single sign-on for all users. You won't have to go to multiple places to get the data from multiple projects. Now a secondary goal of the GDC is to harmonize the diverse standards that we also see in our data. So here on the left-hand side of the slide, we have just the representation of the BAMs being aligned to various different references. So are you HD18, are you HD19, are you HD20? This causes a lot of difficulty in analysis, of course. On the right-hand side, we have a representation of the Venn diagram for mutation callers with very little overlap. Again, if you've got your mutation calling from two different pipelines that don't overlap very much, again, it's going to make it difficult for analyses. So what a secondary goal of the genomic data commons is to go ahead and harmonize some of these data standards. So BAMs will all of the BAMs that go into the GDC will be aligned to one reference. And they will all also get the same mutation calling pipeline. So the genomic data commons is a contract. It was awarded to the University of Chicago and the PI is Dr. Robert Grosman. The go live date for full functionality for the GDCs should be around late spring of 2016. So that's when you can anticipate being able to log on to one location to get all of your data and finally, it's not a commercial cloud. The photographs here represent the actual data warehouse that's at the University of Chicago and one of the benefits of not being a commercial cloud for the storage of this data is that it will be free to download the data. So speaking of clouds, I'm going to move on now to talk a little bit about the cloud pilots. Right now the standard model of computational analysis is of course, you've generated your local data, you keep it locally stored. You, if you're interested in TCGA data or other public data sets, you have to go ahead and download that locally as well. You've got your locally developed software and then any software that you've pulled down from other locations. Everything, again, is local to your university or research lab. And as a user, you're doing everything locally. And this works fine until the data sets get quite large. So there are definitely some limitations to the standard model when you're dealing with much larger data sets. So assuming that we have approximately 2.5 petabytes of TCGA data towards the end of the project, the storage and data protection will cost about $2 million a year. Downloading the TCGA data at 10 gigabits per second, which is really a best case scenario, is going to take a minimum of 23 days. So you can probably guess more like two or three months to download all this data. And at this point, really only large institutions have the ability to utilize all of this. And moreover, these data types are going to continue to grow, right? We're going to have new projects after TCGA is closed, new projects at NCI, other projects, for example, Target and CGCI. So we anticipate that the data sets will just get larger and larger. Okay, this slide does not look like it should. All right, that's okay. So the answer, of course, is to co-locate all of the compute and data in the cloud. And my cloud is very fractured, you see here. It used to be like this. I don't know what happened. All right, but in any case, what we want to do is we want to have everything on the cloud. We want to have our core data set. In this case, we're just talking about TCGA at this point. We want to have user uploaded data. So instead of storing all your data locally, you would upload it to the cloud, which should be a much easier task than downloading an entire TCGA data set. You would also, of course, have some standard tools that would be available in the cloud, as well as some user uploaded tools. And all this computational capacity would be here in the cloud. And as the user, you would come to the cloud, you would do all of your analysis, and you would then, at the end, download only your results, which, again, should be a much smaller data set than the original large data set, including TCGA. So here you can see the data generated from a project like TCGA. Going into the data coordinating center where there's QA, QC validation, as well as aggregation of the data. And once that's done, we have the authoritative NCI data reference set. This data set is being loaded into the genomic data comments, which is one of the reasons I mentioned that at the beginning of our talk here. So all of this data is coming into the GDC. This is going to be the main repository for all of this data. The data is also being copied to our NCI cloud pilots. Again, they're fractured here. Sorry about that. So the data set will be replicated in the NCI cloud pilots. Now, why is it replicated? So here you are as the user. There's really one, there's different reasons to use the different tools. So you would come to the genomic data comments of the GDC if you were interested in searching, retrieving, or just downloading the data. Because this is the authoritative data reference set. However, if you were interested in doing high performance computing or analysis, instead of coming to the GDC, you would come to the NCI cloud pilots. And you'd be able to upload your user data. You'd be able to upload your own analyses, and then run everything on the clouds. So as I mentioned earlier, it's really only the largest institutions right now that can download three petabytes worth of data and run analyses on them. So this is NCI's effort to democratize access to this genomics data. It's managed through CBIT, where I'm from, in partnership with the CCG, and we're coordinating closely with the genomic data commons. We've awarded three contracts, one to Broad, one to Institute for Systems Biology, and one to Seven Bridges Genomics. Our period of performance started in September of 2014, and it goes to September of 2016. If you're interested, this link takes you to a CBIT page, information page on the cloud pilots. And our anticipated launch date is actually January of 2016. So fairly shortly, you'll be able to start analyzing data using this. So some considerations for the cloud pilots. First, the designs must be released under a non-viral open source license. So everything's open. For extensibility, the initial clouds need to focus on a set of core data types that I'll mention in a few moments. But they also must extend to additional data types without major refactoring of the existing system because we know that there'll always be new data types that will come up. As far as sustainability, we have cost assessments for operating at the current scale, which is on the order of 2.5 petabytes. And also at 10 and 100-fold increases in storage, compute, and usage. So we have an idea of what it would actually take to get to the next level. Once that data is so large. And then finally, for security, these systems need to be FISMA moderate. All of the cloud providers need to be fed ramped. And the three cloud pilots must all obtain trusted partnership. And this is an NIH trusted partnership agreement which allows the cloud pilots or any other trusted partner to share controlled access data. So in this case, controlled access TCGA data. Finally, we have the open access data versus the controlled access. So the open access data, everyone will be able to use and analyze at the cloud pilots. For example, some of the open access data would even include CCLE data so that you could work, do some analysis on BAMs. If you do not have controlled access, dbGaP approval. However, if you have controlled access, you'll be able to utilize all the data. Okay, core data sets. So all three awardees have to host a common set of core data that includes all DNA-seq BAMs, RNA-seq FastQ and BAMs, SNP array cell files, somatic and germline mutation calls. That'll be your VCF files and your math files. And we'll also have the clinical data. In addition to that, each awardee was required to have at least one additional TCGA data set and they all chose to do more than one. Broad chose to add validation BAMs, MyRNA-seq and methyl-seq. ISB is adding MyRNA-seq as well as all level three data across all of TCGA. Seven bridges has added whole genome and exome DNA-seq FastQs, MyRNA-seq, data and methyl-seq. Okay, so the project schedule and deliverables. We have just about a month ago finished up our design build phase one. That was six months of initial design and development. We've just started design build phase two, nine months of completing the design development and implementation. And then right around January of 2016 again, we'll be in our nine month evaluation phase. And this is where we're going to need the community member's help. So we're going to be opening up the cloud pilots to allow the community to evaluate. NCI will also be evaluating the cloud pilots. But we really need researchers in the community to try these systems out to tell us what is working. So what are some things that are common to all three cloud pilots? We have the core data sets, as I mentioned. Some use cases that everyone has to have are running pre-loaded pipelines on TCGA data. Uploading and processing, again, that user data that you can upload your own data. Uploading and running custom algorithms so you're not forced to use what's available on the cloud pilot. You can upload any algorithm that you choose. And finally, these pilots need to serve both the biologists as well as the bioinformaticians. The common workflow language is being considered as a workflow language for all three pilots. And that's not finalized yet, but we're looking at that. All three will be using Docker containers for improved portability and reproducibility. And all three will be using the emerging GA4GH standards. Finally, they'll all have the same authorization and authentication process. This is, again, to make sure that only people with controlled access, dbGaP approval, get access to things like BAMs. Okay, so I mentioned what the three have in common, but what makes them unique? We chose three cloud pilots because we wanted three different solutions. We want to see what works. The first is the Broad Cloud Pilot, led by Gatti Getz. His collaborators are at University of California Berkeley, as well as Santa Cruz. They've chosen to use Google as their cloud platform. And some unique technologies they're using include Adam and Spark. And they've incorporated Firehose, and they've chosen to call their cloud pilot FireCloud. So each of the three cloud pilots have their own website, which I'll be showing on these slides, as well as the very end slide today, if you're taking notes. Each of them have a link that allows you to go and sign up for notifications and news on what's going on, and the opportunity to evaluate the system once it is open. Okay, here's the FireCloud slide. So FireCloud allows you to take your data, your tools, and your workflows. And then put them all in either a user or a team workspace, where you can securely track and manage your metadata tools, job execution, and results. And it also allows you to capture the providence for each run. There's versatile job executors, as I mentioned earlier, Docker, Adam Spark, and Google Cloud Dataflow. And the data can be stored in these distributed stores, called ReadStore and Variant Store. From ISB, we have the PI is Aaliyah Shmilovich. His collaborators are Google and SRA International, so of course his platform is Google. He's using the Google Genomics platform as one of his unique technologies. And some of the tools they've incorporated include Regulom Explorer and GeneSpot. And ISB is really focusing on the interactive data visualization, exploration, and analysis. And again, the CloudPilot website here is cgcsystemsbiology.net. And I'll be, again, repeating those. Again, this is just a slide showing some of the highlights from the ISB CloudPilot. The interactive tools allow you to explore all tumors or a subset of tumors. Allow you to define custom cohorts, focus on specific molecular data types or platforms. Finally, we have programmatic access, including REST APIs for cloud storage, SQL-like queries for BigQuery, and the GA4GH API for Google Genomics. All three CloudPilots will be having tutorials and ways to learn more about the system and how to use it. Finally, we have the Seven Bridges CloudPilot, the PI is Dennis Karal. There are no collaborators for this contract, but they, unlike the other two systems, are using Amazon Web Services. Their unique technology includes the Seven Bridges Genomics platform. And they've incorporated already over 30 public pipelines that you can see here at this URL. And finally, their CloudPilot website is sbgenomics.com slash cancer dash genomics dash cloud. And here's their last, here's their slide. So at the Seven Bridges Cancer Genomics Cloud, you'll be able to wrap your tools, mix your data and match it with other data, including the TCGA. Analyze the data and then finally collaborate and discuss what you've discovered with your colleagues through the Seven Bridges platform. Okay, so I'm out of time, but we also have a CloudPilot workshop this evening for additional details. All three PIs are here to talk about their specific pilots. It will be from four to five and repeated again from five to six right here in Nature Auditorium. So I encourage you to please attend if you are interested in learning more. And then finally, we just have the project teams that make all this possible. And once again, I've repeated the URLs. Thank you. You ensure that the economics after two years would be affordable to most of the users. That's a good question. So one of the things that CloudPilots need to deal with in the next nine months is how to allocate use time for the evaluators, right? So right now, we're really just looking at the evaluation phase and how to fund that. So that the evaluators do not have to put money into the system. But so I think we need to see what happens in the evaluation phase before we decide what happens outside of the two years. And it's beneficial to bring in more commercial CloudProvider. Three out of the two is Google. So there are other provider, I think it would be great to have a competition so that they can make it more competitive in terms of overall. Yes, I believe the systems will actually be able to run on other CloudProviders if necessary, yes. Thank you. Just a quick follow up, and maybe you can't answer this now, but you just said let's see what happens in the pilot phase to see how that carries forward. But so is there a tentative plan as to what would happen in the spring of 2016 in terms of people getting access free of charge to those clouds to run their jobs? Yes, so. The limit of capacity, or what's the rough plan for that commitment? All three CloudPilots I think are going to come up with their own system. But everyone who wants to help with the evaluation will be given a certain number of hours or credits or something like that to utilize. And I think if you exceed your credits or you meet your credits, you could ask for more credits. And the helpers with the evaluation will be organized by your team? Yes, okay. That's right. All right, thank you very much. Thank you, Tanya. So our last speaker for this session is Andrew Gross. He's going to talk about paired tumor and normal analysis using TCGA data.