 All right, good morning everyone. Just to give you a little bit of introduction to myself, my name is Christina Young and my background is in mathematics and computational biology. I started working at OICR several years ago. I stayed there for eight years doing different kind of analysis in pancreatic cancer, colorectal cancer, and also got very much involved in the PCOG project that you've heard about yesterday. And I'll dive into more details. And then last year I joined University of Chicago, and I'm now leading the project called the NCI Genomic Data Commons, so I'll tell you about that as well because these are both very big data sets for cancer research. So we'll talk about PCOG in a bit more detail, and then because I was doing the technical work in PCOG, I want to tell you about some of the lessons we learned in some very tough ways. So hopefully when you plan out your own cloud analysis, you hear about these lessons and don't run into the same kind of issues. So we'll talk about the data sets in ICGC and Genomic Data Commons, and there are also multiple cloud resources available for your big data analysis, so I want to touch on that. And at the end of module 11, actually after Brian's lecture, we'll do a tutorial so that you can actually write your own common workflow around a docker so you can actually run one of the PCOG workflows. So ICGC, I believe, Francis already gave you a bit of introduction about how this project started almost 10 years ago with the goal to collect 25,000 tumors with match normals. So this project is currently wrapping up, and there are altogether 107 projects from 17 jurisdictions that have committed more than 29,000 tumor genomes, and this was all spread across really around the world with the exception of Africa not participating. And so for the ICGC, like having this much data, it requires a lot of data coordination. Planning was done way ahead, so having just sequencing data doesn't help you. You need metadata, clinical data. So as you can imagine with that many projects, the clinical data is very diverse. You have different terminologies that may actually mean the same thing. So the DCC was responsible for harmonizing the data, coming up with data dictionary. They come up with controlled vocabulary so that people can submit the clinical data in a very uniform way. So data submission is important. Validation is even more important. You've got to make sure that bad data gets into your system, otherwise trash in, trash out. Annotation, so annotating your genomic variants, your clinical data, that's all important. And then ICGC has a web portal that you probably have seen some screenshots already. This allows you to search and download data, and then we have also controlled access. There is a data access compliance office where people submit the applications and you can get access to data. So one thing about ICGC is that once you get approved by the DACO, you can access all ICGC data. So it's not just per project basis. And on the portal, there are continuously development of analysis and visualization tools. Recently, the ICGC data have also got into the cloud there on AWS. So that's why now there is some integration with cloud management as well. So of course, help desk is very important to handle all these requests. So there are a lot of data types in ICGC. Clinical data, as I said, is important. There are also mutations from both germline and also somatic. And then these are somatic mutations. They are open access on ICGC. So you don't have to get DACO approval, germline, of course, because they can identify a patient, those are controlled access. We also have copy number, structural variations, gene expression, microRNA, exon, DNA methylation, protein expression. So very diverse data types within ICGC. And this is how the data has been growing over the 10-year period. So we are now at 20,000 donors in total that have been sent to the DCC. Some of them don't have molecular data yet, meaning that the submitters have sent in the clinical data first, but the sequencing variants haven't come in. But we do have over 17,000 donors with molecular data. So we hope that there are two more releases of the data, and we are aiming to hit that target of 25,000. So as part of the ICGC effort, there was a benchmarking group. And basically, they discovered that these are germline, sorry, genomic variants submitted to DCC. But they're analyzed by the project owner themselves. So they use their own pipeline. They use maybe different alignment algorithm. And they definitely use different variant calling algorithm. So this benchmark working group discovered that it did an analysis and said, there's actually a lot of variations from method to method. So when you try to combine multiple data sets, like I say multiple lung cancer data set to find some meaningful variants, you're hoping that by combining you get a better power. And you actually see variations that are due to the technical variability of the pipelines. They're not true biological variations. So in order to really enable people to use these large data sets, combine multiple cohorts, we need to do uniform analysis across the data. So this is the motivation to have basically reanalyze all the data. From raw sequencing reads, we have to completely realign. You use the same variant calling pipeline for all the data. So this was already done by TCGA. This was the pan-cancer analysis on cancer exomes. This was published back in 2013. So for ICGC to do something new, we decided to do a uniform analysis in whole genomes. So this is the pan-cancer analysis of whole genomes. We call it PCOG. So at the start of the project, we only called out to the ICGC members to say we want to collect 2,000 whole genomes. That was our goal. And we have overwhelming response. People were very willing to contribute the data they have already on hand. But it's not so easy just to say, hey, here's my data. There's a lot of work, and they're willing to do it. Because we came up with standardized metadata. They have to re-header their BAM files with correct naming conventions. It was a lot of work for them to just prep the data to submit. But in the end, we have over 2,800 whole genomes. This was really great. So the project was launched in 2014 almost four years ago. And of course with whole genomes, we can get a lot more information than the exomes. We could look at non-coding regions, regulatory elements, genomic structural variations, pathogens. We have virus inserted into some parts of the human DNA. We can now look at those as well. And then mutation signatures is another interesting topic. And then we also, of course, look at driver pathways, driver genes. And so that was the goals set out back then. And this consortium is organized with a five-member steering committee. And then there's 16 research working groups. So basically, at the start of the project, we made a call out to consortium members and submit abstracts. What would they do with this whole genome data set when it is available? So we received over 130 abstracts. So this was organized into a theme. So that's why we got 16 research working groups. But before the research working groups can do any research, there's a lot of technical work to be done. So there's a technical working group that was responsible for a uniform alignment and bearing calling. And then we also did a lot of data curation and data check before disseminating the data. So this technical working group was the one that I was very much involved in. These are the 16 research working groups. I won't read out each of them, but definitely, there might be a topic that you're interested in. And I would say look out for the papers coming out of this consortium probably later in the year. So for those 2,800 tumor normal whole genomes, they come from 48 projects and 14 jurisdictions. You probably noticed that the projects coming from the United States, those are actually TCGA samples. So all the TCGA whole genomes were also contributed into this project. So many different types of cancer here, and it was basically a really generous contribution from different groups with the data sets. This gives you a sense of the different number of donors in each primary site. So pancreas actually came from four projects. Some of them are adenocousenoma, some of them neocrine tumors. So these are just to give you a distribution of what, in the case there, a specific disease that you're interested in, this gives you a sense of what data set is available. So initially, when we set out to do this, we basically said, OK, we'll collect 2,000 donors. We received unaligned bams from the data submitters. So in general, there are about 150 gigabyte per whole genome. It's quite a large. This is about 25x coverage is the minimum that we required. Now the problem is that actually there's a typo there because this amounts to 800 terabytes of unaligned bams. So it's a lot of data. As I said, the data owners made a lot of effort to prepare the reads in the format that we wanted. So we actually provided them a tool. We call it PCAP tool. So this allows them to submit their, basically, validate their bams locally. So this is a tool they had to set up locally. It's perm-based. And so they get a chance to validate their metadata in their BAM file. If the data is correct, then the metadata is extracted into the form of an XML. And so that when they upload the data, at the time we're using a server called Genos, it is very similar to the CG Hub server, if you ever use CG Hub from a while back. So this basically allows people to submit their XML with the metadata in it and then the BAM file. So this will make sure that data is clean to come to us. So it's a very important part. Yes. What does BAM align them? So basically, it's like a fast queue, but now it's a BAM format. So it's raw sequencing data. It hasn't been aligned. It has quality scores in it and everything. So we then aligned those BAM files with BWA-MAM. Don't forget, when you align something, you basically double your footprint. So we have another 800 terabytes of Align BAM. Alignment started in August of 2014. And it was basically a continuous process. We just aligned them as data comes in, because data comes in asynchronously. You can't get everyone to get your data in at the same time. So as data came in, we aligned. And then as we finished alignment, we do varying calling. So I think George mentioned to you, we ran three core pipelines. So there's the Sanga Pipeline, the DKabs at Ombo, and the Broad Muse Pipeline. So at the time, we didn't have a very good idea of how much resources these pipelines will take or how long it will take. It was very vague. So we decided that we just have to do a small batches and then figure out how much resources we need. Along with this effort, at the very start, we also went ahead to do validation. So we picked out 63 donors. The picking was based on the availability of DNA and also a material transfer agreement that could be made between institutes. So 63 donors were selected. And then we basically run the pipeline on those 63 donors. And then we have a set of variants. Then we send those set of variants to the lab so that they can design a targeting panel and then deep sequence these tumors, these tissue to do the validation. And so if some of you have been involved in targeting sequencing validation, it actually takes a while to design the target panel, have your panel ordered and made. So the validation process actually took nine months. And you can imagine during this time, we have already started three algorithms for bearing calling. We don't have a good sense of the accuracy and we have to wait till the validation comes back nine months later to really know, have we been running really good pipelines or really crappy pipelines? I was very nervous during the entire time. And then using the validation data, we could now use a consensus strategy. So machine learning methods, you apply the methods to learn from the validated results, what features these varying calling pipelines are good at. And then now you can estimate your accuracy and then also decide how to combine the cost from three pipelines into one set of consensus calls. So that was the roadmap at the beginning. And of course, we discover a lot of challenges along the way. So as I said, having 800 terabytes of raw data and then another 800 terabytes of aligned reads, that's a lot of data. No one institute at the time could host all this data. And then we also have to think about bandwidth. We're hosting this data, if just in North America, then you know, having transporting the data or transferring the data from Asia would take a long time to North America. So we decided a strategy to have multiple data centers. And then again, this time we called out to our consortia members, who could contribute the compute resources to host this data? Because one thing that was subtle that people may not be aware of, there was not a dedicated grant to support the PECO project. This was just volunteering from the different centers that either they have the data, they have the knowledge or the compute resource, they can donate to the project. So we end up having a very good results. I'll show you who the data centers are. But also we had to do a lot of benchmarking. We had to make sure that alignment that we do at one center actually can give us the same results when it's done on a different compute center. And also when your data is spread across multiple places, well, how do you track it? So metadata is very important, as I cannot emphasize enough. And also we use Elasticsearch as an indexing tool. So I'll talk about that a little bit later. So when we start off, we have seven data centers. The idea is these centers contribute both storage for us to upload the data to it. And they also have compute co-locating. So all your compute, once the files are uploaded to that center, the center can do the alignment itself. So you minimize the move of data. So the blue line is actually showing you how people, where the data is uploaded depending on the region. So obviously if you're in Asia, we asked you to upload to Korea or University of Tokyo. In Europe, we have Barcelona and Heidelberg, London's three centers hosting. And in the TCGA data is special because it can only be hosted at University of Chicago because this is a trusted partner of the National Cancer Institute. So TCGA data, the American data cannot leave American soil in a sense. And then we have also the NCI cluster that is the UCSC. Again, that is a specialized center to deal with the TCGA data. And we have a special case though. Australia at the time was having a very slow bandwidth to any of the data centers. So Australia ended shipping us 40 pounds of hard drive to Chicago. So that Chicago could do the upload for them. That has changed since because now AWS has set up a center in Australia. So bandwidth there is much better now. Okay, so yeah, so we started Alignment there and we found that it took about 2,000 core hours per donor. And this was done over a couple months time and when we finished Alignment when we went to the Varying Calling, we thought, oh, maybe Varying Calling would just require about the same amount of resources. But we were very wrong because when we tried the pipelines, first of all, there are multiple challenges. The pipelines originally were already running in the production environment of the Institute who donated those pipelines. So it works well, but now you're pulling out some pipelines and try to make them stand alone to run in different compute environments. So that was quite challenging. So for example, Broad pipeline was very integrated into their internal production tool called Firehose. So pulling out all those components get them to be stand alone was not easy. And there are also specific things that you can think about. So the Sanger pipeline, for example, were very specific to how they name their regroups. So they were expecting certain things and but now we are having data from multiple groups. So those assumptions are no longer valid. So they had to make changes as well. And don't forget one thing is also the pipelines were not completely ready when we wanted to run Varying Calling. They actually were in development as we were trying to run them. And we've noticed that, okay, we need more compute power. So we actually made a call out to some other additional consortia members asking them who can donate more resources. And I'll show you that there are the outcome of that call out. But there's some key changes regarding cloud computing that happened during this project. When we first started, data was actually not allowed to go onto commercial clouds like AWS or Google. But some of our steering committee members, especially Lincoln Stein, lobbied with NCI, trying to convince them how beneficial cloud computing will be and how much we needed. So there was an article, I believe, in your reading material as well. So that's one of the convincing argument that he made. So then in March 2015, finally NIH updated the cloud policy and say, yes, TCGA data can go to the cloud. If you apply for a TCGA, there's a checkbox actually in your application asking whether you use it on the cloud or not. And ICGC also changed that policy allowing us to use the cloud. So that changed the project a lot because now we can have a burst of compute. And so with all those changes, we also got an agreement with AWS. They are hosting the ICGC data, not just PCOT, but they're willing to host the ICGC data set as part of their public data set program. So this is sort of a community contribution they have. So we managed to put 1,400 PCOT donors in the AWS bucket. And so this makes it a lot easier for us to do the compute. And also we also switched to using dockers to run our pipelines. Our first few wasn't dockerized, but then having the dockers really helped us to go from one environment to another. So this is the, you may have seen this slide already. These are the different components and the three varying calling pipeline to call SNVs. Indels, structural variants, copy number variants, germline. And basically this is the amount of core hours that you need to run each pipeline. We know that these are actually very ambitious having to run on 2,800 donors. And so we asked our consortium members for more resources and we got really great response. So the yellow ones are the additional resources that we add to the project. But keep in mind that these resources only have compute. They don't store the local data. So they will actually have to pull from the data centers that have the data. One thing we also did I forgot to mention is after the alignment phase, we actually started synchronizing the alignments between data centers because that would give us more flexibility on where to run the data. And also at that point, we could actually delete our unaligned, the raw data because we know we'll delete them and using just the aligned reads where we can do varying calling. So there are a couple, just one point now. We did use AWS and Seven Bridges platform. Those are AWS and already has the data. So it's great to use it there. OICR Toronto, that's the collaboratory that George Bryan mentioned. That's where all the ICGC data is as well. And so this is sort of a very simplified view of the progress of the project. You notice this is BWMM. We started running that back in summer of 2014. You see only a lag here because we finished out data train one and then that's about 2000 and then people submitted more data that's a while we waited and then we have a burst to finish all the alignment. And the reason you see that there's a stagger of starting these varying calling pipelines is because they were still under development at the time. So we could start up the Sangit pipeline really fast but then we took some time for a decam as an elbow to finish the development and Brode started even later. And then the reason you see these dips is because throughout the process, we discover issues, QC issues with the data. So we had to pull it back. Good thing those are not major issues. It wasn't a complete rerun but it had to be post-processed and then uploaded back to the server. So that's why you see these dips. And then we also have an OxoG pipeline. So this is actually after all the varying calling there are oxidative artifacts in the reads. So that's why we have to filter some of the variants using this algorithm. And OxoG is actually a published tool by Brode. In fact, to get good quality data, we have to do a lot more than just a simple filtering. We disseminated this data, the varying calls to the working groups. And we asked them, please be very critical, look at this data, tell us which samples could be bad. And it was very interesting because people looked at this data from different angles. Some people look at it from just an SMVs. So whether some people look at it from a structural variance standpoint. So what they discover is that we had to eliminate some of the samples. Some of this simply because we don't have any clinical or histological information. So you don't know the disease type. So we got to get rid of it. And then we also found that sometimes the tumor is contaminated with normal DNA. So we had a cutoff of 4%. It is more than 4% to normal DNA. We have to eliminate it. And interestingly, sometimes we found that the normal is contaminated with tumor DNA. And in that case, basically you have a much tougher time to find out whether to make varying calls. And so the cutoff here was set to 15%. And then we also have some excessive numbers of mutations that are already in DBSNP. So those, again, we believe are contaminated samples. And then we also found some contamination either from CDNA or sometimes in mouse. So none of these samples are xenogravs, but who knows when you're sequencing, your machines or your bench could be contaminated with other samples that are not intended for the project. We also see some, yes. So it's a mix of your sample, right? If you're processing your normal and your tumor at the same time, it's just maybe a pipette tip, a little bit of getting in there. Yeah. So it's not because the normals run adjacent to the normals? No. So the normal are mostly from blood? Yeah. So it's not because of the tumor tissue itself during the extraction process. It's probably during the preparation, library prep. Yes, it's a library prep. And then we also look at some extreme outliers that are based on the QC working group. Because if they see that some of your reads are, your pair reads are mapped to multiple chromosomes at a very high level. And it's more than biologically possible, then that's probably a QC issue as well. So in the end, we eliminated 6% of the samples for a large-scale project. This is pretty standard. And then there are some samples that we listed as gray, because they have low level of contamination. Some filtering managed to rescue the sample. But if you think that you have discovery in there, you better double check, because it could be an artifact. And so along with these samples, we also have RNA-seq data. So this was all made for downstream analysis by the 16 research working groups. So after the samples were eliminated, then we actually had to go through another round of filtering and annotation. So as I mentioned, oxidative artifacts were eliminated by OxoG. We also looked for PCR template bias, trend bias, using the panel of normals. We tried to eliminate germline leaks. And then we also look at SNVs that overlap with germline calls that are in the 1,000 genome project. So these are just ways to improve the quality of your data. Another good example is when you see what chromosome Y calls in a female donor, that's probably an alignment artifact. So those are eliminated. And then annotations is important because we need to help the machine learning algorithm to make consensus calls. So we annotate things like signature artifacts and then also any enriched SNVs. Yes. In the previous step, there's just how many false positives we're getting up, which is true to positives in your community? Yeah, so unfortunately we weren't able to assess that even using our validation data, because for some reason, what we validated was not filtered out. When we tried to run the same filtering algorithms on a valid data, not much was removed. So we couldn't get a chance to assess that. Yeah. So here's the consensus strategy. After cleaning up all our data, we look at how to combine the calls from three pipelines. So this is, so down at the bottom, and it was a little hard to read, is basically this is the individual pipeline. So this is DK of Z, Sanger, this is two plus. So meaning that if two or more, two out of two or three of the callers make a call, then this is the kind of sensitivity, precision, and F1 score we get. So what you're trying to get is basically a high F1 score or a high for everybody. But if you just want to look at one score, then F1 is a good estimate. You want to get a high F1 score. And then we try also some. So it's basically a combination of sensitivity and precision. So it allows you to estimate how accurate your algorithm is. So these are other methods. These are decision tree, stat and logistic regression, SV vector machine, random forest. So these are fancier machine learning algorithms. You can say, this is just a simple two or three algorithms calling us in intersection. So we decide to go with just simple. So if a variant is called by two or more algorithms, that's the call we take. And that gives us 90% precision, 90% sensitivity. Yes. Your truth is from your target? Yes. This is from the validation data, yes. So that was what I just showed you was SMVs. For indels, it's a little more complicated. So for some of you who have worked for indels, if you pick indels from any two algorithms and try to overlap, the overlap is probably only 50%. Indels are tough to call. It's very tough to be accurate. So in this case we have to use the machine learning algorithm. This is things logistic regression that's over there. So basically we allow the features, genomic features into the machine learning algorithm so that one pipeline may be better at calling variants with specific features. And in that case, a call by that pipeline will be given a larger weight. So it's because there's a higher chance that it will be accurate. So by using different weight against the genomic features, we managed to get a method that is 60% sensitivity and 90% precision. Indels is just tougher to call than SMVs. So you may ask, okay, well, why do you want to use three algorithms when you could maybe only run one or maybe just run two? Then this is where we want to look at the accuracy and the cost. So the x-axis is the cost per donor. This is based on our estimate using AWS spot instances. So running one alignment algorithm and then any combination of either one varying calling pipeline and the y-axis shows you the accuracy. So what you want is to have high accuracy, but at low cost, you don't wanna be here up there. I don't think it's working anymore. So you wanna be at the top left corner, but that's hard to achieve. But as you can see, if we run just the DKZ and Sanger alone, cost is low, but we don't achieve such high accuracy as we want. If you run just the broad pipeline alone, is actually just as costly as running two pipelines. So the best case scenario for us in terms of accuracy is using all three pipelines, but it comes at the cost of about $100, but keep in mind, for Peacock, we were just out the door, we just had to get it done very quickly. So we did not spend a lot of time optimizing. So if you have the time to optimize, you know that you're gonna scale up to tens of thousands of samples, do spend your time to benchmark and optimize. And just to give you a very simple example, when we started off with alignment, we used a 32 core machine on AWS. It took two days to finish and cost $30. But then we realized that the alignment, the BW alignment, yes, use all of our cores, but then the Peacock process used only about 4% of the CPU. So we're wasting a lot of CPU and we didn't do a lot of fancy things. We did a simple switch. We went to a smaller machine with only four cores. So of course it took five days to run, but it only cost $6. So something simple could actually save you a bit of money and it's worth doing and you have to weigh in how much time and how much effort you wanna put into your optimization. Yes. Yes, so for structural variations, I didn't talk about it as much here. There was a, so that, yes, using the validation, SV was the SV group also decided that it will be, any calls made by two out of three, two or more of the three algorithms will be used. But the strategy is slightly different because they had to look for consensus breakpoints. So that's what that SV group did. They actually had to look for any consensus breakpoints. Those consensus breakpoints are decided by any two algorithms out of three to be called. Okay, so after all this work that we have done, the alignments and varying calling pipelines, the data is usable, searchable. You can find it at the ICGC portal, but the workflows, that's our blood and sweat and we have dockerized all of these workflows and this is now put on Dockstore. So you can actually go to this site. You will find all of our pipelines there. Not all of the broad pipelines are there yet, but we're working on it. But otherwise alignment, Sanger, DKFZ, those pipelines are all there and you can use it for your own data. So out of this technical part, we will have two, we're working on two manuscripts. One describing the software infrastructure, the workflow operations, a preprint is already in bio-archive and then if you're interested in the algorithms to generate these variants, a paper is in the work, it's not in bio-archive yet. This is the lead author is Jonathan Dursey, but soon this will go into bio-archive. In fact, a lot of the data is already, sorry, a lot of the papers are already published in bio-archive specifically if you're interested in the scientific outcomes. So we're completing our manuscripts. There are, I believe, papers already on germline, mutations signatures, driver events. So far that we expect about 50 manuscripts coming out, hopefully late 2018, they will come out. So just to give you some highlight of the results, we found that 50% of the donors have at least one non-coding driver mutation. I mean, non-coding regions were not looked at in exomes, so this data is very valuable, allowing us to see the non-coding events. And this is very important to know because looking at as driver genes is great, but these non-coding regions, especially regulatory regions, are important as well in cancer. And also we found that on average, each sample has 4.6 driver events. When we call it a driver events, it can be a SNB in a genomic region or non-coding regions, or it could be structural variant as well. So I think a lot of them are in the promoters, but I will have to go back to look at the paper, but yeah, those are very interesting ones. And then we also found that we don't find any new, many new driver genes. Basically, these are known driver genes already. However, the event occurring to these driver genes are very diverse because some of them are promoter regions that are affected, sometimes are the UTR environment regions. So this gives us a better view of what along that gene is being affected in these already known cancer driver genes. Sure. Yeah, so there is a large analysis in the paper that you'll have to go through, but in general is basically looking for statistically significant event against a background. So the background could be from actually a set of normals from other projects as well. So one other thing that we're very interested in is the structural variations. And this doesn't show up very well in the screen because the room is very bright, but basically the two sides here are the chromosome numbers. Some of it actually was covered here. And then for every time there is a intra-chromosomal structural variant, this is sort of marked in between chromosome one and chromosome one, or there's interchromosomal variations. So it's hard to see here, but you have to believe me when I say that between chromosome 12 has a lot of intra-chromo... Chromosome 12 has a lot of intra-chromo... No, I can't see what number is that, but there is a lot of interchromosomal events with chromosome 12 and another chromosome there. Okay, so that helps me actually. Let me go back. These are the regions with the breakpoints between two chromosomes. And yes, so there's some hotspots actually. Okay. So also we looked at viruses, I mentioned before, but there wasn't a lot of new... We didn't find any new virus actually. These are all known already in HPV16, in cervix and neck cancer, liver, you have HPV, but you do see that, hey, this is reconfirming what we know already. Unfortunately, we didn't see any new virus coming out of this project. Let's skip this one, but this I found very interesting. So this method was developed by someone at OICR where they try to use the SMVs and the mutations to predict the origin of that tumor. So along here, this is the predicted tumor type and this is the real, the actual tumor type. So overall, they have pretty good accuracy. If you, basically the more red, the better, the higher the accuracy. And so this is along as to how well do they predate. So it does depend on the cancer type and also the number of samples. Why is this interesting? Because if we can get liquid biopsy, so if you just take circulating tumor DNA and based on the mutations you see and can predict where that tumor DNA comes from, you can potentially diagnose the origin of that cancer. So this is, yes. Unknown primer. Yes, exactly. Not accurate and predicted. I guess if you could combine mutations and methylation, you would be even more accurate. Yeah, because you see methylation being a very good classifier for tumor subtypes. So yeah, it would be really nice to get those kind of diagnoses. I mean, imagine if you just can get a blood test and just easily predict, get a diagnosis. I possibly, we should look into another area for potential cancer. So yeah. And so that's also what a lot of clinicians say. Right, yeah. So hopefully this will become a clinical, actual clinical usage in the near future. So I just want to wrap up the PCOG part with some lessons that we learned. So as I said before, we had multiple compute centers for data storage and data computing. And so what I've highlighted is actually their environment here are different. So the blue ones are the HPC environments. The yellow ones are academic clouds and the pink ones are commercial clouds. So it was helpful to have doctors to allow us to run algorithms in different environments. But they are all slightly different to manage. So when we did this, we did have a lot of tools at the time. So what happened is in each environment, it could be a cloud, it could be HPC, there are always what we call a cloud shepherd. This is a person supervising the execution of the workflows, reporting back to project manager and say, OK, we have a problem with this, maybe as a system outage or problems with samples. So at least a good thing is we do have a real-time job tracker. So the way we managed to do this because we have a centralized metadata. So GNOS allows us to track a lot of the metadata and we pull from all the GNOS servers on a nightly basis. So we know, OK, which sample has finished, which pipeline. So we could actually report back. So that's why you see your progress over time, what samples have already been finished, what has not. And also then the person, the project manager, that was me at the time, can delegate out and say, OK, EBI, could you please run this pipeline on this set of samples? It was a, we sort of delegate out these jobs on a weekly basis. We do it this way because it has to be dynamic. Maybe a site can go down because schedule outage, unscheduled outage, so we need to dynamically allocate and reshuffle the work as we need to. So there was a lot of manual interference in this process. And we can definitely do better as we go. And also want to mention the comparison of the different compute environments. I mean, we're talking about using cloud a lot, but I don't think HPC clusters will completely be gone. And there's some advantage with using HPC because the cloud shepherds themselves at the center already know the system. They know the storage system. They know how to queue up their jobs. It was very fast for them to just get going and running. But the problem, though, is usually the hardware is slightly older, and they cannot give us these very large machines to run jobs. So the bro pipeline took 32 cores and 256 gig RAM. Not many institutes have those kind of machines. And also, this is a shared resource with their own computer, their researchers. So they cannot really monopolize the system and just run it over a long period of time. We do have academic clouds in this environment as well. When we used it, it was new clouds for many institutes. So we had to kick the tires. There were a lot of issues. But I think overall, it worked out for everybody. They had feedback to improve their system. We got some free compute, but it's still limited because they were ramping up building out. Commercial clouds. AWS and Azure was great because we could use a lot of VMs at any one time. And it was really helpful to have that burst. But at the time, which is still true, actually, some of the jurisdictions don't allow their data to go onto commercial clouds. This was true with the German data set. They did not want their data to go into commercial clouds because of the privacy issues that are specific to that country. So what we want to do, knowing the amount of data that you get in this project is to make sure you co-locate your data and your compute. Whether it's commercial or academic, it doesn't matter. You don't want to spend time downloading your data, especially if you're downloading from the cloud. There's an egress cost that is about $55,000 for one petabyte if you try to move it out of Amazon. You don't want to move that. And then as I said, metadata is really key for tracking both your raw data, your analyzed data, and also the progress of your workflow. So standardize the metadata from the beginning, but it's hard to predict everything that you need. So you've got to be very flexible at your metadata along the way. Use JSON, that's a flexible way to represent your metadata. We use Elasticsearch to index the data. So think of it as a very fast database for a lot of metadata so you can easily query and get it out. And also when you try to run your jobs, use your advanced stats to help you predict your job. How much resource do you need? Like so the number one guideline is coverage. If you have a very high coverage sample, do you need a larger machine? I think George will show you, just divide your sample into two groups, one that is small and large so that you know what type of machine to use. Higher granularity would be good. We also know how to predict long running jobs. So we found that samples that have a large percentage of discordant reads usually go to take a long time to run. So when I say discordant reads, meaning that your pair ends are your two pair, the two reads in the pair are mapped to different chromosomes. So in that case, those are usually long running jobs. So you don't want to run them in commercial clouds where you have to pay. So save it on an academic cloud or HPC where you don't have to pay. So as I said before, we need to minimize the human intervention with some smarter logics if we were to do this in the long run. So any failed jobs need to be restarted automatically. That may mean going to larger VMs automatically instead of having someone recuing them. We need better monitoring system. You don't want your VMs to be sitting idle and you don't know about that for a couple of days that cost you money and doing nothing. Execution services need to take into account real time conditions. So what do I mean by real time conditions? We have replicated the data to multiple data centers. So you want to know which data center is available at runtime and which one is closest to you, give you the best cost effective way to transfer the data to the VM that you're running. So if one resource is out, try the next resource. So you need to be dynamic on where you download your data. And then if you have multiple compute environments, then you may want to know, well, at this time I'm about to run, which one is cheaper? So you know Amazon have spot instances, Google Cloud also have preemptible VMs. You want to know the market price and then make a decision on where to run your jobs instead of just sticking with one environment. Also during our project, we got free donation from projects. They said, we have compute credits on Azure, it's free, feel free to use it. And so you want to be able to switch from environment to take advantage of these offers. So like I said, save the long running jobs for academic clouds and then that will be some of the things that we need to consider. So we need to improve our cloud orchestration for if we were to use this kind of strategy long term. So instead of having people from a previous picture, we should have something like a multi-cloud orchestrator. So you can consider the real-time conditions and then assign the job to different cloud environments to take advantage of cost efficiency and bandwidth. And then at each cloud environment, we should have a workflow management engine that automatically retry jobs to get us onto larger VMs if necessary. So try to minimize the amount of work that a cloud shepherd has to do. So why do we want to do this? Because there are several new projects that are very similar to Peacock. One of them is the pan-prostate initiative. This is ongoing. They have, doesn't sound like a lot of samples relative to Peacock, but they do have a thousand whole genomes, 500 exomes and a lot more data types. So managing the data types is challenging for them. And this is also across multiple jurisdictions. So those are very similar scenario as in Peacock. But the one most important project that is about to be launched this spring, yes. So some of, so I know pan-prostate has already adopted the Sanger pipeline. So they, I think what they do is they use a lot of our pipelines to further improve it, and both in terms of performance and accuracy. So the one large initiative is the ICGC-ARGO. It stands for Accelerating Research in Genomic Oncology. This will be launched in the spring under ICGC. It's intended to be a 10-year project to collect and characterize over 200,000 donors, mostly participating in clinical trials. So this is not just getting sequencing done, but they want to have drug responses during clinical trials from these patients. And a lot more clinical data will be collected. So this is much larger than that Peacock is almost 100 times larger, and it will be done over eight to 10 years time. So previously in Peacock, when we did it in over two to three years time, now this needs to be sustainable. You don't wanna have cloud shepherds, spending a lot of time monitoring jobs. You need to have an automated system to really run these numbers. So this is sort of the proposal Argo is gonna use where they will have sequencing data submitted to different data centers, probably by region, just like Peacock, and then they will also have a team to deal with here. So deal with pipeline engineering. So once a method is good, depend based on the benchmark of the group, they said, hey, a variant calling pipeline is really good. This is the one we should use, then this engineering group will package it to the Docker, make sure that it can be run in the cloud environment, and we can run it anywhere basically that we can have resources. And then once the data is analyzed, they can go through the DCC where they curate the data, they will further collect the clinical data from the different groups, and then all this data will be made available through the ICGC data portal. So the idea is that the data centers will be very, these data centers will be spread out across regions, but they will be given workflows by the engineering group so that they run on the local data. So this is gonna be the model that ICGC ARGO would use. Yes, yes. So we wanna make sure uniform analysis across. Yeah, okay. So we've gone through an hour and I don't know if the people need to break, people dozing off. So we don't talk about some of the two major cancer data sets that are available for research use. So as I said before, ICGC, we have 17,000 donors. You can't access raw data if you wish, but otherwise some added mutations are openly available and clinical data is also available. So you can do a lot already without having to apply for a DACO. And then the NCI Genomic Data Commons currently, there are 32,000 donors in there and I'll go over what data set is available. So for ICGC data, you can get access to the whole genomes and there are about 6,500 donors with the whole genomes. 28 of which are actually part of PCOG. So they are uniformly analyzed already if that's what you want to have. 7,500 donors with whole exomes. And then some of them have already seek my GARNA by self I seek an array based calls as well. So it's a very good mix. If you want to access it, it's at the portal. So to a controlled access, yes, you will have to apply for, through the data access control office, but once approved, you get access to all ICGC data. So this is the ICGC portal, just a screenshot of it. You get quite, you see that this is the latest release from December. And you see these are the cancer projects. So those are, you can see that all together there are over 10,000 unique donors with SNVs already, simple somatic mutations already analyzed. This is the, I guess this was called the hamburger plot before and showing the mutation rate. So these are the different projects and this is the number of mutations per megabase. So as usual, melanoma has the highest mutation rate. So this gives you a little bit of overview, but the good thing is on the left side was a faceted search, just like how you go to, go do your shopping on Amazon, you can choose the cancer types that you're interested in, the data types you're interested in. So the faceted search, you would see that they're donor tab, gene tab, mutations tab, meaning for each tab, you can select donors with specific characteristics or genes that you're specifically interested in and even mutations that you, if that's what you care about. So this is how you can do your search and then you'll see the donors that meet your search and then which project they're coming from. We have some, the disease site, simple clinical data, and then the great thing is you see what available data types there are and even quick summaries of the number of mutations and number of genes. So the ICGC portal is not just for querying and downloading data, it allows you to do some simple exploration of the data itself without having to download it. So this is a possible to look in a genome browser, the mutation regions that you're specifically interested in. This allows you to zoom in and out. This saves you the trouble to download the data and also notice that there are pathway information. So this is actually the query that was selected. It was, the disease site was brain with a consequence type of frame shift variant and then related to a pathway. This is a code of the pathway, but there you can actually search for those pathways that are you interested in. And once you're happy with your search, now you can look, you open the genome browser right on the portal to look at the data without downloading. This is another very cool feature because you can actually stream a BAM file. Because when you don't have full access to the BAM raw data, you cannot download it, but you can still get some summary information of a BAM that you can actually stream right on. So you can see the coverage of that file, some reach distribution in the region and you can actually click a chromosome or zoom into the chromosome. This is all live streaming. So this is pretty cool. Also there is streaming of VCF. I know VCF is always tough to read for a human when there's so much information in there. So there is at least some summary from the VCF about where along the chromosome have variants. There's some base changes, summary. So this is another nice way that you're summarizing a lot of data because these are 24,000 variants actually summarized here. So on the ICGC portal, there are other information for annotating your mutation. This is slightly blocked, but this is a lollipop plot looking at along the gene where the mutation is and also the frequency of the mutation. So this is the number of donors on the y-axis. So you can see maybe some hotspots based on a cohort. And the nice thing is also there's now additional annotation on compounds. So drug entities can now be annotated. So you can start linking your mutation data to compounds if the data is available. So it's all depending on the community starting to have more of the drug compounds annotated in database. You can also visualize pathways in the, this is the reactant pathway. So information is already pulled from the reactant database and your mutations that you're interested in can be overlay onto the pathway to give you a sense of how your mutations fit into a specific pathway. This is the oncogrid if you have seen it in the C-Bioportal before, sort of showing you each column is a donor, each row is a gene, how, what genes are mutated are in this cohort are sort of grouped in a visual way so that you can actually see exclusivity. So when donors with this TP53 mutation, they typically, typically do not have the other mutation, which I can't see here, but they will have this second gene being mutated for most of the time. So these are all nice features to have. And one thing to my mention is a lot of the visualization tools on the ICGC portal is actually open source is available as in a tool suite called OncoJS as a JavaScript suite. And so you can download that set up in your own portal if you wish so that you can see the pathways, have the oncogrid, you've got the lollipop plot, survival plot and other tools. So if that's something you're interested, definitely take a look at this open source. Another interesting thing that you can do with the ICGC portal is synthetic cohorts. So based on your search criteria, you can create cohort A or cohort B and then compare them, look at their survival analysis between two cohorts. And so this is a nice way to enable some analysis without downloading the data. So as I said, the DCC release has been going on for the last eight, nine years and all the data is archived. So if you read a publication and specifically refer to ICGC data release 19, you can always come back here. This is basically a snapshot of the releases. And as you can see, there are usually releases every, I believe it's three or four releases per year. So you won't have to worry about the data is changing every day but you would easily look up a release that is referred to in a publication. So one thing I have to say though, so the ICGC portal basically hosts all the clinical data and the mutations. So those are small data sets. The raw data itself is not hosted at ICGC portal because that's too much. So is that kind of distributed? So some of the data is in the EGA archive. This is the raw reads that we're talking about. We also have the raw reads in PCOM servers that are host the engine back engines genos but we're actually gradually retiring those. As I said before, the NCI data, the TCGA data is actually in the genomic data commons because they cannot leave US. Cancer Collaboratory holds some of the data, so does Amazon. So the raw data is distributed to many places. You can still use the portal to find the data. So this is what you can do. Come to the browser, you can choose, maybe there's a specific data repository you're interested in or maybe data type or you can look at donors, maybe there are disease types that you're interested in. But basically once you choose your criteria, you have a set of files you're interested in downloading. It tells you which repository hosts these files and what you can do is actually save the donor set, you can save it as a manifest and then now you can use a tool to download these files. So what happened is, I show you before a file actually has multiple copies in different repositories. You actually have a choice where you want to download from. So at the beginning it shows you a lot of files like everything but once you click remove duplicate files, it only shows you the repositories and you can prioritize them. You can say I prefer to download from AWS to Virginia and so most of your files are coming from here. The remaining files that you don't get will be in your next second repository. So this is a really cool way to help you download unique files that you want and based on repositories that are maybe closer to you geographically or maybe because of cost issues. So this is a nice tool. So once you have this, this can be, once you have this set up, you can generate a manifest. So a manifest is just a list of files and the location that you want to download from. But because the files are in multiple sources and each data source has a different download client, it gets very confusing. So for example, GNOS requires a tool called GT download. You have to go and install it, set it up, complicated. GDC has its own tool called the data transfer tool. AWS and Collaboratory, you can use the ICGC storage client or you can use just the AWS client as well. EGA has its own tool that is tough to install and it's a little, but it's again, it's another tool, the PDC that uses AWS clients. So just simplify all this. You don't have to remember any of this. The ICGC, GCC came up with a tool, a one tool called ICGC Git. So it's a very simple concept that they have packaged all these download clients into one Docker. And so now you just have to invoke one command called ICGC Git and it will be smart enough to know which download client you use depending on your data repository. So we did that, I show you already, you already did your search, your data, you get a manifest and now all you need, you don't even have to download the manifest itself. You just need a manifest ID because the actual manifest is stored on the ICGC server. You just have to ID point to it and then you can do ICGC Git download and then my manifest ID and this will all done in the background for me. They could be coming from multiple locations but I don't have to worry about it. So this is a pretty smart way that the ICGC team came up with. So already set this already. So the next bit is about the genomic data comments. So who knows about GDC already? Okay, do people know about CG Hub? Okay, so basically the NCI came had the TCGA data. This is the genome, TCGA. The genome atlas, the cancer genome atlas. So this was started over 10 years ago with multiple cancer types being sequenced, mostly exomes with RNA-seq, sorry, at the time it was mainly expression arrays and all the data was pulled originally just on an SFTP site. This was what they call the Jamboree site but then over time the raw data has gone to CG Hub but CG Hub was retired a couple of years ago and now basically is all hosted in the NCI data commons. And the idea is that we will have a portal for you to search. And also points allow you to download and also do some visualization and some analysis as well. So this is all sounding very similar to the ICGC portal. And as you look at the two, they're pretty much cousins because the front end is developed by the same team at OICR. But GDC being a government funded project is very structured, very organized. So besides the data portal there is a website with a lot of documentation. So these two websites tell you everything you need to know. So if you want to know about the pipelines, the data types, everything in there, they're very well documented. GDC also has a submission portal. So projects, new projects that want to submit data to GDC can go through a process. They can request to submit the data. And then approved, they get to use a tool to submit the data to GDC. And again, during the submission process, data is validated, there's control metadata, clinical data that uses how to follow. And I said there was a data transfer too that allows you to do multi-part downloads, meaning that if your download is stalled or aborted, you can easily restart and pick up where you left off. And then there's an application programming interface. So for people who don't want to use the portal to interact with the data, they can actually use an API so to make it easy as part of their code to get at the data. So once very interesting about GDC though, it may not be obvious, is that the GDC is tasked with realigning all the data. Again, doing uniform analysis, but this time to the latest genome build. So GRCH838. So in PCOT, I didn't point out, but the analysis was done against HG19. And I think we're in a transition time where a lot of people are still sticking with HG19. It's very difficult to get them to switch, but I think we need to gradually get there. So GDC is taking the lead in reharmonizing all the TCGA data and anything that goes into GDC will be uniformly analyzed to GRCH38. So because we cannot get rid of HG19 data yet, so the older data is still, we put it in the legacy archive. So there's the older TCGA data, target data, so target sample therapeutically applicable research to generate effective treatments. That's basically pediatric data. And then there's the CCLE stands for the cancer cell line encyclopedia. So that's older data that was previously in CG Hub and they're now migrated to HG19 as they were. So they were not reanalyzed. They could be called by different algorithms depending on the working groups choice at the time. So we still have to make that available because people want it. But in the active portal, that's where we have the reharmonized data. So there are over 32,000 donors across 40 projects. The TCGA data comprises 11,000 donors and 33 projects. Target has 3,000 donors as typically pediatric data, smaller data sets, so it only has 3,000 donors. And the latest and fairly interesting is the foundation medicine data. So TCGA and target were mostly exomes. We haven't made the whole genomes available yet. They will be. Foundation medicine is targeted sequencing. So 18,000 cases sounds like a lot, but it's all targeted sequencing. So in terms of data size, it's actually not much. And also because foundation medicine gave us the data just gave us only the variance. We do not have the raw data. So we did have to go through the whole realigning. We basically lifted those calls to GRC-H38. Yes. That's not, they don't use any germline today. No. So how do you integrate that, in fact, if with everything else? Yeah, so there hasn't been, are you talking specifically for foundation medicine or? Yeah, they don't give us the germline and I don't think they will have it because it's all targeted panel. Yeah. No, it's based on their own pipeline. Yeah, so this is something we try to make clear to people in the documentation. Those, we just don't have a choice, but it's two pressures of data set to give up just because they don't have the same pipelines. Yes. Is there something on DTC right now that doesn't exist on ICGC? Because there wasn't a DTC data on ICGC. Yeah, so definitely. I mean, target data is not in ICGC. Foundation medicine is not. And also DTCG, they did some kind of vetting process on how choosing some samples to submit to ICGC. I don't know what their criteria is, but definitely there are samples in GDC, but not in ICGC. Yeah. Okay, so GDC data, keep in mind, it's slightly different from ICGC. So all the metadata, clinical data, that's open access, no problem. Only the TCJ somatic mutations are open and they have some germline masking already done. So target data is completely controlled because typically pediatric community would like to protect their donors a bit more. FMI also, foundation medicine, that's also controlled. So if you need to apply for DVGAP access, this is sort of the link, but keep in mind that this is not like ICGC. So you have to apply to specific studies to get access and you basically provide a research statement saying what you do with the data. If along the way you want to expand your research, so you now want to do more than you originally said you would do, then you can submit an amendment. I know sometimes TCJ, sorry, DVGAP gets a bad rap, but it's so complicated, it takes so long, but that's not true actually. Usually the application is stuck at people's own institute because your institute has to sign off on that application. Your IT director has to sign off. That's actually the bottleneck. So once DVGAP receives your application, they turn it around within a couple of days. So the GDC portal, I'll try to do a bit of a demo instead of showing you screenshots. All right, so this is the portal, GDC portal, and it has, again, these organoids, you actually as part of the Anko JS visualization suite if you're interested. So we have 40 projects there, and you can go to the project page to explore. So just like the ICGC portal, you can choose your primary site. You see the TCGA target and foundation medicine data set. You can see the number of cases that are affected based on a particular gene. So this is TP53, the most mutated gene across all the projects, and you get to see the distribution of cases across the projects. So say if I choose kidney disease, actually I'll show the exploration page first. Okay, so this is the cool page because it allows you to do some, start to have some visualization done. So you can choose a kidney, say with genes mutated, so let's click be rapid, TP53. So that's another nice thing that when you type, this can actually automatically fill in for you some suggested terms. So this is all the cases, only 36 cases, not that much, with primary site as kidney and the gene TP53 mutated. So if I click on oncogrid, actually let me take out TP53. So if I take out TP53 and just look at kidney cancer, and now I can easily get an oncogrid showing me the most mutated gene, the most frequently mutated genes, BHL, as I said, because each column is a patient, here you can actually see the number of mutations in that patient. So you've got one who has 45 mutations rather than these lower ones. And if you're interested in looking at these cases, you can. So at the bottom, there's very simple annotation based on clinical data, but let's say if we look at one of the mutations, so I can click on any one of them, this is a missense variant. And now coming here, I want to look at the mutation itself. So this is your lollipop plot, right? So this is the gene, BHL, and it shows me where they're mutated. And the color code tells me where there is a missense. So red is missense. The blue is stopgamed. And these are considered high impact mutations. And then by looking at the frequency, well, there's no specific one that stand out, but at least you can see where the regions are. And here's the PFAM domain, if that's what you're interested in. Okay, so what's also very interesting is you can now do analysis. So you can actually create sets. So what I said before, you can create synthetic cohorts. This is what you can do. So this is a demo set. So all I'm choosing, I could create three sets. One is from bladder cancer with high impact mutations detected by the algorithm Mutec2. So this is one of the varying calling pipeline. And I want to look at second set. There's a different calling pipeline far scan and a third pipeline of mules. And basically I just have this set operations just to find out, well, how well do these three pipelines overlap? And you see that some of them, they're a good overlap here at the center called by three pipelines. And in a sense you can choose, because right now we don't have the consensus calls here yet. So if you want to know, maybe you need to very accurate mutations for your research purpose, then this is the region that you'll want to use and those are the mutations that you want to look at. So now you can actually click on this set and bring up the mutations. Maybe that's too many to bring up. But what you can also do is save it to your list. So this is going back to the exploration, back to exploration for me to look at, but okay. Yeah, so these are all the 6,006 mutations that I can list, look at. I can easily look at any of the cases specifically if I want to know any clinical attributes out of it. So another cool thing though, you notice that, okay, for these cases, I have a survival plot. And I can actually select a specific mutation to see if patients having that mutation, do they have a slightly different survival property? So slightly, so the orange one here are the cases with this particular mutation. The blue ones are the ones that do not have the mutations and you see the lock-ranked p-value. So this allows you to really nicely analyze the data without much downloading or having to manipulate the data. So this is nice. Yes, sure. Right, so right now the back end of this is all in Elasticsearch. But that's one question I had to the UncleJS team as well is like people need to manipulate data into a certain form. So they do give you instructions on how to do that. I believe it was just flat file to start with but I'll have to double check on that. So, okay, and okay. This is in the cohort comparison as well because if you have to, this is a nice example. This was basically pancreatic cancer where KRAS is predominantly mutated but you also have patients who don't have KRAS mutated. So you create that cohort and then you can look at the difference in their survival. That is actually significant. And then you can also look at, well, how is that segregated between gen and gender? Maybe their vital status. When they're diagnosis or they see the distribution and when patients were diagnosis, not knows based on whether they have or do not have the KRAS mutation and you can sort of choose. So, GDC is in the process of adding a lot of more clinical data in there. So this will gradually build and hopefully be even more useful for people to do analysis right on the portal. And the repository page is very similar to what you've seen in ICGC. You can choose and pick your donors. Eventually, you can put it in your cart whether you download it or make it into a manifest. So a lot of options. So do play around with this when you have a chance because there are a lot of cool properties here. I can skip through a few of the slides now that I've done a live demo. Okay, so yeah, so the data transfer too, as I mentioned before, this is something you can use for multi-part downloads. It's much easier when you have a lot of files or large files. Okay, so I do want to talk about the API. So programmatically, you can get at the two queries using an API. So this is when I say API, so this is the API URL. This is sort of the endpoint saying that I'm interested in searching projects because if you can imagine, this is actually going on your left side facet. I pick under projects, specific project ID, a primary site. This is kind of like checking off those facets at search. And then this is how I can do query. And then this is actually what I have to demo. Yeah, so these are the features that you've already seen, survival plot, protein mutated coding regions and on congruent. This is the set analysis I just showed you. Okay, so I think I can skip these. So I do want to talk about resources that you can find for any cloud-based analysis. So what I've shown you is, okay, now you can find the data sets in either ICGC or TCGA. You may want to just look at mutations, look at survival plots, you're happy with that. But there would be a different type of researchers who want to actually analyze the data a little deeper, maybe looking at a BAM, looking at VCFs, they want to do additional analysis, maybe to have their own algorithms they want to test out on the data. So there are a couple of options. You have commercial clouds out there already, Amazon Web Services, Google Cloud Platform, Microsoft Azure, and then there are NCI cloud resources. I'll dive into these in a little bit, but the NCI cloud resources are also on commercial clouds. Seven bridges is on AWS, the ISP system and the Broadfire cloud are on Google. And the good thing about them is because the TCGA, the GDC data is already on those two clouds. So the NCI cloud resources becomes a platform, a workspace where they provide you with tools, analysis tools that already be containerized and they also give you a space to share your results with your collaborators. So that's very customized for cancer research. And then you also have academic clouds. So University of Chicago has the bound in this protected data cloud and then on OICR has the Cancer Genome Collaboratory. Commercial clouds are good because you can easily start up there and they come up with native services. So if you want to just not use anything fancy, for example, the AWS batch will allow you to submit, to run a large amount of jobs. They have container registry that is equivalent to say Dock Store, container services, database service already there, they already have elastic search. So they do have a lot of services and actually sometimes it's a little overwhelming to just get through the catalog and find the things that you need. But the community you talk to will have recommendations for you. Google Cloud Platform comes with several great things, Kubernetes that's really already open source for you to deploy your containers. BigQuery table people find is very useful. I mean, you can think of it as just a big table but they have very good performance. So there's even talk about putting variants into the BigQuery table to do very fast searches. Now, D-sub in Google Cloud I think is very interesting because if you're familiar with the HPC cluster, so sacred engine, you're already familiar with Q-sub commands. So they have the D-sub command that is pretty much give you the exact syntax, the rest of the syntax, the options exactly like Q-sub. So that becomes much easier for people who are accustomed to HPC. This removes a bit of barrier because Cloud seems to overwhelm some researchers because they have to port all the tools, rewrite the tools to get it onto Cloud. So D-sub makes it easier to do the transition. Microsoft Azure, I don't have a lot to say about it just because I haven't played with it a lot. So the NCI Cloud Resources actually was launched as pilots back in 2016. They are now matured enough to be official Cloud resources. They basically have TCGA and target data co-located already on AWS and Google Cloud. They also allow users to import their own data into the workspace. So you can commingo your data with other TCGA target data in your analysis. And there are a lot of analysis tools and pipelines that are specific for genomic analysis. So if you don't have to try to install your own tool or try to get it from somewhere. And so another great thing is now you have a workspace that you can collaborate with your researchers. So for example, Seven Bridges, they have actually make available over, should actually have it on the next slide. Oh yeah, over 300 tools that they have already put into their work, the platform. And what they do is they provide you a visual to link up your workflow. So you can imagine each little box is a method. So you can have a BWA MAM as your first step and then your next step is Picard. And now you can visually do this, arrange it and easily communicate with your collaborators. Hey, this is the method I use. This is the flow of the analysis. So this is really great. And the other thing they have is a Ravix. Actually, Brian made talk about that because Ravix enables you to run workflows that are supported by CWL. And yes, I've interacted with Seven Bridges. They are really great with their support and answering questions. So just give you that URL, sorry, that email there in case you want to start. Typically, there is free credits to start with. So I can't remember how much now, but there is usually some free credits for you to start up and play around. So this is my personal impression of the three clouds as I interacted with them. Seven Bridges really have great number of tools, having great graphical interface to do workflows. But at the same time, some people don't like graphical interface. And so ISB is more appealing for people who want to use command lines. And these are different type of users. And they also use a lot of the Google native services already. So if you're already familiar with Google or you don't want anything too fancy and get confused, that would be a good option. Fari Cloud, BroFi Cloud is interesting because they themselves are using this platform there for the production run. So you may know that Brod does a lot of sequencing and they actually use this platform to do their production alignment. So they have production level analysis in mind. So you notice there's a difference there. And so sometimes also as Brod publishes a lot of methods, some of them are not available just as open source or a package you can download. However, these tools can be fanned in Fari Cloud. So that's one advantage. Then academic clouds. So BioNimbus, as I said, is a trusted partner to distribute NCI data at UChicago. So this is one good resource because you can actually get both GDC data and then some of the Peacock data, the TCGA portion of the Peacock data on that cloud platform. So you got the data co-located. At this point, I won't encourage application to BioNimbus because I know we're at capacity, but I would recommend Cancer Genome Collaboratory if you're interested in an academic cloud. This is the place that already has the ICGC data. And the great thing is it has a very fast connection to UChicago, BioNimbus. So that means if you have to download some of the TCGA data to Collaboratory for Analysis is a fast connection. So this is actually the Collaboratory hardware. That's George over there. Back in November, 2015, when we first started, we only had like a rack and a half. Now we have six racks actually by, actually as of November, all six racks are full. And we have 2,500 cores right now with local DOSCO 381. No, sorry here. Yeah, so block storage. Yeah, 60 terabytes block storage, but the raw storage of almost with 7.3 petabytes. And so basically we will do another expansion this year. So we will hope to expand to 4,600 cores by the end of the year. I think George Prime mentioned that Collaboratory use all open source systems. So open stack, set storage. These are all open stack, sorry, open source products. So we don't have to pay licenses. So this is sort of how you see where things linked up. So the file storage system is in the back that can easily be accessed through API's that are standardized by Amazon. So meaning that whatever you develop that works on Collaboratory will work on AWS. So this is a nice, Collaborate is a nice place to start your development. And this is typically what we recommend to people is like start your development on Collaboratory. And then as you need to scale up, then you can port your code or your data to Amazon as necessary. In here, basically, we just also want to show that this is easily linked to Dock Store to get Docker containers where all the algorithms are. And this also has a connection to the ICGC data portal. So after you do your searches, you can easily get the data from the storage system. So just very briefly, this is data that is already in Collaboratory. So there is, these are somatic mutations, I believe, structural variants, align reads, and then copy number and structural mutations. So all this data is already in Collaboratory. These are the ICGC portion of Peacock and this is the TCJ portion of Peacock's Inbound Invis. There are other ICGC data sets that are more than just Peacock. And this data is actually over at EGA. So gradually over time, we're in the process of importing all those data to Collaboratory as well so they can reside in one place. It just takes a little bit of time. Okay, so for Collaboratory, you can see that there are, we have currently 31 active projects, some of the three users. And just to give you a sense of what people are doing in terms of research, people are looking at tumors of types using deep learning, genomic alteration of 3D, organization of cancer genome, formation and consequences of non-coding mutations, comprehensive comparison of primary and metastatic cancer. So this is open to the research community. You can easily go to the site to register, get an account. It is based on a cost recovery model. So basically based on the cost of replacing hardware over time, we're not even putting staffing cost. We're just replacing hardware over time. We come up with a cost recovery model and it is actually pretty competitive with AWS. And well, if you ask me why would we use this instead of AWS is because you're in a community that with other cancer researchers. So the support staff, the Collaboratory as well as two support staff, one is George and the other person. So you only have two support staff any questions you sent out there. You just have to give the background ones. They understand what you're trying to do. And so they're very able to help you in your specific scenario. Whereas if you go to AWS, every time you have opened a ticket, you get someone different, giving all the background and they may not understand the cancer tools that you're trying to do. Whereas George has actually run some of these tools himself. So that's the difference. But of course when you have to scale up and that's when you go to AWS that you can start up 500 machines if you want at the same time to get it done. But like George said, remember to start small. So do your optimization while you're in development. So just a little, okay, last couple of things. Whereas another reason to support Collaboratory is because it continues to develop tools that are open source to a community. So some of the things they have done is I know they come up with this theme of music. So the entire software package is called Overture. But they have come up with systems for this one is for managing the genomic data hosted on cloud storage. So this is sort of their storage system. Song is a metadata tracking system. And then we also develop our own billing system in order to do cost recovery. Dockstore is a tool registry tool. And then you have enrollment and data portal. So all these things can be reused by the community if they want to set up their own kind of cloud portal. But in terms of science Collaboratory is also trying to comply with standards like the Global Alliance for Genomics Health. And so one thing I think people are very interested is to have an API to interrogate variants instead of having to download these VCFs and manipulate them. So Collaboratory has already started a prototype. This is using a 10 node elastic search. It got very good performance because when they were trying these are they have indexed 300 million variants. And so it took only 1,200 milliseconds to get the results much faster than what you can do with say a SQL database. So it's nice that basically this is something that you try to develop based on demand from the users. So a lot of these work they do is based on user demand to get better software to be integrated into the portal and integrated into the cloud.